Friday, December 21, 2007

Paypal vs. Google Checkout in the UK



Google Checkout and Paypal - are both well enough established that we can compare their success in the UK. As the chart below illustrates, Paypal was in the lead for the first couple of months of the holiday shopping season, but Google Checkout overtook it two weeks ago and has maintained a marginal lead since.

Although the two sites are competitors, they have quite different sources of traffic. As you would expect, the majority (59.1%) of Paypal’s traffic comes from its parent, eBay (combined UK and US sites), with another 12.4% from Google (UK and US sites again). A further 11.7% comes from a combination of email providers, social networks and banks; but just 2.2% comes from non-auction Shopping and Classified sites. This compares with 45.3% for Google Checkout, which helps explain the large number of retailers in the table below llustrating the top 20 upstream sites visited before Google Checkout last week.

Most of these are smaller independent retailers, although two larger retailers (Vodafone and Dabs) also appear. Dabs accepts payment via both Google Checkout and Paypal, but currently sends 16 times as much traffic to former than the latter. However, visits do not always mean purchases. When Heather Hopkins posted on this topic earlier in the year, there was some discussion in the comments section about what people did once they got to the shopping cart.

While I can’t provide abandonment data, it is likely that people visiting another retail site after either Paypal or Google Checkout may have not completed their purchase. In other words, a lot of downstream traffic to our Shopping and Classifieds category could be used as a proxy for abandonment rates. As the graph below illustrates, more people currently visit another Shopping and Classifieds site after Google Checkout than after Paypal, and that the gap is widening.




Source: http://weblogs.hitwise.com/robin-goad/2007/12/paypal_vs_google_checkout_in_t.html

Twitter is dangerous

Twitter is rapidly becoming a serious threat to corporate information protection. The program’s great strength — many-to-many messaging — becomes its great weakness in this context.

Imagine this scenario: 20 people are in a confidential meeting, one of them using Twitter. This attendee broadcasts an off-hand “tweet” (Twitter comment) to his or her “followers” (Twitter friends). With traditional instant messaging, that message would be received by perhaps one or two others. With Twitter, that comment may be seen by 10, 100, 1000, or more followers.

Why it matters? Twitter has the power to turn groups of innocent bystanders into instant analysts. Even seemingly innocuous comments, when put before a large group of people, can be analyzed more rapidly, and in more depth, than you might expect. This can easily cause ranges of unintended, highly negative, consequences.

If you’re running corporate IT, what should you do? You’ve got a few choices:

  1. Pretend the problem doesn’t exist. Not being one to advocate head-in-sand methods, I can’t recommend this approach.
  2. Block, or monitor, Twitter, as you might do with traditional instant messaging programs, such as Yahoo or AIM. It’s a tried and true method - not the best, but it works.
  3. Acknowledge the inevitable, and establish clear information sharing policies and guidelines. In the long run users, like water, will seek their own level. In other words, users will eventually adopt the tools they want, whether you want them to or not. The wise among us will recognize this certainty.

The solution: be prepared to strongly enforce information-sharing policies. If confidential information is being shared, even innocently, question the judgment of the sharer.

By the way, if you think Twitter isn’t mainstream enough to matter, think again. It’s currently got almost 700,000 users, many of them influential early adopters. Twitter isn’t going away, and like all tools, it can be used for both good and evil. Balancing Twitter’s dangers and benefits may not be easy, but you’d better start thinking about it today.



Source: http://blogs.zdnet.com/projectfailures/?p=542

Thursday, December 20, 2007

Search Engine Demographics - Google, Yahoo and MSN

It’s amazing to think that search engines can calculate the gender % of users and their age. But it’s true and recent studies have shown interesting statistics.

A recent study carried out by Hitwise showed that 55% of Google users are male, whereas 58% of MSN users were female. Did you know that paid search listings are most likely to be clicked on my females and organic ads are most likely by males.

Click on Organic Ads Clicked on Paid Ads

Women 56.9% 43.1%

Men 65.4% 34.6%

Google and Yahoo are used mostly by users who are under the age of 34 while MSN is used widely by users who are over 55. Visitors to Google sites are 42% more likely to buy online than the average internet user.

Google, Yahoo and MSN combined make up 85-90% of all searches made by internet users today.

Let’s not forget Ask and AOL who apparently are favourites with the female user….


Source: http://www.vertical-leap.co.uk/blog/Search-Engine-Demographics-ndash;-Google-Yahoo-and-MSN.asp

Wednesday, December 19, 2007

Google Improves Results For Supplemental Pages

According to a new post on the Google Webmaster Central blog, the supplemental index is no longer, well, supplemental. Google has long had a two-tiered index and webmasters have generally feared the second, supplemental tier. A Forbes article earlier this year called it "Google Hell", as historically, those pages weren't crawled as often as those in the main index, weren't returned in search results unless the main index didn't contain enough matching pages, and were labeled "supplemental", which implied they were inferior to the other results.

In July, Google removed the supplemental label, saying that they had overhauled the supplemental crawling and index system and therefore the label was no longer needed. Now, they say that the next set of improvements are complete and that they now search both the main and supplemental index for all queries, not just the long tail queries that the main index can't satisfy.

In the post, Google explains the origin of the supplemental index, saying that it was meant to provide a better user experience for searchers looking for obscure queries. However, over time, the supplemental index grew to something that webmasters tried to avoid as they felt that it kept their pages from being shown for relevant queries. In addition, Google didn't refresh those pages as often as it did those in the main index.

With this latest change, Google says that searchers will see a larger set of relevant documents from a deeper slice of the web in results, particularly for non-English queries. Supplemental pages that previously had little chance of ranking now will be queried and scored for relevancy along with the rest of the pages in the index and now potentially have a much better chance of being shown.

What does this mean for the webmaster? In the July post, Google said they would be rolling out this change incrementally, so changes a site owner would see may already be in place. Webmasters who had previously noticed that their pages in the supplemental index weren't being recrawled frequently should have been noticing steady improvements in crawl frequency and indexing freshness in the recent months. It's doubtful that today's blog post is a signal for a dramatic change in overall ranking, but it's worth keeping an eye on the pages that webmasters had previously given up hope on due to their supplemental status to see if hope reigns eternal after all.


Source: http://searchengineland.com/071219-122926.php

Worm Hits Google's Orkut

The intruder appears relatively harmless--and now halted--but raises security concerns.


networking site appeared to have been hit by a relatively harmless worm, but one that demonstrated the continuing vulnerability of Web applications.

Some Orkut users received an e-mail telling them they had been sent a new scrapbook entry -- a type of Orkut message -- on their profile from another Orkut user.

They only had to view their profile to become infected by the worm, which added them to an Orkut group, "Infectados pelo VĂ­rus do Orkut," wrote the blogger Kee Hinckley on his site TechnoSocial.

The name of the group, in Portuguese, roughly translates to "infected by the Orkut virus." Orkut is popular in Brazil, as well as India, but has not caught on as well outside those countries compared to MySpace and Facebook.

The description of the group reveals that the worm was designed to show Orkut could be dangerous to users even if they do not click on malicious links, Hinckley wrote. The worm apparently did not try to steal any personal data.

The worm was also noted by Orkut Plus, a site that offers Orkut security tips, and discussedin Google's Orkut help group.

At one time the infected group was adding new members at a rate of 100 per minute, and had reached a few hundred thousand members, according to various postings, but the problem appears now to be fixed, Hinckley wrote.

Orkut's scrapbook feature allows people post messages that contain HTML code, but it may lack a filter to strip out malicious JavaScript, Hinckley wrote.

"It does not appear at first glance that the worm does anything more dangerous than pass itself on to one or more of your friends," he wrote. "I think it unlikely that it would be able to steal your password, although it could potentially access other private information."


Source: http://www.pcworld.com/article/id,140653-c,worms/article.html

The Google Homepage FAQ

This summary is not available. Please click here to view the post.

Monday, December 17, 2007

What To Do When Your Gmail or Google Account is Hacked

I just got a message in Facebook from a friend whose Google Account was hijacked and the password was changed by the hacker. That person is now looking for some kind of a Google helpline to help him revive the account.

It can be a nightmare if someone else takes control of your Google Account because all your Google services like Gmail, Orkut, Google Calendar, Blogger, AdSense, Google Docs and even Google Checkout are tied to the same account.

Here are some options suggested by Google Support when your forget the Gmail password or if someone else takes ownership of your Google Account and changes the password:

1. Reset Your Google Account Password:

Type the email address associated with your Google Account or Gmail user name at google.com/accounts/ForgotPasswd - you will receive an email at your secondary email address with a link to reset your Google Account Password.

This will not work if the other person has changed your secondary email address or if you no longer have access to that address.

2. For Google Accounts Associated with Gmail

If you have problems while logging into your Gmail account, you can consider contacting Google by filling this form. It however requires you to remember the exact date when you created that Gmail account.

3. For Hijacked Google Accounts Not Linked to Gmail

If your Google Account doesn’t use a Gmail address, contact Google by filling this form. This approach may help bring back your Google Account if you religiously preserve all your old emails. You will be required to know the exact creation date of your Google Account plus a copy of that original “Google Email Verification” message.

It may be slightly tough to get your Google Account back but definitely not impossible if you have the relevant information in your secondary email mailbox.


Source: http://www.labnol.org/internet/email/google-account-hacked-gmail-password-change/1947/

Universal Search Comes To Google maps -- Sort Of

In something of a whimsical post ("Confessions of a search box"), the Google LatLong Blog says that photos, YouTube videos, and book search results have been added and are now discoverable in Google Maps. The photos are from various sources, including Google-owned Panoramio and Flickr.

This is all content that can be found in Google Earth, which shares the same platform as Maps. Increasingly, the two products are less distinct and more of Earth's content and functionality are making their way into Maps (terrain is another good example). Another way to look at this is as the introduction of multiple forms of content (a la Universal Search) into Maps.

Google has said its mission with Earth is to "geographically organize" the world's information. As I said, Earth has had multiple content types for a long time and now those layers are starting to make their way into Maps online. Indeed, Google's vision for Maps is as a multimedia content platform with a geo orientation.


Source: http://searchengineland.com/071214-092027.php


How To Build Landing Pages That Convert

Landing pages are an important tool in any online marketing campaign. They are one of the best ways to convert web clicks into clients, and can help to maximize your online performance. Here are some tips for getting started and building an effective landing page that meets the needs of your clients.

What is a landing page?

A landing page is a web page that a visitor reaches after clicking an online ad or a link, and contains detailed information about the specific product or service that is mentioned. The landing page should be considered part of the marketing campaign and shouldn't just be another page on your website.

When you start developing a landing page you should really consider its purpose. What are you hoping visitors will do when they get there? Is your goal to sell a product, help visitors learn more about a service, or do you want them to provide you feedback? All of these goals would need different landing pages.

An effective landing page makes your visitors' lives easier by providing them all the information that they need without having to scour your site or the web for answers. Using landing pages can significantly impact your conversion rate. A survey by Atlas OnePoint found that the average conversion rate when companies used their homepage as the destination for an advertisement or link was only 6 percent. However, companies that used targeted landing pages had almost double the conversion rate, with 12 percent of their visitors converting.

Landing pages can also improve your search engine optimization because they are filled with keywords about your business or product. Search engines want to provide the most relevant results, so these keyword-rich pages can improve your rank.

How to make a superior landing page

Content is an important part of a landing page, but knowing what to include and what to omit is very important. Your landing page should do one of three things—give your prospect reason to convert, enable them to do so, or resolve any concerns the prospect may have about converting. If any of the information on your page does not accomplish this, then it shouldn't be there.

The site should provide relevant, focused, and detailed information about a specific product or service. It is most beneficial if this can be included on a single page. According to website optimization firm, Interactive Marketing Inc., this can increase conversion by 55 percent. This information should also be visible "above the fold," or without the need to scroll down.

It is important to keep in mind who your audience is and make sure that the information you provide is relevant to them. This may mean you need to develop multiple landing pages for a single product. This will allow you to target your message to each specific audience.

Landing pages, as mentioned earlier, have a purpose, either to ask people to buy or to provide them more information—to meet that goal, your page needs to have a call to action. This can be a simple button asking people to purchase a product or click to read a free report. The button should be clearly labeled and should explain what you want customers to do.

Having a well designed page can heavily impact your conversion rate. Eye-tracking studies and other research have given online marketers new information about how users interact with websites.

People tend to look at the upper left hand side first, then at the headline and then at the left side of the page. To maximize your success, the most important information should be in these positions.

Additionally, the look and feel of your page should be consistent with your other marketing materials, and it should appear trustworthy. Users want to see a design that is consistent with the advertisement or link that brought them to your page so they know they're in the right place. If you change your advertising campaign, you should change your landing page as well.

The impression your site gives visitors is crucial. A Stanford study found that 46 percent of web sales are lost on sites lacking the critical elements to build trust. The number one reason people indicated they don't buy from a site is because it had an unprofessional look and feel that lacked credibility. Building this trust is crucial if you're trying to gather personal information about your website's users. The most common answer submitted on personal information forms online is Mickey Mouse. If you want fewer "Mickey Mouses" on your prospect list, this key is building trust.

The headline and page title on your landing page are very important. The page title is in the bar at the top of your web browser, and the headline is the biggest piece of text on the page. These two items have the greatest potential to impact your conversion rate. Include the keywords or phrases you used in the advertisement to get visitors to the site. Position these items where your eye travels first—the top left of the screen.

You now know the key to developing an effective design and helpful content, but what if you don't have an in-house web designer or the resources to hire someone to design a landing page for every online marketing campaign? Luckily, there are online sites that help you create your own landing pages relatively easily and inexpensively. These sites don't require that you know HTML, and designing a landing page can be as easy as creating a PowerPoint slide. For example, Marketo.com offers landing page creation tools and hosting.

Landing page optimization

To improve your landing page—test and test again. There are a few elements that are very important, including load time and headline.

Try to keep load time under 5 seconds. Cater your page to the slowest dial-up connection so as not to lose these visitors. Web analytics software can tell you how many of your visitors are using a slow connection, and a variety of sites, such as iWebTool, offer free tools to test your page load time.

Headlines can alter conversion times, so test a variety of them to see which is most effective. Headlines should tell the benefit to the customer, not necessarily the product features.

A landing page is also a good place to test different prices for your product if you display them online. An Allbusiness.com article about the psychology of pricing noted that prices that end with odd numbers, especially 7s or 9s, tend to be associated with lower prices than even numbers.

Finally, test the call to action to find one that delivers the highest conversion rates. This includes the buttons themselves. Large, red buttons tend to have the best conversion.

Tuning your landing page

Just building a landing page isn't enough; to be effective, the page must be routinely updated. Updated content can boost your search engine optimization, it can help you track what content generates the best conversion, and it can improve traffic. The more you update, the more reason people have to visit your page. A Marketing Sherpa eye-tracking study showed that consistently updating and tweaking content can increase traffic by 40 percent.

One of the most important things to update is pricing changes. A landing page that misquotes a price will frustrate and most likely turn off a prospect. In a recent survey, Enquiro found that users of B2B websites preferred to see pricing information but it is often unavailable. Supplying a price range may help customers determine if you are within their budget, without requiring you to list specific prices.

Finally, make sure that none of the links are broken and remove any outdated links.

Landing pages are a great way to provide your customers the information they need in one convenient location, and they can help you convert web clicks into clients. While they may take some work to set up and maintain, they can drastically improve your online marketing efforts.


Source: http://searchengineland.com/071214-124154.php

Sunday, December 16, 2007

Google to take on Wikipedia

POLITICAL spin-doctors will be hunting for another way of massaging the truth if a new service launched by the world's most popular search engine lives up to expectations.

Google is going head-to-head with the user-generated online encyclopedia Wikipedia by developing its own repository of knowledge.

The looming battle pits the world's busiest internet site, with 260 million users, against Wikipedia, which was visited by 107 million people in October.

Dubbed the "knol" project after what Google calls a unit of knowledge, the new information source will encourage experts on a particular subject to write an authoritative article about it.

But unlike on Wikipedia, only the author of a "knol" page will be allowed to edit it. Other authors will have to set up competing pages under their own names.

The Google system is designed to overcome a major criticism of Wikipedia: that it is open to abuse because it allows anybody to edit a page.

Widespread tinkering with Wikipedia was exposed this year after the release of a software tool capable of tracing people who edited pages on the encyclopedia.

The offices of former prime minister John Howard and NSW Premier Morris Iemma were accused of tampering with entries.

The knol project, still in an invitation-only test phase, is likely to go public within months.

It follows Wikipedia's move on Google's patch with Search Wikia, an open-source search engine. Wikipedia could go live with an early test version of its search engine this week.

The company says the new search engine will avoid some of the privacy concerns levelled at Google by refusing to share data with advertisers or store users' search terms.


Source: http://www.news.com.au/technology/story/0,25642,22935533-5014108,00.html

Friday, December 14, 2007

Google 2D? Google Tests Vertical Results In Right-Hand Column

Ask 3D was the name Ask gave to the "three pane" user interface rolled out earlier this year. Now Google seems to be copying Ask, at least to the second degree, with the right-most column being used to show vertical results similar to how Ask does it. Google Blogoscoped and Valleywag both have pictures of Google trying out the right-hand column as a location for Google's OneBox results.

It would be interesting to see if this test turns out to be more than a test. Google has adopted Universal Search, but by placing these vertical results outside of the main web results, they no longer seem to fit the blended Search 3.0 Universal Search-style but instead are more like Google's traditional OneBox results.

That brings me to a discussion Danny and I had on the Daily SearchCast yesterday. There's some confusion that anything placed in the middle of results is the result of Universal Search.

To our knowledge, Universal Search specifically means that Google drops a web search result (or two or three) from the usual 10 listings and replaces what was dropped with a vertical result (specifically, from news, local, books, images, or video search). Other types of search units or options inserted into the middle of results are not the consequence of Universal Search, but rather Google doing a different type of blending.

Danny is conducting a follow-up with Google this week to get further clarification about the changes we have been noticing recently with those results.


Source: http://searchengineland.com/071213-111509.php

Thursday, December 13, 2007

MSN AdCenter Labs - What a Surprise!

I recently attended the PPC Summit in Los Angeles where I interacted with online marketers from around the country. To my surprise, many of the presenters at the conference referenced MSN adCenter Labs. This is Microsoft's website for testing new prototypes of tools they are currently developing.

After returning from the conference, I decided to check it out for myself. Below is an overview of the tools I found most useful for optimizing your paid search campaigns:

Detecting Online Commercial Intention: Type in a keyword to view the probability that the keyword is commercial or non-commercial. In other words, this tool helps answer the question "Is the searcher a potential customer, or just researching information?" This tool clarifies exactly what the searcher is looking for.

Search Funnels: This is one of my favorite tools! MSN describes it as "a tool to help visualize and analyze our customer's search behaviors." You can see what your customers are searching for before & after they search on keywords you bid on. I entered in "Gucci" and, the tool yielded the graph below. You can use this information help better understand who you are marketing to. It can also help you create more relevant ad text.

Keyword Forecast: Think of this tool as a "snapshot" of a keywords demographics. This displays monthly impression history and age and gender statistics in easy-to-read bar graphs. You can choose to view the data as Flash, pictures, or text.

Keyword Mutation Detector: Use this to generate common keyword misspellings and variations. Unlike many keyword permutation tools, I was surprised that the variations seemed like real typos. It would be handy if you could export your results easily to a spreadsheet, but exporting is not yet a feature in this tool.

If you do decide to use these tools, take note of the disclosure of at the bottom of the page "The tools shown on the adlabs website are not yet in production; they are demos or prototypes of tools that we might include as production adCenter tools in the future."

MSN adCenter Labs is a great resource for agencies and individuals to use when developing campaigns in unfamiliar industries. The tools reveal much more than just forecasted search volumes. Overall I was impressed with the prototypes and look forward to their actual release. I encourage you to visit MSN adCenter Labs, and see what you can get the tools to do for you!



Source: http://www.roirevolution.com/blog/2007/10/msn_adcenter_labs_what_a_surprise.html?utm_source=roinews&utm_medium=email&utm_campaign=roinews+2007-11&utm_content=200711_msn_adcenter&utm_nooverride=1

Tuesday, December 11, 2007

Google Patent on Anchor Text and Different Crawling Rates

How does a search engine use information from anchor text in links pointed to pages?

Why and how do some pages get crawled more frequently than others?

How might links that use permanent and temporary redirects be treated differently by a search engine?

A newly granted patent from Google, originally filed in 2003, explores these topics, and provides some interesting answers, and even some surprising ones.

Of course, this is a patent, and may not necessarily describe the actual processes in use by Google. It is possible that they are being used, or were at one point in time, but there has been plenty of time since the patent was filed for changes to be made to the processes described.

It has long been observed and understood that different pages on the web get indexed at different rates, and that anchor text in hyperlinks pointing to pages can influence what a page may rank for in search results.

Why Use Anchor Text to Determine Relevancy?

When you search, you expect a short list of highly relevant web pages to be returned. The authors of this patent tell us that previous search engine systems only associated the contents of the web page itself with the web page when indexing that page.

They also tell us that there is valuable information about pages that can be found outside the contents of the web page itself in hyperlinks that point to those pages. This information can be especially valuable when the page being pointed towards contains little or no textual information itself.

For image files, videos, programs, and other documents, the textual information on pages linking to those documents may be the only source of textual information about them.

Using anchor text, and information associated with links can also make it possible to index a web page before the web page has been crawled.

Creating Link Logs and Anchor Maps

Link Log - A large part of the process involved in this patent includes the creation of a link log during the crawling and indexing process, which contains a number of link records, identifying source documents, including URLs, and the identity of pages targeted by those links.

Anchor Map - Information about text associated with links ends up in a sorted anchor map, which includes a number of anchor records, indicating the source URL of anchor text, and the URL of targeted pages. The anchor information is sorted by the documents that they target. The anchor records may also include some annotation information.

In the creation of a link log and anchor map, a page ranker process may be used to determine a PageRank, or some other query-independent relevance metric, for particular pages. This page ranking may help determine crawling rates and crawling priorities.

The patent is:

Anchor tag indexing in a web crawler system
Invented by Huican Zhu, Jeffrey Dean, Sanjay Ghemawat, Bwolen Po-Jen Yang, and Anurag Acharya
Assigned to Google
US Patent 7,308,643
Granted December 11, 2007
Filed July 3, 2003

Abstract

Provided is a method and system for indexing documents in a collection of linked documents. A link log, including one or more pairings of source documents and target documents is accessed.

A sorted anchor map, containing one or more target document to source document pairings, is generated. The pairings in the sorted anchor map are ordered based on target document identifiers.

Different Layers and Different Crawling Rates

Base Layer

The base layer of this data structure is made up of a sequence of segments. Each of those might cover more than two hundred million uniform resource locations (URLs), though that number may have changed since this patent was written. Together, segments represent a substantial percentage of the addressable URLs in the entire Internet. Periodically (e.g., daily) one of the segments may be crawled.



Daily Crawl Layer

In addition to segments, there exists a daily crawl layer. In one version of this process, the daily crawl layer might cover more than fifty million URLs. The URLs in this daily crawl layer are crawled more frequently than the URLs in segments. These can also include high priority URLs that are discovered by system during a current crawling period.

Optional Real-Time Layer

Another layer might be an optional real-time layer, that could be comprised of more than five million URLs. The URLs in a real-time layer are those URLs that are to be crawled multiple times during a given epoch (e.g., multiple times per day). Some URLs in an optional real-time layer are crawled every few minutes. Real-time layer also comprises newly discovered URLs that have not been crawled but should be crawled as soon as possible.

Same Robots for All Layers

The URLs in the different layers are all crawled by the same robots, but the results of the crawl are placed in indexes that correspond to those different layers. A scheduling program bases the crawls for those layers based on the historical (or expected) frequency of change of the content of the web pages at the URLs and on a measure of URL importance.

URL Discovery

The sources for URLs used to populate this data structure include:

  • Direct submission of URLs by users to the search engine system
  • Discovery of outgoing links on crawled pages
  • Submissions from third parties who have agreed to provide content, and can give links as they are published, updated, or changed - from sources like RSS feeds

It’s not unusual to see blog posts in the web index these days that indicate they are only a few hours old. It’s quite possible that those posts are included in the real time layer, and may even be entered into the index through an RSS feed submission.

Processing of URLs and Content

Before this information is stored in a data structure, a URL (and the content of the corresponding page) is processed by programs designed to ensure content uniformity and to prevent the indexing of duplicate pages.

The syntax of specific URLs might be looked at, and a host duplicate detection program might be used to determine which hosts are complete duplicates of each other by examining their incoming URLs.

Parts of the Process

Rather than describing this process step-by-step, I’m pointing out some of the key terms, and some of the processes that it covers. I’ve skipped over a few aspects of the patents that increase the speed and efficiency of the process, such as partitioning, and the incremental addition of data through a few processes.

It’s recommended that you read the full text of the patent if want to see more of the technical aspects involved in this patent.

Epochs - a predetermined period of time, such as a day, in which events in this process take place.

Active Segment - the segment from the base layer that gets crawled during a specific epoch. A different segment is selected each epoch, so that over the course of several epochs, all the segments are selected for crawling in a round-robin style.

Movement between Daily Layer and Optional Real-Time Layer - Some URLs might be moved from one layer to another based upon information in history logs that indicate how frequently the content associated with the URLs is changing, as well as individual URL page ranks that are set by page rankers.

The determination as to what URLs are placed in those layers could be made by computing a daily score like this - daily score=[page rank].sup.2*URL change frequency

URL Change Frequency Data - When a URL is accessed by a robot, the information is passed through content filters, which may determine whether content at a URL has changed and when that URL was last accessed by a robot. This information is placed in history logs and then go to the URL scheduler.

A frequency of change may then be calculated, and supplemental information about a URL may also be reviewed, such as a record of sites whose content is known to change quickly.

PageRank - a query-independent score (also called a document score) is computed for each URL by URL page rankers, which compute a page rank for a given URL by looking at the number of URLs that reference a given URL and the page rank of those referencing URLs. This PageRank data can be obtained from URL managers.

URL History Log - can contain URLs not found in data structure, such as within log records for URL’s that no longer exist, or which will no longer be scheduled for crawling because of such things as requests by a site owner that the URL not be crawled.

Placement into Base Layer - When the URL scheduler determines that a URL should be placed in a segment of base layer, an effort is made to ensure that the placement of the URL into a given segment of base layer is random (or pseudo-random), so that the URLs to be crawled are evenly distributed (or approximately evenly distributed) over the segments.

Processing rules might be used to distribute URLs randomly into the appropriate base, daily, and realtime layers.

When All URLs Cannot be Crawled During a Given Epoch - This could be addressed using two different approaches:

  1. A crawl score could be computed for each URL in active segment, daily layer, and real-time layer, and only URLs with high crawl scores are passed on to the URL managers.
  2. In the second, the URL scheduler decides upon an optimum crawl frequency for each such URL and passes the crawl frequency information on to URL managers - which is used by URL managers to decide which URLs to crawl. These approaches could be used by themselves, or in a combined approach to prioritize the URLs to crawl.

Factors Determining a Crawl Score — where a crawl score is computed, URLs receiving a high crawl score are passed on to the next stage, and URLs with a low crawl score are not passed on to the next stage during the given epoch. Factors that could be used to compute a crawl score might include:

  • The current location of the URL (active segment, daily segment or real-time segment),
  • URL page rank, and;
  • URL crawl history - can be computed as: crawl score=[page rank].sup.2*(change frequency)*(time since last crawl).

Other modifications may impact the crawl score. For example, the crawl score of URLs that have not been crawled in a relatively long period of time can be up-weighted so that the minimum refresh time for a URL is a predetermined period of time, such as two months.

Where Crawl Frequency is Used - The URL scheduler may set and refine a URL crawl frequency for each URL in the data structure. This frequency represents a selected or computed crawl frequency for each URL. Crawl frequencies for URLs in daily layer and real-time layer will tend to be shorter than the crawl frequency of URLs in the base layer.

The range of crawl frequencies for any given URL can range from a minute or less to a periods that can take months. The crawl frequency for a URL is computed based on the historical change frequency of the URL and the page rank of the URL.

Dropping URLs - The URL scheduler determines which URLs are deleted from data structure and therefore dropped from system. URLs might be removed to make room for new URLs. A “keep score” might be computed for each URL, and the URLs could then be sorted by the “keep score.”

URLs with a low “keep score” could be eliminated as newly discovered URLs are added. The “keep score” could be the page rank of a URL determined by page rankers.

Crawl Interval - A target frequency that a URL should be crawled. For a URL with a crawl interval of two hours, the URL manager will attempt to crawl the URL every two hours. There are a number of criteria that can be used to prioritize which URLs that will be delivered to the URL server, including “URL characteristics” such as the category of the URL.

Representative URL Categories - include, but are not limited to news URLs, international URLs, language categories (e.g., French, German, Japanese, etc.), and file type categories (e.g., postscript, powerpoint, pdf, html).

Url Server Requests - there are times when the URL server requests URLs from URL managers. The URL server may sometimes request specific types of URLs from the URL managers based upon some policy, such as eighty percent foreign URLs/twenty percent news URLs.

Robots - Programs which visit a document at a URL, and recursively retrieve all documents referenced by the retrieved document. Each robot crawls the documents assigned to it by the URL server, and passes them to content filters, which process the links in the downloaded pages.

The URL scheduler determines which pages to crawl as instructed by the URL server, which takes the URLs from the content filters. Robots differ from normal web browsers - a robot will not automatically retrieve content such as images embedded in the document, and are not necessarily configured to follow “permanent redirects”.

Host Load Server - used to keep from overloading any particular target server accessed by robots.

Avoiding DNS Bottleneck Problems - a dedicated local DNS database may be used to resolve IP addresses with domain names so that domain names for IP addresses that have been crawled before don’t have to be looked up every time that a robot goes to visit a URL.

Handling Permanent and Temporary Redirects - Robots do not follow permanent redirects that are found at URLs that they have been requested to crawl, but instead send the source and target (redirected) URLs of the redirect to the content filters.

The content filters take the redirect URLs and place them in link logs where they are passed back to URL managers. It is the URL managers that determine when and if such redirect URLs will be assigned to a robot for crawling. Robots are set to follow temporary redirects, and obtain page information from the temporary redirects.

Content Filters - robots send the content of pages to a number of content filters. Those content filter sends information about each page to a DupServer to uncover duplicate pages, including:

  • The URL fingerprint of the page,
  • The content fingerprint of the page,
  • The page’s page rank, and;
  • An indicator as to whether the page is source for a temporary or permanent redirect.

Handling Duplicates - when found, the page rankings of the duplicate pages at different URLs are compared and the “canonical” page for the set of duplicate pages is identified. If the page sent to the DupServer is not the canonical page, it is not forwarded for indexing.

Instead, an entry might be made for the page in the history log, and the content filter may then cease to work on the page. The DupServer also assists in the handling of both temporary and permanent redirects encountered by the robots.

Link Records and Text Surrounding Links - a link log contains one link record per URL document. A URL document is a document obtained from a URL by a robot and passed to content filter. Each link record lists the URL fingerprints of all the links (URLs) that are found in the URL document associated with a record, and the text that surrounds the link.

For example, a link pointing to a picture of Mount Everest might read “to see a picture of Mount Everest click here.” The anchor text might be the “click here” but the additional text “to see a picture of Mount Everest” could be included in the link record.

RTlog Matching Document PageRanks with Source URLs - an RTlog stores the documents obtained by robots, and each document is coupled with the page rank assigned to the source URL of the document to form a pair. A document obtained from URL “XYZ” is paired with the page rank assigned to the URL “XYZ” and this pair is stored in an RTlog.

There are three RTlogs, one for the active segment of the base layer, one for the daily layer, and one for the realtime layer.

Creation of Link Maps - the global state manager reads the link logs, and uses the information from the log files to create link maps and anchor maps. The records in the link map are similar to the records in the link log, except that the text is stripped. The link maps are used by page rankers to adjust the page rank of URLs within the data structure. These page rankings persist between epochs.

Creation of Anchor Maps - The global state manager also creates anchor maps. Anchor maps are used by the indexers at the different layers to facilitate the indexing of “anchor text” as well as to facilitate the indexing of URLs that do not contain words.


Text Passages - Each record in the link log includes a list of annotations, such as the text from an anchor tag pointing to a target page. The text included in an annotation can be a continuous block of text from the source document, in which case it is referred to as a text passage.

Annotations may also include text outside the anchor tag in the document referred to by a URL. That text could be determined by being within a predetermined distance of an anchor tag in a source document. The predetermined distance might be be based on:

  • A number of characters in the HTML code of the source document,
  • The placement of other anchor tags in the source document, or;
  • Other predefined criteria, referred to as anchor text identification criteria.

Other Annotations - annotations may also include a list of attributes of the text they include. For example, when the text in an annotation is composed in HTML, examples of attributes may include:

  • Emphasized text - (em)
  • Citations - (cite)
  • Variable names - (var)
  • Strongly Emphasized - (strong)
  • Source Code - (CODE)
  • Text position,
  • Number of characters in the text passage,
  • Number of words in the text passage, and;
  • Others.

Annotations as Delete Entries - Sometimes an annotation is a delete entry, which is generated by the global state manager when it determines that a link no longer exists.

Use of Anchor Text from Duplicates - sometimes anchor text pointed to duplicates is used to index the canonical version of the page. We are told that this can be useful when one or more of the links to one or more of the non-canonical pages has anchor text in a different language than the anchor text of the links to the canonical page.

Conclusion

This patent appears to illustrate a lot of the process behind how anchor text may be used to add relevancy to pages that are pointed at by links which use that anchor text. Many of the observations and assumptions about links, that people who watch search engines have made, are addressed within the patent in one way or another.

The information about how a crawler might handle permanent and temporary redirects differently is interesting, as is the annotation information that a search engine might use. I’m not sure that I’ve seen mention or hint of the use of anchor text pointing to duplicates to provide relavance to the canonical versions of pages before.

The information about crawling rates, and the possible role of PageRank along with frequency of changes in content, which could influence how frequently a page is crawled is also the most detailed that I can recall seeing in a patent from Google.

Please don’t take any of the information from this patent as gospel - but keep in mind that it is a document created by people from Google, and that if the processes described within it aren’t being used, that they were seriously enough considered to protect them as intellectual property of the search engine.

Source: http://www.seobythesea.com/?p=929