Posted at 09/27/2013 10:30 AM | Updated as of 09/27/2013 10:30 AM
MENLO PARK, Calif, – Google Inc has overhauled its search algorithm, the foundation of the Internet’s dominant search engine, to better cope with the longer, more complex queries it has been getting from Web users.
Amit Singhal, senior vice president of search, told reporters on Thursday that the company launched its latest “Hummingbird” algorithm about a month ago and that it currently affects 90 percent of worldwide searches via Google.
Google is trying to keep pace with the evolution of Internet usage. As search queries get more complicated, traditional “Boolean” or keyword-based systems begin deteriorating because of the need to match concepts and meanings in addition to words.
“Hummingbird” is the company’s effort to match the meaning of queries with that of documents on the Internet, said Singhal from the Menlo Park garage where Google founders Larry Page and Sergey Brin conceived their now-ubiquitous search engine.
“Remember what it was like to search in 1998? You’d sit down and boot up your bulky computer, dial up on your squawky modem, type in some keywords, and get 10 blue links to websites that had those words,” Singhal wrote in a separate blog post.
“The world has changed so much since then: billions of people have come online, the Web has grown exponentially, and now you can ask any question on the powerful little device in your pocket.”
Page and Brin set up shop in the garage of Susan Wojcicki — now a senior Google executive — in September 1998, around the time they incorporated their company. This week marks the 15th anniversary of their collaboration.
I just saw the page that was specifically created for Googlebot (hysterical). Ironically, the webmaster even included the link that said: “Click here to load this page faster” assuming he was doing the right thing by creating a just-a-content page. In reality, this was a classic example of cloaking. I won’t provide that URL since I reached out to that webmaster and sent him Matt’s video from 2010 when the Page Speed was announced. The point is: Do not do anything special for Googlebot that you do not do for your regular users (pretty basic, but apparently not everyone follows).
Hi! My name is Maile Ohye, and I work at Google as a Developer Programs Tech Lead. I’m so glad to be speaking to you today, because for me, and on behalf of all my colleagues at Google, we understand how important it is to have a strong news ecosystem. So I hope you find something in this presentation that you find useful.
Today we’re going to talk about three main topics. First, the ranking factors of Google News search. Next we’re going to cover some of the frequently asked questions that we hear from publishers or from SEOs. And last, we’re going to talk more about the best practices when you publish articles.
So let’s take a first look at how your articles appear in a Google search result. There are several ways. First is obviously on google.com, where people might see a news OneBox. And this here, in the upper screenshot, shows you news results for a search like “obama medals,” where now the user is shown some news articles. There’s one way where your article can appear in Google News.
On the second screenshot, this is from a user going directly to news.google.com. And here’s where they see a similar cluster of articles. But instead of the google.com homepage, they’re seeing it on the News homepage.
So you might be asking yourself, “How did these articles appear?” Now, the way we gather these articles are by first crawling it, next grouping it, and then ranking all of the information. And we’ll cover each of these steps more in depth.
Let’s start with crawling. In the crawling stage, much like web search, we have Googlebot, who is going to go out to your news site to look for new articles. And there’s two ways that we retrieve these articles: One is through our discovery crawl, where Google sees new URLs and then crawls those articles. But in addition to that discovery crawl, you can also create news sitemaps. And news sitemaps are a way for you to list exactly what are your new URLs. So we can use that as well, in addition to our discovery crawl, to find your new information.
And of course, we respect the robots exclusion protocol. You can create a robots.txt file, or use HTTP headers, to let us know specifically what documents you want crawled, and what documents you want excluded from Google search results.
Last, once we’ve crawled and made sure that we have only crawled where we’re allowed to crawl, we bring those articles back to Google. And that’s the end of the crawling phase.
So next, we get into that grouping phase. And here’s where we have this classification idea. In classification, what we’re doing is actually looking at each individual article’s contents. So you can see on this article, “The Millions Kozlowski Didn’t Steal.” We actually take out individual words, like business, Tyco, money, and CFO, and understand that this article pertains to the section of business. That’s how we populate those different sections of Google News, like business, health, and entertainment.
Another thing we’re doing is populating our editions, whether it’s U.K. or U.S. or India. And we can take that from the text as well. Here we’ve taken words like New York and Manhattan, and that led us to believe that this article pertains to the United States. So this is that grouping stage, where we understand what an article is about, and also what sections and editions it pertains to.
So now that we’ve covered crawling, grouping — we now have ranking. And ranking is going to come in two phases. First, of course, is story ranking. Story ranking is much like what you see on the Google News page, where there’s a group of stories, whether it might be Obama and the medal ceremony, or it might be the death of Michael Jackson, or it might be rising oil prices. Story ranking is deciding which of these stories should be placed higher, which second, which third — that type of idea, these cluster of stories. And we rank these story clusters according to aggregate editorial interest.
So let’s take a deeper look at what that means. In the upper diagram, you can see that a small story has a small effect on publishing activity. Let’s say in North Carolina, a man was giving out free cars to those who really needed it. It’s a great human-interest story. It might be covered in their local newspaper, and also picked up by a few wires. But this is still a relatively small story, not showing as much aggregate editorial interest as say a larger story like the death of Michael Jackson, which is not only published on the local newspaper but also foreign and national papers, covered by many wires, also including op-ed articles and follow-up articles. You can see that due to all the editorial interest about this story, we will likely rank it higher than the interest story about a man giving out free cars in North Carolina. So that’s story ranking. We are actually ranking those clusters.
The next part about ranking is the individual article ranking. Article ranking helps us take a cluster of stories — say the death of Michael Jackson — and helps us determine out of those 200 stories, which one should be ranked first for our users, which should be ranked second and so on. There are many signals that go into article ranking, but I am just going to cover four of the major ones for you here.
First is fresh and new. It’s important to us that an article contain recent substantial information about a news topic, and it needs to be objective news to lead this cluster of stories. So press releases, satire, op-eds aren’t eligible to lead clusters.
Another factor is duplication and novelty detection. And that’s where we try to determine an original source of content from those that are duplicating the information. So something that we use there is this idea of citation rank. So per article, we can see that if a news story was broken by the Los Angeles Times, and then later another article, say in Washington, cited the Los Angeles Times as having been a source of their information, we can start to see the citation rank taking place for this story — that this article from the Los Angeles Times might have higher ranking now, because other people are citing it as being an original story.
Another factor is local and personal relevancy, and this applies to individual sections as well as editions of your publication. So what we want to do is actually give more weight to local sources that are likely more relevant to the news item. So if we take that idea of a man giving out free cars in North Carolina, it’s likely that we’d take a paper like the Charlotte Observer and know that that could be a higher authority for that story. And, therefore that article might be ranked higher in this cluster.
The last signal I want to cover in article ranking is the idea of trusted sources. For us, trusted sources doesn’t have to do with some arbitrary decision that we make, but it’s actually data driven. So, according to our data over time, did users start to look at your articles and then click on them? Let’s say that there were five articles being listed and a significant amount of users chose the third article and went to that source. Though we might start to determine that this source is actually very trusted for the certain type of information and over time, we start to build out what publications are trusted sources, but not for their entire publication. It is done on a section and category basis, so something like The Sporting News could be very trusted for sports information, but may be not so much for business. And likely something like the Wall Street Journal might be very trusted in the United States for business information, but may be not in India. So again these trusted sources have to do with section and edition, so it’s a very specific thing that we’re looking for due to aggregate user behavior. So, those are just four of the signals that we use in news search article ranking.
Next, let’s go into some of your frequently asked questions. You might be asking: What are the benefits of submitting a news sitemap? Well, we think that sitemaps are beneficial to us and to you as a publisher as well.
First of all they provide you greater control over which of your articles appear in Google News. And that is because, as I mentioned earlier, they help complement our discovery crawl, and tell us exactly what articles are new and which articles we should crawl.
Second, news sitemaps are great because they help you give us metainformation about your articles. So, rather than rely on our extractor, you can give us the publication date and rather than rely on just our extractor to determine the categories for your articles, you can give us good hints by using the keywords field. So, all in all we think news sitemaps provide a large benefit to publishers.
Another frequently asked question is: Can Googlebot visit our URLs more than once? And, the answer is yes, we can definitely recrawl URLs to check for updates. But just taking a step back, initially Googlebot can actually find your new content within a matter of minutes of when you published it. And, we find your new content through our discovery crawl or through news sitemaps, and after that initial discovery, we will definitely go back and retrack for new article content. So the time at which we may recrawl varies. So that recrawl rate varies, but it’s pretty safe to say that we will probably go back and check for new content within 12 hours. So we’ll find it within a matter of minutes and we’ll recrawl for new content within 12 hours.
You might also be asking: How do I optimize my multimedia content? Well, that’s a great question. So we’re going to take a look at two types of content. First, let’s talk about videos. With videos, you can create a YouTube channel and submit that to us. We are looking to include other types of video hosters, but right now with YouTube, we have a pretty good idea of the user experience, that the video will load, etc. So YouTube is a trusted video hoster platform for us. And if you do use YouTube, remember that including textual descriptions and transcripts are also helpful because that helps us associate a specific video with the subject matter.
Now let’s talk about images. With images, we have five tips that will help you get your images included in Google News Search.
— First, you want to use a large-sized image with good aspect ratio.
— Second, you want descriptive captions and alt text.
— Third, you want to keep your good image near the title and that again helps us associate an image with the subject matter.
— Fourth, you want your good image to be inline and not a clickable version. So again, you want your good image near the title and in line.
— And last, we prefer JPEG. So if you use things like PNG images, that’s not as good for Google News as for JPEG, so I would definitely stick with JPEG if you would like your images included in Google News.
So the last frequently asked question of course is: What about PageRank? PageRank is a lesser factor in Google News than it is in web search. And that makes sense, right, because the linking structure for an article that was only published minutes ago isn’t going to be the same as one that was published years or months ago. So we have to use PageRank delicately in Google News. So instead of using signals like PageRank, we actually use more signals like we talked about earlier — which is things like timeliness, is it fresh and new, or it does it have local or personal relevancy, those types of things.
So now that we have covered how Google crawls and groups and ranks articles, and we answered some of your frequently asked questions, let’s just get into some best practices.
First, it is important that you create permanent unique URLs with at least three digits. And the reason for this is that, traditionally, news publishers have used article IDs and then equals a number and their URL strings. And that has helped us to determine that it’s an article and not just a static HTML page. But if your news publishing system doesn’t include digits — three, at least three for Google News — then you can actually submit a news sitemap, so that’s the workaround. If you have three digits in your URLs, you can create a news site map and let us know which specific URLs belong in news.
The second best practice is to not break up the article body, so in your news article it should have sequential paragraphs that can all be included in Google News. You don’t want to break that up with user comments, or links to related posts, or even if you have things like it links to additional pages. That’s not as good for Google News. We’ll take all the article on that first page. So look again to not break up the article body.
A third best practice is to put dates between the title and the body, and that will help our data extractor to have the correct publication date.
Fourth, titles matter. And this is to have a good HTML title as well as an article title, so you want your title to be extremely indicative of the story at hand.
Fifth, it’s best for Google News if you separate your original article content from your press releases. And you can do this in a directory structure. And this helps us to determine what is specifically a news article versus what might be satire or opinion or a press release.
And the last tip, of course, is to create unique and informative content and that’s always going to help you do well in rankings. So the more unique content that you create and the more users that enjoy that, the more users we’ll send there. And this is kind of converse to the idea of just publishing other people’s content or just having duplicate information.
So again, the greater information you put out for all of us to read, the more users you’ll attract to your site. If you have additional questions, please feel free to visit our News Publisher Help Center and thanks so much for watching.