Google API Leak: A Look Inside the Algorithm

The 2024 Google API leak offers an unprecedented look into the inner workings of Google’s search algorithm.

In an unprecedented event in late May 2024, the world of SEO and digital marketing was shaken by a significant leak of Google API documentation containing more than 14,000 ranking factors. This leak, shared by Erfan Azimi with SparkToro’s Rand Fishkin and then distributed with the help of Michael King from iPullRank, provides a rare glimpse into the intricate workings of Google's search algorithm. Here's an in-depth look at what happened, why it matters, and the implications for webmasters and SEO professionals.

The Unfolding of the Leak

In late May, a man named Erfan Azimi shared a Google API doc leak with SparkToro’s Rand Fishkin, who, in turn, brought in Michael King of iPullRank, to get his help to distribute this story. The leaked files originated from a Google API document commit titled “yoshi-code-bot /elixer-google-api,” which means this was not a hack or a whistle-blower.

The leak appears to come from GitHub when these documents were inadvertently and briefly made public (many links in the documentation point to private GitHub repositories and internal pages on Google’s corporate site that require specific, Google-credentialed logins). During this probably-accidental, public period between March and May of 2024, the API documentation was spread to Hexdocs (which indexes public GitHub repos) and found/circulated by other sources.

Why This Leak Matters

Google's search algorithm is one of the most closely guarded secrets in the tech world. Incredibly, over the past 25 years, no leak of such magnitude and detail has ever been reported. This documentation provides a rare opportunity to understand the mechanisms behind Google’s ranking systems.

What the Leak Says

The leaked Google documentation outlines each module of the API and breaks them down into summaries, types, functions, and attributes. Most of what we’re looking at are the property definitions for various protocol buffers (or protobufs) that get accessed across the ranking systems to generate SERPs.

It’s also helpful to note documentation like this exists on almost every Google team, explaining various API attributes and modules to help familiarize those working on a project with the data elements available.

Expert Validation

Rand Fishkin reached out to former Googlers for their insights on the leaked documents. Three ex-Googlers wrote back and while one ex-Googler felt uncomfortable commenting, the other two confirmed the legitimacy of the leak, saying:

“I didn’t have access to this code when I worked there. But this certainly looks legit.“
“It has all the hallmarks of an internal Google API.”
“It’s a Java-based API. And someone spent a lot of time adhering to Google’s own internal standards for documentation and naming.”
“I’d need more time to be sure, but this matches internal documentation I’m familiar with.”
“Nothing I saw in a brief review suggests this is anything but legit.”

Google's Official Response

After a few days, Google officially responded, in their typical fashion. They were quoted by Search Engine Land:

“We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information. We’ve shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation.”

They also went on to add, “it would be incorrect to assume that this data leak is comprehensive, fully-relevant or even provides up-to-date information on Search rankings.”

Key Revelations from the Leak

The leaked Google API documentation provides detailed insights into various modules of the Google API, including summaries, types, functions, and attributes. Here are some of the most significant findings:

PageRank and Other Ranking Factors

Some of the most important components revealed in Google’s algorithm appear to be navBoost, NSR and chardScores. Alongside those, other ranking factors that stood out to us included:

Modified PageRank: The documentation mentions a deprecated algorithm called pageRank_NS, associated with document understanding. There are seven types of PageRank, including the well-known ToolBarPageRank.
Business Model Identification: Google identifies different business models such as news, YMYL (Your Money or Your Life), personal blogs, e-commerce, and video sites. The reason for filtering personal blogs remains unclear.
Click Metrics: Features like “goodClicks,” “badClicks,” “lastLongestClicks,” and “impressions” are used to measure user engagement. Google appears to have a way to filter out unwanted clicks and measures click length to identify pogo-sticking behavior.
Authority Metrics: Google uses site-wide authority metrics, including traffic from Chrome browsers, and various embedding techniques in its scoring functions.
Original Content Scoring: Short content is ranked differently, with an OriginalContentScore emphasizing originality over length.

Demotions and Penalties

The leak described several factors that can lead to demotions in search rankings, such as:

Poor Navigational Experience: Sites with poor navigation are penalized.
Location Identity: Mismatched location identity can hurt rankings.
Link Relevance: Links that don’t match the target site negatively impact scores.
User Dissatisfaction: User click dissatisfaction and poorly performing pages are penalized.

Content Freshness and Relevance

There were additional findings unearthed in the API leak related to content that have long had SEOs stumped. These details might show how Google prioritizes its ranking factors:

Content Updates: Regular updates and additions of unique information, new images, and videos are crucial for maintaining freshness.
Content Length: Short content can rank, but has a different scoring system applied to it. OriginalContentScore is mentioned and suggests that short content is scored for its originality. This is probably why thin content is not always a function of length.
Title Match: How well the page title matches the query remains an important ranking factor as indicated by titlematchScore. The description suggests that how well the page title matches the query is still something that Google actively gives value to.
User Experience: Google uses page embeddings, site embeddings, site focus and site radius in its scoring function.
Date Association: Google emphasizes fresh results and attempts to associate dates with pages.
Content Relevance: Irregularly updated content has the lowest storage priority for Google and is definitely not showing up for freshness.

How We’re Interpreting This Leak

The contents of the leak are also not necessarily proof that Google uses the specific data and signals it mentions for search rankings. Rather, the leak outlines what data Google collects from web pages, sites, and searchers and offers indirect hints about what Google seems to care about,

The biggest takeaway from this leak is the confirmation that creating valuable content and engaging websites remains paramount. Websites should continue to focus on providing great user experiences, earning links, and encouraging user engagement.

New Strategies Going Forward

Remove Poorly Performing Pages: If user metrics are bad, no links point to the page and the page has had plenty of opportunity to thrive, then that page should be eliminated. Site-wide scores and scoring averages are mentioned throughout the leaked docs, and it is just as valuable to delete the weakest links as it is to optimize your new article (with some caveats).
Content Audits: Regular audits to remove (or optimize) outdated, irrelevant, or thin content are crucial. However, when updating content, seek ways to update the content by adding unique perspectives, new images, and video content.
Quality Link Building: In the past, Google spokespeople have made efforts to downplay the impact of domain authority, despite the SEO community seeing its positive impacts firsthand. The leak, however, showed that there’s a SiteAuthority score, validating what SEOs have known to be true all along. However, all links are not created equal. For example, the leak showed that links from newer pages are weighted more strongly than those coming from older content. Additionally, Google is probably ignoring links that do not come from relevant sources, something to keep in mind as you do any link prospecting.
Experimentation: This leak is another indication that we should be taking in the inputs and experimenting with them to see what will work for our websites. It’s not enough to look at anecdotal reviews of things and assume that’s how Google works.

Final Thoughts

The 2024 Google API leak offers an unprecedented look into the inner workings of Google’s search algorithm. While the information should be taken with caution, it provides valuable insights that can help guide SEO strategies and improve search rankings. As always, focusing on quality content and user experience remains the best approach to achieving SEO success.