By Cyrus Shepard
Last Updated: February 15th, 2024
It’s a big index, but faces growing pains
Google rarely discusses the size of its web index—at least publicly.
What exactly is Google’s index? Simply put, it’s a digital record of web pages and other documents eligible to be served in Google’s search results.
If it’s not in the index, you won’t see it in search results.
Many might believe you can simply search Google for any page on the web, but the opposite is actually more true. Out of trillions and trillions of possible pages, Google must narrow it down to mere “billions” of the most important documents.
Google typically keeps the actual size of its index a secret.
But recently, during testimony in the USA vs. Google antitrust trial, questioning by US attorneys revealed that Google maintained a web index of “about 400 billion documents.”
The number came up during the cross-examination of Google’s VP of Search, Pandu Nayak. The 400 billion refers to Google’s index size in 2020.
Nayak also testified that “for a time,” the index grew smaller than that.
Finally, when asked if Google made any changes to its index capacity since 2020, Nayak replied “I don’t know in the past three years if there’s been a specific change in the size of the index.”
The takeaway is that while 400 billion probably isn’t the exact index size, it’s most likely a good ballpark figure.
Also, the size of the index shifts over time and may even be shrinking.
Google Index Size Over Time
Thanks to Kevin Indig’s excellent research on Google’s index size over time, we can chart how it has grown massively since the 1990s. (note that many of these data points are estimates.)
But charts can be deceiving. This makes it seem like Google always builds its index size up and to the right year after year.
In reality—as recent court testimony proved—sometimes Google shrinks the size of its index.
In fact, it may be shrinking even today.
How Many Websites Does Google Index?
400 billion is a lot of URLs, but how many websites does this represent?
While we don’t know the exact number, Google’s Gary Illyes shed some light on this topic when he revealed Google checks the robots.txt record of 4 billion hostnames every single day. (A hostname is the domain name along with any subdomain, e.g., “www.example”.)
…you can see that we have about 4 billion host names that we check every single day for robots.txt. Now, let’s say that all of those have subdomains, or subdirectories, for example. So the number of sites is probably over four, or– Search Off The Record Podcast
very likely over four billion.
While we don’t know if Google indexes all of 4 billion of those sites, it’s likely a good estimation.
Google Excludes An Increasing Number of Documents
Google can’t index every page it finds on the web. Nor does it want to.
Google actually discovers trillions of pages while crawling. But as Nayak testified, most of those pages aren’t helpful to users.
“Like I said, trillions of pages is a lot of pages. So it’s a little difficult to get an index of the whole web. It’s not even clear you want an index of the whole web, because the web has a lot of spam in it. So you want an index of sort of the useful parts of the web that would help users.”
Beyond getting rid of spam, Nayak listed several other factors that impact the size of Google’s index:
1. Freshness Of Documents
Some pages on the web change quickly – like the front page of CNN. Other important pages can stay the same for years. The challenge Google faces is estimating how often a page might change to keep its index fresh without unnecessary crawling.
2. Document Size
Webpages are simply growing bigger. Ads, images, and more code mean the average size of web pages has grown huge over time.
Since it costs money to crawl and process web documents, this creates a challenge for Google to index.
“… over time at various times, the average size of documents has gone up for whatever reason. Webmasters have been creating larger and larger documents in various ways. And so for the same size of storage, you can index fewer documents, because each document has now become larger.”Pandurang Nayak, US vs Google
Bigger documents mean pressure to index fewer pages.
3. Metadata Storage
Not only does Google store each document, it creates a tremendous amount of data about each document, including all the words and concepts related to each document.
“… when we get these documents, not only do we create an index, we create a bunch of metadata associated with the document which reflects our understanding of the document. And that has also grown over time. And so that also takes space in the index. And as a result, that results in the number of documents that you can index in a fixed size of storage to go down.”Pandurang Nayak, US vs Google
As Google’s algorithms become more and more sophisticated, the amount of metadata increases, limiting the amount the index can grow in size.
4. Cost of indexing and processing
At the end of the day, all those data centers cost a lot of money – and use a lot of electricity!
“… there is this trade-off that we have in terms of amount of data that you use, the diminishing returns of the data, and the cost of processing the data. And so usually, there’s a sweet spot along the way where the value has started diminishing, the costs have gone up, and that’s where you would stop.”Pandurang Nayak, US vs Google
How Big is 400 Billion Documents?
Make no mistake, 400 billion is a big number.
For example, the size of this (very small) website you are reading right now—Zyppy—is about 50 pages. So Google’s index could hold 8 billion websites like this one.
Some sites are much larger. Wikipedia, for example, has 7 billion pages in Google’s index. So, Google could hold only about 50-60 Wikipedias.
To put this figure in perspective, consider the size of Google’s index compared to popular indexes SEOs might know about – Ahrefs, Moz, and the Wayback Machine.
This chart shows huge differences, but size isn’t everything.
“Index” means something different for every system that keeps a record of web documents.
For example, the Wayback Machine stores multiple copies of every URL it records but stores fewer truly “unique” documents. It also processes and stores far less metadata for each URL compared to Google.
Likewise, SEO tools like Ahrefs are more likely to index more important webpages useful to SEO. The data Ahrefs calculates and records for every URL is also far different from Google.
And Google, while it filters out a lot of junk, is more likely to contain vast numbers of documents like books, patents, pdfs, and scientific papers that serve smaller and more niche audiences.
Takeaways for Web Publishers
As AI-generated content floods the web as it becomes cheaper to produce, Google may be forced to index an increasingly smaller percentage of all web pages it finds.
As Nayak explained, the goal of Google’s index isn’t about making a complete record of all documents but indexing enough pages to satisfy users.
“… making sure that when users come to us with queries, we want to make sure that we’ve indexed enough of the web so we can serve those queries. And so that’s why the index is such a crucial piece of the puzzle.”Pandurang Nayak, US vs Google
This supports what Google has been publicly hinting at for years: Sometimes when Google doesn’t index a page, it does so because it doesn’t believe it’ll be useful to users.
If Google isn’t indexing your pages, you may need to evaluate your site’s technical SEO, the usefulness of the content, links to your site, and your user engagement, among other factors.
It may seem like being in Google’s index is a no-brainer.
But increasingly, we may see more pages excluded from it.