How Google crawls the web? What elements really impact Google’s crawl?

This article is based on a study of 413 million pages crawled by Botify and 4 billion Googlebot requests as presented on SMX Paris.

Aug 01, 2024

∙ Paid

Analysed were the data from websites in 3 industries: Retails, Publishers and Classified. Data for the websites were taken both from Botify analytics and the log files (data calculated with 30 days of logs).

Discussion below uses the following key metrics:

Compliant URLs
Crawl Ratio
Crawl Frequency

Compliant URL is the URL that meets the following requirements:

Has a Canonical tag to self (or not set)
Has HTTP 200 Status code
Contains Text / HTML Content
Is indexable

Crawl ratio is a percentage of indexable pages crawled by Google in a 30 days time frame (as compared to all compliant pages).

Crawl frequency is the average number of times a website’s URL was crawled by Google in 30 days.

The study’s average stats data shows that:

Crawl ratio of analysed websites (whatever the size) was 49%
Active pages ration - 23% and
Crawl frequency - 2.3 times per 30 days per a URL

These data will be used as a benchmark in some of the findings below.

Website size matters

The larger the website, the greater the impact of:

Orphan pages
Load Time
% of words vs. Template

Big websites with >1 million pages have ½ the crawl ratio of the average smaller website. Also, big websites have ¼ Average Active ratio, i.e. Ratio of All pages to Active pages (those who receive the traffic whatever the number) as compared to smaller websites.

The crawl frequency of the big websites was 1.7 vs 2.3 (average).

Also, a huge impact (regardless of the site) have the following elements:

PageRank
Page depth
Content Size

Elements that really impact Google’s crawl

Here’s a recap of confirmed and not confirmed hypothesis on the significance of certain site’s elements for key metrics, i.e. crawl ratio, etc.

Publishers are crawled 45% more frequently that other industries

This is quite logical since publishers normally would update content more often that other sites in other industries. Also, they tend to have more Content Size per page.

Not compliant pages are biting off your crawl budget

By contrast to compliant ones, non-compliant pages are badly indexable from a technical POV. These pages show bad crawl signarom web spiders, discouraging them from crawling your pages more.

Reasons why non-compliant pages are crawled are as follows:

Extensive use of noindex HTML tag (without using nofollow, so pages are served to robots but not indexed)
Server errors
Incorrect canonical annotations

The study results show that Google is wasting (at least) 16% of its time and resources crawling non-compliant pages.

The higher the share of non compliant pages (in the site), the lower its crawl ratio.

For large websites over 100k pages, the negative correlation between the share of non compliant pages and the overall crawl ratio is very evident.

Orphaned pages steal up to 26% of crawl resources

Orphaned pages:

Are outside of the website structure
Not discovered by Botify
Crawled by Google (accesses though sitemaps or by direct request to crawl as submitted through the google search console)
Receive crawl budget

From the analysis of the data set, it was shown that the orphaned pages steal ¼ of the site’s crawl budget.

More orphans means lower crawl ratio

As the percentage of orphaned pages increases, the crawl ratio of the structural pages is negatively impacted, which is especially true on big or giant websites.

PageRank is a strong signal supposed to pilot Googlebots

The PageRank:

is the epitome of popularity spread into the website internal structure
A strong signal supposed to pilot the Googlebots.

The more percentage of internal PageRank is focused on compliant pages, the better the crawl ratio. And vice versa, the better crawl ratio you get, the more you should look into the direction of (better) distributing your internal PageRank.

The page depth greatly impacts its chances of being crawled by Google

The depth of URLs is:

The number of physical clicks from the homepage
Depth does not equal folders
Depth implies slowing down of the crawl by Googlebots.

Websites with a higher average depth have found to be ()on average) less crawled by Google. For example, with sites of depth of <4, the crawl ratio is 58% whereas 6-9 is 39%.

The load time has a huge impact on crawl ratio, but only for a larger websites

The page load time is:

The Time to first byte (responsiveness of the server)
Time to build the DOM.

When we look as website sizes, the load time does not seem to have a great impact on Google’s crawl/

Yet, if in case of a small website a page load time has a low impact on crawl ratio, in case in big websites, it is HUGE.

Heavy page templates have a huge impact on big sites’ crawl ratio

Percentage of content is a ratio of real content to the template. In most cases, templates are heavier to load, i.e. more difficult to crawl for Google.

The study data shows that % content vs template are crucial for big sites crawl effectiveness, whereas small sites are less likely to be affected.

Content size has a very big impact on Google crawl

Content size is the number of words on a page, excluding the template.

The study shows a strong positive correlation between content size and the crawl ratio.

The effect is observed in all groups of websites: from small (than than 100k pages) to big (>1M pages).

Elements there were not confirmed to significantly impact crawl

Keep reading with a 7-day free trial

Subscribe to Bohdan Lytvyn Talks to keep reading this post and get 7 days of free access to the full post archives.