A beginner’s guide to web crawling

Whether you are a website owner or an SEO professional, it is essential to understand website crawling, how search engines crawl websites and web pages, and how it ranks (or decides not to rank) a page for a certain search query.

Website crawling is a technical process and, to be honest, you do not need to understand the behind-the-stuff technical aspect of it. Only understanding the main concept and what you can do to facilitate website crawling for search engines like Google can help you make your website more search-engine friendly, follow best SEO practices, and rank higher on the search engine results pages (SERPs).

In this post, we will discuss:

What is website crawling
The different types of website crawling
What the future of website crawling looks like in 2022 and beyond
How you can facilitate website crawling for Google

Let’s take it from the top.

What is web crawling?

Before we jump further into this, it’s crucial to understand what web crawling is.

Crawling refers to the process by which search engines discover new and updated content on the internet. Search engines do this by sending out crawlers (also commonly known as robots, bots, or spiders).

These bots “crawl” the internet to see if there is a new page that they can index on the SERPs. Similarly, these bots also look out for pages that were recently updated with new content.

The type of “content” can vary — from web pages to images to videos to PDFs.

The limitations faced by web crawlers

As you can imagine, it is a tough job to fetch each web page on the internet and crawl it for new content. That is so because of the sheer volume of web pages on the internet. In addition, millions and millions of new pages appear on the web daily.

This requires a lot of computational resources, which may lead to sustainability problems. We will talk more about this later in this article, what this means for the future of web crawling, and the potential impact of this limitation on webmasters and SEOs professionals.

For now, you should understand how these crawlers or spiders try to overcome this problem by becoming more efficient and how you can leverage this to your benefit.

Crawl spiders usually fetch a few web pages and crawl them. Then they follow the links (internal links and external links) on those web pages to find new URLs to crawl and index. This helps crawlers to become more efficient in building a gigantic database of URLs.

And that’s why adding links to pages on your web pages — especially contextually relevant internal links to other pages is a recommended SEO practice.

Two types of web crawling

According to Google, there are two types of web crawling:

Discovery
Refresh

“When it comes to crawling, we have two types of crawling. One is a discovery crawl where we try to discover new pages on your website. And the other is a refresh crawl where we update existing pages that we know about,” says Google’s John Mueller.

The crawl frequency — apart from the type of crawling — also depends on how frequently content is being updated on your website or web page. For example, if your website homepage is updated more regularly than other pages, you will likely see more crawl activity on that page.

And as we explained earlier, the crawl spiders will also find links on the home page and crawl the pages they find with those links.

So, a refresh crawl (for the homepage, to check if there is any new content) can also lead to a discovery crawl if a link to a new page is found there.

One last point to understand about this is that Googlebot is capable of recognizing patterns to adjust its refresh crawl accordingly.

Google’s John Mueller explained this with the following example:

“For example, if you have a news website and you update it hourly, then we should learn that we need to crawl it hourly. Whereas if it’s a news website that updates once a month, then we should learn that we don’t need to crawl every hour.

And that’s not a sign of quality, or a sign of ranking, or anything like that. It’s really just purely from a technical point of view we have learned we can crawl this once a day, once a week, and that’s okay.”

Google does this to save crawl resources. As we mentioned earlier, crawling is a tough job and can take a lot of computational resources day after day. That is not very sustainable, especially as the Internet continues to grow.

This leads to our next point.

The future of web crawling in 2022 and beyond

In a recent episode of the Search Off the Records podcast, Google’s Search Relations team hinted that Google might reduce the web crawl rate in order to save computational resources and promote sustainability.

“Computing, in general, is not really sustainable. We are carbon-free since, I don’t even know, 2007 or something, but it does not mean that we can’t reduce even more of our footprint on the environment. And crawling is one of those things that early on, we could chop off some low-hanging fruits,” said Google’s Gary Illyes.

He further elaborated how Google might achieve this sustainability goal by reducing the refresh crawl rate.

“One thing that we do, and we might not need to do that much, is refresh crawls. Which means that once we discover a document, a URL, then we go, we crawl it, and then, eventually, we are going to go back and revisit that URL. That is a refresh crawl.

And then every single time we go back to that one URL, that will always be a refresh crawl. Now, how often do we need to go back to that URL?”

What does a reduced crawl rate mean for website owners and SEOs?

Reduced crawl rate for refresh crawls would likely slow down indexing and rankings updates for updated web pages. However, it does not necessarily mean poorer search engine rankings.

Gary Illyes confirmed during the podcast that “it is a misconception” to think “if a page gets crawled more, it will get ranked more.”

7 tips on how to improve crawling on your website

Now that you know what web crawling is and what the future of web crawling may hold, let’s briefly look at some tips that you can use to improve crawling on your website.

Update your content often. If you publish one post per week — with no other content updates across your website — Google will recognize the pattern and slow down the refresh crawl for your website, as we learned earlier.
Update Google once your website is updated by submitting the URL for reindexing in Google Search Console.
Build more contextually relevant links from regularly crawled websites as well as regularly crawled web pages on your site.
Spend time and effort to improve the loading speed of your website. If a website is too slow to load, website crawlers may abandon your site.
Add a sitemap and keep it updated to help Google with web crawling. Check your website’s sitemap here.
Reduce the number of orphaned pages on your website. Orphaned pages are those pages that do not have any link pointing to them.
Reduce redirect chains

We hope you found this beginner’s guide to web crawling useful. If you have any questions or comments, let us know in the comment section below.