Back to Blog

Crawl Budget Explained: How to Stop Wasting Google's Time

What crawl budget is, whether your site actually has a problem, and the real fixes for large and fast-changing sites.

I
Indexly Team
· · 12 min read

Crawl Budget Explained: How to Stop Wasting Google's Time

Crawl budget is the amount of time and resources Google is willing to spend crawling your site in a given period — and for most sites, it's not something you need to worry about. For the ones that do need to worry about it, getting it wrong means new pages take weeks to index, old pages go stale, and Google quietly gives up on sections of your site.

This guide covers what crawl budget actually is, how to tell whether you have a real problem (most sites don't), the two factors Google's own documentation says determine it, and the actual playbook for reclaiming wasted crawl.

Table of contents

  1. What crawl budget actually is
  2. Does your site even have a crawl budget problem?
  3. The two factors: crawl capacity and crawl demand
  4. What wastes crawl budget
  5. How to diagnose with GSC Crawl Stats
  6. The crawl budget optimization playbook
  7. Where sitemaps fit in
  8. FAQ

What crawl budget actually is

Crawl budget is Google's term for how much of your site Googlebot is willing and able to crawl. From Google Search Central: "Taking crawl capacity and crawl demand together, Google defines a site's crawl budget as the set of URLs that Googlebot can and wants to crawl."

Two key words: can and wants to. If Google can't (your server is slow or erroring), or doesn't want to (your content doesn't seem worth fetching), crawl budget shrinks. Everything downstream — how fast new pages appear in search, how quickly content updates get picked up, how reliable your coverage reports are — is downstream of that.

"Budget" is slightly misleading. There's no dashboard where Google shows you a daily allowance. It's the behavior that emerges from Googlebot's rate-limiting and priority systems over time.

Does your site even have a crawl budget problem?

This is the honest question most articles skip past. Google's own position, from their 2017 Search Central blog post and still current: "crawl budget, as described below, is not something most publishers have to worry about. If new pages tend to be crawled the same day they're published, crawl budget is not something webmasters need to focus on."

A plain-language version of when you actually need to think about this:

You probably don't need to worry about crawl budget if:

  • Your site has fewer than ~10,000 URLs.
  • New pages get indexed within a few days.
  • Your traffic is stable and Search Console coverage looks healthy.
  • You don't see meaningful "Discovered – currently not indexed" counts for pages you care about.

You probably do need to worry about crawl budget if:

  • You run a large e-commerce catalog (100,000+ products).
  • You publish news or frequently-updated content and ranking speed matters.
  • You have a site with lots of faceted navigation, filters, or parameterized URLs.
  • You have a backlog of "Discovered – currently not indexed" URLs that's growing.
  • New pages are taking more than a week to get indexed and there's no other obvious cause.

If you're in the first bucket, spending time on crawl budget optimization is almost pure opportunity cost. Fix internal linking, write better content, clean up your sitemap — those levers matter much more. If you're in the second bucket, read on.

The two factors: crawl capacity and crawl demand

Google's model has two inputs, and optimizing either one without considering the other rarely works.

Crawl capacity limit

The ceiling Googlebot enforces to avoid hammering your server. It goes up when:

  • Your server responds quickly and consistently (sub-200ms target).
  • Your error rate (5xx, timeouts) stays low.
  • You haven't explicitly limited crawl rate in Search Console (legacy setting, rarely useful).

It goes down when:

  • Your server is slow to respond.
  • Googlebot hits a lot of 5xx errors or timeouts.
  • DNS or TLS issues intermittently break fetches.

Capacity is a technical problem with a technical fix: make your server faster and more reliable. A CDN in front of the site, a well-tuned cache layer, and an honest audit of how long /sitemap.xml and popular pages take to return will cover 90% of capacity issues.

Crawl demand

How much of your site Google actually wants to crawl. Goes up when:

  • Your pages are high-quality and frequently referenced by other sites.
  • Content genuinely changes (accurate <lastmod> signals, not bumped timestamps).
  • New high-value pages appear.
  • Your site is known to Google as an authoritative source in its niche.

Goes down when:

  • Large parts of your site are low-value or duplicative.
  • Historical crawls produced a lot of soft 404s, redirects, or thin pages.
  • Your <lastmod> dates have been unreliable and Google has stopped trusting them.

Demand is harder to raise than capacity because it's downstream of quality and authority. This is the part of crawl budget that's largely not a technical problem — it's an editorial one.

The common mistake

Teams spend months making their server 40ms faster and see no improvement in indexing speed. Why? Because their crawl issue was demand-side, not capacity-side. Google wasn't failing to fetch pages; Google just didn't want to.

Before optimizing capacity, diagnose which side your problem is on. Use the Crawl Stats report.

What wastes crawl budget

Concretely, the patterns that burn crawl without producing indexed pages:

1. URL parameter proliferation. ?utm_source=, ?sessionid=, ?sort=price-asc — each unique URL Googlebot sees is a potential fetch. A faceted navigation system can generate thousands of URL combinations that point at functionally the same content.

2. Faceted navigation on e-commerce. Colour × size × brand × price-range = thousands of URL variations. Google crawls them, sees near-duplicate content, doesn't index, tries again later. Waste.

3. Internal search result pages. Every search query a user makes can produce a crawlable URL. Most sites have no business letting Google index these.

4. Soft 404s. Pages that return 200 OK but display "not found" messaging. Google crawls, evaluates, discards, and re-crawls later hoping for better. Covered in why Google isn't indexing your pages.

5. Redirect chains. A → B → C → D is four fetches to index one page. Long chains are crawl-budget dead weight.

6. Infinite calendar/pagination. /?page=847 for a 50-post blog. /events/2099/01/ for a calendar widget with no end. These can generate effectively unlimited crawlable URLs.

7. Stale URLs in your sitemap. If your sitemap still contains 5,000 URLs that now 404, Googlebot is spending budget fetching dead pages because you told it to.

8. Duplicate content across URL variants. Same product on /products/widget-a/ and /shop/products/widget-a/. Without canonicals, both get crawled indefinitely.

9. JavaScript-heavy pages. Pages that require rendering to extract content cost Googlebot more resources per fetch than static HTML. Not a reason to panic, but at scale it matters.

10. Blocked resources that Googlebot still tries to fetch. If you block CSS and JS in robots.txt, Googlebot still tries — and receives a "can't access" response — before moving on. Covered in sitemap vs robots.txt.

How to diagnose with GSC Crawl Stats

The Crawl Stats report (Settings → Crawling → Crawl stats) is the best free diagnostic Google gives you. Four things to look at:

1. Total crawl requests over time

If the trend line is rising, Google is willing to crawl more of your site. Falling means capacity or demand is shrinking. Flat is fine.

2. Average response time

The single best capacity-side signal. If this creeps above 1,000ms, expect crawl rate to throttle. Aim for under 500ms; under 200ms is excellent. Rising response times usually mean your app, database, or CDN is under pressure.

3. Crawl requests by response code

  • 2xx dominating: healthy.
  • Lots of 3xx: redirect chains or a redirect-heavy deploy. Investigate.
  • Lots of 4xx: broken internal links or stale sitemap entries. Clean up.
  • 5xx showing up: your server is erroring on Googlebot's requests. This throttles crawl capacity hard. Fix immediately.

4. Crawl requests by purpose (Discovery vs. Refresh)

  • Discovery = Googlebot fetching URLs it hasn't seen before.
  • Refresh = Googlebot re-fetching known URLs to check for changes.

If refresh dwarfs discovery on a site that's adding content, Google is spending budget re-crawling old pages instead of finding new ones. That usually means your internal linking and sitemap aren't surfacing new content effectively. If discovery is high but indexing isn't keeping pace, you probably have quality issues — see the "Crawled – currently not indexed" bucket in your Pages report.

The crawl budget optimization playbook

If diagnosis points to a real problem, work through these in order. The ordering is by impact-per-effort, not alphabetical:

Step 1: Fix server reliability

Get 5xx errors to near zero. Get average response time under 500ms. These two things alone often unlock meaningful crawl capacity on sites that didn't previously have it. Cheap wins: CDN in front of your app, proper caching headers, database query review for high-traffic pages.

Step 2: Block genuine crawl traps in robots.txt

Internal search results, session-ID variants, faceted-nav combinations, infinite pagination — anything Google has no business crawling. Be surgical: don't block CSS or JavaScript, and don't block anything that's actually indexable.

User-agent: *
Disallow: /search
Disallow: /?sessionid=
Disallow: /?utm_
Disallow: /*&utm_

Step 3: Fix redirect chains

A chain of more than one hop is almost always accidental. Audit with Screaming Frog or equivalent; flatten every A → B → C → D down to A → D directly. Re-deploy the redirect map.

Step 4: Clean up the sitemap

Every URL in your sitemap should be live, canonical, and indexable. Every 404, redirect, or noindex URL in the sitemap is a signal Google uses to decide how much it trusts the file. A sitemap that's 15% junk is a sitemap Google treats as unreliable. The scale challenge is keeping this clean as your site grows — covered next.

Step 5: Consolidate duplicate content

Canonical tags on parameter variants, 301s on deprecated URLs, noindex on pages that genuinely shouldn't be in search (thin tag pages, empty category pages). Every duplicate you remove frees up budget for the URLs that matter.

Step 6: Raise perceived quality

The demand-side lever. Prune pages that aren't pulling their weight, expand thin content on pages that have potential, improve internal linking from strong pages to weaker ones. This is slow — expect months to see movement, not days. But it's the only way to meaningfully raise crawl demand.

Where sitemaps fit in

A clean, accurate sitemap helps crawl budget in two specific ways:

1. It tells Google which pages you actually care about. Without a sitemap, Google reconstructs your URL inventory by crawling links. With a sitemap, Google gets an authoritative list — and can spend crawl budget on those URLs instead of guessing.

2. <lastmod> dates direct refresh crawls. When Google knows which pages genuinely changed recently, it can prioritize refresh-crawling those instead of re-fetching everything. This is why accurate <lastmod> matters more than almost any other sitemap field.

The ways a sitemap hurts crawl budget:

  • Stale URLs in the sitemap → Googlebot fetches dead pages.
  • Inflated <lastmod> dates → Googlebot re-fetches unchanged pages.
  • Duplicates and noindex pages listed → Googlebot wastes capacity on URLs you don't want indexed anyway.

Keeping a sitemap clean at scale is the dull but important maintenance work most teams stop doing after setup. Indexly automates this specifically: each crawl diffs against the previous one, automatically removes URLs that 404 or returned non-200 responses, and only updates <lastmod> on URLs where the content actually changed. On large sites this can reclaim meaningful crawl budget that was previously burnt on sitemap drift. More in sitemaps for large sites and how often should your sitemap update.

FAQ

What is crawl budget in SEO?

Crawl budget is the amount of time and resources Google is willing to spend crawling your site within a given period. It's determined by two factors: how much your server can handle (crawl capacity) and how much Google actually wants to crawl your site (crawl demand). Sites under ~10,000 URLs rarely need to worry about it.

Does my site have a crawl budget problem?

Probably not. Google has officially said most sites don't need to worry about crawl budget. You have a real problem if your site has 100,000+ URLs, you're adding content faster than Google indexes it, or you see a growing "Discovered – currently not indexed" backlog in Search Console. Under those conditions, crawl budget optimization can move the needle meaningfully.

How do I check my crawl budget in Search Console?

Go to Settings → Crawling → Crawl stats. The report shows total crawl requests, average response time, crawl status codes, and requests broken down by purpose (discovery vs. refresh). Rising response times or lots of 5xx errors mean you're losing crawl capacity. Refresh dominating discovery on a growing site means Google isn't finding new content efficiently.

How do I increase my crawl budget?

Two levers: raise crawl capacity (make your server faster, fix errors, shorten redirect chains) and raise crawl demand (improve content quality, consolidate duplicates, earn external links, keep sitemap and <lastmod> dates accurate). Capacity improvements happen within weeks; demand improvements take months. There's no setting in Search Console that lets you request more crawl budget directly.

Does blocking pages with robots.txt save crawl budget?

Yes, when used carefully. Disallow: rules prevent Googlebot from fetching URLs at all, freeing that budget for pages that matter. Use it for internal search, parameter variants, session IDs, and faceted nav combinations. Don't use it for pages you want deindexed — use noindex meta tags for those, because Google needs to crawl the page to see the tag.

Can a sitemap affect crawl budget?

Yes. A clean sitemap directs Googlebot to URLs you care about and uses accurate <lastmod> dates to prioritize refresh crawling. A stale or bloated sitemap wastes budget by pointing Google at 404s, redirects, and URLs that haven't actually changed. On large sites, sitemap hygiene is one of the highest-impact crawl-budget levers available.


The bottom line

Most sites don't have a crawl budget problem, and the ones that do usually have two problems at once: a capacity issue (slow server, lots of errors) and a demand issue (low-quality content, unreliable sitemap, mountains of duplicate URLs). Neither alone explains the indexing lag — the fix has to address both.

If you suspect sitemap drift is part of your crawl budget problem, try Indexly free. Automatic re-crawls keep your sitemap clean; diff tracking catches URL bloat before it becomes thousands of stale entries; and hosted delivery takes the sitemap off your app server entirely. Every page found. Every page indexed.

I

Indexly Team

Writing about SEO, sitemaps, and how to get every page indexed by Google.

Enjoyed this post?

Get our next one delivered to your inbox — no spam, ever.

Back to Blog

Ready to get your site fully indexed?

Get started free