Back to Blog

Large Site Sitemaps: Handling 50,000+ URLs (Google's Limits Explained)

Past 50,000 URLs your sitemap breaks. Learn Google sitemap limits, sitemap index files, and the sharding strategies that survive at scale.

I
Indexly Team
· · 12 min read

Large Site Sitemaps: Handling 50,000+ URLs (Google's Limits Explained)

Once a site passes 50,000 URLs, a single sitemap file stops working — Google rejects anything larger, and you need to split your sitemap into chunks managed by an index file. That's the easy part. The hard part is picking a sharding strategy that doesn't fall apart six months later when your catalogue has doubled.

This guide covers Google's exact limits, what actually happens when you hit them, the five realistic ways to shard a large sitemap, and the scale problems almost every tutorial skips — the ones that bite you at 200k, 2 million, or 20 million URLs.

Table of contents

  1. Google's hard limits
  2. What happens when you hit them
  3. Sitemap index files explained
  4. Five ways to shard a large sitemap
  5. Scale problems most tutorials skip
  6. Reference architecture for high-scale sitemaps
  7. FAQ

Google's hard limits

These are not guidelines. They're the limits Google's parser enforces, and exceeding them means the whole file gets rejected — not just the overflow.

Constraint Limit
URLs per sitemap file 50,000
Uncompressed file size 50 MB
URLs per sitemap index file 50,000 (child sitemaps)
Sitemap index files per site Unlimited
Compression Gzip supported (.xml.gz)
Encoding UTF-8 required

From Google Search Central: "All formats limit a single sitemap to 50MB (uncompressed) or 50,000 URLs. If you have a larger file or more URLs, you must break your sitemap into multiple sitemaps."

A sitemap index can itself reference up to 50,000 child sitemaps, and you can submit multiple index files. Multiply those out and the theoretical ceiling is 2.5 billion URLs per index — more than any site actually needs. The real question isn't whether you can scale, it's how gracefully.

The 50MB limit is the one that catches people

Most large sites hit the 50,000-URL limit before the 50MB one, because typical URL entries are short. But:

  • Sites with long URLs (UTM-tagged, heavily parameterised, or deeply nested) can hit 50MB at 30–40k URLs.
  • Sitemaps with lots of <lastmod> dates, hreflang alternates, or image/video extensions are fatter per entry.
  • URLs containing non-ASCII characters (Arabic, Chinese, Cyrillic) that get percent-encoded can triple in length.

Check both limits. A 47,000-URL sitemap that's 51MB is just as broken as a 51,000-URL one.

What happens when you hit them

The failure modes, ranked by how badly they'll ruin your week:

1. Google rejects the file entirely. You'll see "Couldn't fetch" or "Parsing error" in Search Console → Sitemaps. None of the URLs in that sitemap get the discovery boost a sitemap is supposed to provide. For a 60k-URL sitemap, that's 60k pages Google is now relying on internal links alone to find.

2. Silent truncation. Some intermediaries (buggy generators, CDN transformations) silently cut off output past a byte count. The file validates as XML, Google reads the first 38k URLs, and the last 20k quietly don't exist in Google's view of your site. This is worse than an outright error because nothing alerts you.

3. Server timeouts. Generating a 50MB sitemap on request, every request, from a live database query — which some CMS plugins still do — OOMs your app server or times out Googlebot before the response finishes. You see "Couldn't fetch" and think Google's broken. Google's fine. Your sitemap controller isn't.

4. Partial indexing. If you split naïvely — first 50k URLs in sitemap 1, next 50k in sitemap 2 — and sitemap 2 is somewhere hard to reach (on a subdomain, behind a redirect, not listed in your index), Google may only process the first half.

The fix for all four is the same: shard properly, serve the files reliably, list them all in a sitemap index.

Sitemap index files explained

A sitemap index file is a sitemap of sitemaps. Instead of listing URLs, it lists other sitemap files. You submit the index file to Google, and Google follows each reference.

The format is similar to a regular sitemap, but uses <sitemapindex> and <sitemap> tags instead of <urlset> and <url>:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemaps/products-1.xml</loc>
    <lastmod>2026-04-15</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemaps/products-2.xml</loc>
    <lastmod>2026-04-15</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemaps/blog.xml</loc>
    <lastmod>2026-04-14</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemaps/categories.xml</loc>
    <lastmod>2026-04-10</lastmod>
  </sitemap>
</sitemapindex>

A few rules most tutorials skip:

  • Each child sitemap must be on the same domain as the index file, or explicitly set up for cross-site submission in Search Console.
  • Child sitemaps must be in the same directory or lower in the site hierarchy relative to the index file.
  • The <lastmod> on each <sitemap> entry should be the most recent <lastmod> of any URL inside that child sitemap. This is how you tell Google "only re-fetch this child if something inside it changed."
  • An index can reference up to 50,000 child sitemaps, and each child sitemap can have 50,000 URLs — giving you 2.5 billion URLs per index.

Getting the index <lastmod> right is the single biggest lever for crawl efficiency on large sites. If the index correctly says "only products-3.xml has changed today," Google only re-fetches products-3.xml. If the index says every child changed today (common bug when the generator just bumps lastmod to NOW() on every regeneration), Google re-fetches all 40 of them and you burn crawl budget. See how often should your sitemap update for the full treatment.

Five ways to shard a large sitemap

"Just split by 50,000 URLs" is a trap. Here are the five strategies that actually work, roughly ordered by how often they're the right choice:

1. Shard by content type

The most common and usually best option. Group URLs by the kind of page they are:

/sitemap.xml                   (index)
/sitemaps/products.xml         (or products-1.xml, products-2.xml, ...)
/sitemaps/categories.xml
/sitemaps/blog.xml
/sitemaps/pages.xml

Why it's good: each child sitemap's <lastmod> pattern tends to be consistent (products change often, static pages rarely), so the index's per-child <lastmod> signals are meaningful. Search Console also lets you see indexing status per sitemap, so you can tell whether your blog section or your product section has coverage issues.

2. Shard by date

Works well for news, publishers, and any site where URLs are effectively immutable once published:

/sitemaps/2026-04.xml
/sitemaps/2026-03.xml
/sitemaps/2026-02.xml

Why it's good: once a month closes, that sitemap never needs to update again. Google re-fetches only the current month's sitemap in steady state. Terrible for sites where old URLs still change — product pages, wiki pages, legal pages.

3. Shard by language or region

Essential for international sites with hreflang:

/sitemaps/en-us.xml
/sitemaps/fr-fr.xml
/sitemaps/de-de.xml

Works well when your per-locale catalogs are already separated in your data model. Doesn't work if every URL has alternates in every language — then you want a single sitemap with hreflang annotations instead.

4. Shard by update frequency

Group fast-moving and slow-moving URLs separately:

/sitemaps/static.xml       (pages that rarely change)
/sitemaps/dynamic.xml      (pages that change daily+)

The static sitemap gets fetched weekly, the dynamic one gets fetched daily. Efficient if you can cleanly categorise — harder in practice than it sounds.

5. Shard by numeric partition (the fallback)

When nothing else fits:

/sitemaps/products-1.xml    (IDs 1–50000)
/sitemaps/products-2.xml    (IDs 50001–100000)
/sitemaps/products-3.xml    (IDs 100001–150000)

Fine at launch. Gets awkward when old IDs are deleted (sitemap-1 slowly becomes 80% sparse) or when you rebalance. Use this only if content-type or date sharding genuinely doesn't fit.

Which strategy to pick

Site type Best strategy
E-commerce By content type (products/categories/blog)
News / publishing By date (monthly or weekly)
International SaaS By language, then by content type
Wiki / docs By content type
User-generated content By date, or by user-ID partition
Small multi-purpose site By content type

Scale problems most tutorials skip

Everything up to this point is in every sitemap article. Here are the problems that only show up at scale, and that most tutorials don't mention.

Problem 1: generation takes longer than the crawl interval

If your sitemap takes 45 minutes to regenerate and you want it to update hourly, you have a bigger architecture problem than a sitemap problem. You either need incremental generation (rebuild only the child sitemaps whose content changed), a faster pipeline, or a lower cadence.

Problem 2: reading from the live database kills your app

Regenerating a 10-million-URL sitemap by scanning the live database for canonical URLs competes with production traffic. You want the sitemap generator reading from a replica, a separate search index, or an offline pipeline — not hammering the same queries your users hit.

Problem 3: CDN caching serves yesterday's sitemap

A sitemap index is small enough to be fine, but child sitemaps can be 40MB each. You want them CDN-cached for Googlebot to fetch efficiently — but then when they update, stale copies keep getting served. The pattern that works: cache with a short TTL (1 hour typical), and purge the CDN cache on every regeneration.

Problem 4: the sitemap and the index disagree

The index says products-3.xml was last modified on Tuesday. products-3.xml actually got rewritten on Friday but no one updated the parent index. Google trusts the index, doesn't re-fetch products-3.xml, and your Friday updates wait a week to be seen. Keeping the index in sync with its children is surprisingly hard without tooling that owns both.

Problem 5: old URLs don't get removed

A sitemap that still lists URLs you deleted three months ago is a sitemap that tells Google "go fetch these 404s." At volume, that's thousands of wasted fetches per day. Good sitemap pipelines diff against the previous crawl and surface removed URLs — which is also useful as a broken-links-in-progress alert.

Problem 6: the sitemap becomes the bottleneck for new features

Adding a new section to your site means adding it to the sitemap generator, testing the generator, deploying the generator. If the generator is a custom Laravel command, a Rails task, or a build-time Next.js plugin, that's real work every time. Hosted sitemap tools that re-crawl the live site pick up new sections automatically — no code change.

Reference architecture for high-scale sitemaps

If you're building this from scratch for a site with 100k+ URLs, the pattern that consistently works:

  1. External crawler — a service that crawls your site on a schedule, from infrastructure that isn't your app server. (Indexly does this; so can a properly-configured external cron running a headless crawler.)
  2. Sharded output — child sitemaps grouped by content type, each under 50k URLs and 50MB.
  3. Index file with accurate per-child <lastmod> — so Google only re-fetches children that actually changed.
  4. CDN-backed delivery — sitemaps served from a CDN, not your app. Indexly serves from Cloudflare R2 via the Cloudflare CDN for exactly this reason.
  5. Diff tracking — each crawl is compared against the previous one so you know which URLs were added, removed, or changed. This doubles as an early-warning system for broken deploys.
  6. Alerting on anomalies — sudden 30% drops in URL count usually mean a deploy broke routing, not that your content actually disappeared.

This is the architecture Indexly implements out of the box — the Agency plan handles up to 100,000 pages per crawl, and Enterprise is unlimited. For sites in this range that don't want to build and maintain the pipeline above, it's one of the few reasons the buy-vs-build decision lands firmly on "buy." For full background on how cadence fits into this, see how often should your sitemap update and crawl budget explained.

FAQ

What are the sitemap limits for large sites?

Google enforces two hard limits per sitemap file: 50,000 URLs and 50 MB uncompressed, whichever you hit first. Past either, you need multiple sitemap files managed by a sitemap index file. The index itself can reference up to 50,000 child sitemaps, so practical scale is effectively unlimited.

How do I make a sitemap for 10,000 pages?

A single sitemap file works fine up to 50,000 URLs, so 10,000 pages easily fits in one file. Focus on accuracy — only indexable canonical URLs, correct <lastmod> dates, no redirects or 404s — and keep the file updated automatically. You don't need a sitemap index file at this scale; just a reliable update pipeline.

What's a sitemap index file and when do I need one?

A sitemap index is a file that lists other sitemap files. You need one whenever you have more than 50,000 URLs or more than 50 MB of sitemap content. Even below those limits, a sitemap index is useful for splitting content by type (products, blog, categories) so you can diagnose indexing issues per section in Google Search Console.

Should I gzip-compress large sitemaps?

Yes. Gzip typically cuts sitemap size by 70–80%, saves bandwidth, and speeds up Googlebot fetches. Google accepts .xml.gz files — just reference them with the .xml.gz extension in your sitemap index. The 50 MB limit is on the uncompressed size, so gzip doesn't let you pack more URLs per file, but it still helps performance.

How should I split sitemaps for an e-commerce site with 500,000 products?

Shard by content type first: a separate sitemap for products, categories, blog posts, and static pages. Within products, either shard by category (more meaningful signals) or by product ID ranges (simpler). With 500k products you'll have roughly 10 child product sitemaps. All of them go in a single sitemap index file submitted to Search Console.

Can a sitemap have more than 50,000 URLs if I compress it?

No. The 50,000-URL limit is per file and independent of compression. The 50 MB limit is on the uncompressed size, so compression helps you fit more into each file up to the URL limit — but the URL limit itself is hard. Past 50,000 URLs, you split the sitemap into multiple files and manage them through an index.


The bottom line

Big sitemaps are a solved problem at the XML-format level — the 50k / 50MB limits are well-documented and sitemap index files handle scale cleanly. Where large sites actually break is in the infrastructure around the sitemap: the generator timing out, the index falling out of sync with its children, the CDN serving stale copies, and the slow accretion of 404s nobody notices until a coverage report in Search Console shows 12% of your URLs are gone.

If your site is past 50,000 URLs and you'd rather not build and babysit this pipeline, try Indexly free. Paid plans scale to 100,000 pages per crawl on Agency, and Enterprise is unlimited — sharded, indexed, CDN-delivered, diffed against the previous crawl, with alerts when something unexpected changes. Every page found. Every page indexed.

I

Indexly Team

Writing about SEO, sitemaps, and how to get every page indexed by Google.

Enjoyed this post?

Get our next one delivered to your inbox — no spam, ever.

Back to Blog

Ready to get your site fully indexed?

Get started free