Overview

A single sitemap.xml works until the site grows past 50,000 URLs or starts publishing news, images, and video that need different metadata. The sitemap-index pattern, per-format sitemaps, and accurate <lastmod> values are the difference between a site Google crawls efficiently and one that wastes the crawl budget on stale pages. This page covers the splitting strategy, the field-by-field rules, and the common errors that produce silent indexing failures.

Split into a sitemap index past 50,000 URLs

A single sitemap file holds at most 50,000 URLs or 50 MB uncompressed. Past either limit, switch to a sitemap-index file that references multiple child sitemaps.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemaps/pages-1.xml</loc>
    <lastmod>2026-05-15T10:00:00-07:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemaps/pages-2.xml</loc>
    <lastmod>2026-05-15T10:00:00-07:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemaps/news.xml</loc>
    <lastmod>2026-05-15T10:00:00-07:00</lastmod>
  </sitemap>
</sitemapindex>

Submit the index URL once in Search Console; the engines walk the index and fetch each child sitemap on their own schedule. Reference the index file from robots.txt.

Ship separate sitemaps for pages, news, images, and video

Splitting by format makes the index visible to engines that consume one format and not another, keeps file sizes manageable, and lets each sitemap follow the rules its format requires.

  • pages.xml: every canonical content URL.
  • news.xml: news articles published in the last 48 hours (see below).
  • images.xml: image entries with parent URL, caption, and license metadata.
  • videos.xml: video entries with thumbnail, duration, and player URL (see video-seo).
  • products.xml: e-commerce sites can split product URLs from editorial content.

The index file lists all of them. Google Search Console reports indexing status per child sitemap, which makes diagnosing format-specific drops faster.

Set <lastmod> from real content changes, not deploy time

<lastmod> is the single most useful sitemap field and the one most often broken. Google uses it to decide whether to recrawl a URL. Bumping <lastmod> on every deploy teaches Google to ignore the field; setting it accurately keeps the crawl budget aimed at pages that actually changed.

  • Read <lastmod> from the page’s last_updated frontmatter or from the git log of the source file.
  • Never set it to the build timestamp.
  • ISO 8601 with timezone offset: 2026-05-15T09:00:00-07:00.
  • A page that has not changed in two years carries a <lastmod> from two years ago. That is correct.

Google’s John Mueller confirmed in 2023 that consistently inaccurate <lastmod> causes Google to discount the field entirely; the fix is restoring accuracy and waiting weeks for trust to rebuild.

Skip <priority> and <changefreq> for Google; keep them for Bing

<priority> (0.0 to 1.0) and <changefreq> (daily, weekly, monthly) are part of the sitemap spec but Google ignores both. Bing still reads them as advisory signals.

  • Setting every URL to <priority>1.0</priority> makes the field meaningless.
  • Honest priority values map to crawl prioritization on Bing.
  • <changefreq>monthly</changefreq> on a daily news page wastes the signal; align the value with reality.
  • If the build tool emits these fields, leaving them in is fine; if you have to choose where to spend effort, skip them and focus on <lastmod>.

Gzip sitemaps over 10 MB

The 50 MB limit is uncompressed. Files over 10 MB should be served gzipped to reduce crawler bandwidth and fetch latency.

  • Save as pages-1.xml.gz.
  • Set Content-Type: application/xml and Content-Encoding: gzip.
  • The 50,000-URL limit still applies; gzip lets a sitemap stay under the 50 MB byte limit while holding more URLs.
  • Sitemap-index files themselves are small; do not gzip the index.

News sitemap rules: last 48 hours, under 1000 URLs

The News sitemap format is separate from the regular sitemap and serves the Google News index.

  • Include only articles published in the last 48 hours. Older articles get removed; including them does nothing.
  • Limit to 1000 URLs per news sitemap. Above that, split into multiple files.
  • Required fields per entry: news:publication (name and language), news:publication_date (ISO 8601), news:title.
  • Regenerate every time a new article publishes; ping IndexNow on every regeneration.

See news-seo for the Publisher Center setup and NewsArticle schema rules.

Image sitemap: one entry per image, parent URL pinned

The Image sitemap format extends the regular sitemap with image entries nested under each URL.

<url>
  <loc>https://example.com/products/trail-runner-3</loc>
  <image:image>
    <image:loc>https://example.com/img/tr3-black.jpg</image:loc>
    <image:caption>Acme Trail Runner 3 in black</image:caption>
    <image:license>https://example.com/licenses/product-images</image:license>
  </image:image>
</url>
  • One <image:image> entry per image; up to 1000 images per URL.
  • The parent <loc> is the page that hosts the image, not the image URL itself.
  • Include image:caption and image:license where applicable for richer Image Search results.

See image-seo for the on-page image SEO rules.

Reference every sitemap from robots.txt

robots.txt is the discovery point for sitemap URLs. List the index file (and each child if no index is used).

# https://example.com/robots.txt
User-agent: *
Allow: /

Sitemap: https://example.com/sitemap-index.xml

Submit the index in Google Search Console and Bing Webmaster Tools once; both engines refetch on their own afterwards.

Ping IndexNow on every sitemap change

IndexNow is the push-based notification protocol that Bing and Yandex (and several smaller engines) honor. On every sitemap regeneration, POST the list of changed URLs to the IndexNow endpoint.

  • One API call per build; the protocol accepts up to 10,000 URLs per request.
  • Google does not consume IndexNow yet, but its inclusion is no cost.
  • See indexnow for the full protocol and authentication key setup.

Common errors

  • Every deploy bumps <lastmod> to the build timestamp. Google stops trusting the field; the entire sitemap loses crawl-prioritization value.
  • Sitemap includes noindex pages. The crawler fetches them, finds the noindex, and the page is excluded; you have spent crawl budget on URLs that cannot rank.
  • Sitemap URLs do not match the canonical. The page’s canonical points one place, the sitemap lists another; Google picks one and ignores the other.
  • News sitemap full of week-old articles. The format requires the 48-hour window; older entries do nothing.
  • Single sitemap.xml past 50,000 URLs. The crawler truncates; later URLs never appear in the index.
  • Sitemap not referenced from robots.txt. New sites take longer to be discovered; submit in Search Console as a workaround but fix robots.txt.
  • HTTP URLs in the sitemap when the site is on HTTPS. Crawler treats them as new URLs, follows the 301 to HTTPS, and burns crawl budget. List the canonical HTTPS URLs directly.