Overview

Crawl budget is the number of URLs Googlebot will fetch from a site within a window. For most sites under 10,000 URLs, budget is effectively infinite; Google crawls everything every few days. Above 10,000 URLs, especially with frequent content changes, crawl budget becomes the bottleneck between publish and rank. Block low-value paths, consolidate duplicates, and signal freshness deliberately.

Know when crawl budget matters

Crawl budget is a real concern at three thresholds.

  • Over 10,000 indexable URLs.
  • Over 1,000 URLs published or updated per day.
  • E-commerce, classifieds, and UGC sites with parameter explosion.

Below those thresholds, Google’s discovery scheduler covers the site comfortably. Spending engineering time on crawl budget for a 200-page site is wasted; spend it on content and internal linking. See internal-linking for the depth rules that also affect crawl rate.

Block low-value paths in robots.txt

robots.txt is the first line of crawl budget defense. Block anything that produces URLs but not rankings.

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /cart
Disallow: /account/

Sitemap: https://example.com/sitemap.xml
  • Faceted navigation with infinite filter combinations: block.
  • Internal search result pages: block.
  • Session-id URLs, cart pages, user account pages: block.
  • Test domains and staging: block at the User-agent: * level, not just the staging robots.txt.

Disallow prevents crawling but does not prevent indexing if the URL is linked from elsewhere. Pair with noindex for URLs that exist but should not appear in search.

Apply noindex to thin and duplicate pages

noindex,follow removes the URL from the index while letting crawlers traverse it. Use it on pages that exist for users but not for search.

  • Thank-you pages and confirmation flows.
  • Author archive pages with only one or two posts.
  • Tag archives that duplicate category archives.
  • Infinite-scroll paginated archives past page 5.
  • Internal-only documentation.

noindex is a meta tag, not a robots.txt rule. Robots.txt-blocked URLs cannot receive noindex because the crawler never fetches the meta tag.

<meta name="robots" content="noindex,follow" />

Use canonical tags to consolidate duplicate URLs

When the same content lives at multiple URLs (sort orders, tracking parameters, pagination), canonical to one URL.

  • Every page declares a self-referential canonical. Parameter variants point back to the clean canonical. See technical.
  • ?utm_source=... URLs canonical to the unparameterized URL.
  • Sort and filter parameters on category pages canonical to the unsorted version.
  • Mobile and desktop variants canonical to the responsive single URL. See mobile-first.

Canonical is a hint, not a directive. Google may ignore the canonical if the pages diverge significantly; 301 redirects are stronger when the URLs are truly equivalent.

Signal freshness through sitemap <lastmod>

Google reads <lastmod> in sitemap.xml and uses it to schedule recrawls. Lying about <lastmod> (touching every page on every build) degrades the signal sitewide.

<url>
  <loc>https://example.com/seo/crawl-budget</loc>
  <lastmod>2026-05-14</lastmod>
</url>
  • Set <lastmod> from the page’s last_updated frontmatter, only when content materially changes.
  • Submit the sitemap once in Google Search Console; the engines refetch automatically.
  • Pair with indexnow for near-instant change notification on Bing and Yandex.

Paginate with ?page=N and self-canonical each page

Pagination is the most common crawl-budget waste. Three patterns; only one is correct in 2026.

  • Wrong: rel="next" and rel="prev". Google deprecated these in 2019.
  • Wrong: canonical every page to page 1. Google treats this as a misuse signal and crawls slower.
  • Right: every paginated URL canonical to itself. ?page=2 declares https://example.com/blog?page=2 as its canonical.
  • For infinite scroll, also expose paginated URLs with ?page=N so crawlers can traverse without executing scroll logic.

Apply noindex,follow to paginated tails past page 5 if those pages are thin. Crawlers still walk the pagination; the index stays clean.

Monitor crawl behavior in Search Console

Google Search Console’s Crawl Stats report (Settings > Crawl stats) is the only ground truth.

  • Total crawl requests per day: should match published volume.
  • Crawl response codes: 95%+ should be 200. A spike in 404s or 5xxs throttles the crawler.
  • Response time: under 600 ms keeps the crawler happy; over 2 seconds throttles it.
  • File type breakdown: a high ratio of image and CSS crawls vs HTML crawls means the crawler is wasting budget on assets.

Consolidate duplicate content before crawl-budget tactics

The cheapest crawl-budget win is shipping fewer URLs. Audit for duplicates first.

  • Tag pages duplicating category pages: remove one set, 301 to the other.
  • Multiple URLs serving the same content: pick one canonical, 301 the rest.
  • Soft-404s (pages that return 200 with “no results found”): return real 404s.
  • Thin pages with no real intent: consolidate or delete. See the thin-content rules in content.