How do I know if Google can actually crawl my page?

Use the URL Inspection tool in Google Search Console. It will tell you whether the page is indexed, whether the live URL can be crawled, and whether the rendered HTML matches the source. For AI crawlers, you can simulate them with curl using a custom User-Agent header.

Should I block AI crawlers to protect my content?

It depends on your business. If your content is your moat and you do not want it summarized in answer engines, block them. If your content is marketing intended to drive awareness, blocking AI crawlers makes you invisible in a fast-growing search surface. Most marketing-oriented sites benefit from allowing them.

What is the difference between noindex and disallow?

Disallow tells a crawler not to fetch a page. Noindex tells a crawler it can fetch the page but must not store it in the index. The dangerous combination is disallow plus noindex: the crawler never fetches the page, so it never sees the noindex, so the URL can still appear in search with no snippet.

All guides

Foundation lens

If a crawler cannot reach the page, nothing else you do counts

Crawlability and indexability decide whether the rest of your SEO work shows up in search at all. Get these right first, then everything you do downstream actually compounds.

Updated May 22, 2026

Quick checkHow does your site score on crawlability & indexability?Run free audit

Crawlability and indexability are the two questions every search engine and AI crawler asks before they evaluate a single word of your content: can I fetch this page, and am I allowed to remember it?

If the answer to either question is no, your perfect heading structure and your beautiful FAQ schema are invisible. This is the lens we run first in every audit, because it sets the ceiling for everything else.

Why this matters more than any other lens

Search engines and AI systems do not browse your site the way a human does. They send automated crawlers — Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot, and many others — that follow a strict set of rules before they ever store a single line of your content.

If a crawler is blocked by robots.txt, redirected in a loop, told to noindex, or pointed somewhere else by a canonical tag, your page does not exist as far as that system is concerned. We see this pattern constantly: a business invests in content, design, and product, then loses 60–90% of its potential organic reach to a single misconfigured rule.

What this lens looks at

We treat crawlability and indexability as a chain. A break at any link drops everything below it. Our audit walks the full chain in this order:

robots.txt: presence, syntax, blanket disallows, accidental wildcards, and per-bot rules including AI crawlers.
Sitemap: presence, location, declaration in robots.txt, freshness, coverage of important URLs, and exclusion of noindex pages.
HTTP status codes on key pages: 200 vs 3xx redirects, 4xx errors, and 5xx instability.
Canonical tags: missing, conflicting, cross-domain, or pointing to a page that itself is noindexed.
Robots meta tags and X-Robots-Tag headers: noindex, nofollow, and nosnippet directives on pages that should be indexable.
AI crawler access: explicit allow or disallow rules for GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and CCBot.
Renderability: whether content appears in the raw HTML or only after JavaScript execution.

Your site

See how your site scores on crawlability & indexability.

Our free audit runs this lens — plus seven others — and returns a prioritized fix list in under two minutes. No signup, no card.

Run free audit

The patterns we find on real sites

These are the issues we surface most often, ranked by how badly they hurt visibility:

A blanket Disallow: / left over from a staging environment, blocking the entire site from every search engine.
Sitemap submitted with stale URLs returning 404, signaling neglect to search engines and wasting crawl budget.
Important product or category pages marked noindex by a CMS template that the team forgot was active.
Canonical tags pointing to a marketing tracking URL, causing the real page to be dropped from the index.
AI crawlers blocked by default, so the site never gets cited in ChatGPT, Claude, or Perplexity answers.
Content that only appears after JavaScript runs, with no server-rendered fallback for crawlers that do not execute JS.

How to fix crawlability without breaking anything

Fixing crawlability is one of the highest-leverage things you can do, but it is also one of the easiest to break. Apply changes in this order, and verify each step before moving to the next:

Audit robots.txt first. Confirm that no rule blocks paths you care about. Treat any Disallow: / as a five-alarm fire.
Generate or regenerate your sitemap. Include only canonical, indexable, status-200 URLs. Reference it from robots.txt with a Sitemap: directive.
Submit your sitemap in Google Search Console and Bing Webmaster Tools, then watch the coverage report for excluded pages.
Sweep every important page for noindex meta tags and X-Robots-Tag headers. Remove them where they should not be.
Confirm canonical tags resolve to themselves on canonical pages, and to the correct primary URL on duplicates.
Decide your AI policy explicitly. If you want to appear in AI answers, allow GPTBot, ClaudeBot, PerplexityBot, and Google-Extended.
Run your homepage through a 'view source' check or a curl request to confirm the important content is in the raw HTML.

The AI search angle most teams miss

Modern AI answer engines run their own crawlers, and they obey robots.txt. Many sites block them by default through an overly cautious CDN setting or a copy-pasted robots.txt from a privacy-focused template.

The result: your competitors get cited in ChatGPT and Perplexity answers, and you do not. If you want to be visible in AI search, the bare minimum is to allow GPTBot, ClaudeBot, PerplexityBot, and Google-Extended explicitly in robots.txt. If you also want training-time visibility (your brand learned by the next model generation), allow CCBot too.

Frequently asked questions

How do I know if Google can actually crawl my page?: Use the URL Inspection tool in Google Search Console. It will tell you whether the page is indexed, whether the live URL can be crawled, and whether the rendered HTML matches the source. For AI crawlers, you can simulate them with curl using a custom User-Agent header.
Should I block AI crawlers to protect my content?: It depends on your business. If your content is your moat and you do not want it summarized in answer engines, block them. If your content is marketing intended to drive awareness, blocking AI crawlers makes you invisible in a fast-growing search surface. Most marketing-oriented sites benefit from allowing them.
What is the difference between noindex and disallow?: Disallow tells a crawler not to fetch a page. Noindex tells a crawler it can fetch the page but must not store it in the index. The dangerous combination is disallow plus noindex: the crawler never fetches the page, so it never sees the noindex, so the URL can still appear in search with no snippet.

Run the audit

See how your site scores on this lens.

A free audit returns a specific verdict on crawlability & indexability, with evidence, severity, and a prioritized fix list across all eight lenses. See also the technical seo guide.