By Greg Nowak. Last updated 2026-06-17.
AI crawler policy is no longer a publisher-only issue. For business websites, it is now a practical decision about lead generation, content reuse, infrastructure load, and who gets to learn from your site.
Most companies do not need a yes-or-no answer. They may want visibility in Google Search and ChatGPT search, while refusing model-training crawlers. They may want service pages and help articles discoverable, while keeping staging, previews, internal search results, parameter-driven duplicates, and forgotten PDFs out of summaries and away from expensive origin paths. The useful question is not “should we block AI?” but “which content deserves discovery, which content needs tighter reuse limits, and which content should never be public in the first place?”
Start with policy, not panic
OpenAI now separates search crawling, training crawling, and user-triggered visits. That matters. OAI-SearchBot is for ChatGPT search visibility, GPTBot is for training, and ChatGPT-User is used for some user-initiated fetches. In practice, that means you can allow discovery while refusing training. It also means private content should never rely on robots.txt alone, because user-triggered visits are a different category and may not behave like ordinary automated crawling.
There is another useful correction to the usual hype. Google says there is no special AI file, no AI-only markup, and no extra schema required to appear in AI Overviews or AI Mode. If a page is indexed, snippet-eligible, and technically sound for normal Search, it can be eligible there too. So this is not a new-content-format project. It is a governance and implementation project.
| Business goal | Best control | Why this layer fits | Main caution |
|---|---|---|---|
| Appear in ChatGPT search | Allow OAI-SearchBot in robots.txt |
Keeps public pages eligible for ChatGPT search answers | Edge firewalls can still block it; OpenAI says changes can take about 24 hours to adjust |
| Refuse model training | Disallow GPTBot |
Separates training opt-out from search visibility | Do not assume this protects already-public sensitive content |
| Keep a page public but limit reuse in snippets | nosnippet, max-snippet, or data-nosnippet |
Lets Google crawl the page while restricting snippet use | If the page is blocked in robots.txt, Google may never see these rules |
| Keep PDFs or exports out of index | X-Robots-Tag |
Works for non-HTML files the CMS may not control well | Easy to forget at CDN or server level |
| Protect client-only or non-public content | Authentication or access control | Actually prevents access | robots.txt is not a privacy mechanism |
Use the right control in the right layer
robots.txt is for broad crawl preferences. Use it for staging areas, preview URLs, faceted duplicates, internal search, and low-value archives that should not be crawled at all. Use page-level controls such as meta name='robots', nosnippet, max-snippet, or data-nosnippet when a page should stay public but you want tighter control over how much of it can be reused. Use X-Robots-Tag for PDFs, image files, feeds, exports, and other non-HTML assets. If something is sensitive, require login or remove public access.
The caveat most teams miss is simple but important: page-level rules are only seen if the page can be crawled. Google’s documentation is explicit here. If you disallow a URL in robots.txt, compliant crawlers may never fetch it and therefore never discover its noindex, nosnippet, or max-snippet instructions. That is why “public but restricted” pages usually need to remain crawlable, while “do not fetch this section at all” pages belong in robots.txt.
If you only need to suppress part of a page, data-nosnippet is often cleaner than suppressing the whole page. It is useful for repeated CTAs, boilerplate legal text, or client-specific fragments embedded on otherwise public pages.
User-agent: OAI-SearchBot
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: *
Disallow: /preview/
Disallow: /staging/
Disallow: /?s=
That is only a starting point. In WordPress and Drupal stacks, the real job is mapping policy to templates, taxonomies, archive behavior, search pages, and files the CMS never touches directly.
Where Cloudflare helps, and where it can trip you up
Cloudflare’s AI Crawl Control is useful because it turns policy into operations. You can review AI crawler activity, see robots.txt violations, and set allow or block actions per crawler. For agency teams and operations leads, that is much cleaner than sprinkling one-off bot rules across several systems.
But the operational risk is at the edge. Cloudflare documents that AI Crawl Control blocking is enforced through WAF custom rules, and rule order matters. A crawler you marked as allowed can still be blocked by upstream security logic. A crawler you marked as blocked can still get through if skip, redirect, or transform rules take precedence. If your site is behind multiple layers of WAF, CDN, and origin config, treat the dashboard as a control panel, not proof that the final result is correct.
There is also a practical limit to remember: Cloudflare says free-plan AI crawler detection is based on user-agent strings. That is useful for managing known, self-identifying bots, but it is not a confidentiality control. If content is commercially sensitive, access control still wins.
What to verify before you call it done
Most failures here are not strategic. They are rollout errors: the wrong hostname, the wrong environment, missing headers on PDFs, or a CDN rule quietly overriding the CMS. Verify the live site, not the intended setup.
curl https://example.com/robots.txt
curl -I https://example.com/brochure.pdf
The first confirms the production robots.txt file actually being served. The second confirms whether headers such as X-Robots-Tag are present on assets marketing teams often forget. Also inspect a sample service page in rendered source to confirm page-level rules are live, not merely configured somewhere in the admin.
If ChatGPT search visibility matters, confirm that OAI-SearchBot is allowed and not blocked at the CDN. If you aggressively filter bots at the edge, allow OpenAI’s published IP ranges as well. If training opt-out matters, confirm that GPTBot is disallowed. If Google AI visibility matters, keep important pages indexable and snippet-eligible instead of blocking them in robots.txt and expecting page-level directives to do the job.
Need someone to own the cleanup?
This usually does not need a rebuild. It needs an audit, a clear policy by content type, and one person willing to line up CMS settings, server headers, and edge rules so they stop contradicting each other. If you want that handled across WordPress, Drupal, Cloudflare, or a mixed agency stack, Greg can take ownership and keep search visibility intact while the policy gets stricter. See how Greg works as a digital project manager.
Related on GrN.dk
- AI crawler access control is now a paid operations project for sites that want ChatGPT visibility without opening everything
- Cloudflare Page Rules Debt: The Quiet Failure Mode on Business Websites
- Form Spam Is a Lead-Quality Problem: A Practical Hardening Playbook for Business Websites
Need help with this kind of work?
Talk to Greg about AI crawler policy and implementation Get in touch with Greg.