Skip to main content
GrN.dk

Main navigation

  • Articles
  • Contact
  • Your Digital Project Manager
  • About Greg Nowak
  • Services
  • Portfolio
  • Container
    • Excel Freelancer
    • Kubuntu - tips and tricks
    • Linux Apache MySQL and PHP
    • News
    • Image Gallery
User account menu
  • Log in

Breadcrumb

  1. Home

AI crawler access control is now a paid operations project for sites that want ChatGPT visibility without opening everything

For years, crawler policy was a blunt instrument. If you wanted search visibility, you let the major engines in. If you did not, you blocked them. That logic breaks down once AI search enters the picture. The same site might want to appear in ChatGPT answers, keep training crawlers away from selected sections, leave Google Search alone, set a separate policy for Gemini-related use, and make sure the WAF is not quietly blocking the requests that matter.

That is the real shift. AI crawler access control is no longer a quick robots.txt edit. It now spans crawler-specific tokens, meta directives, canonical handling, CDN and WAF behavior, IP allowlists, and analytics. If those layers are not aligned, you can end up hiding pages you meant to surface or exposing more content than the business intended.

Search visibility and model training are separate decisions

OpenAI's current crawler documentation makes an important distinction that many teams still miss. OAI-SearchBot is tied to search visibility in ChatGPT's search features. GPTBot is the crawler used for content that may be used in training generative AI foundation models. OpenAI says those controls are independent, which means a site can allow OAI-SearchBot to support ChatGPT search visibility while disallowing GPTBot to keep the same content out of model training.

OpenAI also says its search systems can take about 24 hours to adjust after a robots.txt change. Its publisher FAQ adds a practical point: if you want your content included in summaries and snippets in ChatGPT, you need to make sure OAI-SearchBot is not blocked. For any brand that cares about ChatGPT visibility, that should be an explicit policy decision, not an accident.

The FAQ adds another nuance with real operational implications. If OpenAI learns about a disallowed URL from a third-party search provider or from other crawled pages, and has signals that the page is relevant, it may still surface only the link and page title. If you want to prevent that, OpenAI points publishers to the noindex meta tag, but also notes that the crawler has to be allowed to crawl the page in order to read that tag. At that point, this stops being a simple allow-or-block decision and becomes content classification.

Google added a separate control for Gemini use

Google documents a similar separation, but the implementation is different. Its crawler documentation says Google's common crawlers obey robots.txt when crawling automatically, and it describes Google-Extended as a standalone product token. Publishers can use that token to manage whether content Google crawls from their sites may be used for future Gemini model training and for grounding in Gemini-related products.

Google is equally clear that Google-Extended does not affect inclusion in Google Search and is not used as a ranking signal. That matters because it removes a false tradeoff. A company can keep normal Google Search visibility and make a separate decision about Gemini training and grounding.

There is a technical wrinkle here that often gets missed in audits. Google says Google-Extended does not have its own separate HTTP user-agent string. The crawling still happens with existing Google user agents; the control sits in the robots.txt token. If a team is only hunting for a new crawler signature in logs, it can miss the actual decision point.

Your WAF can override your crawler policy without telling you

Perplexity's crawler documentation shows why this work has moved beyond a pure SEO checklist. PerplexityBot is the search crawler used to surface and link sites in Perplexity results, and Perplexity says it is not used for AI foundation model training. But Perplexity also documents Perplexity-User, the fetcher used when a user action inside Perplexity triggers a page visit to support an answer. Perplexity says that this user-requested fetcher generally ignores robots.txt.

That one detail changes the implementation model. Once user-triggered fetches are part of the picture, robots.txt is no longer the whole policy. Perplexity explicitly tells site owners using a WAF to whitelist its bots, and its Cloudflare example combines user-agent matching with official IP ranges. The recommended action is to set the rule to Allow so those requests bypass the security rules that would otherwise block or challenge them.

Cloudflare's own AI Crawl Control documentation reinforces the point. It frames AI-specific handling through three separate configuration paths: WAF, Bots, and Transform Rules. In other words, this is now a cross-layer configuration problem. A bot can be allowed in robots.txt and still fail to reach the page because the CDN gets there first.

Canonical handling now reaches into AI training control

Cloudflare's Redirects for AI Training feature adds another layer. For verified AI training crawlers, Cloudflare can inspect the origin HTML, read the <link rel="canonical"> tag, resolve relative canonical URLs, verify that the canonical is same-origin and different from the current URL, and then return a 301 redirect to the canonical URL. If there is no canonical, if the canonical points cross-origin, or if the page is self-canonical, the response passes through unchanged.

That is useful if the problem is duplicate or off-canonical access by verified AI training crawlers. It is not a general fix for AI visibility. Cloudflare is explicit that this redirect behavior applies only to verified bots in the AI Crawler category, and that AI Assistants and AI Search bots are not affected. So canonical-enforcing redirects for training crawlers will not, by themselves, solve ChatGPT search visibility.

That split is why this now needs operational ownership. Many sites need one policy for search inclusion, another for training access, and a third for user-triggered fetchers. Treat all of that as one AI bot setting and you are likely to end up either over-open or invisible where you wanted exposure.

What a deliberate per-bot policy actually looks like

The better question is no longer, Do we allow AI? It is which bot, which content, which purpose, and which control layer. On a real site, that usually means:

  • Allow OAI-SearchBot on the canonical public pages you actually want surfaced, cited, and linked in ChatGPT search results, then verify referral tracking with utm_source=chatgpt.com.
  • Make a separate decision on GPTBot, because OpenAI documents it as the training crawler rather than the search crawler.
  • Keep Google Search policy distinct from Google-Extended, because Google says the latter does not control Search inclusion or ranking.
  • Treat PerplexityBot and Perplexity-User as different operational cases, especially if your WAF can block or challenge requests before crawler intent matters.
  • Use Cloudflare's AI-specific controls where relevant, and apply canonical-enforcing redirects only where verified training crawler behavior is the actual problem.

This is not administrative overhead for its own sake. It is the minimum sensible setup when one vendor separates search and training with different robots tokens, another uses a control token instead of a distinct user-agent string, and another documents a user-triggered fetcher that generally ignores robots.txt.

Why this is now worth paying to implement properly

The commercial risk runs both ways. An accidental block can keep a brand out of the ChatGPT summaries and snippets it wanted. An accidental allow can expose more content to training crawlers than intended. A quiet WAF rule can make a correct robots.txt policy fail in practice. And loose canonical handling can leave duplicate or non-canonical URLs available to verified training crawlers even when the site owner thought the footprint was under control.

That is why this increasingly belongs to someone who can audit the full path: robots.txt, canonical tags, bot tokens, WAF behavior, IP-based allowlists, and analytics. The goal is not to open everything to every AI system. The goal is to publish a policy that matches business intent, then make sure the site, CDN, and measurement stack are all enforcing the same answer.

For sites that want ChatGPT visibility without giving away control, that is the real change. AI crawler governance is no longer a one-file checkbox. It is ongoing operations, and the sites that handle it deliberately will stay visible on the surfaces that matter without creating avoidable exposure elsewhere.

Need help with this kind of work?

If you need a workable per-bot policy across robots.txt, Cloudflare, canonicals, and analytics, Greg can audit the setup and implement the gaps. Get in touch with Greg.

Sources

  • Overview of OpenAI Crawlers
  • Publishers and Developers - FAQ
  • Google's common crawlers
  • Perplexity Crawlers
  • Configuration
  • Redirects for AI Training
Last modified
2026-06-02

Tags

  • ai search
  • crawler governance
  • Cloudflare
  • Technical SEO

Review Greg on Google

Greg Nowak Google Reviews

 

  • Drupal 8 Advanced Aggregation for Better Google PageSpeed Scores
  • Drupal 8 Inline Responsive Images: Practical Setup for Legacy Sites
  • Drupal 8 Development: Legacy Support, Module Work, and Upgrade Planning
  • AI crawler access control is now a paid operations project for sites that want ChatGPT visibility without opening everything
  • Drupal Wiki: Build a Practical Knowledge Base
RSS feed

GrN.dk web platforms, web optimization, data analysis, data handling and logistics.