ChatGPT Visibility Without Open Access: robots.txt Is Only the Start

By Greg Nowak. Last updated 2026-07-18.

Showing up in ChatGPT does not require giving every AI crawler unrestricted access to your entire website. OpenAI separates its search crawler from its training crawler, so businesses can make different decisions about visibility and model training.

That sounds like a simple robots.txt change. In practice, the policy also has to survive your CDN, web application firewall, indexing directives, redirects and canonical tags. A technically correct rule is useless if Cloudflare challenges the request first—or if the page being crawled is a duplicate URL you never intended to promote.

The practical goal is not to “allow AI.” It is to decide which public content may be discovered, summarized, cited or used for training, then ensure every delivery layer enforces that decision.

Separate search visibility from training access

OpenAI currently documents two relevant crawlers. OAI-SearchBot supports search features in ChatGPT, while GPTBot is used to crawl content that may contribute to training generative AI models. Their controls are independent.

A company can therefore allow its service pages, articles and public documentation to be considered for ChatGPT search while blocking GPTBot. This does not guarantee inclusion or a citation, but it avoids preventing access to pages you want ChatGPT search to find.

Business objective	Primary control	Operational check
Be eligible for ChatGPT search visibility	Allow `OAI-SearchBot` on selected public pages	Confirm the crawler receives a normal page without a WAF challenge
Limit OpenAI training access	Disallow `GPTBot` for the relevant paths	Check rule precedence and request logs
Keep pages out of results	Use an appropriate `noindex` directive	Allow the crawler to reach the page so it can read the directive
Protect private or licensed material	Authentication and access control	Verify the content is unavailable without authorization
Measure business value	Analytics and conversion events	Track ChatGPT referrals, landing pages and qualified enquiries

A useful AI policy connects each business decision to both a technical control and a verification step.

Publish an explicit crawler policy

A basic site-wide policy that permits ChatGPT search crawling while declining GPTBot could begin like this:

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

Do not copy that example blindly. Many organisations need path-level rules. Marketing pages may be intentionally public, while customer portals, internal search results, staging environments, licensed resources or large parameter-driven URL spaces require different treatment.

Document the reason for each rule alongside the implementation. Otherwise, a future redesign, SEO migration or CDN configuration change can quietly undo the policy. Remember that robots.txt is a crawler instruction, not a security boundary. Confidential material belongs behind authentication rather than behind a Disallow line.

Google uses a different model for its optional AI control. Google-Extended is a robots.txt product token rather than a separate HTTP user-agent. Google says it does not affect normal Google Search inclusion or ranking. That makes it another independent policy choice—and explains why searching server logs for a “Google-Extended crawler” is the wrong verification method.

Make sure the edge layer agrees

The request reaches your CDN or WAF before it reaches most content management systems. Bot protection, rate limiting, managed challenges and custom firewall rules can therefore override the policy you published.

A broad rule that trusts any request claiming to be OAI-SearchBot is unsafe because user-agent strings are easy to spoof. Where possible, use your provider’s verified-bot classification or the crawler operator’s published IP information. Review these rules periodically rather than treating an IP list as permanent configuration.

For every crawler you intend to allow, inspect real request logs and verify the complete response path:

The intended URL returns 200, or a short and deliberate redirect chain to the canonical URL.
No CAPTCHA, JavaScript challenge, login page or generic block response is substituted.
The crawler receives the meaningful page content, not an empty client-rendered shell.
The final page has the expected canonical and indexing directives.
Rate limits remain strict enough to protect the service without blocking legitimate crawling.

A manual request with a crawler user-agent can expose obvious edge-rule problems, but it cannot prove how a verified crawler will be classified. Treat it as an initial diagnostic, not final acceptance testing.

Handle noindex and canonical URLs deliberately

Blocking crawling and preventing indexing are not identical. OpenAI notes that a disallowed URL discovered through another source may still appear as a title and link. If a page should not appear, an applicable noindex directive is the clearer instruction—but the crawler must be able to fetch the page to read it.

Canonical tags solve a different problem: selecting the preferred version among duplicate or similar URLs. They should be correct before adding AI-specific edge behaviour. Cloudflare can redirect verified AI training crawlers to a same-origin canonical URL, but its documentation says this feature does not apply to AI Search bots or AI Assistants. It can reduce off-canonical training-crawler access; it is not a shortcut to ChatGPT visibility.

Use a repeatable implementation workflow

Classify content. Group URLs into public marketing, editorial, documentation, transactional, private and low-value duplicate content.
Choose by purpose. Decide separately on AI search discovery, model-training crawling and user-requested page retrieval.
Translate policy into controls. Update robots.txt, indexing directives, authentication, canonicals and edge rules.
Test representative URLs. Include allowed, blocked, redirected, non-indexable and authenticated examples.
Inspect production logs. Check status codes, response sizes, crawl paths, challenges and unexpected spikes.
Measure outcomes. OpenAI identifies ChatGPT referral traffic with utm_source=chatgpt.com. Preserve that parameter and connect visits to meaningful conversion events.
Assign an owner. Recheck the policy after migrations, firewall changes and material crawler-documentation updates.

This work sits between commercial policy, SEO, infrastructure and analytics. Someone needs authority to resolve conflicts between those teams, not merely permission to edit one file.

Turn the policy into an operating decision

The right configuration depends on what the business publishes and how it expects that content to create value. A consultancy may want broad discovery of its expertise. A software company may expose public documentation while protecting account areas and licensed material. A publisher may need rules at section or content-type level.

If your current setup grew through isolated SEO edits and emergency firewall rules, Greg can audit the whole request path and turn it into a clear, testable crawler policy. Talk to Greg about coordinating the implementation.