AI crawler permissions now belong in a licensing register

By Greg Nowak. Last updated 2026-06-27.

Robots.txt used to be a small technical SEO concern. Let Google in, keep the junk paths out, maybe block staging folders, and move on.

That is no longer enough. For publishers, ecommerce teams, SaaS companies, and knowledge-heavy business sites, crawler access is becoming a commercial permission layer. The same page can be used for ordinary search discovery, an AI answer, model training, real-time grounding, ad validation, or a user-triggered browsing request. Those are not the same business deal.

As of June 27, 2026, this is not a theoretical issue. Axios reported People Inc.'s CEO accusing Google of abusing market power around search crawling and AI use, while Google pointed to Google-Extended as a way to control certain Gemini-related uses without affecting Google Search. A separate Axios event recap put the economics plainly: AI tools are changing how people find information, and content owners are worried about being used as inputs while getting fewer visits back.

Whether the site is a media property, a support hub, or a B2B knowledge base, the operating question is now sharper: which crawlers are allowed, for which purpose, under which terms?

The crawler file is no longer enough

Robots.txt still matters. It tells crawlers which user agents may access which parts of a site. But access is only the first decision.

Cloudflare's Content Signals Policy makes the missing layer clear. A traditional robots.txt rule can say where a crawler may go. It does not, on its own, define what the crawler may do with the content after access. Cloudflare's proposed signals separate three uses: search, AI input, and AI training. That distinction matters because search indexing, AI answers grounded in retrieved content, and training or fine-tuning a model carry different commercial consequences.

Cloudflare also stresses that these signals are preferences, not a technical barrier. Some companies may ignore them. So a useful crawler policy has to be paired with enforcement where needed: Cloudflare rules, WAF controls, bot management, and log review.

That is why crawler governance should not live only in a text file. It belongs in a register that connects policy, licensing, configuration, and monitoring.

Search, training, and user-triggered access are separate decisions

The major crawler documents show why a register is needed. Google's crawler documentation separates common crawlers such as Googlebot and GoogleOther from Google-Extended. Googlebot affects Google Search and related search surfaces. GoogleOther is a generic crawler that can be used by different Google product teams for publicly accessible content.

Google-Extended is different again. It is a standalone control token for managing whether content Google crawls may be used for future Gemini model training and for grounding in Gemini and Vertex AI. Google says Google-Extended does not affect inclusion or ranking in Google Search.

OpenAI's crawler documentation makes a similar separation. OAI-SearchBot is for surfacing sites in ChatGPT search features. GPTBot is for crawling content that may be used in training OpenAI's generative AI foundation models. ChatGPT-User is tied to certain user actions in ChatGPT and Custom GPTs. It is not used for automatic web crawling, and robots.txt rules may not apply because the action is initiated by a user.

OpenAI also notes that settings are independent, so a site can allow search visibility while disallowing training use. That is the practical shift: crawler policy is no longer a simple allow-or-block choice. It needs to record purpose, route, scope, licensing terms, enforcement, and ownership.

Register item	Decision to record	Evidence to check	Risk if ignored
Crawler identity	User-agent token, documented purpose, and affected product surface.	Robots.txt entries, Cloudflare events, and server logs.	Useful discovery gets blocked, or training access is allowed by accident.
Use permission	Search, AI input, and AI training treated as separate permissions.	Content Signals Policy values and path-level rules.	A broad allow rule is reused for a purpose the business never approved.
Licensing term	Attribution, paid crawl, paid inference, subscription, or custom license route.	Machine-readable RSL terms and internal ownership notes.	Commercial leverage is weaker because terms are not published in machine-readable form.
Enforcement	Which signals are advisory and which access paths are technically controlled.	WAF rules, bot management, allow lists, block lists, and crawl patterns.	The policy looks fine on paper while unwanted traffic continues unnoticed.
Owner and review	Who approves changes and when the register is checked.	Change history, source documentation updates, and review dates.	Old rules remain in place after crawler behavior or business priorities change.

A crawler licensing register gives technical, content, legal, and commercial teams one shared place to audit crawler decisions.

Licensing signals are moving closer to the crawl path

RSL, or Really Simple Licensing, points to the next layer. It is presented as an open content licensing standard for the AI-first internet and lets publishers define machine-readable licensing terms. Those terms can include attribution, pay-per-crawl, and pay-per-inference compensation.

The important operational idea is simple: licensing is moving closer to the crawl path. Instead of relying only on private contracts or human-readable terms pages, a site can publish licensing information that automated systems can read.

For a business site, this does not mean every page needs a complex paid license. It means the organization needs to know which content has commercial value, which content should remain freely discoverable, and which content should require permission or a licensing route before AI use.

Product documentation, editorial archives, research reports, comparison pages, support content, and media files may deserve different treatment. The register is where those differences stop being assumptions and become decisions.

What Greg would audit

A practical audit starts with the current robots.txt file, then maps every relevant crawler token against the site's business goals. Googlebot, GoogleOther, Google-Extended, OAI-SearchBot, GPTBot, and ChatGPT-User should not be handled as one bucket. Each has a different purpose, and in some cases the controlling token is not the same thing as a separate fetching user agent.

The audit should also test whether the current rules match the intended content policy. A site may want ordinary search visibility but not training use. Another may want AI search visibility only for selected public pages. A third may need stronger controls around paid research, media files, or gated documentation.

Next comes enforcement. Cloudflare's own guidance is clear that content signals are not anti-scraping controls. If the business expects actual restriction, then Cloudflare rules, WAF configuration, bot management, and server logs need to be part of the review. A register without monitoring is a snapshot. A register with evidence becomes an operational control.

Finally, the audit should connect policy to licensing. If the site uses RSL-style terms, the register should show which content is covered, what use is permitted, what compensation or attribution condition applies, and who owns updates. If no machine-readable licensing exists, that is still a useful finding. It shows where the business is relying on informal assumptions rather than published terms.

The better question is not block or allow

The wrong boardroom question is: should we block AI crawlers?

The better question is: which uses of our content create value for us, and which uses need terms?

Search visibility can still matter. AI answer inclusion may become a discovery channel. Training use may be unacceptable without a deal. User-triggered requests may need to be treated differently from automatic crawling. These are policy decisions with technical consequences.

A crawler licensing register gives those decisions a place to live. It reduces the chance that SEO, infrastructure, legal, content, and commercial teams make isolated changes that cancel each other out. It also creates a cleaner conversation with AI companies, because permissions, restrictions, and licensing signals are documented before negotiation starts.

For GrN.dk clients, the useful outcome is not a fashionable AI policy. It is a working register: allowed crawlers, blocked uses, licensing signals, enforcement checks, and review ownership. That is how crawler governance becomes manageable instead of reactive.