How AI Publisher Licensing Deals Decide Which Brands Get Cited

June 4, 2026

Publisher licensing agreements determine which brands appear in ChatGPT and Gemini responses, not content quality alone. Here's how the mechanism works and how to audit where your brand stands.

Publisher licensing agreements largely determine which brands get cited in AI-generated answers.

Not content quality alone.

Not domain authority.

Which publishers have licensed their content to AI companies, and whether those publishers write about you.

The pace of those agreements tells you how fast the stakes are rising. There were 12 publicly announced AI content licensing deals in 2023. By the end of 2025 that figure had reached 91. By mid-2026 it is projected to hit 127 — and almost all of the growth is in live retrieval and attribution deals, not historical training data. The platforms that decide which brands get cited are actively reshaping who supplies them with the content they pull from.

The Reddit deal is the most publicly reported example of how this works. The Wall Street Journal, The Atlantic, Dotdash Meredith — each agreement controls what language enters a model's training data and what surfaces when buyers query AI tools today. The brands those publishers cover are the brands that get cited.

Two distinct mechanisms govern this: foundational training data and real-time retrieval.

They have different timeframes and require different actions. Most brands conflate them, which is why most GEO strategies are optimising for the wrong one. Here is how they work, and how to audit where your brand stands.

Key Takeaways

Publisher licensing deals directly determine which brands AI tools cite and most brands are optimising the wrong variable
There are two distinct mechanisms: foundational training data and real-time retrieval. Each has a different timeframe and requires a different approach
Between 40% and 60% of AI-cited sources rotate month to month. Brand citation is not a ranking you earn once
A five-step audit gives you a measurable baseline for where your brand stands in AI-cited results today

What Are AI Publisher Licensing Deals?

AI publisher licensing deals are commercial contracts between AI companies and content owners, granting AI labs the right to use published content for model training or real-time retrieval.

These are not the same as RSS feeds or standard crawl permissions. A web crawl gives a model permission to index content for search. A licensing deal goes further: it gives the model the right to use that content as training material, making it part of the foundation from which the model learns about the world.

The pace of these deals has accelerated since 2023, driven by two pressures:

AI labs needed to expand training datasets beyond what open web crawls could provide.
Publishers threatened legal action after discovering their content had been used without permission or compensation.

The result is a commercial layer that now sits between published content and the AI systems your customers use to make purchase decisions.

What this table does not show is what the money actually buys, and which brands benefit as a result. That is exactly where most brand strategy goes wrong.

Not all these deals are the same. The early wave (2023–2024) was mostly about training data — AI companies paying for historical archives to build better models.

The newer deals look different. Washington Post's April 2025 agreement with OpenAI explicitly excludes LLM training: OpenAI can surface Post content in ChatGPT answers and attribute it, but cannot use it to train future models. Le Monde wrote the same clause into its Perplexity deal. That distinction matters enormously for what these deals actually buy — and for brands trying to understand what signals the model is acting on when it answers a query about their space.

What Do Those Contracts Actually Buy?

The answer depends entirely on whether the deal is for training data or retrieval access. These are fundamentally different, and they affect your brand visibility in completely different ways.

A training data deal shapes what the model has learned before it answers any question. Call it long-term memory. Brands that appear consistently in that training material become part of the model's baseline understanding of the world — in plain terms, what it will say about your category without looking anything up.

A retrieval deal works differently. It means content is accessed in real time while the model is answering a specific query. This is what the industry calls retrieval-augmented generation, or RAG: the model searches for current sources to support its answer. The content does not change what the model knows permanently. It only shapes that one response.

‍

AI Platform	Publisher / Content Owner	Deal Value	Deal Type	What the Publisher Gets	Year
OpenAI	Associated Press	Undisclosed	Training data	Archive access (1985–present)	2023
OpenAI	Axel Springer (Politico, Business Insider)	€Tens of millions	Training + paywalled content	Revenue + ChatGPT product access	2023
OpenAI	News Corp (WSJ, The Times, NY Post, The Australian)	~$250M over 5 years	Training + live retrieval	Attribution, links, product collaboration	2024
OpenAI	Le Monde + Prisa Media	Undisclosed	Training + attribution	Logo, hyperlink in ChatGPT summaries	2024
OpenAI	Financial Times	Undisclosed	Training + live retrieval	Attributed content + product collaboration	2024
OpenAI	Reddit	~$70M/year (est.)	Live retrieval (Data API)	Real-time API access; advertising partnership	2024
OpenAI	Dotdash Meredith (People, AllRecipes, Investopedia)	Undisclosed	Training + attribution	Attribution + links in ChatGPT responses	2024
OpenAI	Vox Media (The Verge, Vox, New York Mag)	Undisclosed	Training + product	Technology collaboration; product co-development	2024
OpenAI	The Atlantic	Undisclosed	Training	165+ years of archive access	2024
OpenAI	TIME	Undisclosed	Training + live retrieval	101-year archive + real-time access; attribution	2024
OpenAI	Condé Nast (Vogue, Wired, The New Yorker)	Undisclosed	Training + live retrieval	Attribution + links in ChatGPT/SearchGPT	2024
OpenAI	Hearst (Good Housekeeping, Esquire, Cosmopolitan)	Undisclosed	Attribution	Citations across 20+ magazines, 40+ newspapers	2024
OpenAI	Stack Overflow	Undisclosed	Training	Discussions used to improve model performance	2024
OpenAI	The Guardian	Undisclosed	Attribution + content licensing	Summaries in ChatGPT; in-house AI product access	2025
OpenAI	Schibsted (VG, Aftenposten, Aftonbladet)	Undisclosed	Live retrieval	Real-time news in ChatGPT; attribution; analytics	2025
OpenAI	Axios	Undisclosed (+ local newsroom funding)	Attribution	Attributed summaries in ChatGPT; 4 local newsrooms funded	2025
OpenAI	The Washington Post	Undisclosed	Live retrieval only (training explicitly excluded)	ChatGPT summaries + attribution; no model training rights	2025
Google	Reddit	~$60M/year	Training + live retrieval	Efficient API access to Reddit content corpus	2024
Google	News Corp (WSJ, The Times, NY Post)	Undisclosed	Live retrieval	Premium news content in Google AI features	2024
Google	Associated Press	Undisclosed	Live retrieval	Real-time news feed for Gemini chatbot	2025
Google	The Guardian, Washington Post, FT, Der Spiegel, Times of India (+ others)	Undisclosed	Attribution (pilot program)	AI article overviews in Google News with attribution	2025
Microsoft	Taylor & Francis (academic publisher)	£8M (~$10M) / year	Training	Academic journal content for model training	2024
Microsoft	FT, Reuters, Axel Springer, Hearst, USA Today	Undisclosed	Live retrieval	News summaries and voice briefings via Copilot Daily	2024
Microsoft	HarperCollins	Undisclosed	Training	Nonfiction titles; optional author participation	2024
Microsoft	Associated Press, People Inc., USA Today Co.	Pay-per-use	Live retrieval	Publisher Content Marketplace — per-use payment model	2025
Amazon	New York Times	~$20–25M/year	Training + live retrieval	Editorial, cooking, sports content for Alexa + model training	2025
Amazon	Condé Nast + Hearst	Undisclosed	Live retrieval	Content for Rufus AI shopping assistant	2025
Amazon	Reach (Mirror, Daily Express, Daily Star)	Usage-based	Training	Nova AI model training; usage-based compensation	2026
Meta	Reuters	Undisclosed	Training + live retrieval	Fact-based news content for Meta AI	2024
Meta	CNN, Fox News, People Inc., USA Today Co. (+ 3 others)	Undisclosed	Training (Llama LLM)	Real-time news summaries + outbound links	2025
Meta	News Corp (WSJ, The Times, NY Post, The Australian)	Up to $50M/year	Content licensing	US and UK archive access; multi-year	2026
Perplexity	TIME, Der Spiegel, Fortune, Entrepreneur, Texas Tribune, Automattic	Revenue share	Attribution	Ad revenue share when content is cited + analytics	2024
Perplexity	LA Times, The Independent, Lee Enterprises (200+ local titles), Adweek (+ others)	Revenue share	Attribution	Ad revenue share; API access; publisher analytics	2024
Perplexity	Le Monde	Undisclosed	Live retrieval (training excluded)	Answer generation; model training explicitly excluded	2025
Perplexity	Gannett (USA Today + 200+ local newsbrands)	Undisclosed	Live retrieval	Local news in Perplexity search; API access; analytics	2025
Perplexity	Getty Images	Undisclosed	Image licensing	Licensed image display with attribution and source links	2025
ProRata.ai (Gist)	FT, Axel Springer, The Atlantic, Fortune, Reuters, Hearst (then 500+ via News/Media Alliance)	50% revenue share	Attribution	Proprietary revenue allocation per content used; opt-in model	2024–2025
Mistral	Agence France-Presse (AFP)	Undisclosed (multi-year)	Live retrieval	2,300 daily stories in 6 languages for Le Chat chatbot	2025
xAI (Grok)	X (Twitter) — proprietary data	N/A (internal)	Training + live retrieval	Real-time social data from the X platform; no external deal structure	2023–present

Sources: Press Gazette, Digiday, Rob Kelly / Media & the Machine (June 2026 update), AI Watch Dog. Deal values where undisclosed reflect confidential terms; estimates sourced from reported figures. Table current as of June 2026.

The Reddit deal is primarily a retrieval deal. OpenAI gained access to Reddit's Data API, letting its models pull current Reddit discussions during queries. That is a very different thing from Reddit content being permanently encoded in the model's knowledge.

‍The News Corp deal is structured differently and is the clearest case study in what dual-platform positioning looks like. The company holds a deal with OpenAI worth approximately $250M over five years — covering the Wall Street Journal, The Times of London, the New York Post, and The Australian. It also holds a separate deal with Meta worth up to $50M per year. That is one media portfolio commanding meaningful fees from two of the largest AI deployments in the world, simultaneously. The Journal's editorial voice appears in ChatGPT when users ask about financial topics. It appears in Meta AI when users ask news-adjacent questions. The brands that get cited in both contexts are not there by accident — they are there because their content is contractually part of the retrieval layer.

Amazon's arrival as a major licensing player in 2025 changed the dynamic. The New York Times deal (reported at $20–25M per year) feeds Alexa summaries and Amazon's proprietary model training simultaneously. Condé Nast and Hearst signed separately with Amazon for the Rufus shopping assistant.

If your brand operates in categories those publishers cover (lifestyle, home, technology, finance), the content pipeline feeding Amazon's AI now has a defined supply chain. Your brand either has a presence within it through citations in those publications or it does not.

For brand visibility, the practical difference is this:

Training data shapes the model's unprompted recommendations.
Retrieval data shapes the model's sourced answers.

Your strategy needs to address both. Most current frameworks only address one.

How Do AI Publisher Deals Affect Your Brand Visibility?

Now you know the two types of deal. Here is what that means for your brand.

Brand visibility in AI search comes from three intersecting sources:

What the model already knows (foundational knowledge)
What it retrieves during a query (RAG)
How it interprets the context of the user's intent

If your brand has a sustained presence in high-authority publications included in AI training data, the model associates you with your category before it retrieves anything. If your content appears on platforms AI systems use for live retrieval, such as Reddit, YouTube, and licensed news outlets, you show up in cited answers. If neither applies, your brand exists in the model's world only to the extent that your website or secondary mentions provide enough signal to surface you at all.

Most brands are absent from both.

The competitive implication is direct. Brands with strong foundational presence get recommended in responses where the model never searches for anything. As AI assistants get better at answering from their own knowledge, foundational presence becomes the more valuable asset. Brands that only exist in retrieval results will become invisible for the queries that matter most.

What Is the Difference Between Training Data and RAG?

Training data is what the model carries into every conversation. RAG is what it goes looking for during one.

The practical difference shows up the moment you ask an AI a category question. Ask ChatGPT: "What's the best project management software for a remote team?" In many responses, Asana or Notion appear without a single cited source. That is foundational knowledge at work. Those brands have been in enough high-authority content, over enough time, that the model associates them with that category by default. It does not need to search. It already knows.

Now ask: "What project management tools launched in 2025?" The model will typically search and return results with citations from recent articles. That is RAG. The model went looking because its foundational knowledge has a cut-off date.

The difference matters for strategy because the two types of visibility require different actions:

Foundational visibility comes from sustained presence in high-authority content over months and years. It builds slowly and compounds.
Retrieval visibility comes from being on the right platforms with the right content structure. It can be improved quickly but changes constantly.

Most brands chase retrieval visibility instead of foundational presence because it is faster to act on and easier to measure. Foundational visibility is the harder, slower work. It is also the work that builds a position competitors cannot copy in a quarter.

Real-time optimisation strategies cannot fully compensate for a weak training data footprint, a finding that holds when base model responses are tested separately from search-augmented ones, as Evertune shows via direct model API access.

What Are the Four Signals That Drive AI Citations?

The signals that determine whether your brand appears in AI-generated answers fall into four categories. Most GEO frameworks cover three. The fourth determines long-term, unprompted visibility.

Entity clarity

AI models organise their knowledge around entities: named companies, people, products, and concepts with consistent, verifiable attributes. A brand that appears across sources with a consistent name, description, and category is easier for the model to represent accurately.

Entity clarity comes from structured data (schema markup on your site), Wikipedia presence, consistent information across business directories, and persistent brand language across authoritative publications. Without it, the model may hold fragmented knowledge of your brand, or in some cases, quietly merge you with a competitor in the same category.

Content extractability

AI systems favour content that leads with the answer. A paragraph that buries its key claim in the fourth sentence is harder for the model to incorporate into a synthesised response than one that opens with the point and elaborates afterwards.

This is not about bullet-pointing everything. It is about removing the narrative build-up that works for human readers but adds noise for AI systems trying to extract a usable answer.

Platform authority

AI models pull from specific sources: Reddit threads, YouTube transcripts, LinkedIn articles, peer-reviewed publications, and licensed news content. A brand that participates genuinely in these environments builds a presence across the platforms AI systems trust.

Reddit matters here not simply because OpenAI paid for the API. Real user discussions about your brand are exactly the kind of third-party signal AI models are designed to surface.

Foundational model footprint

Most GEO frameworks leave this signal out. It builds through sustained presence in high-authority, training-eligible content over time: coverage in publications AI labs license, academic references, media mentions, and consistent brand association across years of authoritative text.

Building foundational model knowledge is a 12 to 18-month project, not a campaign. Brands that start now earn a compounding advantage over those that treat GEO as something to revisit next quarter.

Why Does Traditional SEO Fall Short in AI Search?

Traditional SEO is built on ranking logic. A piece of content earns a position in a search index through signals an algorithm can evaluate: backlinks, relevance, click-through rate, site health.

AI search has no ranking index. It has a generation process. The model does not sort your content against competitors and return the best match. It synthesises a response from everything it knows. The competition is not for rank position. It is for inclusion in the model's knowledge base. The data makes this concrete: 62% of pages cited in AI Overviews do not rank in Google's top 10 for the same query, a finding we document in our competitor gap analysis for AI search.

The measurement gap is just as sharp. Traditional SEO gives you rankings, traffic, and click data. AI visibility gives you none of those by default. A brand can appear in thousands of AI responses per month and record zero corresponding traffic, because the user got a complete answer without clicking anywhere.

Then there is the volatility problem. Between 40% and 60% of AI-cited sources rotate out month to month. AI visibility is not a position you defend. It is a share of voice you actively manage. You can only manage what you measure, and most brands are currently measuring neither.

Can Brands Without Publisher Deals Still Win AI Visibility?

Yes. The licensing deals shape the playing field. They do not close it to everyone else.

Publisher deals primarily determine which content sources AI models treat as high-authority training material and trusted retrieval targets. Brands that appear consistently in those sources benefit. But the system is not sealed. AI models still pull from the open web, surface community discussions, and carry knowledge from sources that predate any licensing arrangement.

The Samsung example makes this concrete. Future Publishing ran a GEO campaign for Samsung using its own AI visibility tooling, and reported 28% growth in AI citations from Future-sourced content over three months. Samsung did not sign a deal with OpenAI or Google. The result came from structured content, platform leverage, and consistent presence in the right places.

A smaller-scale version of the same logic works for mid-sized brands. A B2B software company with a well-maintained LinkedIn presence, genuine participation in relevant Reddit communities, and a website structured for content extraction can build meaningful AI visibility without a publishing deal or a seven-figure content budget. It takes longer than a campaign. It compounds in a way that a campaign cannot.

The path runs through the four signals above: entity clarity, content extractability, genuine platform presence, and consistent media coverage in publications that appear in AI training and retrieval datasets. None of those require a licensing contract.

How Do You Audit Your AI Visibility?

Run a five-step quarterly audit to establish where your brand actually stands, rather than optimising from an assumed baseline.

Most GEO content describes optimisation tactics without first measuring the gap those tactics are meant to close. Auditing before optimising changes the work from guesswork to measurement.

Step 1: Baseline your citation rate

Query each major AI platform (ChatGPT, Perplexity, Gemini, Claude) with 10 to 15 category-level questions relevant to your product or service. Record whether your brand appears, how it is described, and whether the citation includes a source or appears to draw from foundational knowledge.

Run each query in a fresh conversation thread, in incognito mode. AI responses vary, so run each query three times and average the results. This gives you a citation rate baseline: the percentage of relevant AI queries in which your brand surfaces.

Step 2: Audit your entity footprint

Check for consistent brand representation across the following sources: Google Knowledge Panel, Wikipedia (if applicable), Wikidata, Crunchbase, your LinkedIn company page, and the schema markup on your own site.

Inconsistencies in brand name, category, or founding year are signals of entity ambiguity that AI models may carry into their answers.

Step 3: Audit content extractability

Take your 10 highest-traffic pages and evaluate each for extractability. Does each paragraph lead with its conclusion? Is the key claim in the first sentence? Are headings descriptive? Are FAQs answered directly and completely?

Score each page on a simple four-point rubric per dimension and identify where the extractability gap is largest. Those pages are your first optimisation priority.

Step 4: Audit platform presence

Search for your brand name on Reddit, YouTube, and in Google News over the past 90 days. Count authentic third-party mentions. Note which platforms have zero mentions.

Business listings are an overlooked gap here: they account for 42% of AI citations across B2B categories, nearly matching websites at 44%. Absence from these sources is a specific, fixable gap in your retrieval visibility.

Step 5: Measure monthly AI share of voice

Run your citation rate audit monthly. Track your brand alongside two to three direct competitors using the same query set.

This gives you an AI share of voice metric that shows how your visibility is moving relative to the competitive landscape, not just against an absolute baseline. The total time investment for a quarterly audit is four to six hours. What it gives you is something most brands currently do not have: a number.

The total time investment for a quarterly audit is four to six hours. What it gives you is something most brands currently do not have: a number.

What Does the Shifting Licensing Landscape Mean for Your Strategy?

The licensing landscape is not static, and brands that treat it as fixed will be managing yesterday's competitive environment.

The EU AI Act now requires every provider of a general-purpose AI model to publicly disclose their training data sources, respect copyright opt-outs, and label AI-generated content. Failure to publish the required training data summary can result in fines of up to 3% of annual worldwide turnover. This creates a compliance incentive for AI labs to formalise their data relationships, which means the published deal landscape will become more transparent over the next two years, not less.

The global market for AI training dataset licensing was valued at $4.8 billion in 2025 and is projected to reach $22.6 billion by 2034, growing at 18.8% per year. That growth reflects both increased AI development activity and the formalisation of data supply chains that were previously informal or legally ambiguous.

Deals will multiply, and the content inside them will turn over as new publications are added and existing agreements renegotiated. A brand that builds visibility through a single publication or platform may find that visibility shift as the licensing structure changes beneath it.

The brands best insulated from that volatility are those with strong foundational model knowledge, already embedded in the model's learned understanding rather than subject to ongoing deal changes, and broad platform presence across multiple retrieval sources. Building broad, not deep, is the more defensible position.

FAQs

What is a publisher licensing deal in the context of AI?

A publisher licensing deal is a commercial agreement giving an AI company the right to use a publication's content for model training or real-time retrieval. These deals differ from standard web crawl permissions because they carry formal rights, fees, and attribution terms negotiated directly between parties.

Does appearing on Reddit guarantee AI visibility?

No. Reddit content is a retrieval source for some AI platforms, meaning your brand can appear in Reddit-sourced answers. Consistent, authentic community participation matters more than simply being present on the platform. Promotional or keyword-stuffed posts are unlikely to be surfaced as high-quality signals.

How long does it take to build AI visibility?

Retrieval visibility through optimised content can show results in four to twelve weeks. Foundational model knowledge, built through sustained presence in high-authority, training-eligible publications, accumulates over 12 to 18 months. Both are worth pursuing at the same time, for different reasons.

How is AI visibility measured?

AI visibility is measured through citation rate (how often your brand appears in AI answers to relevant queries), share of voice (your citation rate relative to competitors), and sentiment (whether mentions are positive, neutral, or negative). Traditional analytics tools do not capture this data. Dedicated AI visibility platforms are emerging to fill the gap.

Are AI publisher deals accessible to smaller brands?

Direct licensing deals between small brands and AI labs are not currently available. The downstream effects are accessible through indirect means: appearing in publications covered by licensing agreements, building entity clarity, optimising content extractability, and maintaining genuine platform presence. None of those require a direct commercial relationship with an AI company.

The Reddit deal made headlines. What rarely makes headlines is the mechanism by which that deal, and dozens like it, determines which brands your customers encounter when they ask an AI what to buy. Getting that right requires understanding the difference between what AI models know and what they look up, building a strategy that addresses both, and measuring where you actually stand before optimising anything. Brands that start on this now will not be scrambling to catch up in 18 months.

Want to know where your brand stands in AI search right now? Tenpoint Labs runs AI visibility audits for B2B and consumer brands. Get in touch to start with a baseline.

‍

Angelique Haughey

Angelique Haughey is a senior SEO and content strategist at Tenpoint Labs. She has over a decade of experience in organic search, from keyword and intent strategy to content systems built to rank, across retail, medical, and B2B. She writes about the shift from traditional SEO to AEO and GEO.