RAG Optimization

What is RAG?

RAG (Retrieval-Augmented Generation) is a technique where a language model retrieves relevant documents or text chunks from an external database before generating a response. Instead of relying solely on knowledge from training, the model augments its answer with up-to-date, source-specific content.

This is how most production AI assistants work in 2026:

User query
    ↓
Retrieve relevant chunks from indexed content (vector database)
    ↓
Inject chunks into LLM context window
    ↓
LLM generates answer grounded in retrieved content
    ↓
Answer with citations

Your website content can be part of these pipelines — if it is structured for retrieval.

The chunking problem

Before content is stored in a vector database, it is split into chunks — typically 512 to 2,000 tokens. The chunking strategy determines whether the retrieved content is coherent and useful.

The problem: If your page is one long wall of text, it will be chunked at arbitrary token boundaries, splitting mid-sentence or mid-argument. The retrieved chunk becomes semantically incomplete.

The solution: Write content in self-contained sections that can be chunked naturally along heading boundaries.

Structuring content for retrieval

Use headings as chunk boundaries

Each ## Heading ideally contains a self-contained unit of information. When a RAG system chunks your page by heading, each chunk is semantically complete:

## What is robots.txt?

robots.txt is a plain-text file placed at your domain root...

## How to block an AI crawler

To block GPTBot, add the following to your robots.txt...

## Which crawlers respect robots.txt?

Most major AI crawlers — GPTBot, ClaudeBot, PerplexityBot — honor robots.txt...

Each section can be retrieved independently without losing meaning.

Keep chunks focused on a single concept

Avoid mixing unrelated concepts within a section. If a section covers both "what is X" and "how to configure X" and "X pricing", split it into three sections.

Include context at the top of each section

RAG systems often retrieve a chunk without surrounding text. Start each section with a brief statement of what it covers:

Without context:

To enable this, navigate to the dashboard and toggle the setting.

With context:

To enable Markdown for Agents on Cloudflare, navigate to your zone dashboard and toggle the Markdown for Agents setting in Quick Actions.

Use explicit entity names

Avoid pronouns and implicit references within sections:

Avoid: "It supports multiple document types." Better: "Workers AI's AI.toMarkdown() supports multiple document types."

Metadata that improves retrieval accuracy

When content is indexed in a vector store, the chunk is often stored with metadata. The metadata enables filtered retrieval ("find recent docs about robots.txt from web4agents.com"):

Metadata field	Source	How to improve
Title	`<title>` or `# H1`	Make it descriptive and unique
URL	Canonical URL	Use a clean, descriptive URL slug
Publication date	JSON-LD `datePublished`	Always include it
Author	JSON-LD `author`	Include for E-E-A-T signals
Content type	Schema.org `@type`	`Article`, `TechArticle`, `FAQPage`

Semantic density

Semantic density is the ratio of meaningful information to total tokens. High-density content is preferred by RAG systems because:

It maximizes information per token
It fills the context window with more useful content
It reduces the chance that a chunk is "noise"

How to increase semantic density:

Remove filler phrases ("It is worth noting that...", "As we can see...")
Eliminate repetitive introductions ("In this section, we will explain...")
Replace vague adjectives with specific values ("very fast" → "under 100ms")
Cut any content that does not add information for the reader

FAQ sections for question-answering pipelines

Question-answering RAG systems look for content that matches the structure of a question and its answer. An FAQ section is the clearest possible signal:

## Frequently Asked Questions

**Can I have different Content Signals policies per page?**
Yes. Set the `Content-Signal` header per route in your server or CDN configuration.

**Does llms.txt replace robots.txt?**
No. They serve complementary purposes: robots.txt controls crawler access; 
llms.txt provides a structured content summary for AI systems.

FAQ items are also excellent candidates for FAQPage Schema.org markup, which generates rich results in traditional search.

Structured data accelerates RAG indexing

When an AI agent or RAG indexer parses your page, JSON-LD provides a pre-extracted, machine-readable version of your key facts. This is faster and more reliable than parsing prose:

{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "RAG Optimization",
  "description": "How to optimize your content for Retrieval-Augmented Generation pipelines.",
  "datePublished": "2026-02-01",
  "author": { "@type": "Organization", "name": "Web4Agents" },
  "keywords": "RAG, retrieval, AI agents, vector database, chunking"
}

Serve Markdown directly

RAG pipelines typically convert HTML to text before indexing. By serving text/markdown responses (via Accept: text/markdown content negotiation), you let the pipeline skip the HTML parsing step and receive clean, structured content directly.

See Markdown for Agents for implementation details.

Summary checklist

Content organized in self-contained sections with descriptive headings
Each section opens with a context statement
Explicit entity names (no ambiguous pronouns)
JSON-LD with headline, description, datePublished, author
FAQ section for common questions about the topic
Accept: text/markdown response available (via Cloudflare or custom)
High semantic density (remove filler, be specific)
URL slug reflects the content topic exactly