What I now understand about content chunking and extraction for LLMs

Like most within the circles I run in, I have spent a fair chunk (no pun intended) of time this year looking more closely at how LLMs read and interpret web content.

Most of which, is what I kept seeing across the SEO industry. LinkedIn posts, conference sessions and webinars, including Sitebulb’s recent AI and LLM SEO webinars, got me thinking.

My approach is simple. If something does not add up, I pull it apart and try to understand how it actually works. For example, I hear people say that JSON-LD (Structured Data) is important for LLM models optimisation, and at the same time, I keep reading that script tags are stripped out before chunking. I also came across arguments that say if the information in JSON-LD already appears on the page, then including it again creates extra noise in the embedding layer. These things do not all align, so I try to make sense of it all. The more I read and research, the clearer the gap becomes between what the industry is saying and what seems to happen in the LLM ingestion pipeline.

What I now understand about HTML and extraction

When content is prepared for an LLM, the extraction step removes most of the markup. The LLM does not see the page as we see it. It only sees whatever remains after extraction.

From everything I have read, the elements that consistently survive are:

headings
paragraphs
line breaks
lists
table text
visible link text
alt text

Most other HTML elements, including formatting tags, semantic cues and layout containers, do not appear in the extracted text the model works with. This matches what multiple scraping and LLM ingestion guides show.

Do you remember that Bold <b> <strong> tags were seen as an influencing factor for Google Organic Search? They’re gone in the age of LLMs…

Where JSON-LD fits into this

This is where the confusion in the industry really shows. I have seen people talk as if JSON-LD is essential for LLM-driven visibility. But JSON-LD lives inside a script tag. And everything I have read about extraction pipelines says that script tags are removed before chunking happens. So the model does not appear to see the schema block at all.

There is also another angle I came across. If the structured data (schemas) repeats information that is already visible on the page, then for embedding-based systems, it becomes duplication and extra tokenisation that does not help the model understand anything new.

So you end up with two conflicting narratives:

One group says JSON-LD helps LLM optimisation
The technical documentation says script tags are removed and duplicate content does not help embeddings

When you put those two views next to each other, the LLM side does not support the idea that JSON-LD contributes anything unless the data is visible in the text. That is my interpretation based on the sources, not a claim of absolute fact. But the logic holds up.

How chunking fits into all this

Chunking is the bit that sits between extraction and the model actually doing any work. Once the page has been flattened, the LLM does not read it as one long stream. It reads it in chunks of cleaned text.

If the content has sensible paragraphs, clear breaks and a logical flow, those chunks tend to fall in the right places. Related sentences stay together, and the model has enough context to understand what is going on. If the structure is messy, too dense or all over the place, the chunks cut across ideas in awkward ways, and the model ends up with weaker context to work with.

The way I see it, chunking improves the model’s ability to interpret your content. It does not improve the content itself. It is a machine level optimisation that happens before any reasoning starts. It changes the input, not the writing.

That is why a lot of the advice about “writing for GEO” does not sit right with me. Much of it is not about making better content. It is about making the LLMs preprocessing layer’s job easier.

The SEOs trade-off

There is also another practical impact and very real risk to consider. That optimising a page for LLM chunking does not always align with improving the experience for users. LLMs prefer shorter paragraphs, frequent headings and a very predictable structure. That does not always create the best reading experience.

If that harms user engagement, it can work against behaviour-driven feedback systems like NavBoost. In other words, improving something for the machine may weaken the signals that matter for traditional rankings. This is something the industry has not really squared yet. SEO’s like Lili Ray get it.

Anyway, these are my own interpretations and takeaways from looking into how LLM ingestion likely works; take them with a pinch of salt. There is a lot of noise in the industry at the moment, and some of the ideas that get repeated simply do not line up with how extraction and embedding systems behave.

For me, the most useful part of this process has been understanding what actually reaches the model. It becomes much easier to separate tactics that help users from tactics that only help the machine. Writing about these ideas also helps to solidify my stance.