Using Cosine Similarity to Find Cannibalisation Before Google Does

If you’ve been in SEO for a while, you’ll know that keyword cannibalisation is one of those annoying issues that can quietly wreck your rankings. Two or more of your own pages end up competing against each other in Google, causing them to flip-flop in rankings or, worse, get buried entirely.

For years, we’ve relied on rank tracking, manual audits, and Search Console data to spot it. But by the time you see pages swapping positions or losing clicks, it’s already too late, Google has noticed, and your site is paying the price.

So, I’ve been experimenting with embeddings as a preemptive strike. Instead of waiting for ranking drops, I’m using cosine similarity scores to detect pages that are semantically too close before they start fighting for the same rankings. It’s still a work in progress, but I’m already seeing how this could change the way we tackle cannibalisation.

What Are Embeddings & Why Do They Matter for SEO?

Embeddings are basically a way to represent words, phrases, or entire pages as numerical vectors in a high-dimensional space. Instead of just looking at exact-match keywords, embeddings help Google understand the meaning behind content.

For example, these three phrases all mean roughly the same thing:

“Cheap sofas”
“Affordable sofas”
“Budget-friendly sofas”

Traditional SEO tools might treat them as separate keywords. But embeddings group them together based on meaning, which is exactly what Google’s algorithms do when ranking pages.

The problem? If your site has multiple pages with similar semantic meaning, they can start cannibalising each other. Google struggles to decide which one to rank, and as a result, neither of them performs as well as it could.

This is where cosine similarity comes in.

Using Cosine Similarity to Find Cannibalisation Before Google Does

Cosine similarity is a metric that tells us how close two pieces of content are in meaning. A score of 1.0 means the content is identical, while 0.0 means they have nothing in common.

I recently created and published a Python script that automates this entire process using sentence-transformers to calculate cosine similarity between embeddings.

Get the script here: https://chrisleverseo.com/forum/t/detecting-duplicate-content-and-off-topic-pages-using-python-ai-nlp.130/

Here’s how I’m using it:

Step 1: Extract Embeddings for All Key Pages

I run my key pages through sentence-transformers to get numerical representations of their content. You can also use BERT, OpenAI, or Hugging Face models for this.

Step 2: Compare Each Page Against Every Other Page

I calculate cosine similarity between all the pages targeting related keywords. If two pages have a high similarity score (e.g., 0.85 or above), that’s a red flag.

Step 3: Prioritise the Worst Offenders

Not all similar pages cause problems, so I also factor in:
Search intent differences – If one page is a blog and the other is a product page, it’s usually fine.
Which page Google is ranking – If a lower-priority page is stealing rankings from a more important one, I know where to focus.
Existing internal linking structure – If two pages are too similar but both are valuable, I might tweak the internal links to signal to Google which one should rank.

What to Do When You Find Cannibalisation

Once I’ve identified pages that are too semantically similar, I’ve got a few options:

1. Merge & Redirect (Best for Near-Duplicates)

If two pages are ranking for the same terms and offering basically the same content, I’ll combine them into one stronger, better page and 301 redirect the weaker one.

Example: If I had two posts, “Best Sofas for Small Spaces” and “Small Space Sofas Buying Guide”, I’d merge them into one authoritative guide that keeps the best content from both.

2. Adjust Targeting & Internal Links

If both pages need to exist, I tweak:

Title tags & H1s to make their focus clearer.
Internal links to reinforce the page I want Google to prioritise.
On-page content to lean each page toward a different intent (e.g., one focuses on product recommendations, the other on general buying advice).

Example: “Leather Sofas vs. Fabric Sofas” should focus on comparisons, while “Best Leather Sofas 2025” should focus on top picks—that way, they don’t compete.

3. Use Canonicals or Noindex (Last Resort)

If there’s a legitimate reason to keep both pages live but I only want one to rank, I’ll:
Add a canonical tag pointing to the preferred page.
Use noindex if the secondary page doesn’t need to appear in search results.

Example: If I have ex-display sofas sale pages but also a clearance category page, I might canonicalise the individual sales pages to avoid splitting rankings.

Why This Approach?

It’s proactive, not reactive – I’m catching cannibalisation before Google does.
It saves rankings before they drop – No more waiting for Search Console to show traffic declines.
It’s scalable – This works whether you’ve got 10 pages or 10,000 pages.

I’m still experimenting with the scripts and refining the process, but early results are promising. Google’s ranking systems are increasingly semantic, and using embeddings means we can spot potential conflicts before they start causing problems.

If you’re dealing with ranking volatility, content conflicts, or weird keyword fluctuations, this is definitely worth testing.

Have you tried using embeddings for SEO yet? I’d love to hear how others are using them – drop me a message or comment with your thoughts!

Comments:

Comments are closed.

What Are Embeddings & Why Do They Matter for SEO?