Duplicate content and off-topic pages cause problems for SEO, but finding them manually is slow and unreliable. Whether it's product pages that are too similar, old blog posts covering the same topic from slightly different angles, or random pages that do not fit the site's main focus, these issues can dilute rankings. Worse, if Google detects overlapping content before you do, it can lead to keyword cannibalisation, where the wrong pages compete for rankings, damaging organic visibility.
Technical SEOs rely on automation to solve problems at scale. This script takes a structured approach by using text embeddings, NLP, and cosine similarity scoring to analyse content the way Google does. Instead of just checking for matching words, it looks at the actual meaning behind the content.
The script extracts text from web pages, removing boilerplate content (header and footer) when semantic HTML5 is used. It then runs the content through Google's NLP API to identify the top five named entities and categorise the page intent. Using text embeddings, it calculates cosine similarity scores to detect duplicate or near-duplicate pages and assigns a relevance score to highlight content that does not fit with the rest of the site.
By running this Python script, you can spot potential content cannibalisation before Google does and take control of how your pages rank. This is a proper data-backed analysis using Google's own NLP API and embeddings, providing a structured way to detect content duplication and off-topic pages. It is a scalable way to audit content without relying on manual review or guesswork.
Example of the output data: https://docs.google.com/spreadsheets/d/1rfcGMSGrCEYcZCreSKQ5LFrZoV_2fOzzj7JP9USpgM8/edit?usp=sharing
For Google Colab, paste this into a code cell and run it:
Notes* the Google NLP API is not free, but it is not expensive to run. I have only tested this script on a Windows desktop PC, so the process may differ for Google Colab and Mac.
The script will read from this file, process the URLs, and generate a CSV report with the results.
For Google Colab, upload input.csv manually before running the script.
The script calculates cosine similarity between embeddings to measure how closely related two pages are. Similarity scores range from 0 to 1.
Most pages should score between 0.5 and 0.7, which indicates normal variation. Scores below 0.3 suggest the content may not align with the rest of the site.
If pages are intentional duplicates, ensure canonical tags are correctly set. If they serve the same intent, consider merging them into one stronger page.
A relevance score below 0.3 suggests the page may not fit within the site’s content strategy. If it is an important page, it may need repositioning. If it is irrelevant, consider removing or redirecting it.
A sales page with a negative sentiment score may discourage conversions. If a page is unexpectedly positive, check if it is overly promotional.
Rather than just counting matching words, the script understands content at a deeper level, making it possible to detect true duplication, content overlap, and off-topic pages that may not be immediately obvious.
This is especially useful for large websites, eCommerce stores, publishers, and content-heavy sites where manual audits are impractical.
It is a data-driven approach that saves time, improves accuracy, and scales effortlessly, whether you are working with a small website or an enterprise-level site with thousands of pages.
Lastly, I have no affiliation with the Reeds Rains website; it is just a random site I chose to showcase the script. Mortgages are a topic that often has content overlapping.
Technical SEOs rely on automation to solve problems at scale. This script takes a structured approach by using text embeddings, NLP, and cosine similarity scoring to analyse content the way Google does. Instead of just checking for matching words, it looks at the actual meaning behind the content.
The script extracts text from web pages, removing boilerplate content (header and footer) when semantic HTML5 is used. It then runs the content through Google's NLP API to identify the top five named entities and categorise the page intent. Using text embeddings, it calculates cosine similarity scores to detect duplicate or near-duplicate pages and assigns a relevance score to highlight content that does not fit with the rest of the site.
By running this Python script, you can spot potential content cannibalisation before Google does and take control of how your pages rank. This is a proper data-backed analysis using Google's own NLP API and embeddings, providing a structured way to detect content duplication and off-topic pages. It is a scalable way to audit content without relying on manual review or guesswork.
Example of the output data: https://docs.google.com/spreadsheets/d/1rfcGMSGrCEYcZCreSKQ5LFrZoV_2fOzzj7JP9USpgM8/edit?usp=sharing
Installing and Running the Script
This script works on Windows, Mac, and Google Colab. Create a new folder, copy the Python script below and name it script.py or whatever you like.
Code:
import os
import re
import json
import time
import logging
import numpy as np
import pandas as pd
from datetime import datetime
# Selenium
import urllib3
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager # Auto-manage ChromeDriver
# Sentence Transformers & Cosine Similarity
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Google Cloud NLP
from google.cloud import language_v1
# Suppress warnings and logs
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
logging.getLogger('selenium').setLevel(logging.CRITICAL)
# Load the BERT model
bert_model = SentenceTransformer('all-MiniLM-L6-v2')
# Google NLP API Credentials
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.path.join(
os.getcwd(), "keys", "dashboard-project-google-nlp-key.json"
)
def extract_page_content(url, driver):
"""Extract the page's main content, excluding header and footer."""
try:
driver.get(url)
time.sleep(2) # Allow page to load
# Page title
title = driver.title.strip()
# Body text excluding <header> and <footer>
body_element = driver.find_element(By.TAG_NAME, 'body')
body_text = body_element.text.strip()
# Remove header and footer if they exist
if driver.find_elements(By.TAG_NAME, 'header'):
header = driver.find_element(By.TAG_NAME, 'header').text.strip()
body_text = body_text.replace(header, "")
if driver.find_elements(By.TAG_NAME, 'footer'):
footer = driver.find_element(By.TAG_NAME, 'footer').text.strip()
body_text = body_text.replace(footer, "")
return title, body_text
except Exception as e:
print(f"Error extracting content from {url}: {e}")
return "", ""
def analyse_text_with_nlp(content):
"""Use Google NLP API to get entities, categories, and sentiment score."""
client = language_v1.LanguageServiceClient()
doc = {"content": content, "type_": language_v1.Document.Type.PLAIN_TEXT}
resp = {}
try:
resp["entities"] = client.analyze_entities(request={"document": doc}).entities
resp["categories"] = client.classify_text(request={"document": doc}).categories
resp["sentiment"] = client.analyze_sentiment(request={"document": doc}).document_sentiment
except Exception as e:
print(f"Error with Google NLP API: {e}")
return resp
def extract_nlp_details(content):
"""Parse the NLP response, returning top entities, categories, and sentiment score."""
nlp_response = analyse_text_with_nlp(content)
entities = [ent.name for ent in nlp_response.get("entities", [])[:5]] # Only entity names, no scores
categories = [cat.name for cat in nlp_response.get("categories", [])]
sentiment_score = round(nlp_response.get("sentiment", {}).score, 4) if nlp_response.get("sentiment") else None
return {
"entities": entities,
"categories": categories,
"sentiment_score": sentiment_score
}
def compute_embedding(body):
"""Compute an embedding for the entire page content (excluding header/footer)."""
return bert_model.encode(body, convert_to_numpy=True) if body else np.zeros(384)
def compute_relevance_scores(embeddings):
"""Compute how relevant each page is to the overall theme of the site using mean embeddings."""
mean_embedding = np.mean(embeddings, axis=0) # Compute the site's 'central' embedding
relevance_scores = cosine_similarity(embeddings, [mean_embedding])[:, 0] # Compare each page to the mean
return relevance_scores
def main():
# Load URLs
input_file = os.path.join(os.getcwd(), "input.csv")
df_urls = pd.read_csv(input_file)
# Output CSV
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
output_csv = os.path.join(os.getcwd(), f"url-data-{timestamp}.csv")
# Selenium headless driver
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--log-level=3")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
pages = []
embeddings = []
for idx, row in df_urls.iterrows():
url = row["URL"]
print(f"Processing {idx+1}/{len(df_urls)}: {url}")
title, body_text = extract_page_content(url, driver)
# Compute embedding for the full page content
embedding_vec = compute_embedding(body_text)
# Word & Sentence count
word_count = len(re.findall(r"\b\w+\b", body_text)) if body_text else 0
sentence_count = len(re.split(r'[.?!]', body_text)) if body_text else 0 # Basic sentence count
# Google NLP Analysis
nlp_data = extract_nlp_details(body_text) if body_text else {"entities": [], "categories": [], "sentiment_score": None}
pages.append({
"URL": url,
"Title": title,
"WordCount": word_count,
"SentenceCount": sentence_count,
"TopEntities": ", ".join(nlp_data["entities"]), # No salience scores
"ContentCategories": ", ".join(nlp_data["categories"]),
"SentimentScore": nlp_data["sentiment_score"],
"Embedding": json.dumps(embedding_vec.tolist()) # Save as JSON
})
embeddings.append(embedding_vec)
driver.quit()
# Convert embeddings to a NumPy array
embeddings_mat = np.array(embeddings) # Shape: (N, 384)
# Compute the relevance scores
relevance_scores = compute_relevance_scores(embeddings_mat)
# Store similarity results in the same CSV
n_docs = len(pages)
for i in range(n_docs):
row_sim = cosine_similarity([embeddings_mat[i]], embeddings_mat)[0]
row_sim[i] = -1.0 # Prevent self-matching
# Sort in descending order
top_idx = np.argsort(-row_sim)[:3]
pages[i]["RelevanceScore"] = round(relevance_scores[i], 4)
for rank, j in enumerate(top_idx, start=1):
pages[i][f"Nearest{rank}URL"] = pages[j]["URL"]
pages[i][f"Nearest{rank}Similarity"] = round(row_sim[j], 4) # Keep similarity scores
# Save results to CSV
df_out = pd.DataFrame(pages)
df_out.to_csv(output_csv, index=False)
print(f"\nData saved to {output_csv}")
if __name__ == "__main__":
main()
Install the Required Python Libraries
For Windows or Mac, open PowerShell or Terminal and run:
Code:
pip install numpy pandas selenium sentence-transformers scikit-learn google-cloud-language webdriver-manager
For Google Colab, paste this into a code cell and run it:
Code:
!pip install numpy pandas selenium sentence-transformers scikit-learn google-cloud-language webdriver-manager
Getting a Google Cloud NLP API Key
Since the script uses Google’s Natural Language API, you need to set up access through Google Cloud.1. Create a Google Cloud Project
- Go to the Google Cloud Console.
- Click the project selector in the top menu and select New Project.
- Name your project something relevant, like SEO-Content-Analysis.
- Click Create and wait for Google to set it up.
2. Enable the Natural Language API
- In the Google Cloud Console, go to the APIs & Services dashboard.
- Click Enable APIs and Services at the top.
- Search for Natural Language API.
- Click Enable.
3. Create a Service Account
- In the Google Cloud Console, go to IAM & Admin > Service Accounts.
- Click Create Service Account.
- Give it a name (e.g., seo-audit-bot).
- Click Create and Continue.
- Under "Grant this service account access to project," select Owner (or Editor if you prefer stricter access).
- Click Done.
4. Generate the API Key (JSON Key File)
- Find your new service account in the list and click it.
- Go to the Keys tab and click Add Key > Create New Key.
- Select JSON format.
- Click Create, and it will download a .json file to your computer.
5. Move the Key File to Your Project Folder
Move the .json file to the same folder where your script is located. Create a new folder within it named "keys" and move it in there. You might want to rename it to something simple, like:
Code:
google-nlp-key.json
Code:
google-nlp-key.json
Find this within the script and replace "dashboard-project-google-nlp-key.json" with your JSON file name.
Code:
# Google NLP API Credentials
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.path.join(
os.getcwd(), "keys", "dashboard-project-google-nlp-key.json"
)
Notes* the Google NLP API is not free, but it is not expensive to run. I have only tested this script on a Windows desktop PC, so the process may differ for Google Colab and Mac.
Prepare Your input.csv File
Before running the script, create a file named input.csv in the same folder as the script. It should contain a list of URLs, starting from cell A2, leaving the first row for headers.The script will read from this file, process the URLs, and generate a CSV report with the results.
Running the Script
Once everything is set up, run the script with:
Code:
python script.py
For Google Colab, upload input.csv manually before running the script.
How the Script Works
This script does not rely on simple text matching. It converts page content into text embeddings, which represent meaning rather than exact wording. This allows for more accurate detection of duplicate content and topic alignment.
What Embeddings and Cosine Similarity Mean
Text embeddings convert page content into numerical vectors, allowing similarity comparisons between pages. This script uses BERT embeddings with a 384-dimensional representation.The script calculates cosine similarity between embeddings to measure how closely related two pages are. Similarity scores range from 0 to 1.
- Pages scoring above 0.85 are likely duplicates or highly similar.
- Pages with moderate similarity (0.6 to 0.8) may share overlapping content but are not exact duplicates.
- Pages below 0.4 are significantly different.
How Relevance Score Works
Instead of comparing each page to a predefined category or the homepage, the script calculates the mean embedding of all pages in the dataset. Each page is then compared against this average to determine how well it fits within the overall theme of the site.Most pages should score between 0.5 and 0.7, which indicates normal variation. Scores below 0.3 suggest the content may not align with the rest of the site.
CSV Output and What It Means
The script generates a CSV file with the following columns:Column | Description |
---|---|
URL | The page being analyzed |
Title | The page’s title |
WordCount | Total number of words extracted |
SentenceCount | Total number of sentences extracted |
TopEntities | Key names, brands, and places detected by Google NLP |
ContentCategories | The category assigned by Google NLP |
SentimentScore | The emotional tone of the page (-1.0 = negative, 1.0 = positive) |
RelevanceScore | Measures how well the page matches the site's overall content |
Nearest1URL | The most similar page on the site |
Nearest1Similarity | The cosine similarity score of the most similar page (1.0 = identical) |
Nearest2URL | The second most similar page |
Nearest2Similarity | The cosine similarity score of the second most similar page |
Nearest3URL | The third most similar page |
Nearest3Similarity | The cosine similarity score of the third most similar page |
How to Use the Data
Identifying Duplicate or Similar Pages
If two pages have a high similarity score, they are likely duplicates or thin variations.URL | Nearest1URL | Nearest1Similarity |
---|---|---|
/product-a | /product-a-variant | 0.92 |
/service-x | /service-x-uk | 0.88 |
If pages are intentional duplicates, ensure canonical tags are correctly set. If they serve the same intent, consider merging them into one stronger page.
Identifying Off-Topic Pages
Relevance Score measures how well a page aligns with the site’s main theme.URL | RelevanceScore | ContentCategories |
---|---|---|
/home-insurance-guide | 0.83 | Insurance, Finance |
/best-pop-songs-2024 | 0.21 | Music, Entertainment |
A relevance score below 0.3 suggests the page may not fit within the site’s content strategy. If it is an important page, it may need repositioning. If it is irrelevant, consider removing or redirecting it.
Using Sentiment Analysis
Sentiment analysis helps determine whether content is positive, neutral, or negative.Score | Meaning |
---|---|
0.6 and above | Strongly positive, useful for marketing and sales pages |
0.3 to 0.5 | Neutral, typical for informational content |
Below 0.2 | Likely negative, which may indicate UX or messaging issues |
A sales page with a negative sentiment score may discourage conversions. If a page is unexpectedly positive, check if it is overly promotional.
Final Thoughts - Preemptive Strike on Potential Cannibalisation
Duplicate content, thin pages, and off-topic content have always been challenges for SEO, but traditional methods rely on basic text matching or manual review, both of which are inefficient at scale. This script takes a more advanced and scalable approach by leveraging AI-powered text embeddings and Google NLP.Rather than just counting matching words, the script understands content at a deeper level, making it possible to detect true duplication, content overlap, and off-topic pages that may not be immediately obvious.
This is especially useful for large websites, eCommerce stores, publishers, and content-heavy sites where manual audits are impractical.
What this script does better than manual review or basic text matching
- Finds near-duplicates even when wording differs
- Unlike exact text matching, embeddings detect semantic similarity.
- A product page and a slightly reworded variant may look unique to the naked eye but will score high on cosine similarity.
- Flags content that might not belong on the site
- The relevance score helps identify off-topic content based on the site’s main theme.
- If an insurance site has a page about pop music, it will likely score low in relevance.
- Eliminates the need for manual comparison
- No more side-by-side content checks or guesswork.
- The output is structured in a way that makes it easy to prioritize which pages need attention first.
- Works at scale
- Whether you are checking 50 pages or 50,000, the method is the same.
- Works for e-commerce, news sites, SaaS, and content publishers dealing with large volumes of pages.
How this helps SEO strategy
- Prevents duplicate content issues
- Helps reduce duplicate page bloat and consolidate weak content.
- Avoids potential Google indexing issues from near-duplicate pages.
- Supports content pruning
- Identifies thin or overlapping pages that may not add value.
- Helps make decisions on which pages to remove, merge, or rewrite.
- Improves internal linking and site structure
- By spotting thematic clusters, this can inform better internal linking strategies.
- If two blog posts are highly similar, one could become a subtopic of the other instead of existing as a standalone page.
- Helps content teams prioritise work
- The structured data allows for quick decision-making on where to focus content efforts.
- Instead of manually comparing content, SEO teams can quickly filter and prioritise based on similarity and relevance scores.
Where this fits into an SEO audit
For sites with thousands of pages, this script can be used alongside traditional site audits. It works well when:- Migrating content and needing to identify redundant pages.
- Cleaning up duplicate blog posts, product pages, or category pages.
- Ensuring content aligns with brand messaging after a site-wide refresh.
- Auditing user-generated content that may be off-topic.
Final takeaway
Content audits are a necessary part of SEO, but they don’t need to be slow and painful. This script makes it easier to spot problems quickly, allowing SEOs and content teams to focus on fixing issues rather than spending hours finding them.It is a data-driven approach that saves time, improves accuracy, and scales effortlessly, whether you are working with a small website or an enterprise-level site with thousands of pages.
Important warning about Google NLP API costs
Google's Natural Language API is not free. The cost depends on the number of requests made, with free usage limited to a small number of API calls per month. Large-scale audits may result in charges, so it is important to monitor API usage in the Google Cloud billing dashboard. Before running the script on a large dataset, check the latest pricing on Google Cloud to avoid unexpected costs.Lastly, I have no affiliation with the Reeds Rains website; it is just a random site I chose to showcase the script. Mortgages are a topic that often has content overlapping.
Last edited: