Node.js script, leveraging Puppeteer to automate the extraction of essential web data. Perfect for SEO professionals and digital marketers, this script systematically navigates through a website, gathering valuable insights to enhance your SEO strategies.
I used this script on a migration, re-platforming to Magento. I used it to scrape the content and then used a Python script to match up the data and flag what was missing. I'm posting here so that I have quick access to find it again and to share it with the technical SEO community.
Key Features:
Grab the code below:
I used this script on a migration, re-platforming to Magento. I used it to scrape the content and then used a Python script to match up the data and flag what was missing. I'm posting here so that I have quick access to find it again and to share it with the technical SEO community.
Key Features:
- Automated Browsing: Utilizing Puppeteer, the script launches a headless browser to simulate real user interactions, ensuring accurate and reliable data collection.
- Data Extraction:It meticulously scrapes critical SEO elements from each page, including:
- Title Tags
- Meta Descriptions
- Canonical URLs
- Headings (H1-H6)
- Paragraph Content
- Images (Source and Alt Text)
- JSON-LD Scripts
- Structured Storage: The extracted data is neatly organised and saved as JSON files. Each file is named after the processed URL, ensuring easy reference and further analysis.
- Efficient Crawling: The script maintains a queue of URLs to visit, ensuring no duplicate or irrelevant URLs (e.g., those containing query parameters or certain file extensions) are processed. It also includes a delay mechanism to control the crawl speed, preventing server overload.
- Scalability: This script is designed to handle extensive website structures, making it ideal for comprehensive site audits and large-scale data collection projects.
- Initialization: The script begins by launching a headless browser and opening a new page with a user-agent string to mimic typical browser behaviour.
- Base URL Processing: It starts with a given base URL, extracting and saving data while dynamically discovering new internal links to follow.
- Data Evaluation: On each page, a set of JavaScript functions evaluate and gather the required on-page elements listed above.
- Link Management: New links are filtered and added to the queue, ensuring only relevant internal pages are crawled.
- Completion: After processing all queued URLs, the browser closes, and the script concludes, leaving behind a comprehensive set of JSON files containing the scraped data.
Grab the code below:
Code:
const puppeteer = require('puppeteer');
const fs = require('fs');
const path = require('path');
function delay(time) {
return new Promise(resolve => setTimeout(resolve, time));
}
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36');
const baseUrl = 'https://www.next.co.uk';
let visitedUrls = new Set();
let urlQueue = [baseUrl];
const jsonDir = path.join(__dirname, 'json');
if (!fs.existsSync(jsonDir)) {
fs.mkdirSync(jsonDir);
}
const saveJson = (data, url) => {
const filename = url.replace(/https?:\/\/|www\.|\//g, '_').replace(/[^a-zA-Z0-9_]/g, '') + '.json';
fs.writeFileSync(path.join(jsonDir, filename), JSON.stringify(data, null, 2), 'utf-8');
};
async function extractAndSave(url) {
console.log(`Processing: ${url}`);
try {
await page.goto(url, { waitUntil: 'networkidle2' });
const data = await page.evaluate(() => {
const getTitle = () => document.querySelector('title') ? document.querySelector('title').innerText : 'N/A';
const getMetaDescription = () => document.querySelector('meta[name="description"]') ? document.querySelector('meta[name="description"]').content : 'N/A';
const getCanonical = () => document.querySelector('link[rel="canonical"]') ? document.querySelector('link[rel="canonical"]').href : 'N/A';
const getHeadings = () => Array.from(document.querySelectorAll('h1, h2, h3, h4, h5, h6')).map(h => ({ tag: h.tagName, content: h.innerText }));
const paragraphs = Array.from(document.querySelectorAll('p')).map(p => p.innerText);
const images = Array.from(document.querySelectorAll('img')).map(img => ({ src: img.src, alt: img.alt || 'N/A' }));
const jsonLdScripts = Array.from(document.querySelectorAll('script[type="application/ld+json"]')).map(script => {
try { return JSON.parse(script.innerText); }
catch (e) { return { error: 'Failed to parse JSON-LD' }; }
});
return { URL: location.href, Title: getTitle(), MetaDescription: getMetaDescription(), CanonicalURL: getCanonical(), Headings: getHeadings(), ParagraphsContent: paragraphs, Images: images, JSONLD: jsonLdScripts };
});
saveJson(data, url);
await delay(2000); // Delay to limit crawl speed
} catch (error) {
console.error(`Failed to process ${url}:`, error);
}
}
while (urlQueue.length > 0) {
const currentUrl = urlQueue.shift();
if (!visitedUrls.has(currentUrl) && !currentUrl.includes('?') && !currentUrl.includes('#') && !currentUrl.match(/\.(zip|xml|pdf|js|css)$/)) {
visitedUrls.add(currentUrl);
await extractAndSave(currentUrl);
const links = await page.evaluate(baseUrl => Array.from(document.querySelectorAll('a[href]')).map(a => new URL(a.href, baseUrl).href).filter(link => link.startsWith(baseUrl) && !link.includes('?') && !link.includes('#') && !link.match(/\.(zip|xml|pdf|js|css)$/)), baseUrl);
links.forEach(link => { if (!visitedUrls.has(link)) { urlQueue.push(link); } });
}
}
await browser.close();
console.log('Crawling has finished.');
})();
Last edited: