    Node.js script, leveraging Puppeteer to automate the extraction of essential web data. Perfect for SEO professionals and digital marketers, this script systematically navigates through a website, gathering valuable insights to enhance your SEO strategies.

    I used this script on a migration, re-platforming to Magento. I used it to scrape the content and then used a Python script to match up the data and flag what was missing. I'm posting here so that I have quick access to find it again and to share it with the technical SEO community.

    Key Features:

    1. Automated Browsing: Utilizing Puppeteer, the script launches a headless browser to simulate real user interactions, ensuring accurate and reliable data collection.
    2. Data Extraction:It meticulously scrapes critical SEO elements from each page, including:
      • Title Tags
      • Meta Descriptions
      • Canonical URLs
      • Headings (H1-H6)
      • Paragraph Content
      • Images (Source and Alt Text)
      • JSON-LD Scripts
    3. Structured Storage: The extracted data is neatly organised and saved as JSON files. Each file is named after the processed URL, ensuring easy reference and further analysis.
    4. Efficient Crawling: The script maintains a queue of URLs to visit, ensuring no duplicate or irrelevant URLs (e.g., those containing query parameters or certain file extensions) are processed. It also includes a delay mechanism to control the crawl speed, preventing server overload.
    5. Scalability: This script is designed to handle extensive website structures, making it ideal for comprehensive site audits and large-scale data collection projects.
    How It Works:

    • Initialization: The script begins by launching a headless browser and opening a new page with a user-agent string to mimic typical browser behaviour.
    • Base URL Processing: It starts with a given base URL, extracting and saving data while dynamically discovering new internal links to follow.
    • Data Evaluation: On each page, a set of JavaScript functions evaluate and gather the required on-page elements listed above.
    • Link Management: New links are filtered and added to the queue, ensuring only relevant internal pages are crawled.
    • Completion: After processing all queued URLs, the browser closes, and the script concludes, leaving behind a comprehensive set of JSON files containing the scraped data.

    Grab the code below:

    const puppeteer = require('puppeteer');
    const fs = require('fs');
    const path = require('path');
    function delay(time) {
        return new Promise(resolve => setTimeout(resolve, time));
    (async () => {
        const browser = await puppeteer.launch({ headless: true });
        const page = await browser.newPage();
        await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36');
        const baseUrl = 'https://www.next.co.uk';
        let visitedUrls = new Set();
        let urlQueue = [baseUrl];
        const jsonDir = path.join(__dirname, 'json');
        if (!fs.existsSync(jsonDir)) {
        const saveJson = (data, url) => {
            const filename = url.replace(/https?:\/\/|www\.|\//g, '_').replace(/[^a-zA-Z0-9_]/g, '') + '.json';
            fs.writeFileSync(path.join(jsonDir, filename), JSON.stringify(data, null, 2), 'utf-8');
        async function extractAndSave(url) {
            console.log(`Processing: ${url}`);
            try {
                await page.goto(url, { waitUntil: 'networkidle2' });
                const data = await page.evaluate(() => {
                    const getTitle = () => document.querySelector('title') ? document.querySelector('title').innerText : 'N/A';
                    const getMetaDescription = () => document.querySelector('meta[name="description"]') ? document.querySelector('meta[name="description"]').content : 'N/A';
                    const getCanonical = () => document.querySelector('link[rel="canonical"]') ? document.querySelector('link[rel="canonical"]').href : 'N/A';
                    const getHeadings = () => Array.from(document.querySelectorAll('h1, h2, h3, h4, h5, h6')).map(h => ({ tag: h.tagName, content: h.innerText }));
                    const paragraphs = Array.from(document.querySelectorAll('p')).map(p => p.innerText);
                    const images = Array.from(document.querySelectorAll('img')).map(img => ({ src: img.src, alt: img.alt || 'N/A' }));
                    const jsonLdScripts = Array.from(document.querySelectorAll('script[type="application/ld+json"]')).map(script => {
                        try { return JSON.parse(script.innerText); }
                        catch (e) { return { error: 'Failed to parse JSON-LD' }; }
                    return { URL: location.href, Title: getTitle(), MetaDescription: getMetaDescription(), CanonicalURL: getCanonical(), Headings: getHeadings(), ParagraphsContent: paragraphs, Images: images, JSONLD: jsonLdScripts };
                saveJson(data, url);
                await delay(2000); // Delay to limit crawl speed
            } catch (error) {
                console.error(`Failed to process ${url}:`, error);
        while (urlQueue.length > 0) {
            const currentUrl = urlQueue.shift();
            if (!visitedUrls.has(currentUrl) && !currentUrl.includes('?') && !currentUrl.includes('#') && !currentUrl.match(/\.(zip|xml|pdf|js|css)$/)) {
                await extractAndSave(currentUrl);
                const links = await page.evaluate(baseUrl => Array.from(document.querySelectorAll('a[href]')).map(a => new URL(a.href, baseUrl).href).filter(link => link.startsWith(baseUrl) && !link.includes('?') && !link.includes('#') && !link.match(/\.(zip|xml|pdf|js|css)$/)), baseUrl);
                links.forEach(link => { if (!visitedUrls.has(link)) { urlQueue.push(link); } });
        await browser.close();
        console.log('Crawling has finished.');
