• Home
  • Nodejs/Puppeteer Web Page Scraper to JSON

    Chris

    Administrator
    Staff member
    Node.js script, leveraging Puppeteer to automate the extraction of essential web data. Perfect for SEO professionals and digital marketers, this script systematically navigates through a website, gathering valuable insights to enhance your SEO strategies.

    I used this script on a migration, re-platforming to Magento. I used it to scrape the content and then used a Python script to match up the data and flag what was missing. I'm posting here so that I have quick access to find it again and to share it with the technical SEO community.

    Key Features:

    1. Automated Browsing: Utilizing Puppeteer, the script launches a headless browser to simulate real user interactions, ensuring accurate and reliable data collection.
    2. Data Extraction:It meticulously scrapes critical SEO elements from each page, including:
      • Title Tags
      • Meta Descriptions
      • Canonical URLs
      • Headings (H1-H6)
      • Paragraph Content
      • Images (Source and Alt Text)
      • JSON-LD Scripts
    3. Structured Storage: The extracted data is neatly organised and saved as JSON files. Each file is named after the processed URL, ensuring easy reference and further analysis.
    4. Efficient Crawling: The script maintains a queue of URLs to visit, ensuring no duplicate or irrelevant URLs (e.g., those containing query parameters or certain file extensions) are processed. It also includes a delay mechanism to control the crawl speed, preventing server overload.
    5. Scalability: This script is designed to handle extensive website structures, making it ideal for comprehensive site audits and large-scale data collection projects.
    How It Works:

    • Initialization: The script begins by launching a headless browser and opening a new page with a user-agent string to mimic typical browser behaviour.
    • Base URL Processing: It starts with a given base URL, extracting and saving data while dynamically discovering new internal links to follow.
    • Data Evaluation: On each page, a set of JavaScript functions evaluate and gather the required on-page elements listed above.
    • Link Management: New links are filtered and added to the queue, ensuring only relevant internal pages are crawled.
    • Completion: After processing all queued URLs, the browser closes, and the script concludes, leaving behind a comprehensive set of JSON files containing the scraped data.

    Grab the code below:

    Code:
    const puppeteer = require('puppeteer');
    const fs = require('fs');
    const path = require('path');
    function delay(time) {
        return new Promise(resolve => setTimeout(resolve, time));
    }
    (async () => {
        const browser = await puppeteer.launch({ headless: true });
        const page = await browser.newPage();
        await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36');
        const baseUrl = 'https://www.next.co.uk';
        let visitedUrls = new Set();
        let urlQueue = [baseUrl];
        const jsonDir = path.join(__dirname, 'json');
        if (!fs.existsSync(jsonDir)) {
            fs.mkdirSync(jsonDir);
        }
        const saveJson = (data, url) => {
            const filename = url.replace(/https?:\/\/|www\.|\//g, '_').replace(/[^a-zA-Z0-9_]/g, '') + '.json';
            fs.writeFileSync(path.join(jsonDir, filename), JSON.stringify(data, null, 2), 'utf-8');
        };
        async function extractAndSave(url) {
            console.log(`Processing: ${url}`);
            try {
                await page.goto(url, { waitUntil: 'networkidle2' });
                const data = await page.evaluate(() => {
                    const getTitle = () => document.querySelector('title') ? document.querySelector('title').innerText : 'N/A';
                    const getMetaDescription = () => document.querySelector('meta[name="description"]') ? document.querySelector('meta[name="description"]').content : 'N/A';
                    const getCanonical = () => document.querySelector('link[rel="canonical"]') ? document.querySelector('link[rel="canonical"]').href : 'N/A';
                    const getHeadings = () => Array.from(document.querySelectorAll('h1, h2, h3, h4, h5, h6')).map(h => ({ tag: h.tagName, content: h.innerText }));
                    const paragraphs = Array.from(document.querySelectorAll('p')).map(p => p.innerText);
                    const images = Array.from(document.querySelectorAll('img')).map(img => ({ src: img.src, alt: img.alt || 'N/A' }));
                    const jsonLdScripts = Array.from(document.querySelectorAll('script[type="application/ld+json"]')).map(script => {
                        try { return JSON.parse(script.innerText); }
                        catch (e) { return { error: 'Failed to parse JSON-LD' }; }
                    });
                    return { URL: location.href, Title: getTitle(), MetaDescription: getMetaDescription(), CanonicalURL: getCanonical(), Headings: getHeadings(), ParagraphsContent: paragraphs, Images: images, JSONLD: jsonLdScripts };
                });
                saveJson(data, url);
                await delay(2000); // Delay to limit crawl speed
            } catch (error) {
                console.error(`Failed to process ${url}:`, error);
            }
        }
        while (urlQueue.length > 0) {
            const currentUrl = urlQueue.shift();
            if (!visitedUrls.has(currentUrl) && !currentUrl.includes('?') && !currentUrl.includes('#') && !currentUrl.match(/\.(zip|xml|pdf|js|css)$/)) {
                visitedUrls.add(currentUrl);
                await extractAndSave(currentUrl);
                const links = await page.evaluate(baseUrl => Array.from(document.querySelectorAll('a[href]')).map(a => new URL(a.href, baseUrl).href).filter(link => link.startsWith(baseUrl) && !link.includes('?') && !link.includes('#') && !link.match(/\.(zip|xml|pdf|js|css)$/)), baseUrl);
                links.forEach(link => { if (!visitedUrls.has(link)) { urlQueue.push(link); } });
            }
        }
        await browser.close();
        console.log('Crawling has finished.');
    })();
     
    Last edited: