archive.org_bot

Name: archive.org_bot
Author: Internet Archive

Internet Archive • Since 2015

Other Respects robots.txt

#archival #internet-archive #wayback-machine #crawler

Quick Actions

Official Docs

What is archive.org_bot?

archive.org_bot is one of the Internet Archive's web crawlers used to archive web pages for the Wayback Machine. This is distinct from ia_archiver and represents a newer crawling infrastructure. The bot systematically archives publicly accessible web content to preserve the web's history. It respects robots.txt directives.

User Agent String

Mozilla/5.0 (compatible; archive.org_bot; +http://archive.org/details/archive.org_bot)

How to Control archive.org_bot

Block Completely

To prevent archive.org_bot from accessing your entire website, add this to your robots.txt file:

# Block archive.org_bot
User-agent: archive.org_bot
Disallow: /

Block Specific Directories

To restrict access to certain parts of your site while allowing others:

User-agent: archive.org_bot
Disallow: /admin/
Disallow: /private/
Disallow: /wp-admin/
Allow: /public/

Set Crawl Delay

To slow down the crawl rate (note: not all bots respect this directive):

User-agent: archive.org_bot
Crawl-delay: 10

How to Verify archive.org_bot

Verification Method:
Reverse DNS lookup should resolve to archive.org

Learn more in the official documentation.

Detection Patterns

Multiple ways to detect archive.org_bot in your application:

Basic Pattern

/archive\.org_bot/i

Strict Pattern

/^Mozilla/5\.0 $compatible; archive\.org_bot; \+http\://archive\.org/details/archive\.org_bot$$/

Flexible Pattern

/archive\.org_bot[\s\/]?[\d\.]*?/i

Vendor Match

/.*Internet Archive.*archive\.org_bot/i

Implementation Examples

// PHP Detection for archive.org_bot function detect_archive_org_bot() { $user_agent = $_SERVER['HTTP_USER_AGENT'] ?? ''; $pattern = '/archive\\.org_bot/i'; if (preg_match($pattern, $user_agent)) { // Log the detection error_log('archive.org_bot detected from IP: ' . $_SERVER['REMOTE_ADDR']); // Set cache headers header('Cache-Control: public, max-age=3600'); header('X-Robots-Tag: noarchive'); // Optional: Serve cached version if (file_exists('cache/' . md5($_SERVER['REQUEST_URI']) . '.html')) { readfile('cache/' . md5($_SERVER['REQUEST_URI']) . '.html'); exit; } return true; } return false; }

# Python/Flask Detection for archive.org_bot import re from flask import request, make_responsedef detect_archive_org_bot(): user_agent = request.headers.get('User-Agent', '') pattern = r'archive.org_bot' if re.search(pattern, user_agent, re.IGNORECASE): # Create response with caching response = make_response() response.headers['Cache-Control'] = 'public, max-age=3600' response.headers['X-Robots-Tag'] = 'noarchive' return True return False# Django Middleware class archiveorg_botMiddleware: def __init__(self, get_response): self.get_response = get_response def __call__(self, request): if self.detect_bot(request): # Handle bot traffic pass return self.get_response(request)

// JavaScript/Node.js Detection for archive.org_bot const express = require('express'); const app = express();// Middleware to detect archive.org_bot function detectarchiveorg_bot(req, res, next) { const userAgent = req.headers['user-agent'] || ''; const pattern = /archive.org_bot/i; if (pattern.test(userAgent)) { // Log bot detection console.log('archive.org_bot detected from IP:', req.ip); // Set cache headers res.set({ 'Cache-Control': 'public, max-age=3600', 'X-Robots-Tag': 'noarchive' }); // Mark request as bot req.isBot = true; req.botName = 'archive.org_bot'; } next(); }app.use(detectarchiveorg_bot);

# Apache .htaccess rules for archive.org_bot# Block completely RewriteEngine On RewriteCond %{HTTP_USER_AGENT} archive\.org_bot [NC] RewriteRule .* - [F,L]# Or redirect to a static version RewriteCond %{HTTP_USER_AGENT} archive\.org_bot [NC] RewriteCond %{REQUEST_URI} !^/static/ RewriteRule ^(.*)$ /static/$1 [L]# Or set environment variable for PHP SetEnvIfNoCase User-Agent "archive\.org_bot" is_bot=1# Add cache headers for this bot <If "%{HTTP_USER_AGENT} =~ /archive\.org_bot/i"> Header set Cache-Control "public, max-age=3600" Header set X-Robots-Tag "noarchive" </If>

# Nginx configuration for archive.org_bot# Map user agent to variable map $http_user_agent $is_archive_org_bot { default 0; ~*archive\.org_bot 1; }server { # Block the bot completely if ($is_archive_org_bot) { return 403; } # Or serve cached content location / { if ($is_archive_org_bot) { root /var/www/cached; try_files $uri $uri.html $uri/index.html @backend; } try_files $uri @backend; } # Add headers for bot requests location @backend { if ($is_archive_org_bot) { add_header Cache-Control "public, max-age=3600"; add_header X-Robots-Tag "noarchive"; } proxy_pass http://backend; } }

Should You Block This Bot?

Recommendations based on your website type:

Site Type	Recommendation	Reasoning
E-commerce	Optional	Evaluate based on bandwidth usage vs. benefits
Blog/News	Allow	Increases content reach and discoverability
SaaS Application	Block	No benefit for application interfaces; preserve resources
Documentation	Selective	Allow for public docs, block for internal docs
Corporate Site	Limit	Allow for public pages, block sensitive areas like intranets

Advanced robots.txt Configurations

E-commerce Site Configuration

User-agent: archive.org_bot Crawl-delay: 5 Disallow: /cart/ Disallow: /checkout/ Disallow: /my-account/ Disallow: /api/ Disallow: /*?sort= Disallow: /*?filter= Disallow: /*&page= Allow: /products/ Allow: /categories/ Sitemap: https://example.com/sitemap.xml

Publishing/Blog Configuration

User-agent: archive.org_bot Crawl-delay: 10 Disallow: /wp-admin/ Disallow: /drafts/ Disallow: /preview/ Disallow: /*?replytocom= Allow: /

SaaS/Application Configuration

User-agent: archive.org_bot Disallow: /app/ Disallow: /api/ Disallow: /dashboard/ Disallow: /settings/ Allow: / Allow: /pricing/ Allow: /features/ Allow: /docs/

Quick Reference

User Agent Match

archive.org_bot

Robots.txt Name