-
Home

-
User Agent Directory

- archive.org_bot
archive.org_bot
Internet Archive •
Since 2015
Quick Actions
Official Docs
What is archive.org_bot?
archive.org_bot is one of the Internet Archive's web crawlers used to archive web pages for the Wayback Machine. This is distinct from ia_archiver and represents a newer crawling infrastructure. The bot systematically archives publicly accessible web content to preserve the web's history. It respects robots.txt directives.
User Agent String
Mozilla/5.0 (compatible; archive.org_bot; +http://archive.org/details/archive.org_bot)
How to Control archive.org_bot
Block Completely
To prevent archive.org_bot from accessing your entire website, add this to your robots.txt file:
# Block archive.org_bot
User-agent: archive.org_bot
Disallow: /
Block Specific Directories
To restrict access to certain parts of your site while allowing others:
User-agent: archive.org_bot
Disallow: /admin/
Disallow: /private/
Disallow: /wp-admin/
Allow: /public/
Set Crawl Delay
To slow down the crawl rate (note: not all bots respect this directive):
User-agent: archive.org_bot
Crawl-delay: 10
How to Verify archive.org_bot
Verification Method:
Reverse DNS lookup should resolve to archive.org
Reverse DNS lookup should resolve to archive.org
Learn more in the official documentation.
Detection Patterns
Multiple ways to detect archive.org_bot in your application:
Basic Pattern
/archive\.org_bot/iStrict Pattern
/^Mozilla/5\.0 \(compatible; archive\.org_bot; \+http\://archive\.org/details/archive\.org_bot\)$/Flexible Pattern
/archive\.org_bot[\s\/]?[\d\.]*?/iVendor Match
/.*Internet Archive.*archive\.org_bot/iImplementation Examples
// PHP Detection for archive.org_bot
function detect_archive_org_bot() {
$user_agent = $_SERVER['HTTP_USER_AGENT'] ?? '';
$pattern = '/archive\\.org_bot/i';
if (preg_match($pattern, $user_agent)) {
// Log the detection
error_log('archive.org_bot detected from IP: ' . $_SERVER['REMOTE_ADDR']);
// Set cache headers
header('Cache-Control: public, max-age=3600');
header('X-Robots-Tag: noarchive');
// Optional: Serve cached version
if (file_exists('cache/' . md5($_SERVER['REQUEST_URI']) . '.html')) {
readfile('cache/' . md5($_SERVER['REQUEST_URI']) . '.html');
exit;
}
return true;
}
return false;
}
# Python/Flask Detection for archive.org_bot
import re
from flask import request, make_responsedef detect_archive_org_bot():
user_agent = request.headers.get('User-Agent', '')
pattern = r'archive.org_bot'
if re.search(pattern, user_agent, re.IGNORECASE):
# Create response with caching
response = make_response()
response.headers['Cache-Control'] = 'public, max-age=3600'
response.headers['X-Robots-Tag'] = 'noarchive'
return True
return False# Django Middleware
class archiveorg_botMiddleware:
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
if self.detect_bot(request):
# Handle bot traffic
pass
return self.get_response(request)
// JavaScript/Node.js Detection for archive.org_bot
const express = require('express');
const app = express();// Middleware to detect archive.org_bot
function detectarchiveorg_bot(req, res, next) {
const userAgent = req.headers['user-agent'] || '';
const pattern = /archive.org_bot/i;
if (pattern.test(userAgent)) {
// Log bot detection
console.log('archive.org_bot detected from IP:', req.ip);
// Set cache headers
res.set({
'Cache-Control': 'public, max-age=3600',
'X-Robots-Tag': 'noarchive'
});
// Mark request as bot
req.isBot = true;
req.botName = 'archive.org_bot';
}
next();
}app.use(detectarchiveorg_bot);
# Apache .htaccess rules for archive.org_bot# Block completely
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} archive\.org_bot [NC]
RewriteRule .* - [F,L]# Or redirect to a static version
RewriteCond %{HTTP_USER_AGENT} archive\.org_bot [NC]
RewriteCond %{REQUEST_URI} !^/static/
RewriteRule ^(.*)$ /static/$1 [L]# Or set environment variable for PHP
SetEnvIfNoCase User-Agent "archive\.org_bot" is_bot=1# Add cache headers for this bot
<If "%{HTTP_USER_AGENT} =~ /archive\.org_bot/i">
Header set Cache-Control "public, max-age=3600"
Header set X-Robots-Tag "noarchive"
</If>
# Nginx configuration for archive.org_bot# Map user agent to variable
map $http_user_agent $is_archive_org_bot {
default 0;
~*archive\.org_bot 1;
}server {
# Block the bot completely
if ($is_archive_org_bot) {
return 403;
}
# Or serve cached content
location / {
if ($is_archive_org_bot) {
root /var/www/cached;
try_files $uri $uri.html $uri/index.html @backend;
}
try_files $uri @backend;
}
# Add headers for bot requests
location @backend {
if ($is_archive_org_bot) {
add_header Cache-Control "public, max-age=3600";
add_header X-Robots-Tag "noarchive";
}
proxy_pass http://backend;
}
}
Should You Block This Bot?
Recommendations based on your website type:
| Site Type | Recommendation | Reasoning |
|---|---|---|
| E-commerce | Optional | Evaluate based on bandwidth usage vs. benefits |
| Blog/News | Allow | Increases content reach and discoverability |
| SaaS Application | Block | No benefit for application interfaces; preserve resources |
| Documentation | Selective | Allow for public docs, block for internal docs |
| Corporate Site | Limit | Allow for public pages, block sensitive areas like intranets |
Advanced robots.txt Configurations
E-commerce Site Configuration
User-agent: archive.org_bot
Crawl-delay: 5
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /api/
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*&page=
Allow: /products/
Allow: /categories/
Sitemap: https://example.com/sitemap.xml
Publishing/Blog Configuration
User-agent: archive.org_bot
Crawl-delay: 10
Disallow: /wp-admin/
Disallow: /drafts/
Disallow: /preview/
Disallow: /*?replytocom=
Allow: /
SaaS/Application Configuration
User-agent: archive.org_bot
Disallow: /app/
Disallow: /api/
Disallow: /dashboard/
Disallow: /settings/
Allow: /
Allow: /pricing/
Allow: /features/
Allow: /docs/
Quick Reference
User Agent Match
archive.org_botRobots.txt Name
archive.org_botCategory
otherRespects robots.txt
Yes
Copied to clipboard!
