Algolia Crawler

Name: Algolia Crawler
Author: Algolia

Algolia • Since 2016

Other Respects robots.txt

#search #indexing #algolia #crawler

Quick Actions

Official Docs

What is Algolia Crawler?

Algolia Crawler is used by Algolia's hosted search service to index website content for their search-as-a-service platform. Many websites and applications use Algolia to power their internal search functionality, and this crawler helps keep search indices up to date. The bot is typically configured to crawl specific websites that have implemented Algolia search, focusing on content that needs to be searchable. Algolia's crawler can be customized to extract specific data and follow custom crawling rules, making it flexible for various search implementations.

User Agent String

Algolia Crawler

How to Control Algolia Crawler

Block Completely

To prevent Algolia Crawler from accessing your entire website, add this to your robots.txt file:

# Block Algolia Crawler
User-agent: Algolia Crawler
Disallow: /

Block Specific Directories

To restrict access to certain parts of your site while allowing others:

User-agent: Algolia Crawler
Disallow: /admin/
Disallow: /private/
Disallow: /wp-admin/
Allow: /public/

Set Crawl Delay

To slow down the crawl rate (note: not all bots respect this directive):

User-agent: Algolia Crawler
Crawl-delay: 10

How to Verify Algolia Crawler

Verification Method:
Configured for specific Algolia customers

Learn more in the official documentation.

Detection Patterns

Multiple ways to detect Algolia Crawler in your application:

Basic Pattern

/Algolia Crawler/i

Strict Pattern

/^Algolia Crawler$/

Flexible Pattern

/Algolia Crawler[\s\/]?[\d\.]*?/i

Vendor Match

/.*Algolia.*Algolia/i

Implementation Examples

// PHP Detection for Algolia Crawler function detect_algolia_crawler() { $user_agent = $_SERVER['HTTP_USER_AGENT'] ?? ''; $pattern = '/Algolia Crawler/i'; if (preg_match($pattern, $user_agent)) { // Log the detection error_log('Algolia Crawler detected from IP: ' . $_SERVER['REMOTE_ADDR']); // Set cache headers header('Cache-Control: public, max-age=3600'); header('X-Robots-Tag: noarchive'); // Optional: Serve cached version if (file_exists('cache/' . md5($_SERVER['REQUEST_URI']) . '.html')) { readfile('cache/' . md5($_SERVER['REQUEST_URI']) . '.html'); exit; } return true; } return false; }

# Python/Flask Detection for Algolia Crawler import re from flask import request, make_response def detect_algolia_crawler(): user_agent = request.headers.get('User-Agent', '') pattern = r'Algolia Crawler' if re.search(pattern, user_agent, re.IGNORECASE): # Create response with caching response = make_response() response.headers['Cache-Control'] = 'public, max-age=3600' response.headers['X-Robots-Tag'] = 'noarchive' return True return False # Django Middleware class AlgoliaCrawlerMiddleware: def __init__(self, get_response): self.get_response = get_response def __call__(self, request): if self.detect_bot(request): # Handle bot traffic pass return self.get_response(request)

// JavaScript/Node.js Detection for Algolia Crawler const express = require('express'); const app = express(); // Middleware to detect Algolia Crawler function detectAlgoliaCrawler(req, res, next) { const userAgent = req.headers['user-agent'] || ''; const pattern = /Algolia Crawler/i; if (pattern.test(userAgent)) { // Log bot detection console.log('Algolia Crawler detected from IP:', req.ip); // Set cache headers res.set({ 'Cache-Control': 'public, max-age=3600', 'X-Robots-Tag': 'noarchive' }); // Mark request as bot req.isBot = true; req.botName = 'Algolia Crawler'; } next(); } app.use(detectAlgoliaCrawler);

# Apache .htaccess rules for Algolia Crawler # Block completely RewriteEngine On RewriteCond %{HTTP_USER_AGENT} Algolia Crawler [NC] RewriteRule .* - [F,L] # Or redirect to a static version RewriteCond %{HTTP_USER_AGENT} Algolia Crawler [NC] RewriteCond %{REQUEST_URI} !^/static/ RewriteRule ^(.*)$ /static/$1 [L] # Or set environment variable for PHP SetEnvIfNoCase User-Agent "Algolia Crawler" is_bot=1 # Add cache headers for this bot <If "%{HTTP_USER_AGENT} =~ /Algolia Crawler/i"> Header set Cache-Control "public, max-age=3600" Header set X-Robots-Tag "noarchive" </If>

# Nginx configuration for Algolia Crawler # Map user agent to variable map $http_user_agent $is_algolia_crawler { default 0; ~*Algolia Crawler 1; } server { # Block the bot completely if ($is_algolia_crawler) { return 403; } # Or serve cached content location / { if ($is_algolia_crawler) { root /var/www/cached; try_files $uri $uri.html $uri/index.html @backend; } try_files $uri @backend; } # Add headers for bot requests location @backend { if ($is_algolia_crawler) { add_header Cache-Control "public, max-age=3600"; add_header X-Robots-Tag "noarchive"; } proxy_pass http://backend; } }

Should You Block This Bot?

Recommendations based on your website type:

Site Type	Recommendation	Reasoning
E-commerce	Optional	Evaluate based on bandwidth usage vs. benefits
Blog/News	Allow	Increases content reach and discoverability
SaaS Application	Block	No benefit for application interfaces; preserve resources
Documentation	Selective	Allow for public docs, block for internal docs
Corporate Site	Limit	Allow for public pages, block sensitive areas like intranets

Advanced robots.txt Configurations

E-commerce Site Configuration

User-agent: Algolia Crawler Crawl-delay: 5 Disallow: /cart/ Disallow: /checkout/ Disallow: /my-account/ Disallow: /api/ Disallow: /*?sort= Disallow: /*?filter= Disallow: /*&page= Allow: /products/ Allow: /categories/ Sitemap: https://example.com/sitemap.xml

Publishing/Blog Configuration

User-agent: Algolia Crawler Crawl-delay: 10 Disallow: /wp-admin/ Disallow: /drafts/ Disallow: /preview/ Disallow: /*?replytocom= Allow: /

SaaS/Application Configuration

User-agent: Algolia Crawler Disallow: /app/ Disallow: /api/ Disallow: /dashboard/ Disallow: /settings/ Allow: / Allow: /pricing/ Allow: /features/ Allow: /docs/

Quick Reference

User Agent Match

Algolia Crawler

Robots.txt Name

Algolia Crawler

Respects robots.txt

Yes

Technical Details

Bot Name Algolia Crawler

Vendor Algolia

Category Other

Robots Name Algolia Crawler

First Seen Jan 2016

Quick Patterns

Simple Match:
/Algolia Crawler/i

Exact Match:
/^Algolia Crawler$/

Recommended:
Evaluate case-by-case

← Back to User Agent Directory

Copied to clipboard!

Algolia Crawler

What is Algolia Crawler?

User Agent String

How to Control Algolia Crawler

Block Completely

Block Specific Directories

Set Crawl Delay

How to Verify Algolia Crawler

Detection Patterns

Basic Pattern

Strict Pattern

Flexible Pattern

Vendor Match

Implementation Examples

Should You Block This Bot?

Advanced robots.txt Configurations

E-commerce Site Configuration

Publishing/Blog Configuration

SaaS/Application Configuration

Quick Reference

User Agent Match

Robots.txt Name

Category

Respects robots.txt

Technical Details

Quick Patterns

Tags

Related User Agents