Cookie Consent by Free Privacy Policy Generator User Agent Database | Bot & Crawler User Agent Strings

User Agent String Database & Bot Directory

A comprehensive, regularly updated database of 189 verified user agent strings used by web crawlers, bots, and spiders. Identify AI crawlers like GPTBot and ClaudeBot, search engine spiders like Googlebot and Bingbot, SEO tools, social media preview bots, and more.

Use this directory to look up any bot's user agent string, find its robots.txt name for blocking or allowing access, verify its vendor, and understand its crawling behavior. Each entry includes the full user agent string you can copy directly into your server configuration or analytics filters.

189
Total User Agents
40
AI Crawlers
49
Search Engine Bots
69
SEO, Social & Monitoring

Recently Added Bots & Crawlers

Anthropic
First seen: 2025-01-01
Google
First seen: 2025-01-01
Cloudflare
First seen: 2025-01-01
Manus AI
First seen: 2025-01-01
Google
First seen: 2024-12-01
Anthropic
First seen: 2024-10-01
Vendor: OpenAI
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
#ai-training #chatgpt #crawler #gpt
robots.txt: GPTBot
Vendor: OpenAI
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/bot)
#ai #chatgpt #browsing #plugin
robots.txt: ChatGPT-User
Vendor: OpenAI
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot)
#ai #search #openai #crawler
robots.txt: OAI-SearchBot
Vendor: Anthropic
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-Web/1.0; +https://www.anthropic.com)
#ai #claude #anthropic #crawler
robots.txt: Claude-Web
Vendor: Anthropic
ClaudeBot/1.0; +https://www.anthropic.com
#ai #claude #anthropic #training
robots.txt: ClaudeBot
Vendor: Common Crawl
CCBot/2.0 (https://commoncrawl.org/faq/)
#dataset #ai-training #crawler #open-data
robots.txt: CCBot
Vendor: Perplexity AI
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/bot)
#ai #search #answer-engine #crawler
robots.txt: PerplexityBot
Vendor: Google
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Google-Extended/1.0; +https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers)
#ai #bard #google #training
robots.txt: Google-Extended
Vendor: Meta
Meta-ExternalAgent/1.0 (+https://developers.facebook.com/docs/sharing/bot)
#ai #meta #facebook #training
robots.txt: Meta-ExternalAgent
Vendor: Amazon
Amazonbot/0.1 (+https://developer.amazon.com/support/amazonbot)
#amazon #alexa #ai #crawler
robots.txt: Amazonbot
Vendor: Google
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
#search #google #crawler #indexing
robots.txt: Googlebot
Vendor: Google
Googlebot-Image/1.0
#search #google #images #crawler
robots.txt: Googlebot-Image
Vendor: Google
Googlebot-Video/1.0
#search #google #video #crawler
robots.txt: Googlebot-Video
Vendor: Google
Googlebot-News
#search #google #news #crawler
robots.txt: Googlebot-News
Vendor: Google
AdsBot-Google (+http://www.google.com/adsbot.html)
#ads #google #quality-check #crawler
robots.txt: AdsBot-Google
Vendor: Google
Mediapartners-Google
#adsense #google #ads #crawler
robots.txt: Mediapartners-Google
Vendor: Microsoft
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
#search #bing #microsoft #crawler
robots.txt: bingbot
Vendor: Microsoft
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b
#search #bing #preview #snapshot
robots.txt: BingPreview
Vendor: DuckDuckGo
DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)
#search #privacy #duckduckgo #crawler
robots.txt: DuckDuckBot
Vendor: DuckDuckGo
Mozilla/5.0 (compatible; DuckAssistBot/1.0; +https://duckduckgo.com/duckassist)
#ai #assistant #duckduckgo #privacy
robots.txt: DuckAssistBot
Vendor: Yandex
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
#search #yandex #russian #crawler
robots.txt: YandexBot
Vendor: Baidu
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
#search #baidu #chinese #crawler
robots.txt: Baiduspider
Vendor: Apple
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)
#search #apple #siri #spotlight
robots.txt: Applebot
Vendor: Apple
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot-Extended/0.1; +http://www.apple.com/go/applebot)
#ai #apple #training #crawler
robots.txt: Applebot-Extended

What Is a User Agent String?

A user agent string is an HTTP header that identifies the client making a request to a web server. Every time a browser, bot, or crawler visits a web page, it sends a user agent string that reveals what software is making the request. This allows web servers to serve appropriate content, log traffic sources, and enforce access rules.

Anatomy of a Bot User Agent String

Bot user agent strings generally follow a common format. Here is the structure of Googlebot's user agent string:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Platform token — compatibility string inherited from early web browsers

Compatible flag — indicates this is a bot, not a regular browser

Bot name & version — the crawler's identity used in robots.txt rules

Documentation URL — link to the bot's official information page

Not all bots follow this format. Some use minimal identifiers (e.g., Bytespider), while others mimic full browser user agent strings. The key identifier for robots.txt rules is the bot name portion, which is listed as the "robots.txt name" in each entry of this directory.

Types of Web Crawlers & Bots

Web bots serve many different purposes. Understanding which category a bot belongs to helps you decide whether to allow or block it. This directory organizes bots into the following categories:

AI AI & LLM Crawlers

Bots operated by AI companies to collect training data for large language models, or to fetch web content in real-time when users interact with AI assistants. Includes bots from OpenAI (GPTBot), Anthropic (ClaudeBot), Meta (FacebookBot), Google (Gemini-Deep-Research), and others.

40 bots in this category

Search Search Engine Bots

Crawlers from search engines that index web pages to serve search results. These are the most well-known bots on the web and include Googlebot, Bingbot, Yandex, Baiduspider, DuckDuckBot, and others. Blocking these bots will remove your site from their search results.

49 bots in this category

SEO SEO & Marketing Tools

Bots from SEO platforms that crawl websites to analyze backlinks, track rankings, audit technical issues, and monitor competitor sites. Includes crawlers from Ahrefs, Semrush, Moz, Screaming Frog, Majestic, and others commonly seen in server logs.

30 bots in this category

Social Media Bots

Fetchers that retrieve page metadata when URLs are shared on social platforms. They generate the link preview cards showing titles, descriptions, and thumbnails. Includes bots from Facebook, Twitter/X, LinkedIn, Discord, Slack, Telegram, and others.

15 bots in this category

Monitoring Monitoring & Performance

Synthetic monitoring agents that test website availability, performance, and functionality. They run scheduled checks from global locations to detect outages and measure response times. Includes tools like UptimeRobot, Pingdom, Datadog, and others.

17 bots in this category

Security Security Scanners

Vulnerability scanners and security auditing tools that test web applications for known weaknesses. These tools typically do not respect robots.txt as they need to test all accessible endpoints. Includes Nessus, Qualys, Nikto, OpenVAS, and CensysInspect.

7 bots in this category

How to Manage Bot Access with robots.txt

The robots.txt file is the standard mechanism for controlling which bots can crawl your website. Place it at the root of your domain (e.g., https://example.com/robots.txt) and define rules for each bot using its robots.txt name. Here is an example configuration:

# Allow search engines full access
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Allow everything else
User-agent: *
Allow: /

Key Points About robots.txt

  • Voluntary compliance: robots.txt is a protocol, not a security measure. Legitimate bots from major companies respect it, but malicious scrapers may ignore it entirely.
  • Bot name matching: The User-agent value must match the bot's robots.txt name exactly. Use the "robots.txt name" field shown in each bot entry in this directory.
  • Specificity matters: More specific rules take precedence over general ones. A rule for Googlebot-Image overrides a wildcard rule.
  • Crawl-delay: Some bots support a Crawl-delay directive that limits how frequently they request pages, reducing server load.

AI Crawlers: Understanding the New Wave of Web Bots

Since 2023, a significant new category of web crawlers has emerged: AI training bots. Companies like OpenAI, Anthropic, Google, Meta, and Cohere now operate crawlers that collect web content to train and improve their large language models (LLMs). This has created new challenges for website owners who need to decide whether their content should be used for AI training.

Types of AI Crawlers

AI-related bots generally fall into three categories:

  • Training data crawlers collect web content at scale for model training. Examples include GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl), Bytespider (ByteDance), and FacebookBot (Meta). Blocking these prevents your content from being included in future training datasets.
  • User-action fetchers retrieve web pages in real-time when a user asks an AI assistant for current information. Examples include ChatGPT-User, Claude-User, Perplexity-User, and MistralAI-User. Blocking these prevents AI assistants from accessing your content during conversations.
  • AI search crawlers index content specifically for AI-powered search products. Examples include OAI-SearchBot (OpenAI), Claude-SearchBot (Anthropic), and Google-CloudVertexBot. Blocking these may affect your visibility in AI search results.

How to Control AI Crawler Access

Most reputable AI companies respect robots.txt directives. You can selectively block training crawlers while still allowing user-action fetchers if you want your content to be referenceable but not used for training. Each AI bot entry in this directory specifies whether it respects robots.txt and what its specific purpose is, helping you make informed decisions about which bots to allow.

Frequently Asked Questions About User Agents & Web Crawlers

What is a user agent string? +

A user agent string is an identifier that web browsers, bots, and crawlers send to web servers with every HTTP request. It tells the server what software is making the request, including the application name, version, and sometimes the operating system. For example, Googlebot identifies itself as "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" so that website owners can recognize it as Google's search engine crawler.

How do I identify which bots are crawling my website? +

You can identify bots crawling your site by checking your server access logs, which record the user agent string for every request. Look for known bot identifiers like Googlebot, Bingbot, GPTBot, or ClaudeBot. You can also use analytics tools that filter bot traffic, or set up server-side logging to flag requests from known crawler user agent strings. This directory catalogs all major bots to help you identify them.

How do I block AI crawlers like GPTBot or ClaudeBot from my website? +

To block AI crawlers, add disallow rules to your robots.txt file. For example, to block OpenAI's GPTBot, add "User-agent: GPTBot" followed by "Disallow: /" on the next line. You can block multiple AI crawlers by adding separate rules for each bot (GPTBot, ClaudeBot, CCBot, Bytespider, etc.). Note that robots.txt relies on voluntary compliance — most reputable AI companies respect these directives, but it is not a guaranteed enforcement mechanism.

What is robots.txt and how does it work? +

robots.txt is a plain text file placed in the root directory of a website (e.g., example.com/robots.txt) that provides instructions to web crawlers about which pages or sections of the site they are allowed or not allowed to access. It uses a simple syntax with "User-agent" to specify which bot the rule applies to and "Disallow" or "Allow" to define access permissions. While it is an industry standard respected by major search engines and legitimate bots, it is advisory rather than enforceable.

How do I know if a crawler respects robots.txt? +

Reputable crawlers from established companies like Google, Microsoft, and most AI companies typically respect robots.txt. You can verify compliance by adding a disallow rule for a specific bot and then monitoring your server logs to see if that bot continues to access blocked URLs. Each bot entry in this directory includes a "respects robots.txt" field to help you understand the expected behavior. You can also use the verification methods listed for each bot to confirm the crawler's identity is genuine.

What is the difference between a web crawler and a web scraper? +

A web crawler (or spider) systematically browses the web to index content, typically following links from page to page. Search engines like Google use crawlers to discover and index web pages. A web scraper extracts specific data from web pages for a particular purpose such as price monitoring or data analysis. The key difference is intent: crawlers aim to index and discover content broadly, while scrapers target specific data from specific pages. Many AI bots function as crawlers that collect training data at scale.

Why are there so many different Google crawlers? +

Google uses specialized crawlers for different purposes: Googlebot handles general web search indexing, Googlebot-Image focuses on image search, Googlebot-Video targets video content, AdsBot-Google checks ad landing page quality, Google-Extended collects data for AI training, and Storebot-Google indexes product and shopping content. Each crawler can be independently controlled via robots.txt, giving website owners granular control over which Google services can access their content.

Should I block all bots from my website? +

Blocking all bots is generally not recommended, as it would prevent search engines from indexing your site, remove you from search results, and stop social media platforms from generating link previews. Instead, take a selective approach: allow search engine bots and social media crawlers that benefit your visibility, while blocking unwanted bots such as aggressive scrapers or AI training crawlers if you prefer not to contribute to their datasets. Review each bot's purpose before deciding whether to allow or block it.

How to Use This User Agent Database

For Website Owners

  • Create robots.txt rules: Look up the robots.txt name for any bot and add allow or disallow rules to control access to your content.
  • Filter analytics data: Exclude known bot user agent strings from your analytics reports to get accurate visitor counts.
  • Configure rate limiting: Set up server-side rate limits for aggressive crawlers to protect your server resources.
  • Audit your traffic: Cross-reference your server logs with this database to identify which bots are crawling your site and how often.

For Developers & Security Teams

  • Bot detection: Use the user agent strings in this database to build detection logic that identifies and classifies bot traffic in your applications.
  • Firewall rules: Create WAF rules to block unwanted bots at the network level before they reach your application servers.
  • Verification: Use the verification methods listed for each bot to confirm that a request actually comes from the claimed bot and not a spoofed user agent.
  • Monitoring: Track the appearance of new or unknown bots in your logs to identify potential scraping or scanning activity.