What is Diffbot?
Diffbot is a web crawler that extracts and structures website content using AI-powered visual understanding, providing knowledge graph data for applications like market intelligence, e-commerce, and AI model training. You can use Known Agents (formerly Dark Visitors) Agent Analytics to see when Diffbot visits your website.
Agent Type
Expected Behavior
AI data providers are API services that crawl, scrape, and index the web to supply structured data to AI models, agents, and applications. They act as intermediaries between the open web and AI systems, converting web content into LLM-ready formats for training, retrieval-augmented generation (RAG), search, and other AI workflows. Traffic from these services can be high-volume and systematic, as they maintain their own indexes or crawl on-demand in response to API requests from their customers. A single provider may serve thousands of downstream AI applications, amplifying the reach of each crawl.
Detail
| Operated By | Diffbot |
| Last Updated | 7 hours ago |
Top Website Robots.txts
Country of Origin
Top Website Blocking Trend Over Time
The percentage of the world's top 1000 websites who are blocking Diffbot
Overall AI Data Provider Traffic
The percentage of all internet traffic coming from AI data providers
Top Visited Website Categories
User Agent String
| Example | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.4238.8 Safari/537.36 Diffbot-User/0.1 (+http://www.diffbot.com) |
Access other known user agent strings and recent IP addresses using the API.
Robots.txt
In this example, all pages are blocked. You can customize which pages are off-limits by swapping out / for a different disallowed path.
User-agent: Diffbot # https://knownagents.com/agents/diffbot
Disallow: /
Frequently Asked Questions About Diffbot
Should I Block Diffbot?
Consider your priorities. Diffbot crawls websites on behalf of its customers to supply data for AI training, search, and retrieval-augmented generation. Your content may be redistributed to many downstream AI applications through a single provider. You may want to block it if you're concerned about how your content is being used across those systems, or allow it if you value the discoverability and reach it can provide.
How Do I Block Diffbot?
If you want to, you can block or limit Diffbot's access by configuring user agent token rules in your robots.txt file. The best way to do this is using Automatic Robots.txt, which update automatically as new agents are discovered. While the vast majority of agents operated by reputable companies honor these robots.txt directives, bad actors may choose to ignore them entirely. In that case, you'll need to implement alternative blocking methods such as firewall rules or server-level restrictions. You can verify whether Diffbot is respecting your rules by setting up Agent Analytics to monitor its visits to your website.
Will Blocking Diffbot Hurt My SEO?
Blocking AI data providers has no direct impact on traditional SEO rankings since they don't control search engine indexing. However, these services feed content into AI search engines, RAG pipelines, and conversational AI platforms. Blocking them could reduce your content's representation across multiple AI-powered discovery channels simultaneously, since a single provider may supply data to many downstream applications.
Does Diffbot Access Private Content?
AI data providers typically crawl publicly accessible web content to build their indexes and fulfill API requests. Some providers operate large-scale proxy networks and may attempt to access content aggressively or bypass rate limits. The scope depends on what their customers request and the provider's own indexing priorities. Most focus on public content, but their scale and the diversity of downstream use cases mean your content could be accessed more broadly than with a single-purpose crawler.
How Can I Tell if Diffbot Is Visiting My Website?
Setting up Agent Analytics will give you realtime visibility into Diffbot visiting your website, along with hundreds of other AI agents, crawlers, and scrapers. This will also let you measure human traffic to your website coming from AI search and chat LLM platforms like ChatGPT, Perplexity, and Gemini.
Why Is Diffbot Visiting My Website?
Diffbot crawled your site to fulfill data requests from its customers or to build and maintain its own web index. Your site was likely identified as containing content relevant to AI training datasets, search indexes, or retrieval-augmented generation pipelines. The crawl may have been triggered by a specific customer API request or as part of the provider's broader web indexing efforts.
How Can I Authenticate Visits From Diffbot?
Agent Analytics authenticates agent visits from many agents, letting you know whether each one was actually from that agent, or spoofed by a bad actor. This helps you identify suspicious traffic patterns and make informed decisions about blocking or allowing specific user agents.