We tested four AI documentation tools - ProdE, DeepWiki (by Cognition, the company behind Devin), Google Code Wiki, and Claude Code - on three open-source codebases. We scored them for AI agents and human developers separately. Nine evaluation passes each.
No benchmark existed for AI documentation tools. So we built one and open-sourced it — methodology, raw data, evaluation reports, all MIT licensed. Clone it, run it, verify every claim.
Here's how they ranked for agent utility:
What mattered
Documentation serves two audiences that want different things. Agents need dense source references and comprehensive coverage. Humans need clear prose, diagrams, and logical structure. We scored both:
- Agent Score = average of Completeness, Correctness, and Referencing
- Human Score = average of Completeness, Correctness, and Presentation
| Tool | Agent Score | Human Score |
|---|---|---|
| ProdE | 8.7 | 8.2 |
| DeepWiki | 7.6 | 7.9 |
| Google Code Wiki | 6.3 | 5.4 |
| Claude Code | 6.2 | 7.1 |
ProdE leads both tracks. Highest agent score (8.7) and highest human score (8.2). Not a trade-off - it wins on both.
Here's every dimension:
| Tool | Completeness | Correctness | Referencing | Presentation |
|---|---|---|---|---|
| ProdE | 9.0 | 8.1 | 9.0 | 7.7 |
| DeepWiki | 7.9 | 7.7 | 7.1 | 8.1 |
| Claude Code | 6.3 | 7.9 | 4.2 | 7.0 |
| Google Code Wiki | 5.2 | 7.0 | 6.8 | 4.1 |
ProdE leads three of four dimensions. DeepWiki leads in Presentation - its diagrams and prose are polished, and that's an area we're actively improving. But on the dimensions that determine agent utility - Completeness and Referencing - ProdE's lead is decisive.
A note on Correctness (8.1): Zero hallucinations detected across all 9 evaluations - every spot-checked file path, symbol reference, and behavioral claim pointed to real code. The score reflects that with 100+ files per project and thousands of claims, exhaustive verification isn't possible in a single pass. ProdE was the only tool the evaluator called "the most precise of the four sets."
Jump to the full benchmark details - methodology, per-dimension breakdown, and per-project scores.
Why documentation is agent infrastructure
In agentic development, documentation isn't just helpful - it's the retrieval surface agents query to navigate codebases, answer questions, and generate code.
Consider an enterprise with 200+ repositories. A developer asks their AI agent: "how does the payment service validate transactions?"
validators.py:validate_transactionvalidators.py:validate_transactionTwo documentation qualities make this work:
- Completeness - An agent can only answer questions about topics that are documented. Coverage gaps become hallucination gaps. In an enterprise with hundreds of services, incomplete documentation means entire services are invisible to agents.
- Referencing - Source citations tell agents which files and functions to examine. When docs reference
payment/validators.py:validate_transaction, the agent goes straight there. Without references, agents produce confident but unverifiable answers. Good references narrow the search space from thousands of files to the handful that matter.
The benchmark
We generated documentation for three well-known open-source projects:
| Project | What it tests |
|---|---|
| FastAPI | Web framework - routing, dependency injection, middleware, OpenAPI generation |
| Pydantic | Data validation - Rust internals, schema generation, type system |
| Mermaid | Diagramming tool - parser toolchain, rendering pipeline, accessibility |
For each project, we cloned the actual source repository so every claim could be verified against real code. Each documentation set was scored on four dimensions: Completeness (coverage breadth), Correctness (technical accuracy), Referencing (source citation density and accuracy), and Presentation (prose clarity, diagrams, code examples).
We ran the evaluation three times per project - same prompt, same data, three passes - to account for LLM non-determinism. Three projects × three runs = nine evaluations per tool.
Completeness: the answer boundary
An agent can only answer questions about topics that are documented. If the docs cover routing but not middleware, every middleware question becomes a hallucination risk. Completeness determines the boundary of what agents can accurately answer.
ProdE scored a perfect 9/10 in Completeness across every evaluation - all projects, all runs.
ProdE covers topics no other tool touches - Rust internals in Pydantic, CI/CD pipeline configuration, testing harness internals, concurrency helpers, logging conventions, type aliases. These expand the boundary of what agents can accurately answer.
For example, in FastAPI: ProdE has dedicated pages for SSE architecture (3 pages), middleware configuration (6 pages including CORS, GZip, HTTPS redirect, trusted hosts), DI internals (5 pages on the Dependant model, graph construction, and solver), and testing harness internals - none of which exist in any other tool's output.
In an enterprise with hundreds of repos, this breadth is the difference between an agent that can answer questions about any service and one that hits coverage gaps and starts guessing.
Referencing: ProdE's decisive advantage
When a developer asks their coding assistant "how does dependency injection resolve in FastAPI?", the agent retrieves docs, synthesizes an answer, and needs to point to specific source locations. If the docs reference dependencies/utils.py:solve_dependencies, the developer can jump to that file and verify. If the docs say "the DI system resolves dependencies" without citing source, the agent generates a confident but unverifiable answer.
ProdE scored a perfect 9/10 in Referencing across all 9 evaluations. Not a single evaluation docked it below 9. This is the most stable individual score in the entire benchmark.
| Metric | ProdE | DeepWiki | Ratio |
|---|---|---|---|
| FastAPI source refs | 2,549–2,823 | 821–1,486 | 2–3x |
| Mermaid source refs | 3,295–3,462 | 217–283 | 12–15x |
| Pydantic source refs | 3,072–4,008 | 497–1,382 | 3–6x |
| FastAPI cross-refs | 770–779 | 118–420 | 2–7x |
| Mermaid cross-refs | 995–1,003 | 59–252 | 4–17x |
| Pydantic cross-refs | 936–4,000+ | 108–1,587 | 3–9x |
ProdE's structured [[symbol:repo:path:ClassName]] and [[file:repo:path]] format is machine-parseable. An agent doesn't need to regex-match file paths out of prose - it gets a structured citation that names the repository, file, and symbol. Every spot-checked reference across all three runs pointed to a real file and a real symbol.
Claude Code, by contrast, averages only 2–3 source references per file (27–72 per project total). It writes excellent prose, but an agent can't trace claims back to source code. Google Code Wiki provides some references but with systematic quality issues - many point to incorrect files or missing symbols.
Where each tool stands
ProdE - Best overall, strongest for agents
- Densest source citations - 2,500–4,000 structured references per project with symbol-level precision
- Broadest coverage - 114–140 files per project covering Rust internals, CI/CD, testing harness, build system
- Rich cross-reference graph - 770–4,000+ wiki-style cross-references creating a navigable knowledge graph
- Most code examples - 289–413 Python blocks per project, showing both modern and classic patterns
- Most consistent - 0.3-point score spread across 9 evaluations (vs. 1.0–1.3 for competitors)
DeepWiki - Strong presentation, moderate referencing
- Best diagrams - 248–284 Mermaid diagrams per project, ~5x more per file than ProdE
- Clear prose - well-organized, accessible technical writing; highest Presentation score (8.1)
- Moderate references - 500–1,500 source citations per project, but 2–15x fewer than ProdE
- Good completeness - 7.9/10 covering core and supporting topics
Claude Code - Best prose, weakest referencing
- Strongest writing quality - "the strongest writing of all four sets" per the evaluator
- Highest code density per file - 7+ code blocks per file with modern syntax
- Critically sparse references - only 27–72 source references per project; agents can't trace claims to code
- Narrow coverage - 13–17 files per project, missing many topics entirely
Google Code Wiki - Thin coverage, uneven quality
- Graphviz architecture diagrams provide useful high-level overviews
- Zero code examples across all projects - no Python, no JavaScript
- Many stub pages under 300 words, and systematic reference quality issues
- Lowest coverage - Completeness score of 5.2/10
Per-project scores: all three runs
FastAPI
| Dimension | ProdE | DeepWiki | Claude Code | |
|---|---|---|---|---|
| Completeness | 9 / 9 / 9 | 8 / 8 / 8 | 5 / 5 / 5 | 6 / 7 / 7 |
| Correctness | 8 / 8 / 8 | 8 / 7 / 7 | 7 / 6 / 7 | 8 / 8 / 7 |
| Referencing | 9 / 9 / 9 | 8 / 7 / 6 | 7 / 6 / 7 | 4 / 6 / 6 |
| Presentation | 8 / 7 / 7 | 8 / 8 / 8 | 4 / 3 / 4 | 7 / 8 / 7 |
Pydantic
| Dimension | ProdE | DeepWiki | Claude Code | |
|---|---|---|---|---|
| Completeness | 9 / 9 / 9 | 8 / 7 / 8 | 4 / 5 / 4 | 6 / 6 / 6 |
| Correctness | 9 / 8 / 8 | 8 / 8 / 8 | 7 / 8 / 7 | 8 / 8 / 8 |
| Referencing | 9 / 9 / 9 | 7 / 8 / 7 | 7 / 7 / 7 | 3 / 3 / 4 |
| Presentation | 8 / 8 / 8 | 8 / 8 / 8 | 3 / 4 / 4 | 7 / 6 / 7 |
Mermaid
| Dimension | ProdE | DeepWiki | Claude Code | |
|---|---|---|---|---|
| Completeness | 9 / 9 / 9 | 8 / 8 / 8 | 7 / 6 / 6 | 6 / 7 / 6 |
| Correctness | 8 / 8 / 8 | 8 / 7 / 8 | 7 / 6 / 8 | 8 / 8 / 8 |
| Referencing | 9 / 9 / 9 | 7 / 7 / 7 | 7 / 6 / 7 | 4 / 5 / 3 |
| Presentation | 8 / 7 / 8 | 9 / 8 / 8 | 6 / 4 / 5 | 7 / 7 / 7 |
Pattern: ProdE scores 9/10 in Completeness and Referencing across every single evaluation - all three projects, all three runs. No other tool comes close to this consistency. DeepWiki's Referencing ranges from 6–8, Claude Code's from 3–6.
Methodology
The evaluator is Claude (Opus). We used a structured evaluation prompt with explicit safeguards against common biases: don't confuse volume with quality, verify source references against actual code, flag hallucinations explicitly, score on an absolute scale, and don't penalize platform-specific formats.
Scoring: Each tool receives two composite scores. The Agent Score is the average of Completeness, Correctness, and Referencing. The Human Score is the average of Completeness, Correctness, and Presentation. Completeness and Correctness are shared - only Referencing (agent-specific) and Presentation (human-specific) differ between the two tracks. This means the gap between a tool's Agent and Human score is entirely attributable to how it performs on referencing vs. presentation.
Blinding: Since Claude is both the evaluator and one of the tools being evaluated (Claude Code), we labeled Claude Code's output as "Doc X" in the evaluation reports. ProdE launched after the evaluator's training cutoff, so the evaluator has no pre-existing association with ProdE either - both ProdE and Claude Code are effectively unknown to the evaluator. ProdE does not use Claude models in its generation pipeline.
Every input and output is published. Run the benchmark yourself with a different evaluator model and see if the conclusions hold.
The full benchmark - all 9 evaluation reports, the evaluation rubric, documentation outputs, and source repositories - is available on GitHub.
See ProdE documentation in action on your codebase
Get a demo