We Benchmarked 4 AI Code Documentation Tools. ProdE Scored Highest.

Benchmark Apr 9, 2026 · 8 min read

We tested four AI documentation tools - ProdE, DeepWiki (by Cognition, the company behind Devin), Google Code Wiki, and Claude Code - on three open-source codebases. We scored them for AI agents and human developers separately. Nine evaluation passes each.

No benchmark existed for AI documentation tools. So we built one and open-sourced it — methodology, raw data, evaluation reports, all MIT licensed. Clone it, run it, verify every claim.

Here's how they ranked for agent utility:

ProdE

8.7

DeepWiki

7.6

Google Code Wiki

6.3

Claude Code

6.2

8.7 ProdE Agent Score

15% Lead over #2

3 Open-source projects

9 Evaluation passes

What mattered

Documentation serves two audiences that want different things. Agents need dense source references and comprehensive coverage. Humans need clear prose, diagrams, and logical structure. We scored both:

Agent Score = average of Completeness, Correctness, and Referencing
Human Score = average of Completeness, Correctness, and Presentation

Tool	Agent Score	Human Score
ProdE	8.7	8.2
DeepWiki	7.6	7.9
Google Code Wiki	6.3	5.4
Claude Code	6.2	7.1

ProdE leads both tracks. Highest agent score (8.7) and highest human score (8.2). Not a trade-off - it wins on both.

Here's every dimension:

Tool	Completeness	Correctness	Referencing	Presentation
ProdE	9.0	8.1	9.0	7.7
DeepWiki	7.9	7.7	7.1	8.1
Claude Code	6.3	7.9	4.2	7.0
Google Code Wiki	5.2	7.0	6.8	4.1

ProdE leads three of four dimensions. DeepWiki leads in Presentation - its diagrams and prose are polished. But on the dimensions that determine agent utility - Completeness and Referencing - ProdE's lead is decisive.

A note on Correctness (8.1): Zero hallucinations detected across all 9 evaluations - every spot-checked file path, symbol reference, and behavioral claim pointed to real code. The score reflects that with 100+ files per project and thousands of claims, exhaustive verification isn't possible in a single pass. ProdE was the only tool the evaluator called "the most precise of the four sets."

Jump to the full benchmark details - methodology, per-dimension breakdown, and per-project scores.

· · ·

The benchmark

We generated documentation for three well-known open-source projects:

Project	What it tests
FastAPI	Web framework - routing, dependency injection, middleware, OpenAPI generation
Pydantic	Data validation - Rust internals, schema generation, type system
Mermaid	Diagramming tool - parser toolchain, rendering pipeline, accessibility

For each project, we cloned the actual source repository so every claim could be verified against real code. Each documentation set was scored on four dimensions: Completeness (coverage breadth), Correctness (technical accuracy), Referencing (source citation density and accuracy), and Presentation (prose clarity, diagrams, code examples).

We ran the evaluation three times per project - same prompt, same data, three passes - to account for LLM non-determinism. Three projects × three runs = nine evaluations per tool.

· · ·

Completeness: the answer boundary

An agent can only answer questions about topics that are documented. If the docs cover routing but not middleware, every middleware question becomes a hallucination risk. Completeness determines the boundary of what agents can accurately answer.

ProdE scored a perfect 9/10 in Completeness across every evaluation - all projects, all runs.

114–140 ProdE files/project

25–44 DeepWiki files/project

13–17 Claude Code files/project

25 Google files/project

ProdE covers topics no other tool touches - Rust internals in Pydantic, CI/CD pipeline configuration, testing harness internals, concurrency helpers, logging conventions, type aliases. These expand the boundary of what agents can accurately answer.

For example, in FastAPI: ProdE has dedicated pages for SSE architecture (3 pages), middleware configuration (6 pages including CORS, GZip, HTTPS redirect, trusted hosts), DI internals (5 pages on the Dependant model, graph construction, and solver), and testing harness internals - none of which exist in any other tool's output.

In an enterprise with hundreds of repos, this breadth is the difference between an agent that can answer questions about any service and one that hits coverage gaps and starts guessing.

· · ·

Referencing: ProdE's decisive advantage

When a developer asks their coding assistant "how does dependency injection resolve in FastAPI?", the agent retrieves docs, synthesizes an answer, and needs to point to specific source locations. If the docs reference dependencies/utils.py:solve_dependencies, the developer can jump to that file and verify. If the docs say "the DI system resolves dependencies" without citing source, the agent generates a confident but unverifiable answer.

ProdE scored a perfect 9/10 in Referencing across all 9 evaluations. Not a single evaluation docked it below 9. This is the most stable individual score in the entire benchmark.

Metric	ProdE	DeepWiki	Ratio
FastAPI source refs	2,549–2,823	821–1,486	2–3x
Mermaid source refs	3,295–3,462	217–283	12–15x
Pydantic source refs	3,072–4,008	497–1,382	3–6x
FastAPI cross-refs	770–779	118–420	2–7x
Mermaid cross-refs	995–1,003	59–252	4–17x
Pydantic cross-refs	936–4,000+	108–1,587	3–9x

ProdE's structured [[symbol:repo:path:ClassName]] and [[file:repo:path]] format is machine-parseable. An agent doesn't need to regex-match file paths out of prose - it gets a structured citation that names the repository, file, and symbol. Every spot-checked reference across all three runs pointed to a real file and a real symbol.

Claude Code, by contrast, averages only 2–3 source references per file (27–72 per project total). It writes excellent prose, but an agent can't trace claims back to source code. Google Code Wiki provides some references but with systematic quality issues - many point to incorrect files or missing symbols.

· · ·

Where each tool stands

ProdE - Best overall, strongest for agents

Densest source citations - 2,500–4,000 structured references per project with symbol-level precision
Broadest coverage - 114–140 files per project covering Rust internals, CI/CD, testing harness, build system
Rich cross-reference graph - 770–4,000+ wiki-style cross-references creating a navigable knowledge graph
Most code examples - 289–413 Python blocks per project, showing both modern and classic patterns
Most consistent - 0.3-point score spread across 9 evaluations (vs. 1.0–1.3 for competitors)

DeepWiki - Strong presentation, moderate referencing

Best diagrams - 248–284 Mermaid diagrams per project, ~5x more per file than ProdE
Clear prose - well-organized, accessible technical writing; highest Presentation score (8.1)
Moderate references - 500–1,500 source citations per project, but 2–15x fewer than ProdE
Good completeness - 7.9/10 covering core and supporting topics

Claude Code - Best prose, weakest referencing

Strongest writing quality - "the strongest writing of all four sets" per the evaluator
Highest code density per file - 7+ code blocks per file with modern syntax
Critically sparse references - only 27–72 source references per project; agents can't trace claims to code
Narrow coverage - 13–17 files per project, missing many topics entirely

Google Code Wiki - Thin coverage, uneven quality

Graphviz architecture diagrams provide useful high-level overviews
Zero code examples across all projects - no Python, no JavaScript
Many stub pages under 300 words, and systematic reference quality issues
Lowest coverage - Completeness score of 5.2/10

· · ·

Per-project scores: all three runs

FastAPI

Dimension	ProdE	DeepWiki	Google	Claude Code
Completeness	9 / 9 / 9	8 / 8 / 8	5 / 5 / 5	6 / 7 / 7
Correctness	8 / 8 / 8	8 / 7 / 7	7 / 6 / 7	8 / 8 / 7
Referencing	9 / 9 / 9	8 / 7 / 6	7 / 6 / 7	4 / 6 / 6
Presentation	8 / 7 / 7	8 / 8 / 8	4 / 3 / 4	7 / 8 / 7

Pydantic

Dimension	ProdE	DeepWiki	Google	Claude Code
Completeness	9 / 9 / 9	8 / 7 / 8	4 / 5 / 4	6 / 6 / 6
Correctness	9 / 8 / 8	8 / 8 / 8	7 / 8 / 7	8 / 8 / 8
Referencing	9 / 9 / 9	7 / 8 / 7	7 / 7 / 7	3 / 3 / 4
Presentation	8 / 8 / 8	8 / 8 / 8	3 / 4 / 4	7 / 6 / 7

Mermaid

Dimension	ProdE	DeepWiki	Google	Claude Code
Completeness	9 / 9 / 9	8 / 8 / 8	7 / 6 / 6	6 / 7 / 6
Correctness	8 / 8 / 8	8 / 7 / 8	7 / 6 / 8	8 / 8 / 8
Referencing	9 / 9 / 9	7 / 7 / 7	7 / 6 / 7	4 / 5 / 3
Presentation	8 / 7 / 8	9 / 8 / 8	6 / 4 / 5	7 / 7 / 7

Pattern: ProdE scores 9/10 in Completeness and Referencing across every single evaluation - all three projects, all three runs. No other tool comes close to this consistency. DeepWiki's Referencing ranges from 6–8, Claude Code's from 3–6.

· · ·

Methodology

The evaluator is Claude (Opus). We used a structured evaluation prompt with explicit safeguards against common biases: don't confuse volume with quality, verify source references against actual code, flag hallucinations explicitly, score on an absolute scale, and don't penalize platform-specific formats.

Scoring: Each tool receives two composite scores. The Agent Score is the average of Completeness, Correctness, and Referencing. The Human Score is the average of Completeness, Correctness, and Presentation. Completeness and Correctness are shared - only Referencing (agent-specific) and Presentation (human-specific) differ between the two tracks. This means the gap between a tool's Agent and Human score is entirely attributable to how it performs on referencing vs. presentation.

Blinding: Since Claude is both the evaluator and one of the tools being evaluated (Claude Code), we labeled Claude Code's output as "Doc X" in the evaluation reports. ProdE launched after the evaluator's training cutoff, so the evaluator has no pre-existing association with ProdE either - both ProdE and Claude Code are effectively unknown to the evaluator. ProdE does not use Claude models in its generation pipeline.

Every input and output is published. Run the benchmark yourself with a different evaluator model and see if the conclusions hold.

The full benchmark - all 9 evaluation reports, the evaluation rubric, documentation outputs, and source repositories - is available on GitHub.

See ProdE documentation in action on your codebase

Get a demo