All posts

We Benchmarked 4 AI Code Documentation Tools. ProdE Scored Highest.

We tested four AI documentation tools - ProdE, DeepWiki (by Cognition, the company behind Devin), Google Code Wiki, and Claude Code - on three open-source codebases. We scored them for AI agents and human developers separately. Nine evaluation passes each.

No benchmark existed for AI documentation tools. So we built one and open-sourced it — methodology, raw data, evaluation reports, all MIT licensed. Clone it, run it, verify every claim.

Here's how they ranked for agent utility:

ProdE
8.7
DeepWiki
7.6
Google Code Wiki
6.3
Claude Code
6.2
8.7 ProdE Agent Score
15% Lead over #2
3 Open-source projects
9 Evaluation passes

What mattered

Documentation serves two audiences that want different things. Agents need dense source references and comprehensive coverage. Humans need clear prose, diagrams, and logical structure. We scored both:

  • Agent Score = average of Completeness, Correctness, and Referencing
  • Human Score = average of Completeness, Correctness, and Presentation
ToolAgent ScoreHuman Score
ProdE8.78.2
DeepWiki7.67.9
Google Code Wiki6.35.4
Claude Code6.27.1

ProdE leads both tracks. Highest agent score (8.7) and highest human score (8.2). Not a trade-off - it wins on both.

Here's every dimension:

ToolCompletenessCorrectnessReferencingPresentation
ProdE9.08.19.07.7
DeepWiki7.97.77.18.1
Claude Code6.37.94.27.0
Google Code Wiki5.27.06.84.1

ProdE leads three of four dimensions. DeepWiki leads in Presentation - its diagrams and prose are polished, and that's an area we're actively improving. But on the dimensions that determine agent utility - Completeness and Referencing - ProdE's lead is decisive.

A note on Correctness (8.1): Zero hallucinations detected across all 9 evaluations - every spot-checked file path, symbol reference, and behavioral claim pointed to real code. The score reflects that with 100+ files per project and thousands of claims, exhaustive verification isn't possible in a single pass. ProdE was the only tool the evaluator called "the most precise of the four sets."

Jump to the full benchmark details - methodology, per-dimension breakdown, and per-project scores.

· · ·

Why documentation is agent infrastructure

In agentic development, documentation isn't just helpful - it's the retrieval surface agents query to navigate codebases, answer questions, and generate code.

Consider an enterprise with 200+ repositories. A developer asks their AI agent: "how does the payment service validate transactions?"

How agents navigate enterprise codebases
100s of repos
Agent can't clone and read every repository
Docs layer
Agent queries documentation first - finds service boundaries, key modules, file paths
Right repos
Docs guide agent to the 2–3 repos that matter for this question
Exact code
References point to specific files and symbols - validators.py:validate_transaction
Without good docs
?
Agent scans blindly. Guesses at repo boundaries. Hallucination risk on every question outside its context.
With ProdE docs
validators.py:validate_transaction
Docs name the repos, the modules, the exact files. Agent navigates with precision.

Two documentation qualities make this work:

  • Completeness - An agent can only answer questions about topics that are documented. Coverage gaps become hallucination gaps. In an enterprise with hundreds of services, incomplete documentation means entire services are invisible to agents.
  • Referencing - Source citations tell agents which files and functions to examine. When docs reference payment/validators.py:validate_transaction, the agent goes straight there. Without references, agents produce confident but unverifiable answers. Good references narrow the search space from thousands of files to the handful that matter.
· · ·

The benchmark

We generated documentation for three well-known open-source projects:

ProjectWhat it tests
FastAPIWeb framework - routing, dependency injection, middleware, OpenAPI generation
PydanticData validation - Rust internals, schema generation, type system
MermaidDiagramming tool - parser toolchain, rendering pipeline, accessibility

For each project, we cloned the actual source repository so every claim could be verified against real code. Each documentation set was scored on four dimensions: Completeness (coverage breadth), Correctness (technical accuracy), Referencing (source citation density and accuracy), and Presentation (prose clarity, diagrams, code examples).

We ran the evaluation three times per project - same prompt, same data, three passes - to account for LLM non-determinism. Three projects × three runs = nine evaluations per tool.

· · ·

Completeness: the answer boundary

An agent can only answer questions about topics that are documented. If the docs cover routing but not middleware, every middleware question becomes a hallucination risk. Completeness determines the boundary of what agents can accurately answer.

ProdE scored a perfect 9/10 in Completeness across every evaluation - all projects, all runs.

114–140 ProdE files/project
25–44 DeepWiki files/project
13–17 Claude Code files/project
25 Google files/project

ProdE covers topics no other tool touches - Rust internals in Pydantic, CI/CD pipeline configuration, testing harness internals, concurrency helpers, logging conventions, type aliases. These expand the boundary of what agents can accurately answer.

For example, in FastAPI: ProdE has dedicated pages for SSE architecture (3 pages), middleware configuration (6 pages including CORS, GZip, HTTPS redirect, trusted hosts), DI internals (5 pages on the Dependant model, graph construction, and solver), and testing harness internals - none of which exist in any other tool's output.

In an enterprise with hundreds of repos, this breadth is the difference between an agent that can answer questions about any service and one that hits coverage gaps and starts guessing.

· · ·

Referencing: ProdE's decisive advantage

When a developer asks their coding assistant "how does dependency injection resolve in FastAPI?", the agent retrieves docs, synthesizes an answer, and needs to point to specific source locations. If the docs reference dependencies/utils.py:solve_dependencies, the developer can jump to that file and verify. If the docs say "the DI system resolves dependencies" without citing source, the agent generates a confident but unverifiable answer.

ProdE scored a perfect 9/10 in Referencing across all 9 evaluations. Not a single evaluation docked it below 9. This is the most stable individual score in the entire benchmark.

MetricProdEDeepWikiRatio
FastAPI source refs2,549–2,823821–1,4862–3x
Mermaid source refs3,295–3,462217–28312–15x
Pydantic source refs3,072–4,008497–1,3823–6x
FastAPI cross-refs770–779118–4202–7x
Mermaid cross-refs995–1,00359–2524–17x
Pydantic cross-refs936–4,000+108–1,5873–9x

ProdE's structured [[symbol:repo:path:ClassName]] and [[file:repo:path]] format is machine-parseable. An agent doesn't need to regex-match file paths out of prose - it gets a structured citation that names the repository, file, and symbol. Every spot-checked reference across all three runs pointed to a real file and a real symbol.

Claude Code, by contrast, averages only 2–3 source references per file (27–72 per project total). It writes excellent prose, but an agent can't trace claims back to source code. Google Code Wiki provides some references but with systematic quality issues - many point to incorrect files or missing symbols.

· · ·

Where each tool stands

ProdE - Best overall, strongest for agents

  • Densest source citations - 2,500–4,000 structured references per project with symbol-level precision
  • Broadest coverage - 114–140 files per project covering Rust internals, CI/CD, testing harness, build system
  • Rich cross-reference graph - 770–4,000+ wiki-style cross-references creating a navigable knowledge graph
  • Most code examples - 289–413 Python blocks per project, showing both modern and classic patterns
  • Most consistent - 0.3-point score spread across 9 evaluations (vs. 1.0–1.3 for competitors)

DeepWiki - Strong presentation, moderate referencing

  • Best diagrams - 248–284 Mermaid diagrams per project, ~5x more per file than ProdE
  • Clear prose - well-organized, accessible technical writing; highest Presentation score (8.1)
  • Moderate references - 500–1,500 source citations per project, but 2–15x fewer than ProdE
  • Good completeness - 7.9/10 covering core and supporting topics

Claude Code - Best prose, weakest referencing

  • Strongest writing quality - "the strongest writing of all four sets" per the evaluator
  • Highest code density per file - 7+ code blocks per file with modern syntax
  • Critically sparse references - only 27–72 source references per project; agents can't trace claims to code
  • Narrow coverage - 13–17 files per project, missing many topics entirely

Google Code Wiki - Thin coverage, uneven quality

  • Graphviz architecture diagrams provide useful high-level overviews
  • Zero code examples across all projects - no Python, no JavaScript
  • Many stub pages under 300 words, and systematic reference quality issues
  • Lowest coverage - Completeness score of 5.2/10
· · ·

Per-project scores: all three runs

FastAPI

DimensionProdEDeepWikiGoogleClaude Code
Completeness9 / 9 / 98 / 8 / 85 / 5 / 56 / 7 / 7
Correctness8 / 8 / 88 / 7 / 77 / 6 / 78 / 8 / 7
Referencing9 / 9 / 98 / 7 / 67 / 6 / 74 / 6 / 6
Presentation8 / 7 / 78 / 8 / 84 / 3 / 47 / 8 / 7

Pydantic

DimensionProdEDeepWikiGoogleClaude Code
Completeness9 / 9 / 98 / 7 / 84 / 5 / 46 / 6 / 6
Correctness9 / 8 / 88 / 8 / 87 / 8 / 78 / 8 / 8
Referencing9 / 9 / 97 / 8 / 77 / 7 / 73 / 3 / 4
Presentation8 / 8 / 88 / 8 / 83 / 4 / 47 / 6 / 7

Mermaid

DimensionProdEDeepWikiGoogleClaude Code
Completeness9 / 9 / 98 / 8 / 87 / 6 / 66 / 7 / 6
Correctness8 / 8 / 88 / 7 / 87 / 6 / 88 / 8 / 8
Referencing9 / 9 / 97 / 7 / 77 / 6 / 74 / 5 / 3
Presentation8 / 7 / 89 / 8 / 86 / 4 / 57 / 7 / 7

Pattern: ProdE scores 9/10 in Completeness and Referencing across every single evaluation - all three projects, all three runs. No other tool comes close to this consistency. DeepWiki's Referencing ranges from 6–8, Claude Code's from 3–6.

· · ·

Methodology

The evaluator is Claude (Opus). We used a structured evaluation prompt with explicit safeguards against common biases: don't confuse volume with quality, verify source references against actual code, flag hallucinations explicitly, score on an absolute scale, and don't penalize platform-specific formats.

Scoring: Each tool receives two composite scores. The Agent Score is the average of Completeness, Correctness, and Referencing. The Human Score is the average of Completeness, Correctness, and Presentation. Completeness and Correctness are shared - only Referencing (agent-specific) and Presentation (human-specific) differ between the two tracks. This means the gap between a tool's Agent and Human score is entirely attributable to how it performs on referencing vs. presentation.

Blinding: Since Claude is both the evaluator and one of the tools being evaluated (Claude Code), we labeled Claude Code's output as "Doc X" in the evaluation reports. ProdE launched after the evaluator's training cutoff, so the evaluator has no pre-existing association with ProdE either - both ProdE and Claude Code are effectively unknown to the evaluator. ProdE does not use Claude models in its generation pipeline.

Every input and output is published. Run the benchmark yourself with a different evaluator model and see if the conclusions hold.

The full benchmark - all 9 evaluation reports, the evaluation rubric, documentation outputs, and source repositories - is available on GitHub.

See ProdE documentation in action on your codebase

Get a demo