Promptshelf AI Tool Report Card · 5 tools × 8 dimensions · A-F grades
5 AI tools · 8 dimensions · A-F grades · ?rc= share URL · GPA on a 4.0 scale

Grade your AI coding tool's report card.

Stop arguing about which AI coding tool is best — grade them. Five contenders (Claude Code, Cursor, GitHub Copilot, Windsurf, OpenAI Codex CLI) across eight dimensions that actually decide whether a tool earns its monthly fee. A-F per dimension. GPA-style overall. Share the URL with your team and have the argument with structure instead of vibes.

5 tools 8 dimensions per tool A · B · C · D · F · — (unrated) 4.0 GPA scale ?rc=<30hex> share URL school report card aesthetic
Bay 01

Pick the tools you've used

Only the tools you've actually used in production. A grade you fabricate is worse than no grade. Skipping a tool leaves it unrated in the share URL — your teammates can fill it in.

Bay 02

Grade each dimension

A = excellent, the tool is a force multiplier on this dimension. B = solid. C = acceptable, you live with it. D = problem area, you work around it. F = actively hurts the workflow. Hover the tiny grey "—" to leave a dimension unrated.

Bay 03

The report card

Each tool you graded gets a card. Unrated dimensions are dotted-line "—" and don't drag the GPA. Compare the cards side-by-side; the dimensional breakdown is the whole point — the GPA is just convenience for chat.

Bay 04

Share the report card

The URL encodes all 40 grades and the optional grader name. Send it to a teammate; their browser reconstructs your exact grading and they can edit it to add their own. Drop the URL in your team chat to compare report cards side-by-side.

Bay 05

If the report card surfaced an obvious gap…

Three follow-on tools that turn a graded report card into action.

Bay 06

FAQ

Why letter grades instead of star ratings or numerical scores?
Stars (★★★☆☆) and 1–10 numbers compress to a single ranking and lose dimensional information. Letter grades carry an implicit cultural meaning: A is excellent, C is acceptable, F is failing. When you say a tool is C-grade at debugging, the reader knows immediately what that means. The school-report-card format also subtly encodes that the rater put each dimension on its own row — not a single overall vibe — which is the whole point. Use the GPA as your one-number summary if you need one.
Which AI coding tools are included?
Claude Code (Anthropic's CLI / IDE plugin family), Cursor (VS Code fork with native AI), GitHub Copilot (the original inline-completion incumbent), Windsurf (the agent-mode IDE formerly known as Codeium), and OpenAI Codex CLI (OpenAI's terminal coding agent, distinct from the older 2021 Codex model). These are the 5 commercial tools most engineers compare in early 2026. Adding more tools is straightforward — open an issue on the repo or commission a custom build.
What do the 8 dimensions actually measure?
Instruction Following — does it do what you asked, not what it inferred you meant. Debugging Help — quality of root-cause analysis vs. shotgun fixes. Refactor Quality — multi-file refactors land cleanly without orphaned imports. Knowledge Currency — recognizes 2026 libraries and current APIs, not just pre-2024 patterns. Hallucination Rate — invents packages or methods that don't exist (lower = higher grade). Context Retention — keeps track of constraints across long sessions. Code Style Consistency — matches your codebase rather than a generic style. Ergonomic Friction — keyboard shortcuts, latency, interrupt-ability, ease of pausing the model mid-stream.
How is the GPA calculated?
Standard 4.0 scale: A=4, B=3, C=2, D=1, F=0. The GPA is the unweighted mean of your letter grades across the 8 dimensions, rounded to two decimals. Unrated dimensions are excluded from the average — a tool you only rated on 3 dimensions gets a GPA from those 3. The overall letter grade follows: 3.5+ = A, 2.5+ = B, 1.5+ = C, 0.5+ = D, below 0.5 = F. The bands are the conventional US grade-point cutoffs.
How does the share URL work?
Each grade encodes as 3 bits: 0 = unrated, 1 = A, 2 = B, 3 = C, 4 = D, 5 = F. With 5 tools × 8 dimensions = 40 grades × 3 bits = 120 bits, packed into 30 hexadecimal characters as ?rc=<30hex>. The URL is opaque-looking but stable — it survives copy-paste, supports any combination of unrated tools, and decodes deterministically. Add an optional grader name with &n=<hex-utf8>. Send the URL to a teammate; their browser reconstructs your exact grading, and they can edit it to add their own grades or counter-rate tools you skipped.
Are these grades supposed to be objective?
No. The whole point is that AI tool quality is workload-dependent and team-dependent. A tool that's A-grade for greenfield React work might be C-grade for legacy Rails refactoring. The Report Card is a structured way to share your specific experience — the value is the dimensional breakdown, not a universal ranking. If your team has 4 different report cards for the same tool, that's information about the workload diversity, not about which rater is wrong. Compare URLs in your team chat and have the argument with structure instead of vibes.
Why no Cursor-specific tier breakdown (Tab vs Composer vs Agent)?
The Report Card grades the tool as a whole because most teams pick a tool and use the modes interchangeably; the friction of switching modes counts against the tool's ergonomic-friction grade. If you want to grade Cursor Tab and Cursor Composer separately you can — duplicate the share URL with one row blank and tag the URL with &n=cursor-tab vs &n=cursor-composer. The tool-decision-flowchart is the better surface if you want a workflow-by-workflow recommendation rather than per-tool grades.