Why letter grades instead of star ratings or numerical scores?
Stars (★★★☆☆) and 1–10 numbers compress to a single ranking and lose dimensional information. Letter grades carry an implicit cultural meaning: A is excellent, C is acceptable, F is failing. When you say a tool is C-grade at debugging, the reader knows immediately what that means. The school-report-card format also subtly encodes that the rater put each dimension on its own row — not a single overall vibe — which is the whole point. Use the GPA as your one-number summary if you need one.
Which AI coding tools are included?
Claude Code (Anthropic's CLI / IDE plugin family), Cursor (VS Code fork with native AI), GitHub Copilot (the original inline-completion incumbent), Windsurf (the agent-mode IDE formerly known as Codeium), and OpenAI Codex CLI (OpenAI's terminal coding agent, distinct from the older 2021 Codex model). These are the 5 commercial tools most engineers compare in early 2026. Adding more tools is straightforward — open an issue on the repo or commission a custom build.
What do the 8 dimensions actually measure?
Instruction Following — does it do what you asked, not what it inferred you meant. Debugging Help — quality of root-cause analysis vs. shotgun fixes. Refactor Quality — multi-file refactors land cleanly without orphaned imports. Knowledge Currency — recognizes 2026 libraries and current APIs, not just pre-2024 patterns. Hallucination Rate — invents packages or methods that don't exist (lower = higher grade). Context Retention — keeps track of constraints across long sessions. Code Style Consistency — matches your codebase rather than a generic style. Ergonomic Friction — keyboard shortcuts, latency, interrupt-ability, ease of pausing the model mid-stream.
How is the GPA calculated?
Standard 4.0 scale: A=4, B=3, C=2, D=1, F=0. The GPA is the unweighted mean of your letter grades across the 8 dimensions, rounded to two decimals. Unrated dimensions are excluded from the average — a tool you only rated on 3 dimensions gets a GPA from those 3. The overall letter grade follows: 3.5+ = A, 2.5+ = B, 1.5+ = C, 0.5+ = D, below 0.5 = F. The bands are the conventional US grade-point cutoffs.
How does the share URL work?
Each grade encodes as 3 bits: 0 = unrated, 1 = A, 2 = B, 3 = C, 4 = D, 5 = F. With 5 tools × 8 dimensions = 40 grades × 3 bits = 120 bits, packed into 30 hexadecimal characters as ?rc=<30hex>. The URL is opaque-looking but stable — it survives copy-paste, supports any combination of unrated tools, and decodes deterministically. Add an optional grader name with &n=<hex-utf8>. Send the URL to a teammate; their browser reconstructs your exact grading, and they can edit it to add their own grades or counter-rate tools you skipped.
Are these grades supposed to be objective?
No. The whole point is that AI tool quality is workload-dependent and team-dependent. A tool that's A-grade for greenfield React work might be C-grade for legacy Rails refactoring. The Report Card is a structured way to share your specific experience — the value is the dimensional breakdown, not a universal ranking. If your team has 4 different report cards for the same tool, that's information about the workload diversity, not about which rater is wrong. Compare URLs in your team chat and have the argument with structure instead of vibes.
Why no Cursor-specific tier breakdown (Tab vs Composer vs Agent)?
The Report Card grades the tool as a whole because most teams pick a tool and use the modes interchangeably; the friction of switching modes counts against the tool's ergonomic-friction grade. If you want to grade Cursor Tab and Cursor Composer separately you can — duplicate the share URL with one row blank and tag the URL with
&n=cursor-tab vs
&n=cursor-composer. The
tool-decision-flowchart is the better surface if you want a workflow-by-workflow recommendation rather than per-tool grades.