Skip to content

Your whole team has Claude. Who's actually good with it?

Promptster records real work sessions and scores every engineer's AI fluency — who's strong, who needs training, and on exactly what.

Integrations · no tool mandates

Works where your engineers already work.

Codex
  • task briefsthe spec they hand the agent
  • transcriptscourse-correction when it drifts
  • file diffswhat ships vs gets reworked
  • test runsproof it works, or a guess

scored: discovery · implementation · verification

Claude Code
  • plans & promptsdo they scope before coding
  • agent transcriptssteering vs rubber-stamping
  • shell commandshow they recover when stuck
  • file diffsaccepted as-is vs reworked

scored: discovery · implementation · verification

Cursor
  • agent chatsthe context they feed it
  • inline editssurgical fixes vs full-file pastes
  • terminalhow they debug when it breaks
  • test runsverified or vibes

scored: discovery · implementation · verification

Claude Code, Codex, and Cursor are all fully supported.

The problem · self-report is broken

Ask engineers how good they are with AI.
Then measure it.

  • Experienced devs believed AI made them faster. It measurably made them slower.

    Felt+20%
    Measured−19%
    METR, 2025 — a randomized controlled trial with early-2025 tools
  • Employees who say the AI training they got was sufficient. The rest are improvising on a tool they use daily.

    36%say training was sufficient64% improvising daily
    BCG, 2025
  • The real spread inside one team: engineers measurably slowed to 0.8× sit next to the 5× the case studies celebrate. Nobody can say who's where.

    0.8×
    METR, 2025 + vendor case studies

Every chart above is self-report or an aggregate — none can name who on your team uses AI well.

The math · seats are a rounding error

The cost isn't the subscription.
It's the salary next to it.

A $200-a-month AI seat sits next to a $200,000 engineer. The subscription is around 1% of the cost of the person using it. Optimizing the seat is a procurement exercise; what the engineer does with it is a payroll-scale variable.

That 19% drag from the METR RCT? On one $200k engineer, that's roughly $38,000 a year in lost output, from someone who believes the tool is making them faster. Multiply by however many engineers on your team match that profile. You don't know the number. Neither do they.

And the drag is only half the math. Vendors' own case studies put a 5× engineer at the top of the curve, while METR measured a 0.8× one — both are probably on your payroll right now. Uneven fluency means you're paying for the ceiling and collecting the floor.

The return on fixing it is denominated in engineer-hours recovered and defects that never ship, not in subscription line items.

One engineer$200,000 / yr
output you're paying for19% drag from unverified AI use ≈ $38k / yr — the slowdown METR measured in devs who felt faster
The AI seat next to that bar costs $2,400 / yr — about 1%. The risk was never the subscription.
−$38k / yrthe drag, per engineer who's slower with AI and doesn't know it
5× possiblethe productivity vendors' own case studies report — the ceiling your team is paying for but not collecting
How it works · 3 steps

Everyone sells AI training. Nobody verifies it worked.
We're the verification layer.

01

Install.

One CLI instruments the AI tools your engineers already use — Claude Code, Codex, Cursor — scoped to the repos you choose. Work outside those repos is never touched. Engineers see the scope before anything runs.

# platform team · one-time setup
promptster team init
scoped to 3 repos · 23 engineers notified
02

Baseline.

Two weeks of real work sessions — prompts, file diffs, terminal commands, test runs. No homework, no simulation day. Each session is scored across discovery, implementation, and verification, rolled up per engineer.

● baseline · week 2 of 2
412 sessions · 23 engineers · 3 repos
scored: discovery · implementation · verification
03

Fix.

Per-engineer training prescriptions, not a slide deck for the whole org. One engineer rubber-stamps AI diffs; another never feeds the agent context. Different problems, different fixes. Then re-assess and prove the delta.

# per-engineer prescription
M. Okafor → verification loop
re-assess wk 6 → 38 → 71 · delta proven
Evidence · not a black box

Every score links to a moment
you can replay.

Sessions open in an IDE-style replay: file tree, syntax-highlighted diffs, the full prompt timeline. When the report says an engineer skipped verification, you can scrub to the minute it happened — and so can they.

Why not what you already have

Nothing in your stack
watches the actual work.

Surveys, AI upskilling platforms, and DevEx dashboards all orbit the question and miss it completely. None of them can tell you which engineer needs what training — because none of them see the work itself.

Comparison of surveys, quizzes and certifications, DevEx dashboards, and Promptster across four dimensions of AI-fluency measurement.
DimensionSurveys / self-reportWhat you haveAI upskilling platforms (Section et al.)What you haveDevEx dashboardsWhat you havePromptsterObserved sessions
What it measuresPerception — see the METR chart above.General AI literacy — quizzes and prompt exercises for knowledge workers, not engineers in real codebases.Aggregate output — DORA, throughput, cycle time.Observed behavior in real work sessions.
ResolutionTeam-level vibes, anonymized by design.Per-person, but on generic exercises — not your stack, not your repos.Team or repo aggregates. Can't name names.Per-engineer, per-dimension: discovery, implementation, verification.
Tells you WHO needs help, and whyNo.Who scored low on a quiz — not who rubber-stamps AI output at 4pm.No. A slow team average has no name attached.Yes — with the session replay as evidence.
What you do nextRun another survey next quarter.A course and a certificate, verified by another quiz.Argue about the dashboard in a planning meeting.A per-engineer training prescription. Re-assess on real work to prove the delta.
What it measures
Surveys / self-report
Perception — see the METR chart above.
AI upskilling platforms (Section et al.)
General AI literacy — quizzes and prompt exercises for knowledge workers, not engineers in real codebases.
DevEx dashboards
Aggregate output — DORA, throughput, cycle time.
Promptster
Observed behavior in real work sessions.
Resolution
Surveys / self-report
Team-level vibes, anonymized by design.
AI upskilling platforms (Section et al.)
Per-person, but on generic exercises — not your stack, not your repos.
DevEx dashboards
Team or repo aggregates. Can't name names.
Promptster
Per-engineer, per-dimension: discovery, implementation, verification.
Tells you WHO needs help, and why
Surveys / self-report
No.
AI upskilling platforms (Section et al.)
Who scored low on a quiz — not who rubber-stamps AI output at 4pm.
DevEx dashboards
No. A slow team average has no name attached.
Promptster
Yes — with the session replay as evidence.
What you do next
Surveys / self-report
Run another survey next quarter.
AI upskilling platforms (Section et al.)
A course and a certificate, verified by another quiz.
DevEx dashboards
Argue about the dashboard in a planning meeting.
Promptster
A per-engineer training prescription. Re-assess on real work to prove the delta.
The objection · worth answering straight

This is not
surveillance.

Tools that grade people in secret deserve the side-eye they get. This one shows its work — to the people being scored, first.

  • Scoped. Capture is limited to the repos and workspaces your company chooses. Nothing outside them, ever.
  • Transparent. Every engineer sees the exact capture manifest before anything runs. No accounts, no dashboard to babysit — they just keep working.
  • Growth-oriented. The output is “here's the training that makes you faster” — a report built to be shared with each engineer, not held over them.
capture manifest · shown to every engineer
Captured
  • prompts
  • file diffs
  • terminal commands
  • test runs
Never captured
  • keystrokes
  • screen recording
  • clipboard
  • webcam
  • browser activity
  • anything outside scoped repos
The same shape of data you'd put in a PR description — what was prompted, what was tried, what was tested.
Pilot cohorts · limited slots

Book a 15-min walkthrough —
we'll replay a scored session live.

We're running pilot cohorts with a handful of teams. Bring a VP Eng or platform lead, we'll bring a real session and its score — you decide in 15 minutes whether the signal is worth a two-week baseline.

or leave an email for the next cohort
no spam · we reply personally
FAQ · the practical questions

Short answers,
no marketing answers.

  • Which AI tools does this work with?
    Claude Code, Codex, and Cursor — all three fully supported, capturing full work sessions: prompts, file diffs, terminal commands, test runs. Engineers keep whichever tool they already use; the scoring is the same across all of them. Mixed-tool teams are the norm, not a problem.
  • How long does setup take?
    About 30 minutes for your platform team: install one CLI, choose which repos are in scope, and engineers get notified with exactly what will be captured. Engineers never need an account or touch a dashboard — the platform is for you; they just keep working. The two-week baseline starts as soon as people work.
  • What does the report actually look like?
    A team-level scoreboard with per-engineer scores across discovery, implementation, and verification — who's strong, who needs training, on exactly what. Each score links to the sessions behind it, viewable in an IDE-style replay (file tree, syntax-highlighted diffs, prompt timeline). Per-engineer training prescriptions come with it, and a re-assessment after training shows the delta.
  • How is the data handled?
    Capture is limited to the repos you scoped — nothing outside them. We capture prompts, file diffs, terminal commands, and test runs; we never capture keystrokes, screen, clipboard, webcam, or browser activity. Every engineer sees that capture manifest before anything runs, and per-engineer reports are built to be shared with them. Dashboard access stays with the managers you invite. Data is retained for the engagement and deleted on request.
  • What are the pilot terms?
    We're running pilot cohorts with a limited number of teams. A pilot is a two-week baseline on repos you choose, the full team report, and a working session to walk through it. Terms and pricing are discussed on the walkthrough call — we'd rather show you a scored session first.