Which AI tools does this work with?

Claude Code, Codex, and Cursor — all three fully supported, capturing full work sessions: prompts, file diffs, terminal commands, test runs. Engineers keep whichever tool they already use; the scoring is the same across all of them. Mixed-tool teams are the norm, not a problem.

How long does setup take?

About 30 minutes for your platform team: install one CLI, choose which repos are in scope, and engineers get notified with exactly what will be captured. Engineers never need an account or touch a dashboard — the platform is for you; they just keep working. The two-week baseline starts as soon as people work.

What does the report actually look like?

A team-level scoreboard with per-engineer scores across discovery, implementation, and verification — who's strong, who needs training, on exactly what. Each score links to the sessions behind it, viewable in an IDE-style replay (file tree, syntax-highlighted diffs, prompt timeline). Per-engineer training prescriptions come with it, and a re-assessment after training shows the delta.

How is the data handled?

Capture is limited to the repos you scoped — nothing outside them. We capture prompts, file diffs, terminal commands, and test runs; we never capture keystrokes, screen, clipboard, webcam, or browser activity. Every engineer sees that capture manifest before anything runs, and per-engineer reports are built to be shared with them. Dashboard access stays with the managers you invite. Data is retained for the engagement and deleted on request.

What are the pilot terms?

We're running pilot cohorts with a limited number of teams. A pilot is a two-week baseline on repos you choose, the full team report, and a working session to walk through it. Terms and pricing are discussed on the walkthrough call — we'd rather show you a scored session first.

Your whole team has Claude. Who's actually good with it?

Promptster records real work sessions and scores every engineer's AI fluency — who's strong, who needs training, and on exactly what.

Book a 15-min walkthrough See a scored session

Integrations · no tool mandates

Works where your engineers already work.

Codex

task briefsthe spec they hand the agent
transcriptscourse-correction when it drifts
file diffswhat ships vs gets reworked
test runsproof it works, or a guess

scored: discovery · implementation · verification

Claude Code

plans & promptsdo they scope before coding
agent transcriptssteering vs rubber-stamping
shell commandshow they recover when stuck
file diffsaccepted as-is vs reworked

scored: discovery · implementation · verification

Cursor

agent chatsthe context they feed it
inline editssurgical fixes vs full-file pastes
terminalhow they debug when it breaks
test runsverified or vibes

scored: discovery · implementation · verification

The problem · self-report is broken

Ask engineers how good they are with AI.
Then measure it.

Experienced devs believed AI made them faster. It measurably made them slower.
Felt+20%
Measured−19%
METR, 2025 — a randomized controlled trial with early-2025 tools
Employees who say the AI training they got was sufficient. The rest are improvising on a tool they use daily.
36%say training was sufficient64% improvising daily
BCG, 2025
The real spread inside one team: engineers measurably slowed to 0.8× sit next to the 5× the case studies celebrate. Nobody can say who's where.
0.8×5×
METR, 2025 + vendor case studies

Every chart above is self-report or an aggregate — none can name who on your team uses AI well.

The math · seats are a rounding error

The cost isn't the subscription.
It's the salary next to it.

A $200-a-month AI seat sits next to a $200,000 engineer. The subscription is around 1% of the cost of the person using it. Optimizing the seat is a procurement exercise; what the engineer does with it is a payroll-scale variable.

That 19% drag from the METR RCT? On one $200k engineer, that's roughly $38,000 a year in lost output, from someone who believes the tool is making them faster. Multiply by however many engineers on your team match that profile. You don't know the number. Neither do they.

And the drag is only half the math. Vendors' own case studies put a 5× engineer at the top of the curve, while METR measured a 0.8× one — both are probably on your payroll right now. Uneven fluency means you're paying for the ceiling and collecting the floor.

The return on fixing it is denominated in engineer-hours recovered and defects that never ship, not in subscription line items.

One engineer$200,000 / yr

output you're paying for19% drag from unverified AI use ≈ $38k / yr — the slowdown METR measured in devs who felt faster

The AI seat next to that bar costs $2,400 / yr — about 1%. The risk was never the subscription.

−$38k / yrthe drag, per engineer who's slower with AI and doesn't know it

5× possiblethe productivity vendors' own case studies report — the ceiling your team is paying for but not collecting

How it works · 3 steps

Everyone sells AI training. Nobody verifies it worked.
We're the verification layer.

Install.

One CLI instruments the AI tools your engineers already use — Claude Code, Codex, Cursor — scoped to the repos you choose. Work outside those repos is never touched. Engineers see the scope before anything runs.

# platform team · one-time setup

❯ promptster team init

✓ scoped to 3 repos · 23 engineers notified

Baseline.

Two weeks of real work sessions — prompts, file diffs, terminal commands, test runs. No homework, no simulation day. Each session is scored across discovery, implementation, and verification, rolled up per engineer.

● baseline · week 2 of 2

412 sessions · 23 engineers · 3 repos

scored: discovery · implementation · verification

Fix.

Per-engineer training prescriptions, not a slide deck for the whole org. One engineer rubber-stamps AI diffs; another never feeds the agent context. Different problems, different fixes. Then re-assess and prove the delta.

# per-engineer prescription

M. Okafor → verification loop

re-assess wk 6 → 38 → 71 · delta proven

Evidence · not a black box

Every score links to a moment
you can replay.

Sessions open in an IDE-style replay: file tree, syntax-highlighted diffs, the full prompt timeline. When the report says an engineer skipped verification, you can scrub to the minute it happened — and so can they.

promptster · auth-middleware · a.rivera · session 0397

BASELINE · WK 2

Explorer

▾src

▾middleware

▾lib

▾types

▾tests

1--- a/src/middleware/auth.ts

2+++ b/src/middleware/auth.ts

3@@ -1,8 +1,56 @@

4 import { NextRequest, NextResponse } from "next/server"

5 import { verifyToken } from "@/lib/jwt"

7−export async function authMiddleware(req: NextRequest) {

8− const token = req.headers.get("authorization")

9− return NextResponse.next()

10+export async function authMiddleware(

11+ req: NextRequest

12+): Promise<NextResponse> {

13+ const token = req.headers.get("authorization")

14+ ?.replace("Bearer ", "")

16+ if (!token) {

17+ return NextResponse.json({ error: "Unauthorized" }, { status: 401 })

18+ }

20+ const payload = await verifyToken(token)

21+ if (!payload) {

22+ return NextResponse.json({ error: "Invalid token" }, { status: 401 })

23+ }

Timeline

prompttool callfile diffcommand

0:000:401:202:002:40

file diff·src/middleware/auth.ts (+48 −0)t=0:41

Why not what you already have

Nothing in your stack
watches the actual work.

Surveys, AI upskilling platforms, and DevEx dashboards all orbit the question and miss it completely. None of them can tell you which engineer needs what training — because none of them see the work itself.

Comparison of surveys, quizzes and certifications, DevEx dashboards, and Promptster across four dimensions of AI-fluency measurement.
Dimension	Surveys / self-reportWhat you have	AI upskilling platforms (Section et al.)What you have	DevEx dashboardsWhat you have	PromptsterObserved sessions
What it measures	Perception — see the METR chart above.	General AI literacy — quizzes and prompt exercises for knowledge workers, not engineers in real codebases.	Aggregate output — DORA, throughput, cycle time.	Observed behavior in real work sessions.
Resolution	Team-level vibes, anonymized by design.	Per-person, but on generic exercises — not your stack, not your repos.	Team or repo aggregates. Can't name names.	Per-engineer, per-dimension: discovery, implementation, verification.
Tells you WHO needs help, and why	No.	Who scored low on a quiz — not who rubber-stamps AI output at 4pm.	No. A slow team average has no name attached.	Yes — with the session replay as evidence.
What you do next	Run another survey next quarter.	A course and a certificate, verified by another quiz.	Argue about the dashboard in a planning meeting.	A per-engineer training prescription. Re-assess on real work to prove the delta.

What it measures

Surveys / self-report: Perception — see the METR chart above.
AI upskilling platforms (Section et al.): General AI literacy — quizzes and prompt exercises for knowledge workers, not engineers in real codebases.
DevEx dashboards: Aggregate output — DORA, throughput, cycle time.
Promptster: Observed behavior in real work sessions.

Resolution

Surveys / self-report: Team-level vibes, anonymized by design.
AI upskilling platforms (Section et al.): Per-person, but on generic exercises — not your stack, not your repos.
DevEx dashboards: Team or repo aggregates. Can't name names.
Promptster: Per-engineer, per-dimension: discovery, implementation, verification.

Tells you WHO needs help, and why

Surveys / self-report: No.
AI upskilling platforms (Section et al.): Who scored low on a quiz — not who rubber-stamps AI output at 4pm.
DevEx dashboards: No. A slow team average has no name attached.
Promptster: Yes — with the session replay as evidence.

What you do next

Surveys / self-report: Run another survey next quarter.
AI upskilling platforms (Section et al.): A course and a certificate, verified by another quiz.
DevEx dashboards: Argue about the dashboard in a planning meeting.
Promptster: A per-engineer training prescription. Re-assess on real work to prove the delta.

The objection · worth answering straight

This is not
surveillance.

Tools that grade people in secret deserve the side-eye they get. This one shows its work — to the people being scored, first.

Scoped. Capture is limited to the repos and workspaces your company chooses. Nothing outside them, ever.
Transparent. Every engineer sees the exact capture manifest before anything runs. No accounts, no dashboard to babysit — they just keep working.
Growth-oriented. The output is “here's the training that makes you faster” — a report built to be shared with each engineer, not held over them.

capture manifest · shown to every engineer

Captured

prompts
file diffs
terminal commands
test runs

Never captured

keystrokes
screen recording
clipboard
webcam
browser activity
anything outside scoped repos

The same shape of data you'd put in a PR description — what was prompted, what was tried, what was tested.

Pilot cohorts · limited slots

Book a 15-min walkthrough —
we'll replay a scored session live.

We're running pilot cohorts with a handful of teams. Bring a VP Eng or platform lead, we'll bring a real session and its score — you decide in 15 minutes whether the signal is worth a two-week baseline.

Book the walkthrough

or leave an email for the next cohort

no spam · we reply personally

FAQ · the practical questions

Short answers,
no marketing answers.

Which AI tools does this work with?
Claude Code, Codex, and Cursor — all three fully supported, capturing full work sessions: prompts, file diffs, terminal commands, test runs. Engineers keep whichever tool they already use; the scoring is the same across all of them. Mixed-tool teams are the norm, not a problem.
How long does setup take?
About 30 minutes for your platform team: install one CLI, choose which repos are in scope, and engineers get notified with exactly what will be captured. Engineers never need an account or touch a dashboard — the platform is for you; they just keep working. The two-week baseline starts as soon as people work.
What does the report actually look like?
A team-level scoreboard with per-engineer scores across discovery, implementation, and verification — who's strong, who needs training, on exactly what. Each score links to the sessions behind it, viewable in an IDE-style replay (file tree, syntax-highlighted diffs, prompt timeline). Per-engineer training prescriptions come with it, and a re-assessment after training shows the delta.
How is the data handled?
Capture is limited to the repos you scoped — nothing outside them. We capture prompts, file diffs, terminal commands, and test runs; we never capture keystrokes, screen, clipboard, webcam, or browser activity. Every engineer sees that capture manifest before anything runs, and per-engineer reports are built to be shared with them. Dashboard access stays with the managers you invite. Data is retained for the engagement and deleted on request.
What are the pilot terms?
We're running pilot cohorts with a limited number of teams. A pilot is a two-week baseline on repos you choose, the full team report, and a working session to walk through it. Terms and pricing are discussed on the walkthrough call — we'd rather show you a scored session first.

Your whole team has Claude. Who's actually good with it?

Works where your engineers already work.

Ask engineers how good they are with AI.Then measure it.

The cost isn't the subscription.It's the salary next to it.

Everyone sells AI training. Nobody verifies it worked.We're the verification layer.

Install.

Baseline.

Fix.

Every score links to a momentyou can replay.

Nothing in your stackwatches the actual work.

This is notsurveillance.

Book a 15-min walkthrough —we'll replay a scored session live.

Short answers,no marketing answers.

Ask engineers how good they are with AI.
Then measure it.

The cost isn't the subscription.
It's the salary next to it.

Everyone sells AI training. Nobody verifies it worked.
We're the verification layer.

Every score links to a moment
you can replay.

Nothing in your stack
watches the actual work.

This is not
surveillance.

Book a 15-min walkthrough —
we'll replay a scored session live.

Short answers,
no marketing answers.