Skip to content

How to Measure AI Fluency on Your Engineering Team (2026 Guide)

90% of developers use AI coding tools. Almost no engineering org can say who uses them well. Here's why self-report and seat counts fail, what AI fluency actually consists of, and how to measure it from observed Claude Code and Cursor sessions — so training budget goes where it works.

Paarth Jamdagneya
ai fluency assessmentmeasure ai fluency engineering teamai enablement for engineering teamsclaude code training for teamsmeasure ai coding tool adoptioncopilot cursor roi measurementdeveloper ai upskilling 2026ai adoption metrics engineering

TL;DR

Your engineers all have AI subscriptions. Some of them are getting a real multiplier out of them. Some are slower than they were before — and they don't know it. Nothing in your current stack can tell you which is which. Surveys can't (developers misjudge their own AI productivity by ~40 points), quizzes can't (they test trivia, not behavior), and DevEx dashboards can't (throughput tells you that a team is slow, not who needs help with what). The only honest signal is the work itself: the actual Claude Code and Cursor sessions. This post covers what the research says about the gap, why the usual metrics fail, what AI fluency actually decomposes into, and a practical loop — baseline, train, re-assess — for running it inside your own org.

Every mandate created the same blind spot

The 2025 wave of AI mandates is well documented. Shopify made "reflexive AI usage" a baseline expectation and put it in performance reviews. Coinbase bought Cursor and Copilot for every engineer and let go of the ones who refused to onboard. Zapier set an AI-fluency bar for every new hire. By the 2025 DORA report, 90% of developers were using AI tools at work.

So adoption is solved. What replaced it is a harder question that almost nobody can answer: of the engineers using these tools every day, who is actually good at it?

The vendors' own customer stories describe the problem without meaning to. Cursor's case studies quote customers reporting "60% of our org" and ">70% of our engineers" on the tool — which is another way of saying a third of the org isn't. Anthropic's customer pages cite teams seeing "2–10x velocity improvements across engineering teams." A 2–10x spread is not a success metric. It's a skills gap wearing a press release.

You can't survey your way out

The instinctive fix is to ask people. The research on that is brutal.

METR ran a randomized controlled trial in 2025: experienced open-source developers, real tasks from their own repositories, randomized into AI-allowed and AI-forbidden conditions with early-2025 tools (Cursor Pro with Claude 3.5/3.7 Sonnet). The AI-allowed group was 19% slower. The same developers predicted AI would make them 24% faster, and — this is the part that should worry you — after finishing, still believed it had made them 20% faster. That's a roughly 40-point gap between perceived and measured productivity. (METR's 2026 follow-up suggests newer tools have closed much of the slowdown — but nothing suggests developers got better at estimating their own uplift.)

It's not just developers. Section's AI Proficiency Benchmark found 54% of knowledge workers rate themselves AI-proficient while roughly 10% test as proficient. BCG's AI at Work 2025 found only 36% of employees consider their AI training sufficient, and 18% of regular AI users received no training at all.

Three independent methodologies, one conclusion: self-report on AI skill is noise. If your AI-adoption picture comes from a survey, you don't have an AI-adoption picture.

The metrics teams actually use, and why each one fails

In practice, engineering orgs trying to measure this land on one of four proxies:

Percent of code written by AI. Coinbase tracks it (33%, targeting 50%). It measures volume, not judgment. An engineer who rubber-stamps every suggestion scores highest on exactly the behavior you should be most worried about.

Seat counts and spend. "Everyone has a license" is procurement data, not capability data. A $200/month seat sits next to a $200,000 engineer — the subscription is a rounding error on the cost of the person using it, which also means optimizing the subscription is a rounding-error activity. The leverage is in what the human does with the tool.

Quizzes and certifications. Vendor academies and assessment platforms can verify that an engineer knows what a context window is. They cannot verify that, at 4pm on a Tuesday with a failing test suite, that engineer writes a focused prompt, checks the diff before accepting it, and pushes back when the model goes sideways. Knowledge tests measure knowledge. Fluency is behavior.

DevEx dashboards. DORA metrics and throughput analytics are genuinely useful — and aggregate. They can show a team's cycle time moved after the Cursor rollout. They cannot tell you which five engineers are carrying the improvement, which five are net-negative with the tool, or what specifically the second group is doing wrong. You can't write a training plan from a percentile.

What AI fluency actually is

If you watch hundreds of real AI coding sessions — which is what we do — fluency stops being a vibe and decomposes into observable, scoreable behaviors:

  • Context management. Does the engineer give the agent what it needs — the right files, constraints, conventions — or paste an error and hope? Do they manage long sessions or let context rot until the model degrades?
  • Appropriate reliance. The big one. Do they verify AI output before shipping it — run the tests, read the diff — or rubber-stamp? Equally: do they under-rely, hand-writing code the tool would have done in seconds? Both tails are expensive.
  • Task decomposition and delegation. Do they hand the agent well-shaped problems, or dump a vague epic and iterate by frustration?
  • Recovery and steering. When the model goes wrong — it will — do they notice quickly, correct course cheaply, or sunk-cost their way through twenty bad turns?

None of this is visible in a survey, a cert, or a dashboard. All of it is visible in the session record: the prompts, the diffs, the commands, the moments where the engineer accepted, rejected, or redirected the machine. That's the same process telemetry we built for hiring assessments — and engineering leaders kept asking us to point it at the team they already have. (That's the origin story of this product, told straight: hiring first, internal teams because customers pulled us there.)

The loop: baseline, prescribe, re-assess

Here's the part the training market gets backwards. Tool-specific instruction is nearly free now — Anthropic's own Academy ships Claude Code courses at no cost, and GitHub has Copilot learning paths and a certification. Content is not the bottleneck. Verification is. Everyone sells instruction; almost nobody can tell you whether it changed how anyone works. (DX's Q4 2025 impact report found structured enablement is among the strongest predictors of AI outcomes — and most orgs still don't provide it, partly because nobody can prove what it returns.)

A measurement-first program looks like this:

  1. Baseline. Capture two weeks of real AI coding sessions across the team. Score each engineer on the fluency dimensions above. Now you know your actual distribution — not the self-reported one.
  2. Prescribe. Training goes to the engineers who need it, targeted at what they specifically do wrong. The over-reliers get verification-loop habits, not another prompt-writing workshop. The under-delegators get the opposite. The free vendor courses suddenly become useful because they're being assigned to the right deficits.
  3. Re-assess. Run the same measurement after. The delta — in fluency scores and in the engineer-hours they imply — is the program's ROI, in a form a CFO can read.

The economics only work in salary terms, so do the math in salary terms: a 19%-class drag on one $200k engineer is roughly $38k a year. If a baseline finds even a handful of engineers in that tail — and the perception-gap research says they won't self-identify, because they can't — the measurement pays for itself before the first training session runs.

Five questions to ask before you buy anything in this category

  1. Does it observe real work, or administer a test? (Simulated environments measure test-taking.)
  2. Can it name which engineer needs what training, or does it stop at team aggregates?
  3. Does it distinguish over-reliance from under-reliance, or just count usage?
  4. Can it show a before/after delta on the same engineers, on the same dimensions?
  5. Is it transparent to the engineers being measured? (If it isn't, it's surveillance, and your team will treat it accordingly — visibility into what's captured and framing around growth aren't nice-to-haves, they're what makes the data honest.)

Where this is going

The 2025 mandate wave settled the adoption question by force. The 2026 question — the one boards are now asking the executives who issued those mandates — is whether it worked. "90% adoption" is not an answer. "Here's our fluency distribution, here's who we trained on what, here's the measured delta" is.

We're running internal-team pilots of exactly that loop on top of Promptster's session capture and replay. If you run an engineering org and want a baseline read on your team's AI fluency — who's strong, who needs training, on exactly what — we'd like to talk.

On the record · signed · replayable

Read the process,
not just the commit.

Twelve founding teams will ship this with us. If you hire 5+ engineers a year and your current technical screen can't tell paste from craft, we should talk.

Founding rate$499$299/molocked through 20281 of 12 claimed