The product

Real work, measured.

Fluency tests how people actually work with AI — through real scenarios, judgement calls, and verification tasks. Two ways to run it: an industry-standard baseline, or a fully bespoke sandbox built around your sector and scoring priorities.

Path 01Baseline

Industry-standard baseline.

The fixed test. Same modules, same scoring, every time. The right answer when you need consistency across hires, cohorts, or time.

  • Nine modules, fixed sequence
  • A consistent benchmark across your org
  • Comparable scores across every candidate
Path 02Sandbox

Fully bespoke sandbox.

The same engine, built around your org. Pick the modules that matter, anchor every scenario in your sector, and weight the scoring to your priorities.

  • Choose your module set
  • Scenarios drawn from your sector
  • Scoring weights tuned to your work

Module library

Ten modules. Each a different lens.

Every module isolates a different way of working with AI — from live collaboration to verification to risk judgement. Run the baseline set, or pick the ones that fit your work.

Not exhaustive · new modules are added regularly

Interactive chat25 pts

Live AI Collaboration

Candidates brief, refine, and direct an AI through a real workplace task.

Binary decisions15 pts

AI Judgement

Genuine judgement calls on when AI should and shouldn’t be used.

Adversarial chat10 pts

Restriction Override

How candidates handle a constrained AI — without breaking the rules.

Drag-and-drop5 pts

Risk Ranking

Order AI use-cases by risk. A read on real-world prioritisation.

Slider allocation5 pts

Priority Weighting

How candidates brief an AI when budget and accuracy compete.

Error detection15 pts

Verification Challenge

Spot the errors in AI-generated work — hallucinations, omissions, drift.

Structured review10 pts

Handoff Review

Critical review of AI output before it leaves the candidate’s desk.

Lane assignment5 pts

Workflow Architect

Which parts of a workflow belong to AI and which belong to a human.

Prompt debugging10 pts

Error Trace

Find the prompt mistake that caused an AI to go wrong.

Parameter tuning10 pts

Live Parameter Tuning

Tune temperature, length, and tone live to hit a target output.

One module, in depth

Live AI Collaboration.

A representative example — the same depth applies to every module.

What the candidate seesLive · 15 min

Scenario brief

A six-month engagement has just ended. The client’s COO has asked for a feedback report covering what the engagement achieved, where it fell short, and three recommendations for the next phase. You have an AI assistant — direct it.

Draft a 4-section feedback report for the COO. Section 1: outcomes. Section 2: gaps. Section 3: three recommendations. Section 4: a closing line. Keep tone consultative, not defensive.

AI · draft

Got it. Quick check before I draft — any specific outcomes the COO already knows about? Anything to under-play or emphasise for the board?
They know the headline KPIs. Don’t under-play the second-phase rollout slip — own it. Emphasise the three new processes their team has carried forward.

AI · draft

Drafting… 380-word report incoming. Want me to keep the closing line warm, or punchy?
Brief · clarify · refine · approve02:14 / 15:00

What’s being scored

Five dimensions.

  • Specificity×1.2

    Did they brief with constraints, audience, and structure — or hand-wave?

  • Iteration quality×1.4

    When the AI asked a clarifying question, was the response useful?

  • Critical judgement×1.3

    Did they push back, override, or accept everything the AI suggested?

  • Tone & appropriateness×0.9

    Did they direct toward a tone appropriate to the audience?

  • Time-to-useful×0.7

    How quickly did the brief converge on something usable?

Scoring

Four bands. One scale.

Every score lands on a calibrated 100-point scale, split into four capability bands. The same scale across every module, every candidate, every time — so you can build a baseline that means something to your org.

  • Calibrated scale, not opinion.

    Anchored by reference rubrics, not the mood of the reviewer.

  • Evidence with every score.

    Each dimension links back to the exact moment in the candidate’s response.

  • Comparable across time.

    A 62 today means the same a year from now. No grade inflation.

Capability band

Baseline · 44

EmergingDevelopingProficientExceptional
  • 0—34 · Emerg.Treats AI as a search box. Limited critical engagement.
  • 35—54 · Devel.Briefs well but accepts most outputs. Verifies inconsistently.
  • 55—79 · Profic.Iterates with intent. Verifies risky output. Knows when not to use AI.
  • 80—100 · Except.Treats AI as a collaborator. Reasons about its failure modes. Trustworthy with high-stakes work.

The report

Everything in one page.

Review a candidate in under ten minutes. Annotated below.

AR

Alex Rivera

Senior PM candidate · 14 May 2026

Proficient

Overall

Baseline · 44

62

/100 · +18 vs. baseline

Emerg.Devel.Profic.Except.

By module

  • Live AI Collaboration18/25
  • AI Judgement12/15
  • Verification Challenge9/15
  • Workflow Architect4/5
  • Error Trace5/10

Evidence · why this score

  • Strong iteration in Live AI — pushed back on the AI’s first draft tone.
  • Caught 4/6 errors in Verification, including one numeric drift.
  • Missed two prompt issues in Error Trace.
Candidate & date. Who took it, when, what they were assessed for.
Headline score. Overall + band + delta vs. your org’s baseline.
Module breakdown. Where they were strong, where they weren’t.
Evidence. Direct quotes and moments from their work.

Sandbox

Built around your work.

Same engine, same scoring rigour — re-shaped around the modules, scenarios, and priorities that matter to your org.

Pick what you measure.

Choose any subset of the ten modules. Build a 20-minute focused test or a deep diagnostic — whatever fits the work.

Live AI Collaboration
AI Judgement
Verification
Workflow Architect
Risk Ranking
Error Trace

Trust

What sits behind every score.

Calibrated, not opinionated.

Every score anchored to a reference rubric — not a reviewer’s mood.

Auditable forever.

Every assessment, response, and rationale stored. Re-scoreable on demand.

Private by default.

Candidate data isolated per org. Nothing trains models. GDPR-aligned.

Bias by design? No.

Scoring tested across personas to surface — and reduce — systemic bias.

Get started

Run the baseline, or build a sandbox.

Request access and we’ll walk you through the platform — or scope a sandbox built around your sector and scoring priorities.