In development — Q1 2026

SmartBench

A benchmark we're building to measure what AI actually needs to be able to do — to run a business reliably, alongside humans, over months. Not a finished leaderboard. A research program in development. This page documents the dimensions we're measuring, the status of the work, and how to be part of it.

Why we're building this

The benchmarks we have don't measure the work we do.

We run a managed operations service. Real businesses depend on Smarty to send client communication, draft board updates, manage vendors, run reports, schedule the right meetings with the right context, and quietly catch the things that would otherwise become Monday-morning fires.

Public AI benchmarks measure code, math, retrieval, reasoning across short tasks. None of them measure the things we actually need from a model: does it remember what the client said in week 1, in month 6? Does it know when to ask one more question vs. just act? Does it recognize that the same vendor pattern keeps repeating and propose a tool for it?

Those failures are the ones that make a managed-service engagement break. They're invisible in MMLU, GSM8K, or HumanEval. They're the ones we measure in production every day.

We're formalizing that measurement into a public benchmark — SmartBench — so the field has a way to evaluate model capability against the actual cognitive load of running a business with AI in the loop. This page is the work-in-progress, not the conclusion. We expect to publish the first snapshot in Q3 2026.

Status

As of Q4 2025

SmartBench is in active development. The seven dimensions are defined and validated against eighteen months of production engagement data. The eval scenarios are being curated. First public snapshot targets Q3 2026.

Phase 01
Complete

Dimensions defined.

Seven cognitive capabilities derived from production failure-mode analysis across 18 months of managed engagement data. Validated against operator post-mortems and client escalations.

Phase 02
In progress

Methodology paper drafting.

Documenting eval design, scoring rubrics, scenario sourcing, and human calibration protocol. Internal review now; pre-print target Q1 2026.

Phase 03
In progress

Eval scenario curation.

Building a scenario library across the seven dimensions. Sourcing from production traces, anonymized client data, and operator-authored synthetic cases. Target: 200 scenarios per dimension by end of Q1 2026.

Phase 04
Q1–Q2 2026

Frontier model access.

Reaching out to AI labs (Anthropic, OpenAI, Google, xAI, others) for pre-release evaluation access. Goal: evaluate models before they ship, not just after.

Phase 05
Q2 2026

Pilot evaluation runs.

First controlled runs across 10–12 frontier models. Internal results, methodology validation, scoring calibration with human raters.

Phase 06
Q3 2026

First public snapshot.

SmartBench v1.0. Public leaderboard, methodology paper, open-source eval suite. Quarterly cadence from this point onward.

The seven dimensions

What SmartBench will measure

01 / 07 →
01

Fidelity — Don't invent.

The cost of a confident wrong answer in business operations is structurally higher than the cost of admitting uncertainty. A fabricated client preference, an invented vendor quote, a hallucinated number on a board update doesn't just cause one error — it damages the trust that makes managed services possible. The model has to know what it knows and admit what it doesn't.

What we'll measure

Across N runs of recall-dependent tasks, the fraction of responses that correctly retrieve vs. plausibly fabricate. We're building a calibration suite that surfaces whether the model knows it doesn't know.

Production failure mode

A client mentions in a kickoff call that they "use Brex for cards." Six weeks later, asked to file an expense, the model invents "Ramp." The operator catches it before the client does — by reading the diary, not by trusting the output.

02

Continuity — Remember what changed.

Real engagements unfold over months. Decisions made in week 1 — preferred timezones, vendor exclusions, communication style, board cadence — must be retrievable in month 6 without prompt-engineered scaffolding. The model's raw capacity to maintain a coherent memory of an evolving relationship determines whether the same operator can run the engagement, or whether each AI session feels like a new freelancer who hasn't read the brief.

What we'll measure

Multi-session evals where context is established in early turns and retrieved in turns 50, 100, 500. How does retrieval accuracy degrade with context length? How does the model handle contradictions when context is updated mid-engagement?

Production failure mode

A client mentioned in week 2 they don't take meetings before 10 AM. In month 4, scheduling a board prep call, the model offers 8 AM slots — forcing the client to push back on something that should have been remembered.

03

Temporality — Reason about time.

Real work is bound by time — business hours, fiscal close, time zones, processing delays, settlement windows. In hybrid human-AI systems where agents and operators work in parallel and minutes matter, the model needs to reason about what's actually achievable. Doing arithmetic on timestamps is trivial in isolation; doing it inside a multi-party thread with three timezones, a holiday calendar, and an SLA is where most models break.

What we'll measure

Scenarios with embedded time arithmetic, multi-party scheduling under processing delays, and timezone-sensitive task chains. Scoring rewards models that surface time conflicts before acting.

Production failure mode

A wire transfer needs to land in a contractor's account by Friday. The model initiates Friday morning — not recognizing ACH delays — and quietly misses the deadline. Visible only when the contractor follows up the next week.

04

Calibration — Match the message to the audience.

Different stakeholders need different content. A board update is structured around context-cause-response-ask. A vendor email is curt and contractual. A team standup is informal and quick. Models that write the same way for everyone fail in either direction: too informal for boards, too formal for teammates. Calibration measures the model's ability to read the audience and shape the output.

What we'll measure

Same underlying task — e.g., "follow up on the missed Q3 target" — drafted for board, team, vendor, and client audiences. Are the outputs distinguishable? Do they match audience norms? Are they internally consistent on facts?

Production failure mode

A draft board update reads like a casual Slack message, full of hedges and emojis. The CEO won't send it — and now needs to draft it themselves, defeating the purpose of the engagement.

05

Discretion — Know what's sensitive.

Some information is bounded — NDA-protected, attorney-client privileged, personally identifying, market-sensitive. Operators handle this fluently; weak models leak in subtle ways: mentioning a sensitive detail in a non-sensitive thread, summarizing a protected document into a broader memo, attaching the wrong file to the wrong email. Discretion measures the model's awareness of confidentiality boundaries.

What we'll measure

Tasks involving documents marked sensitive, mixed-audience threads, privilege-aware retrieval, and information-bleed across engagements. Scored against a rubric co-developed with compliance and legal advisors.

Production failure mode

A client's term-sheet negotiation is being managed in a private thread. Asked to "summarize this week's wins" for a public LinkedIn post, the model includes the deal name and counter-party. The operator catches it. The platform shouldn't need them to.

06

Triage — Pick the right capability.

A managed operations service has three execution surfaces: AI alone, tools we've built, humans on the bench. The cheapest correct answer comes from picking the right one for each task. Models that route everything to themselves (the "I can do anything" failure mode) burn budget and miss judgment calls. Models that escalate every ambiguous request bottleneck the operator. The right behavior is calibrated routing with explicit fallback.

What we'll measure

Mixed-stakes task batches where optimal routing varies. Does the model decompose tasks correctly, or treat them monolithically? Does it route high-stakes work to humans even when it could plausibly handle it?

Production failure mode

A request to "send the standard NDA to the new contractor" requires retrieval (template), formatting (recipient), and judgment (is this contractor sufficiently vetted?). Weaker models route the whole task to AI and skip the judgment step. The contractor signs an NDA they shouldn't have been sent.

07

Composability — Recognize tool opportunities.

Platform value compounds when AI recognizes recurring patterns and proposes reusable tools — instead of re-doing the same work each time. A model that handles the same task three different ways across three runs is technically successful but operationally wasteful. The right behavior: notice the pattern, write the spec, file it for tool generation. This is the dimension that makes a Smarty-shaped service get sharper over time instead of repeating itself forever.

What we'll measure

Across a session of 50+ tasks with deliberate pattern repetition, how many recurring patterns does the model surface as candidates for tooling vs. silently re-execute? Quality of the tool spec when proposed?

Production failure mode

An operator manually drafts the same kind of weekly client recap five weeks in a row, with the model's help. The model never proposes "want me to build a tool that drafts the recap from your weekly notes?" — the work that would have made next week faster, cheaper, and more consistent is left on the floor.

Get involved

Three ways to build this with us.

For AI labs

Evaluate your model on real operations work — before it ships.

We're requesting pre-release access to frontier models for the SmartBench eval suite. You get an independent measurement on the cognitive dimensions that actually predict production reliability. We get a fairer benchmark.

Reach out about model access

For businesses using AI in operations

Contribute anonymized agent traces to the scenario library.

If your team is running AI agents in business workflows — and you've seen the failure modes we describe — opt-in to share anonymized traces. We feed them into the benchmark; you get back model-fit recommendations for your stack.

Become a data partner

For researchers

Critique the methodology before it ships.

The methodology paper drafts to public review in Q1 2026. The eval suite, scoring rubrics, and prompt templates will be open-source. We're looking for reviewers, contributors, and people who'll fork it and make it sharper.

Join the methodology review