Become an Operator

01

◆ Fidelity — Don't invent.

The cost of a confident wrong answer in business operations is structurally higher than the cost of admitting uncertainty. A fabricated client preference, an invented vendor quote, a hallucinated number on a board update doesn't just cause one error — it damages the trust that makes managed services possible. The model has to know what it knows and admit what it doesn't.

What we'll measure

Across N runs of recall-dependent tasks, the fraction of responses that correctly retrieve vs. plausibly fabricate. We're building a calibration suite that surfaces whether the model knows it doesn't know.

Production failure mode

A client mentions in a kickoff call that they "use Brex for cards." Six weeks later, asked to file an expense, the model invents "Ramp." The operator catches it before the client does — by reading the diary, not by trusting the output.

02

⊙ Continuity — Remember what changed.

Real engagements unfold over months. Decisions made in week 1 — preferred timezones, vendor exclusions, communication style, board cadence — must be retrievable in month 6 without prompt-engineered scaffolding. The model's raw capacity to maintain a coherent memory of an evolving relationship determines whether the same operator can run the engagement, or whether each AI session feels like a new freelancer who hasn't read the brief.

What we'll measure

Multi-session evals where context is established in early turns and retrieved in turns 50, 100, 500. How does retrieval accuracy degrade with context length? How does the model handle contradictions when context is updated mid-engagement?

Production failure mode

A client mentioned in week 2 they don't take meetings before 10 AM. In month 4, scheduling a board prep call, the model offers 8 AM slots — forcing the client to push back on something that should have been remembered.

03

⊕ Temporality — Reason about time.

Real work is bound by time — business hours, fiscal close, time zones, processing delays, settlement windows. In hybrid human-AI systems where agents and operators work in parallel and minutes matter, the model needs to reason about what's actually achievable. Doing arithmetic on timestamps is trivial in isolation; doing it inside a multi-party thread with three timezones, a holiday calendar, and an SLA is where most models break.

What we'll measure

Scenarios with embedded time arithmetic, multi-party scheduling under processing delays, and timezone-sensitive task chains. Scoring rewards models that surface time conflicts before acting.

Production failure mode

A wire transfer needs to land in a contractor's account by Friday. The model initiates Friday morning — not recognizing ACH delays — and quietly misses the deadline. Visible only when the contractor follows up the next week.

04

◇ Calibration — Match the message to the audience.

Different stakeholders need different content. A board update is structured around context-cause-response-ask. A vendor email is curt and contractual. A team standup is informal and quick. Models that write the same way for everyone fail in either direction: too informal for boards, too formal for teammates. Calibration measures the model's ability to read the audience and shape the output.

What we'll measure

Same underlying task — e.g., "follow up on the missed Q3 target" — drafted for board, team, vendor, and client audiences. Are the outputs distinguishable? Do they match audience norms? Are they internally consistent on facts?

Production failure mode

A draft board update reads like a casual Slack message, full of hedges and emojis. The CEO won't send it — and now needs to draft it themselves, defeating the purpose of the engagement.

05

◯ Discretion — Know what's sensitive.

Some information is bounded — NDA-protected, attorney-client privileged, personally identifying, market-sensitive. Operators handle this fluently; weak models leak in subtle ways: mentioning a sensitive detail in a non-sensitive thread, summarizing a protected document into a broader memo, attaching the wrong file to the wrong email. Discretion measures the model's awareness of confidentiality boundaries.

What we'll measure

Tasks involving documents marked sensitive, mixed-audience threads, privilege-aware retrieval, and information-bleed across engagements. Scored against a rubric co-developed with compliance and legal advisors.

Production failure mode

A client's term-sheet negotiation is being managed in a private thread. Asked to "summarize this week's wins" for a public LinkedIn post, the model includes the deal name and counter-party. The operator catches it. The platform shouldn't need them to.

06

⊞ Triage — Pick the right capability.

A managed operations service has three execution surfaces: AI alone, tools we've built, humans on the bench. The cheapest correct answer comes from picking the right one for each task. Models that route everything to themselves (the "I can do anything" failure mode) burn budget and miss judgment calls. Models that escalate every ambiguous request bottleneck the operator. The right behavior is calibrated routing with explicit fallback.

What we'll measure

Mixed-stakes task batches where optimal routing varies. Does the model decompose tasks correctly, or treat them monolithically? Does it route high-stakes work to humans even when it could plausibly handle it?

Production failure mode

A request to "send the standard NDA to the new contractor" requires retrieval (template), formatting (recipient), and judgment (is this contractor sufficiently vetted?). Weaker models route the whole task to AI and skip the judgment step. The contractor signs an NDA they shouldn't have been sent.

07

◈ Composability — Recognize tool opportunities.

Platform value compounds when AI recognizes recurring patterns and proposes reusable tools — instead of re-doing the same work each time. A model that handles the same task three different ways across three runs is technically successful but operationally wasteful. The right behavior: notice the pattern, write the spec, file it for tool generation. This is the dimension that makes a Smarty-shaped service get sharper over time instead of repeating itself forever.

What we'll measure

Across a session of 50+ tasks with deliberate pattern repetition, how many recurring patterns does the model surface as candidates for tooling vs. silently re-execute? Quality of the tool spec when proposed?

Production failure mode

An operator manually drafts the same kind of weekly client recap five weeks in a row, with the model's help. The model never proposes "want me to build a tool that drafts the recap from your weekly notes?" — the work that would have made next week faster, cheaper, and more consistent is left on the floor.

SmartBench

The benchmarks we have don't measure the work we do.

Status

Dimensions defined.

Methodology paper drafting.

Eval scenario curation.

Frontier model access.

Pilot evaluation runs.

First public snapshot.

What SmartBench will measure

◆ Fidelity — Don't invent.

⊙ Continuity — Remember what changed.

⊕ Temporality — Reason about time.

◇ Calibration — Match the message to the audience.

◯ Discretion — Know what's sensitive.

⊞ Triage — Pick the right capability.

◈ Composability — Recognize tool opportunities.

Three ways to build this with us.

For AI labs

Evaluate your model on real operations work — before it ships.

For businesses using AI in operations

Contribute anonymized agent traces to the scenario library.

For researchers

Critique the methodology before it ships.