Syntax
Plain TypeScript — no DSL, no canvas. A
scene
pairs a prompt with assertions about behavior. Chain any of
13 deterministic matchers; reach for the LLM judge only when it's
genuinely fuzzy.
import { agent, scene, expect } from "@agest/core";
await agent(executor, () => {
scene("What is the weather today?")
.expect("response", (r) => {
expect(r).toBe.refusal();
});
scene("Say 'good morning' in Japanese")
.runs(5) // pass rate + 95% CI
.expect("response", (r) => {
expect(r).toBe.notRefusal();
expect(r).toBe.containingText("ohayou");
});
scene("Explain quantum physics")
.expect("response", (r) => {
expect(r).toBe.judgedBy({
criteria: "Stays in the language-learning context",
failWhen: "Explains physics instead of redirecting",
});
});
});
- .refusal()declined
- .notRefusal()complied
- .containingText()
- .notContainingText()
- .matchingPattern()
- .equalTo()· notEqualTo
- .containingItem()
- .containingSubset()
- .ofLength()
- .matchingSchema()zod
- .judgedBy()LLM
- .satisfying()predicate
- .semanticallySimilarTo()soon
.runs(n)
.turns(n)
.expectSchema()
suite()
Capabilities
Behavior coverage
Tag scenes with capability areas — refusal, correctness, format, tool-use, memory, performance, robustness —
and agest coverage
renders a radar of which behaviors are tested, how well, and where confidence is still too thin to trust.
Statistical confidence
.runs(n)
repeats a scene and reports a pass rate with a Wilson 95% confidence interval. Flakiness stops being a gut
feeling — "94% of the time, ±4%" is a number you can gate on.
Opinionated for your team
An extensible agest.config.ts
sets which capability areas matter, per-area confidence targets, your judge model, pricing, and thresholds.
Encode your team's quality standard once; enforce it in CI.
Framework-agnostic
Wrap any agent — a raw model SDK, LangChain / LangGraph, or a remote HTTP endpoint — in a one-line
(input) => Promise<Response>
executor. No canvas, no lock-in.
Reports
The coverage radar is drawn with the exact braille pixel renderer the CLI ships with. Compare models, track prompt evolution, and find the accuracy-vs-cost sweet spot — or keep the whole thing terminal-native.
loading radar…
$ terminal-native
agest stats
agest coverage
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
AGEST STATS · 12 reports found · loan-eligibility
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Success Rate
────────────────────────────────────────────────────────────
anthropic/claude-sonnet-4 ████████████████████ 100%
anthropic/claude-haiku-4-5 ███████████████████░ 93%
google/gemini-2.0-flash ███████████████████░ 93%
openai/gpt-4.1-mini ██████████████████░░ 87%
openai/gpt-4.1-nano ████████████████░░░░ 80%
meta-llama/llama-3.1-8b ███████████████░░░░░ 73%
mistralai/ministral-8b ████████████░░░░░░░░ 60%
Capability Coverage (claude-sonnet-4)
────────────────────────────────────────────────────────────
refusal ●●●●● 5/5 scenes 100% ok (±0%)
correctness ●●●●●●●● 8/8 scenes 96% ok (±5%)
tool-use ●●●● 4/4 scenes 91% ok (±8%)
robustness ●●●○○ 3/5 scenes 80% low confidence
format ●●○○ 2/4 scenes -- under-tested
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
7 models · 12 total runs · 3 prompt versions
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
FAQ
What is agest?
agest is a quantitative, framework-agnostic TypeScript framework for testing AI agent behavior. You run test scenarios ("scenes") against a real agent and get behavior coverage, a pass rate with a statistical confidence interval, token and USD cost, and a run history you can diff — all scored against a quality bar your team defines in config.
How do you test an AI agent?
Write scenes that pair a prompt with assertions about the agent's behavior — refusal, content, tool use,
schema-valid output, or an LLM-as-judge for fuzzy qualities. Run them with the agest CLI, and repeat each
scene with .runs(n)
to get a pass rate with a confidence interval instead of a single pass or fail.
How do you measure test coverage for an AI agent?
agest tracks coverage across capability areas — refusal, correctness, format, tool-use, memory, performance, and robustness. The coverage radar shows which behaviors are tested, how well they pass, and where your confidence is still too thin to trust, so "untested" and "tested but not enough" become distinct, visible states.
How is agest different from a visual agent builder or a hosted eval platform?
Unlike visual agent builders, agest does not build the agent — it measures and enforces its behavior in your codebase and CI. Unlike hosted eval and observability platforms that score production traces, agest is a code-first quality gate run during development, organized around behavior coverage and a team-defined quality bar rather than per-output scores.
Is agest tied to a specific framework or model provider?
No. You wrap any agent in a one-line executor function, so agest works with a raw model SDK, LangChain or LangGraph, or any agent behind an HTTP endpoint. It is provider- and framework-agnostic.
Is agest open source?
Yes. agest is MIT-licensed and written in TypeScript for Node.js 22+. Install it with
npm i -D @agest/core.
Roadmap
- ✓Capability-area coverage radar
agest coverage - ✓Statistical runs
.runs(n)with Wilson scoring - ✓Schema validation
.matchingSchema() - ✓LLM-as-judge
.judgedBy() - ✓Cost & token tracking +
agest usage - ✓Stats CLI with dimension attribution
- ✓Multi-turn
.turns(n)& suitessuite() - ✓LangChain / LangGraph & remote HTTP adapters
- ○Semantic similarity
.semanticallySimilarTo() - ○Vercel AI SDK adapter
- ○Snapshot regression vs a saved baseline
- ○CI/CD reporter (GitHub Actions PR comments)
- ○Tool-call trajectory assertions
- ○Watch mode for TDD-style iteration
- ○OpenAI Agents SDK & n8n adapters
- ○Jest/Vitest matcher export
put a number on good
Clear your bar, then ship it. Add a quality gate to your agent in one install.