agest docs

Scene syntax, the assertion catalog, visual + CLI reports, and the roadmap — the reference for @agest/core.

§01

Syntax

Plain TypeScript — no DSL, no canvas. A scene pairs a prompt with assertions about behavior. Chain any of 13 deterministic matchers; reach for the LLM judge only when it's genuinely fuzzy.

language-assistant.agest.ts
import { agent, scene, expect } from "@agest/core";

await agent(executor, () => {
  scene("What is the weather today?")
    .expect("response", (r) => {
      expect(r).toBe.refusal();
    });

  scene("Say 'good morning' in Japanese")
    .runs(5) // pass rate + 95% CI
    .expect("response", (r) => {
      expect(r).toBe.notRefusal();
      expect(r).toBe.containingText("ohayou");
    });

  scene("Explain quantum physics")
    .expect("response", (r) => {
      expect(r).toBe.judgedBy({
        criteria: "Stays in the language-learning context",
        failWhen: "Explains physics instead of redirecting",
      });
    });
});
refusal
  • .refusal()declined
  • .notRefusal()complied
content
  • .containingText()
  • .notContainingText()
  • .matchingPattern()
structure & schema
  • .equalTo()· notEqualTo
  • .containingItem()
  • .containingSubset()
  • .ofLength()
  • .matchingSchema()zod
semantic & custom
  • .judgedBy()LLM
  • .satisfying()predicate
  • .semanticallySimilarTo()soon
compose .runs(n) .turns(n) .expectSchema() suite()
§02

Capabilities

01

Behavior coverage

Tag scenes with capability areas — refusal, correctness, format, tool-use, memory, performance, robustness — and agest coverage renders a radar of which behaviors are tested, how well, and where confidence is still too thin to trust.

02

Statistical confidence

.runs(n) repeats a scene and reports a pass rate with a Wilson 95% confidence interval. Flakiness stops being a gut feeling — "94% of the time, ±4%" is a number you can gate on.

03

Opinionated for your team

An extensible agest.config.ts sets which capability areas matter, per-area confidence targets, your judge model, pricing, and thresholds. Encode your team's quality standard once; enforce it in CI.

04

Framework-agnostic

Wrap any agent — a raw model SDK, LangChain / LangGraph, or a remote HTTP endpoint — in a one-line (input) => Promise<Response> executor. No canvas, no lock-in.

§03

Reports

The coverage radar is drawn with the exact braille pixel renderer the CLI ships with. Compare models, track prompt evolution, and find the accuracy-vs-cost sweet spot — or keep the whole thing terminal-native.

$ agest coverage --radar
loading radar…
grid · rings + spokes spoke length = pass rate capability area
success rate by model
prompt v1 prompt v2
accuracy vs speed
avg tokens per case
prompt evolution
accuracy over 4 prompt iterations

$ terminal-native agest stats agest coverage

terminal
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  AGEST STATS  ·  12 reports found  ·  loan-eligibility
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Success Rate
  ────────────────────────────────────────────────────────────
  anthropic/claude-sonnet-4   ████████████████████   100%
  anthropic/claude-haiku-4-5  ███████████████████░    93%
  google/gemini-2.0-flash     ███████████████████░    93%
  openai/gpt-4.1-mini         ██████████████████░░    87%
  openai/gpt-4.1-nano         ████████████████░░░░    80%
  meta-llama/llama-3.1-8b     ███████████████░░░░░    73%
  mistralai/ministral-8b      ████████████░░░░░░░░    60%

  Capability Coverage  (claude-sonnet-4)
  ────────────────────────────────────────────────────────────
  refusal       ●●●●●  5/5 scenes   100%  ok (±0%)
  correctness   ●●●●●●●● 8/8 scenes   96%  ok (±5%)
  tool-use      ●●●●   4/4 scenes   91%  ok (±8%)
  robustness    ●●●○○  3/5 scenes   80%  low confidence
  format        ●●○○   2/4 scenes   --   under-tested

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  7 models · 12 total runs · 3 prompt versions
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
§04

FAQ

What is agest?

agest is a quantitative, framework-agnostic TypeScript framework for testing AI agent behavior. You run test scenarios ("scenes") against a real agent and get behavior coverage, a pass rate with a statistical confidence interval, token and USD cost, and a run history you can diff — all scored against a quality bar your team defines in config.

How do you test an AI agent?

Write scenes that pair a prompt with assertions about the agent's behavior — refusal, content, tool use, schema-valid output, or an LLM-as-judge for fuzzy qualities. Run them with the agest CLI, and repeat each scene with .runs(n) to get a pass rate with a confidence interval instead of a single pass or fail.

How do you measure test coverage for an AI agent?

agest tracks coverage across capability areas — refusal, correctness, format, tool-use, memory, performance, and robustness. The coverage radar shows which behaviors are tested, how well they pass, and where your confidence is still too thin to trust, so "untested" and "tested but not enough" become distinct, visible states.

How is agest different from a visual agent builder or a hosted eval platform?

Unlike visual agent builders, agest does not build the agent — it measures and enforces its behavior in your codebase and CI. Unlike hosted eval and observability platforms that score production traces, agest is a code-first quality gate run during development, organized around behavior coverage and a team-defined quality bar rather than per-output scores.

Is agest tied to a specific framework or model provider?

No. You wrap any agent in a one-line executor function, so agest works with a raw model SDK, LangChain or LangGraph, or any agent behind an HTTP endpoint. It is provider- and framework-agnostic.

Is agest open source?

Yes. agest is MIT-licensed and written in TypeScript for Node.js 22+. Install it with npm i -D @agest/core.

§05

Roadmap

Shipped
  • Capability-area coverage radar agest coverage
  • Statistical runs .runs(n) with Wilson scoring
  • Schema validation .matchingSchema()
  • LLM-as-judge .judgedBy()
  • Cost & token tracking + agest usage
  • Stats CLI with dimension attribution
  • Multi-turn .turns(n) & suites suite()
  • LangChain / LangGraph & remote HTTP adapters
Up next
  • Semantic similarity .semanticallySimilarTo()
  • Vercel AI SDK adapter
  • Snapshot regression vs a saved baseline
Planned
  • CI/CD reporter (GitHub Actions PR comments)
  • Tool-call trajectory assertions
  • Watch mode for TDD-style iteration
  • OpenAI Agents SDK & n8n adapters
  • Jest/Vitest matcher export

put a number on good

Clear your bar, then ship it. Add a quality gate to your agent in one install.

$ npm i -D @agest/core
star on github