Behavioral test runner · MIT

agest

STOP SHIPPING
AGENTS ON VIBES.

Quantitative testing for AI agents. Define what good means — which behaviors, how much coverage, what statistical confidence, what cost — in version-controlled config, then measure every run against that bar.

// coverage for behavior, not lines.

$ npm i -D @agest/core
refusal correctness format tool-use memory performance robustness

> TRUSTED BY

Teams shipping agents they can actually measure.

> THE LOOP

Building agents costs money — agest faces it. Spend the tokens once, measure everything, refine against your bar, repeat.

FIG.01 · THE TUNING LOOP cost in → signal out
01
BUILD

Define the agent — and the bar it has to clear.

agest.config.ts
02
RUN

Run the suite for real. The tokens get spent either way.

agest.runs(n)
03
MEASURE

That one paid run hands back the whole readout — not a pass/fail.

coverage · ci · cost
04
REFINE

Change one dimension; the numbers tell you what moved.

model · prompt · tools

↻ refine & re-run — scored against your bar

what one paid run returns BUILT-IN · $0 EXTRA
COV82%
ACC94%
COST$.002
SPEED3.4s
same run also writes
scored report coverage radar ±wilson 95% ci $cost waterfall Δrun-history diff ci exit gate

Syntax, assertions, reports & roadmap → read the docs

put a number on good

Clear your bar, then ship it. Add a quality gate to your agent in one install.

$ npm i -D @agest/core
star on github