sam-evals

Overnight, reproducible benchmarks for local coding agents — run real models against real tasks inside locked-down containers, score every run, and compare style profiles night over night.

Read the introduction See how it works

What it is

sam-evals is a harness for benchmarking coding agents the way you’d benchmark a compiler: deterministic, sandboxed, and resumable. It runs an agent (OpenCode driving a local Ollama model, by default) against small TypeScript task suites inside locked-down Docker containers, scores each run on a pass/fail + stage + style axis, and writes everything to SQLite so experiments are comparable across nights.

It is not a general LLM-eval platform. It’s narrow on purpose: local-model, coding-agent, style-aware evaluation — the niche the big platforms are weakest at.

Sandboxed by construction

Agent containers reach exactly one thing — a local model proxy on an internal Docker network. No internet, no $HOME, no secrets. Checks run --network none.

One cell, one number

A cell = one task × one style profile × one model × one temperature. Every cell yields a classified, stage-scored row you can rank and diff.

Style profiles as variables

Decompose how you constrain the agent — lint rails, Tiger Style, strict types — into profiles and measure which one gets a model furthest per wall-minute.

Resumable & honest

Matrices run sequentially under a wall-clock budget and skip recorded cells, so a crashed night resumes with the same run id. Pass rate alone lies — the report surfaces timeouts, no-diffs, and security flags too.

Start here

Introduction What it is, the problem it solves, and how it differs from promptfoo / Inspect.

How it works The full cell lifecycle, end to end, from scratch dir to scored SQLite row.

Quickstart The planned bunx flow: setup → doctor → run → report.

Concepts Cells, suites, profiles, scoring, and the sandbox — the core model.

CLI reference Every command: setup, doctor, run, matrix, report, compare.

Why it exists The battery research that produced this harness — the credibility story.