Skip to content

sam-evals

Overnight, reproducible benchmarks for local coding agents — run real models against real tasks inside locked-down containers, score every run, and compare style profiles night over night.
Read the introduction See how it works

sam-evals is a harness for benchmarking coding agents the way you’d benchmark a compiler: deterministic, sandboxed, and resumable. It runs an agent (OpenCode driving a local Ollama model, by default) against small TypeScript task suites inside locked-down Docker containers, scores each run on a pass/fail + stage + style axis, and writes everything to SQLite so experiments are comparable across nights.

It is not a general LLM-eval platform. It’s narrow on purpose: local-model, coding-agent, style-aware evaluation — the niche the big platforms are weakest at.

Sandboxed by construction

Agent containers reach exactly one thing — a local model proxy on an internal Docker network. No internet, no $HOME, no secrets. Checks run --network none.

One cell, one number

A cell = one task × one style profile × one model × one temperature. Every cell yields a classified, stage-scored row you can rank and diff.

Style profiles as variables

Decompose how you constrain the agent — lint rails, Tiger Style, strict types — into profiles and measure which one gets a model furthest per wall-minute.

Resumable & honest

Matrices run sequentially under a wall-clock budget and skip recorded cells, so a crashed night resumes with the same run id. Pass rate alone lies — the report surfaces timeouts, no-diffs, and security flags too.