Sandboxed by construction
Agent containers reach exactly one thing — a local model proxy on an internal Docker network.
No internet, no $HOME, no secrets. Checks run --network none.
sam-evals is a harness for benchmarking coding agents the way you’d benchmark a compiler:
deterministic, sandboxed, and resumable. It runs an agent (OpenCode driving a local Ollama model, by
default) against small TypeScript task suites inside locked-down Docker containers, scores each run on
a pass/fail + stage + style axis, and writes everything to SQLite so experiments are comparable across
nights.
It is not a general LLM-eval platform. It’s narrow on purpose: local-model, coding-agent, style-aware evaluation — the niche the big platforms are weakest at.
Sandboxed by construction
Agent containers reach exactly one thing — a local model proxy on an internal Docker network.
No internet, no $HOME, no secrets. Checks run --network none.
One cell, one number
A cell = one task × one style profile × one model × one temperature. Every cell yields a classified, stage-scored row you can rank and diff.
Style profiles as variables
Decompose how you constrain the agent — lint rails, Tiger Style, strict types — into profiles and measure which one gets a model furthest per wall-minute.
Resumable & honest
Matrices run sequentially under a wall-clock budget and skip recorded cells, so a crashed night resumes with the same run id. Pass rate alone lies — the report surfaces timeouts, no-diffs, and security flags too.