HumanEval Benchmark
Evaluates OpenAI's o3-mini model on the HumanEval benchmark (164 code-generation problems). Generated code is executed inside the Boxer sandbox, keeping untrusted LLM output completely isolated. Reports a pass@1 score.
Source: examples/humaneval/
Prerequisites
- uv installed
- Boxer server running locally
OPENAI_API_KEYset in your environment (or via.env)
Setup
# Start boxer (from the repo root)
cd packages/core && go run . --config config.dev.json
# In a separate terminal, set up credentials and install dependencies
cd examples/humaneval
cp .env.example .env # then fill in your OPENAI_API_KEY
uv sync
Usage
Quick Smoke Test (3 problems)
OPENAI_API_KEY=sk-... uv run python evaluate.py --max-problems 3
Expected output:
Loading HumanEval dataset…
Evaluating 3 problem(s) with model=o3-mini, workers=8
[ 1/ 3] HumanEval/0 PASS 412ms
[ 2/ 3] HumanEval/1 PASS 389ms
[ 3/ 3] HumanEval/2 PASS 501ms
────────────────────────────────────────
pass@1: 3/3 (100.0%)
Full Evaluation
OPENAI_API_KEY=sk-... uv run python evaluate.py
Artifacts are written to examples/humaneval/results/ (wiped and recreated on each run):
results/
├── summary.json ← aggregate: pass@1, total, passed, run metadata
└── problems/
├── HumanEval_0/
│ ├── code.py ← full test harness (prompt + completion + tests + check)
│ ├── completion.txt ← raw LLM output
│ ├── stdout.txt ← captured stdout from sandbox
│ ├── stderr.txt ← captured stderr from sandbox
│ └── result.json ← {task_id, passed, exit_code, wall_ms, error?}
└── ...
Options
| Flag | Default | Description |
|---|---|---|
--boxer-url | http://localhost:8080 | Boxer server base URL |
--model | o3-mini | OpenAI model ID |
--max-problems | (all 164) | Limit number of problems |
--workers | 8 | Concurrent async tasks |
How It Works
- Loads the
openai_humanevaldataset from HuggingFace - For each problem, calls
o3-minito complete the function body - Assembles the test harness:
prompt + completion + tests + check(entry_point) - Uploads the
.pyfile to Boxer and executes it in apython:3.12-slimcontainer - A zero exit code means the tests passed (pass@1)