CI Integration
Dynobox runs in CI like any other command-line test step. A successful run exits with 0; config, flag, discovery, load, or job failures exit with 1.
Use text output when humans will read the log:
dynobox run dynobox --quiet --harness claude-code
Use JSON output when a later CI step should consume the results:
dynobox run dynobox --reporter json --harness claude-code > dynobox-report.ndjson
--reporter json writes newline-delimited JSON to stdout. Each completed job produces one "type": "job" record, followed by one "type": "summary" record. Every record includes "schema": "dynobox.report.v1".
Recommended Pattern
- Install Node.js 22 or newer.
- Install
dynobox. - Install the harness executable for the job.
- Run
dynobox runonce per harness, usually through a CI matrix. - Upload the JSON report as a build artifact.
- Summarize the final JSON
summaryrecord in the job output.
For targeted CI jobs, combine the JSON reporter with scenario filters:
dynobox run dynobox --reporter json --scenario "release*" > dynobox-report.ndjson
Scenario filters match the compiled scenario name or id. Repeat the flag or use comma-separated values to select multiple patterns:
dynobox run dynobox --scenario "release*,publish package"
GitHub Actions
A reference workflow lives at examples/.github/workflows/example-eval.yml. It runs a matrix over claude-code and codex, writes one NDJSON report per harness, uploads each report, and appends a compact summary to the GitHub Actions step summary.
Copy the workflow into your repository's .github/workflows/ directory and adjust:
DYNOBOX_TARGETfor the directory or file containing your dynos.- Harness install commands for your pinned versions.
- Secrets for the selected harnesses.
The example assumes:
ANTHROPIC_API_KEYis available forclaude-code.OPENAI_API_KEYis available forcodex.
Read JSON Reports
The JSON reporter is line-oriented. Read the file one line at a time and parse each line as a separate JSON object.
import {readFileSync} from 'node:fs';
const records = readFileSync('dynobox-report.ndjson', 'utf8')
.trim()
.split('\n')
.filter(Boolean)
.map((line) => JSON.parse(line));
const summary = records.find((record) => record.type === 'summary');
console.log(summary.totals);
Useful job fields include jobId, scenario, harness, status, passed, warnings, observations, and assertions.
Useful summary fields include status, totals, plan, failedJobs, and warningJobs.
Permission warnings are advisory. They explain when a harness blocked a tool action, but they do not change job status or exit codes. Use --permission-mode dangerous only for trusted evals that intentionally need full local access.
Config and discovery failures can happen before any job runs. In those cases, Dynobox writes the config error to stderr and exits 1; there may be no JSON summary record to parse.
Artifact Naming
When a CI matrix runs multiple harnesses, write one report per harness:
dynobox run dynobox --reporter json --harness "$HARNESS" > "dynobox-${HARNESS}.ndjson"
This keeps reports easy to compare and avoids interleaving records from different CI jobs.