Eval with decodingtrust-agent
eval/evaluation.py is the canonical entrypoint: it consumes a JSONL list of tasks, brings up the per-task Docker sandboxes, runs your chosen agent framework against each task, calls the per-task judge.py, and writes a standardized trajectory + summary stats. Use this for full benchmark runs.
What you'll need
A working install (see Install from Source), Docker up, and a provider API key in
.env. The first run of a domain pulls its sandbox images, which can take a few minutes.Step 1 — Build your task file
Tasks are described in a JSONL file. The runner uses {domain, type, task_id} for benign tasks and adds threat_model + risk_category for malicious tasks. The path under dataset/ is derived from those fields.
# tasks.jsonl
{"domain": "crm", "type": "benign", "task_id": 1}
{"domain": "crm", "type": "benign", "task_id": 2}
{"domain": "crm", "type": "malicious", "threat_model": "indirect", "risk_category": "data-exfiltration", "task_id": 3}
{"domain": "workflow", "type": "malicious", "threat_model": "direct", "risk_category": "manipulative-use", "task_id": 12}Step 2 — Run the evaluation
# Single-process baseline
python eval/evaluation.py \
--task-file tasks.jsonl \
--agent-type openaisdk \
--model gpt-4o \
--api-key "$OPENAI_API_KEY"
# Parallel — N tasks at once, dynamic ports for the sandboxes
python eval/evaluation.py \
--task-file tasks.jsonl \
--agent-type openaisdk \
--model gpt-4o \
--max-parallel 4 \
--port-range 10000-12000Common CLI flags
| Flag | Default | Description |
|---|---|---|
| --task-file | required | JSONL of tasks to evaluate. |
| --agent-type | openaisdk | One of openaisdk, claudesdk, googleadk, langchain, pocketflow, openclaw. |
| --model | gpt-4o | Model identifier passed to the chosen wrapper. |
| --max-parallel | 1 | Concurrent task subprocesses. |
| --port-range | -- | E.g. 10000-12000. Required for --max-parallel > 1. |
| --domain | -- | Filter the task file to a single domain. |
| --skip-setup | false | Skip per-task setup.sh (dev only). |
| --skip-mcp | false | Skip MCP server startup (mock/CLI tests). |
Single-task quick path
When iterating on a single task you can call the runner directly and skip the parent JSONL:
python eval/task_runner.py \
--task-dir dataset/crm/benign/4 \
--agent-type openaisdk \
--model gpt-4oOutput layout
Results land under $EVAL_RESULTS_ROOT/benchmark/<agent>/<dataset_path>/ (default root is ./results):
results/
└── openaisdk/
└── crm/benign/4/
├── traces/
│ └── traces_20261018_143052.jsonl # Raw provider trace
├── 20261018_143052.json # Standardized trajectory
└── judge_result.json # task_success / attack_successWhere this connects
The trajectories produced here are the same format published on the leaderboard, so you can drop them straight into the data-auditing UI. For attack runs use the red-teaming agent from the Red-teaming Agent section instead — it actively searches for working attacks rather than running a fixed task list.