Eval with decodingtrust-agent

eval/evaluation.py is the canonical entrypoint: it consumes a JSONL list of tasks, brings up the per-task Docker sandboxes, runs your chosen agent framework against each task, calls the per-task judge.py, and writes a standardized trajectory + summary stats. Use this for full benchmark runs.

What you'll need

A working install (see Install from Source), Docker up, and a provider API key in .env. The first run of a domain pulls its sandbox images, which can take a few minutes.

Step 1 — Build your task file

Tasks are described in a JSONL file. The runner uses {domain, type, task_id} for benign tasks and adds threat_model + risk_category for malicious tasks. The path under dataset/ is derived from those fields.

# tasks.jsonl
{"domain": "crm", "type": "benign", "task_id": 1}
{"domain": "crm", "type": "benign", "task_id": 2}
{"domain": "crm", "type": "malicious", "threat_model": "indirect", "risk_category": "data-exfiltration", "task_id": 3}
{"domain": "workflow", "type": "malicious", "threat_model": "direct", "risk_category": "manipulative-use", "task_id": 12}

Step 2 — Run the evaluation

# Single-process baseline
python eval/evaluation.py \
  --task-file tasks.jsonl \
  --agent-type openaisdk \
  --model gpt-4o \
  --api-key "$OPENAI_API_KEY"

# Parallel — N tasks at once, dynamic ports for the sandboxes
python eval/evaluation.py \
  --task-file tasks.jsonl \
  --agent-type openaisdk \
  --model gpt-4o \
  --max-parallel 4 \
  --port-range 10000-12000

Common CLI flags

Flag	Default	Description
--task-file	required	JSONL of tasks to evaluate.
--agent-type	openaisdk	One of `openaisdk`, `claudesdk`, `googleadk`, `langchain`, `pocketflow`, `openclaw`.
--model	gpt-4o	Model identifier passed to the chosen wrapper.
--max-parallel	1	Concurrent task subprocesses.
--port-range	--	E.g. `10000-12000`. Required for `--max-parallel > 1`.
--domain	--	Filter the task file to a single domain.
--skip-setup	false	Skip per-task `setup.sh` (dev only).
--skip-mcp	false	Skip MCP server startup (mock/CLI tests).

Single-task quick path

When iterating on a single task you can call the runner directly and skip the parent JSONL:

python eval/task_runner.py \
  --task-dir dataset/crm/benign/4 \
  --agent-type openaisdk \
  --model gpt-4o

Output layout

Results land under $EVAL_RESULTS_ROOT/benchmark/<agent>/<dataset_path>/ (default root is ./results):

results/
└── openaisdk/
    └── crm/benign/4/
        ├── traces/
        │   └── traces_20261018_143052.jsonl       # Raw provider trace
        ├── 20261018_143052.json                    # Standardized trajectory
        └── judge_result.json                       # task_success / attack_success

Where this connects