Eval with decodingtrust-agent

eval/evaluation.py is the canonical entrypoint: it consumes a JSONL list of tasks, brings up the per-task Docker sandboxes, runs your chosen agent framework against each task, calls the per-task judge.py, and writes a standardized trajectory + summary stats. Use this for full benchmark runs.

What you'll need
A working install (see Install from Source), Docker up, and a provider API key in .env. The first run of a domain pulls its sandbox images, which can take a few minutes.

Step 1 — Build your task file

Tasks are described in a JSONL file. The runner uses {domain, type, task_id} for benign tasks and adds threat_model + risk_category for malicious tasks. The path under dataset/ is derived from those fields.

# tasks.jsonl
{"domain": "crm", "type": "benign", "task_id": 1}
{"domain": "crm", "type": "benign", "task_id": 2}
{"domain": "crm", "type": "malicious", "threat_model": "indirect", "risk_category": "data-exfiltration", "task_id": 3}
{"domain": "workflow", "type": "malicious", "threat_model": "direct", "risk_category": "manipulative-use", "task_id": 12}

Step 2 — Run the evaluation

# Single-process baseline
python eval/evaluation.py \
  --task-file tasks.jsonl \
  --agent-type openaisdk \
  --model gpt-4o \
  --api-key "$OPENAI_API_KEY"

# Parallel — N tasks at once, dynamic ports for the sandboxes
python eval/evaluation.py \
  --task-file tasks.jsonl \
  --agent-type openaisdk \
  --model gpt-4o \
  --max-parallel 4 \
  --port-range 10000-12000

Common CLI flags

FlagDefaultDescription
--task-filerequiredJSONL of tasks to evaluate.
--agent-typeopenaisdkOne of openaisdk, claudesdk, googleadk, langchain, pocketflow, openclaw.
--modelgpt-4oModel identifier passed to the chosen wrapper.
--max-parallel1Concurrent task subprocesses.
--port-range--E.g. 10000-12000. Required for --max-parallel > 1.
--domain--Filter the task file to a single domain.
--skip-setupfalseSkip per-task setup.sh (dev only).
--skip-mcpfalseSkip MCP server startup (mock/CLI tests).

Single-task quick path

When iterating on a single task you can call the runner directly and skip the parent JSONL:

python eval/task_runner.py \
  --task-dir dataset/crm/benign/4 \
  --agent-type openaisdk \
  --model gpt-4o

Output layout

Results land under $EVAL_RESULTS_ROOT/benchmark/<agent>/<dataset_path>/ (default root is ./results):

results/
 openaisdk/
     crm/benign/4/
         traces/
            traces_20261018_143052.jsonl       # Raw provider trace
         20261018_143052.json                    # Standardized trajectory
         judge_result.json                       # task_success / attack_success
Where this connects
The trajectories produced here are the same format published on the leaderboard, so you can drop them straight into the data-auditing UI. For attack runs use the red-teaming agent from the Red-teaming Agent section instead — it actively searches for working attacks rather than running a fixed task list.