Install SDK
The fastest way to use DTap is via the decodingtrust-agent-sdk package on PyPI. It ships the dtap CLI, every supported agent framework backend, and the benchmark task lists. The per-task dataset and the Docker sandboxes are fetched on demand.
pip, Docker Engine 24+ with the compose plugin (for the per-task sandbox containers), and an API key for at least one model provider (OpenAI, Anthropic, or Google).Step 1 — pip install
A virtual environment is strongly recommended. The base install pulls every agent framework — the package is light enough that there's no reason to split it.
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install decodingtrust-agent-sdkIf you only want a single framework's deps, the optional extras [openai], [claude], [google], [langchain], [pocketflow], and [all] are also defined.
Step 2 — Configure provider keys
Export the key(s) for the providers you plan to use. Only the providers you actually invoke are required:
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=...The SDK also reads .env files via python-dotenv, so dropping these in a .env at your working directory works as well.
Step 3 — Download the dataset
The per-task config.yaml / judge.py / setup.sh files that the evaluator consumes live on HuggingFace at AI-Secure/DecodingTrust-Agent-Platform. Pull them into a dataset/ folder next to where you'll run dtap:
# Install the HuggingFace CLI if you don't have it
pip install -U huggingface_hub
# Download the dataset/ folder (~216 MB)
hf download AI-Secure/DecodingTrust-Agent-Platform \
--repo-type dataset \
--local-dir datasetAlternatively, use the Python API for programmatic downloads:
from huggingface_hub import snapshot_download
snapshot_download(
"AI-Secure/DecodingTrust-Agent-Platform",
repo_type="dataset",
local_dir="dataset",
)Step 4 — Verify the install
The package installs the dtap console script. Use the introspection subcommands to confirm everything is wired up:
# Show CLI help
dtap --help
# List the 14 benchmark domains and their task counts
dtap info domain
# List the Docker environments defined in dt_arena/config/env.yaml
dtap info env
# List the agent backends and which pip extra each one needs
dtap agent listStep 5 — Run your first evaluation
A single benign CRM task with the OpenAI Agents SDK backbone — Docker pulls the Salesforce sandbox on first run, finishes in a couple of minutes:
dtap eval \
--task-list benchmark/crm/benign.jsonl \
--agent-type openaisdk \
--model gpt-4o \
--max-parallel 4Results land under results/benchmark/<agent_type>/<model>/<domain>/<type>/<task_id>/ (override the root with EVAL_RESULTS_ROOT).