Install SDK

The fastest way to use DTap is via the decodingtrust-agent-sdk package on PyPI. It ships the dtap CLI, every supported agent framework backend, and the benchmark task lists. The per-task dataset and the Docker sandboxes are fetched on demand.

Prerequisites

Python 3.10+, pip, Docker Engine 24+ with the compose plugin (for the per-task sandbox containers), and an API key for at least one model provider (OpenAI, Anthropic, or Google).

Step 1 — pip install

A virtual environment is strongly recommended. The base install pulls every agent framework — the package is light enough that there's no reason to split it.

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

pip install decodingtrust-agent-sdk

If you only want a single framework's deps, the optional extras [openai], [claude], [google], [langchain], [pocketflow], and [all] are also defined.

Step 2 — Configure provider keys

Export the key(s) for the providers you plan to use. Only the providers you actually invoke are required:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=...

The SDK also reads .env files via python-dotenv, so dropping these in a .env at your working directory works as well.

Step 3 — Download the dataset

The per-task config.yaml / judge.py / setup.sh files that the evaluator consumes live on HuggingFace at AI-Secure/DecodingTrust-Agent-Platform. Pull them into a dataset/ folder next to where you'll run dtap:

# Install the HuggingFace CLI if you don't have it
pip install -U huggingface_hub

# Download the dataset/ folder (~216 MB)
hf download AI-Secure/DecodingTrust-Agent-Platform \
  --repo-type dataset \
  --local-dir dataset

Alternatively, use the Python API for programmatic downloads:

from huggingface_hub import snapshot_download

snapshot_download(
    "AI-Secure/DecodingTrust-Agent-Platform",
    repo_type="dataset",
    local_dir="dataset",
)

Step 4 — Verify the install

The package installs the dtap console script. Use the introspection subcommands to confirm everything is wired up:

# Show CLI help
dtap --help

# List the 14 benchmark domains and their task counts
dtap info domain

# List the Docker environments defined in dt_arena/config/env.yaml
dtap info env

# List the agent backends and which pip extra each one needs
dtap agent list

Step 5 — Run your first evaluation

A single benign CRM task with the OpenAI Agents SDK backbone — Docker pulls the Salesforce sandbox on first run, finishes in a couple of minutes:

dtap eval \
  --task-list benchmark/crm/benign.jsonl \
  --agent-type openaisdk \
  --model gpt-4o \
  --max-parallel 4

Results land under results/benchmark/<agent_type>/<model>/<domain>/<type>/<task_id>/ (override the root with EVAL_RESULTS_ROOT).

Working from source instead?