Install Environment
Each domain ships with one or more Docker-Compose stacks under dt_arena/envs/<env>/ that emulate the real apps the agent interacts with — Salesforce, Gmail, Slack, GitHub, and friends. The evaluation runner brings them up automatically per task; this page covers the prerequisites, what each stack contains, and how to drive them by hand for debugging.
compose plugin, and at least 16 GB of free disk for the bigger stacks (SuiteCRM, GitLab).Pull the sandbox images
Most environments ship a docker-compose-hub.yml file that pulls prebuilt images. The first task in a domain pulls them on demand, so you can either let the runner do it or warm the cache up front:
# Warm the image cache for a domain (CRM example)
docker compose -f dt_arena/envs/salesforce_crm/docker-compose-hub.yml pull
# Or all environments at once
for f in dt_arena/envs/*/docker-compose-hub.yml; do
docker compose -f "$f" pull
doneBring up a single environment
For interactive debugging — for example, watching the SuiteCRM UI while you run a task — start the stack manually:
# Salesforce CRM (SuiteCRM + MariaDB)
docker compose -f dt_arena/envs/salesforce_crm/docker-compose.yml up -d
# Browse the UI
open http://localhost:8080 # SuiteCRM dashboard
# Tear down (and reset state)
docker compose -f dt_arena/envs/salesforce_crm/docker-compose.yml down -vPort allocation for parallel runs
When you run more than one task at a time the runner allocates dynamic ports out of a configurable range so multiple sandbox copies can coexist. Pin the range either via CLI flag or env var:
# CLI flag
python eval/evaluation.py --task-file tasks.jsonl --max-parallel 4 \
--port-range 10000-12000
# Or environment variable (picked up by every entrypoint)
export DT_PORT_RANGE="10000-12000"Each environment exposes its ports via dynamic variables (e.g. SALESFORCE_API_PORT) that the runner injects into both the MCP server and the task setup.sh. The mapping lives in dt_arena/config/env.yaml.
Resetting between tasks
Many environments support fast in-place resets without recreating containers, via a reset.sh mounted inside the container and registered under reset_scripts: in env.yaml:
# dt_arena/config/env.yaml
environments:
salesforce:
docker_compose: "dt_arena/envs/salesforce_crm/docker-compose.yaml"
reset_scripts:
mariadb: "/scripts/reset.sh" # Path INSIDE the container
ports:
SALESFORCE_API_PORT:
default: 8080
container_port: 8080When the runner finishes a task it calls docker exec <container> /scripts/reset.sh instead of bringing the whole stack down. New environments get reset support by mounting a script and adding it under reset_scripts.
Available environments
The full registry lives at dt_arena/envs/registry.yaml. As of the latest release the platform ships sandboxes for Gmail, Google Calendar / Docs / Drive / Sheets / Forms, Slack, Zoom, Atlassian, WhatsApp, X, LinkedIn, Reddit, Notion, Dropbox, GitHub, GitLab, Snowflake, Databricks, PayPal, Robinhood, Chase, ServiceNow, Salesforce CRM, OS-Filesystem, Code-Terminal, Browser (e-commerce), arXiv (research), Telecom, Hospital (medical), Yahoo Finance, Booking, Expedia, Southwest, United, Enterprise, FedEx, and DoorDash. See the Environment sidebar for the per-app docs page.
sudo docker exec ... to seed databases inside containers. Add yourself to the docker group or configure passwordless sudo for /usr/bin/docker if running unattended.