Install Environment

Each domain ships with one or more Docker-Compose stacks under dt_arena/envs/<env>/ that emulate the real apps the agent interacts with — Salesforce, Gmail, Slack, GitHub, and friends. The evaluation runner brings them up automatically per task; this page covers the prerequisites, what each stack contains, and how to drive them by hand for debugging.

Prerequisites

Docker Engine 24+ (or Docker Desktop) with the compose plugin, and at least 16 GB of free disk for the bigger stacks (SuiteCRM, GitLab).

Pull the sandbox images

Most environments ship a docker-compose-hub.yml file that pulls prebuilt images. The first task in a domain pulls them on demand, so you can either let the runner do it or warm the cache up front:

# Warm the image cache for a domain (CRM example)
docker compose -f dt_arena/envs/salesforce_crm/docker-compose-hub.yml pull

# Or all environments at once
for f in dt_arena/envs/*/docker-compose-hub.yml; do
  docker compose -f "$f" pull
done

Bring up a single environment

For interactive debugging — for example, watching the SuiteCRM UI while you run a task — start the stack manually:

# Salesforce CRM (SuiteCRM + MariaDB)
docker compose -f dt_arena/envs/salesforce_crm/docker-compose.yml up -d

# Browse the UI
open http://localhost:8080      # SuiteCRM dashboard

# Tear down (and reset state)
docker compose -f dt_arena/envs/salesforce_crm/docker-compose.yml down -v

Port allocation for parallel runs

When you run more than one task at a time the runner allocates dynamic ports out of a configurable range so multiple sandbox copies can coexist. Pin the range either via CLI flag or env var:

# CLI flag
python eval/evaluation.py --task-file tasks.jsonl --max-parallel 4 \
  --port-range 10000-12000

# Or environment variable (picked up by every entrypoint)
export DT_PORT_RANGE="10000-12000"

Each environment exposes its ports via dynamic variables (e.g. SALESFORCE_API_PORT) that the runner injects into both the MCP server and the task setup.sh. The mapping lives in dt_arena/config/env.yaml.

Resetting between tasks

Many environments support fast in-place resets without recreating containers, via a reset.sh mounted inside the container and registered under reset_scripts: in env.yaml:

# dt_arena/config/env.yaml
environments:
  salesforce:
    docker_compose: "dt_arena/envs/salesforce_crm/docker-compose.yaml"
    reset_scripts:
      mariadb: "/scripts/reset.sh"   # Path INSIDE the container
    ports:
      SALESFORCE_API_PORT:
        default: 8080
        container_port: 8080

When the runner finishes a task it calls docker exec <container> /scripts/reset.sh instead of bringing the whole stack down. New environments get reset support by mounting a script and adding it under reset_scripts.

Available environments

The full registry lives at dt_arena/envs/registry.yaml. As of the latest release the platform ships sandboxes for Gmail, Google Calendar / Docs / Drive / Sheets / Forms, Slack, Zoom, Atlassian, WhatsApp, X, LinkedIn, Reddit, Notion, Dropbox, GitHub, GitLab, Snowflake, Databricks, PayPal, Robinhood, Chase, ServiceNow, Salesforce CRM, OS-Filesystem, Code-Terminal, Browser (e-commerce), arXiv (research), Telecom, Hospital (medical), Yahoo Finance, Booking, Expedia, Southwest, United, Enterprise, FedEx, and DoorDash. See the Environment sidebar for the per-app docs page.

Linux + sudo for Docker