Environments
Realistic, policy-aligned simulations of the apps where AI agents make decisions — email and chat, source control, banking, scheduling, customer service, OS shells, and more. Each environment ships with GUIs, MCP tool inventories, and direct + indirect attack policies.
Workflow
20 environments
Atlassian
WorkflowThe Atlassian environment simulates a project coordination workspace that integrates issue tracking and collaborative project management workflows. It is designed around common Jira-style interactions, including project browsing, issue discovery, board-based task tracking, and detailed issue inspection. This environment is particularly useful for evaluating workflow agents in settings where they must interpret project artifacts, update task state, and reason over structured issue metadata together with unstructured text content.
Databricks
WorkflowThe Databricks environment simulates a data-engineering and analytics workspace for retrieval, query, and assistant-driven reasoning workflows. Like Snowflake, this environment does not rely on a GUI; instead, agents interact exclusively through MCP tools, matching realistic usage patterns in data platforms where operations are performed through search, SQL, and function-style interfaces.

DoorDash
WorkflowThe DoorDash environment simulates an on-demand food-delivery and local-commerce platform that serves as an end-to-end ordering workspace for restaurants, grocers, and convenience stores. It supports account registration and login, storefront browsing by category with featured / past-orders / local-liquor sections, free-text restaurant search, per-store detail inspection (info, menu, featured items, sections, reviews), individual menu-item lookup, single-store cart management (add / update / remove / clear) with item notes, checkout with configurable fulfillment options, delivery address, scheduled delivery time, payment method, taxes, fees, tips, and promo-code application. It also supports order placement, order-status tracking, order history review, cancellation when eligible, and receipt-style summaries, making it suitable for evaluating agents in everyday food-ordering, price comparison, and local-commerce workflow scenarios. This environment is particularly important because delivery workflows combine real-time availability, substitutions, payment decisions, location data, and potentially irreversible purchase actions, creating realistic opportunities for both benign ordering assistance and adversarial manipulation through injected delivery instructions, malicious promo codes, unauthorized purchases, or harmful changes to address, tip, and payment settings.

Dropbox
WorkflowThe Dropbox environment simulates a cloud file-storage and collaboration platform that serves as a personal and shared workspace for documents, images, and folder hierarchies. It supports account login, hierarchical file and folder browsing, file upload and download, folder creation, file move / copy / rename / delete, starring, full-text file search, shared-link generation with configurable access levels, file-request pages for inbound uploads from external parties, and per-account storage-quota inspection, making it suitable for evaluating agents in document-management and content-collaboration workflow scenarios. This environment is particularly important because cloud-storage content combines structured metadata (paths, sizes, MIME types, ownership, sharing permissions) with unstructured file bodies that may contain arbitrary documents, images, PDFs or executable artifacts, creating realistic opportunities for both benign file-management tasks and adversarial manipulation through malicious file uploads, overly permissive share links, exfiltration via shared links, or prompt-injection attacks hidden inside file contents, file names, or file-request submissions.

FedEx
WorkflowWe construct a simulated FedEx service environment that models shipment operations and tracking workflows through a unified MCP-based interface. The environment captures the core functionality of real-world logistics platforms, where users interact with shipping services to create shipments, estimate rates, and track delivery progress.

Gmail
WorkflowThe Gmail environment simulates a realistic email workspace for workflow-agent evaluation, covering the core operations needed for communication-centric tasks. It supports reading inboxes and email threads, inspecting email details, composing new emails, replying and forwarding messages, and sending notifications to designated recipients. This environment is especially important for studying workflow security because email is both a primary action channel and a major attack surface: agents may be asked to process untrusted message content and then perform downstream actions such as replying, forwarding information, or contacting other users.

Google Calendar
WorkflowThe Google Calendar environment simulates a scheduling workspace that supports common calendar-management workflows. Agents can inspect events across month, week, and day views, create or edit events, check event details, and manage invitations and attendee-related updates. This environment is useful for evaluating whether agents can reliably coordinate legitimate scheduling tasks while remaining robust against attacks that misuse invitations, event descriptions, or scheduling-related notifications.

Google Docs
WorkflowThe Google Docs environment simulates a collaborative document workspace for document-centered workflow tasks. It supports document browsing, reading and editing content, adding comments, and sharing documents through link-based access control. This environment is particularly relevant for workflow-agent evaluation because agents may need to extract information from shared documents or act on document content, making document text a natural channel for indirect prompt injection and other content-based attacks.

Google Drive
WorkflowThe Google Drive environment simulates a cloud file-management workspace for document discovery, organization, and retrieval workflows. It supports file browsing, filtering, and navigation over shared and stored resources, making it useful for evaluating whether agents can correctly locate relevant files and reason over document repositories in realistic productivity settings. This environment is particularly relevant for workflow-agent evaluation because file repositories often serve as an entry point for downstream actions, and file metadata or search results may also become channels for indirect attacks.

Google Forms
WorkflowThe Google Forms environment simulates a form-based data-collection workspace for survey, intake, and structured response workflows. It supports viewing form content and inspecting form details in a layout similar to real-world form interfaces. This environment is particularly relevant for workflow-agent evaluation because forms often serve as a lightweight channel for collecting user-submitted information, making them a realistic source of externally provided content that may later influence downstream workflow actions.

Google Sheets
WorkflowThe Google Sheets environment simulates a spreadsheet workspace for structured data inspection and table-centered workflow tasks. It supports spreadsheet browsing and sheet-level detail views, enabling agents to read and reason over tabular information in realistic office workflows. This environment is useful for evaluating whether agents can safely handle structured business data, summaries, and records without being misled by malicious or deceptive spreadsheet content.

The LinkedIn environment simulates a professional social network centered on career identity, hiring, and industry communication. It supports account login, personalized feed browsing, job-opportunity exploration, and network management (connections, invitations, and followed pages), making it suitable for evaluating agents in career-oriented workflow scenarios. This environment is particularly important because professional profiles combine structured career metadata (companies, titles, degrees, skills) with unstructured content such as feed posts, endorsements, recruiter messages, and job descriptions, creating realistic opportunities for both benign networking and adversarial manipulation through recruiter impersonation, malicious invitations, job-scam postings, or instructions hidden inside messages and posts.

Notion
WorkflowThe Notion environment simulates a workspace-centric productivity platform that unifies note-taking, wiki-style documentation, lightweight databases, and collaborative knowledge management. It supports account login, multi-workspace browsing, hierarchical page creation and editing with rich block content, structured databases with typed properties, full-text search, comments and sharing, favorites, templates, and a trash/restore flow, making it suitable for evaluating agents in knowledge-work scenarios. This environment is particularly important because Notion pages mix structured metadata (page titles, icons, database properties, workspace roles) with unstructured long-form content such as paragraphs, to-dos, toggles, embedded databases, comments, and page covers, creating realistic opportunities for both benign productivity workflows and adversarial manipulation through prompt injection hidden inside page bodies, database cells, comments, or shared pages.

Paypal
WorkflowThe PayPal environment simulates a payment workspace for financial workflow tasks. It supports inspecting account and balance information, navigating finance-related views, managing send/request flows, and interacting with linked payment methods such as bank cards. This environment enables evaluation of high-impact financial actions in workflow settings, where agents may be asked to initiate, route, or confirm payments and must therefore be robust against fraud, deceptive payment requests, and other misuse of financial operations.

The Reddit environment simulates a community-driven social-discussion platform organized around subreddits, long-form posts, threaded comments, up/down voting, and private messaging. It supports account login, personalized and subreddit-scoped feed browsing, subreddit discovery and membership management, post and comment authoring with voting, public user profiles, and direct messaging, making it suitable for evaluating agents in open community interaction scenarios. This environment is particularly important because Reddit content is largely user-generated and pseudonymous, combining structured metadata (subreddit, author, karma, score) with unstructured free-text posts, comment threads, direct messages, and link posts that frequently embed external URLs, creating realistic opportunities for both benign community participation and adversarial manipulation through trolling, vote manipulation, karma farming, scam link posts, or prompt-injection attacks hidden inside posts, comments, or unsolicited private messages.

Slack
WorkflowThe Slack environment simulates an internal messaging workspace for channel-based communication and notification workflows. Agents can read channel messages and message threads, inspect user information, and post updates or notifications to relevant channels. This environment captures a common enterprise communication setting in which agents must process conversational content and coordinate downstream actions, while also being exposed to untrusted instructions embedded in messages or threads.
Snowflake
WorkflowThe Snowflake environment simulates a warehouse-style structured data workspace for analytics and retrieval workflows. Unlike GUI-centric environments, this environment is accessed entirely through MCP tools, reflecting realistic agent use over data platforms where task execution is mediated through query, retrieval, and analyst-style interfaces rather than direct visual interaction.

The WhatsApp environment simulates a recipient-facing messaging workspace for workflow-agent evaluation. It supports common communication and follow-up actions in a mobile-style interface, including account access, message review, conversation tracking, call history inspection, and call-related interactions. Compared with channel-based systems such as Slack, this environment emphasizes direct user-to-user communication, making it particularly relevant for evaluating risks such as phishing, impersonation, spam, and other deceptive messaging behaviors.

X
WorkflowThe X environment simulates a short-form social microblogging platform built around posts, replies, reposts, likes, follows, and personalized feeds. It supports account login, timeline browsing, profile inspection, trend and topic exploration, and post-level interaction, making it suitable for evaluating agents in information-consumption and social-interaction scenarios. This environment is particularly important because public social posts mix first-party user content with third-party quotes, links, hashtags, and replies, creating realistic opportunities for both benign engagement and adversarial manipulation through injected instructions, misleading claims, or socially coercive phrasing.

Zoom
WorkflowThe Zoom environment simulates a meeting-management workspace that supports common conferencing workflows. Agents can inspect the home page and meeting details, schedule meetings through calendar-related views, and manage invitation-related actions. This environment is useful for evaluating workflow agents in coordination tasks involving meetings and attendance, while also exposing risks such as deceptive invitations, fraudulent meeting updates, or malicious instructions embedded in meeting-related content.
CRM
1 environmentCustomer Service
1 environmentTravel
5 environments
Booking
TravelWe construct a simulated travel booking platform where a travel agent helps users plan, book and pay trips. The environment is populated with large-sacle datasets covering 3.8 million flights records for 2025, 5047 accommodation listings, 9551 restaurant entris, and 5302 tourist attractions. The platform also maintains user account data, including trip histories and saved payment methods. This data-driven design reflects the structured multi-step workflow of a real travel booking experience, where an assistant must query available options, compare alternatives along multiple criteria, make reservations, manage bookings, process payments, and interact with platform features such as reviews and host messaging.

Enterprise Rent-A-Car
TravelThe Enterprise environment simulates a global car-rental platform that serves as a ground-transportation workspace for vehicle reservation, Emerald Club loyalty management, and pickup / return logistics. It supports account registration and login, rental-location lookup by 3-letter code or free-text search (airport, neighborhood, or downtown sites), vehicle-class browsing (economy, SUV, minivan, ), pickup / return date–time car search with live pricing, end-to-end booking with primary-driver details and configurable mileage plans, promo-code and corporate-account discounts, authenticated and anonymous booking lookup, cancellation, and a recent-searches history, making it suitable for evaluating agents in car-rental and multi-leg travel-planning workflow scenarios. This environment is particularly important because car-rental workflows combine driver identity data (name, age, driver's license context), irreversible reservation actions, and sensitive corporate-account / payment inputs, creating realistic opportunities for both benign trip-logistics tasks and adversarial manipulation through spoofed confirmation codes, misleading vehicle-class labels, malicious promo or corporate codes, or prompt-injection attacks hidden inside location listings, vehicle descriptions, or search-session payloads.

Expedia
TravelGUI. The Expedia environment simulates an online travel-agency platform that serves as an end-to-end lodging-reservation workspace for hotels, resorts, vacation rentals, and short-term apartments across multiple destinations. It supports rewards-account registration and login with authenticated profile lookup, destination browsing and free-text search, rich property search with per-night price, star rating, review rating, refundability, breakfast-included, pay-later, amenity and property-type filters, per-property detail pages covering rooms / amenities / reviews, authenticated and anonymous booking creation with guest contact details and optional promo codes, personal booking lists, confirmation-code lookup, cancellation by internal id or confirmation code, promo-code validation, a ``pick up where you left off'' continuation card, and persisted recent-search history, making it suitable for evaluating agents in travel-planning and lodging-reservation workflow scenarios.

Southwest Airlines
TravelThe Southwest Airlines environment simulates a low-cost-carrier airline's booking platform that serves as a trip-planning workspace for Southwest-style flight search and Rapid Rewards loyalty management. It supports account registration and login, airport lookup by IATA code or city, one-way and round-trip flight search across Southwest's signature Wanna Get Away / Wanna Get Away Plus / Anytime / Business Select fare classes, per-flight fare inspection, live flight-status checks by flight number and date, Low-Fare-Calendar month-view browsing with configurable length of stay, end-to-end booking with multi-passenger itineraries (including Rapid Rewards numbers), authenticated and anonymous booking lookup, cancellation, promo-code resolution, and a recent-searches history, making it suitable for evaluating agents in price-sensitive airline-reservation and itinerary-management workflow scenarios. This environment is particularly important because low-cost-carrier workflows combine fare-class optionality (basic vs. fully-flexible), calendar-style price exploration, and irreversible booking actions with personally identifiable passenger data, creating realistic opportunities for both benign trip-planning tasks and adversarial manipulation through spoofed confirmations, injected promo codes, malicious itinerary-change instructions, or unauthorized cancellation requests. Therefore, it provides a realistic testbed for assessing whether agents can safely distinguish legitimate booking assistance from harmful actions involving payments, travel identity data, loyalty accounts, and irreversible reservation management.

United Airlines
TravelThe United Airlines environment simulates a flag-carrier airline's booking platform that serves as a travel-planning workspace for flight search, reservation, and MileagePlus loyalty management. It supports account registration and login, airport lookup by IATA code or city, one-way / round-trip / multi-city flight search across several cabin classes, per-flight fare inspection, live flight-status checks by flight number and date, end-to-end booking with multi-passenger itineraries (including MileagePlus and Known Traveler numbers), authenticated and anonymous booking lookup, cancellation, travel-credit redemption, promo-code resolution, and a recent-searches history, making it suitable for evaluating agents in real-world airline-reservation and itinerary-management workflow scenarios. This environment is particularly important because flight-booking workflows combine structured transactional state (passenger identities, fare classes, payment, promo codes, travel credits) with high-stakes irreversible actions (confirmed bookings, cancellations) and sensitive personal data (dates of birth, travel documents), creating realistic opportunities for both benign trip-planning tasks and adversarial manipulation through malicious promo codes, spoofed confirmation codes, tampered passenger payloads, or prompt-injection attacks hidden inside airport listings, fare-class labels, or promo descriptions.
Coding
3 environmentsCode-Terminal
CodeWe construct a sandboxed code-execution environment that provides coding agents with a fully functional Linux development workstation within a dedicated Docker container. The environment is equipped with a standard development toolchain, including Python 3 with `pip`, as well as common Unix utilities such as `wget`, `curl`, and others. The container is reset between tasks, ensuring complete state isolation across evaluations.

GitHub
CodeThe GitHub environment simulates a collaborative software-development workspace for repository management, code review, and issue-tracking workflows. It supports repository navigation, issue inspection, pull-request review, and commit-history exploration, making it suitable for evaluating agents in development-centered workflow scenarios. This environment is particularly important because software repositories combine structured metadata with unstructured code, comments, and review discussions, creating realistic opportunities for both benign collaboration and adversarial manipulation.

GitLab
CodeThe GitLab environment simulates a collaborative software-development workspace for project management, issue tracking, and repository-centered workflows. It supports project navigation, issue inspection, board-based task tracking, and detailed issue review, making it suitable for evaluating agents in development-oriented workflow scenarios. This environment is particularly useful because GitLab-style project systems combine structured metadata with unstructured descriptions, comments, and workflow state, creating realistic opportunities for both benign coordination and adversarial manipulation.
Browser
1 environmentResearch
1 environmentOS-Filesystem
1 environmentWindows
1 environmentmacOS
1 environmentFinance
3 environments
Chase
FinanceThe Chase environment simulates a consumer retail-banking portal modeled after JPMorgan Chase's online banking experience. It supports account login, balance and statement inspection, transaction history review, internal and external ACH transfers, bill pay with managed payees, debit- and credit-card management, and Zelle peer-to-peer payments with favorites, requests, and split flows, making it suitable for evaluating agents in high-stakes financial workflow scenarios. This environment is particularly important because banking interfaces combine structured account metadata (account numbers, balances, card numbers, routing information) with unstructured content such as transaction memos, payee notes, Zelle messages, and in-app notifications, creating realistic opportunities for both benign money-management tasks and adversarial manipulation through unauthorized-transfer prompts, fraudulent bill-payee injection, scam Zelle requests, or instructions hidden inside transaction descriptions and notifications.

Robinhood
FinanceThe Robinhood environment simulates a commission-free retail brokerage platform covering both equities and cryptocurrencies. It supports account login, portfolio and position inspection, real-time stock and crypto quotes, market and limit order placement and cancellation, watchlist management, and cash transfers between the brokerage and a linked bank, making it suitable for evaluating agents in high-stakes financial-trading workflow scenarios. This environment is particularly important because brokerage interfaces combine structured account metadata (buying power, cash balance, positions, P&L) with time-sensitive market data and irreversible trading actions, and interleave real-time quotes with free-text order notes, discovery lists (top gainers, losers, trending tickers) and crypto venues, creating realistic opportunities for both benign investing workflows and adversarial manipulation through unsolicited stock / crypto recommendations, pump-and-dump framing, scam transfer targets, or prompt-injection attacks hidden inside order notes, discovery feed items, or market-data descriptions.

Yahoo Finance
FinanceThe Yahoo Finance environment simulates a full-featured brokerage and wealth management platform for evaluating financial AI agents. The platform is built as a Flask web application serving real-time market data from the Finnhub API, with a multi-account portfolio management system supporting equity trading, options chains, and fund transfers. The environment provides six core pages: stock quote with integrated trading panel, portfolio dashboard with holdings tracking and recent activity, per-ticker news feeds with full article browsing, markets overview with major indices and top movers, options chain data with calls and puts, and a stock screener. Agents interact with the platform exclusively through MCP tools that extract simplified content from the rendered HTML pages, faithfully reproducing how a real brokerage assistant would process web-based financial data.



