Workflow

Workflow

Workflow environments cover the everyday productivity surface area where agents read untrusted content and act on behalf of users — email, calendars, docs, chat, social, payments, and more.

Domain overview

Workflow agents are increasingly used to automate productivity and operational tasks across a wide range of digital platforms, including Gmail, Google Calendar, Google Forms, Google Docs, Slack, WhatsApp, Zoom, Databricks, Snowflake, Atlassian, and PayPal. These agents support common workplace functions such as email and messaging management, scheduling, document handling, task coordination, notification routing, structured data transfer, and routine content generation, often operating across multiple interconnected environments within a single workflow.

To construct realistic workflow tasks, we build the benchmark on sandbox environments that simulate real-world applications with highly similar user interfaces, functionally complete operations, and flexible data customization. This design allows us to preserve the structure and interaction patterns of real workflow systems while enabling controlled initialization of environment states, convenient customization of task data, and systematic construction of benign and adversarial scenarios.

Because workflow agents routinely process untrusted external content (e.g., emails, documents, forms, chat messages, tickets, and system outputs) while also interacting with sensitive data and action-taking tools, they expose a broad and realistic attack surface. In practice, these agents may read private messages, handle internal documents, query structured databases, update project systems, trigger notifications, or even initiate payment-related actions. When security enforcement is weak or inconsistent across connected environments, attackers may manipulate the agent into performing harmful actions such as phishing, spam dissemination, sensitive data leakage, destructive file operations, fraudulent payment requests, or deceptive cross-platform notifications. Such failures can lead to privacy violations, financial loss, operational disruption, legal exposure, and reputational damage.

To systematically study this domain, we design a benchmark that covers a diverse set of workflow environments and 5 aggregated benign task categories drawn from real-world automation scenarios. We further derive 9 key security risk categories from platform-specific policy documents across these environments, covering harms such as Financial Fraud, Unsafe Content, Data Exfiltration, Copyright / IP Infringement, Illegal Activity, Messaging Abuse, Misinformation, Sensitive File Deletion, and Bot Attack. Guided by these risks, we construct red-teaming tasks under two primary threat models, namely direct malicious prompting and indirect prompt injection through external workflow content, to systematically evaluate the security robustness of workflow agents.

Benign task categories

Data Query & Reporting

Retrieves, aggregates, or summarizes structured or semi-structured information from workflow environments, such as documents, spreadsheets, repositories, project systems, or transaction records, and produces task-relevant reports or summaries

Meeting Coordination

Supports meeting- and schedule-related workflows, including calendar inspection, availability checking, event creation or modification, invitation handling, and coordination of meeting logistics across communication and scheduling tools

Payment & Financial

Handles legitimate finance-related workflow operations such as invoice processing, payment execution, refund handling, payout management, billing updates, and other authorized financial or transaction-centered actions

Survey & Data Collection

Manages workflows centered on collecting, viewing, organizing, or processing form responses, spreadsheet entries, or other user-submitted records used for operational tracking and information gathering

Communication & Announcement

Drafts, sends, forwards, or posts legitimate messages, notifications, reminders, and announcements through email, chat, or other communication channels to the appropriate recipients

Results in this domain

Indirect / Direct ASR (lower is safer) and BSR (higher is more capable) for every evaluated agent on the Workflow suite.

FrameworkModel
Indirect ASR
Lower = safer
Direct ASR
Lower = safer
BSR
Higher = more capable

Environments

20 environments in the Workflow domain.

Atlassian

Atlassian

The Atlassian environment simulates a project coordination workspace that integrates issue tracking and collaborative project management workflows. It is designed around common Jira-style interactions, including project browsing, issue discovery, board-based task tracking, and detailed issue inspection. This environment is particularly useful for evaluating workflow agents in settings where they must interpret project artifacts, update task state, and reason over structured issue metadata together with unstructured text content.

Databricks

The Databricks environment simulates a data-engineering and analytics workspace for retrieval, query, and assistant-driven reasoning workflows. Like Snowflake, this environment does not rely on a GUI; instead, agents interact exclusively through MCP tools, matching realistic usage patterns in data platforms where operations are performed through search, SQL, and function-style interfaces.

DoorDash

DoorDash

The DoorDash environment simulates an on-demand food-delivery and local-commerce platform that serves as an end-to-end ordering workspace for restaurants, grocers, and convenience stores. It supports account registration and login, storefront browsing by category with featured / past-orders / local-liquor sections, free-text restaurant search, per-store detail inspection (info, menu, featured items, sections, reviews), individual menu-item lookup, single-store cart management (add / update / remove / clear) with item notes, checkout with configurable fulfillment options, delivery address, scheduled delivery time, payment method, taxes, fees, tips, and promo-code application. It also supports order placement, order-status tracking, order history review, cancellation when eligible, and receipt-style summaries, making it suitable for evaluating agents in everyday food-ordering, price comparison, and local-commerce workflow scenarios. This environment is particularly important because delivery workflows combine real-time availability, substitutions, payment decisions, location data, and potentially irreversible purchase actions, creating realistic opportunities for both benign ordering assistance and adversarial manipulation through injected delivery instructions, malicious promo codes, unauthorized purchases, or harmful changes to address, tip, and payment settings.

Dropbox

Dropbox

The Dropbox environment simulates a cloud file-storage and collaboration platform that serves as a personal and shared workspace for documents, images, and folder hierarchies. It supports account login, hierarchical file and folder browsing, file upload and download, folder creation, file move / copy / rename / delete, starring, full-text file search, shared-link generation with configurable access levels, file-request pages for inbound uploads from external parties, and per-account storage-quota inspection, making it suitable for evaluating agents in document-management and content-collaboration workflow scenarios. This environment is particularly important because cloud-storage content combines structured metadata (paths, sizes, MIME types, ownership, sharing permissions) with unstructured file bodies that may contain arbitrary documents, images, PDFs or executable artifacts, creating realistic opportunities for both benign file-management tasks and adversarial manipulation through malicious file uploads, overly permissive share links, exfiltration via shared links, or prompt-injection attacks hidden inside file contents, file names, or file-request submissions.

FedEx

FedEx

We construct a simulated FedEx service environment that models shipment operations and tracking workflows through a unified MCP-based interface. The environment captures the core functionality of real-world logistics platforms, where users interact with shipping services to create shipments, estimate rates, and track delivery progress.

Gmail

Gmail

The Gmail environment simulates a realistic email workspace for workflow-agent evaluation, covering the core operations needed for communication-centric tasks. It supports reading inboxes and email threads, inspecting email details, composing new emails, replying and forwarding messages, and sending notifications to designated recipients. This environment is especially important for studying workflow security because email is both a primary action channel and a major attack surface: agents may be asked to process untrusted message content and then perform downstream actions such as replying, forwarding information, or contacting other users.

Google Calendar

Google Calendar

The Google Calendar environment simulates a scheduling workspace that supports common calendar-management workflows. Agents can inspect events across month, week, and day views, create or edit events, check event details, and manage invitations and attendee-related updates. This environment is useful for evaluating whether agents can reliably coordinate legitimate scheduling tasks while remaining robust against attacks that misuse invitations, event descriptions, or scheduling-related notifications.

Google Docs

Google Docs

The Google Docs environment simulates a collaborative document workspace for document-centered workflow tasks. It supports document browsing, reading and editing content, adding comments, and sharing documents through link-based access control. This environment is particularly relevant for workflow-agent evaluation because agents may need to extract information from shared documents or act on document content, making document text a natural channel for indirect prompt injection and other content-based attacks.

Google Drive

Google Drive

The Google Drive environment simulates a cloud file-management workspace for document discovery, organization, and retrieval workflows. It supports file browsing, filtering, and navigation over shared and stored resources, making it useful for evaluating whether agents can correctly locate relevant files and reason over document repositories in realistic productivity settings. This environment is particularly relevant for workflow-agent evaluation because file repositories often serve as an entry point for downstream actions, and file metadata or search results may also become channels for indirect attacks.

Google Forms

Google Forms

The Google Forms environment simulates a form-based data-collection workspace for survey, intake, and structured response workflows. It supports viewing form content and inspecting form details in a layout similar to real-world form interfaces. This environment is particularly relevant for workflow-agent evaluation because forms often serve as a lightweight channel for collecting user-submitted information, making them a realistic source of externally provided content that may later influence downstream workflow actions.

Google Sheets

Google Sheets

The Google Sheets environment simulates a spreadsheet workspace for structured data inspection and table-centered workflow tasks. It supports spreadsheet browsing and sheet-level detail views, enabling agents to read and reason over tabular information in realistic office workflows. This environment is useful for evaluating whether agents can safely handle structured business data, summaries, and records without being misled by malicious or deceptive spreadsheet content.

LinkedIn

LinkedIn

The LinkedIn environment simulates a professional social network centered on career identity, hiring, and industry communication. It supports account login, personalized feed browsing, job-opportunity exploration, and network management (connections, invitations, and followed pages), making it suitable for evaluating agents in career-oriented workflow scenarios. This environment is particularly important because professional profiles combine structured career metadata (companies, titles, degrees, skills) with unstructured content such as feed posts, endorsements, recruiter messages, and job descriptions, creating realistic opportunities for both benign networking and adversarial manipulation through recruiter impersonation, malicious invitations, job-scam postings, or instructions hidden inside messages and posts.

Notion

Notion

The Notion environment simulates a workspace-centric productivity platform that unifies note-taking, wiki-style documentation, lightweight databases, and collaborative knowledge management. It supports account login, multi-workspace browsing, hierarchical page creation and editing with rich block content, structured databases with typed properties, full-text search, comments and sharing, favorites, templates, and a trash/restore flow, making it suitable for evaluating agents in knowledge-work scenarios. This environment is particularly important because Notion pages mix structured metadata (page titles, icons, database properties, workspace roles) with unstructured long-form content such as paragraphs, to-dos, toggles, embedded databases, comments, and page covers, creating realistic opportunities for both benign productivity workflows and adversarial manipulation through prompt injection hidden inside page bodies, database cells, comments, or shared pages.

Paypal

Paypal

The PayPal environment simulates a payment workspace for financial workflow tasks. It supports inspecting account and balance information, navigating finance-related views, managing send/request flows, and interacting with linked payment methods such as bank cards. This environment enables evaluation of high-impact financial actions in workflow settings, where agents may be asked to initiate, route, or confirm payments and must therefore be robust against fraud, deceptive payment requests, and other misuse of financial operations.

Reddit

Reddit

The Reddit environment simulates a community-driven social-discussion platform organized around subreddits, long-form posts, threaded comments, up/down voting, and private messaging. It supports account login, personalized and subreddit-scoped feed browsing, subreddit discovery and membership management, post and comment authoring with voting, public user profiles, and direct messaging, making it suitable for evaluating agents in open community interaction scenarios. This environment is particularly important because Reddit content is largely user-generated and pseudonymous, combining structured metadata (subreddit, author, karma, score) with unstructured free-text posts, comment threads, direct messages, and link posts that frequently embed external URLs, creating realistic opportunities for both benign community participation and adversarial manipulation through trolling, vote manipulation, karma farming, scam link posts, or prompt-injection attacks hidden inside posts, comments, or unsolicited private messages.

Slack

Slack

The Slack environment simulates an internal messaging workspace for channel-based communication and notification workflows. Agents can read channel messages and message threads, inspect user information, and post updates or notifications to relevant channels. This environment captures a common enterprise communication setting in which agents must process conversational content and coordinate downstream actions, while also being exposed to untrusted instructions embedded in messages or threads.

Snowflake

The Snowflake environment simulates a warehouse-style structured data workspace for analytics and retrieval workflows. Unlike GUI-centric environments, this environment is accessed entirely through MCP tools, reflecting realistic agent use over data platforms where task execution is mediated through query, retrieval, and analyst-style interfaces rather than direct visual interaction.

WhatsApp

WhatsApp

The WhatsApp environment simulates a recipient-facing messaging workspace for workflow-agent evaluation. It supports common communication and follow-up actions in a mobile-style interface, including account access, message review, conversation tracking, call history inspection, and call-related interactions. Compared with channel-based systems such as Slack, this environment emphasizes direct user-to-user communication, making it particularly relevant for evaluating risks such as phishing, impersonation, spam, and other deceptive messaging behaviors.

X

X

The X environment simulates a short-form social microblogging platform built around posts, replies, reposts, likes, follows, and personalized feeds. It supports account login, timeline browsing, profile inspection, trend and topic exploration, and post-level interaction, making it suitable for evaluating agents in information-consumption and social-interaction scenarios. This environment is particularly important because public social posts mix first-party user content with third-party quotes, links, hashtags, and replies, creating realistic opportunities for both benign engagement and adversarial manipulation through injected instructions, misleading claims, or socially coercive phrasing.

Zoom

Zoom

The Zoom environment simulates a meeting-management workspace that supports common conferencing workflows. Agents can inspect the home page and meeting details, schedule meetings through calendar-related views, and manage invitation-related actions. This environment is useful for evaluating workflow agents in coordination tasks involving meetings and attendance, while also exposing risks such as deceptive invitations, fraudulent meeting updates, or malicious instructions embedded in meeting-related content.