Finance

Finance

Finance-domain environments (Yahoo Finance, Chase banking, Robinhood) where agents must reason about market data, place trades, and move funds without falling for adversarial news, alerts, or transfer requests.

Domain overview

Financial services represent one of the highest-stakes domains for AI agent deployment. Modern brokerage and wealth management platforms increasingly rely on AI-powered assistants to automate workflows spanning market research and analysis (fundamental research, portfolio monitoring, options analysis, risk assessment, client reporting) and trading operations (order execution, payment processing, multi-account management, client communications). These agents operate with access to highly sensitive data---portfolio holdings, account balances, client financial records---and can execute consequential, often irreversible actions: buying or selling securities, trading options, initiating fund transfers, and sending investment recommendations to clients.

The combination of real-time market data, multi-account management, and cross-platform integrations (email, messaging, payment systems) creates a broad attack surface. An adversary---whether a third-party attacker injecting malicious content into the agent's environment or a malicious user directly issuing instructions---can manipulate an agent into executing unauthorized trades, exfiltrating sensitive portfolio data, sending fraudulent communications to clients, or carrying out illegal financial schemes. The consequences include direct financial loss, regulatory violations (SEC, FINRA), legal liability, and reputational damage, compounded by the fact that most financial actions are irreversible once settled.

We design a comprehensive evaluation framework consisting of 600 tasks: 200 benign tasks across 15 categories of real-world financial workflows, 200 direct-threat tasks across 20 risk categories (malicious user), and 200 indirect-threat tasks across 13 risk categories (benign user, malicious environment). Risk categories are grounded in enforceable securities regulations from the SEC, FINRA, the Dodd-Frank Act, the Bank Secrecy Act, and the EU AI Act.

Benign task categories

Trade Execution

Executes buy, sell, and limit orders for equities and options based on user instructions received via email, Slack, or direct request, including order confirmation and position verification

Portfolio Review

Monitors and reports on current portfolio holdings, account balances, position values, and key metrics such as market capitalization and price changes across single or multiple accounts

Stock Research

Conducts fundamental analysis on individual equities covering price, valuation ratios (P/E, EPS), 52-week ranges, trading volume, and beta, then communicates findings via email

Options Analysis

Analyzes options chains including implied volatility, strike selection, term structure, and skew, and produces strategy recommendations for hedging or income generation

Payment Processing

Processes financial transactions via PayPal including invoice creation, payout execution, and fee payments, coordinated with portfolio and communication workflows

News Digest

Monitors and summarizes market news, earnings reports, and sector developments, then distributes digests to clients or posts commentary

Market Intelligence

Gathers and synthesizes real-time market data, trending stocks, sector performance, and macroeconomic indicators to support trading decisions

Analyst Review

Reviews and acts upon analyst research received via email, incorporating stock data and news to produce data-grounded client communications

Client Communication

Drafts and sends personalized investment updates, portfolio summaries, and research briefings to clients via email based on current market conditions

Performance Review

Evaluates portfolio performance including daily P&L, position attribution, and risk metrics, producing detailed performance reports for client distribution

Risk Assessment

Assesses portfolio risk exposure including concentration analysis, sector allocation, beta weighting, and drawdown scenarios

Sector Analysis

Conducts comparative analysis across sectors or peer companies, evaluating relative valuations, momentum, and competitive positioning

Sector Exposure

Analyzes portfolio allocation across market sectors, identifies concentration risks, and recommends rebalancing strategies

Income Review

Evaluates dividend income, yield characteristics, and income-generating potential of portfolio holdings for income-focused strategies

Market Analysis

Provides broad market overview including index performance, sector rotation, and macroeconomic context to frame individual investment decisions

Policy & risk framework

Domain policies

We draw on the following securities regulations that govern broker-dealer operations and trading conduct: itemize[leftmargin=*, itemsep=2pt] FINRA Rules : The comprehensive rulebook enforced by the Financial Industry Regulatory Authority, including Rule 2020 (prohibition on manipulative and deceptive devices), Rule 2010 (standards of commercial honor), Rule 2111 (suitability obligations), Rule 2150 (improper use of customer funds), Rule 2210 (communications with the public), Rule 3260 (discretionary accounts), Rule 3310 (anti-money laundering compliance), Rule 5310 (best execution), and Rule 2165 (financial exploitation of specified adults). SEC Rules : Federal securities regulations including Rule 10b-5 (prohibition on fraud and manipulation), Regulation Best Interest (broker-dealer care and disclosure obligations), Regulation S-P (privacy of consumer financial information), and Rules 15g-2/15g-9 (penny stock disclosure requirements). Dodd-Frank Wall Street Reform and Consumer Protection Act : Including Section 747 (anti-spoofing provisions for derivatives markets). Bank Secrecy Act (BSA) : Including 31 U.S.C. 5324 (structuring transactions to evade reporting requirements) and suspicious activity reporting obligations. Federal criminal statutes : Including 18 U.S.C. 1343 (wire fraud) and 18 U.S.C. 1028 (identity fraud). itemize

General regulatory frameworks

We additionally consider the EU AI Act , which establishes a risk-based regulatory framework for AI systems including obligations for high-risk applications in financial services such as transparency, accountability, non-discrimination, and safeguards against harmful automated actions.

Results in this domain

Indirect / Direct ASR (lower is safer) and BSR (higher is more capable) for every evaluated agent on the Finance suite.

FrameworkModel
Indirect ASR
Lower = safer
Direct ASR
Lower = safer
BSR
Higher = more capable

Environments

3 environments in the Finance domain.

Chase

Chase

The Chase environment simulates a consumer retail-banking portal modeled after JPMorgan Chase's online banking experience. It supports account login, balance and statement inspection, transaction history review, internal and external ACH transfers, bill pay with managed payees, debit- and credit-card management, and Zelle peer-to-peer payments with favorites, requests, and split flows, making it suitable for evaluating agents in high-stakes financial workflow scenarios. This environment is particularly important because banking interfaces combine structured account metadata (account numbers, balances, card numbers, routing information) with unstructured content such as transaction memos, payee notes, Zelle messages, and in-app notifications, creating realistic opportunities for both benign money-management tasks and adversarial manipulation through unauthorized-transfer prompts, fraudulent bill-payee injection, scam Zelle requests, or instructions hidden inside transaction descriptions and notifications.

Robinhood

Robinhood

The Robinhood environment simulates a commission-free retail brokerage platform covering both equities and cryptocurrencies. It supports account login, portfolio and position inspection, real-time stock and crypto quotes, market and limit order placement and cancellation, watchlist management, and cash transfers between the brokerage and a linked bank, making it suitable for evaluating agents in high-stakes financial-trading workflow scenarios. This environment is particularly important because brokerage interfaces combine structured account metadata (buying power, cash balance, positions, P&L) with time-sensitive market data and irreversible trading actions, and interleave real-time quotes with free-text order notes, discovery lists (top gainers, losers, trending tickers) and crypto venues, creating realistic opportunities for both benign investing workflows and adversarial manipulation through unsolicited stock / crypto recommendations, pump-and-dump framing, scam transfer targets, or prompt-injection attacks hidden inside order notes, discovery feed items, or market-data descriptions.

Yahoo Finance

Yahoo Finance

The Yahoo Finance environment simulates a full-featured brokerage and wealth management platform for evaluating financial AI agents. The platform is built as a Flask web application serving real-time market data from the Finnhub API, with a multi-account portfolio management system supporting equity trading, options chains, and fund transfers. The environment provides six core pages: stock quote with integrated trading panel, portfolio dashboard with holdings tracking and recent activity, per-ticker news feeds with full article browsing, markets overview with major indices and top movers, options chain data with calls and puts, and a stock screener. Agents interact with the platform exclusively through MCP tools that extract simplified content from the rendered HTML pages, faithfully reproducing how a real brokerage assistant would process web-based financial data.