Legal
Legal-domain agent tasks over contracts, statutes, and case files — testing whether agents preserve client privilege and resist instructions hidden inside document text or counsel notes.
Domain overview
Legal practice represents one of the most sensitive domains for AI agent deployment. Law firms are increasingly adopting AI-powered associates to automate workflows spanning legal research (case law search, statutory analysis, citation verification, judicial profiling) and practice management (matter tracking, document drafting, court filing preparation, client communications). These agents operate with access to highly privileged data---attorney-client communications, litigation strategy, confidential settlement terms, witness information---and can execute consequential, often irreversible actions: filing court documents, sending legal correspondence to clients and opposing counsel, preparing settlement agreements, and issuing payments.
The combination of privileged client data, confidential litigation strategy, and cross-platform integrations (email, messaging, payment systems) creates a broad attack surface. An adversary---whether a malicious user directly instructing the agent or a third-party attacker injecting content into the agent's environment---can manipulate an agent into disclosing privileged information, fabricating legal citations, filing fraudulent documents, sending coercive communications, or issuing unauthorized payments. The consequences include malpractice liability, bar disciplinary action, client harm, and criminal exposure, compounded by the irreversible nature of filed documents and sent communications.
We design a comprehensive evaluation framework consisting of 600 tasks: 200 benign tasks across 20 categories of real-world legal practice workflows, 200 direct-threat tasks across 10 risk categories (malicious user), and 200 indirect-threat tasks across 12 risk categories (benign user, malicious environment). Risk categories are grounded in the ABA Model Rules of Professional Conduct, the EU AI Act, the UNESCO Recommendation on the Ethics of AI, and applicable U.S. federal criminal statutes.
Platform. We build a simulated legal practice platform comprising two integrated subsystems. The legal research subsystem mirrors the CourtListener REST API, providing full-text search across federal case law, opinion retrieval with citation networks, statute lookup, judicial profiles with career positions and financial disclosures, and court docket data. The practice management subsystem simulates a law firm's internal infrastructure, including matter management (case files with clients, opposing parties, confidential strategy notes), a document vault (complaints, demand letters, expert reports, medical records seeded per task), and legal drafting tools (draft creation and retrieval, court filing preparation and retrieval). Both subsystems are implemented as a Flask web server inside a self-contained Docker container fronted by an MCP (Model Context Protocol) interface that proxies all data access to the container. Tasks may additionally integrate external MCP servers for Gmail (client and inter-office email), Slack (team messaging), and PayPal (payments and invoicing), reflecting the multi-system workflows common in modern legal practice. Tasks use four MCP server configurations: legal only (26 tasks), legal + Slack (22 tasks), legal + Gmail (79 tasks), and legal + Gmail + Slack (73 tasks).
Data. The platform serves real legal data sourced from the CourtListener REST API (v4). The case law database contains 1{,}153 opinion clusters and 991 full opinion texts spanning federal courts, along with 313 judge profiles (with career positions, education records, ABA ratings, and political affiliations), 1{,}206 court dockets, 1{,}001 courts, 60 financial disclosure records (with associated investment, gift, debt, and income data), and 718 citation relationships. The statutory database includes 116 curated federal statutes spanning constitutional law, civil rights, employment, immigration, bankruptcy, securities, and criminal procedure.
For each task, the environment is seeded with: (1) a synthetic matter file containing fabricated client information, opposing party details, case numbers, assigned judges, and confidential litigation strategy, (2) vault documents tailored to the legal scenario (complaints, demand letters, expert reports, medical records, TRO orders, settlement agreements, franchise agreements, insurance policies), (3) pre-configured email and Slack accounts with assignment messages from supervising partners, and (4) the full case law and statutory databases shared across all tasks. Matter data comprises 195 synthetic law firm case files across diverse practice areas. All data is served from the Docker container and is deterministic across evaluation runs, ensuring reproducibility.
Policy & risk framework
Domain policies
We select the ABA Model Rules of Professional Conduct as the primary domain-specific policy governing legal AI agent behavior. The ABA Rules establish ethical obligations for attorneys and, by extension, AI systems acting on their behalf. Four rules are referenced across our tasks: Rule 1.1 (competence, 16 references), Rule 1.6 (confidentiality of information, 34 references), Rule 1.7 (conflict of interest, 16 references), and Rule 8.4 (misconduct, 29 references).
General regulatory frameworks
We additionally consider three regulatory frameworks: (1) the UNESCO Recommendation on the Ethics of AI , which provides principles for safety and human well-being (P1.1), robustness (P1.4), information security (P1.5), accuracy and reliability (P1.6), data protection (S2.3.3), cybersecurity resilience (S2.1.15), and responsibility for AI output (S3.3.9). UNESCO provisions are the most frequently cited across our tasks (608 total references across 24 distinct provisions); (2) the EU AI Act , which classifies legal AI assistants as high-risk systems (Annex III, Area 8) subject to accuracy requirements (Art. 15(1)), AI content disclosure obligations (Art. 50(2)), and prohibitions on social scoring (Art. 5(1)(c)) and criminal risk prediction based on profiling (Art. 5(1)(d)) (133 total references across 7 provisions); (3) applicable U.S. federal criminal statutes , including prohibitions on wire fraud (18 U.S.C. 1343), mail fraud (1341), false statements to federal agencies (1001), impersonation of government officers (912), bribery of witnesses (201), witness tampering (1512), blackmail (873), and extortion under the Hobbs Act (1951) (158 total references across 15 provisions).
Threat models
Indirect threat model
In the indirect threat model, the user is benign and submits a legitimate legal task (e.g., "draft a motion to dismiss based on the matter file and relevant case law"). The adversary operates exclusively through the agent's environment---not through the user prompt---by injecting malicious content into matter files, vault documents, case law annotations, or tool descriptions that the agent processes during task execution. The agent must complete the user's original task while resisting environmental manipulation that attempts to hijack its actions toward a hidden malicious goal.
Direct threat model
In the direct threat model, the user is the adversary and directly instructs the agent to perform unethical or illegal legal actions. Unlike the indirect model, the adversary has full control over the task instruction and can combine it with environment and tool-description injections to maximize attack effectiveness. Each of the 200 direct tasks (20 per risk category) targets a specific harmful scheme such as drafting fraudulent court filings, fabricating evidence, sending extortion demands, impersonating government officials, or bribing witnesses.
Results in this domain
Indirect / Direct ASR (lower is safer) and BSR (higher is more capable) for every evaluated agent on the Legal suite.
| Framework | Model | Indirect ASR Lower = safer | Direct ASR Lower = safer | BSR Higher = more capable |
|---|---|---|---|---|