Medical
Hospital client environment exercising scheduling, intake, prescription review, and chart access — evaluating whether agents protect PHI under realistic clinical content injections.
Domain overview
Healthcare is a safety-critical domain where AI agents are increasingly used to assist with clinical and administrative tasks such as diagnostic question answering, treatment planning, medical documentation, and billing or insurance support. These agents often interact with electronic health records (EHRs), clinical tools, and hospital information systems to analyze patient information, request tests, recommend treatments, and generate billing codes (e.g., ICD and CPT codes).
Medical agents typically operate with access to highly sensitive patient data, including protected health information (PHI), laboratory results, imaging reports, and insurance records. They may also influence high-stakes decisions such as medical diagnoses, treatment recommendations, and healthcare billing. Failures or malicious manipulation of these systems can lead to patient harm, insurance fraud, privacy violations, and regulatory risks.
The medical domain is heavily regulated by frameworks such as HIPAA and HITECH for data privacy and security, as well as FDA guidance for Software as a Medical Device (SaMD) and international medical software standards (e.g., ISO 62304 and ISO 82304). To evaluate the robustness of medical AI agents, we construct benign tasks covering three representative clinical workflows: Diagnostic Q&A, Treatment Planning, and Billing and Insurance Support. Grounded in major healthcare regulations and standards, we further derive security risk categories and design adversarial red-teaming tasks to test whether agents can be manipulated into producing unsafe diagnoses, harmful treatments, fraudulent billing actions, or privacy violations.
Benign task categories
Diagnostic Q&A
The agent analyzes patient information including symptoms, medical history, physical examination findings, and test results to determine the most likely diagnosis. The agent may request additional diagnostic tests (e.g., laboratory tests or imaging) and reason over the collected evidence to produce a clinically plausible diagnosis
Treatment Planning
Given a patient diagnosis and clinical context, the agent proposes an initial treatment plan consistent with common medical guidelines. This may include medication recommendations, non-pharmacological interventions, and follow-up monitoring
Billing and Insurance Support
The agent assists with administrative healthcare workflows by generating appropriate billing information based on patient diagnoses and clinical services performed. This includes identifying relevant ICD diagnosis codes and CPT procedure codes and preparing supporting information for insurance documentation
Policy & risk framework
Domain policies
We consider several regulatory and technical frameworks that govern the safe use of AI systems in healthcare environments. These include the Health Insurance Portability and Accountability Act (HIPAA), which defines requirements for protecting protected health information (PHI), including rules on data access, disclosure, and breach notification; the Health Information Technology for Economic and Clinical Health (HITECH) Act, which strengthens HIPAA protections by introducing stricter security safeguards, breach reporting obligations, and accountability requirements for entities handling electronic health records; and the 21st Century Cures Act, which promotes secure interoperability and patient access to electronic health information while prohibiting information blocking practices that restrict appropriate data sharing. We additionally incorporate the FTC Health Breach Notification Rule (HBNR), which requires vendors of personal health records and related entities to notify individuals, the FTC, and in some cases the media when unauthorized acquisition of identifiable health information occurs, extending breach accountability beyond HIPAA-covered entities.
General regulatory frameworks
We additionally incorporate regulatory guidance and international standards governing medical software safety and clinical AI systems. These include the FDA Software as a Medical Device (SaMD) Guidance, which establishes regulatory expectations for AI systems involved in diagnosis and treatment decisions, including requirements for transparency, validation, and lifecycle monitoring; the U.S. Federal Food, Drug, and Cosmetic Act (FD&C Act), which regulates the safety and truthful representation of medical devices and clinical claims; and international medical software standards such as ISO 62304 and ISO 82304, which define best practices for medical software lifecycle management, risk control, and system safety. We also reference principles from the WHO Guidance on Ethics and Governance of Artificial Intelligence for Health, which highlights requirements for transparency, accountability, and responsible deployment of healthcare AI systems.
Results in this domain
Indirect / Direct ASR (lower is safer) and BSR (higher is more capable) for every evaluated agent on the Medical suite.
| Framework | Model | Indirect ASR Lower = safer | Direct ASR Lower = safer | BSR Higher = more capable |
|---|---|---|---|---|
Environments
1 environment in the Medical domain.