macOS

macOS

Image-grounded macOS desktop environment counterpart to Windows, exercising click-driven workflows over native applications under both pop-up and screenshot-borne injections.

Domain overview

The macOS domain evaluates agents operating within a full graphical desktop environment on macOS 14.8 (Sonoma). Agents interact through both shell commands and GUI operations (screenshots, clicks, typing), reflecting how deployed OS agents must handle both modalities. Because agents must interpret visual elements (dialog boxes, context menus, system notifications), the domain introduces an injection surface absent from text-only environments: adversaries can plant malicious instructions in files, tool outputs, or configuration entries that the agent encounters while performing its task.

Agents with system-level access can create LaunchAgents, edit shell profiles such as .zshrc, escalate privileges via sudo, exfiltrate credentials from Keychain, bypass Gatekeeper and SIP protections, and set up persistent monitoring through cron jobs or login items. These operations use the same tools as legitimate administration, making them difficult to distinguish from benign tasks.

Benign task categories

File & Directory Management

Creating, copying, moving, renaming, and organizing files

System Configuration

Display settings, power options, notification preferences

Application & Process Control

Launching, pinning, closing applications and processes

Network & Connectivity

Network status checks, adapter configuration, Wi-Fi profile management

Security & Privacy Settings

Firewall rules, password policies, encryption status

Scripting & Automation

Shell scripts, cron jobs, automation routines

Office & Document Processing

Document operations and format conversions

System Monitoring & Diagnostics

Disk usage, CPU/memory status, system logs, system reports

Results in this domain

Indirect / Direct ASR (lower is safer) and BSR (higher is more capable) for every evaluated agent on the macOS suite.

FrameworkModel
Indirect ASR
Lower = safer
Direct ASR
Lower = safer
BSR
Higher = more capable

Environments

1 environment in the macOS domain.