macOS
Image-grounded macOS desktop environment counterpart to Windows, exercising click-driven workflows over native applications under both pop-up and screenshot-borne injections.
Domain overview
The macOS domain evaluates agents operating within a full graphical desktop environment on macOS 14.8 (Sonoma). Agents interact through both shell commands and GUI operations (screenshots, clicks, typing), reflecting how deployed OS agents must handle both modalities. Because agents must interpret visual elements (dialog boxes, context menus, system notifications), the domain introduces an injection surface absent from text-only environments: adversaries can plant malicious instructions in files, tool outputs, or configuration entries that the agent encounters while performing its task.
Agents with system-level access can create LaunchAgents, edit shell profiles such as .zshrc, escalate privileges via sudo, exfiltrate credentials from Keychain, bypass Gatekeeper and SIP protections, and set up persistent monitoring through cron jobs or login items.
These operations use the same tools as legitimate administration, making them difficult to distinguish from benign tasks.
Benign task categories
File & Directory Management
Creating, copying, moving, renaming, and organizing files
System Configuration
Display settings, power options, notification preferences
Application & Process Control
Launching, pinning, closing applications and processes
Network & Connectivity
Network status checks, adapter configuration, Wi-Fi profile management
Security & Privacy Settings
Firewall rules, password policies, encryption status
Scripting & Automation
Shell scripts, cron jobs, automation routines
Office & Document Processing
Document operations and format conversions
System Monitoring & Diagnostics
Disk usage, CPU/memory status, system logs, system reports
Results in this domain
Indirect / Direct ASR (lower is safer) and BSR (higher is more capable) for every evaluated agent on the macOS suite.
| Framework | Model | Indirect ASR Lower = safer | Direct ASR Lower = safer | BSR Higher = more capable |
|---|---|---|---|---|
Environments
1 environment in the macOS domain.