Agent-desktop is a cross-platform CLI for structured desktop automation, enabling AI agents to interact with native applications via accessibility APIs rather than pixel-based methods. It uses a progressive skeleton traversal to manage context size for LLMs.
Raw Developer Origin & Technical Request
Hacker News
May 2, 2026
I've been building computer-use tools for a while, and I quietly launched this about a month ago (122 Stars on GH). I figured it was worth sharing here.Over the last few months, a lot of computer-use agents have come out: Codex, Claude Code, CUA, and others. Most of them seem to work roughly like this:
1. Take a screenshot
2. Have the model predict pixel coordinates
3. Click x,y
4. Take another screenshot
5. RepeatThat works, but it's slow, expensive in tokens, and fragile. If the UI shifts a few pixels, things break. And the model still doesn't know what any element actually is.But the OS already exposes structured UI information: - macOS: Accessibility API
- Windows: UI Automation
- Linux: AT-SPI
Screen readers have used these APIs for years. On the web, Playwright beat screenshot scraping for the same reason: structured access is just a better abstraction than pixels.So I built a desktop equivalent: agent-desktop.It's a cross-platform CLI for structured desktop automation through the accessibility tree. One Rust binary, about 15 MB, no runtime dependencies. It exposes 53 commands with JSON output, so an LLM can inspect and operate native apps without screenshots or vision models. Inspired by agent-browser by Vercel Labs.A typical loop looks like this: agent-desktop snapshot --app Slack -i --compact
agent-desktop click @e12
agent-desktop type @e5 "ship it"
agent-desktop press cmd+return
So the loop becomes: 1. Snapshot
2. Decide
3. Act
4. Snapshot again
The main design problem was context size.A naive approach would dump the full accessibility tree into the model, but real apps get huge. Slack can easily exceed 50,000 tokens for a full tree dump, which makes the approach impractical.The approach I ended up using is progressive skeleton traversal: - First pass: return a shallow tree, typically depth 3, with deeper containers truncated and annotated with children_count
- Named containers get references so the agent can request only that subtree
- The agent drills down into the relevant region with --root @e3
- References are scoped and invalidated only for that subtree
- After acting, the agent can re-query just that region instead of re-snapshotting the whole app
In practice, this reduced token usage by about 78% to 96% versus full-tree dumps in Electron apps like Slack, VS Code, and Notion.A few implementation details that may be interesting here: - Rust workspace with strict platform/core separation through a PlatformAdapter trait
- Accessibility-first activation chain; mouse synthesis is the fallback, not the default
- Deterministic element refs like @e1, @e2, with optimistic re-identification across UI shifts
- Structured errors with machine-readable codes plus retry suggestions
- C ABI via cdylib, so it can be loaded directly from Python, Swift, Go, Node, Ruby, or C without shelling out
- Batch operations in a single call
- Support for windows, menus, sheets, popovers, alerts, and notifications
- Special handling for Chromium/Electron accessibility trees, which can get very deep and noisy
Why I think this matters: pixel-based desktop control feels like a leaky abstraction. The OS already knows the UI semantically. Accessibility APIs give you roles, names, actions, hierarchy, focus, selection, and state directly. That seems like a much better substrate for desktop agents than screenshot loops.If you're building your own desktop agent, internal automation tool, or research prototype, this may be useful.Install: npm install -g agent-desktop
agent-desktop snapshot --app Finder -i
Repo: github.com/lahfir/agent-desk... especially love feedback from people who've built desktop automation before. What are the biggest pain points you've run into, and what would you want a tool like this to support?
Developer Debate & Comments
Frequently Asked Questions
Market intelligence mapped to Agent-desktop is a cross-platform CLI for structured desktop automation, enabling AI agents to interact with native applications via accessibility APIs rather than pixel-based methods. It uses a progressive skeleton traversal to manage context size for LLMs..
What problem does Agent-desktop is a cross-platform CLI for structured desktop automation, enabling AI agents to interact with native applications via accessibility APIs rather than pixel-based methods. It uses a progressive skeleton traversal to manage context size for LLMs. solve?
Are engineers actively discussing Agent-desktop is a cross-platform CLI for structured desktop automation, enabling AI agents to interact with native applications via accessibility APIs rather than pixel-based methods. It uses a progressive skeleton traversal to manage context size for LLMs.?
What are the foundational technologies related to Agent-desktop is a cross-platform CLI for structured desktop automation, enabling AI agents to interact with native applications via accessibility APIs rather than pixel-based methods. It uses a progressive skeleton traversal to manage context size for LLMs.?
What open-source repositories focus on Agent-desktop is a cross-platform CLI for structured desktop automation, enabling AI agents to interact with native applications via accessibility APIs rather than pixel-based methods. It uses a progressive skeleton traversal to manage context size for LLMs.?
Engagement Signals
Cross-Market Term Frequency
Quantifies the cross-market adoption of foundational terms like LLM and AI agents by tracking occurrence frequency across active SaaS architectures and enterprise developer debates.
SaaS Metrics