
Clawd Cursor
Safe desktop control for any AI agent. Compiles the screen into a UI map and acts on elements by stable id (screenshot/vision only as a last resort), verifies its own actions, and gates everything through one safety checkpoint. Local Β· cross-OS Β· any model.
Quickstart Β· Why it's different Β· The engine Β· How it works Β· Tools Β· Platforms Β· Changelog
What it is
Clawd Cursor is a local MCP server that gives any tool-calling agent β Claude Code, Cursor, Windsurf, OpenClaw, the Claude Agent SDK, or your own loop β safe control of the real desktop. It clicks, types, reads the screen, opens apps, and drives any GUI the way a human would: native apps, the browser, even a canvas.
Most "let an agent use the computer" tools take a screenshot and feed it to a vision model β slow, expensive, and brittle. Clawd Cursor compiles the screen into one UI map: it fuses the accessibility tree and OCR into a confidence-scored set of elements, each tagged with a stable el_NN id, and acts on elements by id β not pixel coordinates. Coordinates appear only in the last-resort screenshot/vision tier (live pixels off the current frame), for canvas-only apps or tasks that genuinely need spatial reasoning. The result is cheaper, faster, private, and β uniquely β it checks that each action actually did what it claimed.
If a human can do it on a screen, your agent can too. No API, no integration, no problem β only the right sequence of reads, clicks, keys, and waits. Use it as the last-mile fallback: native API exists? Use it. CLI? Use it. Clawd Cursor is for the click, the legacy app, the GUI with no public surface.
Why it's different
The desktop-agent space is crowded. The closest install-and-go peers are Windows-MCP and Terminator (desktop MCP servers); browser-only tools (browser-use, Playwright MCP) are adjacent; and OmniParser / UI-TARS are vision-centric parsing approaches you'd build an agent around, not products you install. Here's the honest comparison across those approaches β what Clawd Cursor does that the popular options don't:
Clawd Cursor browser-use Playwright MCP OmniParser / UI-TARS computer-use Any desktop app, not just the web β web only web only β β Cross-OS (Windows + macOS + Linux) β β β varies sandbox Perception without a vision model β compiled a11y + OCR map DOM a11y tree β vision-centric β vision Verifies its own actions (deviation) β β β β β Single safety chokepoint (allow/confirm/block) β β β β β Any model / vendor β β not an agent model-specific Claude only MCP-native (one config, any host) β library test framework β tool-use API Local-only, no cloud required β β β needs a model screens β cloud
Three things here are genuinely rare:
-
Cheapest-tier-first perception, fully local. Accessibility tree (free) β OCR (cheap) β screenshot (expensive β the only tier that puts pixels in the model's context; "screenshot" and "vision" are the same step). The agent climbs only when it must, so token cost tracks task difficulty β and with a local model, nothing leaves the machine. Vision-centric agents (OmniParser, UI-TARS) need a screenshot in the model for every observation.
-
It verifies. Pass
expecton a consequential action and Clawd Cursor re-checks the live screen (with a short settle window for async UIs) and reports a DEVIATION instead of a hollow "success." A completed task can't be marked done on evidence that was already true before it acted. -
One safety gate. Every call β from an editor over stdio, an external agent over HTTP, or the built-in loop β routes through a single
safety.evaluate()chokepoint (allow / confirm / block) before it touches the desktop. The agent cannot bypass it.
Plus: an on-screen "desktop control in progress" banner with a blinking red dot whenever an agent is driving β double-click it to stop. A human at the machine always knows, and always has a kill switch.
Windows (PowerShell)
powershell -c "irm https://clawdcursor.com/install.ps1 | iex"
macOS / Linux
curl -fsSL https://clawdcursor.com/install.sh | bash
Notes. You never run clawdcursor mcp yourself β the host spawns it over stdio on demand. clawdcursor doctor is not part of MCP setup; it only configures the built-in LLM for the autonomous agent daemon. On macOS, Accessibility is required (primary control path); Screen Recording is optional (vision fallback only). For editor permission allowlists, use the server-level wildcardmcp__clawdcursor rather than per-tool entries β it survives tool renames.
The engine
The perception + verification core (the UI State Compiler, since v1.5.0):
-
compile_uifuses the accessibility tree and OCR into one confidence-scored map of the screen, every element tagged with a stableel_NNid. Act on an element by{element_id, snapshot_id}instead of pixels β near-free in tokens, and it survives DPI, resize, and layout shifts.find_button/find_fieldlocate a target by meaning and hand you the id. -
Reactive verification.
expecton an action β Clawd Cursor confirms the outcome on the live screen and returns a DEVIATION when the UI didn't obey. -
Cross-platform parity. The compiler, secure-field redaction, and coordinate handling run on Windows, macOS, and Linux; the external-agent (MCP) surface resolves
el_NNrefs through the safety gate and discloses when it attached to your existing browser.
Set-of-Mark-style element IDs and a11y/OCR fusion aren't new ideas on their own β what's rare is doing them locally, a11y-first (no vision model required), with a built-in verification gate and one safety chokepoint, across three operating systems, behind a single MCP config.
See the changelog for the full release history, or the latest release.
How it works
Where the brain lives decides how you run it. Both modes can run side-by-side.
Brain livesβ¦ Mode Command What you call In your editor (Claude Code, Cursor, Windsurf, Zed) Direct tools clawdcursor mcp Each tool, via stdio MCP In a headless agent with its own LLM (OpenClaw, Agent SDK, your loop) Direct tools clawdcursor agent --no-llm Same, over HTTP MCP Inside Clawd Cursor itself (scheduled / "submit and walk away") Thin agent loop clawdcursor agent + doctor-configured LLM task / submit_task External brain that delegates sub-tasks to the built-in loop Direct + delegation clawdcursor agent + your client task({instruction:β¦}) to hand off
The loop
Read the a11y tree (cheap) β act on named targets β verify from fresh observations β escalate perception only when needed (OCR β screenshot, the one tier that sends pixels to the model). Sparse a11y tree? system.detect_webview switches Electron/WebView2 apps to browser.* over CDP. Canvas-only (Paint, Figma, games)? Screenshot + coordinate click.
flowchart TB task["User task"] --> loop["Agent LLM loop plans Β· chooses tools Β· verifies"] loop --> observe{"Cheapest observation that answers the question"}
observe -- "obsΒ·a11y β free" --> a11y["A11y tree (structured text + el_NN handles)"]
observe -- "obsΒ·ocr β cheap" --> ocr["OCR (OS-level, no vision LLM)"]
observe -- "obsΒ·dom β medium" --> dom["Browser DOM (CDP)"]
observe -- "obsΒ·vision β expensive" --> vision["Screenshot (image into context)"]
a11y --> act
ocr --> act
dom --> act
vision --> act
act["Act click/type/key/drag Β· invoke/set_value Β· open_app Β· batch"] --> safety
safety["Single safety gate safety.evaluate() β allow / confirm / block"] -- allowed --> tools["Tool registry 98 granular + 7 compound"]
safety -- needs user --> confirm["Human confirmation"] --> tools
safety -- denied --> blocked["blocked"]
tools --> desktop["Real desktop"]
desktop --> verify{"expect β does state match?"}
verify -- pass --> done["done"]
verify -- "DEVIATION" --> loop
classDef agentNode fill:#dbeafe,stroke:#2563eb,color:#0f172a;
classDef gate fill:#ede9fe,stroke:#7c3aed,color:#0f172a;
classDef obsNode fill:#fef9c3,stroke:#ca8a04,color:#0f172a;
classDef actNode fill:#ffedd5,stroke:#ea580c,color:#0f172a;
classDef stop fill:#fee2e2,stroke:#dc2626,color:#0f172a;
class loop,verify agentNode;
class safety,confirm,tools gate;
class observe,a11y,ocr,dom,vision obsNode;
class act actNode;
class blocked stop;
batch for deterministic stretches. When the next N steps are known, collapse them into one call β each step still routes through the safety gate; on any guard miss or error the batch halts with a per-step trace.
Task delegation. With an LLM configured on the daemon, an external agent can hand off at any point: task({"instruction":"β¦"}). The built-in loop takes the wheel and reports back β offload grunt work to a cheaper model without burning your own context.
The toolbox
Two catalogs ship side-by-side. The toolbox is 7 compound tools, each with an action enum covering ~10β20 verbs (~1,500 tokens total β about 12Γ smaller than granular, the computer_20250124 shape editor hosts already know). The granular surface is the 98 underlying primitives, one schema per verb (for runtimes that need top-level tools, or for debugging). Both run through the same safety.evaluate() chokepoint; the full catalog is always visible via MCP tools/list.
Toolbox Actions computer screenshot, click, double_click, right_click, triple_click, hover, move, scroll, scroll_horizontal, drag, drag_path, type, key, wait accessibility read_tree, find, get_element, focused, invoke, focus, set_value, get_value, expand, collapse, toggle, select, state, list_children, wait_for, compile_ui, find_button, find_field, smart_click, smart_type, smart_read window list, active, focus, maximize, minimize, restore, close, resize, list_displays, screen_size, open_app, open_file, open_url, switch_tab, navigate system clipboard_read, clipboard_write, system_time, ocr, undo, shortcuts_list, shortcuts_run, delegate, detect_webview, relaunch_with_cdp, system_prompt, build_uri, open_uri, open_app, open_file, open_url, detect_app, app_guide, learn_app browser connect, page_context, read_text, click, type, select_option, evaluate, wait_for, list_tabs, switch_tab, scroll task run (default; bounded-sync β waits up to timeouts, returns {status:"running"} + progress if longer, re-call to keep waiting), status, abort. Delegates to the built-in loop. Requires clawdcursor agent with an LLM. batch {steps:[β¦]} β collapse N calls into one round-trip; each step {name, arguments, expect?}, re-perceived and safety-gated, halts with a trace on any miss.
computer({ action: "key", combo: "mod+s" }) // Cmd+S / Ctrl+S, resolved per-OS accessibility({ action: "invoke", name: "Send" }) // click by name, not pixels window({ action: "open_app", name: "Outlook" }) task({ instruction: "open Notepad and type hello" }) // hand off to the thin loop
Cheapest-tier-first perception
Every observation has a cost. Start at the cheapest rung that works; climb only when it fails. The live log (CLAWD_LOG=pretty, default on a TTY) shows the ladder in real time via per-call badges.
Tier Badge Cost Source When T1 structured obsΒ·a11y ~free accessibility., window., browser.read_text, clipboard Default. Text + bounds, no image, no vision LLM. T2 OCR obsΒ·ocr cheap system.ocr, smart_read / smart_click / smart_type A11y tree empty/sparse. OS-level text out, no image bytes. T3 DOM obsΒ·dom medium browser.read_text / page_context (CDP) WebView / Electron / Chrome content. T4 screenshot (vision) obsΒ·vision expensive computer.screenshot The only tier that puts pixels in the model's context. Canvas-only apps or spatial reasoning. Last resort.
Acting tools log act. Watching obsΒ·a11y β act β obsΒ·a11y on a normal turn β and the rare climb to obsΒ·vision β is the whole efficiency model, visible.
Transports
One protocol β MCP β two transports, same catalog and JSON-RPC envelope. Both stateless; no session handshake.
Transport When Client config stdio MCP Editor hosts. Tools appear on demand β no daemon. {"command":"clawdcursor","args":["mcp","--compact"]} HTTP MCP Headless agents, daemons, orchestration, Agent SDK. POST JSON-RPC to http://127.0.0.1:3847/mcp. Run clawdcursor agent. Bearer token at ~/.clawdcursor/token.
HTTP MCP β list tools
(the Accept header is required β the MCP spec's Streamable HTTP transport
rejects requests that don't accept both JSON and SSE with a 406)
curl -s -X POST http://127.0.0.1:3847/mcp -H "Authorization: Bearer $(cat ~/.clawdcursor/token)" -H "Content-Type: application/json" -H "Accept: application/json, text/event-stream" -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'
Platform support
Platform code lives behind a single PlatformAdapter interface (src/platform/{windows,macos,linux}.ts + wayland-backend.ts). Business logic never reads process.platform.
Platform UI Automation OCR Browser (CDP) Input Windows 10/11 (x64 / ARM64) UIA via PowerShell bridge Windows.Media.Ocr Chrome / Edge nut-js macOS 12+ (Intel / Apple Silicon) JXA + System Events (TCC-safe) Apple Vision Chrome / Edge nut-js + System Events Linux X11 AT-SPI via python3-gi Tesseract Chrome / Edge nut-js Linux Wayland AT-SPI via python3-gi Tesseract Chrome / Edge ydotool / wtype
-
Windows β no setup; the PowerShell bridge spawns on demand.
-
macOS β first run needs Accessibility (required) + Screen Recording (optional);
clawdcursor grantwalks the dialogs. Retina/HiDPI handled in-adapter β don't pre-scale coordinates. -
Linux X11 β
apt install tesseract-ocr python3-gi gir1.2-atspi-2.0. -
Linux Wayland β same, plus
ydotool+ydotoold(preferred) orwtype(keyboard only).
Safety & privacy
Tier Actions Behavior Allow Reading, opening apps, navigation, typing into non-sensitive fields, minimize Executes immediately Confirm Sends, deletes, purchases, transfers, close-window/quit-app & show-desktop key combos, sensitive apps Pauses for approval (batch({allowConfirm:true}) to authorize) Block Ctrl+Alt+Del, lock / log-out / force-quit / shutdown key sequences Refused outright (no path)
-
Network isolation. Binds to
127.0.0.1. Verify:netstat -an | findstr 3847(Windows) /| grep 3847(Unix). -
Bearer-token auth on every HTTP request (
~/.clawdcursor/token). -
Sensitive-app policy. Email, banking, password managers, private messaging auto-elevate to Confirm.
-
No telemetry by default. Nothing phones home. Screenshots stay in RAM; with a local model nothing leaves the machine; with a cloud provider, screenshots go only to the endpoint you configured.
clawdcursor reportis opt-in and previews exactly what it sends. -
Prompt-injection defense. Screen text is returned inside
<untrusted-screen-content>tags β data, never instructions. -
Log privacy. Logs redact password-field values (
AXSecureTextField, UIAIsPassword=true).
See SECURITY.md for private vulnerability reporting.
Architecture
Directory What lives here src/core/ Thin agent loop (runAgent), sense layer (a11y / snapshot / fingerprint / UI compiler), reactive verification, focus guard, safety gate. src/tools/ 98 granular tools + 7 compound aggregators + batch, playbooks, registry, dispatch. src/platform/ PlatformAdapter + Windows / macOS / Linux / Wayland, OCR engine, CDP driver, URI handler. src/llm/ Provider clients (Claude, GPT, Gemini, Llama, Kimi, Ollama, β¦), credentials, model config. src/surface/ CLI, MCP server (stdio + HTTP), dashboard, doctor, onboarding, control banner.
The PlatformAdapter is the only thing platform code talks to; safety.evaluate() is the only way tools execute. Those two seams are the whole point.
CLI
For humans diagnosing an install. Agents connect via MCP.
clawdcursor consent Manage desktop-control consent (--accept / --revoke / --status)
clawdcursor grant Grant macOS permissions (interactive, macOS only)
clawdcursor doctor Configure the AI provider for `agent` mode (+ diagnostics)
clawdcursor status Readiness check (consent, permissions, AI config)
clawdcursor mcp stdio MCP server β editor hosts spawn this; you don't
clawdcursor agent Daemon: HTTP MCP on :3847, optional built-in thin loop
clawdcursor agent --no-llm Daemon, tool surface only (no built-in brain)
clawdcursor stop Stop every running mode
clawdcursor uninstall Remove all config and data
Options: --port (default 3847) Β· --compact Β· --no-banner Β· --provider Β· --accept
Development
git clone https://github.com/AmrDab/clawdcursor.git && cd clawdcursor
npm install
npm run build # tsc + postbuild β dist/surface/cli.js
npm test # vitest (1,000+ tests)
npm run lint # eslint
npm link # global clawdcursor shim (Admin shell on Windows)
Tests run on Node 20 & 22 against Ubuntu, macOS, and Windows in CI, plus a coverage ratchet, a perf tripwire, and an npm audit gate.
Tech: TypeScript Β· Node 20+ Β· nut-js Β· Playwright Β· sharp Β· Express Β· Model Context Protocol SDK Β· Zod Β· commander.
Contributing
PRs welcome β see CONTRIBUTING.md for the dev loop, branch conventions, and the test matrix every change clears. Bugs and features in issues; private security reports via SECURITY.md.
License
MIT β see LICENSE.
Acknowledgments
Built on the Model Context Protocol SDK, nut-js, Playwright, the Anthropic computer_20250124 tool shape, and the AT-SPI / UIA / AX trees that make app-agnostic GUI automation possible at all.
clawdcursor.com Β· Discord Β· Changelog Β· npm
Install
clawdcursor is an MCP server published to npm β install it into any MCP-capable agent (Claude Code, Claude Desktop, Cursor, Windsurf, Zed, OpenAI Codex, or your own loop) the same way you install any other MCP server.
1 β Install the engine + grant consent (once)
npm i -g clawdcursor clawdcursor consent --accept # one-time desktop-control consent (required) clawdcursor grant # macOS only β approve Accessibility + Screen Recording
Zero-install also works β swap clawdcursor for npx -y clawdcursor in any snippet below and npx fetches it on demand. A global install is recommendedanyway: it's pinnable and inspectable on disk (safer for a tool with full desktop control than auto-fetching latest every run), and it's the path on which the macOS native helper builds at install time. Requires Node.js 20+.
Per-OS prerequisites. Windows installs clean β sharp and @nut-tree-fork/nut-jsship prebuilt binaries, so no C++/Python build tools are needed. macOS needs Xcode Command Line Tools (xcode-select --install) for screenshots / vision; core accessibility-driven control still works without them. Linux needs a few system packages npm can't install: tesseract-ocr (OCR), python3-gi + gir1.2-atspi-2.0(accessibility tree), and β on Wayland β ydotool (synthetic input).
2 β Add it to your agent (pick your host)
Claude Code
claude mcp add clawdcursor -s user -- clawdcursor mcp --compact
OpenAI Codex β add to ~/.codex/config.toml:
[mcp_servers.clawdcursor] command = "clawdcursor" args = ["mcp", "--compact"]
Cursor / Windsurf / Claude Desktop β add to the host's MCP config:
{ "mcpServers": { "clawdcursor": { "command": "clawdcursor", "args": ["mcp", "--compact"] } } }
Zed β Zed uses context_servers (not mcpServers) in settings.json:
{ "context_servers": { "clawdcursor": { "command": { "path": "clawdcursor", "args": ["mcp", "--compact"] } } } }
That's the whole setup. Ask your agent: "open Outlook and reply to the latest email from Sarah."
Or: one-command plugin (Claude Code)
Skip the manual config β this repo ships a plugin that registers the tools andbundles the usage skill in one step. It resolves the package's bin (never a hard-coded dist/ path), so an upgrade can't break it:
claude plugin marketplace add AmrDab/clawdcursor claude plugin install clawdcursor@clawdcursor
One-line installers (clone + build; handles the macOS native build)
No common issues documented yet. If you hit a problem, the repository's GitHub Issues page is the best place to look.