Local control: in-SDK sidecar + desktop/browser drivers#161
Conversation
Add a deny-by-default CapabilityPolicy that gates which command names a local browser/desktop driver will execute (shell, arbitrary scripts, cookies/storage, and secrets are opt-in), a name-keyed driver registry so one package can host many drivers, and the command-name contract mirroring the hai_drivers interfaces. Co-authored-by: Cursor <cursoragent@cursor.com>
Long-polling sidecar (single-owner lease, connect-time drain, command_uid replay cache + echo), capability policy (deny-by-default with opt-ins), driver registry, pyautogui desktop driver and Selenium browser driver. Co-authored-by: Cursor <cursoragent@cursor.com>
…e open Co-authored-by: Cursor <cursoragent@cursor.com>
…+ config knobs Policy now derives allowed commands from the driver's public methods minus the danger sets (shell/scripts/cookies/secrets), removing the hand-maintained method lists that duplicated the drivers. Replace the driver registry with a direct lazy factory and trim SidecarConfig to essentials. Co-authored-by: Cursor <cursoragent@cursor.com>
- serialize_result recurses into dicts (fixes get_observation_snapshot crash) - browser: reject file/chrome/js/data URLs; real markdown via markdownify; guard get_logs on CDP attach - desktop: run_command merges os.environ instead of replacing it - sidecar: interrupt long-poll on stop, reconnect on 404, back off on 429, tear down driver on shutdown - drop dead dedup cache + racy drain-on-connect (server delivers one cmd at a time, fresh uid, no replay) - split drivers into desktop/ and browser/ subpackages Co-authored-by: Cursor <cursoragent@cursor.com>
…constants Co-authored-by: Cursor <cursoragent@cursor.com>
…down - vendor h.js + defuddle.full.js; execute_script auto-injects hjs with iframe guard - extract_markdown -> Defuddle (main-content, in-browser) - get_viewport_html -> hjs_0x2a.collectViewportHTML() (screen-bounds pruned DOM) - viewport_markdown -> collectViewportHTML then CustomMarkdownify (markdownify), full-page fallback - ship js assets via wheel force-include Co-authored-by: Cursor <cursoragent@cursor.com>
…l` CLI Client now injects the local session_id for any source:"local" environment on create_agent/update_agent/patch_agent and on inline-agent create_session, so callers only pass source:"local" and the env id. Adds `hai local browser` and `hai local desktop` to run the sidecar from the CLI. Co-authored-by: Cursor <cursoragent@cursor.com>
…e, typed envs) - enter_secret clicks (x, y) to focus the target before typing, so the secret lands in the field the agent pointed at instead of stale focus. - get_tab_title honors tab_id by switching, reading, and restoring the tab. - close_active_tab guards against an empty handle list after closing the last tab. - localize_environments/localize_agent now wire source:"local" envs whether they arrive as dicts or typed Pydantic models. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
| if not allow_cookies: | ||
| allowed -= _COOKIES | ||
| if not allow_secrets: | ||
| allowed -= _SECRETS |
There was a problem hiding this comment.
Script policy bypass via helpers
High Severity
With allow_scripts disabled, CapabilityPolicy only removes execute_script, but other allowed browser driver commands such as get_viewport_html, extract_markdown, scroll_page, and observation_bundle still execute page JavaScript internally, so the CLI --allow-scripts gate does not actually block script execution.
Reviewed by Cursor Bugbot for commit 0d1c561. Configure here.
Co-authored-by: Cursor <cursoragent@cursor.com>
…ce SidecarBusyError in CLI update_agent/patch_agent take agent_name positionally; the **kwargs-only wrappers raised TypeError on a positional call. Accept *args and pass through. `hai local browser/desktop` acquired the lease inside asyncio.run, outside the guarded block, so a busy sidecar dumped a raw traceback; route it to the CLI error. Co-authored-by: Cursor <cursoragent@cursor.com>
…ingle-source capability - sidecar: cache command_uid -> result and re-post on redelivery instead of re-running side effects (a transient result-POST failure left the command pending and re-executed it on the next poll). - desktop driver: route keyboard through pyautogui (matches the Key contract and the remote executor); drop pynput, whose member names (enter/esc) diverge from the contract names (return/escape) and silently failed. - policy: walk the full MRO so inherited driver methods are gated, not just those declared on the concrete class. - config/wiring: KIND_TO_CAPABILITY is the single source; _CAPABILITIES derives from it. - pyproject: drop unused pynput; collapse the all extra to a self-reference. Co-authored-by: Cursor <cursoragent@cursor.com>
…shutdown Remove CapabilityPolicy and the --allow-* CLI flags: the GUI keystroke and script paths reach the same surface anyway, so the gate was a formality. - _dispatch rejects unknown/private names cleanly instead of crashing the poll loop - build the driver only after the machine lease is acquired (no leak on busy lease) - SIGINT/SIGTERM now stop the sidecar cooperatively so in-flight commands finish - floor the 429 backoff so Retry-After: 0 can't busy-loop - guard malformed fetch bodies so a bad json() doesn't kill the loop Co-authored-by: Cursor <cursoragent@cursor.com>
- desktop snapshot emits screenshot_b64 (str), the field ObservationSnapshot requires; the old screenshot_png key raised a validation error on every observe - release_key clears its modifier bit instead of XOR (stray release no longer flips it back on); key mask mutates only after a successful perform - CDP mouse events carry the buttons bitmask so drags register, and moves use button "none" - _run_script keeps the iframe guard on during retries so transient blocks retry - _focus_new_tab switches to the genuinely new handle, not window_handles[-1] - block chrome-extension/devtools/filesystem URL schemes - a kind-less dict env defaults to web so session_id still autowires Co-authored-by: Cursor <cursoragent@cursor.com>
| self._action_builder = ActionBuilder | ||
| self._destroyed = False | ||
| self.cursor_x = 0 | ||
| self.cursor_y = 0 |
There was a problem hiding this comment.
Mouse position never synced
Medium Severity
SeleniumWebDriver keeps cursor_x/cursor_y for CDP mouse events but initializes them to (0, 0) and never reads the browser’s actual pointer. webpage_metadata and observation_bundle expose that stale position, and click, mouse_press, and scroll can act at the wrong coordinates when no prior mouse_move_to ran.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 7345734. Configure here.
…rt.py Three single-purpose helper modules become one; the defuddle bundle is now read lazily and cached on first extract_markdown instead of at import. Co-authored-by: Cursor <cursoragent@cursor.com>
Cosmetic: module tunables read as plain UPPER_CASE. Class-private methods and driver internals keep the underscore (the dispatch firewall keys off it). Co-authored-by: Cursor <cursoragent@cursor.com>
…sktop->pyautogui_desktop Package dirs now name their implementation. CLI commands, capability strings, install extras, and class names are unchanged. Co-authored-by: Cursor <cursoragent@cursor.com>
|
|
||
| def click(self, x: int, y: int, button: str = "left") -> None: | ||
| self._pyautogui.click(x=x, y=y, button=button) | ||
| self._settle_after_click() |
There was a problem hiding this comment.
Desktop click coordinate space mismatch
High Severity
get_observation_snapshot reports the cursor in screenshot pixel space (including after width downscaling via screenshot_max_width), but click, mouse_move_to, and related input helpers forward those coordinates unchanged to PyAutoGUI, which expects logical screen coordinates from _screen_size. Agents aiming from the observation image will miss clicks whenever capture dimensions differ from the stored screen size.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit db90a85. Configure here.
… cached - localize_agent/create_agent now recurse into inline subagents, so a local browser/desktop child gets its session_id (was only top-level environments) - the result cache is now LRU: a cache hit refreshes recency so an actively redelivered command_uid is not evicted and re-executed mid-retry Co-authored-by: Cursor <cursoragent@cursor.com>
Stored but never read (vestigial in the upstream driver too); removing it so the constructor doesn't advertise an option that does nothing. Co-authored-by: Cursor <cursoragent@cursor.com>
Matches the consolidated hai_drivers desktop interface (single screenshot_b64 method); the command proxy forwards screenshot_b64, so the desktop driver must expose it rather than screenshot_png_bytes. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1c267dc. Configure here.
| def click(self, button: str = "left", delay_before_release: float = 0.05) -> None: | ||
| self.mouse_press(button=button) | ||
| time.sleep(delay_before_release) | ||
| self.mouse_release(button=button) |
There was a problem hiding this comment.
Browser sidecar click args mismatch
High Severity
The sidecar invokes driver methods by RPC name with JSON args. Desktop control uses click with x/y, and sidecar tests dispatch the same shape, but SeleniumWebDriver.click only accepts button and delay_before_release. Browser click requests carrying coordinates raise a TypeError or never move the pointer before clicking.
Reviewed by Cursor Bugbot for commit 1c267dc. Configure here.


Made with Cursor
Note
High Risk
Enables real local browser/desktop control and subprocess execution on the user machine; incorrect wiring or driver bugs could affect the host OS, though optional installs and a per-session lease limit blast radius.
Overview
Adds local computer-use so agents can drive the user’s machine via an in-process sidecar that long-polls the platform for commands and runs them on local drivers, instead of only remote hosted environments.
Packaging: optional extras
hai-agents[desktop](pyautogui, pillow) and[browser](selenium, markdownify);allcomposes them; wheel force-includes bundled browser JS (defuddle /h.js).SDK wiring:
Client/AsyncClientnow exposeagentsandsessionssubclasses that, on create/update agent and create session, rewriteuser_deviceenvironments (and nested subagents) to inject a deterministicsession_idderived from environment id, API key, and capability (web→ browser,desktop→ desktop).Sidecar:
SidecarClientensures a trajectory channel, polls/api/v1/commands/..., dispatches by method name toLocalDesktopDriverorSeleniumWebDriver, posts results with idempotent caching bycommand_uid, and uses a machine lease so only one sidecar owns a session. CLI:hai local browser(Chrome debugger port) andhai local desktop.Drivers: Desktop driver covers screenshots/observation snapshots, pointer/keyboard, files, and
run_command(including detach on Windows/macOS). Browser driver attaches to Chrome via CDP, blocks risky URL schemes, injects page helper JS for viewport/DOM work, and implements navigation, input, tabs, cookies, and observation bundles (screenshot + markdown).Reviewed by Cursor Bugbot for commit 1c267dc. Bugbot is set up for automated code reviews on this repo. Configure here.