Skip to content

Local control: in-SDK sidecar + desktop/browser drivers#161

Open
abonneth wants to merge 21 commits into
mainfrom
antoine/local-control
Open

Local control: in-SDK sidecar + desktop/browser drivers#161
abonneth wants to merge 21 commits into
mainfrom
antoine/local-control

Conversation

@abonneth

@abonneth abonneth commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Made with Cursor


Note

High Risk
Enables real local browser/desktop control and subprocess execution on the user machine; incorrect wiring or driver bugs could affect the host OS, though optional installs and a per-session lease limit blast radius.

Overview
Adds local computer-use so agents can drive the user’s machine via an in-process sidecar that long-polls the platform for commands and runs them on local drivers, instead of only remote hosted environments.

Packaging: optional extras hai-agents[desktop] (pyautogui, pillow) and [browser] (selenium, markdownify); all composes them; wheel force-includes bundled browser JS (defuddle / h.js).

SDK wiring: Client / AsyncClient now expose agents and sessions subclasses that, on create/update agent and create session, rewrite user_device environments (and nested subagents) to inject a deterministic session_id derived from environment id, API key, and capability (web → browser, desktop → desktop).

Sidecar: SidecarClient ensures a trajectory channel, polls /api/v1/commands/..., dispatches by method name to LocalDesktopDriver or SeleniumWebDriver, posts results with idempotent caching by command_uid, and uses a machine lease so only one sidecar owns a session. CLI: hai local browser (Chrome debugger port) and hai local desktop.

Drivers: Desktop driver covers screenshots/observation snapshots, pointer/keyboard, files, and run_command (including detach on Windows/macOS). Browser driver attaches to Chrome via CDP, blocks risky URL schemes, injects page helper JS for viewport/DOM work, and implements navigation, input, tabs, cookies, and observation bundles (screenshot + markdown).

Reviewed by Cursor Bugbot for commit 1c267dc. Bugbot is set up for automated code reviews on this repo. Configure here.

abonneth and others added 7 commits June 29, 2026 17:31
Add a deny-by-default CapabilityPolicy that gates which command names a local
browser/desktop driver will execute (shell, arbitrary scripts, cookies/storage,
and secrets are opt-in), a name-keyed driver registry so one package can host
many drivers, and the command-name contract mirroring the hai_drivers interfaces.

Co-authored-by: Cursor <cursoragent@cursor.com>
Long-polling sidecar (single-owner lease, connect-time drain, command_uid
replay cache + echo), capability policy (deny-by-default with opt-ins),
driver registry, pyautogui desktop driver and Selenium browser driver.

Co-authored-by: Cursor <cursoragent@cursor.com>
…e open

Co-authored-by: Cursor <cursoragent@cursor.com>
…+ config knobs

Policy now derives allowed commands from the driver's public methods minus the
danger sets (shell/scripts/cookies/secrets), removing the hand-maintained method
lists that duplicated the drivers. Replace the driver registry with a direct lazy
factory and trim SidecarConfig to essentials.

Co-authored-by: Cursor <cursoragent@cursor.com>
- serialize_result recurses into dicts (fixes get_observation_snapshot crash)
- browser: reject file/chrome/js/data URLs; real markdown via markdownify; guard get_logs on CDP attach
- desktop: run_command merges os.environ instead of replacing it
- sidecar: interrupt long-poll on stop, reconnect on 404, back off on 429, tear down driver on shutdown
- drop dead dedup cache + racy drain-on-connect (server delivers one cmd at a time, fresh uid, no replay)
- split drivers into desktop/ and browser/ subpackages

Co-authored-by: Cursor <cursoragent@cursor.com>
…constants

Co-authored-by: Cursor <cursoragent@cursor.com>
…down

- vendor h.js + defuddle.full.js; execute_script auto-injects hjs with iframe guard
- extract_markdown -> Defuddle (main-content, in-browser)
- get_viewport_html -> hjs_0x2a.collectViewportHTML() (screen-bounds pruned DOM)
- viewport_markdown -> collectViewportHTML then CustomMarkdownify (markdownify), full-page fallback
- ship js assets via wheel force-include

Co-authored-by: Cursor <cursoragent@cursor.com>
@abonneth abonneth marked this pull request as ready for review June 29, 2026 18:15
@abonneth abonneth requested a review from adeprezh as a code owner June 29, 2026 18:15
Comment thread src/hai_agents/local/selenium_browser/driver.py
Comment thread src/hai_agents/local/browser/driver.py Outdated
Comment thread src/hai_agents/local/browser/driver.py Outdated
…l` CLI

Client now injects the local session_id for any source:"local" environment on
create_agent/update_agent/patch_agent and on inline-agent create_session, so
callers only pass source:"local" and the env id. Adds `hai local browser` and
`hai local desktop` to run the sidecar from the CLI.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/hai_agents/local/wiring.py
abonneth and others added 2 commits June 29, 2026 23:10
…e, typed envs)

- enter_secret clicks (x, y) to focus the target before typing, so the secret
  lands in the field the agent pointed at instead of stale focus.
- get_tab_title honors tab_id by switching, reading, and restoring the tab.
- close_active_tab guards against an empty handle list after closing the last tab.
- localize_environments/localize_agent now wire source:"local" envs whether they
  arrive as dicts or typed Pydantic models.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/hai_agents/local/policy.py Outdated
if not allow_cookies:
allowed -= _COOKIES
if not allow_secrets:
allowed -= _SECRETS

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Script policy bypass via helpers

High Severity

With allow_scripts disabled, CapabilityPolicy only removes execute_script, but other allowed browser driver commands such as get_viewport_html, extract_markdown, scroll_page, and observation_bundle still execute page JavaScript internally, so the CLI --allow-scripts gate does not actually block script execution.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0d1c561. Configure here.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/hai_agents/local/browser/driver.py Outdated
abonneth and others added 4 commits June 30, 2026 13:57
…ce SidecarBusyError in CLI

update_agent/patch_agent take agent_name positionally; the **kwargs-only
wrappers raised TypeError on a positional call. Accept *args and pass through.

`hai local browser/desktop` acquired the lease inside asyncio.run, outside the
guarded block, so a busy sidecar dumped a raw traceback; route it to the CLI error.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ingle-source capability

- sidecar: cache command_uid -> result and re-post on redelivery instead of
  re-running side effects (a transient result-POST failure left the command
  pending and re-executed it on the next poll).
- desktop driver: route keyboard through pyautogui (matches the Key contract
  and the remote executor); drop pynput, whose member names (enter/esc) diverge
  from the contract names (return/escape) and silently failed.
- policy: walk the full MRO so inherited driver methods are gated, not just
  those declared on the concrete class.
- config/wiring: KIND_TO_CAPABILITY is the single source; _CAPABILITIES derives
  from it.
- pyproject: drop unused pynput; collapse the all extra to a self-reference.

Co-authored-by: Cursor <cursoragent@cursor.com>
…shutdown

Remove CapabilityPolicy and the --allow-* CLI flags: the GUI keystroke and
script paths reach the same surface anyway, so the gate was a formality.

- _dispatch rejects unknown/private names cleanly instead of crashing the poll loop
- build the driver only after the machine lease is acquired (no leak on busy lease)
- SIGINT/SIGTERM now stop the sidecar cooperatively so in-flight commands finish
- floor the 429 backoff so Retry-After: 0 can't busy-loop
- guard malformed fetch bodies so a bad json() doesn't kill the loop

Co-authored-by: Cursor <cursoragent@cursor.com>
- desktop snapshot emits screenshot_b64 (str), the field ObservationSnapshot
  requires; the old screenshot_png key raised a validation error on every observe
- release_key clears its modifier bit instead of XOR (stray release no longer
  flips it back on); key mask mutates only after a successful perform
- CDP mouse events carry the buttons bitmask so drags register, and moves use
  button "none"
- _run_script keeps the iframe guard on during retries so transient blocks retry
- _focus_new_tab switches to the genuinely new handle, not window_handles[-1]
- block chrome-extension/devtools/filesystem URL schemes
- a kind-less dict env defaults to web so session_id still autowires

Co-authored-by: Cursor <cursoragent@cursor.com>
self._action_builder = ActionBuilder
self._destroyed = False
self.cursor_x = 0
self.cursor_y = 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mouse position never synced

Medium Severity

SeleniumWebDriver keeps cursor_x/cursor_y for CDP mouse events but initializes them to (0, 0) and never reads the browser’s actual pointer. webpage_metadata and observation_bundle expose that stale position, and click, mouse_press, and scroll can act at the wrong coordinates when no prior mouse_move_to ran.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7345734. Configure here.

abonneth and others added 2 commits June 30, 2026 23:01
…rt.py

Three single-purpose helper modules become one; the defuddle bundle is now
read lazily and cached on first extract_markdown instead of at import.

Co-authored-by: Cursor <cursoragent@cursor.com>
Cosmetic: module tunables read as plain UPPER_CASE. Class-private methods and
driver internals keep the underscore (the dispatch firewall keys off it).

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/hai_agents/local/wiring.py Outdated
Comment thread src/hai_agents/local/sidecar.py
…sktop->pyautogui_desktop

Package dirs now name their implementation. CLI commands, capability strings,
install extras, and class names are unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>

def click(self, x: int, y: int, button: str = "left") -> None:
self._pyautogui.click(x=x, y=y, button=button)
self._settle_after_click()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Desktop click coordinate space mismatch

High Severity

get_observation_snapshot reports the cursor in screenshot pixel space (including after width downscaling via screenshot_max_width), but click, mouse_move_to, and related input helpers forward those coordinates unchanged to PyAutoGUI, which expects logical screen coordinates from _screen_size. Agents aiming from the observation image will miss clicks whenever capture dimensions differ from the stored screen size.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit db90a85. Configure here.

… cached

- localize_agent/create_agent now recurse into inline subagents, so a local
  browser/desktop child gets its session_id (was only top-level environments)
- the result cache is now LRU: a cache hit refreshes recency so an actively
  redelivered command_uid is not evicted and re-executed mid-retry

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/hai_agents/local/selenium_browser/driver.py Outdated
Stored but never read (vestigial in the upstream driver too); removing it so the
constructor doesn't advertise an option that does nothing.

Co-authored-by: Cursor <cursoragent@cursor.com>
Matches the consolidated hai_drivers desktop interface (single
screenshot_b64 method); the command proxy forwards screenshot_b64, so the
desktop driver must expose it rather than screenshot_png_bytes.

Co-authored-by: Cursor <cursoragent@cursor.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1c267dc. Configure here.

def click(self, button: str = "left", delay_before_release: float = 0.05) -> None:
self.mouse_press(button=button)
time.sleep(delay_before_release)
self.mouse_release(button=button)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Browser sidecar click args mismatch

High Severity

The sidecar invokes driver methods by RPC name with JSON args. Desktop control uses click with x/y, and sidecar tests dispatch the same shape, but SeleniumWebDriver.click only accepts button and delay_before_release. Browser click requests carrying coordinates raise a TypeError or never move the pointer before clicking.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1c267dc. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant