improvement(execution, connectors): offload large function inputs, increase connector limits + better error propagation#5089
Conversation
…onnector size limits Addresses a class of 10 MB limit failures: - executor/variables: offload over-budget function block-output context values to durable large-value refs (lazy `sim.values.read`) so JS function blocks can merge medium files without exceeding the 10 MB inter-block request-body cap. - connectors: stream downloads via `readBodyWithLimit` (memory-safe), and surface oversized files as visible `failed` KB documents instead of silently dropping them — listing-time for github/s3/dropbox/onedrive/sharepoint, fetch-time for gitlab/azure/google-drive via a shared `ConnectorFileTooLargeError`. Raise the per-file cap from a hardcoded 10 MB to the canonical 100 MB KB document limit (`CONNECTOR_MAX_FILE_BYTES`), except Google Drive's export path (Google's hard 10 MB export-API limit). - sync-engine: `classifyExternalDoc` + bulk `skipDocuments` (failed rows with a reason, excluded from retry), byte-bounded batch concurrency to cap peak worker memory at the raised cap, and a `metadata.fileSize ?? size` fallback.
# Conflicts: # apps/sim/connectors/utils.test.ts
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
PR SummaryMedium Risk Overview Function blocks: Resolved block-output values that would blow the ~6 MB inline budget (data + display) are offloaded to durable large-value refs and read in the JS sandbox via Workflow SSE: If the Redis event buffer rejects a write (e.g. oversized block output), terminal events still go out on the live SSE stream so the UI does not hang on “running.” KB connectors: Shared Sync engine: Agent skill Reviewed by Cursor Bugbot for commit e1bece6. Configure here. |
Greptile SummaryThis PR tackles a class of 10 MB cap failures across workflow execution and KB connectors by offloading oversized function-block context values to durable large-value refs, raising the per-file connector limit to the canonical 100 MB KB document cap, and replacing silent drops of oversized files with visible
Confidence Score: 4/5Safe to merge with awareness of the two P2 notes — no blocking defects introduced. The changes are broad (19 files, three distinct subsystems) but well-structured: shared utilities have unit tests, the sync engine's new classification and batching paths are tested, and the resolver offload logic includes thorough test coverage. The two observations are minor: the takeIndexableWithinCap boundary behaviour is explicitly documented as intentional, and the storageKey null assumption holds because addDocument uploads before committing the row. No correctness regression or data-loss path was identified. apps/sim/lib/knowledge/connectors/sync-engine.ts and apps/sim/connectors/utils.ts warrant a second read — the new takeIndexableWithinCap boundary behaviour means oversized files that appear after the indexable quota is saturated are still silently excluded from the failed-row surface. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Connector listDocuments] --> B{size reported?}
B -- yes, size > cap --> C[stubOrSkipBySize → markSkipped\nskippedReason set, contentDeferred=false]
B -- no size or within cap --> D[Normal stub\ncontentDeferred=true]
C --> E[takeIndexableWithinCap\nSkipped items ride along,\nnot counted against quota]
D --> E
E --> F[classifyExternalDoc]
F -- skip, no existing row --> G[skip op → skipDocuments\nbulk-insert failed rows]
F -- skip, existing row --> H[unchanged ++\nlast-known-good kept]
F -- add/update --> I[contentOps]
I --> J[chunkOpsByByteBudget\ncount + byte budget]
J --> K{contentDeferred?}
K -- yes --> L[getDocument hydration]
L -- skippedReason at fetch time --> G
L -- content OK --> M[addDocument / updateDocument]
K -- no --> M
G --> N[DB: failed row\nstorageKey=null]
M --> O[DB: pending row\nstorageKey set]
O --> P[stuck-doc retry sweep\nisNotNull storageKey excludes N]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[Connector listDocuments] --> B{size reported?}
B -- yes, size > cap --> C[stubOrSkipBySize → markSkipped\nskippedReason set, contentDeferred=false]
B -- no size or within cap --> D[Normal stub\ncontentDeferred=true]
C --> E[takeIndexableWithinCap\nSkipped items ride along,\nnot counted against quota]
D --> E
E --> F[classifyExternalDoc]
F -- skip, no existing row --> G[skip op → skipDocuments\nbulk-insert failed rows]
F -- skip, existing row --> H[unchanged ++\nlast-known-good kept]
F -- add/update --> I[contentOps]
I --> J[chunkOpsByByteBudget\ncount + byte budget]
J --> K{contentDeferred?}
K -- yes --> L[getDocument hydration]
L -- skippedReason at fetch time --> G
L -- content OK --> M[addDocument / updateDocument]
K -- no --> M
G --> N[DB: failed row\nstorageKey=null]
M --> O[DB: pending row\nstorageKey set]
O --> P[stuck-doc retry sweep\nisNotNull storageKey excludes N]
Reviews (2): Last reviewed commit: "fix accounting issue" | Re-trigger Greptile |
|
@greptile |
|
bugbot run |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e1bece6. Configure here.

Summary
Fixes a class of 10 MB limit failures across workflow execution and KB connectors.
sim.values.read), so a JS function can merge medium files without busting the 10 MB inter-block request-body cap (the original "Seedance" merge failure).failedKB documents (with a reason) instead of being silently dropped — at listing time (GitHub/S3/Dropbox/OneDrive/SharePoint) and fetch time (GitLab/Azure/Google Drive via a sharedConnectorFileTooLargeError).response.text()downloads replaced with streamingreadBodyWithLimit(cancels past the cap; closes a Dropbox OOM/DoS gap).CONNECTOR_MAX_FILE_BYTES), except Google Drive's export path (Google's hard 10 MB export-API limit).classifyExternalDocclassification, bulkskipDocuments(failed rows, excluded from the stuck-doc retry sweep), byte-bounded batch concurrency so the raised cap can't OOM the worker, and ametadata.fileSize ?? sizefallback so skipped rows show the real size.Type of Change
Testing
WiP
Checklist