Disable crewai shutdown telemetry hang (ar-r82f.21)#21
Merged
Conversation
…(ar-r82f.21) crewai 1.14.6 transitively launches daemon threads from chromadb and posthog that don't honor any documented opt-out env var at shutdown. Result: CLI commands hang ~5 minutes on interpreter exit after running. Discovered 2026-05-29 during EOD smoke test (ar-r82f.15). Workaround verified to drop shutdown time from ~5 min to <2s. Two pieces: - _telemetry_shim.py: sets 8 documented opt-out env vars at import time, before crewai/chromadb/posthog get loaded transitively - CLI main: wraps typer app() with os._exit(0) to bypass any lingering atexit handlers that snuck through The shim is conservative — env-var-only, no monkey-patching. The os._exit(0) wrapper only applies to the CLI entry point (uvicorn/ FastAPI servers stay on their normal exit path). bead: ar-r82f.21
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
atc964
added a commit
that referenced
this pull request
Jun 24, 2026
PR #15's CI was hitting a 10min post-pytest interpreter hang even after PR #21's telemetry shim landed. The shim handles CLI shutdown (os._exit after typer returns), but pytest exits normally and triggers chromadb/ posthog atexit handlers from transitive deps. This hook fires after pytest's reporting is complete (so test results still print correctly) and hard-exits with the test session's exit status — same pattern as the CLI wrapper, applied to the pytest entry point. Verified locally: pytest tests/unit/test_storage.py exits in <2s instead of hanging. Should drop the seller CI Test job from ~10min timeout to ~2-3min like buyer-agent. bead: ar-r82f.16 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
atc964
added a commit
that referenced
this pull request
Jun 24, 2026
* Fix CI Test job to fail on pytest failures
The previous workflow wrapped pytest in `timeout 60s ... && exit 0 on code
124/137`, which swallowed exit code 124 (timeout-killed) as success.
In practice, pytest itself completes quickly (~9s) but the python interpreter
hangs ~50s during shutdown cleaning up crewai telemetry threads. The 60s
shell `timeout` was firing during that shutdown hang, killing the interpreter
with SIGTERM, returning 124, which the workflow then mapped to exit 0 even
though pytest had reported test failures.
Symptom: run 26593648988 reported "5 failed, 709 passed" but the Test job
concluded SUCCESS.
Fix: drop the shell timeout wrapper entirely. Use pytest-timeout for per-test
deadlines (30s unit, 60s integration) and let pytest's exit code propagate
to GitHub Actions directly.
bead: ar-r82f.16
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* [poison-pill] Add deliberate failing test to verify CI surfaces failures
Temporary commit. Verifies the fix in the previous commit causes the
CI Test job to conclude FAILURE on pytest failures (previously it
silently passed). This commit will be reverted once CI is observed red.
bead: ar-r82f.16
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* Add hybrid timeout to CI: outer shell timeout + pytest-timeout thread method
The previous CI runs hung after pytest printed its summary line. The default
signal-based pytest-timeout cannot kill processes that ignore signals or are
stuck in teardown/atexit code paths. We layer two safety nets:
1. Outer `timeout --kill-after=10s 300s` ensures the shell kills the process
group if pytest itself fails to exit (the actual failure mode in run
26595978562 — pytest summary printed, then 11min hang until cancellation).
2. Inner `--timeout=30 --timeout-method=thread` gives finer-grained per-test
protection using thread-based termination, more reliable than signals in
containerized CI.
No exit-code masking. Both timeouts propagate failures naturally as job
failures (exit 124 or 137).
bead: ar-r82f.16
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Remove CI poison-pill test (verification complete) (ar-r82f.16)
Co-Authored-By: Claude Sonnet 4.7 <noreply@anthropic.com>
* Add OTEL_SDK_DISABLED to suppress crewai telemetry shutdown hang (ar-r82f.16)
crewai 1.14.6 has a telemetry shutdown handler at
crewai/telemetry/telemetry.py:211 that raises RuntimeError
("cannot schedule new futures after interpreter shutdown") on clean
test exits. This causes pytest to pass in ~7s but the Python
interpreter to hang ~5min on shutdown, tripping the outer 300s
timeout and producing exit 124 / CI conclusion FAILURE despite all
tests passing.
CREWAI_TELEMETRY_OPT_OUT alone does not suppress this shutdown path.
crewai's telemetry uses OpenTelemetry under the hood, so setting
OTEL_SDK_DISABLED=true should fully disable the shutdown handler
and let the interpreter exit cleanly.
Investigation tracked in ar-r82f.21.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Bump CI timeout to 600s + add telemetry opt-out env vars (ar-r82f.16)
Pragmatic ship for the lingering interpreter-shutdown hang. Cycle 3
confirmed pytest itself completes (no traceback in the killed runs),
but the Python interpreter hangs ~5min on exit, triggering the outer
timeout 300s and producing exit 124 / FAILURE despite a clean test run.
Two-part fix:
1. Bump outer `timeout` from 300s to 600s on both pytest steps. Even
if the interpreter still sits in atexit handlers for several
minutes, it has room to complete naturally instead of being killed.
2. Add belt-and-suspenders telemetry opt-out env vars on both steps,
targeting the dependencies most likely to be registering atexit
network calls:
- CREWAI_DISABLE_TELEMETRY / CREWAI_DISABLE_TRACKING: the other
crewai-recognized opt-out names (CREWAI_TELEMETRY_OPT_OUT alone
turned out to be a documented-but-unrecognized no-op).
- ANONYMIZED_TELEMETRY: chromadb's opt-out switch.
- POSTHOG_DISABLED: posthog (used by chromadb/crewai) opt-out.
Any one of these may shorten the hang; together with the 600s budget
the workflow should report SUCCESS on clean runs while still surfacing
real pytest failures (cycle 1 verified this works in 8.47s).
bead: ar-r82f.16
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Add pytest_sessionfinish os._exit hook (ar-r82f.21 / ar-r82f.16)
PR #15's CI was hitting a 10min post-pytest interpreter hang even after
PR #21's telemetry shim landed. The shim handles CLI shutdown (os._exit
after typer returns), but pytest exits normally and triggers chromadb/
posthog atexit handlers from transitive deps.
This hook fires after pytest's reporting is complete (so test results
still print correctly) and hard-exits with the test session's exit
status — same pattern as the CLI wrapper, applied to the pytest entry
point.
Verified locally: pytest tests/unit/test_storage.py exits in <2s instead
of hanging. Should drop the seller CI Test job from ~10min timeout to
~2-3min like buyer-agent.
bead: ar-r82f.16
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
crewai 1.14.6 transitively imports chromadb and posthog at startup. Both
libraries register daemon threads and
atexithandlers that do nothonor any of the libraries' documented opt-out env vars during the
shutdown path. Result: every CLI invocation hangs ~5 minutes on interpreter
exit after the command completes.
Discovered 2026-05-29 during EOD smoke test (local fix in
smoke-fix/ar-r82f.15-2026-05-29). The workaround drops shutdown timefrom ~5 min to <2 s.
Fix
Two pieces, shipped as a conservative shim with no monkey-patching:
1.
src/ad_seller/_telemetry_shim.py— env-var opt-outs at import timeSets 8 documented opt-out env vars before crewai/chromadb/posthog get
imported transitively, so their module constructors take the disabled branch:
OTEL_SDK_DISABLED=trueCREWAI_DISABLE_TELEMETRY=trueCREWAI_DISABLE_TRACKING=trueCREWAI_TELEMETRY_OPT_OUT=trueANONYMIZED_TELEMETRY=falsePOSTHOG_DISABLED=trueCHROMA_TELEMETRY_DISABLED=trueDO_NOT_TRACK=1Uses
os.environ.setdefaultso user-set values are never overridden.2.
src/ad_seller/__init__.py— import shim firstThe shim is imported as the very first line of
ad_seller/__init__.pyto guarantee it runs before any transitive crewai/chromadb/posthog import.
3.
src/ad_seller/interfaces/cli/main.py—os._exit(0)wrapperEven with the env vars set, some atexit handlers can sneak in via
transitive deps at runtime. The CLI
__main__block wrapsapp()withos._exit(0)to hard-exit after typer returns, bypassing any remainingatexit handlers.
This only applies to the CLI entry point. uvicorn/FastAPI servers
(
interfaces/api/) stay on their normal graceful exit path.Why
os._exitis neededCPython's shutdown sequence calls all
atexithandlers registered afterthe main thread finishes. If any handler blocks on a daemon thread that
itself is blocked waiting for a network connection (PostHog flush,
chromadb heartbeat), the process hangs indefinitely.
os._exitskips allatexit handlers, achieving the same result as SIGKILL without losing the
exit code.
The proper fix is crewAI honoring opt-out env vars in their shutdown path
(filed upstream). This shim is the pragmatic workaround until that lands.
Scope
This PR is only the telemetry shim + force-exit wrapper.
The
memory=True → memory=Falsebulk change is tracked separately underbead ar-i84f.
This PR is the seller-side counterpart to the buyer-agent PR. It also
unblocks PR #15 (ar-r82f.16) which has been parked DRAFT pending this
fix landing on both sides.
Verification
bead: ar-r82f.21