Skip to content

Disable crewai shutdown telemetry hang (ar-r82f.21)#21

Merged
atc964 merged 2 commits into
mainfrom
fix/telemetry-shim-ar-r82f-21
Jun 24, 2026
Merged

Disable crewai shutdown telemetry hang (ar-r82f.21)#21
atc964 merged 2 commits into
mainfrom
fix/telemetry-shim-ar-r82f-21

Conversation

@atc964

@atc964 atc964 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Bug

crewai 1.14.6 transitively imports chromadb and posthog at startup. Both
libraries register daemon threads and atexit handlers that do not
honor
any of the libraries' documented opt-out env vars during the
shutdown path. Result: every CLI invocation hangs ~5 minutes on interpreter
exit after the command completes.

Discovered 2026-05-29 during EOD smoke test (local fix in
smoke-fix/ar-r82f.15-2026-05-29). The workaround drops shutdown time
from ~5 min to <2 s.

Fix

Two pieces, shipped as a conservative shim with no monkey-patching:

1. src/ad_seller/_telemetry_shim.py — env-var opt-outs at import time

Sets 8 documented opt-out env vars before crewai/chromadb/posthog get
imported transitively, so their module constructors take the disabled branch:

Env var Purpose
OTEL_SDK_DISABLED=true OpenTelemetry SDK
CREWAI_DISABLE_TELEMETRY=true crewAI telemetry
CREWAI_DISABLE_TRACKING=true crewAI tracking
CREWAI_TELEMETRY_OPT_OUT=true crewAI opt-out
ANONYMIZED_TELEMETRY=false chromadb
POSTHOG_DISABLED=true PostHog
CHROMA_TELEMETRY_DISABLED=true chromadb
DO_NOT_TRACK=1 generic DNT

Uses os.environ.setdefault so user-set values are never overridden.

2. src/ad_seller/__init__.py — import shim first

The shim is imported as the very first line of ad_seller/__init__.py
to guarantee it runs before any transitive crewai/chromadb/posthog import.

3. src/ad_seller/interfaces/cli/main.pyos._exit(0) wrapper

Even with the env vars set, some atexit handlers can sneak in via
transitive deps at runtime. The CLI __main__ block wraps app() with
os._exit(0) to hard-exit after typer returns, bypassing any remaining
atexit handlers.

This only applies to the CLI entry point. uvicorn/FastAPI servers
(interfaces/api/) stay on their normal graceful exit path.

Why os._exit is needed

CPython's shutdown sequence calls all atexit handlers registered after
the main thread finishes. If any handler blocks on a daemon thread that
itself is blocked waiting for a network connection (PostHog flush,
chromadb heartbeat), the process hangs indefinitely. os._exit skips all
atexit handlers, achieving the same result as SIGKILL without losing the
exit code.

The proper fix is crewAI honoring opt-out env vars in their shutdown path
(filed upstream). This shim is the pragmatic workaround until that lands.

Scope

This PR is only the telemetry shim + force-exit wrapper.
The memory=True → memory=False bulk change is tracked separately under
bead ar-i84f.

This PR is the seller-side counterpart to the buyer-agent PR. It also
unblocks PR #15 (ar-r82f.16) which has been parked DRAFT pending this
fix landing on both sides.

Verification

git diff main fix/telemetry-shim-ar-r82f-21 --stat
# src/ad_seller/__init__.py            |  2 ++
# src/ad_seller/_telemetry_shim.py     | 51 +++++++
# src/ad_seller/interfaces/cli/main.py |  8 +++++-
# 3 files changed, 60 insertions(+), 1 deletion(-)

bead: ar-r82f.21

atc964 and others added 2 commits June 24, 2026 12:27
…(ar-r82f.21)

crewai 1.14.6 transitively launches daemon threads from chromadb and
posthog that don't honor any documented opt-out env var at shutdown.
Result: CLI commands hang ~5 minutes on interpreter exit after running.

Discovered 2026-05-29 during EOD smoke test (ar-r82f.15). Workaround
verified to drop shutdown time from ~5 min to <2s.

Two pieces:
- _telemetry_shim.py: sets 8 documented opt-out env vars at import time,
  before crewai/chromadb/posthog get loaded transitively
- CLI main: wraps typer app() with os._exit(0) to bypass any lingering
  atexit handlers that snuck through

The shim is conservative — env-var-only, no monkey-patching. The
os._exit(0) wrapper only applies to the CLI entry point (uvicorn/
FastAPI servers stay on their normal exit path).

bead: ar-r82f.21
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@atc964 atc964 merged commit 190941d into main Jun 24, 2026
2 checks passed
@atc964 atc964 deleted the fix/telemetry-shim-ar-r82f-21 branch June 24, 2026 17:16
atc964 added a commit that referenced this pull request Jun 24, 2026
PR #15's CI was hitting a 10min post-pytest interpreter hang even after
PR #21's telemetry shim landed. The shim handles CLI shutdown (os._exit
after typer returns), but pytest exits normally and triggers chromadb/
posthog atexit handlers from transitive deps.

This hook fires after pytest's reporting is complete (so test results
still print correctly) and hard-exits with the test session's exit
status — same pattern as the CLI wrapper, applied to the pytest entry
point.

Verified locally: pytest tests/unit/test_storage.py exits in <2s instead
of hanging. Should drop the seller CI Test job from ~10min timeout to
~2-3min like buyer-agent.

bead: ar-r82f.16

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
atc964 added a commit that referenced this pull request Jun 24, 2026
* Fix CI Test job to fail on pytest failures

The previous workflow wrapped pytest in `timeout 60s ... && exit 0 on code
124/137`, which swallowed exit code 124 (timeout-killed) as success.

In practice, pytest itself completes quickly (~9s) but the python interpreter
hangs ~50s during shutdown cleaning up crewai telemetry threads. The 60s
shell `timeout` was firing during that shutdown hang, killing the interpreter
with SIGTERM, returning 124, which the workflow then mapped to exit 0 even
though pytest had reported test failures.

Symptom: run 26593648988 reported "5 failed, 709 passed" but the Test job
concluded SUCCESS.

Fix: drop the shell timeout wrapper entirely. Use pytest-timeout for per-test
deadlines (30s unit, 60s integration) and let pytest's exit code propagate
to GitHub Actions directly.

bead: ar-r82f.16

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* [poison-pill] Add deliberate failing test to verify CI surfaces failures

Temporary commit. Verifies the fix in the previous commit causes the
CI Test job to conclude FAILURE on pytest failures (previously it
silently passed). This commit will be reverted once CI is observed red.

bead: ar-r82f.16

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Add hybrid timeout to CI: outer shell timeout + pytest-timeout thread method

The previous CI runs hung after pytest printed its summary line. The default
signal-based pytest-timeout cannot kill processes that ignore signals or are
stuck in teardown/atexit code paths. We layer two safety nets:

1. Outer `timeout --kill-after=10s 300s` ensures the shell kills the process
   group if pytest itself fails to exit (the actual failure mode in run
   26595978562 — pytest summary printed, then 11min hang until cancellation).
2. Inner `--timeout=30 --timeout-method=thread` gives finer-grained per-test
   protection using thread-based termination, more reliable than signals in
   containerized CI.

No exit-code masking. Both timeouts propagate failures naturally as job
failures (exit 124 or 137).

bead: ar-r82f.16

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Remove CI poison-pill test (verification complete) (ar-r82f.16)

Co-Authored-By: Claude Sonnet 4.7 <noreply@anthropic.com>

* Add OTEL_SDK_DISABLED to suppress crewai telemetry shutdown hang (ar-r82f.16)

crewai 1.14.6 has a telemetry shutdown handler at
crewai/telemetry/telemetry.py:211 that raises RuntimeError
("cannot schedule new futures after interpreter shutdown") on clean
test exits. This causes pytest to pass in ~7s but the Python
interpreter to hang ~5min on shutdown, tripping the outer 300s
timeout and producing exit 124 / CI conclusion FAILURE despite all
tests passing.

CREWAI_TELEMETRY_OPT_OUT alone does not suppress this shutdown path.
crewai's telemetry uses OpenTelemetry under the hood, so setting
OTEL_SDK_DISABLED=true should fully disable the shutdown handler
and let the interpreter exit cleanly.

Investigation tracked in ar-r82f.21.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Bump CI timeout to 600s + add telemetry opt-out env vars (ar-r82f.16)

Pragmatic ship for the lingering interpreter-shutdown hang. Cycle 3
confirmed pytest itself completes (no traceback in the killed runs),
but the Python interpreter hangs ~5min on exit, triggering the outer
timeout 300s and producing exit 124 / FAILURE despite a clean test run.

Two-part fix:

1. Bump outer `timeout` from 300s to 600s on both pytest steps. Even
   if the interpreter still sits in atexit handlers for several
   minutes, it has room to complete naturally instead of being killed.

2. Add belt-and-suspenders telemetry opt-out env vars on both steps,
   targeting the dependencies most likely to be registering atexit
   network calls:
     - CREWAI_DISABLE_TELEMETRY / CREWAI_DISABLE_TRACKING: the other
       crewai-recognized opt-out names (CREWAI_TELEMETRY_OPT_OUT alone
       turned out to be a documented-but-unrecognized no-op).
     - ANONYMIZED_TELEMETRY: chromadb's opt-out switch.
     - POSTHOG_DISABLED: posthog (used by chromadb/crewai) opt-out.

Any one of these may shorten the hang; together with the 600s budget
the workflow should report SUCCESS on clean runs while still surfacing
real pytest failures (cycle 1 verified this works in 8.47s).

bead: ar-r82f.16

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Add pytest_sessionfinish os._exit hook (ar-r82f.21 / ar-r82f.16)

PR #15's CI was hitting a 10min post-pytest interpreter hang even after
PR #21's telemetry shim landed. The shim handles CLI shutdown (os._exit
after typer returns), but pytest exits normally and triggers chromadb/
posthog atexit handlers from transitive deps.

This hook fires after pytest's reporting is complete (so test results
still print correctly) and hard-exits with the test session's exit
status — same pattern as the CLI wrapper, applied to the pytest entry
point.

Verified locally: pytest tests/unit/test_storage.py exits in <2s instead
of hanging. Should drop the seller CI Test job from ~10min timeout to
~2-3min like buyer-agent.

bead: ar-r82f.16

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant