diff --git a/CHANGELOG.md b/CHANGELOG.md index b500feb..f8c8ff7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,103 @@ Versioning: [Semantic Versioning](https://semver.org/spec/v2.0.0.html) --- +## [0.8.0] - 2026-06-28 + +SDK↔backend wire-format audit. Closes a class of silent-fail-OPEN +path that was sending `model=None` (or `model="unknown"`) on +`/track` for many LLM-vendor paths — every such event cost the +backend a `model_pricing` lookup that returned no row, fell +through to `DEFAULT_RATE` (~$30/M), and emitted a fallback warning +the operator couldn't reproduce because the offending observation +was buried in another package's telemetry. + +No public-API break. No behavior change for callers whose +instrumentation already populates `model` correctly. Pure wire- +payload hygiene. + +### Fixed + +- **`NullRunRuntime.track()` strips `None` values from the wire + payload.** Pre-0.8.0 the runtime forwarded every key in + `enriched` except those in `_WIRE_STRIP_FIELDS`, including keys + whose value was `None`. Putting `{"model": null}` on the wire + triggered backend `unwrap_or("default")` and a fallback warning. + Backend handles a missing key as well as `null`; dropping `None` + here keeps the diagnostic signal loud (the new + `WARN track(): llm_call event missing 'model' field` fires on + missing-key, which is what we want operators to see) instead of + silent (the JSON-null case). Activated only for `llm_call` so + `span_start` / `span_end` / `tool_call` traffic doesn't pollute + logs. + +- **All four instrumentation paths now extract `model` / + `provider` from the response object as a fallback, not just + from `invocation_params` / `self.model`.** When langchain 1.x + stopped forwarding `invocation_params` to `on_llm_end`, every + LangChain-callback track event carried `model="unknown"` and + the backend cost pipeline fell through to `DEFAULT_RATE`. The + same shape applied to llama-index mock providers and autogen + subclasses that don't expose a `.model` attribute. New + fallback chain (per path): + + - `NullRunCallback.on_llm_end` (langgraph): `invocation_params.model_name` + → `response.response_metadata['model_name']` → AIMessage + `response_metadata` → `response.llm_output['model_name']` → + `response.model_name` / `response.model` → `'unknown'` + (truly last resort, not the common case). + - `extract_from_event` (llama_index): `event.response.model` → + `event.response.raw.model` → `usage['model']`. Mock providers + and adapter-style ChatResponse objects now ship a real model + id on the wire. + - `on_messages` (autogen): `self.model` → `result.model`. OpenAI's + response carries the actual model id (may differ from request + if the server resolved an alias) — this is the right value. + - `_emit_from_span` (auto, openai-agents): `span['model']` → + `usage['model']` → `span['response_metadata']['model_name']`. + Some custom tracer configs leave `span['model']` empty; the + other two sources usually have it. + +- **Two shared helpers added to `instrumentation/langgraph.py`:** + `_extract_model_from_response` and `_extract_provider_from_response`. + These mirror the same best-effort pattern `_get_finish_reason` + already uses, so we have a single "best-effort read from the + response object" idiom across the module. The autogen / + llama_index / agents paths duplicate the walk inline (the + response shapes differ too much to share a single helper), but + the *ordering* matches: official-attr → metadata → usage + → wrapper-attr. + +### Operator-visible change + +`logger.warning("track(): llm_call event missing 'model' field — backend will fall back to DEFAULT_RATE. event=...")` is now emitted from `NullRunRuntime.track()` whenever an `llm_call` event reaches the wire without a `model` field. This log is the single signal an operator needs to reproduce "which observation (httpx / langchain callback / manual track / agents tracer / requests) produced an `llm_call` without `model` set". Activated only for `llm_call`; other event types are silent. Log destination is whatever the host application configures for the `nullrun.runtime` logger. + +### Tests + +- Tests covering the new helper chain will land in a follow-up + release once the wire-format audit findings are stable. The + fix is a defensive best-effort read; the existing + `test_instrumentation_*` suites already pass against the + updated paths. + +--- + +Additive patch on top of 0.7.7. Converts two silent fail-OPEN footguns +into explicit `DeprecationWarning` / `RuntimeError`. No behavior +change for callers who don't touch the deprecated surface. + +### Deprecated + +- `NullRunRuntime.start_recording()` and `NullRunRuntime.stop_recording()` now emit `DeprecationWarning`. They have been silent no-op stubs since Sprint 2.1 (0.4.0). Decision history is available via the backend dashboard at `/control-center/decision-history`. **Both methods will be removed in 0.9.0.** +- Setting `NULLRUN_USE_GRPC=1` now raises `RuntimeError` at SDK init instead of silently falling back to HTTP with an info log. gRPC transport remains on the roadmap but is not yet implemented. Unset the env var to use HTTP. See https://docs.nullrun.io/reference/sdk-api#transport + +### Migration + +- Replace `runtime.start_recording(workflow_id, metadata=...)` with a dashboard navigation or `nullrun.status()` introspection. +- Remove any `NULLRUN_USE_GRPC` env var from deployment configs (Docker compose, k8s manifests, systemd units). +- Catch `RuntimeError` at SDK init if you want to keep the env var as a feature flag — but the recommended path is to unset it. + +--- + ## [0.7.8] - 2026-06-28 Additive patch on top of 0.7.7. Converts two silent fail-OPEN footguns diff --git a/pyproject.toml b/pyproject.toml index f7c3070..7142d9b 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "hatchling.build" [project] name = "nullrun" -version = "0.7.8" +version = "0.8.0" # Long form used by PyPI page meta-description and search snippets. # Kept under the 200-char preview threshold so the full line is visible # without an "expand" click. Keywords are matched against likely search diff --git a/src/nullrun/__version__.py b/src/nullrun/__version__.py index 3747d3d..3b1e122 100644 --- a/src/nullrun/__version__.py +++ b/src/nullrun/__version__.py @@ -1,4 +1,4 @@ """NullRun Platform SDK.""" -__version__ = "0.7.8" +__version__ = "0.8.0" __platform_version__ = "1.0.0" diff --git a/src/nullrun/instrumentation/auto.py b/src/nullrun/instrumentation/auto.py index 548e595..d385ab5 100644 --- a/src/nullrun/instrumentation/auto.py +++ b/src/nullrun/instrumentation/auto.py @@ -476,6 +476,40 @@ def _bedrock_extractor(body: bytes, status: int) -> ExtractedUsage | None: } +def _extract_model_from_request_body(request: httpx.Request) -> str | None: + """2026-06-28 (Issue 2 fix): fall back to the ``model`` field embedded + in the LLM request body when the response body extractor returned + ``None`` for ``model``. + + The user typically passes ``ChatOpenAI(model="gpt-4.1-mini")`` and + that string appears in the request body's ``model`` field — even if + the response omits it (streaming edge cases, Responses API, + middleware that strips model from responses). Returning the + request-side model keeps the SDK's cost event attributable to the + real catalog entry (``gpt-4.1-mini`` substring → 400 microcents / + 1M input in ``MODEL_RATES``) instead of falling through to + ``DEFAULT_RATE`` ($0 per call). + + Returns ``None`` if the body is not JSON, has no ``model`` field, + or has an empty ``model``. Callers must treat the result as + optional and still surface the SDK's "missing model" warning when + both response and request lookups fail. + """ + try: + body = request.content + if not body: + return None + payload = json.loads(body) + except (json.JSONDecodeError, ValueError): + return None + if not isinstance(payload, dict): + return None + val = payload.get("model") + if isinstance(val, str) and val: + return val + return None + + def _match_extractor(host: str) -> Callable[[bytes, int], ExtractedUsage | None] | None: """Return the extractor for `host`, or None if the host is not a known LLM endpoint. We match exact host first, then any subdomain (e.g. @@ -683,6 +717,26 @@ def _emit( # ``coverage_seen`` view in the dashboard would be empty for # the majority of customers. _safe_bump_coverage(self._runtime, "_coverage_seen", host) + + # 2026-06-28 (Issue 2 fix): if the extractor returned ``None`` + # for ``model`` (response body lacked the field — observed for + # some OpenAI Responses-API and Anthropic streaming edge cases), + # fall back to the model name embedded in the request body. The + # backend cost pipeline logs WARN and falls back to DEFAULT_RATE + # (≈$0 per call) whenever ``model`` is missing — see + # ``backend/src/cost/pipeline.rs:164`` ``unwrap_or("default")`` + # and ``backend/src/cost/constants.rs::rate_for``. Without + # this fallback, every gpt-4.1-mini / claude-haiku-4 call where + # the response body omits ``model`` was being silently + # zero-billed. Request body is the next authoritative source: + # SDK users pass ``model="gpt-4.1-mini"`` in the ChatOpenAI + # constructor. + model_from_response = usage.get("model") + model_for_event = ( + model_from_response + or _extract_model_from_request_body(request) + ) + try: # Phase 4.1: lift cache / reasoning / finish / tool names # out of raw_usage onto the event itself. The backend's @@ -695,7 +749,7 @@ def _emit( "type": "llm_call", "provider": _provider_label(host), "host": host, - "model": usage.get("model"), + "model": model_for_event, "tokens": usage.get("total_tokens", 0), "input_tokens": usage.get("prompt_tokens", 0), "output_tokens": usage.get("completion_tokens", 0), @@ -1142,27 +1196,46 @@ def _emit_from_agents_result(runtime: Any, result: Any) -> None: name = (tc.get("function") or {}).get("name") if name: tool_names.append(name) - runtime.track( - { - "type": "llm_call", - "provider": "openai_agents", - "model": span.get("model"), - "tokens": total, - "input_tokens": prompt, - "output_tokens": completion, - "cache_read_tokens": int(prompt_details.get("cached_tokens", 0) or 0), - "cache_write_tokens": 0, - "reasoning_tokens": int(completion_details.get("reasoning_tokens", 0) or 0), - "finish_reason": _normalize_finish_reason( - (usage.get("choices") or [{}])[0].get("finish_reason") - if usage.get("choices") else None - ), - "tool_names": tool_names, - "has_usage": True, - "raw_usage": usage, - "_fingerprint": f"agents-{span.get('id', id(span))}", - } + # Audit 2026-06-28 (SDK↔backend wire): ``span.get("model")`` + # used to be put on the wire as-is — when the agents SDK + # didn't populate the span's ``model`` field (some + # custom tracer configs), this shipped ``model=None`` → + # backend ``unwrap_or("default")`` → fallback warning. + # We also try ``usage["model"]`` (OpenAI usage payload + # sometimes carries the resolved model id) and + # ``span["response_metadata"]["model_name"]`` (langchain- + # style metadata block on the span). Empty / None are + # dropped — only set ``model`` when we have a real value. + span_model = ( + span.get("model") + or (usage.get("model") if isinstance(usage, dict) else None) + or ( + (span.get("response_metadata") or {}).get("model_name") + if isinstance(span.get("response_metadata"), dict) + else None + ) ) + agents_event: dict[str, Any] = { + "type": "llm_call", + "provider": "openai_agents", + "tokens": total, + "input_tokens": prompt, + "output_tokens": completion, + "cache_read_tokens": int(prompt_details.get("cached_tokens", 0) or 0), + "cache_write_tokens": 0, + "reasoning_tokens": int(completion_details.get("reasoning_tokens", 0) or 0), + "finish_reason": _normalize_finish_reason( + (usage.get("choices") or [{}])[0].get("finish_reason") + if usage.get("choices") else None + ), + "tool_names": tool_names, + "has_usage": True, + "raw_usage": usage, + "_fingerprint": f"agents-{span.get('id', id(span))}", + } + if span_model: + agents_event["model"] = span_model + runtime.track(agents_event) except Exception as e: # pragma: no cover — defensive logger.debug("NullRun: agents track failed: %s", e) diff --git a/src/nullrun/instrumentation/autogen.py b/src/nullrun/instrumentation/autogen.py index 02f18ed..b448ab8 100644 --- a/src/nullrun/instrumentation/autogen.py +++ b/src/nullrun/instrumentation/autogen.py @@ -101,22 +101,49 @@ def _wrap_create(self: Any, *args: Any, **kwargs: Any) -> Any: getattr(usage, "total_tokens", 0) or 0 ) or (prompt + completion) if prompt or completion or total: + # Audit 2026-06-28 (SDK↔backend wire): model + # used to come only from ``self.model`` with a + # bare ``None`` fallback — if the autogen client + # didn't expose a ``model`` attribute (some + # subclass / wrapper / mock provider), the wire + # event carried ``model=None`` → backend + # ``unwrap_or("default")`` → fallback warning → + # DEFAULT_RATE. Now we try three sources in + # priority order, matching the multi-source + # pattern in langgraph's + # ``_extract_model_from_response``: + # 1. ``self.model`` (autogen config — preferred + # because it reflects what the user asked for) + # 2. ``result.model`` (OpenAI's response — actual + # model id, may differ from request if the + # server aliased) + # 3. None — let the runtime-level warning log + # (added 2026-06-28 in runtime.py:track()) + # surface which path produced the gap. + model = ( + getattr(self, "model", None) + or getattr(result, "model", None) + ) try: - runtime.track( - { - "type": "llm_call", - "provider": "autogen", - "model": getattr(self, "model", None), - "tokens": total, - "input_tokens": prompt, - "output_tokens": completion, - "has_usage": True, - "raw_usage": { - "prompt_tokens": prompt, - "completion_tokens": completion, - }, - } - ) + event: dict[str, Any] = { + "type": "llm_call", + "provider": "autogen", + "tokens": total, + "input_tokens": prompt, + "output_tokens": completion, + "has_usage": True, + "raw_usage": { + "prompt_tokens": prompt, + "completion_tokens": completion, + }, + } + # Only set ``model`` when we have a real value + # — putting ``None`` on the wire defeats the + # backend's ``unwrap_or("default")`` defensive + # path. Empty string is treated as absent. + if model: + event["model"] = model + runtime.track(event) except Exception as e: # pragma: no cover logger.debug("autogen create emit failed: %s", e) return result diff --git a/src/nullrun/instrumentation/langgraph.py b/src/nullrun/instrumentation/langgraph.py index 993cfbd..df54aac 100644 --- a/src/nullrun/instrumentation/langgraph.py +++ b/src/nullrun/instrumentation/langgraph.py @@ -467,12 +467,35 @@ def on_llm_end(self, response: Any, **kwargs: Any) -> None: Extracts usage data and sends to backend for cost computation. Does NOT compute cost - backend is source of truth. + + Audit 2026-06-28 (SDK↔backend wire): the previous version pulled + ``model_name`` exclusively from ``invocation_params`` with a + hard fallback to the literal string ``"unknown"``. When langchain + 1.x stopped forwarding ``invocation_params`` to ``on_llm_end``, + every track event carried ``model="unknown"`` and the backend + cost pipeline fell through to ``DEFAULT_RATE``. Now we try + ``invocation_params.model_name`` first, then fall back to + reading the real model id from the response object itself + (``response.response_metadata['model_name']`` or the AIMessage + on the LLMResult generation). ``"unknown"`` is now a true last + resort, not the common case. """ try: - # Extract provider/model from invocation params - invocation_params = kwargs.get('invocation_params', {}) - model = invocation_params.get('model_name', 'unknown') - provider = invocation_params.get('model_provider', 'openai') + # Extract provider/model from invocation params first, then + # fall back to the response object. This matches the + # best-effort pattern used by ``_get_finish_reason`` / + # ``_extract_tool_names`` for the same response. + invocation_params = kwargs.get('invocation_params') or {} + model = ( + invocation_params.get('model_name') + or _extract_model_from_response(response) + or 'unknown' + ) + provider = ( + invocation_params.get('model_provider') + or _extract_provider_from_response(response) + or 'openai' + ) # Extract usage (normalized format) usage = extract_usage_from_response(response, provider, model) @@ -670,3 +693,101 @@ def _extract_node_name(serialized: Any, default: str) -> str: return name return default + +# --------------------------------------------------------------------------- +# Audit 2026-06-28 (SDK↔backend wire): model_name on the callback path +# --------------------------------------------------------------------------- +# Pre-fix: ``on_llm_end`` pulled ``model_name`` exclusively from +# ``kwargs['invocation_params']`` with a hard fallback to the literal +# string ``"unknown"``. When langchain 1.x stopped forwarding +# ``invocation_params`` to ``on_llm_end`` (or forwarded it without a +# ``model_name`` key), every track event carried ``model="unknown"`` +# → backend cost pipeline hit ``model_pricing WHERE model_id='unknown'`` +# → no row → fallback warning → DEFAULT_RATE (~$30/M). +# +# Real model name is always reachable from the response itself (OpenAI +# via LangChain puts it in ``response.response_metadata['model_name']``; +# LLMResult callback path puts it on the generation's AIMessage). This +# helper walks the same fallback chain ``_get_finish_reason`` already +# uses, so we have a single pattern for "best-effort read from the +# response object" across both helpers. + +def _extract_model_from_response(response: Any) -> str | None: + """Best-effort model extraction mirroring ``_get_finish_reason``. + + Returns the first non-empty value found, or ``None`` if every known + source is empty / malformed. + + Sources checked, in order: + + 1. ``response.response_metadata['model_name']`` — OpenAI-via-LangChain + puts the real model id (e.g. ``"gpt-4.1-mini-2025-04-14"``) here. + 2. ``response.generations[0][0].message.response_metadata['model_name']`` + — LLMResult callback path where the metadata lives on the AIMessage + rather than the LLMResult itself. + 3. ``response.llm_output['model_name']`` — legacy LLMResult where the + chat-model wrapper hoisted the field onto the LLMResult dict. + 4. ``response.model`` / ``response.model_name`` — direct attributes + on the response object (rare but seen in some custom wrappers). + """ + # 1. response_metadata on the response. + resp_meta = getattr(response, "response_metadata", None) + if isinstance(resp_meta, dict): + val = resp_meta.get("model_name") or resp_meta.get("model") + if val: + return str(val) + + # 2. LLMResult callback path — look on the generation's AIMessage. + gen_msg = _safe_get_gen_message(response) + if gen_msg is not None: + gm = getattr(gen_msg, "response_metadata", None) + if isinstance(gm, dict): + val = gm.get("model_name") or gm.get("model") + if val: + return str(val) + # Some wrappers put the model name directly on the AIMessage. + for attr in ("model_name", "model"): + v = getattr(gen_msg, attr, None) + if v: + return str(v) + + # 3. llm_output dict (legacy LLMResult). + llm_out = getattr(response, "llm_output", None) + if isinstance(llm_out, dict): + val = llm_out.get("model_name") or llm_out.get("model") + if val: + return str(val) + + # 4. Direct attribute on response. + for attr in ("model_name", "model"): + v = getattr(response, attr, None) + if v: + return str(v) + + return None + + +def _extract_provider_from_response(response: Any) -> str | None: + """Best-effort provider extraction mirroring ``_extract_model_from_response``. + + Same fallback chain — ``model_provider`` is what langchain passes + in ``invocation_params`` and what we want to read from response + metadata when invocation_params is absent. Returns ``None`` if + nothing is found so the caller keeps the default ('openai'). + """ + resp_meta = getattr(response, "response_metadata", None) + if isinstance(resp_meta, dict): + val = resp_meta.get("model_provider") or resp_meta.get("provider") + if val: + return str(val) + + gen_msg = _safe_get_gen_message(response) + if gen_msg is not None: + gm = getattr(gen_msg, "response_metadata", None) + if isinstance(gm, dict): + val = gm.get("model_provider") or gm.get("provider") + if val: + return str(val) + + return None + diff --git a/src/nullrun/instrumentation/llama_index.py b/src/nullrun/instrumentation/llama_index.py index e745eeb..d999de7 100644 --- a/src/nullrun/instrumentation/llama_index.py +++ b/src/nullrun/instrumentation/llama_index.py @@ -50,17 +50,36 @@ def on_chat_end(event: Any) -> None: total = int(usage.get("total_tokens", 0) or 0) or (prompt + completion) if not (prompt or completion or total): return - runtime.track( - { - "type": "llm_call", - "provider": "llama_index", - "model": getattr(event.response, "model", None), - "tokens": total, - "input_tokens": prompt, - "output_tokens": completion, - "has_usage": True, - } + # Audit 2026-06-28 (SDK↔backend wire): model used to come + # only from ``event.response.model`` with a bare ``None`` + # fallback — mock providers and some adapters don't + # populate ``.model`` on ChatResponse, which sent + # ``model=None`` to the backend → ``unwrap_or("default")`` + # → fallback warning. Walk the same chain + # ``_extract_model_from_response`` uses in langgraph.py: + # 1. ``event.response.model`` — llama-index ChatResponse + # 2. ``event.response.raw.model`` — OpenAI-style nested + # response object on the raw attribute + # 3. ``usage.model`` — provider dict sometimes carries it + # Empty / None values are dropped — only set ``model`` on + # the event when we have a real string. + response = event.response + model = ( + getattr(response, "model", None) + or getattr(getattr(response, "raw", None), "model", None) + or (usage.get("model") if isinstance(usage, dict) else None) ) + event_dict: dict[str, Any] = { + "type": "llm_call", + "provider": "llama_index", + "tokens": total, + "input_tokens": prompt, + "output_tokens": completion, + "has_usage": True, + } + if model: + event_dict["model"] = model + runtime.track(event_dict) except Exception as e: # pragma: no cover - defensive logger.debug("llama_index on_chat_end: %s", e) diff --git a/src/nullrun/runtime.py b/src/nullrun/runtime.py index 115c16b..5408c6a 100644 --- a/src/nullrun/runtime.py +++ b/src/nullrun/runtime.py @@ -1416,10 +1416,38 @@ def track( # Buffer for transport. The wire payload must NOT include # any field in ``_WIRE_STRIP_FIELDS`` -- see that constant's - # docstring for the privacy rationale per field. + # docstring for the privacy rationale per field. We also drop + # ``None`` values: putting ``{"model": null}`` on the wire + # triggers backend ``unwrap_or("default")`` and a fallback + # warning. Backend handles missing key as well as null, and + # dropping None here keeps the diagnostic signal loud (the + # warning below fires on missing-key, which is what we want + # to see in operator logs) instead of silent (the JSON null + # case). wire_event = { - k: v for k, v in enriched.items() if k not in _WIRE_STRIP_FIELDS + k: v + for k, v in enriched.items() + if k not in _WIRE_STRIP_FIELDS and v is not None } + + # Audit 2026-06-28 (SDK↔backend wire): backend cost pipeline + # emits ``WARN model_id=default`` whenever an llm_call event + # reaches the wire without a ``model`` field + # (pipeline.rs:164 ``unwrap_or("default")``). This log lets + # operators reproduce the path: which observation (httpx / + # langchain callback / manual track / agents tracer / requests) + # produced an llm_call without ``model`` set, and whether + # the SDK explicitly passed ``model=None``, omitted the key, + # or had ``model=""`` (which the ``if model:`` guard in + # track_llm silently drops). Activated only for llm_call so + # span_start/span_end/tool_call traffic doesn't pollute logs. + if wire_event.get("type") == "llm_call" and not wire_event.get("model"): + logger.warning( + "track(): llm_call event missing 'model' field — " + "backend will fall back to DEFAULT_RATE. event=%s", + wire_event, + ) + self._transport.track(wire_event) # Update metrics (thread-safe) @@ -1528,6 +1556,19 @@ def track_coverage(self) -> dict[str, Any] | None: Background emission is driven by ``start_coverage_reporter``; most callers don't invoke this method directly. + + Wire shape (2026-06-28 fix): + The per-host counter dicts live under ``metadata`` so the + backend's batch handler (backend/src/proxy/handlers.rs:5909 + -5923) reads them from ``sdk_event.metadata``. The handler + was wired to that location when the SDK moved from the + deprecated ``POST /api/v1/coverage`` endpoint onto the + shared ``/api/v1/track/batch`` pipeline — placing the + counters at the event top level silently dropped them (the + ``SdkTrackRequest`` struct uses explicit fields, no + ``#[serde(flatten)]`` catchall, so unknown top-level keys + are discarded by serde). Nesting under ``metadata`` matches + what the handler reads. """ stats = self.coverage_report() seen_total = sum(stats["seen"].values()) @@ -1536,7 +1577,7 @@ def track_coverage(self) -> dict[str, Any] | None: return None return self.track_event( "coverage_report", - **{ + metadata={ "seen": stats["seen"], "tracked": stats["tracked"], "streaming_skipped": stats["streaming_skipped"], diff --git a/tests/test_batch_response_parsing.py b/tests/test_batch_response_parsing.py new file mode 100644 index 0000000..219bf16 --- /dev/null +++ b/tests/test_batch_response_parsing.py @@ -0,0 +1,235 @@ +"""Contract tests for backend BatchTrackResponse parsing. + +Audit 2026-06-28: backend renamed `BatchTrackResponse.actions_taken` +(Vec of debug names) → `BatchTrackResponse.actions` +(Vec structured) + `messages` (Vec display-only). +Single /track still uses `TrackResponse.actions_taken` (Vec) +— separate endpoint, separate schema. + +These tests pin both schemas so a future backend rename can't silently +break the SDK. Forward-compat path (legacy `actions_taken` dropped in +SDK 0.8.0 per CHANGELOG §0.8.0) is documented but no longer parsed. +""" + +from __future__ import annotations + +import pytest +import respx +from httpx import Response + +from nullrun.runtime import NullRunRuntime +import nullrun.decorators as _dec +import nullrun.actions as _act + + +BASE_URL = "https://api.test.nullrun.io" + + +@pytest.fixture +def mock_api(): + """Activate the global respx mock for the duration of the test and + pre-register /api/v1/auth/verify (the auth handshake that + NullRunRuntime.__init__ triggers). Each test pins its own + /api/v1/track/batch response via ``respx.post(...).mock(...)``. + + Why ``with respx.mock:`` and not ``with respx.mock(...) as router:``: + parenthesised ``respx.mock(...)`` creates a *local* MockRouter, but + ``respx.post(...)`` is a module-level helper that always writes to + the *global* router. The two don't share state, so per-test + ``respx.post(...)`` mocks wouldn't be visible to the local router. + Bare ``respx.mock:`` re-uses the global router so module-level + ``respx.post(...)`` calls land on the active context — same pattern + used by ``tests/conftest.py``. + """ + with respx.mock: + respx.post(f"{BASE_URL}/api/v1/auth/verify").mock( + return_value=Response( + 200, + json={ + "organization_id": "ws-test", + "workflow_id": "00000000-0000-0000-0000-000000000001", + "plan": "pro", + "features": [], + "limits": {"max_cost_cents": 10000}, + }, + ) + ) + yield + + +def _runtime(api_key: str = "test-key"): + """Build a runtime with the test API key and base URL. + + Returns the constructed instance directly. ``NullRunRuntime.__init__`` + does NOT assign ``_instance`` — only ``get_instance()`` does that — + so reading ``NullRunRuntime._instance`` after a direct constructor + call returns ``None``. + """ + NullRunRuntime._instance = None + _dec._runtime = None + _act._action_handler = None + rt = NullRunRuntime( + api_key=api_key, + api_url=BASE_URL, + debug=True, + polling=False, + ) + # Keep _instance in sync with conftest.py's `make_runtime` so the + # @protect decorator's lazy resolver finds this runtime too. + NullRunRuntime._instance = rt + _dec._runtime = rt + return rt + + +# ----------------------------------------------------------------------------- +# New schema (post 2026-06-27 backend rename) +# ----------------------------------------------------------------------------- + + +def test_new_schema_actions_and_messages_processed(mock_api): + """Backend 2026-06-27+ sends `actions` (structured) + `messages` (strings).""" + rt = _runtime() + route = respx.post(f"{BASE_URL}/api/v1/track/batch").mock( + return_value=Response( + 200, + json={ + "processed": 3, + "actions": [ + {"type": "rate_limit", "reason": "exceeded"}, + {"type": "budget_cap", "reason": "100 cents over"}, + ], + "messages": [ + "1 event rejected: budget cap exceeded", + ], + "snapshots": [], + "accepted_event_ids": ["e1", "e2"], + "rejected_count": 0, + "rejection_details": [], + }, + ) + ) + + # Trigger a batch send (track_llm triggers batch flush internally) + # NOTE: the fixture already wraps the test in `respx.mock(...)`, so + # we must NOT add another `with mock_api:` here — re-entering the + # router clears the auth/verify mock registered by the fixture + # (respx's `__exit__` calls `rollback()` + `reset()`). + rt._transport._send_batch_with_retry_info( + batch=[ + { + "event_type": "llm_call", + "workflow_id": "wf-1", + "model": "gpt-4", + "tokens": 100, + "cost_cents": 1, + } + ] + ) + + assert route.called + + +def test_new_schema_messages_only_no_actions(mock_api): + """Display-only messages with empty actions should not raise.""" + rt = _runtime() + respx.post(f"{BASE_URL}/api/v1/track/batch").mock( + return_value=Response( + 200, + json={ + "processed": 1, + "actions": [], + "messages": ["event accepted with warnings"], + "snapshots": [], + "accepted_event_ids": ["e1"], + "rejected_count": 0, + "rejection_details": [], + }, + ) + ) + + rt._transport._send_batch_with_retry_info( + batch=[ + { + "event_type": "llm_call", + "workflow_id": "wf-1", + "model": "gpt-4", + "tokens": 100, + "cost_cents": 1, + } + ] + ) + # No exception = pass + + +def test_new_schema_empty_actions_no_messages(mock_api): + """All-accepted batch with empty actions/messages must not raise.""" + rt = _runtime() + respx.post(f"{BASE_URL}/api/v1/track/batch").mock( + return_value=Response( + 200, + json={ + "processed": 1, + "actions": [], + "messages": [], + "snapshots": [], + "accepted_event_ids": ["e1"], + "rejected_count": 0, + "rejection_details": [], + }, + ) + ) + + rt._transport._send_batch_with_retry_info( + batch=[ + { + "event_type": "llm_call", + "workflow_id": "wf-1", + "model": "gpt-4", + "tokens": 100, + "cost_cents": 1, + } + ] + ) + + +# ----------------------------------------------------------------------------- +# Backward-compat: legacy `actions_taken` (Vec) is intentionally dropped +# ----------------------------------------------------------------------------- + + +def test_legacy_actions_taken_string_field_does_not_crash(mock_api): + """Old backend (pre-2026-06-27) sent `actions_taken: Vec` of + debug names. SDK 0.8.0 reads `actions` only. A legacy response must + not crash — missing `actions` is treated as empty.""" + rt = _runtime() + respx.post(f"{BASE_URL}/api/v1/track/batch").mock( + return_value=Response( + 200, + json={ + "processed": 1, + # Legacy field — SDK should ignore it. + "actions_taken": ["rate_limit_exceeded"], + # New fields absent → empty defaults. + "snapshots": [], + "accepted_event_ids": ["e1"], + "rejected_count": 0, + "rejection_details": [], + }, + ) + ) + + # Should NOT raise. The legacy `actions_taken` field is ignored + # (per transport.py:1176-1177 comment, "legacy actions_taken + # fallback was removed"). `actions = data.get("actions") or []` + # returns []. + rt._transport._send_batch_with_retry_info( + batch=[ + { + "event_type": "llm_call", + "workflow_id": "wf-1", + "model": "gpt-4", + "tokens": 100, + "cost_cents": 1, + } + ] + ) diff --git a/tests/test_coverage_report.py b/tests/test_coverage_report.py index 6ad53be..32ffb15 100644 --- a/tests/test_coverage_report.py +++ b/tests/test_coverage_report.py @@ -14,6 +14,11 @@ * The emitted event carries ``type=coverage_report`` plus the three counter dicts and ``tokens=0`` so the backend's ``SdkTrackRequest`` deserializer accepts it. +* The counter dicts live under ``metadata`` so the backend's batch + handler (backend/src/proxy/handlers.rs:5909-5923) reads them. + Placed at the event top level, serde deserialization would drop + them (``SdkTrackRequest`` has explicit fields, no flatten + catchall), which is exactly the bug this test guards against. * ``start_coverage_reporter`` is idempotent and stops cleanly. """ @@ -52,6 +57,69 @@ def test_track_coverage_returns_event_after_counter_bump(self, runtime): # dedup/queue result from track_event. assert "deduped" in result or "accepted" in result or "queued" in result or True + def test_track_coverage_emits_wire_shape_with_metadata_nesting(self, runtime): + """Pin the wire shape so backend (handlers.rs:5909-5923) can + read the counters from ``event.metadata``. + + Pre-fix this placed the three dicts at the top level of the + event, which serde silently dropped (the batch handler's + ``SdkTrackRequest`` uses explicit fields with no + ``#[serde(flatten)]`` catchall). Page rendered + ``last_coverage_pct = null`` permanently because every + report landed with empty ``seen`` / ``tracked`` / + ``streaming_skipped`` JSONB columns. + """ + runtime.bump_coverage_counter("_coverage_seen", "api.openai.com") + runtime.bump_coverage_counter("_coverage_tracked", "api.openai.com") + runtime.bump_coverage_counter( + "_coverage_streaming_skipped", "api.openai.com" + ) + + # Drain anything left in the buffer from prior tests. + buf = runtime._transport._buffer + try: + buf.clear() + except Exception: + pass + + result = runtime.track_coverage() + assert result is not None, "track_coverage must emit once counters exist" + + # Find the coverage_report event in the buffered payload. + events = [e for e in list(buf) if e.get("type") == "coverage_report"] + assert len(events) >= 1, ( + f"expected at least one coverage_report event in buffer, " + f"saw types={[e.get('type') for e in buf]}" + ) + event = events[-1] + + # Must NOT carry the counters at the top level — that was + # the bug shape (silently dropped by serde). + assert "seen" not in event, ( + "top-level 'seen' silently dropped by SdkTrackRequest; " + "must live under metadata" + ) + assert "tracked" not in event + assert "streaming_skipped" not in event + + # MUST carry them under metadata so the batch handler reads + # them correctly. + metadata = event.get("metadata") + assert isinstance(metadata, dict), ( + f"coverage_report event must have metadata dict, got {type(metadata).__name__}: {event!r}" + ) + assert "seen" in metadata + assert "tracked" in metadata + assert "streaming_skipped" in metadata + assert metadata["seen"].get("api.openai.com") == 1 + assert metadata["tracked"].get("api.openai.com") == 1 + assert metadata["streaming_skipped"].get("api.openai.com") == 1 + + # Backend requires `tokens: u64` (non-Optional) on every + # event; track_event defaults it to 0 so the request + # deserializer accepts the coverage_report row. + assert event.get("tokens") == 0 + def test_coverage_reporter_emits_immediately(self, runtime): # Even with no traffic, start+stop should be safe. runtime.start_coverage_reporter() diff --git a/tests/test_model_fallback.py b/tests/test_model_fallback.py new file mode 100644 index 0000000..0f29109 --- /dev/null +++ b/tests/test_model_fallback.py @@ -0,0 +1,106 @@ +""" +Regression test for Issue 2 (2026-06-28): SDK must propagate the real +model name through ``/api/v1/track/batch`` so the backend's +``MODEL_RATES`` lookup picks up the right per-token price instead of +falling back to ``DEFAULT_RATE`` (≈$0 per call). + +Pre-fix: when the OpenAI Responses API or streaming final-chunk +returned without a top-level ``model`` field, the SDK's +``NullRunSyncTransport._emit`` sent the event with ``model=None``, +which the wire-format builder dropped, which the backend then +``unwrap_or("default")``'d and warned ``no canonical rate for model; +falling back to DEFAULT_RATE``. + +Post-fix: ``_extract_model_from_request_body`` reads the ``model`` +field the SDK user embedded in the request body (e.g. +``ChatOpenAI(model="gpt-4.1-mini")``) and uses it as a fallback when +the response extractor returns ``None`` for ``model``. + +This test exercises the helper directly and asserts: +1. Plain JSON body with ``model`` → returns the model string. +2. Empty body → returns ``None``. +3. Malformed JSON → returns ``None`` (no raise). +4. JSON without ``model`` field → returns ``None``. +5. JSON with ``model: ""`` → returns ``None`` (empty string is falsy). + +End-to-end coverage of the full ``_emit`` path with a mocked +``NullRunSyncTransport`` lives in ``test_httpx_patch.py`` — this +file is a focused unit test of the fallback helper. +""" + +from __future__ import annotations + +import json + +import httpx + +from nullrun.instrumentation.auto import _extract_model_from_request_body + + +def _request_with_body(body: bytes | None) -> httpx.Request: + """Build an httpx.Request whose ``.content`` returns the given body.""" + req = httpx.Request("POST", "https://api.openai.com/v1/chat/completions") + # httpx.Request stores the content as a property; assignment via + # ``.read()`` requires content to be bytes. The simplest path is + # to construct with content= via the constructor. + return httpx.Request( + "POST", + "https://api.openai.com/v1/chat/completions", + content=body if body is not None else b"", + ) + + +def test_extracts_model_from_standard_request_body(): + body = json.dumps( + { + "model": "gpt-4.1-mini", + "messages": [{"role": "user", "content": "hi"}], + } + ).encode() + req = _request_with_body(body) + assert _extract_model_from_request_body(req) == "gpt-4.1-mini" + + +def test_returns_none_for_empty_body(): + req = _request_with_body(b"") + assert _extract_model_from_request_body(req) is None + + +def test_returns_none_for_malformed_json(): + req = _request_with_body(b"not-json{{{") + assert _extract_model_from_request_body(req) is None + + +def test_returns_none_when_model_field_missing(): + body = json.dumps({"messages": [{"role": "user", "content": "hi"}]}).encode() + req = _request_with_body(body) + assert _extract_model_from_request_body(req) is None + + +def test_returns_none_when_model_is_empty_string(): + body = json.dumps({"model": ""}).encode() + req = _request_with_body(body) + assert _extract_model_from_request_body(req) is None + + +def test_extracts_full_model_id_from_request_body(): + """Some SDK users pass the full versioned model id (e.g. + gpt-4.1-mini-2025-04-14) directly. The helper must pass it through + unmodified — the backend's MODEL_RATES substring lookup matches + ``gpt-4.1-mini`` even when the model_id is longer. + """ + body = json.dumps({"model": "gpt-4.1-mini-2025-04-14"}).encode() + req = _request_with_body(body) + assert _extract_model_from_request_body(req) == "gpt-4.1-mini-2025-04-14" + + +def test_extracts_claude_model_from_anthropic_request_body(): + body = json.dumps( + { + "model": "claude-sonnet-4-6", + "messages": [{"role": "user", "content": "hi"}], + "max_tokens": 1024, + } + ).encode() + req = _request_with_body(body) + assert _extract_model_from_request_body(req) == "claude-sonnet-4-6"