Skip to content

Composable, chainable query API: pa.callables().with_decorator(...).reachable_to(...).without_passing_through(...) #155

@rahlk

Description

@rahlk

Summary

Add a composable, chainable query API on top of the existing CLDK analysis façade — the equivalent of what DataFrame.groupby(...).filter(...).agg(...) is to pandas. Today, answering a security/code-analysis question with CLDK requires manually composing the results of multiple get_* calls in user code; this proposal moves that composition into the library so the queries themselves become the unit of citation.

The dream

pa.callables()
  .with_decorator("http.route", auth="public")
  .reachable_to(sink_pattern("request.env[*].sudo().*"))
  .without_passing_through(sanitizers=["check_access", "has_group"])

This single chain expresses: "every callable that is decorated as a public HTTP route, from which there exists a call/dataflow path to a sink matching this attribute-chain pattern, where no callable on the path is one of these named sanitizers."

For framework-shaped audits (web, RPC, CLI, message handlers), one such chain replaces dozens of hand-written analyses. The result is a citable fact: the chain itself is the evidence, not the prose around it.

Motivation (from a real audit)

I recently used CLDK to build a Proof-of-Exploitability report over 12 alerts in an Odoo addon, following a sources.json → reachability → taint → verdict pipeline (see the poe-with-cldk skill methodology). The most valuable thing CLDK gave me was authoritative decorator capture via pa.get_method(...).decorators — that single fact (auth='public' on a route) carried the entire severity of the top finding.

The bottleneck was composition: I had to manually walk get_methods(), filter by decorator predicate by reading the decorators list myself, then for each match run get_callers/get_callees to test reachability, then re-read the source body to check for sanitizers. That composition is the analysis. It belongs in the library, not in every user's notebook.

What the proposal contains

1. A Query / Selector chain over CLDK's existing graphs

Methods to start a query:

  • pa.callables() — every PyCallable in the analysis
  • pa.classes(), pa.fields(), pa.modules() — analogous starting points
  • pa.callsites(pattern) — every call site matching a structural pattern (sidesteps the dynamic-dispatch problem; see §3 below)

Methods to filter (return a narrowed query):

  • .with_decorator(name, **kwarg_predicates) — match decorator by name and named-argument value (auth="public")
  • .in_module(glob), .in_class(name), .in_directory(path)
  • .matching(name_pattern) — name/signature regex
  • .modified_since(commit_ref) — git-blame integration (later)
  • .not_in_tests() — convention-based test exclusion
  • .where(predicate) — escape hatch for arbitrary user predicates

Methods to relate (return a query over a related set):

  • .callers(), .callees() — one hop in the call graph
  • .transitive_callers(), .transitive_callees() — closure
  • .reachable_to(other_query, *, via=["call", "dataflow"], sanitizers=None) — path existence between two queries with optional sanitizer-mask
  • .without_passing_through(sanitizers=[...]) — narrow a reachability set by removing paths that touch any sanitizer
  • .subclasses(), .superclasses(), .implementers()

Methods to terminate (return concrete results):

  • .signatures() — list of FQ names
  • .objects() — list of PyCallable / PyClass / etc.
  • .paths() — list of resolved paths (for reachability queries), each carrying a confidence/visibility label
  • .count()
  • .explain()the most important terminator: returns the queries CLDK ran, the backends that resolved them, and the unresolved-edge set. This is what makes the result citable.

2. Joins between graphs

The current get_* methods return one slice at a time. Good queries need joins and collects like a streaming API does in Java/Rust.

(pa.callables()
   .with_decorator("http.route")
   .calling(pa.callables().in_class("CrmNoteController"))
   .where(lambda c: c.modified_since("HEAD~10"))
   .signatures())

Call graph ⋈ inheritance ⋈ decorator-set ⋈ git-blame is the highest leverage join

3. Pattern-based sink matching on attribute chains

For dynamic-dispatch-heavy code (Odoo ORM, Django ORM, SQLAlchemy, message-bus dispatch), the resolved call graph cannot see the sinks. A sink_pattern("request.env[*].sudo().*") matcher operating at the AST / token level on attribute chains — usable inside .reachable_to(sink_pattern(...)) — is the escape hatch. It also bridges Bandit-/Semgrep-style pattern rules into CLDK without re-implementing them.

4. Provenance, lazily evaluated, cached

  • Lazy: a chain is a query plan, not a list of results. Evaluation happens at the terminator.
  • Cached: re-running the same chain hits the cache (the analysis backend is already cache-backed by analysis_cache.json; queries should follow).
  • Provenance: every result carries the chain + backend resolutions that produced it. `.explain()` surfaces this; for security reports it becomes the citation.

5. Honest visibility labels

Every reachability / dataflow result should carry a label distinguishing:

  • resolved — CodeQL/Jedi confirmed the edge
  • structural — pattern matched but not resolved through dispatch
  • unresolved — analyzer can't see this edge, surfaced anyway

This is the single most valuable property for security work: it lets the consumer set their confidence tier based on the result, instead of being silently mislead by dropped edges.

Concrete example end-to-end

The 12-folder audit I did by hand would collapse to roughly:

public_unauth_sudo = (
    pa.callables()
      .with_decorator("http.route", auth="public")
      .reachable_to(pa.callsites(sink_pattern("*.sudo().browse")))
      .without_passing_through(sanitizers=["check_access", "has_group"])
)

inapp_sudo_no_gate = (
    pa.callables()
      .with_decorator("http.route", auth="user")
      .reachable_to(pa.callsites(sink_pattern("*.sudo().*")))
      .without_passing_through(sanitizers=[
          "check_access", "has_group",
          "owner_user_id.id == request.env.user.id",   # value-predicate sanitizers
      ])
)

for c in public_unauth_sudo.objects():
    print(c.signature, "-> CONFIRMED unauth sudo path")
for c in inapp_sudo_no_gate.objects():
    print(c.signature, "-> CONDITIONALLY CONFIRMED (in_app_role)")

Twelve handwritten findings become two queries, each .explain()-able into the same evidence I produced by hand.

What this is not

  • Not a new analysis backend. It is a query layer over the existing Jedi + CodeQL results plus AST/token access for pattern matching.
  • Not a scanner. CLDK should remain the analyst-side tool that audits scanner alerts; this proposal makes that audit composable and citable, not automatic.
  • Not an opinionated security framework. Sanitizer/source/sink dictionaries are user-supplied (with the same justification discipline the sources.json validator already enforces); the library returns facts and provenance.

Relationship to existing stubs

Two stubs in the current SDK become natural starting points for the chain API once implemented:

  • get_methods_with_decorators(...)pa.callables().with_decorator(...)
  • get_entry_point_methods(...)pa.callables().is_entry_point() (with framework recipes deciding what counts)

get_calling_lines and get_call_targets would underlie .callers() / .callees().

Why this is the highest-leverage change

Pandas is my reference, it is great because it makes the composition of common slicing operations cheap and citable. CLDK already has the underlying analysis facts; what's missing is the surface that turns them into composable, citable queries.

For framework-shaped security audits specifically (web, RPC, CLI, async messaging — i.e., most real-world Python codebases), the marginal value of this change is larger than the marginal value of any analysis-engine improvement. A composable query layer with pattern-based escape hatches and honest visibility labels would be awesome.

Out of scope (suggested separate issues)

  • Polyglot graphs that follow HTTP/gRPC/message-queue edges across languages
  • Differential queries between commits / branches
  • Bridge to coverage / dynamic instrumentation
  • A framework-recipe registry (Flask, Django, FastAPI, Odoo, Express, Rails, Spring) shipping as data

Metadata

Metadata

Assignees

Labels

Type

No fields configured for Epic.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions