Skip to content

Dev/improves - full dwarf support#8

Merged
buzzer-re merged 6 commits into
mainfrom
dev/improves
Jun 20, 2026
Merged

Dev/improves - full dwarf support#8
buzzer-re merged 6 commits into
mainfrom
dev/improves

Conversation

@buzzer-re

Copy link
Copy Markdown
Owner

No description provided.

buzzer-re and others added 6 commits June 19, 2026 09:29
Two quality gaps are addressed:

1. DWARF source info was ignored. A new shared backends/dwarf.py reads
   per-function source file/dir/line and type definitions via pyelftools.
   When a binary has debug info, the export tree now mirrors the original
   source files and directories (src/raw/<dir>/<file>.c) instead of only
   call-graph clusters; functions without debug info keep the cluster
   fallback. Each function also gets an `origin: file:line` annotation and
   decl_file/decl_dir/decl_line in functions.json / function-index.json.

2. Types were never exported (the header carried only a hardcoded pseudo-type
   prelude). Backends now expose a types() catalog and per-function
   return/param/local types, sourced from the tool rather than invented:
   - angr: DWARF DIEs (structs/unions/enums/typedefs) + subprogram protos
   - IDA: local type library (TIL) + function tinfo (return/params/calltype)
   - radare2: best-effort tsj/tuj/tej + tc (DWARF support is limited)
   These are written to a new types.json and an include/<bin>.types.h of real
   C declarations, included by the main header. functions.json now carries
   return_type/params/locals/calltype.

All wiring is defensive: missing debug info or an older session degrades to
the previous behavior. Adds DWARF unit/integration tests and an exporter
source-grouping/types test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
pyelftools caches every DIE it parses onto its CompileUnit. Walking all
compilation units of a large .debug_info therefore accumulated the entire
DWARF tree in memory (multiple GB) and could OOM-kill analysis on big
binaries. Release each unit's DIE cache once we are done with it, capping
memory to a single unit at a time. On a large real-world shared object this
dropped peak RSS from ~7 GB (OOM) to ~650 MB with no loss of recovered data.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The DWARF reader only inspected attributes directly on each subprogram DIE.
For C++ and optimized code the concrete function DIE typically omits
decl_file/decl_line/type and instead references a DW_AT_specification or
DW_AT_abstract_origin DIE that carries them. As a result the vast majority
of functions in a C++ binary recovered no source file and fell back to the
generic cluster layout.

Follow those references (decl_file is resolved against the file table and
comp_dir of the CU that actually holds it, which may differ from the
concrete DIE's CU). On a large real-world C++ shared object this raised
source-file coverage from ~3.9k to ~38.7k of ~39k functions, with memory
still bounded (~0.65 GB). Adds a C++ regression test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The IDA backend used APIs incorrectly, so source files/types never came
through and every function fell back to the cluster layout:

- types() iterated the numbered-type ordinal space and rendered with
  tinfo_t._print(), which is not exposed; switch to the idiomatic
  db.types (named types) and render via serialize()+idc_print_type(),
  falling back to tinfo_t.dstr().
- source recovery only probed the function entry ea, which is frequently
  not annotated; scan the first instruction heads (ida_bytes.next_head)
  for the first get_sourcefile()/get_source_linnum() hit.
- prototype recovery called the non-existent ._print() on types; use
  get_func_details() with rettype/arg .dstr().

Adds fake-IDA unit tests covering these paths (IDA cannot run in CI).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
IDA's DWARF plugin defaults to DWARF_IMPORT_LNNUMS=NO, so it imported
types but no source file/line information; get_sourcefile() and
get_source_linnum() returned nothing and every function fell back to the
cluster layout. Pass -Odwarf:import_lnnums=1 when creating the database so
the loader imports source locations, which the backend already reads.

Verified on a real IDA 9.3 install: functions now resolve to their source
file/line and group by source file, while recovered types (structs/enums)
continue to populate the type catalog.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
pyelftools ships with the angr extra via cle, so it is absent from the
type checking environment and mypy raised import-not-found on dwarf.py.
Mirror the existing angr.* override to ignore its missing imports.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@buzzer-re buzzer-re merged commit 4f3502f into main Jun 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant