Dev/improves - full dwarf support#8
Merged
Merged
Conversation
Two quality gaps are addressed: 1. DWARF source info was ignored. A new shared backends/dwarf.py reads per-function source file/dir/line and type definitions via pyelftools. When a binary has debug info, the export tree now mirrors the original source files and directories (src/raw/<dir>/<file>.c) instead of only call-graph clusters; functions without debug info keep the cluster fallback. Each function also gets an `origin: file:line` annotation and decl_file/decl_dir/decl_line in functions.json / function-index.json. 2. Types were never exported (the header carried only a hardcoded pseudo-type prelude). Backends now expose a types() catalog and per-function return/param/local types, sourced from the tool rather than invented: - angr: DWARF DIEs (structs/unions/enums/typedefs) + subprogram protos - IDA: local type library (TIL) + function tinfo (return/params/calltype) - radare2: best-effort tsj/tuj/tej + tc (DWARF support is limited) These are written to a new types.json and an include/<bin>.types.h of real C declarations, included by the main header. functions.json now carries return_type/params/locals/calltype. All wiring is defensive: missing debug info or an older session degrades to the previous behavior. Adds DWARF unit/integration tests and an exporter source-grouping/types test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
pyelftools caches every DIE it parses onto its CompileUnit. Walking all compilation units of a large .debug_info therefore accumulated the entire DWARF tree in memory (multiple GB) and could OOM-kill analysis on big binaries. Release each unit's DIE cache once we are done with it, capping memory to a single unit at a time. On a large real-world shared object this dropped peak RSS from ~7 GB (OOM) to ~650 MB with no loss of recovered data. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The DWARF reader only inspected attributes directly on each subprogram DIE. For C++ and optimized code the concrete function DIE typically omits decl_file/decl_line/type and instead references a DW_AT_specification or DW_AT_abstract_origin DIE that carries them. As a result the vast majority of functions in a C++ binary recovered no source file and fell back to the generic cluster layout. Follow those references (decl_file is resolved against the file table and comp_dir of the CU that actually holds it, which may differ from the concrete DIE's CU). On a large real-world C++ shared object this raised source-file coverage from ~3.9k to ~38.7k of ~39k functions, with memory still bounded (~0.65 GB). Adds a C++ regression test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The IDA backend used APIs incorrectly, so source files/types never came through and every function fell back to the cluster layout: - types() iterated the numbered-type ordinal space and rendered with tinfo_t._print(), which is not exposed; switch to the idiomatic db.types (named types) and render via serialize()+idc_print_type(), falling back to tinfo_t.dstr(). - source recovery only probed the function entry ea, which is frequently not annotated; scan the first instruction heads (ida_bytes.next_head) for the first get_sourcefile()/get_source_linnum() hit. - prototype recovery called the non-existent ._print() on types; use get_func_details() with rettype/arg .dstr(). Adds fake-IDA unit tests covering these paths (IDA cannot run in CI). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
IDA's DWARF plugin defaults to DWARF_IMPORT_LNNUMS=NO, so it imported types but no source file/line information; get_sourcefile() and get_source_linnum() returned nothing and every function fell back to the cluster layout. Pass -Odwarf:import_lnnums=1 when creating the database so the loader imports source locations, which the backend already reads. Verified on a real IDA 9.3 install: functions now resolve to their source file/line and group by source file, while recovered types (structs/enums) continue to populate the type catalog. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
pyelftools ships with the angr extra via cle, so it is absent from the type checking environment and mypy raised import-not-found on dwarf.py. Mirror the existing angr.* override to ignore its missing imports. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.