gh-151497: Avoid huge pre-allocation for oversized tarfile extended headers by iamsharduld · Pull Request #151498 · python/cpython

iamsharduld · 2026-06-15T10:33:42Z

tarfile reads a member's extended header (a GNU long name/link, or a pax
header) with a single read sized directly by the header's size field:

buf = tarfile.fileobj.read(self._block(self.size))

self.size is taken from the archive and is not validated, so a ~512-byte
crafted file can claim several gigabytes (or, via base-256 encoding, far more)
and make read() pre-allocate that much memory — on open/iterate
(tarfile.open(...).getmembers()), before any extraction filter runs. A
512-byte archive claiming 1 GiB drives a ~950 MiB resident allocation; a claim
of 1 TiB raises MemoryError even on high-RAM machines.

This reads the extended-header data in bounded chunks instead, so an oversized
or truncated header can no longer force a huge up-front allocation. The bytes
returned for valid archives are unchanged, and the change is safe for both
seekable and streaming (r|) tars.

Issue: tarfile: memory exhaustion via oversized extended-header (GNU long name / pax) size field #151497

…nded headers tarfile reads a member's extended header (a GNU long name/link or a pax header) with a single read sized by the header's size field: buf = tarfile.fileobj.read(self._block(self.size)) The size is taken from the archive and is not validated, so a ~512-byte crafted file can claim several gigabytes (or, via base-256 encoding, far more) and make read() pre-allocate that much memory -- on open/iterate, before any extraction filter runs. Read the extended-header data in bounded chunks instead, so an oversized or truncated header can no longer force a huge allocation. The bytes returned for valid archives are unchanged.

vstinner

cc @encukou @cmaloney

vstinner · 2026-06-15T15:51:42Z

            self.assertIs(fobj.seekable(), True)


+class _ReadSizeRecorder(io.BytesIO):


You can rename it to ReadSizeRecorder, I don't think that the _ prefix is useful.

vstinner · 2026-06-15T15:52:03Z

+    # size far larger than the file actually contains; opening such an archive
+    # must not try to read (and so pre-allocate) the claimed size in one go.
+
+    def _crafted_archive(self, hdrtype):


You can rename it to crafted_archive, I don't think that the _ prefix is useful. Same remark for _check() method.

vstinner · 2026-06-15T15:52:57Z

+        except tarfile.ReadError:
+            pass  # a truncated header is fine; we only check the allocation
+        # The bogus ~4 GiB size must never reach a single read() call.
+        self.assertLess(fobj.max_read_size, 10 * 1024 * 1024)


You can decorate the class with @support.cpython_only and use the private attribute tarfile._EXTHEADER_READ_CHUNK.

Suggested change

self.assertLess(fobj.max_read_size, 10 * 1024 * 1024)

self.assertLessEqual(fobj.max_read_size, tarfile._EXTHEADER_READ_CHUNK)

vstinner · 2026-06-15T16:05:29Z

+# bounded chunks to avoid a huge up-front allocation when a crafted or
+# truncated archive claims far more data than the file actually contains
+# (gh-151497).
+_EXTHEADER_READ_CHUNK = 1024 * 1024  # 1 MiB


I checked the _safe_read() argument when running test_tarfile. If I ignore the 4 GiB outlier, the size is between 512 bytes and 4 kiB. So a limit of 1 MiB sounds reasonable to me.

I wouldn't expect the test suite to contain real-world data.
But, 1 MiB should do fine. It's well over io.DEFAULT_BUFFER_SIZE.

encukou · 2026-06-16T11:34:45Z

+# bounded chunks to avoid a huge up-front allocation when a crafted or
+# truncated archive claims far more data than the file actually contains
+# (gh-151497).
+_EXTHEADER_READ_CHUNK = 1024 * 1024  # 1 MiB


I wouldn't expect the test suite to contain real-world data.
But, 1 MiB should do fine. It's well over io.DEFAULT_BUFFER_SIZE.

encukou · 2026-06-16T11:36:51Z

+    """Read up to *size* bytes from *fileobj* in bounded chunks.
+
+    Returns the same bytes as ``fileobj.read(size)`` would (including a short
+    result at end of file), but never pre-allocates *size* bytes, so an


Nitpick: it will preallocate size bytes if size is small.

Suggested change

result at end of file), but never pre-allocates *size* bytes, so an

result at end of file), but limits pre-allocation, so an

iamsharduld requested a review from ethanfurman as a code owner June 15, 2026 10:33

bedevere-app Bot added the awaiting review label Jun 15, 2026

bedevere-app Bot mentioned this pull request Jun 15, 2026

tarfile: memory exhaustion via oversized extended-header (GNU long name / pax) size field #151497

Open

vstinner reviewed Jun 15, 2026

View reviewed changes

encukou reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-151497: Avoid huge pre-allocation for oversized tarfile extended headers#151498

gh-151497: Avoid huge pre-allocation for oversized tarfile extended headers#151498
iamsharduld wants to merge 1 commit into
python:mainfrom
iamsharduld:gh-tarfile-extheader-memory

iamsharduld commented Jun 15, 2026

Uh oh!

vstinner left a comment

Uh oh!

vstinner Jun 15, 2026

Uh oh!

vstinner Jun 15, 2026

Uh oh!

vstinner Jun 15, 2026

Uh oh!

vstinner Jun 15, 2026

Uh oh!

encukou Jun 16, 2026

Uh oh!

encukou Jun 16, 2026

Uh oh!

encukou Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		self.assertIs(fobj.seekable(), True)


		class _ReadSizeRecorder(io.BytesIO):

	self.assertLess(fobj.max_read_size, 10 * 1024 * 1024)
	self.assertLessEqual(fobj.max_read_size, tarfile._EXTHEADER_READ_CHUNK)

	result at end of file), but never pre-allocates size bytes, so an
	result at end of file), but limits pre-allocation, so an

Uh oh!

Conversation

iamsharduld commented Jun 15, 2026

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

vstinner Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

vstinner Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

vstinner Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

encukou Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

encukou Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

encukou Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants