gh-151497: Avoid huge pre-allocation for oversized tarfile extended headers#151498
gh-151497: Avoid huge pre-allocation for oversized tarfile extended headers#151498iamsharduld wants to merge 1 commit into
Conversation
…nded headers
tarfile reads a member's extended header (a GNU long name/link or a pax
header) with a single read sized by the header's size field:
buf = tarfile.fileobj.read(self._block(self.size))
The size is taken from the archive and is not validated, so a ~512-byte
crafted file can claim several gigabytes (or, via base-256 encoding, far
more) and make read() pre-allocate that much memory -- on open/iterate,
before any extraction filter runs.
Read the extended-header data in bounded chunks instead, so an oversized
or truncated header can no longer force a huge allocation. The bytes
returned for valid archives are unchanged.
| self.assertIs(fobj.seekable(), True) | ||
|
|
||
|
|
||
| class _ReadSizeRecorder(io.BytesIO): |
There was a problem hiding this comment.
You can rename it to ReadSizeRecorder, I don't think that the _ prefix is useful.
| # size far larger than the file actually contains; opening such an archive | ||
| # must not try to read (and so pre-allocate) the claimed size in one go. | ||
|
|
||
| def _crafted_archive(self, hdrtype): |
There was a problem hiding this comment.
You can rename it to crafted_archive, I don't think that the _ prefix is useful. Same remark for _check() method.
| except tarfile.ReadError: | ||
| pass # a truncated header is fine; we only check the allocation | ||
| # The bogus ~4 GiB size must never reach a single read() call. | ||
| self.assertLess(fobj.max_read_size, 10 * 1024 * 1024) |
There was a problem hiding this comment.
You can decorate the class with @support.cpython_only and use the private attribute tarfile._EXTHEADER_READ_CHUNK.
| self.assertLess(fobj.max_read_size, 10 * 1024 * 1024) | |
| self.assertLessEqual(fobj.max_read_size, tarfile._EXTHEADER_READ_CHUNK) |
| # bounded chunks to avoid a huge up-front allocation when a crafted or | ||
| # truncated archive claims far more data than the file actually contains | ||
| # (gh-151497). | ||
| _EXTHEADER_READ_CHUNK = 1024 * 1024 # 1 MiB |
There was a problem hiding this comment.
I checked the _safe_read() argument when running test_tarfile. If I ignore the 4 GiB outlier, the size is between 512 bytes and 4 kiB. So a limit of 1 MiB sounds reasonable to me.
There was a problem hiding this comment.
I wouldn't expect the test suite to contain real-world data.
But, 1 MiB should do fine. It's well over io.DEFAULT_BUFFER_SIZE.
| # bounded chunks to avoid a huge up-front allocation when a crafted or | ||
| # truncated archive claims far more data than the file actually contains | ||
| # (gh-151497). | ||
| _EXTHEADER_READ_CHUNK = 1024 * 1024 # 1 MiB |
There was a problem hiding this comment.
I wouldn't expect the test suite to contain real-world data.
But, 1 MiB should do fine. It's well over io.DEFAULT_BUFFER_SIZE.
| """Read up to *size* bytes from *fileobj* in bounded chunks. | ||
|
|
||
| Returns the same bytes as ``fileobj.read(size)`` would (including a short | ||
| result at end of file), but never pre-allocates *size* bytes, so an |
There was a problem hiding this comment.
Nitpick: it will preallocate size bytes if size is small.
| result at end of file), but never pre-allocates *size* bytes, so an | |
| result at end of file), but limits pre-allocation, so an |
tarfilereads a member's extended header (a GNU long name/link, or a paxheader) with a single read sized directly by the header's
sizefield:self.sizeis taken from the archive and is not validated, so a ~512-bytecrafted file can claim several gigabytes (or, via base-256 encoding, far more)
and make
read()pre-allocate that much memory — on open/iterate(
tarfile.open(...).getmembers()), before any extraction filter runs. A512-byte archive claiming 1 GiB drives a ~950 MiB resident allocation; a claim
of 1 TiB raises
MemoryErroreven on high-RAM machines.This reads the extended-header data in bounded chunks instead, so an oversized
or truncated header can no longer force a huge up-front allocation. The bytes
returned for valid archives are unchanged, and the change is safe for both
seekable and streaming (
r|) tars.