Skip to content

Fix Base45 dropping trailing bytes on non-ASCII input#45

Open
gaoflow wants to merge 1 commit into
dhondta:mainfrom
gaoflow:fix-base45-multibyte-dataloss
Open

Fix Base45 dropping trailing bytes on non-ASCII input#45
gaoflow wants to merge 1 commit into
dhondta:mainfrom
gaoflow:fix-base45-multibyte-dataloss

Conversation

@gaoflow

@gaoflow gaoflow commented Jun 15, 2026

Copy link
Copy Markdown

Problem

The Base45 codec silently drops trailing bytes when the input contains non-ASCII content, producing output that is too short and no longer round-trips.

>>> import codext
>>> codext.encode(b'\xcf\xb1\x1b', 'base45')
b'OBQ'          # should be b'OBQR0'
>>> codext.decode(codext.encode(b'\xcf\xb1\x1b', 'base45'), 'base45')
b'\xcf\xb1'      # the trailing 0x1b byte is lost

Cause

base45_encode/base45_decode iterate with range(0, len(text), step) but index into t = b(text). Because the codec layer converts a bytes input to str (UTF-8) before the codec runs, b(text) is longer than text whenever the content is non-ASCII, so len(text) ends the loop early and the final group is never emitted. For b'\xcf\xb1\x1b' (\xcf\xb1 decodes to the single character U+03F1), text has length 2 while the byte sequence has length 3, dropping the third byte.

Fix

Iterate over len(t) (the actual byte sequence) in both functions. Encoded output now matches RFC 9285 and the reference base45 implementation, and encoding round-trips for arbitrary byte input.

Tests

Added test_codec_base45 covering the RFC 9285 vectors (AB, Hello!!, base-45), the exact regression value (b'\xcf\xb1\x1b' -> b'OBQR0'), and round-trips for binary inputs. The full test suite passes.

I worked on this with AI assistance under my direction and reviewed the change myself.

The Base45 encoder and decoder iterated with range(0, len(text), step)
while indexing into t = b(text). codext converts a bytes input to str
(UTF-8) before the codec runs, so for any non-ASCII content b(text) is
longer than text and len(text) stops the loop early, silently dropping
the trailing byte(s). For example encode(b'\xcf\xb1\x1b') returned
'OBQ' instead of 'OBQR0' and the value no longer round-tripped.

Iterate over len(t) (the actual byte sequence) instead. Output now
matches RFC 9285 and the reference base45 implementation, and encoding
round-trips for arbitrary byte input.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant