Feature or enhancement
Proposal:
While investigating the performance of random.Random vs. random.SystemRandom, I noticed some asymmetries that suggest performance enhancements for the former.
Currently, random.getrandbits is implemented (via _random_Random_getrandbits_impl) by filling a raw memory allocation with entropy (looping genrand_uint32) and passing the buffer to _PyLong_FromByteArray which creates its own copy. (Similarly, integer-range functions like randint and randrange are built on top of this, in a reasonable way — perhaps micro-optimizations are possible there, but they don't seem worthwhile to me.)
Then random.randbytes is implemented by calling random.getrandbits and feeding the result to int.to_bytes from Python. This entails two useless copies and additional Python object manipulation overhead (round-tripping raw buffer -> int object -> bytes object, which is also a lot of bit packing/unpacking for the special digit representations within the ints). By contrast, random.SystemRandom uses os.urandom to get bytes directly (from a system call, but no real adaptation is necessary).
A micro-benchmark suggests (weakly) that random.randbytes for large chunks could be made over three times as fast by avoiding the extra work:
$ py3.14 -m timeit --setup 'from random import randbytes' 'randbytes(1000000)'
50 loops, best of 5: 4.21 msec per loop
$ py3.14 -m timeit --setup 's = b"\xff"*1000000' 'int.from_bytes(s, "little").to_bytes(1000000, "little")'
100 loops, best of 5: 2.91 msec per loop
The underlying genrand_uint32, meanwhile, works (implementing the Mersenne Twister algorithm) by refilling an entropy pool in 32-bit units if necessary, extracting a value, and then scrambling a bit further with bit arithmetic. I don't know of a 64-bit version of the algorithm for filling the entropy pool (and it's more complex than I'd consider messing with); but I'm confident the final step can be extended to 64 bits. Nowadays 64-bit architectures are commonplace and this is surely worth taking advantage of.
Suggestions:
-
Extract the portion of genrand_uint32 that retrieves a value from self->state (roughly the existing function minus the y ^= ... lines at the end) to a helper function (I'll call it _get_entropy here), which can probably be marked inline
-
Add genrand_uint64 logic, along the lines of:
uint64_t y = _get_entropy() << 32 | _get_entropy();
y ^= (y >> 11) & 0xffffffff001fffffULL; /* mask out the 11 bits that would otherwise overflow from high to low */
y ^= (y << 7) & 0x9d2c56809d2c5680ULL; /* the masks for the left-shifts already handle overflows */
y ^= (y << 15) & 0xefc60000efc60000ULL;
y ^= (y >> 18) & 0xffffffff00003fffULL;
return y;
(This might also be useful for _random_Random_random_impl; is the current use of 26/27 bits from each value necessary?)
-
Add _genrand_buffer, implementing the loop to populate a buffer currently in _random_Random_getrandbits_impl
-
Have _random_Random_getrandbits_impl proceed (after the special-case checks) by calling long_alloc, using _genrand_buffer to populate the ->long_value.ob_digit, and masking out the extra bits (this requires changing the math for the allocation size; it might also make sense to modify genrand_uint64 to be able to get a twodigits directly?)
-
Add a C implementation for random.Random.randbytes that, similarly, populates a new bytes object's allocated memory directly
(Edit: if genrand_uint32 is an implementation detail and doesn't need to be preserved, then 64-bit versions could be used exclusively. In this case, the logic for checking the pool state can be simplified, checking only once per call, taking advantage of the fact that the pool size is even.)
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response
Feature or enhancement
Proposal:
While investigating the performance of
random.Randomvs.random.SystemRandom, I noticed some asymmetries that suggest performance enhancements for the former.Currently,
random.getrandbitsis implemented (via_random_Random_getrandbits_impl) by filling a raw memory allocation with entropy (loopinggenrand_uint32) and passing the buffer to_PyLong_FromByteArraywhich creates its own copy. (Similarly, integer-range functions likerandintandrandrangeare built on top of this, in a reasonable way — perhaps micro-optimizations are possible there, but they don't seem worthwhile to me.)Then
random.randbytesis implemented by callingrandom.getrandbitsand feeding the result toint.to_bytesfrom Python. This entails two useless copies and additional Python object manipulation overhead (round-tripping raw buffer ->intobject ->bytesobject, which is also a lot of bit packing/unpacking for the special digit representations within theints). By contrast,random.SystemRandomusesos.urandomto get bytes directly (from a system call, but no real adaptation is necessary).A micro-benchmark suggests (weakly) that
random.randbytesfor large chunks could be made over three times as fast by avoiding the extra work:The underlying
genrand_uint32, meanwhile, works (implementing the Mersenne Twister algorithm) by refilling an entropy pool in 32-bit units if necessary, extracting a value, and then scrambling a bit further with bit arithmetic. I don't know of a 64-bit version of the algorithm for filling the entropy pool (and it's more complex than I'd consider messing with); but I'm confident the final step can be extended to 64 bits. Nowadays 64-bit architectures are commonplace and this is surely worth taking advantage of.Suggestions:
Extract the portion of
genrand_uint32that retrieves a value fromself->state(roughly the existing function minus they ^= ...lines at the end) to a helper function (I'll call it_get_entropyhere), which can probably be markedinlineAdd
genrand_uint64logic, along the lines of:(This might also be useful for
_random_Random_random_impl; is the current use of 26/27 bits from each value necessary?)Add
_genrand_buffer, implementing the loop to populate a buffer currently in_random_Random_getrandbits_implHave
_random_Random_getrandbits_implproceed (after the special-case checks) by callinglong_alloc, using_genrand_bufferto populate the->long_value.ob_digit, and masking out the extra bits (this requires changing the math for the allocation size; it might also make sense to modifygenrand_uint64to be able to get atwodigitsdirectly?)Add a C implementation for
random.Random.randbytesthat, similarly, populates a newbytesobject's allocated memory directly(Edit: if
genrand_uint32is an implementation detail and doesn't need to be preserved, then 64-bit versions could be used exclusively. In this case, the logic for checking the pool state can be simplified, checking only once per call, taking advantage of the fact that the pool size is even.)Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response