Skip to content

GG-49287 Bulk decode of primitive arrays (vectorized float-vector deserialization)#78

Draft
PakhomovAlexander wants to merge 1 commit into
masterfrom
gg-49287-bulk-float-decode
Draft

GG-49287 Bulk decode of primitive arrays (vectorized float-vector deserialization)#78
PakhomovAlexander wants to merge 1 commit into
masterfrom
gg-49287-bulk-float-decode

Conversation

@PakhomovAlexander

Copy link
Copy Markdown

Draft — profiling-driven client-side deserialization fast path (part of GG-49287).

Problem

PrimitiveArray.to_python_not_null deserializes a primitive array element-by-element:
```python
return [ctypes_object.data[i] for i in range(ctypes_object.length)]
```
For large float arrays — e.g. the 1536-d vector returned as the value of each vector-query
result row — this is the dominant client cost. cProfile of a kNN query loop (500 queries × 100
results) against a live node put ~41% of total client CPU in this one list comprehension
(~77M per-element ctypes accesses).

Change (one method)

Bulk buffer decode in a single C-level pass:
```python
mv = memoryview(ctypes_object.data)
return mv.cast('B').cast(mv.format.lstrip('<>=!@')).tolist()
```
The cast through bytes is required because a ctypes LittleEndianStructure array reports an
explicit byte-order format (e.g. '<f') that memoryview.tolist() rejects; we strip the
order prefix and reinterpret in native order (correct on little-endian platforms). Generic
across all primitive array element types.

Result

Measured on dbpedia-openai 1536-d vector queries (single client, identical results
key-checksums unchanged at every efSearch):

k (results/query) baseline this PR speedup
10 531 q/s 736 q/s 1.39x
100 71 q/s 126 q/s 1.77x
800 9.7 q/s 18.2 q/s 1.88x

The win grows with the number of result rows deserialized. Combined with not over-fetching
(request k results with the efSearch beam carried on the negative-threshold sentinel rather
than k=efSearch), this is what closes most of the vector-query client gap to RediSearch in the
high-recall regime.

Notes

Draft. Validated end-to-end via the ann-benchmarks pygridgain harness; CI runs the unit suite.

…erialization)

PrimitiveArray.to_python_not_null built the Python list element-by-element over the
ctypes array (`[obj.data[i] for i in range(len)]`). For large float arrays — e.g. a
1536-d vector value per vector-query result row — this is the dominant client cost:
cProfile of a kNN query loop showed ~41% of total client CPU in this one comprehension
(~77M per-element ctypes reads for 500 queries x 100 results).

Replace it with a single bulk buffer decode: memoryview(obj.data).cast('B').cast(fmt).tolist().
The cast through bytes is needed because ctypes LittleEndianStructure reports an explicit
byte-order format (e.g. '<f') that memoryview.tolist() rejects; we strip the order prefix
and reinterpret in native order (correct on little-endian platforms). Generic across all
primitive array types (float/int/long/...).

Measured (dbpedia-openai 1536-d vector queries, single client, identical results /
key-checksums): 1.4x @ k=10 -> 1.9x @ k=800 on the vector-query result path; the win grows
with the number of result rows deserialized. Part of GG-49287 (pygridgain deserialization
fast path).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant