struct endianness in Python

TIL the Python standard library struct module defaults to interpreting binary strings using the endianness of your machine.

Which means that this code:

def decode_matchinfo(buf): 
    # buf is a bytestring of unsigned integers, each 4 bytes long 
    return struct.unpack("I" * (len(buf) // 4), buf) 

Behaves differently on big-endian v.s. little-endian systems.

I found this out thanks to this bug report against my sqlite-fts4 library.

My decode_matchinfo() function runs against a binary data structure returned by SQLite - more details on that in Exploring search relevance algorithms with SQLite.

SQLite doesn't change the binary format depending on the endianness of the system, which means that my function here works correctly on little-endian but does the wrong thing on big-endian systems:

Update: I was entirely wrong about this. SQLite DOES change the format based on the endianness of the system. My bug fix was incorrect - see this issue comment for details.

On little-endian systems:

>>> buf = b'\x01\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00'
>>> decode_matchinfo(buf)
(1, 2, 2, 2)

But on big-endian systems:

>>> buf = b'\x01\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00'
>>> decode_matchinfo(buf)
(16777216, 33554432, 33554432, 33554432)

The fix is to add a first character to that format string specifying the endianness that should be used, see Byte Order, Size, and Alignment in the Python documentation.

>>> struct.unpack("<IIII", buf)
(1, 2, 2, 2)
>>> struct.unpack(">IIII", buf)
(16777216, 33554432, 33554432, 33554432)

So the fix for my bug was to rewrite the function to look like this:

def decode_matchinfo(buf):
    # buf is a bytestring of unsigned integers, each 4 bytes long
    return struct.unpack("<" + ("I" * (len(buf) // 4)), buf)

Bonus: How to tell which endianness your system has

Turns out Python can tell you if you are big-endian or little-endian like this:

>>> from sys import byteorder
>>> byteorder
'little'

Created 2022-07-28T08:48:00-07:00, updated 2022-07-30T11:07:51-07:00 · History · Edit