Using io.BufferedReader to peek against a non-peekable stream

When building the --sniff option for sqlite-utils insert (which attempts to detect the correct CSV delimiter and quote character by looking at the first 2048 bytes of a CSV file) I had the need to peek ahead in an incoming stream of data.

I use Click, and Click can automatically handle both files and standard input. The problem I had is that peeking ahead in a file is easy (you can call .read() and then .seek(0), or use the .peek() method directly) but peaking ahead in standard input is not - anything you consume from that is not available to rewind to later on.

Since my code works by passing a file-like object to the csv.reader() function I needed a way to read the first 2048 bytes but then reset the stream ready for that function to consume it.

I figured out how to do that using the io.BufferedReader class. Here's the pattern:

import io
import sys
import csv

# Get a file-like object in binary mode
fp = open("myfile.csv", "rb")
# Or from standard input (need to use .buffer here)
fp = sys.stdin.buffer

# Wrap it in a buffered reader with a 4096 byte buffer
buffered = io.BufferedReader(fp, buffer_size=4096)

# Wrap THAT in a text io wrapper that can decode to unicode
decoded = io.TextIOWrapper(buffered, encoding="utf-8")

# Now I can read the first 2048 bytes...
first_bytes = buffered.peek(2048)

# But I can still pass the "decoded" object to csv.reader
reader = csv.reader(decoded)
for row in reader:
    print(row)

My implementation is in this commit.

Created 2021-02-15T11:17:28-08:00 · Edit