When building the --sniff option for
sqlite-utils insert (which attempts to detect the correct CSV delimiter and quote character by looking at the first 2048 bytes of a CSV file) I had the need to peek ahead in an incoming stream of data.
I use Click, and Click can automatically handle both files and standard input. The problem I had is that peeking ahead in a file is easy (you can call
.read() and then
.seek(0), or use the
.peek() method directly) but peaking ahead in standard input is not - anything you consume from that is not available to rewind to later on.
Since my code works by passing a file-like object to the
csv.reader() function I needed a way to read the first 2048 bytes but then reset the stream ready for that function to consume it.
I figured out how to do that using the
io.BufferedReader class. Here's the pattern:
import io import sys import csv # Get a file-like object in binary mode fp = open("myfile.csv", "rb") # Or from standard input (need to use .buffer here) fp = sys.stdin.buffer # Wrap it in a buffered reader with a 4096 byte buffer buffered = io.BufferedReader(fp, buffer_size=4096) # Wrap THAT in a text io wrapper that can decode to unicode decoded = io.TextIOWrapper(buffered, encoding="utf-8") # Now I can read the first 2048 bytes... first_bytes = buffered.peek(2048) # But I can still pass the "decoded" object to csv.reader reader = csv.reader(decoded) for row in reader: print(row)
My implementation is in this commit.
Created 2021-02-15T11:17:28-08:00 · Edit