Processing a stream of chunks of JSON with ijson

A follow-up to Using OpenAI functions and their Python library for data extraction and Using the ChatGPT streaming API from Python. If I have a stream of chunks of a larger JSON document, how can I output full individual JSON objects as soon as they are available?

My completed JSON will look like this:

{
  "items": [
    {
      "title": "Classic and Custom Car Show",
      "date": "Sat, Sep 2",
      "venue_name": "1700 W Hillsdale Blvd • San Mateo, CA",
      "start_time": "9:00 AM"
    },
    {
      "title": "【衛蘭、王君馨、衛詩、鍾舒漫、吉中鳴牧師】MMO AINEO LIVE 愛你喲!佈道會(三藩市站)",
      "date": "Sat, Aug 26",
      "venue_name": "San Mateo Performing Arts Center • San Mateo, CA",
      "start_time": "7:30 PM"
    },
    {
      "title": "San Francisco Small Business Expo 2023",
      "date": "Fri, Aug 18",
      "venue_name": "Hyatt Regency San Francisco Airport • Burlingame, CA",
      "start_time": "10:00 AM"
    },
    {
      "title": "Stanford Genetics Conference on Structural Variants and DNA Repeats",
      "date": "Thu, Sep 7",
      "venue_name": "Stanford Center for Academic Medicine • Palo Alto, CA",
      "start_time": "8:00 AM"
    },
    {
      "title": "DJ QUIK / ROJAI / Z-MAN / DJ SAURUS",
      "date": "Fri, Aug 18",
      "venue_name": "The Longboard Margarita Bar • Pacifica, CALIFORNIA",
      "start_time": "9:00 PM"
    },
    {
      "title": "DUE SOUTH W/ CHERRY GLAZERR, MOMMA (DUO), KING ISIS (FREE!)",
      "date": "Sat, Aug 26",
      "venue_name": "Jerry Garcia Amphitheater • San Francisco, CA",
      "start_time": "2:30 PM"
    }
  ]
}

If that's going to arrive as a sequence of chunks, how can I display those items as soon as they become available?

After much experimentation I figured out this recipe using ijson:

import ijson
import json
import time

chunks = [
    '{\n  "items": [\n    {\n      "ti',
    'tle": "Classic and Custom Car ',
    'Show",\n      "date": "Sat, Sep',
    ' 2",\n      "venue_name": "1700',
    " W Hillsdale Blvd • San Mateo,",
    ' CA",\n      "start_time": "9:0',
    '0 AM"\n    },\n    {\n      "titl',
    'e": "【衛蘭、王君馨、衛詩、鍾舒漫、吉中鳴牧師】MMO ',
    'AINEO LIVE 愛你喲!佈道會(三藩市站)",\n   ',
    '   "date": "Sat, Aug 26",\n    ',
    '  "venue_name": "San Mateo Per',
    "forming Arts Center • San Mate",
    'o, CA",\n      "start_time": "7',
    ':30 PM"\n    },\n    {\n      "ti',
    'tle": "San Francisco Small Bus',
    'iness Expo 2023",\n      "date"',
    ': "Fri, Aug 18",\n      "venue_',
    'name": "Hyatt Regency San Fran',
    "cisco Airport • Burlingame, CA",
    '",\n      "start_time": "10:00 ',
    'AM"\n    },\n    {\n      "title"',
    ': "Stanford Genetics Conferenc',
    "e on Structural Variants and D",
    'NA Repeats",\n      "date": "Th',
    'u, Sep 7",\n      "venue_name":',
    ' "Stanford Center for Academic',
    ' Medicine • Palo Alto, CA",\n  ',
    '    "start_time": "8:00 AM"\n  ',
    '  },\n    {\n      "title": "DJ ',
    "QUIK / ROJAI / Z-MAN / DJ SAUR",
    'US",\n      "date": "Fri, Aug 1',
    '8",\n      "venue_name": "The L',
    "ongboard Margarita Bar • Pacif",
    'ica, CALIFORNIA",\n      "start',
    '_time": "9:00 PM"\n    },\n    {',
    '\n      "title": "DUE SOUTH W/ ',
    "CHERRY GLAZERR, MOMMA (DUO), K",
    'ING ISIS (FREE!)",\n      "date',
    '": "Sat, Aug 26",\n      "venue',
    '_name": "Jerry Garcia Amphithe',
    'ater • San Francisco, CA",\n   ',
    '   "start_time": "2:30 PM"\n   ',
    " }\n  ]\n}",
]


events = ijson.sendable_list()
coro = ijson.items_coro(events, "items.item")

seen_events = set()

for chunk in chunks:
    coro.send(chunk.encode("utf-8"))
    if events:
        # Any we have not seen yet?
        unseen_events = [e for e in events if json.dumps(e) not in seen_events]
        if unseen_events:
            for event in unseen_events:
                seen_events.add(json.dumps(event))
                print(json.dumps(event))
                time.sleep(1)

You create an ijson coroutine, then send it chunks of JSON data. It will write new items to the sendable_list() as soon as they are available.

The hardest part to figure out was this:

coro = ijson.items_coro(events, "items.item")

This is the syntax to indicate that I'm interested in the array items in the {"items": [...]} object.

The confusing part is that item here is a reserved word in ijson which means "individual items of an array". I'm not sure what you are meant to do if your nested object also has a key called "item" - I guess work to avoid that situation from coming up!

This works though. The above code, when executed, prints out each of the nested objects one at a time, with a 1 second sleep between each one.

Created 2023-08-15T18:07:38-07:00 · Edit