Planning parallel downloads with TopologicalSorter

For complicated reasons I found myself wanting to write Python code to resolve a graph of dependencies and produce a plan for efficiently executing them, in parallel where possible.

I figured out how to do it using graphlib.TopologicalSorter, which was added to the Python standard library in Python 3.9.

Since I want my code to work on older versions of Python, I was pleased to see that the graphlib.py module is standalone and can be easily vendored (though it uses the walrus := operator so I had to modify it for compatibility with Python 3.6 and 3.7).

For this example: imagine I know the dependencies of some packages, and I want to fire off some downloads - running them in parallel where possible.

from graphlib import TopologicalSorter

graph = {
    "datasette": {"httpx", "sqlite-utils", "click"},
    "sqlite-utils": {"click", "httpx"}
}

ts = TopologicalSorter(graph)
ts.prepare()
plan = []
while ts.is_active():
    nodes = ts.get_ready()
    plan.append(nodes)
    ts.done(*nodes)

print(plan)

This outputs:

[('click', 'httpx'), ('sqlite-utils',), ('datasette',)]

So the plan would be to download click and httpx in parallel, then sqlite-utils, then datasette.

In this usage of the module we're passing the full graph as an argument to TopologicalSorter. It's also possible to build it up a step at a time like this:

ts = TopologicalSorter()
# add(node, *predecessors)
ts.add("datasette", "httpx", "sqlite-utils", "click")
ts.add("sqlite-utils", "click", "httpx")
ts.prepare()
plan = []
while ts.is_active():
    nodes = ts.get_ready()
    plan.append(nodes)
    ts.done(*nodes)

Having called .prepare() no further .add() calls are allowed.

If you just want the nodes ordered such that none of them come before a dependency you can use .static_order():

>>> graph = {
...     "datasette": {"httpx", "sqlite-utils", "click"},
...     "sqlite-utils": {"click", "httpx"}
... }
>>> ts = TopologicalSorter(graph)
>>> list(ts.static_order())
['click', 'httpx', 'sqlite-utils', 'datasette']

The .prepare() method raises graphlib.CycleError if it detects a cycle:

>>> TopologicalSorter({"datasette": {"sqlite-utils"}, "sqlite-utils": {"datasette"}}).prepare()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/simon/.pyenv/versions/3.10.0/lib/python3.10/graphlib.py", line 104, in prepare
    raise CycleError(f"nodes are in a cycle", cycle)
graphlib.CycleError: ('nodes are in a cycle', ['datasette', 'sqlite-utils', 'datasette'])

Created 2021-11-16T13:41:07-08:00, updated 2021-11-16T15:48:37-08:00 · History · Edit