Querying the GitHub archive with the ClickHouse playground

Via this comment on Hacker News I started exploring the ClickHouse Playground. It's really cool, and among other things it allows CORS-enabled API hits that can query a decade of history from the GitHub events archive in less than a second.

ClickHouse

ClickHouse is an open source column-oriented database, originally developed at Yandex but spun out into a separate, VC-funded company in 2021. It's designed for big data analytical queries - in a similar space to HBase, BigQuery and DuckDB.

It turns out it can do that trick with HTTP range queries where you can point it at the URL to a Parquet or .native.zst file (ClickHouse native format, optionally compressed using Facebook Zstandard) and run queries without downloading the entire file first.

Exploring the Playground

The ClickHouse Playground is a free hosted environment for trying out ClickHouse. You can access it here:

https://play.clickhouse.com/play?user=play

Try this query, taken from the ClickHouse Everything You Always Wanted To Know About GitHub (But Were Afraid To Ask) tutorial:

SELECT count() FROM github_events WHERE event_type = 'WatchEvent'

The playground interface says Elapsed: 0.043 sec, read 341.54 million rows, 341.54 MB - and returns a count() of 341429632

github_events

The github_events table contains a copy of the GH Archive - a project that archives and makes available the public GitHub timeline, I think using data from the public events API. GH Archive then makes that data available as compressed newline-delimited JSON in this bucket. The archive stretches back to February 2011, and is constantly updated.

The ClickHouse demo table is continually updated with the latest archived data, by this script, which runs every 10 minutes.

You can do all sorts of fun stuff with it. Here's my recent activity acrosss all of GitHub:

SELECT
  created_at,
  actor_login,
  repo_name,
  event_type,
  title
FROM
  github_events
WHERE
  actor_login = 'simonw'
  AND repo_name != 'simonw/disaster-data'
ORDER BY
  created_at DESC
LIMIT
  100

This link executes that query - note how it includes a base64 encoded copy of the SQL query following the # in the URL.

There are 77 tables total in the Playground instance - you can get a list of them like this:

SELECT database, name FROM system.tables

API access

You can access the API via curl like this:

curl 'https://play.clickhouse.com/' \
  -X POST \
  -H 'Authorization: Basic cGxheTo=' \
  --data-raw $'
    SELECT created_at, actor_login, repo_name
    FROM github_events
    WHERE event_type = \'WatchEvent\'
    ORDER BY created_at DESC LIMIT 100'

This defaults to returning TSV without column headers, like this:

2023-01-01 03:59:59	Willmac16	nlohmann/json
2023-01-01 03:59:59	Samrose-Ahmed	Stebalien/stash-rs
2023-01-01 03:59:57	CodePromoter	aplus-framework/image

To get back data in JSON instead, add ?default_format=JSON to the URL. Here I'm piping that through jq to pretty print it:

curl 'https://play.clickhouse.com/?default_format=JSON' \
  -X POST \
  -H 'Authorization: Basic cGxheTo=' \
  --data-raw $'
    SELECT created_at, actor_login, repo_name
    FROM github_events
    WHERE event_type = \'WatchEvent\'
    ORDER BY created_at DESC LIMIT 1' | jq

Output:

{
  "meta": [
    {
      "name": "created_at",
      "type": "DateTime"
    },
    {
      "name": "actor_login",
      "type": "LowCardinality(String)"
    },
    {
      "name": "repo_name",
      "type": "LowCardinality(String)"
    }
  ],
  "data": [
    {
      "created_at": "2023-01-01 03:59:59",
      "actor_login": "Willmac16",
      "repo_name": "nlohmann/json"
    }
  ],
  "rows": 1,
  "rows_before_limit_at_least": 341429632,
  "statistics": {
    "elapsed": 0.925636889,
    "rows_read": 341540363,
    "bytes_read": 10567093585
  }
}

More format options are documented here.

CORS requests from JavaScript

This pattern works for running queries from JavaScript. CORS is enabled - I pasted this into the Firefox DevTools console on https://www.example.com/ and it returned the results I expected:

r = await fetch("https://play.clickhouse.com/?user=play", {
  method: "POST",
  body: `SELECT
        created_at,
        event_type,
        actor_login,
        repo_name,
        number,
        title,
        body
      FROM
        github_events
      WHERE
         actor_login = 'simonw'
      ORDER BY
        created_at desc
      LIMIT
        100
      FORMAT JSON`,
});
d = await r.json();

Here I'm using FORMAT JSON at the end of the query itself, and passing the requested user as ?user=play rather than sending an Authorization header.

Created 2022-12-31T21:06:14-08:00, updated 2023-01-01T10:37:42-08:00 · History · Edit