Summing columns in remote Parquet files using DuckDB

vivym/midjourney-messages on Hugging Face is a large (~8GB) dataset consisting of 55,082,563 Midjourney images - each one with the prompt and a URL to the image hosted on Discord.

TLDR: Each record links to a Discord CDN URL, and the total size of all of those images is 148.07 TB - so Midjourney has cost Discord a LOT of money in CDN costs!

Read on for the details of how I figured this out.

Each of the records looks like this (converted to JSON):

{
  "id": "1144508197969854484",
  "channel_id": "989268300473192561",
  "content": "**adult Goku in Dragonball Z, walking on a beach, in a Akira Toriyama anime style** - Image #1 <@1016225582566101084>",
  "timestamp": "2023-08-25T05:46:58.330000+00:00",
  "image_id": "1144508197693046875",
  "height": 1024,
  "width": 1024,
  "url": "https://cdn.discordapp.com/attachments/989268300473192561/1144508197693046875/anaxagore54_adult_Goku_in_Dragonball_Z_walking_on_a_beach_in_a__987e6fd5-64a1-43f6-83dd-c58d2eb42948.png",
  "size": 1689284
}

The records are hosted on Hugging Face as Parquet files - 56 of them, each around 160MB. Here's the full list - they're hosted in Git LFS, but Hugging Face also offers an HTTP download link.

I wanted to total up the size column there to see how space on Discord's CDN is taken up by these Midjourney images. But I didn't want to download all 8GB of data just to run that query.

DuckDB can query remotely hosted Parquet files and use HTTP Range header tricks to retrieve just the subset of data needed to answer a query. It should be able to sum up the size column without downloading the whole dataset.

The URL to each Parquet file looks like this (this actually redirects to a final URL, but DuckDB seems to be able to follow those redirects transparently):

https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000000.parquet

Querying a single file

Here's a DuckDB query that calculates the sum of that size column without retrieving the entire file:

SELECT SUM(size) FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000000.parquet';

Running that in the duckdb CLI tool gives me this result:

┌───────────────┐
│   sum(size)   │
│    int128     │
├───────────────┤
│ 3456458790156 │
└───────────────┘

That's 3.14TB, just for the first file.

Tracking network usage with nettop

How many bytes of data did that retrieve? We can find out on macOS using the nettop command.

First we need the PID of our duckdb process:

ps aux | grep duckdb
simon            19992   0.0  0.0 408644352   1520 s114  S+    2:30PM   0:00.00 grep duckdb
simon            19985   0.0  0.0 408527616   4752 s118  S+    2:30PM   0:00.01 duckdb

Now we can run the following, before we execute that SQL query:

nettop -p 19985

Then I ran the SQL query, and saw this in the output from nettop:

5331 KiB

So DuckDB retrieved 5.3MB of data (from a file that was 159MB) to answer that query.

Using UNION ALL

How about the total across all 56 files? I generated this UNION ALL query to answer that question:

SELECT SUM(size_total)
FROM (
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000000.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000001.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000002.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000003.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000004.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000005.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000006.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000007.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000008.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000009.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000010.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000011.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000012.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000013.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000014.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000015.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000016.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000017.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000018.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000019.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000020.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000021.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000022.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000023.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000024.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000025.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000026.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000027.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000028.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000029.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000030.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000031.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000032.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000033.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000034.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000035.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000036.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000037.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000038.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000039.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000040.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000041.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000042.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000043.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000044.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000045.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000046.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000047.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000048.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000049.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000050.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000051.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000052.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000053.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000054.parquet' UNION ALL
    SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000055.parquet'
);

Using read_parquet()

Update: Alex Monahan tipped me off to a more concise alternative for the same query:

SELECT SUM(size)
FROM read_parquet([
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000000.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000001.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000002.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000003.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000004.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000005.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000006.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000007.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000008.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000009.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000010.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000011.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000012.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000013.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000014.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000015.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000016.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000017.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000018.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000019.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000020.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000021.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000022.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000023.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000024.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000025.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000026.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000027.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000028.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000029.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000030.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000031.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000032.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000033.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000034.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000035.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000036.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000037.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000038.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000039.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000040.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000041.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000042.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000043.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000044.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000045.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000046.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000047.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000048.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000049.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000050.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000051.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000052.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000053.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000054.parquet',
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000055.parquet'
]);

This version displays a useful progress bar while the query is executing:

The results of the query with a progress bar at 100%

Using list_transform()

@adityawarmanfw shared this much more elegant solution:

SELECT
    SUM(size) AS size
FROM read_parquet(
    list_transform(
        generate_series(0, 55),
        n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
            format('{:06d}', n) || '.parquet'
    )
);

This is using a DuckDB lambda function - really neat!

To measure them, I ran a query in a fresh DuckDB instance with nettop watching the network traffic. Here's what that looked like while it was running:

Animated GIF of nettop showing different connections being made and how much bandwidth is used for each one

The total data transferred was 287 MiB - still a lot of data, but a big saving on 8GB.

That's also around what I'd expect for 56 files, given that a single file fetched 5.3MB earlier and 5.3 * 56 = 296.8.

The end result

The result it gave me was:

┌─────────────────┐
│ sum(size_total) │
│     int128      │
├─────────────────┤
│ 162800469938172 │
└─────────────────┘

That's 162800469938172 bytes - or 148.07 TB of CDN space used by Midjourney images!

(I got ChatGPT to build me a tool for converting bytes to KB/MB/GB/TB: Byte Size Converter.)

CTEs and views work too

You can run this query using a CTE to make it nicer to read:

with midjourney_messages as (
    select
        *
    from read_parquet(
        list_transform(
            generate_series(0, 2),
            n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
                format('{:06d}', n) || '.parquet'
        )
    )
)
select sum(size) as size from midjourney_messages;

(I used generate_series(0, 2) here instead of (0, 55) to speed up these subsequent experiments.)

Or you can define a view, which lets you refer to midjourney_messages in multiple queries. This is a bad idea though, as each time you execute the query against the view it will download the data again:

create view midjourney_messages as
select * from read_parquet(
    list_transform(
        generate_series(0, 2),
        n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
            format('{:06d}', n) || '.parquet'
    )
);

Running create view here transferred 37 KiB according to nettop - presumably from loading metadata in order to be able to answer describe queries like this one:

describe midjourney_messages;

Output:

┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │  null   │   key   │ default │ extra │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ int32 │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ id          │ VARCHAR     │ YES     │         │         │       │
│ channel_id  │ VARCHAR     │ YES     │         │         │       │
│ content     │ VARCHAR     │ YES     │         │         │       │
│ timestamp   │ VARCHAR     │ YES     │         │         │       │
│ image_id    │ VARCHAR     │ YES     │         │         │       │
│ height      │ BIGINT      │ YES     │         │         │       │
│ width       │ BIGINT      │ YES     │         │         │       │
│ url         │ VARCHAR     │ YES     │         │         │       │
│ size        │ BIGINT      │ YES     │         │         │       │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┘

nettop confirmed that each time I ran this query:

select count(*) from midjourney_messages;

Approximately 114 KiB of data was fetched.

parquet_metadata()

chrisjc tipped me off about the parquet_metadata() function, which can be used like this:

select * from parquet_metadata(
    'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000000.parquet'
);

That returns the following, after fetching 37 KiB:

┌──────────────────────┬──────────────┬────────────────────┬───┬──────────────────┬──────────────────────┬──────────────────────┐
│      file_name       │ row_group_id │ row_group_num_rows │ … │ data_page_offset │ total_compressed_s…  │ total_uncompressed…  │
│       varchar        │    int64     │       int64        │   │      int64       │        int64         │        int64         │
├──────────────────────┼──────────────┼────────────────────┼───┼──────────────────┼──────────────────────┼──────────────────────┤
│ https://huggingfac…  │            0 │            1000000 │ … │           601280 │             13133418 │             23093988 │
│ https://huggingfac…  │            0 │            1000000 │ … │         13133571 │                  116 │                  112 │
│ https://huggingfac…  │            0 │            1000000 │ … │         13396455 │             46191873 │            208657682 │
│ https://huggingfac…  │            0 │            1000000 │ … │         59593218 │              9046231 │             36052113 │
│ https://huggingfac…  │            0 │            1000000 │ … │         68973087 │             13118570 │             23093988 │
│ https://huggingfac…  │            0 │            1000000 │ … │         81491806 │               498549 │               915584 │
│ https://huggingfac…  │            0 │            1000000 │ … │         81990515 │               496767 │               916607 │
│ https://huggingfac…  │            0 │            1000000 │ … │         82909448 │             71430496 │            180090922 │
│ https://huggingfac…  │            0 │            1000000 │ … │        154593238 │              5392260 │              8286381 │
├──────────────────────┴──────────────┴────────────────────┴───┴──────────────────┴──────────────────────┴──────────────────────┤
│ 9 rows                                                                                                   23 columns (6 shown) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Since some of the columns here were truncated in the middle, I typed .columns to switch modes and ran the query again:

┌──────────────────────┬─────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬──────────────────────┐
│        Column        │  Type   │        Row 1         │ … │        Row 7         │        Row 8         │        Row 9         │
├──────────────────────┼─────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼──────────────────────┤
│ file_name            │ varchar │  https://huggingfac… │ … │  https://huggingfac… │  https://huggingfac… │  https://huggingfac… │
│ row_group_id         │ int64   │                    0 │ … │                    0 │                    0 │                    0 │
│ row_group_num_rows   │ int64   │              1000000 │ … │              1000000 │              1000000 │              1000000 │
│ row_group_num_colu…  │ int64   │                    9 │ … │                    9 │                    9 │                    9 │
│ row_group_bytes      │ int64   │            481107377 │ … │            481107377 │            481107377 │            481107377 │
│ column_id            │ int64   │                    0 │ … │                    6 │                    7 │                    8 │
│ file_offset          │ int64   │             13133422 │ … │             82486423 │            153917026 │            159309733 │
│ num_values           │ int64   │              1000000 │ … │              1000000 │              1000000 │              1000000 │
│ path_in_schema       │ varchar │                   id │ … │                width │                  url │                 size │
│ type                 │ varchar │           BYTE_ARRAY │ … │                INT64 │           BYTE_ARRAY │                INT64 │
│ stats_min            │ varchar │                      │ … │                  256 │                      │                  312 │
│ stats_max            │ varchar │                      │ … │                 9408 │                      │             17937790 │
│ stats_null_count     │ int64   │                    0 │ … │                    0 │                    0 │                    0 │
│ stats_distinct_count │ int64   │                      │ … │                      │                      │                      │
│ stats_min_value      │ varchar │  1089054097631629352 │ … │                  256 │  https://cdn.discor… │                  312 │
│ stats_max_value      │ varchar │  1144508197969854484 │ … │                 9408 │  https://cdn.discor… │             17937790 │
│ compression          │ varchar │               SNAPPY │ … │               SNAPPY │               SNAPPY │               SNAPPY │
│ encodings            │ varchar │  PLAIN, RLE, RLE_DI… │ … │  PLAIN, RLE, RLE_DI… │  PLAIN, RLE, RLE_DI… │  PLAIN, RLE, RLE_DI… │
│ index_page_offset    │ int64   │                      │ … │                      │                      │                      │
│ dictionary_page_of…  │ int64   │                    4 │ … │             81989656 │             82486530 │            153917473 │
│ data_page_offset     │ int64   │               601280 │ … │             81990515 │             82909448 │            154593238 │
│ total_compressed_s…  │ int64   │             13133418 │ … │               496767 │             71430496 │              5392260 │
│ total_uncompressed…  │ int64   │             23093988 │ … │               916607 │            180090922 │              8286381 │
├──────────────────────┴─────────┴──────────────────────┴───┴──────────────────────┴──────────────────────┴──────────────────────┤
│ 9 rows  (4 shown)                                                                                                   23 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Typing .rows switches the mode back to the default again.

The same trick in ClickHouse

richraposa on Hacker News pointed out that ClickHouse can do the same HTTP Range header trick:

SELECT sum(size)
    FROM url('https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/0000{01..55}.parquet')

Output:

┌───────sum(size)─┐
│ 159344011148016 │
└─────────────────┘

1 row in set. Elapsed: 11.615 sec. Processed 54.08 million rows, 8.50 GB (4.66 million rows/s., 731.83 MB/s.)
Peak memory usage: 458.88 KiB.

This transfers around 290 MiB, effectively the same as DuckDB.

Polars support for streaming Parquet over HTTP

Polars provides "blazingly fast DataFrames in Rust, Python, Node.js, R and SQL". Apparently inspired by this post, they added support for answering this kind of query by streaming portions of Parquet files over HTTP in #12493: Downloading https dataset doesn't appropriately push down predicates.

In the released version, the following should work:

import polars as pl

base_url = "https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/"

files = []

for i in range(56):
    url = f"{base_url}{i:06}.parquet"
    files.append(url)


df = pl.scan_parquet(files).select(pl.col("size").sum())
print(df.collect())

Created 2023-11-14T14:43:17-08:00, updated 2023-11-17T09:33:11-08:00 · History · Edit