vivym/midjourney-messages on Hugging Face is a large (~8GB) dataset consisting of 55,082,563 Midjourney images - each one with the prompt and a URL to the image hosted on Discord.
TLDR: Each record links to a Discord CDN URL, and the total size of all of those images is 148.07 TB - so Midjourney has cost Discord a LOT of money in CDN costs!
Read on for the details of how I figured this out.
Each of the records looks like this (converted to JSON):
{
"id": "1144508197969854484",
"channel_id": "989268300473192561",
"content": "**adult Goku in Dragonball Z, walking on a beach, in a Akira Toriyama anime style** - Image #1 <@1016225582566101084>",
"timestamp": "2023-08-25T05:46:58.330000+00:00",
"image_id": "1144508197693046875",
"height": 1024,
"width": 1024,
"url": "https://cdn.discordapp.com/attachments/989268300473192561/1144508197693046875/anaxagore54_adult_Goku_in_Dragonball_Z_walking_on_a_beach_in_a__987e6fd5-64a1-43f6-83dd-c58d2eb42948.png",
"size": 1689284
}
The records are hosted on Hugging Face as Parquet files - 56 of them, each around 160MB. Here's the full list - they're hosted in Git LFS, but Hugging Face also offers an HTTP download link.
I wanted to total up the size
column there to see how space on Discord's CDN is taken up by these Midjourney images. But I didn't want to download all 8GB of data just to run that query.
DuckDB can query remotely hosted Parquet files and use HTTP Range header tricks to retrieve just the subset of data needed to answer a query. It should be able to sum up the size
column without downloading the whole dataset.
The URL to each Parquet file looks like this (this actually redirects to a final URL, but DuckDB seems to be able to follow those redirects transparently):
https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000000.parquet
Here's a DuckDB query that calculates the sum of that size
column without retrieving the entire file:
SELECT SUM(size) FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000000.parquet';
Running that in the duckdb
CLI tool gives me this result:
┌───────────────┐
│ sum(size) │
│ int128 │
├───────────────┤
│ 3456458790156 │
└───────────────┘
That's 3.14TB, just for the first file.
How many bytes of data did that retrieve? We can find out on macOS using the nettop
command.
First we need the PID of our duckdb
process:
ps aux | grep duckdb
simon 19992 0.0 0.0 408644352 1520 s114 S+ 2:30PM 0:00.00 grep duckdb
simon 19985 0.0 0.0 408527616 4752 s118 S+ 2:30PM 0:00.01 duckdb
Now we can run the following, before we execute that SQL query:
nettop -p 19985
Then I ran the SQL query, and saw this in the output from nettop
:
5331 KiB
So DuckDB retrieved 5.3MB of data (from a file that was 159MB) to answer that query.
How about the total across all 56 files? I generated this UNION ALL
query to answer that question:
SELECT SUM(size_total)
FROM (
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000000.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000001.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000002.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000003.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000004.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000005.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000006.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000007.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000008.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000009.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000010.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000011.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000012.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000013.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000014.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000015.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000016.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000017.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000018.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000019.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000020.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000021.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000022.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000023.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000024.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000025.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000026.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000027.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000028.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000029.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000030.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000031.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000032.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000033.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000034.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000035.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000036.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000037.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000038.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000039.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000040.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000041.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000042.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000043.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000044.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000045.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000046.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000047.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000048.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000049.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000050.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000051.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000052.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000053.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000054.parquet' UNION ALL
SELECT SUM(size) as size_total FROM 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000055.parquet'
);
Update: Alex Monahan tipped me off to a more concise alternative for the same query:
SELECT SUM(size)
FROM read_parquet([
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000000.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000001.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000002.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000003.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000004.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000005.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000006.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000007.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000008.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000009.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000010.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000011.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000012.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000013.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000014.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000015.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000016.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000017.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000018.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000019.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000020.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000021.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000022.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000023.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000024.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000025.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000026.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000027.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000028.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000029.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000030.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000031.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000032.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000033.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000034.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000035.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000036.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000037.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000038.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000039.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000040.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000041.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000042.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000043.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000044.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000045.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000046.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000047.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000048.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000049.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000050.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000051.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000052.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000053.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000054.parquet',
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000055.parquet'
]);
This version displays a useful progress bar while the query is executing:
@adityawarmanfw shared this much more elegant solution:
SELECT
SUM(size) AS size
FROM read_parquet(
list_transform(
generate_series(0, 55),
n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
format('{:06d}', n) || '.parquet'
)
);
This is using a DuckDB lambda function - really neat!
To measure them, I ran a query in a fresh DuckDB instance with nettop
watching the network traffic. Here's what that looked like while it was running:
The total data transferred was 287 MiB - still a lot of data, but a big saving on 8GB.
That's also around what I'd expect for 56 files, given that a single file fetched 5.3MB earlier and 5.3 * 56 = 296.8.
The result it gave me was:
┌─────────────────┐
│ sum(size_total) │
│ int128 │
├─────────────────┤
│ 162800469938172 │
└─────────────────┘
That's 162800469938172 bytes - or 148.07 TB of CDN space used by Midjourney images!
(I got ChatGPT to build me a tool for converting bytes to KB/MB/GB/TB: Byte Size Converter.)
You can run this query using a CTE to make it nicer to read:
with midjourney_messages as (
select
*
from read_parquet(
list_transform(
generate_series(0, 2),
n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
format('{:06d}', n) || '.parquet'
)
)
)
select sum(size) as size from midjourney_messages;
(I used generate_series(0, 2)
here instead of (0, 55)
to speed up these subsequent experiments.)
Or you can define a view, which lets you refer to midjourney_messages
in multiple queries. This is a bad idea though, as each time you execute the query against the view it will download the data again:
create view midjourney_messages as
select * from read_parquet(
list_transform(
generate_series(0, 2),
n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
format('{:06d}', n) || '.parquet'
)
);
Running create view
here transferred 37 KiB according to nettop
- presumably from loading metadata in order to be able to answer describe
queries like this one:
describe midjourney_messages;
Output:
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ int32 │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ id │ VARCHAR │ YES │ │ │ │
│ channel_id │ VARCHAR │ YES │ │ │ │
│ content │ VARCHAR │ YES │ │ │ │
│ timestamp │ VARCHAR │ YES │ │ │ │
│ image_id │ VARCHAR │ YES │ │ │ │
│ height │ BIGINT │ YES │ │ │ │
│ width │ BIGINT │ YES │ │ │ │
│ url │ VARCHAR │ YES │ │ │ │
│ size │ BIGINT │ YES │ │ │ │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┘
nettop
confirmed that each time I ran this query:
select count(*) from midjourney_messages;
Approximately 114 KiB of data was fetched.
chrisjc tipped me off about the parquet_metadata()
function, which can be used like this:
select * from parquet_metadata(
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000000.parquet'
);
That returns the following, after fetching 37 KiB:
┌──────────────────────┬──────────────┬────────────────────┬───┬──────────────────┬──────────────────────┬──────────────────────┐
│ file_name │ row_group_id │ row_group_num_rows │ … │ data_page_offset │ total_compressed_s… │ total_uncompressed… │
│ varchar │ int64 │ int64 │ │ int64 │ int64 │ int64 │
├──────────────────────┼──────────────┼────────────────────┼───┼──────────────────┼──────────────────────┼──────────────────────┤
│ https://huggingfac… │ 0 │ 1000000 │ … │ 601280 │ 13133418 │ 23093988 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 13133571 │ 116 │ 112 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 13396455 │ 46191873 │ 208657682 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 59593218 │ 9046231 │ 36052113 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 68973087 │ 13118570 │ 23093988 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 81491806 │ 498549 │ 915584 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 81990515 │ 496767 │ 916607 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 82909448 │ 71430496 │ 180090922 │
│ https://huggingfac… │ 0 │ 1000000 │ … │ 154593238 │ 5392260 │ 8286381 │
├──────────────────────┴──────────────┴────────────────────┴───┴──────────────────┴──────────────────────┴──────────────────────┤
│ 9 rows 23 columns (6 shown) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Since some of the columns here were truncated in the middle, I typed .columns
to switch modes and ran the query again:
┌──────────────────────┬─────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬──────────────────────┐
│ Column │ Type │ Row 1 │ … │ Row 7 │ Row 8 │ Row 9 │
├──────────────────────┼─────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼──────────────────────┤
│ file_name │ varchar │ https://huggingfac… │ … │ https://huggingfac… │ https://huggingfac… │ https://huggingfac… │
│ row_group_id │ int64 │ 0 │ … │ 0 │ 0 │ 0 │
│ row_group_num_rows │ int64 │ 1000000 │ … │ 1000000 │ 1000000 │ 1000000 │
│ row_group_num_colu… │ int64 │ 9 │ … │ 9 │ 9 │ 9 │
│ row_group_bytes │ int64 │ 481107377 │ … │ 481107377 │ 481107377 │ 481107377 │
│ column_id │ int64 │ 0 │ … │ 6 │ 7 │ 8 │
│ file_offset │ int64 │ 13133422 │ … │ 82486423 │ 153917026 │ 159309733 │
│ num_values │ int64 │ 1000000 │ … │ 1000000 │ 1000000 │ 1000000 │
│ path_in_schema │ varchar │ id │ … │ width │ url │ size │
│ type │ varchar │ BYTE_ARRAY │ … │ INT64 │ BYTE_ARRAY │ INT64 │
│ stats_min │ varchar │ │ … │ 256 │ │ 312 │
│ stats_max │ varchar │ │ … │ 9408 │ │ 17937790 │
│ stats_null_count │ int64 │ 0 │ … │ 0 │ 0 │ 0 │
│ stats_distinct_count │ int64 │ │ … │ │ │ │
│ stats_min_value │ varchar │ 1089054097631629352 │ … │ 256 │ https://cdn.discor… │ 312 │
│ stats_max_value │ varchar │ 1144508197969854484 │ … │ 9408 │ https://cdn.discor… │ 17937790 │
│ compression │ varchar │ SNAPPY │ … │ SNAPPY │ SNAPPY │ SNAPPY │
│ encodings │ varchar │ PLAIN, RLE, RLE_DI… │ … │ PLAIN, RLE, RLE_DI… │ PLAIN, RLE, RLE_DI… │ PLAIN, RLE, RLE_DI… │
│ index_page_offset │ int64 │ │ … │ │ │ │
│ dictionary_page_of… │ int64 │ 4 │ … │ 81989656 │ 82486530 │ 153917473 │
│ data_page_offset │ int64 │ 601280 │ … │ 81990515 │ 82909448 │ 154593238 │
│ total_compressed_s… │ int64 │ 13133418 │ … │ 496767 │ 71430496 │ 5392260 │
│ total_uncompressed… │ int64 │ 23093988 │ … │ 916607 │ 180090922 │ 8286381 │
├──────────────────────┴─────────┴──────────────────────┴───┴──────────────────────┴──────────────────────┴──────────────────────┤
│ 9 rows (4 shown) 23 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Typing .rows
switches the mode back to the default again.
richraposa on Hacker News pointed out that ClickHouse can do the same HTTP Range header trick:
SELECT sum(size)
FROM url('https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/0000{01..55}.parquet')
Output:
┌───────sum(size)─┐
│ 159344011148016 │
└─────────────────┘
1 row in set. Elapsed: 11.615 sec. Processed 54.08 million rows, 8.50 GB (4.66 million rows/s., 731.83 MB/s.)
Peak memory usage: 458.88 KiB.
This transfers around 290 MiB, effectively the same as DuckDB.
Polars provides "blazingly fast DataFrames in Rust, Python, Node.js, R and SQL". Apparently inspired by this post, they added support for answering this kind of query by streaming portions of Parquet files over HTTP in #12493: Downloading https dataset doesn't appropriately push down predicates.
In the released version, the following should work:
import polars as pl
base_url = "https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/"
files = []
for i in range(56):
url = f"{base_url}{i:06}.parquet"
files.append(url)
df = pl.scan_parquet(files).select(pl.col("size").sum())
print(df.collect())
Created 2023-11-14T14:43:17-08:00, updated 2023-11-17T09:33:11-08:00 · History · Edit