Given a large, heterogeneous table I wanted to identify patterns in the rows in terms of which columns were not null.
Imagine a table like this for example:
id field1 field2 field3 field4
I want to know how many records have values for (id, field1, field2) compared to the number of rows with values for (id, field3, field4).
I worked out the following query pattern for answering this question:
select
case when [pk] is not null then 'pk, ' else '' end ||
case when [created] is not null then 'created, ' else '' end ||
case when [planet_int] is not null then 'planet_int, ' else '' end ||
case when [on_earth] is not null then 'on_earth, ' else '' end ||
case when [state] is not null then 'state, ' else '' end ||
case when [_city_id] is not null then '_city_id, ' else '' end ||
case when [_neighborhood] is not null then '_neighborhood, ' else '' end ||
case when [tags] is not null then 'tags, ' else '' end ||
case when [complex_array] is not null then 'complex_array, ' else '' end ||
case when [distinct_some_null] is not null then 'distinct_some_null, ' else '' end
as columns,
count(*) as num_rows
from
[facetable]
group by
columns
order by
num_rows desc
This has the desired effect: it gives me back all of the combinations of not-null columns in the table, with a count for each one.
(Running this on a table with 1,000,000+ rows took about 40 seconds, so I had to use datasette data.db --setting sql_time_limit_ms 100000
to bump up the default time limit in Datasette.)
One remaining problem: how to generate the above query for an arbitrary table. I came up with the following SQL query for generating a SQL query like the above:
select 'select
' || group_concat(' case when [' || name || '] is not null then ' || quote(name || ', ') || ' else '''' end', ' ||
') || '
as columns,
count(*) as num_rows
from
[' || :table || ']
group by
columns
order by
num_rows desc' as query from pragma_table_info(:table)
Try that out in a demo that includes the datasette-query-links plugin.
This takes :table
as an input and generates SQL which can be used to generate column-combination counts.
Created 2021-01-12T14:00:51-08:00, updated 2022-03-20T22:34:14-07:00 · History · Edit