Identifying column combination patterns in a SQLite table

Given a large, heterogeneous table I wanted to identify patterns in the rows in terms of which columns were not null.

Imagine a table like this for example:

id  field1  field2  field3  field4

I want to know how many records have values for (id, field1, field2) compared to the number of rows with values for (id, field3, field4).

I worked out the following query pattern for answering this question:

select
    case when [pk] is not null then 'pk, ' else '' end ||
    case when [created] is not null then 'created, ' else '' end ||
    case when [planet_int] is not null then 'planet_int, ' else '' end ||
    case when [on_earth] is not null then 'on_earth, ' else '' end ||
    case when [state] is not null then 'state, ' else '' end ||
    case when [_city_id] is not null then '_city_id, ' else '' end ||
    case when [_neighborhood] is not null then '_neighborhood, ' else '' end ||
    case when [tags] is not null then 'tags, ' else '' end ||
    case when [complex_array] is not null then 'complex_array, ' else '' end ||
    case when [distinct_some_null] is not null then 'distinct_some_null, ' else '' end
  as columns,
  count(*) as num_rows
from
  [facetable]
group by
  columns
order by
  num_rows desc

Try that here.

This has the desired effect: it gives me back all of the combinations of not-null columns in the table, with a count for each one.

(Running this on a table with 1,000,000+ rows took about 40 seconds, so I had to use datasette data.db --setting sql_time_limit_ms 100000 to bump up the default time limit in Datasette.)

One remaining problem: how to generate the above query for an arbitrary table. I came up with the following SQL query for generating a SQL query like the above:

select 'select
' || group_concat('    case when [' || name || '] is not null then ' || quote(name || ', ') || ' else '''' end', ' ||
') || '
  as columns,
  count(*) as num_rows
from
  [' || :table || ']
group by
  columns
order by
  num_rows desc' as query from pragma_table_info(:table)

Try that out in a demo that includes the datasette-query-links plugin. This takes :table as an input and generates SQL which can be used to generate column-combination counts.

Created 2021-01-12T14:00:51-08:00, updated 2022-03-20T22:34:14-07:00 · History · Edit