Related content with SQLite FTS and a Datasette template function

Today I added "related TILs" to this TIL website - so each TIL now shows five related TILs at the bottom of the page.

I'm generating the related content using a SQLite full-text search query.

The related content query

I take the title and body text of an entry, strip out any non-alphanumeric characters, de-dupe the words, and then combine them to form a big OR search query.

I built an initial prototype of this using an Observable notebook that generated these queries from pasted text. See also my research issue.

Here's a simple version of that query, using some example words pulled from this entry. It executes against til_fts which is a SQLite FTS table created using the .enable_fts() method of the sqlite-utils Python library.

select title, rank from til_fts where til_fts match '
  i OR wanted OR to OR run OR some OR django OR tests OR using OR
  pytestdjango OR and OR with OR configured OR pick OR up OR the
  OR databaseurl OR environment OR variable OR via OR djdatabaseurl
  OR against OR a OR postgresql OR server OR running OR in OR
  github OR actions OR it OR took OR while OR figure OR out OR
  right OR pattern OR trick OR was OR define OR postgres OR service'
order by rank limit 5

Here are the results from that query. Unsurprisingly the entry itself shows up first, but the other items look relevant enough to me:

title rank
Running tests against PostgreSQL in a service container -61.04335068286244
Talking to a PostgreSQL service container from inside a Docker container -37.54518907167069
Allowing a container in Docker Desktop for Mac to talk to a PostgreSQL server on the host machine -29.712660785491842
Installing different PostgreSQL server versions in GitHub Actions -27.738649522063493
Docker Compose for Django development -25.027695483498224

The key trick here is the order by rank - SQLite FTS has a robust scoring mechanism built in, and it turns out it's good enough that simply feeding in all of the words from the entry and sorting by that rank provides good-enough related content results on its own.

Improving the query

In order to display related content I need to do two additional things: I need to filter out the current entry, and I need to join against the til table in order to retrieve the information I want to display.

Here's the finished query:

select
  til.topic, til.slug, til.title, til.created
from
  til
  join til_fts on til.rowid = til_fts.rowid
where
  til_fts match :words
  and not (
    til.slug = :slug
    and til.topic = :topic
  )
order by
  til_fts.rank
limit
  5

And an example of it running, which returns the following:

topic slug title created
github-actions service-containers-docker Talking to a PostgreSQL service container from inside a Docker container 2020-09-18T11:43:19-07:00
docker docker-for-mac-container-to-postgresql-on-host Allowing a container in Docker Desktop for Mac to talk to a PostgreSQL server on the host machine 2022-03-31T22:48:17-07:00
github-actions different-postgresql-versions Installing different PostgreSQL server versions in GitHub Actions 2021-07-05T17:43:13-07:00
docker docker-compose-for-django-development Docker Compose for Django development 2021-05-24T22:08:23-07:00
django just-with-django Using just with Django 2022-06-06T14:24:37-07:00

Implementing this as a Datasette template function

I decided to implement this using a custom Jinja template function, added using a Datasette one-off plugin.

The custom function is called related_tils(), and is called from the template like this:

{% set related = related_tils(til) %}
{% if related %}
  <h3>Related</h3>
  <ul class="related">
    {% for til in related %}
    <li><span class="topic">{{ til.topic }}</span> <a href="/{{ til.topic }}/{{ til.slug }}">{{ til.title }}</a> - {{ til.created[:10] }}</li>
    {% endfor %}
  </ul>
{% endif %}

Here's the implementation of that custom template function. It's registered using the extra_template_vars Datasette plugin hook (described here).

The function itself is an inner async def function, because it needs to be able to use await in order to execute its SQL query.

Jinja automatically awaits this kind of function when the template is rendered (see enable_async=True in the Jinja documentation).

So this function takes a TIL entry, creates a de-duped list of words from the title and body, uses that to construct and execute the SQL query and returns the results:

from datasette import hookimpl
import re

non_alphanumeric = re.compile(r"[^a-zA-Z0-9\s]")
multi_spaces = re.compile(r"\s+")


@hookimpl
def extra_template_vars(request, datasette):
    async def related_tils(til):
        text = til["title"] + " " + til["body"]
        text = non_alphanumeric.sub(" ", text)
        text = multi_spaces.sub(" ", text)
        words = list(set(text.lower().strip().split()))
        sql = """
        select
          til.topic, til.slug, til.title, til.created
        from
          til
          join til_fts on til.rowid = til_fts.rowid
        where
          til_fts match :words
          and not (
            til.slug = :slug
            and til.topic = :topic
          )
        order by
          til_fts.rank
        limit
          5
        """
        result = await datasette.get_database().execute(
            sql,
            {"words": " OR ".join(words), "slug": til["slug"], "topic": til["topic"]},
        )
        return result.rows

    return {
        "related_tils": related_tils,
    }

Here's the full code for that plugin, which also registers some other template functions.

Related

Created 2022-07-31T17:53:23-07:00, updated 2022-07-31T19:05:19-07:00 · History · Edit