Today I built a system for monitoring Reddit for new posts that link to various domains that I own.
I modelled it on my existing scrape-hacker-news-by-domain project, which I described in some detail in this blog post.
The goal of the Reddit scraper was to do the following:
simonwillison.net
or datasette.io
, using the unofficial search API - the Reddit search page with .json
appended to the URL.Not much to say about the scraper. It looks like this:
curl -H $USER_AGENT \
'https://www.reddit.com/search/.json?q=site%3Asimonwillison.net&sort=new' \
| jq > simonwillison-net.json
Where USER_AGENT
is set to this, because Reddit doesn't like curl
hitting it with the default agent:
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36
The HTML version of this page is https://www.reddit.com/search/?q=site%3Asimonwillison.net&sort=new
It then repeats the same thing for datasette.io
and commits the results. The code lives in this scheduled GitHub Actions file: https://github.com/simonw/scrape-reddit-by-domain/blob/main/.github/workflows/scrape.yml
I created myself a Datasette signed API token with full permissions that would expire after five minutes using the /-/create-token
interface.
I set that as an environment variable like so:
export DS_TOKEN='dstok_...'
Then I ran this to create the table and populate it with some initial data:
cat simonwillison-net.json | jq '{
table: "reddit_posts",
rows: [.data.children[].data | {
id,
subreddit,
title,
url,
ups,
num_comments,
created_utc: (.created_utc | todateiso8601),
subreddit_subscribers,
domain,
permalink: ("https://www.reddit.com" + .permalink)
}],
pk: "id"
}' | \
curl -X POST -H "Content-Type: application/json" \
-H "Authorization: Bearer $DS_TOKEN" \
-d @- https://simon.datasette.cloud/data/-/create
This uses jq
to extract the bits of the JSON I care about and reformat them into a smaller set of colunms. It constructs a JSON document that matches that expected by the /-/create
API, documented here.
(I used ChatGPT to figure out that jq
recipe, in particular the | todateiso8601
part.)
It pipes the resulting JSON to curl
and makes an authenticated POST request to the /-/create
API. This created the reddit_posts
table and populated it with the initial data.
With the table created, I could now create a long-lived API token that would allow me to write to the table.
I used the /-/create-token
interface again but this time I created a token that would never expire but that only had write and alter permission to the new reddit_posts
table I had just created.
I wrote the following script to submit all rows from all JSON files in the current directory:
#!/bin/bash
for file in *.json
do
cat $file | jq '{
rows: [.data.children[].data | {
id,
subreddit,
title,
url,
ups,
num_comments,
created_utc: (.created_utc | todateiso8601),
subreddit_subscribers,
domain,
permalink: ("https://www.reddit.com" + .permalink)
}],
replace: true
}' | \
curl -X POST -H "Content-Type: application/json" \
-H "Authorization: Bearer $DS_TOKEN" \
-d @- https://simon.datasette.cloud/data/reddit_posts/-/insert
done
This uses the /-/insert
API instead, which is a little different from the /-/create
API. It doesn't take a table
argument, instead expecting the table name to be part of the URL.
I also send replace: true
- this ensures that any existing rows with a primary key that matches incoming data will be replaced by the updated version.
I ran chmod 755 submit-to-datasette-cloud.sh
and added it to the GitHub repository.
Having set the DS_TOKEN
secret for my repository, I added the following to the scrape.yml
file:
- name: Submit latest to Datasette Cloud
env:
DS_TOKEN: ${{ secrets.DS_TOKEN }}
run: |-
./submit-to-datasette-cloud.sh
Now every time the workflow runs (once an hour) any new records will be submitted up to Datasette Cloud.
I used the datasette-public plugin to make the reddit_posts
table public. You can see that here:
https://simon.datasette.cloud/data/reddit_posts
Then I used the datasette-write plugin to create a SQL view defined like this:
create view reddit_posts_atom as select
'reddit:' || id as atom_id,
title as atom_title,
permalink as atom_link,
created_utc as atom_updated,
'<a href="' || url || '">' || url || '</a><br><br>'
|| 'Subreddit: ' || subreddit || '<br>'
|| 'Comments: ' || num_comments || '<br>'
|| 'Upvotes: ' || ups || '<br>'
|| 'Subscribers: ' || subreddit_subscribers as atom_content_html
from
reddit_posts
order by
created_utc desc
limit
100;
This follows the conventions required by the datasette-atom plugin.
I made that view public too, and now you can visit it here:
https://simon.datasette.cloud/data/reddit_posts_atom
Or subscribe to the Atom feed here:
https://simon.datasette.cloud/data/reddit_posts_atom.atom
Created 2023-03-13T22:45:43-07:00 · Edit