Here's a trick I've used a couple of times in the past few days.
If I have a question about a codebase and I want to answer it using an LLM, I'll use ripgrep (aka rg) to search that codebase for a specific symbol and output the 5-10 lines of code that follow that symbol.
Then I'll pipe the result to my llm CLI tool, running through a large context model such as Claude or gpt-4-turbo, and ask it a question about the code.
I wanted to figure out what "read ribbons" are in the source code for https://lobste.rs/ - which I checked out from GitHub like this:
cd /tmp
git clone https://github.com/lobsters/lobsters
cd lobstersHere's the recipe I use, explained in full below:
rg -I -A 5 read_ribbons | \
llm --system 'explain what read ribbons are' -m gpt-4-turboWhich gave me:
Based on the SQL queries and the schema definitions you've provided, it seems "read ribbons" is a concept related to tracking user interactions with stories or posts in a social platform or content management system, possibly a forum, news aggregator, or a blog commenting system.
Here's an explanation of key components related to read ribbons and their functionality as inferred from the provided SQL snippets:
Purpose: Read ribbons seem to serve the purpose of tracking whether users have read or followed specific stories or posts. This feature is likely used to manage and display unread content to users, enhancing their experience by making it easier to catch up on what's new or changed since their last visit.
Database Structure:
- The
read_ribbonstable has a few important fields:is_following(a boolean indicating if a user is following a story),created_atandupdated_at(timestamps for tracking when the read ribbon was created or last updated),user_id, andstory_id(foreign keys linking to the users and stories tables respectively). It essentially creates a many-to-many relationship between users and stories, with the additional context about following status and read status.- There are indices on
user_idandstory_idfor performance optimization, ensuring that queries to find which stories a user is following (or who is following a story) are efficient.Functionality:
- Queries involving the
read_ribbonstable seem to be focused on fetching unread comments for the user, distinguishing between unread and read comments based on theupdated_attimestamp in comparison to thecreated_attimestamp of comments.- The concept of "following" a story is important here. A user appears to automatically follow stories they interact with (e.g., through commenting), and their read ribbons are updated when they read new comments to ensure accurate tracking of what they've seen.
- Additional conditions in the queries include checks for comment and story scores (to filter out negatively scored items), comment moderation status, and ensuring that users do not see their comments as unread.
Actions: The operations performed with read ribbons include updating the
updated_attimestamp when a user views new content, and filtering comments and stories to present users with relevant, unread discussions. Additionally, after certain actions like viewing unread replies, corresponding "after actions" (such asupdate_read_ribbons) are triggered to keep the read ribbon statuses up to date.To sum up, read ribbons are a feature used for managing and tracking user engagement with stories or posts, distinguishing between read and unread content to improve user experience on platforms where content is frequently updated or added.
ripgrep is a lightning fast grep-style tool. If you give it a search term it will find that term in every file (including in subdirectories) of your current directory:
rg read_ribbonsdb/views/replying_comments_v01.sql
2: read_ribbons.user_id,
4: read_ribbons.story_id as story_id,
10: (read_ribbons.updated_at < comments.created_at) as is_unread
12: read_ribbons
14: comments ON comments.story_id = read_ribbons.story_id
20: read_ribbons.is_following = 1
21: AND comments.user_id != read_ribbons.user_id
23: (parent_comments.user_id = read_ribbons.user_id
25: AND stories.user_id = read_ribbons.user_id))
db/schema.rb
177: create_table "read_ribbons", id: { type: :bigint, unsigned: true }, charset: "utf8mb4", collation: "utf8mb4_general_ci", force: :cascade do |t|
183: t.index ["story_id"], name: "index_read_ribbons_on_story_id"
184: t.index ["user_id"], name: "index_read_ribbons_on_user_id"
...
For the purposes of feeding code into a large language model those line numbers are just wasted tokens. Using -N removes them:
rg -N read_ribbonsdb/views/replying_comments_v01.sql
read_ribbons.user_id,
read_ribbons.story_id as story_id,
...
The first time I wrote up this tip I had a nasty shock when I found that piping content from rg outputs it in a different format to if you send it to the CLI. That's illustrated by this:
rg -N read_ribbons | catdb/views/replying_comments_v01.sql: read_ribbons.user_id,
db/views/replying_comments_v01.sql: read_ribbons.story_id as story_id,
db/views/replying_comments_v01.sql: (read_ribbons.updated_at < comments.created_at) as is_unread
db/views/replying_comments_v01.sql: read_ribbons
...
Those filenames are even more wasted tokens! Adding the -I/--no-filename option fixes that problem:
rg -IN read_ribbons | catAnd it turns out rg -I has the same result as rg -IN, at least for piped content (checked using | cat).
I want a bit more context. I've found that for this purpose returning the 5 or 10 lines following a match works great, which can be achieved using -A 5 (for --after-context). rg will merge together overlapping sections:
rg -I -A 5 read_ribbonsThat produces output like this - no line numbers, no filenames, and a -- between each match:
...
--
after_action :update_read_ribbons, only: [:unread]
after_action :clear_unread_replies_cache, only: [:comments, :stories]
after_action :zero_unread_replies_cache, only: [:all, :unread]
def all
@title = "All Your Replies"
--
def update_read_ribbons
story_ids = @replies.pluck(:story_id).uniq
ReadRibbon
.where(user_id: @user.id, story_id: story_ids)
.update_all(updated_at: Time.current)
end
Tokens cost money, so I ran a check to see the token count by piping through my ttok tool:
rg -I -A 5 read_ribbons | ttokIt replied 4312 which is fine - GPT-4 Turbo is currently priced at 1c per 1,000 input tokens.
The -C 5 gives 5 lines before the match and 5 lines after. -B 5 is just 5 lines before.
My llm tool accepts piped content, and can also take a --system prompt (for models that support it) providing instructions about what to do with that content. Hence this final recipe:
rg -I -A 5 read_ribbons | \
llm --system 'explain what read ribbons are' -m gpt-4-turboThe -m gpt-4-turbo option can be shortened to -m 4t.
You can see the full prompt and response here, extracted using llm logs -c | pbcopy.
Created 2024-02-11T14:48:24-08:00, updated 2024-02-11T15:31:27-08:00 · History · Edit