tree-sitter is a "parser generator tool and an incremental parsing library". It has a very good reputation these days.
Two useful posts by Douglas Creager: Getting started with tree-sitter and A map of the tree-sitter ecosystem.
I want to be able to parse SQLite SQL - in particular I want to be able to parse CREATE TABLE
statements, because SQLite stores those directly in its sqlite_master
metadata table as the main source of truth about a table, and I want to be able to introspect them beyond what's possible with pragma table_info()
and friends.
It turns out there are tree-sitter grammars for a huge array of languages, including one for SQLite SQL hosted at github.com/dhcmrlchtdj/tree-sitter-sqlite.
tree-sitter grammars need to be compiled - they generate C code, which should then be compiled into a .so
library file.
py-tree-sitter provides Python bindings for tree-sitter that can both work with compiled grammars AND manage the compilation process.
Here's how I compiled the SQLite grammar:
git clone https://github.com/dhcmrlchtdj/tree-sitter-sqlite
pipenv shell
python -m pip install tree_sitter
python
>>> from tree_sitter import Language, Parser
>>> Language.build_library('/tmp/sqlite.so', ['/tmp/tree-sitter-sqlite'])
True
This gave me a /tmp/sqlite.so
file - 1.1MB in size.
Here's how to use that from Python:
from tree_sitter import Language, Parser
language = Language('/tmp/sqlite.so', 'sqlite')
parser = Parser()
parser.set_language(language)
sql = b"""CREATE TABLE _datasette_auth_tokens (
id INTEGER PRIMARY KEY,
secret TEXT,
description TEXT,
permissions TEXT,
actor_id TEXT,
created_timestamp INTEGER,
last_used_timestamp INTEGER,
expires_after_seconds INTEGER
);"""
tree = parser.parse(sql)
print(tree.root_node.sexp())
Which outputs:
(sql_stmt_list (sql_stmt (create_table_stmt (CREATE) (TABLE) (identifier) (column_def (identifier) (type_name (identifier)) (column_constraint (PRIMARY) (KEY))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))))))
I explored the tree structure a bit in Python as well:
>>> n = tree.root_node
>>> dir(n)
['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'child_by_field_id', 'child_by_field_name', 'child_count', 'children', 'children_by_field_id', 'children_by_field_name', 'end_byte', 'end_point', 'field_name_for_child', 'has_changes', 'has_error', 'id', 'is_missing', 'is_named', 'named_child_count', 'named_children', 'next_named_sibling', 'next_sibling', 'parent', 'prev_named_sibling', 'prev_sibling', 'sexp', 'start_byte', 'start_point', 'text', 'type', 'walk']
>>> n.text
b'CREATE TABLE _datasette_auth_tokens (\n id INTEGER PRIMARY KEY,\n secret TEXT,\n description TEXT,\n permissions TEXT,\n actor_id TEXT,\n created_timestamp INTEGER,\n last_used_timestamp INTEGER,\n expires_after_seconds INTEGER\n);'
>>> n.children
[<Node type=sql_stmt, start_point=(0, 0), end_point=(9, 1)>, <Node type=";", start_point=(9, 1), end_point=(9, 2)>]
>>> n.children[0]
<Node type=sql_stmt, start_point=(0, 0), end_point=(9, 1)>
>>> n.children[0].children
[<Node type=create_table_stmt, start_point=(0, 0), end_point=(9, 1)>]
>>> n.children[0].children[0]
<Node type=create_table_stmt, start_point=(0, 0), end_point=(9, 1)>
>>> n.children[0].children[0].children
[<Node type=CREATE, start_point=(0, 0), end_point=(0, 6)>,
<Node type=TABLE, start_point=(0, 7), end_point=(0, 12)>,
<Node type=identifier, start_point=(0, 13), end_point=(0, 35)>,
<Node type="(", start_point=(0, 36), end_point=(0, 37)>,
<Node type=column_def, start_point=(1, 3), end_point=(1, 25)>,
<Node type=",", start_point=(1, 25), end_point=(1, 26)>,
<Node type=column_def, start_point=(2, 3), end_point=(2, 14)>,
<Node type=",", start_point=(2, 14), end_point=(2, 15)>,
<Node type=column_def, start_point=(3, 3), end_point=(3, 19)>,
<Node type=",", start_point=(3, 19), end_point=(3, 20)>,
<Node type=column_def, start_point=(4, 3), end_point=(4, 19)>,
<Node type=",", start_point=(4, 19), end_point=(4, 20)>,
<Node type=column_def, start_point=(5, 3), end_point=(5, 16)>,
<Node type=",", start_point=(5, 16), end_point=(5, 17)>,
<Node type=column_def, start_point=(6, 3), end_point=(6, 28)>,
<Node type=",", start_point=(6, 28), end_point=(6, 29)>,
<Node type=column_def, start_point=(7, 3), end_point=(7, 30)>,
<Node type=",", start_point=(7, 30), end_point=(7, 31)>,
<Node type=column_def, start_point=(8, 3), end_point=(8, 32)>,
<Node type=")", start_point=(9, 0), end_point=(9, 1)>]
>>> n.children[0].children[0].text
b'CREATE TABLE _datasette_auth_tokens (\n id INTEGER PRIMARY KEY,\n secret TEXT,\n description TEXT,\n permissions TEXT,\n actor_id TEXT,\n created_timestamp INTEGER,\n last_used_timestamp INTEGER,\n expires_after_seconds INTEGER\n)'
I've only just started exploring tree-sitter - the Python documentation has more details on ways to walk the tree, plus a description of tree-sitter's pattern matching language which looks like it may be the key to using it effectively.
grantjenks/py-tree-sitter-languages is a handy Python package that bundles compiled versions of a bunch of grammars for different architectures, as published wheel files:
https://pypi.org/project/tree-sitter-languages/#files
This means you can install that package rather than compiling them yourself:
pip install tree-sitter-languages
I had a PR accepted adding the SQLite grammar. Now you can do this:
from tree_sitter_languages import get_parser
parser = get_parser("sqlite")
tree = parser.parse(b"""CREATE TABLE [replicate_predictions] (
[id] TEXT PRIMARY KEY,
[_model_guess] TEXT,
[completed_at] TEXT,
[created_at] TEXT,
[error] TEXT,
[input] TEXT,
[logs] TEXT,
[metrics] TEXT,
[output] TEXT,
[started_at] TEXT,
[status] TEXT,
[urls] TEXT,
[version] TEXT,
[webhook_completed] TEXT
)""")
print(tree.root_node.sexp())
Output:
(sql_stmt_list (sql_stmt (create_table_stmt (CREATE) (TABLE) (identifier) (column_def (identifier) (type_name (identifier)) (column_constraint (PRIMARY) (KEY))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))) (column_def (identifier) (type_name (identifier))))))
Created 2023-07-13T22:49:08-07:00, updated 2023-07-20T11:55:52-07:00 · History · Edit