The puzzle of tree-sitter parser maintenance and distribution
Lately I’ve been thinking about a problem which feels like an interesting governance one. (Yet another, yes!)
The recent abandonment of the nvim-treesitter repository made quite some waves, so it feels a fitting occasion to expose this problem here.
Tree-sitter is a parser generator. You specify a grammar for a language (say, Java or CSS) and it generates pretty efficient parsing code for it in C. This parser can then be used in all sorts of applications, such as for syntax highlighting in text editors (Neovim, Helix, Zed, Emacs…). It’s also used in GitHub to list function definitions in files, for instance. We also rely on it in Mergiraf, our resolution tool for git conflicts.
For each language that you want to parse with tree-sitter, you need a grammar (written in the format understood by tree-sitter), which often comes together with a “scanner” (custom lexing code written in C). Of course, such parsers require maintenance: for instance because the target language evolves, because bugs are found, to improve performance, or simply because tree-sitter itself evolves. So far, so good: each parser can just be an open-source project of its own, with people maintaining the parser, and each user simply depending on that software, right?
Sadly, not really. Most tools which rely on tree-sitter parsers can’t just use tree-sitter parsers as plugins: they need some additional configuration to give some meaning to the syntax trees created by the parsers. For instance, an editor using tree-sitter for syntax highlighting will need to map node types (defined by the parser) to the colors and other styles used in the editor. Platforms like GitHub need to decide which nodes of the trees will be displayed as function definitions in its UI. Tools like mergiraf need additional configuration to determine which nodes it can merge “commutatively” (meaning that their local order does not matter). This tool-specific “glue” also needs maintenance, as the parser and the tool evolve. To support those use cases, tree-sitter comes with a notion of “queries”, which are ways to select nodes of syntax trees. For instance, a text editor can use queries to define which nodes it will display in bold font, to state things in a somewhat simplistic way. A lot of tools use such queries to define this bridge between the parser and their own concepts, but those queries remain tool-specific, so they still require their own maintenance.
On top of that, tools rely on different distribution formats for parsers. Some tools require their users to point them to a Git repository containing the source of the parser. Some others require bindings for the parser in a specific language (such as a Rust crate or a Python module), often uploaded to a package registry (Crates.io, PyPI, NPM). Others require a compiled version of the parser as a dynamic library or a WASM file.
The nvim-treesitter repository that was abandoned recently is one that stores this glue code between Neovim, a text editor, and hundreds of tree-sitter parsers. Maintaining this glue looks like pretty gruesome work. One of its maintainers has been visibly strained by that work, and it looks like a conflict with a user was the last straw that pushed him to abandon ship (in the form of archiving the repository).
So the big question the tree-sitter ecosystem is struggling with right now is how to organize the maintenance of not just the parsers themselves but also this glue that is required for tools using it. People look for alternatives to the “monorepo” approach of nvim-treesitter, using one repository per language for instance. Beyond choices of repository structures, the question I’m interested in is who does this work and how do they organize themselves.
Motivations to maintain parsers
People get involved in tree-sitter parser/query maintenance for all sorts of reasons:
- folks who care about a specific parser and for a specific downstream use. For instance, a Neovim user who programs in PHP may want to get involved in the creation or maintenance of the PHP parser, because it translates to improvements to their work environment. They are knowledgeable about the programming language being parsed, perhaps less so about tree-sitter parser development or distribution.
- folks who care about a specific downstream tool, but no language in particular. For instance, I work on Mergiraf, which relies on a range of tree-sitter parsers. I have made contributions to (and maintain) various tree-sitter parsers, primarily driven by the motivation to make their use in Mergiraf easier. I do that for parsers of languages that I often don’t know at all.
- perhaps more rarely, people who care about a programming language and maintain a tree-sitter parser for it so that their language is well-supported in a range of software development tools. This happens mostly for relatively niche programming languages. They are obviously experts of their language, but might be less familiar with all of the downstream tools relying on tree-sitter parsers (which there is no authoritative list of).
- perhaps also less frequently, people who are excited by tree-sitter itself as a cool piece of tech and want to make great parsers with it. They might be more knowledgeable about parser optimization and the latest shiny stuff coming from the tree-sitter project itself. This includes tree-sitter maintainers working on flagship parsers, acting as showcases for tree-sitter.
In any case, the involvement of those contributors is generally relatively lightweight, driven by the specific motivation they have. Blame them on being selfish if you want, but maintaining a tree-sitter parser is not a particularly creative and fulfilling task. I would say it’s perhaps even a rather boring one that doesn’t give you a lot of open source “street credibility” or employability, as far as I can tell.
Failure of collaboration in parser maintenance
Can we aggregate all this passing interest of people with varying motivations to maintain an ecosystem of parsers and queries?
It generally does not work out very well. What often happens is something along the lines of:
- Jane creates a parser for FooLang because she wants to use it in Emacs. She spends quite some time doing this - making a parser from scratch is no easy feat! She writes the queries that work for her with her use of Emacs. She publishes her parser as a GitHub repository, under her own account.
- Three months later, Barnaby opens an issue asking if she could upload her parser to NPM, so that that he can use it more easily in his Node.js project. Jane doesn’t have an NPM account, so it’s a bit annoying for her to do so (and to commit to publishing the following releases there as well). She declines.
- Barnaby forks the project and sets up a pipeline to publish it to NPM under his own account.
- Two months later, Eliott opens a PR to fix a bug on Jane’s repository, as he found a case where a valid construct of FooLang isn’t parsed correctly. Jane happily accepts, updates the queries accordingly and publishes a new version. Barnaby’s fork doesn’t get the improvement.
- Four months later, Jessica opens a PR to Jane’s repo, to add support for a new feature of FooLang. Jane has switched to VSCode so she doesn’t care so much about this parser anymore and doesn’t review the PR.
You get the idea: the end state is a network of forks, some of which are distributed in some package registries, some benefiting from various bug fixes in the parser, and, down the line, a lot of duplication of work.
Community Package Maintenance Organizations
Couldn’t those people get together in one GitHub organization and maintain parsers and queries collaboratively? There is tree-sitter-grammars, a GitHub organization which is somewhat close to that. But:
- it does not accept new parsers, nor does it seem to accept new members as far as I can tell. I could also not find written governance for it;
- while there is some effort to have all parsers follow a set of guidelines, there are still big discrepancies between the parsers in terms of the package registries they get published to or the queries they include, for instance.
Since I wasn’t able to join this organization, I created a similar one called grammar-orchard (on Codeberg), hosting a slowly-growing collection of parsers. There is a lot more I should do to improve this organization, improve the contribution and onboarding process more, advertise the grammars to end users, distribute the parsers in more formats, but it has the merit to exist and to have already enabled nice collaborations. For instance, our Java parser has received sustained contributions from two contributors who reviewed each other’s patches after I onboarded them, while the upstream repository doesn’t seem to have anyone available to review external PRs. Similarly, I collaborated with a Rust team member on our Rust parser, leading to a lot of improvements to the parser (which in turn is used to “fuzz” the Rust compiler to avoid crashes). If you are interested in fostering collaboration around your tree-sitter parser, consider joining us and moving your parser there!
Both of those organizations can be seen as “Community Package Maintenance Organizations”, a concept coined by my colleague Théo Zimmermann. He has studied many of those organizations, trying to understand why people start them and what makes them successful. I definitely need to integrate a lot of the lessons learned from his research in the Grammar Orchard!
That being said, none of those organizations claim to solve the problem of maintaining this “glue” between parsers and reusers (which is what the tree-sitter community is currently more concerned with).
I don’t know if such organizations could be extended to also maintain queries or other configuration glue for selected end-user tools. It could be possible to onboard people who have a specific interest in those tools and who help maintain the queries directly in the repositories where the parsers are developed. I don’t know to what extent it can work, but it could be worth trying.
There is of course also the question to which extent that configuration glue can be mutualized between tools. The fact that those tools have different feature sets, design choices and target audiences gives me little hope that sufficient normalization can happen for this to really bear fruit.
Comments can go to the associated Mastodon post as usual :)