Contribution experience report: Git

Improving the experience of new contributors in OpenRefine or other projects I maintain is an important topic for me. I think that whether a contributor stays active in a project depends a lot on the experience they have during their first contact. I would generally like to have more feedback about the hurdles people have when trying to contribute to projects where I am active. So I am starting to document my own experience when contributing to other projects. This feels like a useful way to take notes of things I would like to take inspiration from. Perhaps it is interesting to the said projects too. Of course, I don’t claim that this experience is representative. Also, I acknowledge that not every open source project is necessarily seeking new contributors: it can be a deliberate choice not to invest any energy in onboarding people or to refuse any external contributions, for various reasons. It’s not what I wish for the projects I am involved in, though.

Let’s start with a first report about contributing to the Git project.

My motivation to contribute

In short:

Here is the longer version. I currently spend quite some time merging or rebasing things in OpenRefine and I have thought it would be wise to invest in a bit of tooling to ease that. I have discovered that git makes it possible to define custom merge drivers. This lets users change the algorithm used to merge two diverging versions of a given file together. You can provide your own executable file whose task is, given the two diverging versions and the common base version of a file, compute the merged file (possibly with merge conflict markers in it).

When writing my own merge driver, I wanted to use git’s own algorithms as a starting point, and thankfully this is possible via the git merge-file command, which is essentially a command-line interface to the natively available merge drivers in git. However, I quickly noticed that this interface would sometimes give worse results than what I would get when letting git merge use its default merge driver. This is because the newer algorithms developed for git merge hadn’t been made available to the lesser known git merge-file command.

So it felt like an exciting opportunity to try and make it possible to use those newer algorithms in git merge-file. Knowing the lineage and success of the git project, it felt quite daunting but also very exciting to submit a contribution there. That added to the motivation.

First contact with the project

Technically this wasn’t my first contact with the project as a contributor: I can’t resist bragging about the fact that I had already got a commit in 7 years earlier. Granted, it was only adding a single letter to a translation file, but you know, it’s still a commit.

Although there is an official git project on GitHub, it’s not the place where the discussions take place. The real deal is the pretty scary official mailing list, where everything happens: project discussion, but also code review because contributions are sent as patches directly on the mailing list.

I found the mailing list “scary” because it looked like a big forest of patches, mentioning a lot of things I never heard of. I was vaguely aware that people have pretty specific customs about how to behave and how to format messages there. I was pretty sure I was going to fail observing those traditions, outing me as a clueless noob, or maybe even an annoying spammer if I messed up really bad.

So because it looked a bit daunting, I didn’t even try to discuss my ideas with the project before implementing them. I think it’s generally much better to take the time to do discuss before, because it can save you a lot of time if the maintainers disagree with your approach. But given that my change felt pretty straightforward, I wanted to try to just craft the perfect patch directly. It would be just so clearly right and good that it would get waived through the reviewing pipeline and released to the world. Wooooosh!

Development environment

For me it’s usually a bit of a chore to install and get used to a new development environment required by a project. But in this case, it was like a visit to an adorable curiosity cabinet.

Git itself has so few dependencies that I already had basically everything I needed on my Debian machine. I ran make and interatively installed some missing libraries or tools via APT. That was it, it compiled. To make changes to the code, I just used vanilla Vim.

No need for the latest version of that fancy package manager that you first need to install by piping a curl into a sudo. No need for virtual environments, long Docker pulls, or anything like that. It felt like a different universe. Surprisingly slick and enjoyable.

Finding my way into the code base

That was a little trickier. I used my standard technique of searching through all files in the repository with grep, gradually backtracking from the command line options to the internal data structures used to represent the configuration of diff algorithms, through the various interfaces. I did write C code in the past, so I wasn’t completely new to its oddities, but still, trying to figure out how to properly use a macro or how to set a flag with bitwise operators felt like quite an adventure.

To continue with the curiosity cabinet: a lot of the source files are just at the root of the repository, not even in a folder. Being now used to deep folder structures from the Java world, I find that hilarious (in an enjoyable way). Also, I thought it was the norm to use preprocessor statements to make include guards preventing double includes, but apparently not in this code base. I had no idea it was manageable to do without for a project of that scale. (There are indeed include guards, not sure how I missed them. Thanks to Andrew Clayton for pointing out my mistake.)

Reviewing experience

To submit my changes, I could have tried to use git send-email, but it felt much easier to make a pull request to the GitHub repository as that’s a workflow I am used to. The GitGitGadget tool is a sort of GitHub bot that takes care of translating such pull requests to patches of the expected format and send those to the mailing list. For my previous pull request, I had used submitGit, a similar tool that was in use at the time (it does not seem to work anymore).

The timeline of my patch was as follows:

The communication during the review process was very friendly and helpful. At some point, I added the metadata “Reviewed-by: Philipp Wood” (with email address) to my commit with the intention of acknowledging his reviewing effort, but it turned out that it was a faux-pas: I guess it probably implies that the person approved the changes, while at this point he had not. This breach of the etiquette was pointed out to me in a most excellent way.

Testing infrastructure

Git’s test suite is written as a collection of bash scripts which test it via its command-line interface. Given that Git’s command-line interface is its canonical user interface, those are de facto end-to-end tests. There are ongoing efforts to introduce unit tests written in C, but it looks like this is still not ready for contributors to adopt.

Tests being bash scripts, they can just be run as such, which is again quite convenient: no need to learn a testing framework, just run the bash file that contains the test you care about.

On top of that, the GitHub repository comes with a collection of pull request checks which run this test suite and a lot of other things. That was also super convenient to make sure I did not break anything with my changes, without having to learn how to invoke those checks on my own (and without having to provide the computation resources for it). I have no idea how the maintainers work with this test suite since they are not using GitHub (or any other forge, as far as I know), but as an external contributor, it is useful.

Code formatting

My editor inserted spaces instead of tabs for indentation, which conflicted with the default style. I think this was caught either by git itself, by highlighting the whitespace in the diff view, or by the pull requests checks on GitHub. The pull request checks might also have brought up other issues that I don’t remember: in any case, it was simple enough to fix those as as a reaction to that.

Governance and roadmap

As I understand it, this project is structured around a central maintainer, Junio C Hamano, who has the final responsibility for accepting patches and leading the release process. Other contributors (such as Philipp Wood in my case) seem to help with reviewing, but I don’t know to what extent this role is formalized. The current maintainer seems to have been designated by Linus Torvalds directly, so I guess the understanding is that he remains the only one ultimately in charge until he designates someone else. This is all guesswork on my part without having looked into it at all, just based on my existing interactions with the project. This rather blurry governance (from my perspective) was not an obstacle to me, beyond the need to find someone to review my changes in the first place (which happened by itself).

I had a look at the minutes from the Git Contributor’s Summit 2023 which I found very interesting. They give a sense of where the project is heading to and that people are interested in easing the onboarding of newcomers.

Would I contribute again?

Absolutely. It would likely be focused on something I need myself though - I don’t really see myself contributing to git for its own sake, because I think there are a lot of people in big tech companies who are in a better position to do so and I think it’s right that those companies contribute back.