My roadmap for OpenRefine

In OpenRefine we still haven’t got ourselves a clear, public roadmap indicating what we are working towards in the near future. There are various proposals as to how we should get ourselves one.

So I thought I would at least write down what my own priorities are and paint of my vision for the project. I don’t claim that this should be OpenRefine’s roadmap: there is a ton of other efforts that I would really agree with, but which I am less interested in working on myself. Other team members have their own agenda and my own priorities are not set in stone either.

Here is a short summary of my personal goals:

  1. Grow the team and set up the project so that it runs sustainably
  2. Reproducibility
  3. Reconciliation
  4. Extensibility

Goal 1: Grow the team and set up the project so that it runs sustainably

This goal is not about which features OpenRefine should have and isn’t about software development per se, so maybe that’s not what you expected to read first. But that’s something I dedicate a good chunk of my time to.

The bottom line is that I won’t be around forever in this project and I would be really happy to leave it in a state where I would be reasonably confident that an enthusiastic, talented and friendly team would keep running for a while after my departure.

Although I have this as a personal goal, it’s also clear I am doing a lot of things wrong in this regard. I have definitely been learning quite a lot on this topic for the past few years, also because of my involvement in Kanthaus, a house project which shares similar challenges, with the constant need to onboard more people to keep the community active and dynamic.

Goal 2: Reproducibility

One of the things that I really liked when I first discovered OpenRefine was the ability to extract the history and replay it on a new version of the dataset. It’s such a great idea and could be potentially so useful! Sadly it’s really far from reliable so I really want to improve that.

First, why is this feature important to me?

I am currently working on this very topic (in the scope of a funded project) and you can see an overview of the progress in this post. In the Development & Design category of OpenRefine’s forum I also post other updates, often with screencasts to demonstrate the features I am working on.

Goal 3: Reconciliation

Reconciliation is another feature of OpenRefine which got me hooked when I first discovered the tool. It struck me as a really well designed and approachable solution to such an important problem. I was looking for a tool to match a dataset against Wikidata and there was really not a lot of decent options. Magnus Manske’s Mix’n’Match tool was a helpful attempt to crowd-source matching of various databases with Wikidata, but made it hard to finely tune the matching or automate parts of it and its integration with other steps of a data import process was fairly unclear. Magnus had also written a reconciliation service for Wikidata, which was not working very well in OpenRefine. It’s hard to blame him for that: the protocol was pretty poorly documented and one essentially had to inspect OpenRefine’s own source code to understand what the server was supposed to do. So that’s what led me to writing a Wikidata reconciliation service and then to get involved in OpenRefine’s development.

So what is there left to do? Many things!

Although I was initially enthusiastic about the design and user-friendliness of the feature, there are tons of things to improve. Reconciliation is a very tricky task and as a tool developer I feel like I have some duties which I don’t really fulfill yet in this area:

Behind the scenes, the reconciliation protocol is still something that was generalized from Freebase’s own APIs and the current specifications still bear traces of that, with oddities which don’t make so much sense outside of Freebase, or things that simply can be improved more than 10 years later, in a world where people’s expectations about APIs have changed a little. I think this protocol deserves to be adopted broadly, by a lot of data sources and a lot of clients wanting to do matching with those sources. That’s why we started a Community Group within W3C. It gathers a lot of people (48 to date) interested in the protocol for one reason or another, who come together to improve its specifications. A few years after the creation of the group, we have done quite a lot already: documenting the existing protocol (version 0.1), releasing a first improved version of the specs (version 0.2) and drafting a lot of changes for the next version. We have documented the ecosystem around the protocol, built a test bench to help developers check the compliance of their service with the specifications and new services and tools using the protocol have appeared. But a lot of those improvements still haven’t reached OpenRefine users so that’s something I want us to catch up on.

Concretely, what do I want to get done? We (meaning Ayushi, Lydia, Lozana and I) are currently working (in the scope of a funded project) on various usability improvements for our existing reconciliation features. Beyond this effort, I would like to work on the following:

There are other things I would really like to see happen (and I would likely help with), related to reconciliation but outside the scope of OpenRefine itself, such as:

Goal 4: Extensibility

Data cleaning needs are very diverse and there is no chance that OpenRefine fulfills enough of those out of the box. People import data from various sources, need various sorts of cleaning steps and export the results to various places. To cater for that, OpenRefine has an extension system, which lets third parties add features to OpenRefine without modifying it directly.

Although we already have an extension system, it’s far from working as it should. First, the experience of installing an OpenRefine extension is quite technical, which likely puts off many users. There should be an easy way to install an extension from the application itself. The same goes of course for upgrading and uninstalling. The stability guarantees are very thin: it’s easy for an extension to crash the entire app, for instance if it was designed to work with a different version of OpenRefine. The development process of extensions is also poorly documented. All this means that the hurdles for a third party to start developing and then keep maintaining an extension are very high.

I think it’s key to the sustainability of the project that this extension system works well. There is a lot of potential in custom integrations with specific platforms, to help people ingest data following a specific data model. Our Wikibase integration is used a lot and the community is asking us for more. The RDF Transform extension is also a popular one. There are extensions to work with OpenStreetMap data too. The Ontotext Refine fork, developed to help people ingest data in the OntoDB triple store, recently got promoted to a standalone product and they made it possible to install OpenRefine extensions in it. We cannot have built in support for all those platforms out of the box, so it’s important that other teams are able to take responsibility for this development.

Because OpenRefine has a fairly uncommon architecture, by being a web app that people run locally, there aren’t a lot of existing application platforms to pick from. OpenRefine extensions must be able to add new functionality both server-side and client-side, in the same extension. I am not aware of any framework which offers that. We have been discussing possible strategies on OpenRefine’s forum for a while now and I feel like we are slowly getting somewhere, but there are still a lot of open questions in my opinion.