My roadmap for OpenRefine

In OpenRefine we still haven’t got ourselves a clear, public roadmap indicating what we are working towards in the near future. There are various proposals as to how we should get ourselves one.

So I thought I would at least write down what my own priorities are and paint of my vision for the project. I don’t claim that this should be OpenRefine’s roadmap: there is a ton of other efforts that I would really agree with, but which I am less interested in working on myself. Other team members have their own agenda and my own priorities are not set in stone either.

Here is a short summary of my personal goals:

  1. Grow the team and set up the project so that it runs sustainably
  2. Reproducibility
  3. Reconciliation
  4. Extensibility

Goal 1: Grow the team and set up the project so that it runs sustainably

This goal is not about which features OpenRefine should have and isn’t about software development per se, so maybe that’s not what you expected to read first. But that’s something I dedicate a good chunk of my time to.

The bottom line is that I won’t be around forever in this project and I would be really happy to leave it in a state where I would be reasonably confident that an enthusiastic, talented and friendly team would keep running for a while after my departure.

Although I have this as a personal goal, it’s also clear I am doing a lot of things wrong in this regard. I have definitely been learning quite a lot on this topic for the past few years, also because of my involvement in Kanthaus, a house project which shares similar challenges, with the constant need to onboard more people to keep the community active and dynamic.

Goal 2: Reproducibility

One of the things that I really liked when I first discovered OpenRefine was the ability to extract the history and replay it on a new version of the dataset. It’s such a great idea and could be potentially so useful! Sadly it’s really far from reliable so I really want to improve that.

First, why is this feature important to me?

  • It’s something a lot of people really need! The research community is an obvious example, given the realization that a lot of scientific results cannot be reliably replicated based on the published description of the research methods. So, if someone uses OpenRefine at some point in their research workflow, I want it to be really easy to inspect their cleaning data process and re-run it in a new environment, on a slightly different dataset. But it’s not just about research! OpenRefine is also used a lot to import data in knowledge graphs, such as Wikidata. Importing a dataset in a knowledge graph is rarely a one-off task, because those datasets often get updated. If we can reliably re-run data cleaning processes made with OpenRefine, then we are really close to having a clean infrastructure for continuously importing datasets in Wikidata. You could have an online platform which hosts those workflows, making it possible for other contributors to audit and update them as the target data model evolves.
  • Many of the architectural changes that are required for this reproducibility also enable other features, seemingly unrelated, which can really help users even if they do not care about re-running their data cleaning process on a new dataset later on. One of the most long-standing open issues we had in OpenRefine was about the grid view being reset to the first page whenever the user would run an operation that could affect multiple rows in the grid (such as matching a reconciled cell to a particular entity). Determining if the current view on the grid can meaningfully be preserved requires assigning suitable metadata to operations. This is the same sort of metadata that is required to determine if a series of operations can be applied on a new grid, or if two operations can safely be run in parallel.
  • I have the impression that for many users, OpenRefine is the first step in their journey towards more principled data transformations. They might have done manual edits in Excel before and might move on to Jupyter notebooks later. I want to believe that OpenRefine can be a meaningful step on this data literacy path, in that it still lets the user carry out transformations interactively, via a user interface, but introduces them to some concepts of programming. The fact that operations run on all rows by default. The combining of filters to select a set of rows. The gradual exposure to regular expressions or expression languages such as GREL or Python. And of course, this ability to re-run the history on a new file, which means that the user is, in some way, interactively writing a small program without thinking about it. But that can only be convincing if this last feature really lives up to the legitimate expectations that users develop about it - which is not the case yet.

I am currently working on this very topic (in the scope of a funded project) and you can see an overview of the progress in this post. In the Development & Design category of OpenRefine’s forum I also post other updates, often with screencasts to demonstrate the features I am working on.

Goal 3: Reconciliation

Reconciliation is another feature of OpenRefine which got me hooked when I first discovered the tool. It struck me as a really well designed and approachable solution to such an important problem. I was looking for a tool to match a dataset against Wikidata and there was really not a lot of decent options. Magnus Manske’s Mix’n’Match tool was a helpful attempt to crowd-source matching of various databases with Wikidata, but made it hard to finely tune the matching or automate parts of it and its integration with other steps of a data import process was fairly unclear. Magnus had also written a reconciliation service for Wikidata, which was not working very well in OpenRefine. It’s hard to blame him for that: the protocol was pretty poorly documented and one essentially had to inspect OpenRefine’s own source code to understand what the server was supposed to do. So that’s what led me to writing a Wikidata reconciliation service and then to get involved in OpenRefine’s development.

So what is there left to do? Many things!

Although I was initially enthusiastic about the design and user-friendliness of the feature, there are tons of things to improve. Reconciliation is a very tricky task and as a tool developer I feel like I have some duties which I don’t really fulfill yet in this area:

  • a duty not to deceive users. That means conveying the correct expectations about the matches suggested by reconciliation services and encourage them to review those critically
  • a duty not to waste people’s time. That means giving them time-efficient and intuitive workflows to configure reconciliation, review and improve its results. Having fast reconciliation services is not sufficient for that: the time I primarily want to minimize is not so much the time spent waiting for reconciliation results to come in, but rather the time spent reviewing and correcting reconciliation results.
  • a duty to encourage sound methodologies. We don’t do a good job of directing users towards a scientifically valid workflow (train and tune reconciliation on a sample of a dataset, evaluate its performance on another sample) which is the only way they can get a sense of what accuracy they are achieving.

Behind the scenes, the reconciliation protocol is still something that was generalized from Freebase’s own APIs and the current specifications still bear traces of that, with oddities which don’t make so much sense outside of Freebase, or things that simply can be improved more than 10 years later, in a world where people’s expectations about APIs have changed a little. I think this protocol deserves to be adopted broadly, by a lot of data sources and a lot of clients wanting to do matching with those sources. That’s why we started a Community Group within W3C. It gathers a lot of people (48 to date) interested in the protocol for one reason or another, who come together to improve its specifications. A few years after the creation of the group, we have done quite a lot already: documenting the existing protocol (version 0.1), releasing a first improved version of the specs (version 0.2) and drafting a lot of changes for the next version. We have documented the ecosystem around the protocol, built a test bench to help developers check the compliance of their service with the specifications and new services and tools using the protocol have appeared. But a lot of those improvements still haven’t reached OpenRefine users so that’s something I want us to catch up on.

Concretely, what do I want to get done? We (meaning Ayushi, Lydia, Lozana and I) are currently working (in the scope of a funded project) on various usability improvements for our existing reconciliation features. Beyond this effort, I would like to work on the following:

  • updating OpenRefine so it can take advantage of the newer features offered by the protocol, in its current draft. For instance, exposing the reconciliation features returned by the services to the user, so that they can rely on something better than an opaque matching score.
  • base OpenRefine’s reconciliation features on an external, reusable pair of libraries (one backend-side in Java, another frontend-side in JS). I see this as an opportunity to do a big clean up in this code base and encourage the use of the protocol in other clients. The libraries would also help with supporting multiple versions of the protocol, as I would like that OpenRefine remains compatible with as many existing services as possible. I have started working on such a Java library (provisionally named ReconToolkit).
  • Improve the tool to let the user easily build a robust set of criteria to match a column in their dataset to an external service. This could take the form of interactively asking the user to make judgments on certain rows, selected by an active learning algorithm, to construct a matching classifier fitted to the dataset at hand (based both on features exposed by the reconciliation service and other ones computed locally). Such a model could then be evaluated on another set of rows to get an estimate of its performance. It could be made reusable and shareable. Although I am phrasing those features in fairly technical terms, I suspect it might be possible to present this to users in an approachable way, even for those not familiar with machine learning or statistics.

There are other things I would really like to see happen (and I would likely help with), related to reconciliation but outside the scope of OpenRefine itself, such as:

  • Getting the reconciliation protocol to become a W3C recommendation. Not that I think the actual status makes a ton of difference for adoption, but rather that it would be a sign that we are happy enough about the state of the specs and their implementations that we apply for this. We have already made many very useful improvements to the specs by attempting to comply with W3C’s guidelines (for instance thanks to Fabian Steeg’s tireless work on internationalization matters), so I find this process quite helpful so far
  • Having a nice, user-facing directory of reconciliation services. The list in the Test bench is not designed for that
  • Having a Wikibase extension which implements the reconciliation protocol. I am hoping this would make reconciliation more reliable, faster and easier to install in a Wikibase instance

Goal 4: Extensibility

Data cleaning needs are very diverse and there is no chance that OpenRefine fulfills enough of those out of the box. People import data from various sources, need various sorts of cleaning steps and export the results to various places. To cater for that, OpenRefine has an extension system, which lets third parties add features to OpenRefine without modifying it directly.

Although we already have an extension system, it’s far from working as it should. First, the experience of installing an OpenRefine extension is quite technical, which likely puts off many users. There should be an easy way to install an extension from the application itself. The same goes of course for upgrading and uninstalling. The stability guarantees are very thin: it’s easy for an extension to crash the entire app, for instance if it was designed to work with a different version of OpenRefine. The development process of extensions is also poorly documented. All this means that the hurdles for a third party to start developing and then keep maintaining an extension are very high.

I think it’s key to the sustainability of the project that this extension system works well. There is a lot of potential in custom integrations with specific platforms, to help people ingest data following a specific data model. Our Wikibase integration is used a lot and the community is asking us for more. The RDF Transform extension is also a popular one. There are extensions to work with OpenStreetMap data too. The Ontotext Refine fork, developed to help people ingest data in the OntoDB triple store, recently got promoted to a standalone product and they made it possible to install OpenRefine extensions in it. We cannot have built in support for all those platforms out of the box, so it’s important that other teams are able to take responsibility for this development.

Because OpenRefine has a fairly uncommon architecture, by being a web app that people run locally, there aren’t a lot of existing application platforms to pick from. OpenRefine extensions must be able to add new functionality both server-side and client-side, in the same extension. I am not aware of any framework which offers that. We have been discussing possible strategies on OpenRefine’s forum for a while now and I feel like we are slowly getting somewhere, but there are still a lot of open questions in my opinion.