Wikibase day-dreaming | Antonin Delpeuch

Some months ago, I saw a job posting for Wikibase.Cloud’s product manager. It piqued my interest and lead me to ask myself what I would do if I were to work at Wikimedia Deutschland, in a position to influence the direction the Wikibase project. I realized I have opinions, so this is an attempt to put them in an understandable format, mostly in the interest of being able to refer to it in discussions with people as those subjects regularly come up.

Experiment with more federation scenarios

The volume of data stored in Wikidata has been growing steadily for the past few years, and we have known that this growth isn’t sustainable. Wikidata’s usefulness relies in a big part on its data being made available in its Query Service, a triple store which can’t be scaled so easily. The Search Team at the Wikimedia Foundation has recently decided to stop offering all of Wikidata in a single Query Service, splitting the graph in two as the load was not manageable anymore. The SQL database itself has been growing at a concerning rate. And beyond the purely technical aspects, there is also the question of how much data the Wikidata community can curate, as the human effort required to take care of all those entities is considerable.

At the same time, users of third-party Wikibase instances struggle, because they basically need to build their knowledge graph from scratch, with tooling that is inferior to what is available on Wikidata. Wikibase has been primarily designed to power Wikidata, a centralized and unique knowledge graph. The idea of multiple Wikibases refering to each other is a very foreign concept that doesn’t fit so well with the Wikibase way of doing things (such as refering to items via “Qids”, not qualified by any sort of domain or prefix).

It’s urgent both for Wikidata and for third-party Wikibases to make it easier to work across multiple Wikibase instances, so that:

certain chunks of Wikidata entities can be split out to separate Wikibase instances, the first obvious candidate being the scholarly articles stored by the Wikicite project,
people working in third-party Wikibases can more easily build on top of Wikidata and other Wikibase instances.

Whether we want to call those changes “federation” or not, let’s make sure they are addressing those pressing needs. Introducing federation in a platform that has not been designed for it can be very difficult, but I believe some lightweight approaches could already make a big difference. A few years ago, Wikimedia Deutschland has experimented with having a Wikibase reuse the properties of another Wikibase instance, which wasn’t judged conclusive. Let’s just explore more scenarios!

To do so, I would focus on the obvious concrete case at hand: Wikicite. This sub-community of Wikidata is a good candidate for splitting it out to another Wikibase instance, because it curates a relatively well delimited set of entities within Wikidata, and is in itself responsible for the lion’s share of the scalability issues. The Wikicite project has also been held back by the growth constraints of Wikidata, forcing it to have a patchy coverage of the subject it models, so I believe it could really thrive if it could free itself from those constraints. My priority would be to sit down with Wikicite participants and try to understand what are the blockers that would prevent them from migrating scholarly articles to a separate Wikibase instance as of today.

Here’s how I expect it would go. The first obvious thing to try (in my opinion) would be to load all scholarly articles in a separate Wikibase instance, keeping other general-purpose items (journals, organizations, humans, topics…) in Wikidata. For this to work, we’d need to first import Wikidata properties into the Wikibase instance, strategically choosing their datatypes:

a property like cites work (P2860) would retain its item datatype, since it would be used to link mostly from a scholarly article to another (both present in the same Wikibase instance),
a property like author (P50) would get a string datatype, so that it can store a Wikidata Qid and appropriately link to it via formatter URL (P1630) and formatter URI for RDF resource (P1921).

The exact details of what type of entity to store in which Wikibase can of course be adjusted: the goal would be to make sure that they are consistent with property domains and ranges, so that the separation can be enforced by the property datatypes. For instance, one could also decide to store entities about people alongside scholarly articles, to enable mass-importing of scholarly profiles from external databases without flooding Wikidata. Local items could be equated with Wikidata items via a dedicated property (as is already customary in many Wikibase instances).

What problems do I expect in such an experiment?

it’s going to be cumbersome to add Wikidata Qids as values of properties like “author”. They will not be rendered with labels, nor be validated when they are saved.
Wikicite is heavily reliant on the Wikidata Query Service and expects all the entities relevant to the project to be present in the same triple store. The Scholia app is a good example of that: it is very difficult to rewrite its queries such that they can still run across two triples stores.

We can work on those problems.

We make a Wikibase extension which declares a new property datatype. Such properties have Wikidata items as values, with the same auto-complete widget. The labels of the entities refered to are cached in a dedicated table, refreshed whenever those statements change or upon explicit request. The local statements refering to Wikidata items export to RDF as you would expect, with the corresponding Wikidata entity URIs. This provides a transparent way to refer to Wikidata items, imitating the existing UX without needing any architectural changes in Wikibase;
We run a Query Service containing both the data from the Wikicite Wikibase instance and relevant parts of Wikidata, fed by two query service updaters. Hopefully, a reduced read/write load would make this more tractable than the Wikidata Query Service. This can be trialled at a small scale. Plugging an off-the-shelf query service updater to Wikidata might also not work so well given the volume of changes, some adaptations might be required there.

Surely this is not straightforward, but perhaps still worth trying out?

Don’t invest more in EntitySchemas

How do you find the capacity to work on federation? By stopping to work on EntitySchemas! Wikimedia Deutschland has apparently been spending a lot of effort on EntitySchemas lately, to make them linkable via statements and to give them more of the features that are expected of Wikibase entities. From the outside, it looks like this effort took a rather long-winded path (by not making EntitySchemas real Wikibase entities and re-implementing much of the associated functionality separately), but even if EntitySchemas were turned into proper Wikibase entities, a major issue would still remain: ShEx (which is the language in which EntitySchemas are expressed) is designed to validate RDF data and not data expressed in Wikibase’s own data model.

To understand why that’s a problem, we need to think about the use cases that those EntitySchemas are supposed to eventually enable. The ones I am aware of are:

generating a report showing to what extent a set of Wikibase entities (or a single one) complies with the schema, helping identify quality issues and address them via manual or automated edits,
doing the same sort of quality assurance on candidate edits from a batch upload tool such as OpenRefine, so that data modelling issues can be addressed ahead of an import,
generating data entry forms (similarly to Cradle or Wikibase Lexeme Forms) that match the data modeling conventions on the instance. This helps users manually input new entities without pre-existing knowledge of the expected properties, and saves time by avoiding the need to input those properties in the first place,
documenting data modeling conventions in a standard form, to be read directly in ShEx form by other users.

The fact that Wikibase stores its data in its own format, which is very different from RDF, is a significant hurdle for providing a satisfactory user experience around use cases 1, 2 and 3. To validate Wikibase data with ShEx, one needs to translate it to RDF. This translation is by now well established, as it’s used to populate the query service, but it’s not so easy to explain to end users, and it’s a one-way street. By this, I mean that I am not aware of any tool (even third party) to create a new Wikidata item by supplying its RDF representation. To provide an experience comparable to that of the WikibaseQualityConstraints extension, which is able to highlight issues with specific statements directly in the Wikibase UI, one would need to find a way to lift the results of ShEx validation back up to the Wikibase data model, and that’s difficult precisely because of this one-way street conversion. One can for sure come up with something that works in basic cases, but in my opinion it’s going to be very hard to make it really user-friendly and reliable.

Beyond the fact that ShEx is designed to validate RDF, another issue is its very broad expressivity, which makes it hard to validate ShEx schemas efficiently at scale. This is also a big hurdle for use cases 1, 2 and 3. Some years ago, a call was held on this topic and the idea of defining a subset of ShEx was floated. This would have the benefit of constraining the expressivity to make implementation tractable and could also be the occasion to introduce additional fields required to generate data input forms (such as labels or placeholders for input fields). This could potentially work, but then if we are to define a new format, why not make it validate data in the Wikibase model directly? And at that point it doesn’t really have anything to do with ShEx anymore.

Use case 4 remains, but looks not so exciting to me. ShEx schemas for Wikibase data are not very readable per se, since Wikibase entity URIs are quite cryptic unless you learn Qid/Pids by heart. If people are enthusiastic about the ShEx syntax despite that, they can easily embed ShEx schemas in wiki pages, potentially via a template to help with linking to an external validator. No custom Wikbase development is required to satisfy this use case.

For those reasons, I think investing more effort into EntitySchemas should not be a priority.

More dogfooding

One constant struggle in the Wikimedia movement, and probably a lot of other volunteer communities supported by a small paid team, is the gap between the volunteers (Wikidata editors, third-party Wikibase users) and the employees (teams at Wikimedia Deutschland and Wikimedia Foundation). By gap I mean cultural gap, but also gap of priorities because of the drift of viewpoint both parties have on the products. It’s something that is difficult to avoid. When hiring for a role, organizations will generally favour hiring someone with proven experience for this particular type of role (possibly in another industry), over someone coming from the grassroots community that they serve. Volunteers may have given a lot of their time to the movement, but are they going to fit in the organizational chart? Are they going to be reliable as employees?

Of course, there are ways to reduce this gap, by having the two parties communicate better (user research interviews, in-person or online gatherings). Those are used in the Wikimedia movement. Still, in the context of Wikidata and Wikibase, I think there is room for improvement. Over the past years, we have seen a lot of turnover in the project manager positions at Wikimedia Deutschland, with vacant positions being filled by professionals who had little previous involvement in the wiki community as far as I can tell. Getting to know the community and forming a deep understanding of the product takes time, and I think it is a lot more likely to happen if those professionals become direct users of their product.

So how about we do more dogfooding? Can we have a Wikibase.Cloud product manager who is actively involved as an administrator of a Wikibase instance on that platform? Not for a toy Wikibase created for testing purposes, but for a Wikibase actively used by a community who has a job to do with the Wikibase. Let them run editathons for that community on their work time. I think it would really help them make informed decisions about the product they are in charge of.

Beyond that, it would of course be ideal to be able to retain those product managers for longer. That’s a difficult goal and there likely isn’t one single reason why the previous ones have left their positions. But intuitively, having them develop a closer connection to their user base should help with making their work easier and more fulfilling. As someone who has interacted on a regular basis with teams at Wikimedia Deutschland because of my involvement in OpenRefine and the Wikibase Stakeholder Group, I have found it difficult to maintain a working relationship to those teams, given the rapidly changing faces on the other side of the Zoom call.

And what about reconciliation?

I have been working on OpenRefine for what feels like a long time (and will be leaving the project this year). As part of that, I have been promoting its reconciliation protocol as something that linked open data platforms should implement. That includes Wikibase: I think it would be really useful for Wikibase to implement this protocol directly. So it’s perhaps surprising to some readers that I don’t rank this as the top priority of Wikibase improvements. I hope this helps appreciate how urgent I think federation improvements are. The house is burning! In my opinion, the Wikidata Query Service should never have been allowed to split: the Wikidata community should have been given much more explicit feedback about what sort of growth can be sustained by the infrastructure, so that they can make more informed decisions about whether certain types of entities should be stored in Wikidata or elsewhere. My understanding is that the Wikidata team sees their role as infrastructure maintainers, who are there to serve the community of volunteers and accommodate with the organic growth that happens there. This is a principle that has worked well for Wikipedia, because the editorial boundaries agreed on by the community make growth much more manageable. Wikidata is a wiki of a very different nature, where users can make legitimate mass imports that can put the infrastructure on its knees. Letting community members find out about the limits of the infrastructure by running into those walls is not doing them a favour.

Another reason why I wouldn’t rank reconciliation as a top priority is that it’s a project that can be tackled by an external team relatively well. We’ve been trying to tackle that in the Wikibase Stakeholder Group and have some funding applications pending, so perhaps it might even end up happening, who knows!