Topic on Talk:Wikimedia Enterprise

Data subsets

2 comments • 17:05, 20 October 2021 2 years ago

2

Summary by LWyatt (WMF)

Suggestions replied to

Pelagic (talkcontribs)

I don’t know if there would be a market for it, but I imagine some customers might want to receive certain subsets of data in their feed/dump, for example:

just the lead paragraphs before the first headings
only the structured (unparsed) wikitext from infoboxes (c.f. DBpedia)
delayed but cleaner feed, e.g. only revisions that have not been reverted or undone for x days
only particular types of pages, e.g. English Wikipedia has articles, lists, disambiguation, and redirects all within the same namespace
just the TOCs and hatnotes that describe structure rather than content

Or maybe nobody wants this, especially if our existing downloaders are already geared toward receiving big dumps and running their own parse-and-filter processes?

Edited 21:03, 28 July 2020 3 years ago

Seddon (WMF) (talkcontribs)

@Pelagic I think this is a really good point. In the long term we are definitely thinking about how we can provide parsed content like this to reusers (both internal and external). One of the aspects that we have identified in our very early community research interviews is that if we are to undertake this work, there comes with it a responsibility to democratize our data for organizations without the resources of these largest users. We should be about leveling the playing field and not about reinforcing the monopolies and help encourage a competitive and healthy internet. It's not just startups or alternatives to the internet giants that we are considering, but also universities and university researchers; archives and archivists; and non-profits like the Internet Archive. We are a long way off from that but it's definitely within our sights for the future.

01:07, 30 July 2020 3 years ago