User talk:Johan (WMF)

Wishlist
Are you keeping a list of ideas for future wishlist discussions? Here's one from me that I believe is new:

I want a script or tool that tells me who wrote most of the content on a given revision of a page. I'm interested in figuring out who actually wrote visible sentences and paragraphs of content, not who formatted citations or inserted infoboxes (both of which can add a lot of bytes). It also needs to be actually creating the content, rather than, e.g., undoing page blankings (which has a large positive byte number, but doesn't actually result in any content).

Unlike w:en:Wikipedia:WikiTrust, I don't care who wrote which specific words, although presumably you would need a similar mechanism to determine the contributions. Instead, I want a list of contributors from the most to the least, or perhaps a percentage for the first handful. This would be useful for compliance with the BY aspect of the license (if you copy it to some forms of media, you need to name the five most significant contributors) and also for statistical work on contributions (allowing us to separate "who wrote the most content" from "who did the most formatting or reverting"). Whatamidoing (WMF) (talk) 19:20, 11 January 2016 (UTC)
 * Not really, but since you've asked me twice now, I should probably get the hint and set one up.
 * I think this could build on research that User:EpochFail is looking/has looked at. /Johan (WMF) (talk) 07:46, 12 January 2016 (UTC)
 * This is surprisingly difficult to track authorship since it require substantial computation for large pages. However I have performed the necessary computations on XML dumps using batch-style large-scale computing systems (e.g. en:Hadoop).  Once I finish with my analysis work (see. m:R:Measuring value-added), the next step is to try to implement it as a live system that synchronizes with a wiki via recent changes.  In the meantime, there are systems that generate stats like this on-demand.  See http://people.aifb.kit.edu/ffl//whovisual/.  You'll be in for a long wait if you try to generate the authorship for the current version of en:Anarchism, but it's worth testing out on pages with less history.  --EpochFail (talk) 14:00, 12 January 2016 (UTC)