Talk:Wikimedia Discovery/RFC

Re: Public Curation of Relevance
I think referring to the current relevance algorithm as a "black box through elastic search" is a little disingenuous. The code is all open source, as is the configuration. Here is where we set default weights we give to fields, here's where they're overridden, here's where they're mostly used. We do the same with phrase slopping and a bunch of other config too. Additionally, the concept of relevance is documented upstream and all the code going into it is public. I think we can do a better job of documenting how these different things come together to generate a _score though. Maybe that will help it seem less of a black box :)

I totally agree that we should continue to add ways to allow users to help curate content in search and affect relevance. In fact, Nik and I did some work on this quite some time ago that I think goes mostly forgotten. For Wikinews, we do article age favoritism, as new articles are more interesting on a news site than old ones (and that's what lsearchd did too)--this weighting is configurable. However, I think the biggest (and most unused) feature we already have support for is allowing wikis to configure how they want to boost/lower featured/bad content. I configured it for enwiki some time ago but I highly doubt this super powerful feature has made it to many other wikis. ^demon[omg plz] 18:24, 10 December 2015 (UTC)
 * I totally agree. Concerning the "blackbox", yes everything is open but it's extremely complex so let's say it's a "complex box" :)... well scoring is not an easy thing so I don't think we'll be able to make cirrus very easy to understand but to address this problem I've started to document the scoring mecanisms used by Cirrus. Concerning boost templates this is something I've wondered before, who owns this settings in System Message? Would it make sense to move this setting into wmf-config? DCausse (WMF) (talk)
 * I'm not really sure ownership is the right question. Originally I was hoping by making it a message the individual wikis could manage it themselves, but if you think doing it via wmf-config would be an improvement I don't think it matters much. Main thing is getting wikis to help advise how they view their high quality (and low quality) content. ^demon[omg plz] 19:53, 17 December 2015 (UTC)
 * Agreed, my concerns with System message vs wmf-config is that we are currently exploring solutions with custom rescore profiles, writing a profile is still a complex task and configuring template weights outside the context of the rescore formulas might be impossible. Our first experiment will be on wikidata let's see how it works after the first results. We should come to an easy process where wikis can guide us with hints on page quality that we could include in our formulas. IMHO template boosts is one the best criteria (maybe even better than incoming links) and plays an important role in the new completion suggester, it's quite frustrating to see that it is enabled only for enwiki :( DCausse (WMF) (talk) 14:35, 21 December 2015 (UTC)

Data
I see two fundamentally different types of data that could be exposed via Wikis: single source authoritative and user-created data.

The user-created data source is similar to Wikidata, where information is contributed directly to the wiki-based storage. This path requires each data contributor to justify the legality and validity of each change, as well as track the history of every data contribution. While this presents a considerable task from organizational, social and legal perspectives, it is actually relatively easy technologically, especially if the data is roughly a few megabytes in size. We could create a namespace "Data" capable of storing JSON/GeoJSON and CSV data, with the subpage "/doc" used for localizable metadata and documentation (similar to Lua module pages).

Unlike user-created data, the single source data comes entirely from some well known source that community deems reliable and acceptable. The community or developers could perform some data transformation/filtering/parsing/cleanup before making it available to the various tools, such as graphs/maps/search/query, but those transformations do not alter data's meaning, license, or authoritativeness. For example, transformations would extract the needed portion of the public API's result and convert numbers and datetimes to Wiki format. The OSM data is an example of a more complex transformation, except that OSM required considerable custom infrastructure to make it user-accessible. Ideally, external data should be community configurable without the need of any WMF involvement.

P.S. See also I Dream of Content and implementation steps. Also, in phabricator: in-wiki CSV & similar, 3D storage. --Yurik (talk) 17:07, 21 January 2016 (UTC)

Wikidata as meta store
Wikidata could potentially work as a sort of a meta-data store for both types of data storage, recording source URL, content license, geo coordinate and the country for the geoJSON data, etc. --Yurik (talk) 17:32, 22 January 2016 (UTC)

Comprehensibility
I've gone through the page and, being BOLD, edited it for grammar. This should make it easier for non-native English speakers, or non-Native techies to understand (Diff). I've removed all contractions ("we've") and replaced some buzzwords ("relevantly licensed", "going forward", "leverage", "gaming"...). I've also fixed a bunch of things that were just incorrectly conjugated/spelled/agreed ("different then Google's Customer Search" -> "different to Google's Custom Search", or "users experiences" -> "users' experiences"). But most importantly, I've replaced all instances of "surface" as a transitive verb. The way that it was used here - "to surface an idea/wikipedia article" is a metaphorical usage of the format "to surface a sunken ship" but Discovery Department documents are the only time I've seen it used to mean "to make an idea more visible". Perhaps it's common within the world of people who think about search-engines, but it's really unusual as a real-world English usage (see, for example, the closest Merriam Webster gets is its definition 3.2 http://www.merriam-webster.com/dictionary/surface ). Please reconsider all future usages of the phrase "to better surface" and replace it with something more commonly understood, like "to help identify" or "to make more visible" etc.

That all notwithstanding, this sentence still doesn't mean anything and I have no idea what to change it to. Can you explain what you're talking about here? "We do want to be very sensitive to not bias our users' experiences with any kind of content and allow our communities to help steer this."

Wittylama (talk) 13:53, 5 February 2016 (UTC)
 * Thanks Wittylama for giving this page some love. I agree that your edits are helpful in conveying the intended ideas. I wonder if the use of "surface" was some unconscious attempt to use a word other than 'discover'? We may never know.


 * As for that sentence I think the intent (I'll ask other watchers of the page to confirm/correct) is that we want to have a neutral experience with content that might appear in search. In short, we want to be cautious in including information in search results from other sources. It has to be neutral in point of view. We also want to make these decisions of what to include in results with the communities.


 * I am going to make one correction to your edits. Elasticsearch is the name of a specific bit of software. I'll add a hyperlink on the first instance of the word for clarification. CKoerner (WMF) (talk) 17:14, 5 February 2016 (UTC)


 * Thanks Wittylama for the non-native speaker point of view. It's very difficult to know what will and won't be clear to others who are reading in a foreign language. You changed "our communities" to "the editing community", and I changed that back to just "the community". I don't think we are limiting ourselves to the needs of editors—though they are much more likely to be commenting here. Readers who don't edit are also an important part of the Discovery team's user base.


 * I also changed a couple of Britishisms to their more typically American variants to preserve consistency throughout, and added some examples to the last "open question" that hopefully make it clearer what we mean by "multiple projects". TJones (WMF) (talk) 18:49, 5 February 2016 (UTC)