Talk:Wikimedia Discovery/RFC

Re: Public Curation of Relevance
I think referring to the current relevance algorithm as a "black box through elastic search" is a little disingenuous. The code is all open source, as is the configuration. Here is where we set default weights we give to fields, here's where they're overridden, here's where they're mostly used. We do the same with phrase slopping and a bunch of other config too. Additionally, the concept of relevance is documented upstream and all the code going into it is public. I think we can do a better job of documenting how these different things come together to generate a _score though. Maybe that will help it seem less of a black box :)

I totally agree that we should continue to add ways to allow users to help curate content in search and affect relevance. In fact, Nik and I did some work on this quite some time ago that I think goes mostly forgotten. For Wikinews, we do article age favoritism, as new articles are more interesting on a news site than old ones (and that's what lsearchd did too)--this weighting is configurable. However, I think the biggest (and most unused) feature we already have support for is allowing wikis to configure how they want to boost/lower featured/bad content. I configured it for enwiki some time ago but I highly doubt this super powerful feature has made it to many other wikis. ^demon[omg plz] 18:24, 10 December 2015 (UTC)
 * I totally agree. Concerning the "blackbox", yes everything is open but it's extremely complex so let's say it's a "complex box" :)... well scoring is not an easy thing so I don't think we'll be able to make cirrus very easy to understand but to address this problem I've started to document the scoring mecanisms used by Cirrus. Concerning boost templates this is something I've wondered before, who owns this settings in System Message? Would it make sense to move this setting into wmf-config? DCausse (WMF) (talk)
 * I'm not really sure ownership is the right question. Originally I was hoping by making it a message the individual wikis could manage it themselves, but if you think doing it via wmf-config would be an improvement I don't think it matters much. Main thing is getting wikis to help advise how they view their high quality (and low quality) content. ^demon[omg plz] 19:53, 17 December 2015 (UTC)
 * Agreed, my concerns with System message vs wmf-config is that we are currently exploring solutions with custom rescore profiles, writing a profile is still a complex task and configuring template weights outside the context of the rescore formulas might be impossible. Our first experiment will be on wikidata let's see how it works after the first results. We should come to an easy process where wikis can guide us with hints on page quality that we could include in our formulas. IMHO template boosts is one the best criteria (maybe even better than incoming links) and plays an important role in the new completion suggester, it's quite frustrating to see that it is enabled only for enwiki :( DCausse (WMF) (talk) 14:35, 21 December 2015 (UTC)
 * I think the black box part refers to the fact that as a reader or editor, I can not easily provide input for the search engine to change its relevance mechanics (maybe via boost templates, but as you notice, this is under-used). I can not say that when I am looking for certain search terms, I think some results are more relevant than others. It is already being changed as we explore incorporating clickthrough analytics and page views statistics into the relevancy metrics, so that user's behavior forms an input channel into relevance, but still there are more ways to let users to provide input into relevance. --Smalyshev (WMF) (talk) 20:20, 19 February 2016 (UTC)

Data
I see two fundamentally different types of data that could be exposed via Wikis: single source authoritative and user-created data.

The user-created data source is similar to Wikidata, where information is contributed directly to the wiki-based storage. This path requires each data contributor to justify the legality and validity of each change, as well as track the history of every data contribution. While this presents a considerable task from organizational, social and legal perspectives, it is actually relatively easy technologically, especially if the data is roughly a few megabytes in size. We could create a namespace "Data" capable of storing JSON/GeoJSON and CSV data, with the subpage "/doc" used for localizable metadata and documentation (similar to Lua module pages).

Unlike user-created data, the single source data comes entirely from some well known source that community deems reliable and acceptable. The community or developers could perform some data transformation/filtering/parsing/cleanup before making it available to the various tools, such as graphs/maps/search/query, but those transformations do not alter data's meaning, license, or authoritativeness. For example, transformations would extract the needed portion of the public API's result and convert numbers and datetimes to Wiki format. The OSM data is an example of a more complex transformation, except that OSM required considerable custom infrastructure to make it user-accessible. Ideally, external data should be community configurable without the need of any WMF involvement.

P.S. See also I Dream of Content and implementation steps. Also, in phabricator: in-wiki CSV & similar, 3D storage. --Yurik (talk) 17:07, 21 January 2016 (UTC)

Wikidata as meta store
Wikidata could potentially work as a sort of a meta-data store for both types of data storage, recording source URL, content license, geo coordinate and the country for the geoJSON data, etc. --Yurik (talk) 17:32, 22 January 2016 (UTC)
 * It doesn't have to be Wikidata... commons could support this sort of thing or somewhere else. For Wikidata, I think there are questions about how the data would fit, etc. Aude (talk) 08:21, 17 February 2016 (UTC)

Comprehensibility
I've gone through the page and, being BOLD, edited it for grammar. This should make it easier for non-native English speakers, or non-Native techies to understand (Diff). I've removed all contractions ("we've") and replaced some buzzwords ("relevantly licensed", "going forward", "leverage", "gaming"...). I've also fixed a bunch of things that were just incorrectly conjugated/spelled/agreed ("different then Google's Customer Search" -> "different to Google's Custom Search", or "users experiences" -> "users' experiences"). But most importantly, I've replaced all instances of "surface" as a transitive verb. The way that it was used here - "to surface an idea/wikipedia article" is a metaphorical usage of the format "to surface a sunken ship" but Discovery Department documents are the only time I've seen it used to mean "to make an idea more visible". Perhaps it's common within the world of people who think about search-engines, but it's really unusual as a real-world English usage (see, for example, the closest Merriam Webster gets is its definition 3.2 http://www.merriam-webster.com/dictionary/surface ). Please reconsider all future usages of the phrase "to better surface" and replace it with something more commonly understood, like "to help identify" or "to make more visible" etc.

That all notwithstanding, this sentence still doesn't mean anything and I have no idea what to change it to. Can you explain what you're talking about here? "We do want to be very sensitive to not bias our users' experiences with any kind of content and allow our communities to help steer this."

Wittylama (talk) 13:53, 5 February 2016 (UTC)
 * Thanks Wittylama for giving this page some love. I agree that your edits are helpful in conveying the intended ideas. I wonder if the use of "surface" was some unconscious attempt to use a word other than 'discover'? We may never know.
 * Surface, discover, etc have been pretty common terms in discussions that i've had around finding and searching for content which is likely why i wrote it. We can certainly change it to a different word if its confusing. thanks for the feedback Tfinc (talk) 17:44, 16 February 2016 (UTC)


 * As for that sentence I think the intent (I'll ask other watchers of the page to confirm/correct) is that we want to have a neutral experience with content that might appear in search. In short, we want to be cautious in including information in search results from other sources. It has to be neutral in point of view. We also want to make these decisions of what to include in results with the communities.
 * Exactly, if we add any information to our index it should be supported by our community. So say we we tasked with surfacing OSM related data in search for articles, we'd want to make sure our community was supportive of that Tfinc (talk) 17:44, 16 February 2016 (UTC)


 * I am going to make one correction to your edits. Elasticsearch is the name of a specific bit of software. I'll add a hyperlink on the first instance of the word for clarification. CKoerner (WMF) (talk) 17:14, 5 February 2016 (UTC)


 * Thanks Wittylama for the non-native speaker point of view. It's very difficult to know what will and won't be clear to others who are reading in a foreign language. You changed "our communities" to "the editing community", and I changed that back to just "the community". I don't think we are limiting ourselves to the needs of editors—though they are much more likely to be commenting here. Readers who don't edit are also an important part of the Discovery team's user base.


 * I also changed a couple of Britishisms to their more typically American variants to preserve consistency throughout, and added some examples to the last "open question" that hopefully make it clearer what we mean by "multiple projects". TJones (WMF) (talk) 18:49, 5 February 2016 (UTC)
 * Thanks for those fixes TJones (WMF) - a good example of 'people in glass houses...' :) Wittylama (talk) 14:20, 10 February 2016 (UTC)

Abortive searches
Good to hear you are looking at null results in searches, are you also looking at people who looked at the Wikipedia result so briefly that it obviously wasn't what they wanted? One of the easy wins in search is to publish lists of popular search terms that don't currently have an obvious wikipedia article. In some case people will be able to create redirects to resolve them. There is a broader issue re languages that are written in scripts other than Latin. I recently met some Georgian Wikipedians in Tbilisi and they explained that one of their main problems is that many Georgians don't have a Georgian script keyboard and instead use Latin or even Cyrillic keyboards. Someone searching in the Georgian Wikipedia might well type in the Latin or Cyrillic scripts, and it should be fairly easy to render that search into Georgian script. WereSpielChequers (talk) 13:50, 13 February 2016 (UTC)
 * WereSpielChequers, I like your username, very witty. Yes, the team is looking in to searches that result in zero results. The nature of wiki and privacy concerns make things a little complicated. On one hand if we don't have an article for the search term visitors are given a redlink. Sometimes that's good as we don't have an article and need one created - other times we have an article, but the search phrase was not accurate enough or our search was not smart enough. We also want to be careful producing a list of 'missed' search terms. It's very hard, if not impossible, to sanitize everything that gets put into that box. Folks could accidentally copy-and-paste their Social Security Number, National Insurance number, passwords, and other private information. If we scoop that up in some report, even with every humanly imaginable attempt to remove those items, we run the risk of sharing something we shouldn't.


 * We have a big epic task with a goal of cutting the rate of zero results in half, and you can track our metrics on the Search Metrics Dashboard to see how we're progressing over time.


 * I like the feedback regarding searching in scripts non-native to the language in question. I can't speak to the details there, but I will pass it along to the engineers and get a task created (probably on Monday, I'm taking a moment this morning to let you know I saw your questions). Cheers! CKoerner (WMF) (talk) 15:25, 13 February 2016 (UTC)
 * Thanks CKoerner, I appreciate the risk of accidentally searching for your password etc, but that's why I was suggesting we look for popular search terms. If your password winds up as one of the most common search terms then having someone create a redirect to the article on you is the least of your worries. Equally the community is capable of going through quite a long list of popular unsuccessful search terms and creating articles or redirects for many of them. But this is a seriously long tail project, and long before you get to most people's mobile numbers or safe combinations you will have dropped below the minimum number of hits for anyone to bother about creating a redirect WereSpielChequers (talk) 23:29, 14 February 2016 (UTC)
 * No problem, WereSpielChequers. I did dig around the billions of tasks in phabricator and did stumble across T100330 which sounds close, but not quite what you're suggestion regarding scripts. I created a new task T127003 to track progress on that front. As for a top list of terms, I created another task to at least have the conversation with our engineers. It's T127002. CKoerner (WMF) (talk) 17:57, 15 February 2016 (UTC)
 * We are currently working on detecting the language of the query better. One of the next steps could also be supporting transliteration (i.e. converting from one script - usually Latin - to something like Cyrillic or Georgian). This is not easy due to multitude of transliteration standards and uncertainty of when even attempt to transliterate, but it is an interesting direction. --Smalyshev (WMF) (talk) 20:58, 19 February 2016 (UTC)
 * Thanks Smalyshev, I appreciate there are many possible transliteration requirements in existence, and even as you say different standards. Would it be simpler if this was chapter driven? IE particular communities such as Georgian asking if we could support particular transliteration? alternatively would it be worth testing this on some wikis with the same "searching for xxxxx did you mean to look for YYYYY" that I often get when searching for typos? WereSpielChequers (talk) 19:21, 22 February 2016 (UTC)

Cross wiki searches
To someone searching for information a dictionary, an encyclopaedia a travelguide and set of quotes are all possible data sources. Currently our default is to silo by wiki, but that artificially hides much of what we have. It would be good if people could see search results across multiple wikis. You'd need the option to opt out both generally, by wiki and in individual searches, but it would serve readers better (and encourage cross wiki editing). If you also enabled people to set search preferences to include which languages they can read and what order they prefer them in, then you are offering a much more interesting nuanced service. WereSpielChequers (talk) 13:59, 13 February 2016 (UTC)
 * See T109957. --AKlapper (WMF) (talk) 09:44, 15 February 2016 (UTC)
 * T112351 too!. CKoerner (WMF) (talk) 14:27, 15 February 2016 (UTC)
 * That was a rather flippant response. Apologies for the brevity. What I meant to say was, "We're working on it! What more would you like to see?" CKoerner (WMF) (talk) 18:19, 15 February 2016 (UTC)
 * Thanks. T112351 could be broadened out to also include Wikiquote, maybe even simple wiki. In effect bringing each language version closer together. Allowing for multiple languages is non obvious to Brits and Yanks, but outside the anglosphere multilingualism is the norm. WereSpielChequers (talk) 20:13, 15 February 2016 (UTC)

Proximity
I assume in wikiTravel, but also in Wikimedia Commons and sometimes I expect in Wikipedia, there would be a great benefit in having a button that shows you "More Nearby". The Geograph solved this many years ago, and since their contents are compatibly licensed it would be worth asking them about their search software. There are some primitive workarounds on Wikipedia such as lists of neghbouring villages, so the desire is clearly there, just the search technology is years behind. When I'm working with Geograph images I often use their search rather than ours for that very reason, even though it doesn't include those commons images not sourced from the Geograph. If it can be done on a very fine level, a matter of feet rather than tens of yards, then it becomes of interest to our partners in the GLAM sector, and an added benefit to them of openly licensing media and releasing it on Commons. WereSpielChequers (talk) 14:08, 13 February 2016 (UTC)
 * Are you looking for something similar to Special:Nearby but maybe more robust? Tell me more. CKoerner (WMF) (talk) 18:18, 15 February 2016 (UTC)
 * I wasn't aware of Special:Nearby and haven't seen any articles use it. Having now looked at it it seems to be a list of articles that are near your "current location", which in my case is a mile or so from where I am. That's interesting and I can see several uses of it, though grouping things by direction would help. To be useful to Wikipedia readers and be something we could put in articles it would need to pick up location from the article, and group nearby things by direction. For village articles a nearby option that showed other villages would be cool. WereSpielChequers (talk) 09:21, 16 February 2016 (UTC)
 * Special:Nearby does support this but it is not easy to find this feature. (e.g. en:Special:Nearby) I have a script to add an (currently a globe) icon above the coordinates in the top right a page that links to Special:Nearby for the page.  Aude (talk) 08:18, 17 February 2016 (UTC)
 * That link prompts my PC to ask for my location. I was thinking of the opposite, a feature that worked from the location code in an article to give you other articles that are located near that one. Special:Nearby works on your location. WereSpielChequers (talk) 17:04, 23 February 2016 (UTC)

Look alike
Some months ago I saw a demonstration in the British Library in London of various image processing software that had been tested on a set of a million images from the library. The best was very effective at learning from a set of images such as ships and finding lots of similar images. It would be enormously useful when categorising images on commons to be able to look for "similar images to this one". Slightly more ambitious, and possibly beyond easily available software, would be to highlight part of an image and search either for similar shapes or materials. WereSpielChequers (talk) 23:00, 14 February 2016 (UTC)
 * There seems to be a few open-source implementations of image identification. Not as many that are open-source compared to closed unfortunately. Besides building or finding a library to enable this we also have unique issues at our size. Mainly around integration and scale. A million static images is very unlike the 30+ millions that are in constant flux on Commons. Our servers would weep at the thought of the processing power required.
 * I think this is way ahead of us at the moment. Many things would are being done to lay a framework for such a feature. However, I hope other team members would refute this with news that I am wrong. CKoerner (WMF) (talk) 18:34, 15 February 2016 (UTC)
 * Oh, hey, I just stumbled across your suggestion for the community tech wishlist! You're already ahead of me on this one. There's a task already created tracking this work too. T120759 CKoerner (WMF) (talk) 19:24, 15 February 2016 (UTC)
 * I don't want to slaughter server kitties, and I appreciate that this might have significant hardware implications if we implemented it too quickly. But a Beta release, one limited to power users until Moore's Law had caught up with us, would likely make a huge difference to tasks such as categorising our vast backlog of images on Commons. If a university team can run this on  a high end PC with a million files a year ago then the first part of my suggestion is already practical or will be within years. The "slightly more ambitious" last sentence of my suggestion might be years away and require some software investments, but I think it is practical. WereSpielChequers (talk) 20:21, 15 February 2016 (UTC)

Spectroscopic analysis
Different rocks and different samples of metal have distinctive impurities and trace elements that can be used to identify source quarries and mines for that object. I suspect that lighting conditions and in some case camera technology may limit the usefulness of much of the media we currently have for this sort of search and discovery. But it would be of interest to Academics and probably others if we could group archaeological artifacts by source deposit. This really would lead to new discoveries. WereSpielChequers (talk) 23:00, 14 February 2016 (UTC)
 * I think like content-based image retrieval, this is one that is technically challenging. It also has a much smaller impact to the broad swatch of contributors and visitors to Wikimedia projects. Sounds like an idea for a potential grant, but you'd have to make a compelling argument. Happy to help review anything you submit. CKoerner (WMF) (talk) 18:37, 15 February 2016 (UTC)
 * I agree this might not be possible in the short term, but it would be an interesting direction to head in, and it would be an interesting service to offer our partners in the Education and GLAM sectors if it could be done. So the user group might be niche, but very influential. WereSpielChequers (talk) 08:54, 16 February 2016 (UTC)

3D images
There is already software in existence that can create 3d models from multiple photographs of the same object, obviously we should offer this on commons, and also support holding 3d models on commons. This may require creating an open source format for 3d images. Bonus features would include mapping multiple paint schemes onto the model; side by side displays so that comparisons can easily be done such as Spitfire v FW 190; and morphing sequence of different models in succession such as up and down phylogenic trees and their design equivalents such as in ceramics and steam engines. WereSpielChequers (talk) 23:13, 14 February 2016 (UTC)
 * See T3790 for related discussion. --AKlapper (WMF) (talk) 09:41, 15 February 2016 (UTC)
 * That brings up one of the most depressing quotes I've seen for a while, "since this task was created in 2005". The opportunity cost of Flow, AFT, Gather and assorted IT mistakes, white elephants and wrongly prioritised WMF hobbyhorses is that an awful lot of good ideas have languished for years in Bugzilla and Phabricator. WereSpielChequers (talk) 20:04, 15 February 2016 (UTC)
 * Yes and I wish I could do something to undo all of that WereSpielChequers. The fact that it's high on the community wish list and an active task does give me hope that progress will be made. It looks like even today it was added to the Reading team's radar to review. If you know other Wikipedians who could constructively add to the discussion and task, please encourage them to do so. Having that activity helps with motivation. Also, I'll try to keep my ears/eyes open about progress - ping me if you feel like things aren't moving as they should. CKoerner (WMF) (talk) 20:21, 15 February 2016 (UTC)
 * More semi-good, potentially positive, let's not get our hopes up, but interesting news. Nemo mentioned updating the list of accepted submissions to Wikimania on the mailing list One of the sessions that was accepted is titled, "Dynamic SVG for Wikimedia projects: Exploring applications, techniques and best practice for interactive and animated vector graphics". There is a mention of "a simple 3D object viewer". CKoerner (WMF) (talk) 20:30, 15 February 2016 (UTC)
 * I've pinged the editor who introduced me to the idea of 3d images. He's aware of how many years the community has been trying to get 3d support enabled on Commons so i don't want to raise his hopes unduly. WereSpielChequers (talk) 11:02, 16 February 2016 (UTC)

Proper public-domain & CC-search across several domains
I'd like to suggest a number of sites for inclusion and especially a number of image repositories that could be included here that essentially not reachable through current search protocols — no-one provides this functionality and Google's CC/PD image-search is abysmal (really). It only covers the obvious sources such as Pixabay, freestockphotos.biz, Flickr

I've personally used the search string on Google images: ***searchterm*** site:www.plos.org OR site:peerj.com OR elifesciences.org OR bmjopen.bmj.com OR www.biomedcentral.com OR sagepub.com/journals/Journal202037/title/* OR springeropen.com OR site:*.gov OR site:molecularbrain.com OR behavioralandbrainfunctions.com OR www.scirp.org/journal/jbbs OR etsmjournal.com -site:openi.nlm.nih.gov -site:lookfordiagnosis.com -site:nlm.nih.gov -site:ncbi.nlm.nih.gov

This string includes a number of open repositories of med/bio-images which can be used on Wikipedia. It is far from all-inclusive and only a very crude tool as each image must later be manually checked so that it is freely licensed. This could hopefully be automated in some way so that certain standardized licensing tags could be included in the Discovery tool.

I think prototyping this type of tool would do wonders towards including the Wikipedia community, and would be an excellent first step in showing how Discovery can be used by Wikipedians. Hopefully this could dispel quite a lot of animosity towards the project.

Other sites that carry CC/PD-content that can be used on Wikimedia-projects (without proper Google-integration) are:
 * http://www.deviantart.com/ — currently has no search tool that allows users to differentiate between CC and proprietary content — each page with CC-type content is tagged accordingly (e.g.: http://dembsky.deviantart.com/art/Pixel-Social-Store-Icon-Set-255259854 – visible at bottom right)
 * https://openclipart.org/

As part of a community effort I would very much like to part-take in your discussions concerning this, and if there is any possibility or if you are interested in discussing the myriad of available sources please feel free to e-mail me.

Best, CFCF (talk) 21:57, 27 February 2016 (UTC)
 * Related is the FIST tool. Legoktm (talk) 07:38, 28 February 2016 (UTC)