Discovery/Status updates/2017-12-11


This is the weekly update for the week starting 2017-12-11


  • We've officially closed out the Discovery Department wiki page (to include the Portal work) and added a new Search Platform wiki page.
  • The automation of the Portal page is now complete. Going forward, as part of the automation process, two tickets will continue to be updated for the stats and translations weekly updates. More information on how the automation is done can be found in the documentation on Diffusion. 🎁 🎉
  • "Gnomes and trolls and hobgoblins (oh my!)—Failed queries and the vicarious fear of missing out"
    • Check out Trey's latest blog post! [1]



  • We closed out the large ticket that tracked a lot of the work that we've done so far for backend data engineering and plumbing for machine learning to rank [2]
  • We also closed out a ticket that fixed an issue with some features that were being hidden by the completion suggester such as sub phrase matching on wikisource. [3]
  • We've ported all our Selenium tests from Ruby to Node.js for the Search Platform team [4]
  • Trey finished up a long review of the Serbian Morphological Libraries that are available, (full report here) [5] [6]
  • We did some tuning to Wikidata prefix search [7]
  • Added ability for searches to return extra match data in API responses (Wikidata will use it) [8]


  • Categories are now updated automatically into Blazegraph each week on Monday.


  • Jan finished up several small tasks related to the automation of the portal page [9]:
    • Update technical documentation for portals repository [10]
    • Fixed a minor UI problem with rendering the small print in English and Russian [11]

This status was last updated 2017-12-12.

Tech: Search Platform[edit]

1. Implement advanced methodologies such as “learning to rank” machine learning techniques and signals to improve search result relevance across language Wikipedias.

  • Begin to automate the machine learning pipeline, starting by targeting eight to ten languages, other than English, that match (at a minimum) current performance and then deploy those models. (DONE)

2. Improve support for multiple languages by researching and deploying new language analyzers as they make sense to individual language wikis.

  • Investigate open source language software that is available and see if it can be converted into ElasticSearch plugins. (DONE)
  • Investigate usage of fall-back languages (DONE)
  • Investigate fuzzy (phonetic) matching. (IN PROGRESS)
  • Continue general language support. (ONGOING)

3. Investigate how to expand and scale Wikidata Query Service to improve its ability to power features on-wiki for readers

  • Work on sub-category filtering and searching within the Wikidata Query Service. (ONGOING)

4. Address technical debt:

  • Convert existing Selenium tests to Node.js (DONE)
  • Investigate ownership and maintenance of Logstash (DONE)

Structured Data on Commons[edit]

1. Commons search will be extended via CirrusSearch and ElasticSearch and Wikidata Query Service, to support searching based on structured data elements describing media.

  • Determine advanced search requirements and measures for structured data on commons. (NOT STARTED YET)

2. Advanced search capabilities (e.g., Wikidata Query Service, SPARQL queries) will be updated to support the more specific media search filters and the relationships to the topics they represent

  • Begin work on prefix- and full-text search in ElasticSearch on Wikidata in preparation for the Structured Data on Commons project. (ONGOING)


Wikidata Query Service goal for this quarter will be to work on sub-category filtering and searching within the Wikidata Query Service; it will be maintained by Stas and Guillaume to support the continued growth and use of the service; the Analysis team will help with statistics.



Update the portal codebase to be completely automated for ease of ongoing maintenance.

  • Automate portal project updates: statistics and translations (DONE)


Support the move to be more operationally centralized and roll out a new map style that has numerous updates and enhancements.

  • Finalize and deploy new map style; replicate maps test cluster in Wikimedia Cloud Service; monitor for critical bugs (IN PROGRESS)


The team will continue to work closely with the Search Platform team to analyze A/B tests and other assorted data; they will also begin working on determining a baseline set of metrics for Structured Data on Commons. (IN PROGRESS)