User:EBernhardson (WMF)/Notes/Accept-Language

November 2015 - Questions and comments are welcome on the talk page


 * This is currently a work in progress and is not complete*

Hypothesis
Using the first non-english Accept-Language HTTP header will provide a good proxy for the language the query is in when the query returns no results against the wiki it was already run against. Further that this is a better proxy than the existing elasticsearch langdetect plugin.

Process
Started by taking an hour worth of queries + accept language headers for enwiki from hive wmf.webrequests table using the following query. The specific day and hour to work with was arbitrarily chosen. This gives us a set of 180,298 queries to start with, which we will use to calculate the expected change to zero result rate. This feels much too low to be the total number of full text queries on enwiki for that hour, but is probably a reasonable number to run this test against.

The result of this query was then run through the following php script to filter out queries that had an invalid accept-language header recorded, or only included English. Run against the above set of 180,298 queries we end up with 28,929 queries that could be effected. This means around 16% of our search queries to enwiki contain a non-english Accept-Language header.

These queries are then run against the enwiki index we have in the hypothesis-testing cluster to see which are zero result queries. This was done with the following command line:

The results of this were filtered down to only the zero result queries, giving ???? queries to test our original hypothesis against.

For the next step I needed a map from the languages to their wiki's. This was sourced from.

These queries were then separated out into a file per wiki using the following php script:

Now that we have all the zero result queries that have a usable accept-language header broken out into files per wiki we ca run them with the following: