All required

"Recieved 3 ratings. At least 48 results must be rated". Ugh, that's tedious! I wish I could only rate the results I feel most certain/whatever. Nemo 21:59, 9 May 2016 (UTC)[]

Ok, I skipped a dozen and then I found one query where I could mark nearly everything "maybe" (articles which might have contained information on that topic, or not) except a couple "probably" and one "relevant" (the article by that name). I'm really confused by how you expect a person to rate 48 search results, which would require opening them and seeing if they happen to contain something of relevance or not; and also by the present of non-mainspace pages. Nemo 22:07, 9 May 2016 (UTC)[]
@Nemo bis: The 80% scoring requirement comes from a couple angles. The initial concern is that the human judgement's will be plugged into nDCG to generate scores when evaluating changes to our ranking formulas. In this model un-ranked results are the equivalent to irrelevant results. Having only 3 or 4 ranked results out of the realm of possibilities makes the entire effort less likely to inform changes in ranking. Secondly, we can't take results from a single judge as definitive, we need to look at inter-judge agreements. Initially we are aiming for 5 judgement's per query, but this may go higher/lower depending on what we find. For the reasons above, and probably others, a high minimum scoring requirement is industry standard in judgement platforms.
With respect to needing to open all the links, I am hopeful that by providing query snippets (the same you see in a search results page) graders can make a reasonably informed decision without opening the link. This is part of the reason we are using only 4 levels of relevance, as opposed to the 10+ that google uses in their own judgement platform. These 4 levels of relevance (top 5, top 20, plausibly related, and irrelevant) should allow users to rate most results based only on the information provided in the snippet without needing to open the link, at least that is the hope. If you didn't notice snippets they are shown when clicking the down arrow next to each title. I will soon be adding an 'open all snippets' button which might make this more discoverable.
In all, our version of the scoring platform in my opinion is simpler and easier than what google, microsoft, and other search engines do. For these reasons it is also less powerful, but we need to start somewhere. For example google's scoring platform has a 140 page instructions document, with > 10 different relevance levels, that must be followed when rating results. Additionally google requires graders to take into account location data, mobile vs desktop, when considering relevance. The difference though is that other entities have the ability to pay people ($0.01 - $0.03 per ranked result) as an enticement to follow the more complicated guidelines.
Thanks for the feedback. We have a weekly relevance meeting and I'll bring this up, perhaps we can reduce the scoring requirement to 50%, or make the instructions clearer in regards to snippets, or perhaps something else. We know that human judgement of relevance is one of, if not the, most important tool used by major search engines to inform changes to their search result rankings. More so than clickthroughs, and more so than engagement metrics. We will be working hard to make our platform easy enough to entice users into actually rating results while also providing us the information we need to improve search.
EBernhardson (WMF) (talk) 16:44, 10 May 2016 (UTC)[]
I did not see the snippets. I now see they are available after clicking the arrow next to the title, which would still require a minimum of 97 clicks for each page. Perhaps show the snippets by default? That would half the wrist fatigue. Nemo 16:51, 10 May 2016 (UTC)[]
I gave it another try. In my opinion it's impossible to give meaningful scores to the results for a generic query like "controls" or "JFK", while for instance "dative latin" is very easy because one article is clearly the best fit (on the other hand, Special:Search/dative_latin clearly tells us we're scoring the keywords vicinity too much and/or the words in headers too little). Nemo 17:04, 10 May 2016 (UTC)[]

Link Explosion

The snippets definitely make it easier to determine whether the result is relevant, but sometimes it isn't enough, which is what the "link" link is for. As Nemo mentioned above, sometimes you need to open a lot of them, and it is kind of a drag. I'm not sure if there's a natural UX solution for this problem, but as a baseline I have a tool I wrote for myself that opens 20 links at once with the click of a button. There are three batches of 20 links, and opening a new batch closes the last batch if they are open, and there's a button to close all opened tabs. Since I built it for myself, the UI isn't necessarily obvious to anyone else, but the javascript is straightforward. Does a "link explosion" button to open 10+ links at once have any appeal to anyone else? (Of course it wouldn't work on mobile, but on my laptop I'd certainly use it.) TJones (WMF) (talk) 13:35, 23 June 2016 (UTC)[]