Talk:Wikimedia Discovery

From MediaWiki.org
Jump to: navigation, search

Activity?[edit]

According to m:Wikimedia Foundation Engineering reorganization FAQ, "Search and Discovery" is a team, not an activity. If so, AFAICS it shouldn't use that template and it should be named Wikimedia Search and Discovery or similar (available information on "and" vs. "&" and casing is inconsistent). --Nemo 17:11, 30 April 2015 (UTC)

+1, the other point is, that "search" redirects to "Search and Discovery" (in my point of view a vote for deleting, searching for "Search", I normally don't think about to find a page of a team) , and for "mediawiki search" in Google this page has a relatively good rank, but it doesn't describe a function in the mediawiki software, or (like you said) an activity. So i suggest to move this page to a page with a clear title. --Florianschmidtwelzow (talk) 08:00, 13 May 2015 (UTC)
The page has been moved to Discovery, so no longer has the word "search" in it, nor "&". --KSmith (WMF) (talk) 22:59, 10 June 2015 (UTC)
Nemo just moved this page, without consultation. I don't necessarily object, but some discussion would have been polite. Unlike all the other Wikimedia xxx team pages mentioned at team prefix:Wikimedia, Discovery is an entire Department, like Reading and Editing. --KSmith (WMF) (talk) 05:35, 27 July 2015 (UTC)
There is a discussion about WMF team pages, their names and locations, at meta:Meta:Babel#Pages for WMF teams and departments which I hope you will contribute to. Rogol Domedonfors (talk) 20:59, 27 July 2015 (UTC)

CirrusSearch component[edit]

There are several interesting reports in the CirrusSearch component. In Phabricator I see quite some activity around recent things, while reports formerly considered normal/high priority (i.e. useful to improve search results) are mostly inactive. Does someone plan to go through all the reports and triage them? It would probably take less than a day for one of the ElasticSearch persons. --Nemo 11:28, 28 August 2015 (UTC)

user:Nemo bis, have you noticed any improvement in the management of older reports? John Vandenberg (talk) 04:39, 6 January 2016 (UTC)

Community Liaison job opening at WMF[edit]

Hi. There's a new job posting for a Community Liaison to work with the Discovery department. Please pass it along, if you know someone who might be interested or a good fit. Thanks. Quiddity (WMF) (talk) 20:06, 21 October 2015 (UTC)

Offsite homepage report[edit]

This page links to http://ironholds.org/misc/homepage_presentation.html , which I initially thought was a broken page because the '>' link is not easy to find (ironic for the Discovery team?) due to color scheme used. I assume it was work for WMF, and therefore should be posted onto a Wikimedia server, preferably a public wiki, and not a private website without a free content license. Could that report be posted on MediaWiki, User:Okeyes (WMF), please? John Vandenberg (talk) 02:58, 6 January 2016 (UTC)

To my knowledge there's nowhere to host this kind of report.If you can point me to a public wikimedia server that lets me host arbitrary HTML content, I would be interested to hear where. I have tried to export it as PDF but the format doesn't make it easy; I'll probably put some work into putting it together as a more structured report a la our other ones.
Your claim that it lacks a free content license is not the case; it is MIT-licensed and openly released, as is all of my work. That it lacks a copyright template is because I do not believe in releasing my work with any restrictions, which means CC-0 or MIT. I would be interested to know how you encountered the report (I've had multiple people poke me about it so I assume it's being discussed somewhere?) Ironholds (talk) 23:43, 6 January 2016 (UTC)
I encountered it because the link is on this page. I'm not aware of any other discussions occurring about it. Your website doesnt mention that it is MIT, and it doesnt link to https://github.com/wikimedia-research/wp_home , so there is no way a reader can ascertain its copyright status. A PDF of the content on Commons would be great, even if some of the functionality is lost. Is it possible to post the HTML using Github Pages?, that way the rendered version is more clearly linked to the repo where the license is declared. John Vandenberg (talk) 23:56, 6 January 2016 (UTC)
I'll see if I can generate a HTML version for that, sure, but I'd much rather the PDF, which I'll work on today. Thanks, Ironholds (talk) 16:05, 7 January 2016 (UTC)
The report can now be found here. Ironholds (talk) 02:28, 9 January 2016 (UTC)

Wikipedia portal only[edit]

Is wikipedia.org the only portal that is under the purview of Discovery? This is strictly a clarification question about scope, as everything I see related to this appears to be wikipedia.org only, but the other portals are not explicitly excluded from the scope.

I assume wikipedia.org is the only portal with traffic significant enough to warrant it being a legitimate target for optimising knowledge pathways at present, but will improvements made to it also trickle down to the other portals? John Vandenberg (talk) 03:03, 6 January 2016 (UTC)

Caveat that the product manager can and will provide a better answer, which might contradict this one: all the portals are within our demesne. If changes also improve the other portals, absolutely. Most of the changes we're looking at are around UX design and should transer nicely. Ironholds (talk) 16:06, 7 January 2016 (UTC)

Congratulation to the Knight Foundation grant![edit]

Great to hear about your success! I was wondering if you would be willing to share the full application that you sent in on-wiki? Having examples of successful applications can help other Wikimedia organizations in their work with external project grants. We have recently started a list to gather positive examples here. Kind regards, John Andersson (WMSE) (talk) 15:58, 8 January 2016 (UTC)

I think there's a better approach than that. Keep in mind that at large funding levels between major partners, the process is not "send in an application and hope for the best" but a series of meetings to explain and gain mutual understanding about a set of objectives. Training in that for chapters by the people who were involved is a great idea!--Jimbo Wales (talk) 11:06, 1 February 2016 (UTC)

Organizing Main Discovery Page[edit]

Should each project (Search, Portal, Maps, etc.) have a landing page that talks more in-depth about the work (ala Wikipedia.org Portal Improvements)? Right now search goes to an extension page, but that's not really what search is all about. It's the technical implementation of a much larger corpus of work. I think things could be a little more balanced but wanted some feedback. CKoerner (WMF) (talk) 01:27, 29 January 2016 (UTC)

Platypus[edit]

There is a demo [1] from some french students in theoretical computer science. They wrote an open source project which aim is to create an open source question answering framework and a demo of it. Just for the info. --Molarus (talk) 09:04, 31 January 2016 (UTC)

Wikimania 2015 - State of Wikidata.pdf
See:

Discovery vs. original research[edit]

What is the difference between "discovery" and "original research"? Between "discovery" and "search"? The terms "search" and "original research" are well understood in the context of Wikipedia, but what, exactly does "discovery" mean? Is it a concept of science fiction? Wbm1058 (talk) 04:15, 1 February 2016 (UTC)

I am speaking only for myself and I am not a staff member. "Discovery" is a broader term than just "search". On Wikipedia, people discover things in a number of ways: the most basic links to other articles, series of articles, categories, sequences and timelines, the front page, and, yes, the search box. Another aspect of "discovery" is how people find Wikipedia - search engines, links from other websites, re-use of our content by people who link back to us, sharing on social media, etc.
We can also think of "Discovery" in the context of readers and in the context of editors. Currently, as an editor, if I visit an article without an image and I think "Gee, I wish this had an image" then I probably go to commons and use the search box there. Can we make that process easier and more efficient? Currently, as an editor, if I see a link to an outside source I may wonder what other Wikipedia entries link to that source. Can we make that process easier and more efficient? Etc.
"Discovery" has nothing to do with "original research" - which is an entirely different concept and entirely different concern.--Jimbo Wales (talk) 11:05, 1 February 2016 (UTC)

Is the knowledge engine a tool for data mining? Does it use machine learning or genetic programming for the purposes of knowledge discovery? Wbm1058 (talk) 04:29, 1 February 2016 (UTC)

The WMF has discontinued the use of the term 'knowledge engine' - presumably because it was causing people to ask just this kind of question. In my view, we can think of the entire workings of Wikipedia and the Wikimedia projects, including the editors, the software, discovery elements, APIs, etc. as a global "knowledge engine". But that's just a way of thinking, not a specific plan.--Jimbo Wales (talk) 11:05, 1 February 2016 (UTC)

About "I probably go to commons and use the search box there.": Actually I did just that yesterday. I have to say first, that I´m an experienced editor. I was looking for an icon, but I didn´t know what icons are there. I started with the searchword "icon" and moved then to the categories commons:Category:Icons. With categories I can search, without knowing the right name of the file. I don´t know if the searchbox will ever do the same, but I understand that new editors don´t know categories. By the way, I think there is AI software about tagging pictures. Categorizing new pictures by software could be a help, I think, at least I remember that I had read somewhere that this is a big part of the work commons editors have to do. PS: I do searching this way in Wikipedia too. PPS: Maybe another aspect of search is that I´m using sometimes Wikipedia to search for searchwords. Since I´m no native English speaker, I don´t know the right English word. Therefore searching in WP and switching from one language into the other is sometimes the first step before going to a search engine. I have learned this while I was researching things in the internet for writing articles. Now I´m doing this quite often. --Molarus (talk) 12:48, 1 February 2016 (UTC)

My view is that both "knowledge engine" and "discovery" are ambiguous terms, so I'm not sure switching from one to the other is helpful. Perhaps "super search" or "enhanced search" would be better.
Maybe just view "discovery" as a "code word" while in the R&D phase, and wait until actual product(s) emerge from this to give them more permanent names.

I see that one example of enhanced search would be "what links here" both to and from external websites, and other Wikimedia projects. Wbm1058 (talk) 16:28, 3 February 2016 (UTC)

Hello Wbm1058, I'm the Community Liaison for the Discovery team. Happy to help answer any questions you or other folks have about the work the team is doing. I thought it might be helpful to clarify a few things you've mentioned.
You're right that the team name of Discovery is a bit ambiguous. That's intentional. The work the team is doing not exactly "just search", they're working on the discovery, or finding, of information across the various wikimedia projects. Like how our Editing team takes care of editing, the Reading team, well, readers. Discovery is the team looking at those folks looking for information. Could be search, could be embedding editable maps, could be how people enter into our projects - heck it could even be how people use our API to pull data out of our projects for analysis and use elsewhere.
So Discovery is the group of projects that we're all working on. I'll talk about a few briefly, but you can learn more on the main team page (something I'm working on improving). One project is Search - making it easier to search on say English Wikipedia without having to go back to a search engine to find what you're looking for. One small example is improvements to the suggestion tool (demo here) that is more lenient and allows for things like misspellings (happens to the best of us). Future ideas might be showing images from Commons or quotes from Wikiquotes when you search for something like "Albert Einstein".
Another is the portal. Did you know that millions of folks visit wikipedia.org every day? Not a specific language wiki, but that landing page. That's a big opportunity to introduce the projects to people who might be totally new to the movement, in new and useful way. We've already done some testing (using A/B tests) to show that a few small improvements can result in more visitors finding content within our projects. We have a draft article we hope to share with folks soon you can read for more info (and I'd love any feedback on the article itself!).
We are looking at ways of learning from users of our technology to understand how it's being used and how to improve it. I appreciate your feedback and thoughts and hope you continue to join us in the journey. CKoerner (WMF) (talk) 19:46, 4 February 2016 (UTC)

Completion suggester[edit]

Hello, I've been reading a very interesting mailing list thread on the new completion suggester.

Rather than talk about the weight of pageviews, I was wondering another thing: why does the search return suggestions on the first character? Shouldn't it return a list on the third character?

I mean, if I type "p", I will almost surely type more characters. Why request suggestions so quickly? --NaBUru38 (talk) 02:57, 10 February 2016 (UTC)

Hello NaBUru38, which mailing list thread? I'm still learning and would love to be aware of the conversation. Other folks who are more knowledgeable might chime in here, but here's my two cents. I think part of the reason we start completing with a single character, like "p", is that we want people to get feedback on their search immediately instead of delayed by a certain character limit. Another reason is that we actually have articles to point people to that are only a single character long, like the article on English Wikipedia for P or P the american alternative rock band! CKoerner (WMF) (talk) 16:23, 10 February 2016 (UTC)
Hello, I meant this. --NaBUru38 (talk) 23:31, 10 February 2016 (UTC)
I just saw your response NaBUru38. That made me chuckle. We have to be careful, apparently computers have dirty minds. :) CKoerner (WMF) (talk) 21:37, 15 February 2016 (UTC)

Searching two terms in the same section only[edit]

Right now there are people complaining at the Wikipedia Reference Desk that the Reference Desk, they think, is useless and not part of the encyclopedia. I believe they are wrong, in part, because they fail to understand that the daily Q-and-A of the Refdesk is just phase 1 of a multiphase operation. At some point, we need to process the voluminous archives we've accumulated to produce lists of answered questions, in which we've separated each question, rephrased it to be more readable and match the answer we were able to give, and provided specific, sourced answers, that can be more effectively searched.

A very, very basic step in this would be to make it easier to search the archived questions we have now. Presently, if you want to search out something about Jupiter's atmosphere, you get back results for any day where people talked about Jupiter in one question and the Earth's or someone else's atmosphere in another. At the very minimum, I'd like a way to do a search for Jupiter and atmosphere only when they appear in the same section (and by section, in this instance, I mean h2 but ignore h3 and below...)

To be honest, there could be an option in the search for this right now and I don't know - the search documentation is ... someplace... I saw it once... it doesn't really jump out at me when I do a search, now does it? I mean, in a link like this - what newbies are given when they hit the button "search reference desk archives" - they grudgingly spill the beans that there's a "prefix:" magic word, but they certainly don't point you toward the full list.

In general, I think that a Knowledge Engine might have a good use for Refdesk archives. If there is a way that you can point Refdesk users toward what would be the most useful curation to do to make a digest of these records for you to use, it might benefit both initiatives. Wnt (talk) 20:59, 14 February 2016 (UTC)

Yes, the prefix search we currently use is rather limiting for pages not in the main namespace. In the example conversation you gave on the Reference desk, is the problem that the individual did not search first? I'm not very familiar with that corner of the English Wikipedia and would appreciate clarification. CKoerner (WMF) (talk) 19:29, 15 February 2016 (UTC)
@CKoerner (WMF): The link I give is just if you hit "search archives" from a main Refdesk page like this one. People should search before asking questions, yes... but that's not really the whole story. The problem is, if you go to the talk page from the Science desk I just linked, you'll see a lot of ...... the kind of people who have too much influence on these projects nowadays, complaining the Refdesk is worthless or at least not an encyclopedia because it only helps one person. But, I don't think it should help only one person. I think we've accumulated this huge database of answered questions, which we could improve further with a lot of work by editors. But a big obstacle to motivating that is that the search is so poor. It's just too hard to pick out a relevant preceding question, which means that the same questions get asked and re-asked, while others who don't want to wait around for an answer are just bouncing and we never know about it. Wnt (talk) 01:39, 18 February 2016 (UTC)
Also, we are working on improving the way searching (and the results within) work. There was a short blog post about it a few months back. Before I joined the team so I'm not super familiar with it, but happy to reach to other team members for more information if you'd like. CKoerner (WMF) (talk) 19:31, 15 February 2016 (UTC)

Knowledge Engine by Wikipedia[edit]

It seems that this is the working title of the project [2]. It would be helpful to know the precise relationship between this and the various actions planned and discussed here. One important question which needs an answer somewhere is what Curation means in this context. The proposal uses the phrases "openly curated", "public curation mechanisms", "curation of that data". May we know who you envisage undertaking this curation? Are you by any chance assuming that the Knowledge Engine by Wikipedia will be curated by the current Wikipedia volunteer community? Rogol Domedonfors (talk) 22:24, 13 February 2016 (UTC)

Rogol Domedonfors, I'm sorry for any confusion over the phrase "knowledge engine". The FAQ clarifies a bit. There isn't anything being built by that name. It's an old term used mainly (if not only) for the grant. I'll do my best to help with the intent of the word "curation". It's pretty simple, even if our past explanations have been lacking. It means exactly what it means today to the movement. If we want to improve the quality of our search, say by improving the ranking of articles in search results, we'll do so with the communities. "Openly curated", "public curation mechanisms" and the like refer to the already impressive work we've done together. This is a continuation of that tenant into new areas - like improving search across Wikimedia projects. So yes, your assumption is correct. Would you like to help? CKoerner (WMF) (talk) 19:19, 15 February 2016 (UTC)
Thanks for the prompt response. As I understand it then, the WMF put forward a grant proposal for a "Knowledge Engine", and received funding for it, but actually nothing like that is being built. The proposal was made without any kind of consultation with the community, and the WMF assumed, and assured the grant-giving body, on the basis of no consultation whatsoever, that the community would be ready, willing and able to take on this extra curation work as an addition to the work they already do voluntarily. The possibility that the community might decline was not identified as any kind of risk: if seems to have been taken for granted, and money was actually asked for and accepted on that basis. This attitude to the community is unsatisfactory to say the least. You say that you will do this work with the communities. How can you possibly be so sure? Can you please now explain in more detail what new work you are expecting that the communities will be willing to do and give them some reasons to think that they might be willing to take it on? It would be much appreciated if you were to take an attitude that better suggested that you understood that this was a request, and a request that might well be turned down in the current poor state of relations between the WMF and its volunteers, volunteers who are most unlikely to appreciate being taken for granted in this way. Rogol Domedonfors (talk) 20:26, 15 February 2016 (UTC)
I'm afraid Rogol Domedonfors that things aren't so easily black and white in this case. Many people have been involved in the creation of the grant and the related work around search, maps, portals and more. The language has changed, but the intent has not. We want to improve search within and across Wikimedia projects. Our beloved MediaWiki.org included. I do not know if you are aware of the FAQ regarding the grant. It has a section about the element of human curation. What specifically within the grant, Discovery's list of work, and the FAQ do you believe the community would not welcome? These ideas would be helpful in our request for comment and I'd be happy to illustrate concerns for the developers involved as things move forward. CKoerner (WMF) (talk) 21:53, 15 February 2016 (UTC)
The links you give make it clear that nobody is currently able to state, specifically, what that extra work will be (although it is clear that you do expect extra work to be done in "new areas") so you can hardly ask me to say, specifically, what the community would or would not welcome, when you yourself do not know, specifically, what you are asking the community to undertake. You do admit that extra work will be required, and yet you are content to assume that the community will be happy to provide extra effort, and indeed to accept grant money on the basis that extra effort will be forthcoming simply for the asking. Do you really not understand how arrogant that appears? Did you ever test those assumptions -- at what stage did you engage the community with your plans and gain some kind of consensus that the project would be worth the extra work and that there was a broad willingness to deliver that extra effort? Where was that engagement and when did it take place? Was it before or after committing the WMF to the expenditure of a large amount of donor money? Rogol Domedonfors (talk) 22:28, 15 February 2016 (UTC)
You're right, I shouldn't attempt to speak for all volunteers of the Wikimedia movement. No one should. You're also right that at this point in time we do not have a list of every imaginable way we might require curation in regards to search and other Discovery projects. Heck, I might even be wrong in that we need curation from contributors. I also don't intend to assume volunteers will be upset if asked for further curation and involvement. If I understand your concerns, is in the same vein. That the WMF should work with the communities. We are trying. It's something I take very seriously, as a volunteer and now a WMF staff member. A small example is the work we're starting on trying to update the weight of pages in search results. One of our first tasks, and one that blocks (or prevents) us from rolling out the technological solution, is a task asking for community input.
While I can't speak to the decisions around the grant (it's in the past, before I joined) I can say that my intent with responding to you is to help moving forward. I know it might be perceived as trite, but I hope you believe me.
The entire movement is founded on the idea that everyone pitching in a little can make quite a bit of change. I'll ask you again, what are your specific concerns, how can we address them, and how are we falling short? To further that, if you feel affronted by some past work of the team, has Discovery produced anything without involving the community that you'd like to discuss? CKoerner (WMF) (talk) 23:08, 15 February 2016 (UTC)
My concern, as I have expressed twice already, is that the WMF has embarked on a programme of action which will require extra work from the volunteer community with no engagement with that community and no evidence for the belief that the work required will be forthcoming. That should not have been done, and the WMF needs to acknowledge that mistake and work to repair the damage to the relationship caused by that failure. The way to address that failure is to engage the community as soon as possible, that is now, and not just in small-scale tactical questions, but in deciding whether the programme is broadly speaking workable or not. I think you need to ask the community whether it is willing to support your programme with its efforts before proceeding. Rogol Domedonfors (talk) 07:33, 16 February 2016 (UTC)
There was a presentation of a concept (which was leaked, and which never even became an actual plan), and then there is the grant, which is binding. The "Knowledge Engine" concept that was presented in June had already evolved substantially by the time the grant was awarded (September?), and has continued to evolve since. Many of us believe it was a mistake to even give that presentation without vetting the ideas with community members, so I understand your frustrations about that.
As far as I can tell, the actual grant only refers to curation in one context: "Test results from exploring relevancy through a federation of open data sources, including structured data via Wikidata, and curation of that data with human and machine learning." Humans are already curating Wikidata, so I think that's covered. I am not aware of any actual plans for Discovery work that would rely on any human curation beyond what is already being done.
You are correct that we as an organization need to do much better at addressing all of this. Just recently, those of us who work in (or with) Discovery were specifically encouraged to be as open as possible, which should help. It helped encourage me to write this response. Lila is working in other channels to try to clear up confusion and respond directly to any questions that haven't already been answered. Thank you for your patience here. --KSmith (WMF) (talk) 01:30, 18 February 2016 (UTC)
Thank you for those assurances. Rogol Domedonfors (talk) 21:55, 18 February 2016 (UTC)
I too am concerned about the concept of "curation" in relationship to "enhanced search". Surely y'all are aware of the concept of search engine optimization (SEO), probably more so than I am, and the risk that an army of low-paid editors in "global south" locations motivated to elevate their clients' pages in wiki search results would likely overwhelm western developed country volunteers, unless adequate controls were implemented to prevent that. Also, that unless this "curation" work was really fun and interesting, it would likely take only twelve months to build up an eleven-month "curation backlog". Recall how past requests, such as to curate Article Feedback Tool comments, went. Wbm1058 (talk) 16:43, 26 February 2016 (UTC)
  • I've posted a series of questions on the discussion page of the FAQ, here. I hope they will be answered and incorporated into the FAQ. I understand that there are questions that there you may see as going beyond the scope of the Discovery team's focus; if there are please do get help from the people who can answer them. Thanks. Jytdog (talk) 16:45, 17 February 2016 (UTC)

Images in search results[edit]

Ping Deskana_(WMF) or anyone else.

Having software blindly grab images is a problem. I suggest you pull them off of the search results. Wikipedia is extremely not-censored. We have everything from explicit porn to images of Muhammad to stomach-curdling medical images.

For example typing pearl n in the search box instantly shows list with exactly one image, that image is the first search result, and it's an image of a cumshot across a woman's neck. I couldn't even begin to guess how many short simple character-strings will bring up penises and vaginas and assholes and various sex acts. Our lead image on the articles Muhammad and Depictions_of_Muhammad are "innocuous", but other articles very well could lead with offending images. Having them pop up on unrelated searches could get ugly. Alsee (talk) 23:20, 14 March 2016 (UTC)

@Alsee: Thanks for your suggestion (and for pinging me so that I saw it). I agree that some users could find the page images that are chosen objectionable. Unfortunately, what you're suggesting is not going to happen, for several reasons. Firstly, last year I worked with the Design Research Team to perform usability tests on the mobile app when we added these images to search results, which showed that users typically found it much easier to use search when they had additional context from images. Secondly, in an A/B test on the Wikipedia portal, adding these images significantly increase the rate at which users are clicking through to search results, which validates the outcome of the usability testing. Thirdly, the whole point of the relevant policy is to prevent exactly the kind of thing you're requesting, namely censoring content to try to make it acceptable to the masses. I quote the policy,
"Wikipedia may contain content that some readers consider objectionable or offensive—​​even exceedingly so. Attempting to ensure that articles and images will be acceptable to all readers, or will adhere to general social or religious norms, is incompatible with the purposes of an encyclopedia."
So, in summary, I think your intent is laudable and I appreciate it, but sorry, your suggestion goes against both the qualitative and quantitative data that shows these images are useful to users and the relevant Wikipedia policies. --Dan Garry, Wikimedia Foundation (talk) 00:00, 15 March 2016 (UTC)
Wow, I'm one of the most strident advocates of our notcensored policy and even I consider that a strikingly bold interpretation. Notcensored clearly prohibits attempts to remove/filter specific content merely because someone claims it's offensive. Saying that the policy is relevant to decisions about general usage of content (such as here) may threaten to turn up the heat on an issue that has been on a low unending boil. I'm in the middle of a conflict right now with people wanting explicit content removed from an article, and it's currently running significantly in favor of removal because nudity-is-evil-think-of-the-children crusaders show up and don't give a damn about policy. We keep the issue down to a low steady boil partly on the basis that the content is relevant to the specific article, because it's being used for an encyclopedia purpose, and because someone showing up at that article expects to find relevant content on the explicit-topic. When you rip the images out of the article, when they're no longer being used for an encyclopedic purpose (you're using them for mere navigation and completely unnecessary frills), when people who aren't looking for explicit-topics start getting hit in the face with porn, the think-of-the-children crowd are probably going to throw a shitfit. Alsee (talk) 05:29, 15 March 2016 (UTC)
Practical question: If search results aren't allowed to contain images (regardless of whether the reason is "offensiveness" or anything else), then how will people find the images they are looking for at Commons?
Also, Alsee, you should look up the history of the Image filter on Meta. The "think of the children" crowd, and even the "don't force individual crime victims to see photos of violent crimes" crowd, definitely lost. WhatamIdoing (talk) 16:38, 17 March 2016 (UTC)
WhatamIdoing, I was not talking about media searches. I was talking about the new Discovery article search, which starts displaying random images the instant you start typing.
I know the Meta image filter referendum in detail. They "lost", but they didn't get the message. The current situation is so screwed up that right now they are voting for, and on the verge of getting, G-RATED VIDEOS massively removed (enwiki)project-wide as spillover from their battle to remove videos they don't like. One of the best defenses against the "children" crowd comes from that referendum - Principle Of Least Astonishment. Someone deliberately going to an article on an explicit-topic cannot be ASTONISHed to find relevant content there. Automatic images in basic article search throws that out the window. Someone typing as few as TWO LETTERS can ASTONISHingly get hit in the face with a closeup photo of genitals, and three letters brings up countless of sex acts up to and including artwork of a woman having sex with an animal.
The "children" crowd don't give a rat's-ass about notcensored policy except to subvert it or make it GONE. The last thing we need is for "moderates" join the attack against notcensored policy, because typing two or three letters into the search box ASTONISHingly hits kids in the face with genitals or bestiality. Alsee (talk) 12:35, 18 March 2016 (UTC)

The w:MediaWiki:Bad image list is significantly shorter than the list of all non-free files we have exempted from PageImages. - Hahnchen (talk) 23:46, 17 March 2016 (UTC)

@Deskana (WMF), Alsee, WhatamIdoing, Hahnchen: Yes, Wikipedia is not censored. But we should also try to apply the principle of least astonishment. In many situations there may be several images in an article suitable as the "hero image" for that article. There is a real risk that an algorithm will pick a very controversial image from the article, while an equally or more suitable alternative is also available on the same page. I think the solution is as simple as the problem is complex: give editors the ability to override the choice of the image. That way they can find the right balance between Wikipedia not being censored and the principle of least astonishment.
I do think that adding images to an auto-completing search engine is going to fundamentally increase the tension between 'no censorship' and the principle of least astonishment. So far I've always been able to browse Wikipedia in public without fear of encountering any graphic content without explicitly looking for it. (In fact there are plenty of articles on Wikipedia which I could freely read in the privacy of my own home, but would not be allowed to read at work, or possibly even in public spaces.) I'm not saying that we should thus censor images from search, but the principle of least astonishment would dictate that we should add a clear warning stating that typing something in the searchbox may immediately reveal explicit images. This may in turn cause people to avoid using search. This is a hard problem... —Ruud 01:29, 20 March 2016 (UTC)
PageImages should not return an image if that image appears on w:MediaWiki:Bad image list. Implement that, and then figure out if it's worth returning an alternate image from the target article. - Hahnchen (talk) 23:19, 21 March 2016 (UTC)
No, that would be a direct violation of NOTCENSORED. If that image is in an article then it's supposed to be in the article. The purpose of Bad Image List to shut down actual cases of people disruptively posting the images where they don't belong. I think including images in the article search results is a bad idea, but if we are going to include them then it's not acceptable for someone to apply arbitrary standards of which images they find personally offensive. The whole plan for image filtering was killed for good reason after the Image_filter_referendum. There were highly polarized responses, but basic outcome was that there was no viable implementation other than an opt-in block-everything. Proponents of filtering had no interest in that sort of filter. People who object to those images want them *gone* to "protect the children" from ever seeing them. Adding an opt-in block-all-images to search results is certainly an option, but I doubt there's any actual call for it. Alsee (talk) 09:33, 22 March 2016 (UTC)
I'm fine with images remaining in the article. The article is not censored. There is no impact on content. Excluding w:MediaWiki:Bad image list from PageImages means that those images can only be used in those articles, which is exactly why w:MediaWiki:Bad image list exists in the first place. - Hahnchen (talk) 12:03, 22 March 2016 (UTC)
Selectively filtering images from the interface is as unmanageable as selectively filtering them from articles. What standards of modesty are you going to apply? European? American? Middle Eastern? It should either be 'show all images', 'show none of the images', or 'let the user choose between all or none' (all three having their own downsides). But not 'selectively hide some images by some criterion that's never going to be globally applicable'. —Ruud 17:53, 22 March 2016 (UTC)
Your argument that it would be unmanageable would have more weight had it not already be managed. Mediawiki already supports w:MediaWiki:Bad image list, each wiki should they choose to implement it already has a set of images that are limited to only appear in certain articles. - Hahnchen (talk) 10:04, 23 March 2016 (UTC)
No, the 'bad image list' is only there to prevent vandalism. The only criteria for an image to be added to that list is whether it is used for vandalism or not. If some persistent vandal keeps spamming Rabbit in montana.jpg all over the place, it will get added to the 'bad image list'. That vandals tend to use more offensive images than bunnies, is entirely coincidental. It was never intended to be used—and should not be used—as a tool for censorship. —Ruud 12:15, 23 March 2016 (UTC)
Rabbit in montana.jpg
Rabbit in montana.jpg
Per Ruud. BadImageList (BIL) is used to block actual cases of vandalism that couldn't be resolved by usual methods. It doesn't restrict images to articles. Many images on BIL are used on non-article pages. BIL adds exemptions for any non-abusive use anywhere, up to and including someone merely requesting the image on their user page. Usage in search results would be inherently non-abusive use, which would be instant grounds to add an exemption so the image can appear. (Uh oh, the rabbits are multiplying. They'd hit BIL in just a few generations. And no, I don't plan on any more.) Alsee (talk) 19:52, 23 March 2016 (UTC)

Policy rewrite[edit]

Ping Deskana_(WMF) and Whatamidoing_(WMF).

I'm hoping that there's some innocent explanation here and that there's something I missed, but it looks like your Community Liaison for Product Development unilaterally wrote yourselves an exemption into policy?![3] I searched the policy talk page and Village Pump (policy) and I could find zero community discussion on expanding the permitted usage of non-free content. Alsee (talk) 01:35, 16 March 2016 (UTC)

@Alsee:
  1. User:Whatamidoing (WMF) is not, in any sense, "my" community liaison. She is neither managed by me, nor does she work with the Discovery Department. Chris Koerner is Discovery's community liaison. If you have issues with Whatamidoing's conduct then I cannot help you. In such instances, you should contact her manager, Quim Gil.
  2. The editing pattern you are reporting is not atypical. There's even an essay about it: the BOLD, revert, discuss cycle. Whatamidoing made a bold edit, you reverted it, and now discussion can take place. This all seems perfectly normal and acceptable to me.
  3. Since the implementation and deployment of phab:T124225, the entire discussion about non-free images in search results is pretty much moot; search implementations use the PageImages extension for images, and PageImages only returns free images, so non-free images are no longer used in search results. If the decision is to include this exemption in the policy, then that's good because does future-proof things in case PageImages starts returning non-free images again, but I doubt that PageImages will be changed in that manner any time soon.
--Dan Garry, Wikimedia Foundation (talk) 23:09, 16 March 2016 (UTC)
I'd also like to add that WhatamIdoing is a long time Wikipedian and is probably one of the most helpful and knowledgeable people when it comes to understanding the policies of the English Wikipedia, the community, and the history around it.
As a member of the WMF staff, Whatamidoing_(WMF) is also one of the best "go to" members of the Community Liaisons team. She is incredibly smart and illuminating when I've ever asked for help. In that sense, she's my liaison to many areas of the communities we work alongside . CKoerner (WMF) (talk) 14:32, 17 March 2016 (UTC)
Actually, Deskana, there is an explicit policy on the English Wikipedia that even policies can be boldly edited. It's described at w:en:WP:PGBOLD. I do a lot of guideline and policy work as a volunteer, and this is a very common approach for me. In this case, my initial edit was refined by a regular at that guideline, and has not been contested by anyone – except Alsee, whose reversion appears to be in direct violation of that fundamental policy. The exact wording in the policy is, "you should not remove any change solely on the grounds that there was no formal discussion indicating consensus for the change before it was made." Alsee might choose either to self-revert or to explain his "substantive reason for challenging it", if any. I've already started a discussion, so it should be very convenient for him to do so, if he actually has a substantive reason.
For the record, I do not edit any policies or guidelines at any Wikipedia as part of my job. WhatamIdoing (talk) 16:33, 17 March 2016 (UTC)
So the outcome of the Design Research Team Usability study was that search images improved the effectiveness of search. Yet phab:T124225 was somehow greenlit for development and deployment, and now, regardless of changes to the non-free media policy, there is doubt that PageImages will show non-free content. What sort of decision making process allowed that to happen? Mobile search is now showing up irrelevant misleading images for the sake of serving up something free. - Hahnchen (talk) 19:46, 17 March 2016 (UTC)
I discovered the change to policy because I was helping the Reading Team get non-free images included in Hovercards. Jkatz (WMF) was eager to explore the possibility, when I suggested it. The idea is that PageImage could return a flag indicating if an image is non-free, then non-free images could be included or excluded as appropriate by the feature that requested an image. In most cases that should be "exclude", but Hovercards is compatible with the non-free image policy. Phabricator task T91683 is addressing the issue of irrelevant or misleading images. Alsee (talk) 20:04, 17 March 2016 (UTC)
@Hahnchen: To be fair, phab:T124225 was about filtering out a very small number of non-free images from PageImages and replacing them with different, free images; it was not about removing images completely. As such, the findings of the design research are not directly applicable to that matter. I personally agree wholeheartedly with your comments on this matter and think that the solution in T124225 is not ideal, but it's important to note that the matters discussed in T124225 are a mostly separate issue. --Dan Garry, Wikimedia Foundation (talk) 21:09, 17 March 2016 (UTC)
That "very small number" of images includes pretty much every piece of pop culture, which is disproportionately popular on Wikipedia and the Internet as a whole, see w:Wikipedia:Top_25_Report. The hovercard popup and mobile search image for w:10 Cloverfield Lane shows a picture of John Goodman, who isn't even in the lead role, instead of the identifying artwork. And while T124225 is not relevant to the issue of offensive imagery, it is to do with the policy rewrite, which is something that should have been exhaustively explored before committing engineering resources into degrading the user experience. I cannot comprehend the decision making process that greenlit T124225. - Hahnchen (talk) 23:00, 17 March 2016 (UTC)

Semantic and structured search: Wikidata + Wiki(pedia)[edit]

One of the reasons that the search engine tends to give bad results is because much like google it simply tries to search based on content of the articles or pagerank.

Something that I haven't seen here and that would be an obvious win would be to do some Data_mining or data from wikipedia to discover obvious connections and how related the content is.

One simple use case / example is searching for authors related to Shakespeare, or authors born in Shakespeare's Era. Using Wikidata query service it is possible to obtain this information and use it to enhance search or even present other useful searches that the user may want to do later based on subject similarity.

Actually even without using wikidata, it is possible to extract information from infoboxes to determine all authors in the encyclopedia born in a certain year. So to an extent this is possible with some tweaks. One approach taken by wikia (yes evil wikia :) was to develop portable infoboxes which makes these queries easier by using for ex categories. These little tools store their structured data in the page props, making it possible to do all sorts of nifty queries see [4]. So one could for example search for:

 Writers born in 19XX

It would then prioritize infobox data, and pick up the terms / arguments "writer", "birth_data" and present (or synonyms) more meaningful results. Or better yet add more meta-data about search to templatedata 09:14, 15 April 2016 (UTC).

Ireland[edit]

On English Wikipedia, the Ireland article is about the island, while the article about the country is at Republic of Ireland. This causes problems for the Wikimedia search-box, which make it more difficult to directly "discover" the article about the country, as the title of the article about the country begins with the letter "R" rather than "I". A lot of readers must be stopping off at the hatnote on the island before getting to the article about the Republic, because the search box fails to lead them directly to the article about the country.

When I search for "Ireland" these are the search suggestions produced by the software, in order:

  1. Ireland, the island
  2. Ireland national cricket team
  3. Ireland in the Eurovision Song Contest 2009
  4. Ireland–United Kingdom relations
  5. Ireland national rugby union team
  6. Ireland–United States relations
  7. Ireland–New Zealand relations
  8. Ireland and World War I
  9. Ireland national football team (1882–1950)
  10. Ireland in the Eurovision Song Contest 2008

Why doesn't Ireland (country), Ireland (state) or Ireland (republic) even crack the top ten on this list? How is this list constructed, and what can be done to fix this? Is there anything editors can do to fix it? Wbm1058 (talk) 15:13, 28 June 2016 (UTC)

@Wbm1058: My country is ranked low in search... for shame! In seriousness, thanks for the report. There's a number of relevance problems similar to this one; for example, searching for "kennedy" doesn't get you John F. Kennedy in the first page of results. There's no short explanation for exactly why this happens for some queries. As problems like this one are often caused by hacks that were put in place to fix some other queries in the past, it's suboptimal for us to hack a solution together for this one specific case, as that makes our code hard to maintain and often causes the same problem further down the line. Similarly, whilst we could pull some levers and change some parameters to fix this search query, it's just as likely to break a bunch of other searches and not really achieve anything. In Q2 2016-17 (i.e. Oct - Dec 2016) we're hoping to replace the current algorithm, tf–idf, with Okapi BM25, which we're hopeful will fix many of these issues. Hope that helps! --Dan Garry, Wikimedia Foundation (talk) 17:57, 28 June 2016 (UTC)
Okapi BM25, wow that's some complex looking math! Right, I'm not expecting any hacks, just an algorithm that works for these. You should probably confirm that Okapi BM25 fixes this, and if it doesn't, then either tweak it so it does, or go back to the drawing board. I think some knowledge of our disambiguation conventions needs to be factored into any such algorithm. These need some special consideration. If it's done by page views, then page views of redirects need to be added into the views of the redirect's target. With regard to Kennedy, I think {{DEFAULTSORT:Kennedy, John F.}} should leverage such biographies into the right search indexes. Wbm1058 (talk) 18:25, 28 June 2016 (UTC)
@Wbm1058: Investigating whether or not BM25 would fix these kinds of problems properly is exactly why this work is planned to take a whole quarter; just switching over would be easy, but it'd break stuff and we wouldn't really know whether it even fixed anything. By the way, I got the dates wrong before, sorry! This work is actually planned for Q1, July - September 2016. :-) --Dan Garry, Wikimedia Foundation (talk) 19:47, 28 June 2016 (UTC)
Just noting the article's default-sort tag {{DEFAULTSORT:Ireland, Republic of}} is probably something that should be latched onto to create the search suggestions: en:Ireland, Republic of. Wbm1058 (talk) 20:18, 30 June 2016 (UTC)
Wbm1058, There's a task for that! CKoerner (WMF) (talk) 17:47, 1 July 2016 (UTC)

Search prefix:[edit]

Implement multiple parameters to prefix: operator on fulltext searches[edit]

Hi guys, checking in here for the first time in a while, I see it's been quiet here. Hope your project's progressing well since attention has shifted away from the "Knowledge Engine" controversy.

Anyhow, I have a question that I posted at the English Wikipedia's village pump. Hoping that someone from the WMF might have some insight on this. Thanks, Wbm1058 (talk) 15:27, 18 December 2016 (UTC)

Followup question at en:Wikipedia:Village pump (technical)/Archive 152#Search prefix revisited – search with multiple prefixes? -- Wbm1058 (talk) 16:24, 13 January 2017 (UTC)

Must say that I'm disappointed to hear crickets. Where, oh where have you gone, Mr. Rainman, who knew how to develop nifty search systems? Wbm1058 (talk) 17:44, 31 January 2017 (UTC)
I suppose you've read Help:CirrusSearch. So the main issue is with the character limit? In what cases do you need to search subpages of multiple basepages at once, other than those discussion archives templates? Nemo 07:24, 2 March 2017 (UTC)
No, the main issue is not the character limit. It's possible that might become a secondary issue in complex requests to search several different prefixes, if the main issue is ever resolved.
Note the clunky search box implementation I added at the top of en:WT:Requested moves. Enhancements to allow a cleaner looking implementation on that page would be appreciated. Wbm1058 (talk) 17:37, 8 March 2017 (UTC)

Search in Finnish[edit]

I was happy to notice the other day that CirrusSearch now provides much better suggestions (and typo corrections) than Google does, at least for Finnish: for instance w:fi:Special:Search/palatka pariisin succeeds where https://www.google.com/search?q=Palatka+Pariisin fails (finding w:fi:Palatkaa Pariisiin! which has a different number of a and i). --Nemo 17:43, 30 December 2016 (UTC)