Talk:Wikimedia Discovery/FAQ

Be clearer on data sources
The first mention of "data sources" is
 * and incorporating new data sources for our projects

But that just links to a map, which seems to be a different way to display search results. Please give actual potential data sources instead of an unclear link, thanks. -- SPage (WMF) (talk) 19:47, 9 November 2015 (UTC)
 * The section below calls out that its supported by external OSM data. That data corpus includes items (buses, trains, etc) that are outside of what our elastic indices include. RU WikiVoyage and soon EN WIkiVoyage will default to our tiles and are already starting to surface transit, points of interest, and articles for discovery of new content. As for other data I cite 'census, national gallery, etc' in https://www.mediawiki.org/wiki/Wikimedia_Discovery/FAQ#If_you.27re_adding_new_data_sources.2C_isn.27t_that_a_search_engine.3F but that's really up for a community discussion of what data sources can help in the same way that OSM did Tfinc (talk) 19:57, 10 November 2015 (UTC)


 * If Fox News or TeleSUR have the slightest chance of appearing as data sources of this searching project, I will campaign to stop it. --NaBUru38 (talk) 14:27, 15 February 2016 (UTC)
 * I'm right there with you NaBUru38. There are data sources, like OpenStreetMap, census data, and other bits of useful open data that we could pull into search results. These would provide a more rich search experience - on wiki - that what we currently have. Of course, we want to engage early and frequently with the community to determine what would be acceptable to include. I created a task (T126980) to track this concern. I encourage you to reference it as we move forward.


 * To be honest, we probably won't get around to this for a while. We have more immediate improvements in this quarter and are working on our longer-term plans. Where something at this level most likely resides! CKoerner (WMF) (talk) 15:38, 15 February 2016 (UTC)

OpenStreetMap is not incorporated into search (e.g. in elastic) and is separate. Of course, search results could be displayed on a map with OSM tiles. Parts of OSM data could also be used as an overlay for Wikivoyage, but don't think it should being mentioned as part of the grant and "search engine" in this way. Aude (talk) 19:36, 17 February 2016 (UTC)

orphan
This page has been up for a week but, as of the time of writing this literally nothing links to this page, it's an orphan. That's kind of ironic given that it's a strategy document for the 'discovery' team :-) Is there a plan for when this 'not-a-knowledge-engine' strategy will be announced widely? Wittylama (talk) 23:26, 9 November 2015 (UTC)
 * Thanks for the feedback. I've linked it from https://www.mediawiki.org/wiki/Wikimedia_Discovery so it's no longer an orphan. Next we'll be adding a set of wiki pages to compliment the discussions that have been happening on phabricator, email lists, and on wiki to bring it all together Tfinc (talk) 21:13, 10 November 2015 (UTC)
 * Also would love your feedback on the Discovery Roadmap linked in the FAQ. https://www.mediawiki.org/wiki/File:Discovery_Year_0-1-2.pdf Tfinc (talk) 21:17, 10 November 2015 (UTC)

"Are you building Google?"
This FAQ included the question "Are you building a search engine?" But after a complex edit history from a few days before the November 2015 Board of Trustees meeting, that is not at all the question that is addressed; specifically, the answer begins by stating "We are not building Google," and then includes a couple sentences I basically don't understand. This does not even come close to addressing the important question "Are you building a search engine," and leaving the section title intact is IMO highly deceptive to the casual reader. I'm not qualified or positioned to improve the answer (though I think that should be done). But I do think it's important that the question reflect what is actually said, which is why I have (for the second time) changed the question title to "Are you building Google?" Pinging who reverted this the first time. Happy to hear your thoughts, but I hope you can at least agree that this question is not addressed in this section? -Pete F (talk) 03:57, 14 February 2016 (UTC)
 * "There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors. - Jeff Atwood"
 * Discovery is not building a competitor to Google in the sense that Google searches everything it can find. We are trying to improve search across Wikimedia projects to provide better results. That's it. Imagine being able to search for "leaning tower in Italy" and not only see the Wikipedia article in your language, but photos from commons, a map of it's location, and information on it's physical properties from Wikidata.
 * I'm new to all of this. Let me ask around and see if we can get some clarification. CKoerner (WMF) (talk) 18:11, 15 February 2016 (UTC)
 * Pete F As you may be aware a blog post went up yesterday. Lila's also hosting an AMA-style (ask me anything) on Meta. I believe Tomasz is looking into an interview as well with the Signpost. I don't know if any of that helps with clarity, but I wanted to highlight the efforts to bring some clarity to things. I'm updating the FAQ with some of the questions that have been asked on this talk page and others. Again, more than happy to help where I can. CKoerner (WMF) (talk) 19:42, 17 February 2016 (UTC)

More questions
The answer to the first question is very unhelpful. Whatever you are calling it, please actually describe it. That is what all the questions below are about. In what follows, please replace "Knowledge Engine" with whatever it is you are calling it. The name is not what is important. What you are working toward, is important.

From what I have been able to piece together, the Knowledge Engine is a) a bunch of data, contained in or linked to the Wikidata database; b) an interface to receive a query; c) algorithms that create and display an answer to the query based on the data, formatted sort of like a WP article.

I have no idea how KE search results are envisioned to relate to existing WP content; if the notion here is just to archive existing content, or somehow fragment it and import it all into Wikidata, or if existing content will somehow remain in existence and available to the public. I have no idea if the WMF intends to put any further energy into making existing WP content more available to the public in the form in which it currently exists. Please address all that.

I don't understand what role the existing editing community is intended to play in all this.

If I am way off track, would you please describe in some detail what the Knowledge Engine actually is imagined to be, and what it will do, and how it will relate to the Wikipedia that exists today -- concretely, so it is understandable to the average reader of WP? (without technobabble) Three key questions there.

More concretely

1) Please rewrite the question "Would users go to Wikipedia if it were an open channel beyond an encyclopedia?" in plain English.

The "Wikipedia" we all know is an encyclopedia full of articles created and maintained by people.

I don't know what an "open channel" means.

I don't know what "an open channel beyond an encyclopedia" means. The question implies that the "encyclopedia" won't exist anymore - instead, the "open channel" would exist. Is that what is envisioned? If so, what happens to the Wikipedia content that exists today?

2) What would a Knowledge Engine search result look like? Are this and this prototypes of what you have in mind?

3) How exactly does Wikidata fit into the Knowledge Engine vision?  Is the vision here that results like those above will be created on the fly by the algorithms based on Wikidata when someone makes a query to the Knowledge Engine?  And that WMF will aggregate a bunch of other source data into Wikidata, or link to, or whatever?

4) Will there be any content curated by editors as there is today, or will the editing community become curators of Wikidata?

Please incorporate answers into the FAQ.

Thanks Jytdog (talk) 01:22, 17 February 2016 (UTC)
 * Good feedback Jytdog, thank you for taking the time to share. I updated the FAQ a little. Please have a look and let me know if there's something that is still not clear. CKoerner (WMF) (talk) 22:41, 17 February 2016 (UTC)
 * Thanks for your reply!
 * Let me first say that if you are not aware of it (and I can't believe you are not :) ) your management's behavior has generated a lot of distrust.  So I am looking for clear answers that make sense in light of other information that is out there, primarily the actual Knight grant agreement.
 * Second, I have worked a lot with grant agreements in academia, and I understand how they work, and I know that you cannot change the scope unilaterally.  Would you please let me know either way,  if WMF and Knight have amended the agreement to change the scope?  I am assuming the scope has not changed....
 * Third, I understand you are not using "Knowledge Engine". What do we call this?
 * OK, specific responses.


 * a) A bunch of the changes made today don't address my questions which were "where is this going - what is the vision" questions. Many of the answers  are down in the weeds of what is happening in this phase of discovery.  If that is mostly what you are going to talk about here, that is fine, but please be explicit about that so I can ask somewhere else, where I can get the answers I am looking for.  Please don't waste my time.   What I am looking for is  a simple statement of the big picture, like this: "the whatever-you-are-calling-it system is envisioned to be a) a bunch of data in the WMF domains and outside it that are linked to the Wikidata database; b) an interface to receive a query at wikipedia.org; c) algorithms that create and display an answer to the query based on the data, formatted sort of like a WP article sort of like this."  Something high level and understandable.  Can you do that?


 * b) You simply removed reference to the "Would users go to Wikipedia if it were an open channel beyond an encyclopedia?" question, which is part of what you are meant to be exploring in this phase of the grant and probably the most alarming aspect for me.   I don't see that you re-wrote this and moved it elsewhere.  Why you have removed reference to this?   What would be really helpful would be an FAQ like "What is 'open channel beyond an encyclopedia'?"  with a clear explanation.


 * c) I understand that the current WMF line is that the KE is meant to search "Wikimedia projects." This contradicts what WMF said in the grant application, which is no where limited to WMF domains but instead talks about "the internet".    Again unless you have amended the agreement you have to be dealing with "the internet."  Now my sense is that you want to link other freely-available data sources on the internet to Wikidata or have the "KE" directly query them too.  But please give an answer here that deals with "the internet" in reasonably plain English.


 * d) WMF is putting out what are to me misleading statements saying "What are we not doing? We’re not building a global crawler search engine." and here you have an FAQ that says "Are you building a search engine to replace Google?" Nobody thinks you are doing either thing, and it is a bit frustrating.  (I find the former especially... bad as I have never seen anyone imply you are building a "crawler")   But to your FAQ...  I don't think that anybody thinks that WMF ever intends to do all the things that Google does nor even all the things a Google search can produce (e.g. show me relevant flights or movie times).   But the grant makes it very clear that the WMF finds great fault with "commercial search engines" and is proposing to do something much better - more transparent, more privacy, and not driven by money.


 * I think you are intending to provide "better" answers for certain kinds of queries than commercial search engines provide and the WMF wants people to come to wikipedia.org (our main search page) to do them. And as is noted explicitly in the revised FAQ,  there is also a desire to keep users who already within WMF domains, within them, instead of losing them when they can't find stuff and having them go off and doing a Google/Bing etc search.  Wanting people to come to wikpedia.org instead of Google/Bing/etc for certain queries, and not losing users who already here to Google/Bing/etc, is competing with them.  It just is.


 * Would you please revise the FAQ to deal with the heart of the matter? The current FAQ is really a distraction and doesn't ask nor answer the real question. A question like: "Why would I search with the envisioned search engine instead of Google's or Bing's"  or "How would the results of a search through our engine be different than a search in Google or Bing or other search engines?"  might be good.


 * e) You don't say anything about what a "query result" is envisioned to look like. Please do. (this is really important to me, at least)   Without understanding what a search result is envisioned to look like, and whether it would lead to actual WP articles or if it will lead to a machine-generated "knowledge graph" or mini-article, I cannot make sense of this whole thing.  I really can't, and this is a very key issue about what people will find when they come to "Wikipedia"   Will they find an encyclopedia, or will they find a "channel beyond" it?   Please do clarify.


 * f) This is a funny edit note. But see above. And please note that I have seen this video which looks an awful lot like content created by a "robot" (to go with your funny term) in response to a query, and I am aware of Approach 6 discussed here.  This really is a big vision thing - is the WMF walking away from having search point to articles and making article content more available the public?  Where are they taking us?   I really (really!) do see the value in having search work better and many other benefits to what I think the KE could do.... I just don't see the vision of how that fits with the WP-that-exists.  That is what I am looking for.  And I understand that you might not be the one to articulate that.  But someone needs to.  Please point me to them if it is not you.  Thanks.


 * And thanks for your patience. Jytdog (talk) 03:20, 18 February 2016 (UTC)

Structure of the page
Merging in the Knight grant stuff makes sense, but probably not at the very top of the page. I would propose having a few top-level sections, like Knowledge Engine, Knight Grant, and Discovery work. And then all the existing questions could fit within one of the top-level buckets. While the term "knowledge engine" is interesting from a historical perspective, and the Knight grant is interesting from a transparency perspective, I suspect a lot of people just want to know what the Discovery department has been doing, is doing now, and plans to do in the future. --KSmith (WMF) (talk) 00:04, 18 February 2016 (UTC)
 * Yes, sorry for any confusion there KSmith (WMF). I saw we had two FAQs with related topics and wanted to bring clarity to things. If you want to take on the restructure, please do. CKoerner (WMF) (talk) 16:17, 18 February 2016 (UTC)
 * What's appropriate here? These are conversations and I'd hate to move things around and upset anyone. CKoerner (WMF) (talk) 18:13, 18 February 2016 (UTC)
 * Nevermind, I've been looking at too many Talk pages today and got confused. CKoerner (WMF) (talk) 18:14, 18 February 2016 (UTC)
 * I created the basic structure I envisioned. I'm pretty sure it could be improved. --KSmith (WMF) (talk) 21:00, 18 February 2016 (UTC)

Are you really trying to improve plain old WP search??
Is part of what you are trying to improve is searches through this? If so I am wildly happy as that search engine is awful. I waste so much time looking for stuff -- especially trying to find things in old Talk page discussions or archived discussions on notice boards. I waste so. much. time. with that search engine. Just in en-wiki, which is where the information I want is. If that is what you were really doing (outside of the Knight Foundation grant stuff) I would be very happy. Please do tell if that is part of what you are fixing. Thanks! Jytdog (talk) 03:24, 18 February 2016 (UTC)
 * Yes. I've been a volunteer for about 5 years now and a visitor even longer. I too keep going crazy with the current search. It's getting better. We're a relatively new team mind you, and are already making small, progressive improvements. We have a beta feature now for the Completion Suggester which is a small step in the right direction. That should be going out to everyone in the near future (by the end of March, communication and feedback withstanding). We also have a list of goals for this current quarter if you want to keep up. There's a lot to improve. We're tracking our thoughts and progress in Phabricator. If you see something you're interested in knowing more about, let us know. CKoerner (WMF) (talk) 18:49, 18 February 2016 (UTC)
 * This would be amazing. A lot of what I have heard has been about making all the various WM content and WP content more accessible when someone searches.  Like I said I get especially frustrated trying to find specific diffs as well as old discussions.  I am sorry to ask this, but is your team aware that if you search (for example) the ANI archives, you get results like this?  There is no discernable time-order, and no way to refine the search - not even old school boolean works.   I waste so much time going from those results, to what I am looking for, and I have always wondered "How can this be?" Jytdog (talk) 19:51, 20 February 2016 (UTC)
 * It will be amazing! :) There's some smart folks working on it. Check out the beta feature if you haven't already. It's a small, but apparent improvement. I've reached out to the Discovery team to ask how/if some of our near-term goals will impact the types of searches you are asking about. I'll let you know what I find out. CKoerner (WMF) (talk) 21:27, 23 February 2016 (UTC)
 * that would be great. Jytdog (talk) 02:41, 24 February 2016 (UTC)

The "robots" question
The following FAQ is framed in a pretty disrespectful way, and it is not at all clear to me, that the response here speaks for the WMF board and the ED. I am parking it here for now.


 * Is the Wikimedia Foundation looking into replacing editors with robots?

No. We think technologies like machine learning and similar tools can help with aggregation of all the rich content humans have created across our projects. Like the work our colleagues have done with ORES in improving the quality of article content.

At no part are we trying to replace or subvert the work of our human editors. We want to figure out smarter ways to return search results that answer visitors questions - even when those searches currently result in zero results. Imagine in the future searching for something we don't have an article for in a particular language Wikipedia - but we do have books in Wikisource, or quotes in Wikiquotes, or photos in Commons. Wouldn't it be great to have a link to those items in search results instead of nothing?

-Jytdog (talk) 15:42, 20 February 2016 (UTC)
 * I wrote that answer. You singled out the word "robots" in quotes. Do you feel that is a disrespectful description? I thought about using the word "bot" as in "Internet bot", but I thought for clarity I'd be explicit and spell out the (more common) word. We got this question, or variants along the lines of 'automatically written articles by a bot' quite a bit in other venues (mailing lists, other talk pages). I thought adding it to the FAQ would answer the common question and provide clarity about how we are currently looking at utilizing bots. Sorry for any confusion. If I've missed your concern, could you be more specific with what you feel is disrespectful? CKoerner (WMF) (talk) 20:47, 23 February 2016 (UTC)
 * Thanks for replying. Nobody has had robots in mind that I have ever heard of.   I and others have been concerned about the WMF's strategy with regard to the relationship between user-generated and computer-generated (or bot-generated, or "automated generation of") content.


 * In any case, is there a long term plan to start including bot-generated content more systematically in any/all of the projects, or to use such content more? If you don't know, please say that. Thanks. Jytdog (talk) 02:32, 24 February 2016 (UTC)
 * btw, two of the most useful things I have seen written about this question, and in this context, were two posts by at jimmy's talk page -  this and this.  Those were real efforts to communicate clearly, i learned a lot,  and I remain very grateful for them.  I know they are too long for use here, but they recognize the concern and speak to it (even with the jabs :) ) so I wanted you to see them.  Jytdog (talk) 03:33, 24 February 2016 (UTC)
 * Thanks for the head's up. Those are pretty good summaries of the work that's going on with regard to automation or bots. I am not aware of any long-term plans beyond what's in our current quarter goals and the Discovery Year 0-1-2 presentation. As mentioned on Jimbo's talk page, work in these areas are at a very early stage of initial development. CKoerner (WMF) (talk) 15:02, 24 February 2016 (UTC)
 * Thank you for linking to that presentation. I was not aware of it (I know, I know it has been sitting there for a long time.  There are users for you. Surprisingly missing things.).  I am taking "I am not aware" as meaning that you don't know if there are long-term plans nor what they might be.  I don't know if you are in a position to know.  I really don't.  (If you are in a position to know, then your saying "I am not aware of X" means that X doesn't exist.  If you are not in a position to know, then you are just saying "I don't know").  Chris, I want to point you to the following in the Knight grant - on the first page.


 * It says that the goal is "To advance new models for finding information by supporting stage one development of the Knowledge Engine by Wikipedia, a system for discovering reliable and trustworthy public information on the Internet."
 * It says: Over the next six months, the Wikimedia Foundation will:
 * • Answer key questions:
 * •• Would users go to Wikipedia if it were an open channel beyond an encyclopedia?"


 * And it says starting on that same page:
 * "OUTCOMES
 * Knowledge Engine by Wikipedia will create a model for surfacing high quality, public information on the Internet. The project will pave the way for non-commercial information to be found and utilized by Internet users. The discovery stage will lay the foundation for the project. During this period, the team will establish core usage and performance metrics that will help determine what is built."


 * Please stop now, and really stop. Imagine you are me and you are on the outside, and are trying to figure out what has actually been going.  Now read the goal, and read the first "key question", and read the Outcome.  The only thing that makes any sense about what was envisioned (and for all we know, is still envisioned)  - and especially in light of the takeover of the wikipedia.org page, and the things that Denny has said, and the stuff that has been said about extending Discovery to include other sources of cc-licensed content (and not things like the Stanford Encyclopedia), and everything that has been said about Discovery currently "just exploring" -  is that the WMF was (or still is)  planning dramatic changes to this place -to what people would actually encounter when they come to wikipedia.org.   There is no mention of how existing, user-generated content fits into that vision.  And really importantly, do you see how the Discovery Year 0-1-2 presentation fits exactly into that notion that the WMF is planning big changes to what people who come to wikipedia.org experience?


 * Can you see that? I am not asking you to confirm or deny it or agree to it  - I am just asking if you can see what i am seeing.  Can you?
 * And please also note that all of this makes it clear that the work of Discovery in exploring things, is very much Stage one of a larger plan. So everybody insisting that Discovery is "just exploring" now is just ... exasperating.  We know  that.  It is not the point.   It is the larger plan that everyone is concerned about.  Can you see that, too?


 * And finally - and please do answer this - is the Discovery Year 0-1-2 presentation still the operative strategy for your group? This is something you should know, so I am looking for a clear answer on this.


 * Please do respond to all three questions. Thanks!  Jytdog (talk) 18:02, 24 February 2016 (UTC)


 * Oh also, what is the "Licensing" referring to, on slide 7 of the 3 year plan? Thanks. Jytdog (talk) 22:39, 24 February 2016 (UTC)


 * quick note, see the last bullet in the "For What" section here. I know this is only a barnstorming, strategy-thinking meeting. I know that.  It is just this thing I have been saying is apparently not completely la-la land. I get the problem that the Product team is trying to solve there - how to keep what the WMF does, relevant.  And I see how doing cool stuff via the portal is way more nimble/scalable/etc  and... tractable than dealing with this crazy nuthouse of user-generated content.  (and again i know that is only brainstorming I see no "smoking gun" about commitments made there) Jytdog (talk) 02:40, 25 February 2016 (UTC)

Edits today
I was BOLD and made some edits to this today, to try to help the Discovery team address the concerns in the community. I am sorry if this was offensive, and if I wrote anything wrong. If I did write anything wrong, please correct it. Happy to discuss. Jytdog (talk) 18:05, 20 February 2016 (UTC)
 * Thank you for the edits. I made a few tweaks as well. I hope this better addresses some of your concerns. CKoerner (WMF) (talk) 21:16, 23 February 2016 (UTC)

What do you call the search function?
I notice this page calls it a "search mechanism" but that is not a common term. Is it a search engine? Are there bots that crawl WP and the other projects and index them? Thanks. Jytdog (talk) 02:49, 24 February 2016 (UTC)
 * so it is Extension:CirrusSearch which uses Elasticsearch, which is a search engine for enterprise search. and you are looking at improving that search engine, first for intra-WM stuff and then to other open sources of information.  OK. Jytdog (talk) 02:05, 25 February 2016 (UTC)
 * (edit conflict) According to the definition given in Search engine (computing), yes, the Discovery Department is working on a search engine. That's nothing new, however; the CirrusSearch extension has existed for years, and search functionality in MediaWiki has been worked on many years before that. Search is also only part of our efforts to improve content discovery on Wikimedia projects; we're also working on maps as a content discovery mechanism, for example.  --Dan Garry, Wikimedia Foundation (talk) 02:12, 25 February 2016 (UTC)
 * yes that was very helpful, thanks. Jytdog (talk) 02:42, 25 February 2016 (UTC)

Staff notes
this is great.

Two snippets, with comments...


 * "Moiz: Follow-up: The initial grant was just a small step toward a lot more funding. Who will be funding that in the future, given the bad press?"
 * "Lila: We never counted on Knight to cover team expenses, only to supplement. Staffing comes out of the common bucket. We budget the work first, and then apply any grant moneys where they fit. Annual fundraising goal includes grants. So this won’t change how we fund the team. Just potentially the total amount of money available."

About that, and relevant to what I was saying above. You are happy to talk about near term goals. You are happy to say "not Google". What is the strategy? What are the projected budgets for it? The persistent not-saying it, is the problem that is breeding all kinds of very bad feelings. And
 * "...We have been trying to work with CL to get the message out. Two office hours have not attracted anyone from communities."  Office hours?  Where are they publicized? Jytdog (talk) 22:18, 24 February 2016 (UTC)
 * While it's possible things have changed, my experience with office hours has been very high noise-to-signal ratio. If office hours have become a standard approach for engaging community, I don't find it surprising to find that nobody is showing up. I was glad to see some discussion about whether other approaches might yield better results -- I hope that discussion progresses. -Pete F (talk) 18:17, 25 February 2016 (UTC)
 * Sorry I am completely ignorant here. I reckon "office hours" means somebody available for a live chat at some designated time and "place".  Would someone please say where is that site, and how are they publicized?  Sorry to be ignorant. Jytdog (talk) 18:38, 25 February 2016 (UTC)

FAQ: If you're adding new data sources, isn't that a search engine?
It is unclear to me how this is helpful, so moving it here for discussion. Seems mostly redundant.


 * If you're adding new data sources, isn't that a search engine? ==

If you define "search engine" as including a web crawler that indexes the whole web, which is the most common definition, no.

We do have a goal of improving the search function to make it work better in each project, and across projects, and yes we want to expand what it reaches to other sources of high-quality, public data.

The goal is to expand the amount of knowledge presented in search results and expand the context beyond just textual search. We want to begin by showcasing content from other wiki projects including appropriate languages based on query input.

The data could be used to potentially evolve and improve the quality of our existing search experience.

Our first new data source outside of Wikimedia projects is OpenStreetMap data for Maps which our wv-maps>wikivoyage:Wikivoyage:Travellers%27_pub#Announcing_the_launch_of_Maps|Wikivoyage community is already starting to experiment with. There are other data sets that we could potentially surface (census, national gallery, etc) but that will be up to our communities to decide. Some of these could certainly show up in search results and we have Phabricator tasks around improving GeoData content phab-ticket>phab:T112026|T112026.

-Jytdog (talk) 02:36, 25 February 2016 (UTC)