Technical decision making/Decision records/T274181

With the feedback from Technical Decision Forum's representatives, Structured Data Across Wikimedia Architecture (SDAW) team was as able to exit the process early. The techincal decision forum process help the team to realize the proposal was at a larger scale than originally anticipated. With a clearer understanding of their decision proposal the SDAW team will work on a more focus frame work to bring the Technical Decision Forum. The feedback the SDAW team received are as follow:

Question:
Was the problem clearly stated?

Feedback

 * The Problem statement talks about tagging articles, paragraphs and even sentences with relevant language independent Wikidata concepts. Leaving aside the fact that there exist concepts that are not language independent (e.g. https://en.wikipedia.org/wiki/Saudade), I 'd like to focus on the fact that tagging sole sentences is a completely different scale of problem than tagging articles (or paragraphs even). Regardless of implementation, the amount of computing resources required for tagging paragraphs or sentences will be orders of magnitude more than tagging articles. Could this problem be broken down more? Perhaps starting with just tagging articles and scaling up from there? From my understanding a big part of the gains will be more tagging paragraphs (e.g. an intr/summary being an answer to a question) so maybe paragraphs can fit in the initial plan as well. But adding sentences to start with sounds a bit too much to me.


 * I have read the "What" section at least 5 times now. I finally realized that "The first most useful type of metadata is the topic of the content." is the closest thing to an explanation of the goal of this body of work that I can understand. The size of the grant funding the work is irrelevant (as is the fact of an earmarked grant being involved at all). This as the lead of the section obfuscates rather than informs. It is still not clear to me after several re-reads if extracting/identifying/cataloging "sections" is part of the work expected, or if instead this will build on other existing structural decomposition of articles that already somehow exists. There is a paragraph on "Structuring content into discrete sections" but it does not contain statements of proposed action. Instead it merely states that this might be a nice thing for external reusers of content and other workflows without establishing an concrete basis for those statements. I would personally expect the What to be statements in active voice about a problem domain and the high level course of action to be explored. Ideally this would also be written in inverted pyramid/journalistic style so that one does not have to hunt for the important ideas within other tangentially related prose.


 * As stated, it sounds like the problem is the presence of the grant itself.

1. Will there be a difference between tagging cited and un-cited content? Will there be a preference for getting tagged information from cited content? 2. Will people be able to access the website of the citation?
 * I asked members of my team to review the SDAW document and take and one of my teammates has several questions (posted below verbatim):
 * 1) "Looking through the presentation, I'm a bit confused by the application of this idea. I see that they are trying to apply section-level concepts to the lead section, which (ideally) summarizes every section in the article. shouldn't this just reflect article-level concepts, and if so, couldn't they use existing ways of describing a topic (eg category tree - although this might be a useful way of replacing that; existing descriptors on Wikidata)? and how would this work with abstracting out references, since again leads are more typically supported only in body? also how does link analysis interact with project-determined linking standards like enwp's MOS:LINK - eg something linked in an earlier section may not be linked again later? they argue that it minimizes bias vs machine learning - I don't agree that it would"
 * 2) Questions I had are the following:


 * The Problem Statement seems fairly thorough and well-written. It doesn't delve too deeply into specific technical details, but I assume that is completely standard and acceptable for these types of documents.  I'm also not entirely clear on how the two additional goals (increased readership from underserved markets, increased editors from emerging/mobile markets) specifically tie into this project, but again I'm not sure that matters.


 * The problem statement is clear and well articulated. I understand that we are in a What and Why stage now.  At the same time, I look forward to a clear "what does done look like" including success metrics to help illustrate what happens next with this decision.


 * The problem statement is understood, but the third paragraph, talking specifically about discrete sections and sentences, might be out of scope for this project, and is being worked on, researched, scoped, and examined by other teams. The process and technical decisions involved in creating the knowledge store are independent of this decision record, and that should be clarified."


 * While to problem is clear - it might've benefit the decision to limit the scope. Current one makes this a large scale effort, resulting in many teams being involved. Overall it looks well defined, my only question is if the "what" part also involves exposing the structured content to consumers (e.g. some sort of structured content API) or the scope for now is just generating the data? Do we expect some sort of product integration with the Android/iOS apps?


 * The scope is unclear. On one hand it is pretty much open ended: structured tagging for content. On other hand it says the first task is section topics. Should this be evaluated for the larger or smaller scope? Some clarification is in order, maybe first do the overall thing only and then later more specific focus area? It's easy to imagine that structured data may help with many things, and enable things we haven't even thought about. However, the examples listed here are vague and it would be nice to a bit more details about why or how we think it helps the things mentioned. I think it would help to connect these to planned/expected focus areas within this project. For example section topic modeling can help (among other things) Section Translation which facilitates translation and knowledge parity."


 * Regarding the structure of the first section, I think there’s enough overlap between the phrasing of the first and second questions (“What is the problem or opportunity?” and “What does the future look like if this is achieved?”) that the 3rd and 4th paragraphs could be placed in the second section, leaving the first section to outline the decision statement and its scope. Speaking of scope, it’s not clear if this decision record covers all milestones mentioned in the roadmap or only M1. If it’s the former, then I’d expect community consultation, editing, and moderation to be mentioned explicitly. Regarding the “What does the future look like if this is achieved?” section: We do know that exposing structured data in a machine-readable format increases average pageviews per day. See https://www.mediawiki.org/wiki/Reading/Search_Engine_Optimization/sameAs_test. Regarding the “What happens if we do nothing?” section: What happens if we fail to meet the goals of the grant?


 * Looks good, I like the callout about breaking an article into sections as its own separate win. It's helpful to see the distinction between building a system for serving content modularly and the specific application of this system, that is tagging sections with topics. If one byproduct is clients/API layers never again needing to parse html to extract sections, then this system has much more potential than just providing quick facts, and that is a selling point that could be hit a little harder.

We still have a fairly vague understanding on how tagging Wikipedia articles with topical metadata is going to help achieving the goals of increasing the number of readers, especially from underserved communities; and increasing the number of contributors and editors, especially from emerging markets and on mobile. We wonder if this could be elaborated a bit more. We wonder if some further examples could be provided to demonstrate the potential future applications benefiting from this work. ""What does the future look like if this is achieved?"" seems relatively short compared to the problem statement. Finally, it might not exactly fit the Decision Statement Overview document, but WMDE is curious to hear more details on how it is planned to use Wikidata in applications like ""question answering"" or ""providing quick facts"". The Platform Engineering team struggled a lot with the “What” section as it didn’t provide a lot of technical details. Beyond the grant requirement, members of the team wanted more detail of how it would be implemented to better comments on the other sections and to fully understand the size of this project and if it should be broken up into sections. There was also some feeling that this document may be retrofitting a decision that has already been made. Open question: Will the topic be the only metadata implemented in the context of this 3 years project? Or is the scope broader?"
 * The Problem Statement seems pretty broad to us. We appreciate the late addition of the Summary paragraph at the top. The general tagging using Wikidata concepts seems generally clear, especially accompanied by examples from the linked documents.


 * While the problem statement is pretty clear, the problem and project described are both very large, and it is hard to be sure.

Question:
Does the solution support the Foundation goals?

Feedback

 * Why is the earmarked grant information in the What section but not in the Why section?


 * Since we're asked specifically about the MTP and the 2030 strategy, it would be nice to link directly to their objectives that this projects aligns to. I don't know if the "objectives it supports" listed on the document are from either of them.


 * It can be hinted by the text itself, but it is not entirely clear until you read the additional background links.


 * The "on-site search [that] is a significant improvement over the current state" does not seem to match the scope defined earlier. Per my understanding of this document this (SDAW) is more a platform improvement project, and while it will showcase what it enables with concrete examples, search feels too important and too complex to be an explicit goal.

[1] https://en.wikipedia.org/wiki/Wikipedia:Article_Feedback_Tool [2] https://www.mediawiki.org/wiki/Extension:RelatedArticles [3] https://www.mediawiki.org/wiki/Extension:PageAssessments
 * There is general excitement around our ability to collect annotations such as these to be highly important to us building successful and inclusive ML and information retrieval / recommender system technologies, especially for cross-lingual work. However, two main questions appeared around community needs: Has the community asked for this (section-level structured data)? That is, is there a community of volunteers who are excited to give constructive feedback, add topics when the tool exists, build tools to make it easier etc.? Otherwise, it feels like we're building a system that will generate a massive amount of backlog / potential for inaccuracies and therefore frustration without any plan to handle that. (e.g., Article Feedback Tool [1]). Related to that is the question (from the supplementary slide-deck) about how users would interact with “the meta”: Has anyone studied usage of the RelatedArticles [2] extension? This project gave editors the ability to override CirrusSearch's ""related articles"" recommendations (which are automatically generated). why annotations on the section-level? Alternative candidates would be page-level annotations where one can identify a clear need from the community (see all the ad-hoc technologies such as WikiProjects + PageAssessments [3] that have been developed to track these things). While this can be more complicated because you're building on existing technologies and not trying to disrupt existing workflows, it's much more likely to get community buy-in and would serve a clear need.


 * I feel like the "how" in this area gets a bit lost. They are all worded like objectives to me. Maybe just a simple prefix of "Objective: {asdf}" and "How: {asdf}" could clear this up, or add a specific reference to topics here.


 * We generally understand how solving this problems supports Wikimedia goals. We could see some more clarity brought to the ""Increase impact of knowledge with data"" part. From the text itself it has not been entirely clear to us what Objectives are meant here - whether this is primarily targeting Big Tech organisations (OKAPI?) that would be enabled to provide Wikimedia content to new audiences via their products, or whether the main target are emerging markets. Table 1 in the public SDAW grant text (I don't have access to the google doc linked from the Decision Statement Overview, so I am relying on https://commons.wikimedia.org/wiki/File:2020_Structured_Data_Across_Wikimedia_proposal.pdf to a degree clarified this point for us, but we think being more explicit in the Statement Overview would still be beneficial."


 * Better clarity on the impact of this project would be helpful. Open question: Why should we prioritize this over other efforts? And how does it map to our OKRs?"


 * As written this is great. It seems like there's also an opportunity for increasing Knowledge Equity through this work, at least as a second-order effect via gap analysis of the resultant corpus, which might be good to highlight