Wikimedia Apps/Offline support

Summary
In response to New Readers research showing the need for better offline support, as well as a well-received Community Wishlist proposal, the Wikipedia Readers team is working on improving the offline user experience by adding support for loading Wikipedia articles from ZIM files. On the app side, the technical work is nearly complete; searching and loading articles from one or more ZIM files loaded onto the device works well. Additionally the interface for finding and downloading packs of content is complete. We are currently performing contextual research in India on the initial production prototype.

Goal
Provide the means for those without reliable sources of internet to use Wikipedia.

Why are we doing this?
Primarily, we want to improve the offline experience for a large number of Wikipedia users, especially new readers who often have spotty internet access and wish to save data usage. Better offline capabilities may also help boost retention and uptake of mobile apps as another differentiator to the mobile web.

Key personas and user stories
The personas upon which detailed user stories are based will be Sandeep (Indian New Reader), Femi (Nigerian New Reader) and Michelle (Active Reader). Michelle is part of the "Pragmatic" personas which may be found here, with the New Reader personas available from the New Readers user personas page.

Sandeep
Motivation: Browsing without an internet connection Story: As a user with limited internet connectivity except free wifi on campus, I want to continue to browse Wikipedia even when I’m offline.

Femi
Motivation: Conserve mobile data usage Story: As a user (paying a premium) to primary use Wikipedia on mobile data, I want to be able to use Wikipedia whilst conserving mobile data usage.

Michelle
Motivation: Active app user bout to experience connectivity issues Story: As an active user going overseas where I will have no mobile data, I want to be able to continue using Wikipedia to look up interesting content related to my travel destination.

Health worker
Motivation: Active Wikipedia reader in a health/medical specialist field about to experience connectivity issues Story: As a health worker going to work in an area with poor internet access, I want to still have easy access to Wikipedia for content related to my field of work in medicine.

Design and Research
The Wikipedia Android team created a prototype for the core interaction flow for the Sandeep 'Browsing offline' user story. Tasks for this "V1 prototype" are captured in the Phabricator board: https://phabricator.wikimedia.org/project/view/2824/

V1 - User research study
Hureo, a third-party user research agency, conducted a user study the initial "V1 prototype" with participants in Pune, India. The goals of the study were to: Findings and recommendations from the study will be posted in September to inform further development of the feature.
 * 1) Discover whether participants find this feature useful and if they are successful in fulfilling their offline information needs.
 * 2) Identify usability improvements.

Glossary

 * ZIM file - The file format used to store a set of articles. A variant of the common zip compression format, specific to storing html offline.
 * Offline Library – proposed name for the new feature in Wikipedia mobile apps which enables users to download large 'pack' of articles for reading without needing connectivity.
 * Article packs - proposed name for the individual downloadable sets of articles in the ZIM format. Example use in context: "Download Wikipedia articles packs to your Offline Library now for data-free access later"
 * Compilation / Offline compilation - the initial placeholder name used for this feature to read ZIM files offline in the Wikipedia mobile app. The term was used both to refer to the overall feature, as well as the set of articles.
 * MW Offliner - Tool produced by Kiwix to convert MediaWiki pages to ZIM files. The main file generation option for offline Wiki content.
 * Swift - Specialized media serving infrastructure used to host media on Commons. The part of the Wikimedia infrastructure best suited for serving ZIM files to production.
 * MCS - The Mobile Content Service. The RESTBase APIs used by the apps. They provide a layer on top of MediaWiki that allows cotnent to be cached and formatted in a way that is best for mobile clients.
 * Collections Extension - A MediaWiki extension which allows you to collect groups of articles and package them in various ways. Used to support ZIM file creation, but that broke and was never repaired. Currently undergoing some redevelopment to support the deprecation of the OCG pdf rendering tool.

Filling in the WMF Row
When the team initially began planning to fulfill the Wishlist request and help to advance the New Readers program by supporting ZIM files, the planned feature included ONLY support for the last step in the pipeline described above. That is, based on the prototype implementation, the app would detect and "side load" ZIM files on device, but would not otherwise enable downloading or creating packs. Very early in the process of design and considering the target personas and initial research, it became clear that without some in-buiilt downloading interface and list of appropriate packs in easy-to-access form, the feature would likely fail, and remain unknown and unused apart from a small core of offline ecosystem participants (ie. DocJames).

At this stage, it was decided that additional requirements would be needed and we would either mirror or work with Kiwix to host a more tailored and appealing set of content, while enabling users to easily find and download the content from within the Android app. Also during this time it became clear that some technical limitations of the ZIM format and offliner tool would benefit from modification to suit our stack and proposed user-experience.

Because we believed the acceptance of this feature may hinge on the ease of use and relevancy of content for target personas and markets, we began working towards moving further to the left, to own file generation and hosting. Using mwoffliner (including some improvements we hoped to upstream), but deferring the content and curation to WikiProjects and popularity (most read articles) for the initial release. We planned to defer decisions around long term curation until initial testing and beta feedback, and to work with the Community Liaisons to explore the communities' preferences for curation.

Initial Requirements

 * Support loading ZIM files from the users device via a Offline Library management interface.
 * Allow the user to search the content of ZIM files loaded into the Library.
 * Clearly identify when the user transitions via a link to content outside the library.
 * Include basic resource information at appropriate points (ie. file size).
 * Each ZIM should have a user-friendly title, description and icon or image (via a service or meta-data augmentation).

Emergent Requirements
In addition to developing our general knowledge and competency around this technology, there are a couple of areas for exploration and possible improvement around the content of the ZIM files currently created by the mwoffliner tool, which is used to create most if not all of the ZIM files currently available:
 * We'd like to expand beyond Kiwix’s library of ZIM files. (T169905)


 * The HTML content of the articles in the existing ZIM files has a lot of Kiwix-specific formatting, which the Wikipedia app needs to strip before displaying it. Ideally the articles in the ZIM file shouldn't be adulterated in any way, and should be identical to the content received if it a network request were made to get the same article. (T172764)

There are also concerns around hosting ZIM files to be downloaded in-app:
 * We'd like to expand upon and and improve the metadata that is "baked into" the various ZIM files offered to our users. This metadata is what the user sees when deciding which compilation to download, so it must be worded very clearly and meaningfully. (T164760)
 * We need to use infrastructure that we can scale for hosting content that we serve to our apps. That means that we need to find production WMF hardware to host the ZIM files that we serve. After some internal discussion, Swift, the service used to host all of the media content uploaded to Wikimedia Commons, has emerged as a strong candidate for hosting these files, and we need to test Swift’s capacity for handling and serving files of this size. (T172123)


 * Wikimedia's production Swift service does not permit uploading files from Cloud VPS, and therefore if Swift is indeed used for hosting ZIM files, they'll have to be uploaded from a ZIM file generation service set up consistently with the requirements for running in the Wikimedia production environment. Therefore we're working on prototyping an mwoffliner instance set up as it would be if running in Wikimedia production. (T172769)

ZIM Format Revision
One core requirement from the beginning is to use and support the ZIM file format created by Kiwix. The existing ecosystems of volunteers and organizations interested in offline Wikipedia have generally embraced ZIMs as their standard format of exchange, and Kiwix is working towards the OpenZIM standard which they (and we) hope to see as an open widely supported standard. That said, their are some aspects of the ZIM file and its meta-data we would ideally augement or work to have incorporated into the standard. Specifcially:

Meta-data
Based on our initial research and design we believe its key that each pack of content have an easy to understand title, description and icon. Although the existing format supports a title it is often not human readable. Additionally we'd like to see a longer description field and a field added for

Rewriting of URLs
All URLs to other articles and images are rewritten to point back into the ZIM file, instead of the live Wikipedia URL. We can theoretically remove this transform, since our app can decide on the fly whether to load an article from the network or from an offline pack (As long as all internal links have a "title" attribute). Mwoffliner also removes links to articles that aren't present in the given ZIM file. We want our users to be aware of all links and to have the option to leave the offline experience and get a page from Wikipedia online seamless.

Presentational tweaks

Kiwix includes a special license footer, as well as certain other transforms that they feel will improve the appearance of the article. Generally we do not want to remove content, and would prefer these transforms not be applied, in favor of our own mobile-friendly transforms.

Media licenses

The current ZIM format does not include links to file pages or complete media license and attribution information. If users are using zim files created or served by the Foundation we will must include more robust licensing information for images and other media.

Space-saving cleanup

Mwoffliner has quite a bit of logic to remove unnecessary or unused tags from the article. These removed tags will need to be reviewed to ensure we also want to remove them, and also we may want to upstream these removals to our own APIs, so they can be performed uniformly across articles from either a ZIM or an API.

Incremental updates
The ZIM model is a "build once read many" model that optimizes for space and reading. There are many stories for our key personas that do not fit this model. For example, personas like Sandeep and the Health Worker have some regular free internet access and have motivation to want the latest, most accurate version of their articles. In the long term, we would ideally discuss and plan with Kiwix how to best support incremental updates of two types The ideal technical path is to construct ZIM files out of pure, unadulterated API responses. It will then be the responsibility of the API to deliver content that is as lightweight as possible, and it's the responsibility of the client to perform any transforms that are specific to that client's presentation of the article. However, we can adopt these changes without breaking compatibility with existing ZIMs, but if these changes are not upstreamed it does require we move into the "Content Generation" phase.
 * 1) One-by-one article update: A user opens an article from their offline library. They are on wifi, so the app checks that there is a newer revision available. If so the user is prompted to replace the existing copy with the updated one. Once updated the new revision acts as part of the library and works offline, etc.
 * 2) Batch (diff) update: A user has the top 5,000 most read collection from 2 months ago. They are on pay-by-byte network and want to press an "update" button which will sync the local pack to the most updated version. However, because they pay by the byte they dont' want to just delete the existing collection and download a new one, but rather get the new articles and delete removed ones, without touching or re-installing content that has not changed.

Short term plans
Because the scope of the WMF's ownership and involvement has continued to expand, the shipping schedule for this app update has extended and become somewhat vague. While this may delay the basic functionality being available to users, we believe it is worth taking the time to perform needed research with real users and to resolve additional dependencies that adding download and indexing steps to our scope adds, even if we do not ultimately expand scope into packaging and hosting. Additionally the underlying goal of this feature is to expand access to Wikipedia to users who are (comparatively) new to the internet. For this audience we believe it is doubly important that features be as easy to use as possible, and in line with their expectations of how the modern, mobile, internet should look and work.

That said, given that the basic functionality is in iteration, we want to get the feature pushed to production and in the hands of our users as quickly as possible, while maintaining a sound approach from engineering, design, and legal perspectives.

Hosting Options

 * 1) Host by Kiwix on their existing infrahstructure
 * 2) dumps.wikimedia.org (one of many Kiwix mirrors)
 * 3) PROS: we host this, and no interference from any third party
 * 4) CONS: Ops said that this is not robust enough for our scale, no guaranteed uptimes.
 * 5) QUESTIONS: are we ok with this? Can we build something into the client that says "temporary outage" or whatever as a gap solution? is this forbidden by Ops or just discouraged?
 * 6) Kiwix URL
 * 7) PROS: simple. They say it's robust enough to perform, though we haven't tested this. One of many mirrors is hosted at Wikimedia
 * 8) CONS: There are legal concerns that would need to be worked through, depending on what the design and implementation. For example, Kiwix does not have a privacy policy. Additionally content curation, packaging and creation would remain as is in the current ecosystem.
 * 9) BLOCKERS (mostly legal):
 * 10) List of 3rd party hosting services do they use, and clear mapping and NDA in place for any user information sent to those servers as part of downloading or file requests.
 * 11) We would need some kind of basic SLA. We should also plan and discuss for potential termination (including for "emergencies" or urgent situations).
 * 12) Kiwix lacks some basic legal needs such as a privacy policy.
 * 13) Usage must be explicitly opt-in before any connections to 3rd party servers. Users must understand they are sending their request to someone outside the WMF, and what the privacy implications are.
 * 14) We will need to create a process to handle content takedown requests. This is less of an issue in this case, than if we build and host these ourselves, but we'd need to define clear expectations and process for these rare but important situations.
 * 15) Cost to Kiwix and their partners and potential need for financial support from the Foundation.
 * 16) Host ourselves, requiring Ops deployment
 * 17) PROS: in house, totally within our control. We can make, host, and serve custom files more easily and this opens the door to make it easier for us to make tools for users to create ZIM files that will then be easy to download from the device. Privacy concerns are completely addressed. Achieving content diversity and opening these tools to community would be technically easier.
 * 18) CONS: Complicated technically and from a project management perspective. Timelines and outputs currently unclear.
 * 19) QUESTIONS:
 * 20) How long would this take to set up? At what cost?
 * 21) Can we upstream changes to mwoffliner successfully and standardize metadata requirements?
 * 22) Does this imply that we or the Wikipedia communities MUST then own curation and open generation if the feature is accepted and there is demand for additional content?

Following initial deployment
This feature set represents an entirely new line of inquiry for our apps and website, and that means that there are a lot of open questions that will help guide us to make the right decisions about how far to push the features to make them the most useful to our target users. We have options from simply using the Kiwix files all the way to building out curation tools with community support, and everything in between. In order to make those decisions, we need to learn some things.
 * 1) Usage data. The app will be sending back feature usage data, and within a short time from release we should understand what packs people are more interested in downloading and using, as well as if there are commonalities.
 * 2) In-app survey data. We can ask users a couple of questions to understand if they have a need for this feature and if it is useful to them.
 * 3) Community consultation. If we're considering making a space for people to curate content and create ZIM files, we need to make sure that there is community support to managing this new workflow and the content it produces. Is this something people are interested in? What do the Community Liaisons think?
 * 4) Marketing funnel data. We are planning to do some marketing work around this feature, which will come with conversion information. Through this push we'll likely get qualitative feedback as well, through app store reviews, social, etc.

Choices for long term
After we learn more, we have a few options (and probably gray areas in between) that we can explore. This could (and will likely) include development on the web as well. These are initial thoughts that would likely evolve as we learn.
 * 1) Use Kiwix's hosting. See above for detail.
 * 2) WMF production servers, using copy of Kiwix library.
 * 3) PROS: no content curation on our side, entirely done through the existing offline ecosystem.
 * 4) CONS: no opportunity to iterate on content packs by us or wider community. ZIM files not formatted ideally for app
 * 5) WMF has tools for content curation, but they are smaller scale and allow readers to create ZIM files, then download them. They have the option to upload them to a central repository or send straight to their own devices. We could consider a Labs tool or something that power users and offline distributors could use.
 * 6) PROS: lower risk from community curation side
 * 7) CONS: if we give the option for upload to a central repository, we still need to monitor that and make sure that files are not corrupted and content is not sketchy.
 * 8) QUESTION:
 * 9) What tools do existing offline distributors use? Are they up to snuff?
 * 10) Could we build a simple uploader to front our hosting?
 * 11) Rich curation toolset around ZIM files. Use Extension:Collection, Book Creator or similar tool to allow anyone to create ZIMs and automatically put them in a library that's publicly accessible, including in the app.
 * 12) PROS: fully integrated approach, removing all curation from the Foundation. Allows for fully customized ZIM files from anyone, anywhere.
 * 13) CONS: Unknown if community would want to take this on. Risk of problematic ZIM files being created and propagated (SuSa).
 * 14) QUESTIONS:
 * 15) Is this necessary?
 * 16) Does community want it? Will they reject it?
 * 17) How technically complex is it? What would we need to give up on the roadmap to make space for this?