Extension talk:Wikispeech

Component reuse
I've read the extension page, Wikispeech and Wikispeech/Pilot study, but I still don't find any answer to the biggest question, i.e. whether you are able to reuse libraries/third party components or you plan to develop everything from scratch and in isolation. --Nemo 08:24, 17 December 2016 (UTC)
 * Text-to-speech is a huge endeavour (especially when having multiple languages as target), and to be viable I'd expect this extension to rely on existing software for the synthesis engine. The configuration mentions https://morf.se but the domain has no information.
 * Even the NLP component is potentially too big a project in itself, if you don't join forces with others. Producing annotated text is something that I'd expect to be provided by a "central" parsing API, to be used by others as well: there is a wide demand for text extracts and Content translation saw a lot of work on LinearDoc, segmentation and markup.
 * Even for the audio player I see no mention of whether you think TimedMediaHandler will be insufficient. I expect this to be the easiest part to fix at any point of the development, but I hope not much work is going to be devoted to the player (which seems to be the least interesting deliverable).

Questions from wikitech-l

 * From Bawolff in 

From what I gather from your existing implementation, your current plan is: server.
 * Using a ParserAfterParse hook, do some complex regexes/DomDocument manipulation to create "utterance" annotations of clean html for the tts
 * insert this utterance html at end of page html
 * javascript posts this to a (currently) python api, that returns a json response that contains a url for the current utterance (not sure how long an utterance is, but I'm assuming its about a paragraph)
 * javascript plays the file.

Is this your general plan, or is the existing code more a proof of concept? I'm not sure I'm a fan of adding extra markup in this fashion if its only going to be used by a fraction of our users.


 * Wikispeech is made up by two main parts. The MW extension and the backend TTS service. In addition to the coordination WMSE's primary development focus is on the MW-extension whilst our partner STTS is leading development on the TTS service. For that reason I'll ask someone from there to answer the questions related to the TTS service/server (also Nemo's questions above).


 * The MW extension has two main parts, recitation and editing. So far we have only actively been working on the recitation part. The editing part will allow for both editing the pronunciation in a particular article ("VI" is a numeral here, "João" should be pronounced in Portuguese etc.) as well as making correction/additions to the TTS itself. Those edits will be stored as some type of annotation (T148734), the exact implementation depends in part on the outcome of the Dev Summit in January.


 * Recitation (i.e. having the text spoken to you) largely follow the steps described by Bawolff.
 * Currently utterance ~ sentence. This might change and might even be something which differs per language.
 * Only the first utterance is sent to the TTS server on page load (the rest being sent if you start listening).
 * Playback is currently done through JS/html5. The plan is to (if possible) make use of TMH/Video.js so as to re-use as much of the exiting infrastructure on MW as possible.


 * Most dev work in the Recitation/Player end is related to one of:
 * Cleaning out unsupported content and utterance segmentation;
 * Handling going backwards/forwards in the audio while keeping track of where in the text you are (complicated by the fact that sound != visual text);
 * (coming) Visualising what is being recited as well as allowing highlighted text to be recited.


 * With "extra markup" do you mean the utterance tags which we inject into the end of the html? We've on purpose kept this separate from the main content of the page so as to not affect any other workflows/extensions/gadgets. If the delay (segmenting+loading the first audio) is acceptable I guess (without having looked into it) that it might be possible to only insert it once you press play (or some other trigger) by working on the same content as we currently access in ParserAfterTidy. Issues which might be cause by such a change are: segmenting/cleaning isn't cached, delayed playback of first sentence, more work on client as opposed to server.


 * On the MW extension side I think we are not married to any particular solution so suggestions/thoughts or known similar problems/solution are all welcome if they help us reach the desired endpoint in a more efficient way.


 * Apologies for the slightly hurried answer. Heading off on Christmas holidays, for that reason I probably also won't be able to respond in person until after the holiday season. /André Costa (WMSE) (talk) 17:53, 19 December 2016 (UTC)