Extension talk:Wikispeech

Component reuse
I've read the extension page, Wikispeech and Wikispeech/Pilot study, but I still don't find any answer to the biggest question, i.e. whether you are able to reuse libraries/third party components or you plan to develop everything from scratch and in isolation. --Nemo 08:24, 17 December 2016 (UTC)
 * Text-to-speech is a huge endeavour (especially when having multiple languages as target), and to be viable I'd expect this extension to rely on existing software for the synthesis engine. The configuration mentions https://morf.se but the domain has no information.
 * Even the NLP component is potentially too big a project in itself, if you don't join forces with others. Producing annotated text is something that I'd expect to be provided by a "central" parsing API, to be used by others as well: there is a wide demand for text extracts and Content translation saw a lot of work on LinearDoc, segmentation and markup.
 * Even for the audio player I see no mention of whether you think TimedMediaHandler will be insufficient. I expect this to be the easiest part to fix at any point of the development, but I hope not much work is going to be devoted to the player (which seems to be the least interesting deliverable).
 * Hi. STTS (see section below) can probably answer some of these questions better but I'll give it a first stab.
 * The basic setup of the TTS service is that it should act as a wrapper with a standardised API allowing you to plug in existing libraries/components per language depending on what exists. So where libre components exist the plan is that these should be reused and only an adaptor to make it work with the wrapper should be needed. For the first three languages (especially sv and ar) we may also be providing a few of the components which are missing today.
 * Thanks for the links from Content Translation reasonings. The cleaning step is a bit special in that it needs to know both what to send to the TTS and what the "untreated" corresponding text is. What you want to get rid of may also differ between different wikis. The segmenting cannot be to fine since that effects how sentences are pronounced while it needs to be fine enough that you don't request loads of data you end up not using or which takes unnecessarily long to generate (i.e. could be generated while you listen to the previous segment). I'll take a look at the Content Translation links more closely when I'm back though to see how large the overlap of needs is.
 * For the audio player see my answer in the section below.
 * I noticed that the two schematics from our report never made it on to Commons (and thus never into the Pilot Study page here on mw.org). Maybe can fix that when they are back in the office. /André Costa (WMSE) (talk) 19:10, 19 December 2016 (UTC)
 * The TTS platform we're using is MaryTTS. To get a bit more information about what's currently being used by the TTS server, see: https://morf.se/wikispeech/. Note that this implementation is under development by STTS and may not be the most stable.
 * When it comes to the player, we'd of course like to use a standard MW player and will try TMH (T142562). Since quite a few things (preprocessing, API requests..) need to be in place before the playback starts, this hasn't been prioritized. In the development so far, standard HTML5 audio elements (and rudimentary button based GUI) have sufficed, and as far as I understand, the leap to (the new) TMH won't be too big.
 * /Sebastian Berlin (WMSE) (talk) 09:38, 20 December 2016 (UTC)

Questions from wikitech-l

 * From Bawolff in 

From what I gather from your existing implementation, your current plan is: server.
 * Using a ParserAfterParse hook, do some complex regexes/DomDocument manipulation to create "utterance" annotations of clean html for the tts
 * insert this utterance html at end of page html
 * javascript posts this to a (currently) python api, that returns a json response that contains a url for the current utterance (not sure how long an utterance is, but I'm assuming its about a paragraph)
 * javascript plays the file.

Is this your general plan, or is the existing code more a proof of concept? I'm not sure I'm a fan of adding extra markup in this fashion if its only going to be used by a fraction of our users.


 * Wikispeech is made up by two main parts. The MW extension and the backend TTS service. In addition to the coordination WMSE's primary development focus is on the MW-extension whilst our partner STTS is leading development on the TTS service. For that reason I'll ask someone from there to answer the questions related to the TTS service/server (also Nemo's questions above).


 * The MW extension has two main parts, recitation and editing. So far we have only actively been working on the recitation part. The editing part will allow for both editing the pronunciation in a particular article ("VI" is a numeral here, "João" should be pronounced in Portuguese etc.) as well as making correction/additions to the TTS itself. Those edits will be stored as some type of annotation (T148734), the exact implementation depends in part on the outcome of the Dev Summit in January.


 * Recitation (i.e. having the text spoken to you) largely follow the steps described by Bawolff.
 * Currently utterance ~ sentence. This might change and might even be something which differs per language.
 * Only the first utterance is sent to the TTS server on page load (the rest being sent if you start listening).
 * Playback is currently done through JS/html5. The plan is to (if possible) make use of TMH/Video.js so as to re-use as much of the exiting infrastructure on MW as possible.


 * Most dev work in the Recitation/Player end is related to one of:
 * Cleaning out unsupported content and utterance segmentation;
 * Handling going backwards/forwards in the audio while keeping track of where in the text you are (complicated by the fact that sound != visual text);
 * (coming) Visualising what is being recited as well as allowing highlighted text to be recited.


 * With "extra markup" do you mean the utterance tags which we inject into the end of the html? We've on purpose kept this separate from the main content of the page so as to not affect any other workflows/extensions/gadgets. If the delay (segmenting+loading the first audio) is acceptable I guess (without having looked into it) that it might be possible to only insert it once you press play (or some other trigger) by working on the same content as we currently access in ParserAfterTidy. Issues which might be cause by such a change are: segmenting/cleaning isn't cached, delayed playback of first sentence, more work on client as opposed to server.


 * On the MW extension side I think we are not married to any particular solution so suggestions/thoughts or known similar problems/solution are all welcome if they help us reach the desired endpoint in a more efficient way.


 * Apologies for the slightly hurried answer. Heading off on Christmas holidays, for that reason I probably also won't be able to respond in person until after the holiday season. /André Costa (WMSE) (talk) 17:53, 19 December 2016 (UTC)
 * I haven't really thought about this very hard, but I guess my main concern would be if the utterance info is roughly the same size as the page content, then this would double page size, which seems like a high cost to pay. You might also consider inserting the data into a js variable as JSON instead of appending things to the end of the html (Mostly because it seems slightly cleaner. The utterance markup isn't really valid html afaik). Loading the data in a separate request does indeed introduce a lot of issues with cache consistency that are very difficult to solve. Other concerns I have is that ParserAfterParse is called in a bunch of places, not just main body content, so I'm not sure if its the most appropriate hook. I'd also be a bit wary that all that DomDocument stuff might be memory heavy, or time heavy on really large pages (but I don't know. I certainly have not done any benchmark). You may want to write up your proposed implementation plan and then submit it to ArchCom for comments - they might be able to give advice on how best to integrate into MediaWiki.


 * I'll also be at dev summit, perhaps we could talk more about this there. Bawolff (talk) 20:24, 19 December 2016 (UTC)
 * I agree that the current implementation adds a lot of extra data and may not be the best. One of the main reasons it's done this way is to do as much as possible server side, to allow for low end user devices. As it is now, you could get away without even using JS if you tweaked the server a bit, so that you could get a direct link to the audio. Another reason for the current implementation is that it was the first somewhat reasonable solution I could come up with; if there are other ways of conveying data from PHP to JS, I'd be happy to hear them.
 * A note about the hook, as André says above, ParserAfterTidy is the one used to retrieve the HTML. We've noticed that it's called for more than just the body, but since this hasn't broken anything (yet), fixing it isn't highly prioritized. Again, I'd be happy if someone knows a better hook (or equivalent) to get the HTML.
 * /Sebastian Berlin (WMSE) (talk) 11:40, 20 December 2016 (UTC)