Extension talk:Wikispeech

From mediawiki.org
Latest comment: 1 year ago by Ainali in topic Estonian text-to-speech models
If you have discovered specific pronunciation errors, see Extension:Wikispeech/Pronunciation_errors for how to report them.

Component reuse[edit]

I've read the extension page, Wikispeech and Wikispeech/Pilot study, but I still don't find any answer to the biggest question, i.e. whether you are able to reuse libraries/third party components or you plan to develop everything from scratch and in isolation.

  • Text-to-speech is a huge endeavour (especially when having multiple languages as target), and to be viable I'd expect this extension to rely on existing software for the synthesis engine. The configuration mentions https://morf.se but the domain has no information.
  • Even the NLP component is potentially too big a project in itself, if you don't join forces with others. Producing annotated text is something that I'd expect to be provided by a "central" parsing API, to be used by others as well: there is a wide demand for text extracts and Content translation saw a lot of work on LinearDoc, segmentation and markup.
  • Even for the audio player I see no mention of whether you think TimedMediaHandler will be insufficient. I expect this to be the easiest part to fix at any point of the development, but I hope not much work is going to be devoted to the player (which seems to be the least interesting deliverable).

--Nemo 08:24, 17 December 2016 (UTC)Reply

Hi. STTS (see section below) can probably answer some of these questions better but I'll give it a first stab.
The basic setup of the TTS service is that it should act as a wrapper with a standardised API allowing you to plug in existing libraries/components per language depending on what exists. So where libre components exist the plan is that these should be reused and only an adaptor to make it work with the wrapper should be needed. For the first three languages (especially sv and ar) we may also be providing a few of the components which are missing today.
Thanks for the links from Content Translation reasonings. The cleaning step is a bit special in that it needs to know both what to send to the TTS and what the "untreated" corresponding text is. What you want to get rid of may also differ between different wikis. The segmenting cannot be to fine since that effects how sentences are pronounced while it needs to be fine enough that you don't request loads of data you end up not using or which takes unnecessarily long to generate (i.e. could be generated while you listen to the previous segment). I'll take a look at the Content Translation links more closely when I'm back though to see how large the overlap of needs is.
For the audio player see my answer in the section below.
I noticed that the two schematics from our report never made it on to Commons (and thus never into the Pilot Study page here on mw.org). Maybe @Sebastian Berlin (WMSE) and John Andersson (WMSE): can fix that when they are back in the office. /André Costa (WMSE) (talk) 19:10, 19 December 2016 (UTC)Reply
The TTS platform we're using is MaryTTS. To get a bit more information about what's currently being used by the TTS server, see: https://morf.se/wikispeech/. Note that this implementation is under development by STTS and may not be the most stable.
When it comes to the player, we'd of course like to use a standard MW player and will try TMH (phab:T142562). Since quite a few things (preprocessing, API requests..) need to be in place before the playback starts, this hasn't been prioritized. In the development so far, standard HTML5 audio elements (and rudimentary button based GUI) have sufficed, and as far as I understand, the leap to (the new) TMH won't be too big.
/Sebastian Berlin (WMSE) (talk) 09:38, 20 December 2016 (UTC)Reply

Questions from wikitech-l[edit]

From Bawolff in [1]
From what I gather from your existing implementation, your current plan is:
* Using a ParserAfterParse hook, do some complex regexes/DomDocument manipulation to create "utterance" annotations of clean html for the tts
server.
* insert this utterance html at end of page html
* javascript posts this to a (currently) python api, that returns a json response that contains a url for the current utterance (not sure how long an utterance is, but I'm assuming its about a paragraph)
* javascript plays the file.

Is this your general plan, or is the existing code more a proof of concept?
I'm not sure I'm a fan of adding extra markup in this fashion if its only going to be used by a fraction of our users.
Wikispeech is made up by two main parts. The MW extension and the backend TTS service. In addition to the coordination WMSE's primary development focus is on the MW-extension whilst our partner STTS is leading development on the TTS service. For that reason I'll ask someone from there to answer the questions related to the TTS service/server (also Nemo's questions above).
The MW extension has two main parts, recitation and editing. So far we have only actively been working on the recitation part. The editing part will allow for both editing the pronunciation in a particular article ("VI" is a numeral here, "João" should be pronounced in Portuguese etc.) as well as making correction/additions to the TTS itself. Those edits will be stored as some type of annotation (T148734), the exact implementation depends in part on the outcome of the Dev Summit in January.
Recitation (i.e. having the text spoken to you) largely follow the steps described by Bawolff.
  • Currently utterance ~ sentence. This might change and might even be something which differs per language.
  • Only the first utterance is sent to the TTS server on page load (the rest being sent if you start listening).
  • Playback is currently done through JS/html5. The plan is to (if possible) make use of TMH/Video.js so as to re-use as much of the exiting infrastructure on MW as possible.
Most dev work in the Recitation/Player end is related to one of:
  • Cleaning out unsupported content and utterance segmentation;
  • Handling going backwards/forwards in the audio while keeping track of where in the text you are (complicated by the fact that sound != visual text);
  • (coming) Visualising what is being recited as well as allowing highlighted text to be recited.
With "extra markup" do you mean the utterance tags which we inject into the end of the html? We've on purpose kept this separate from the main content of the page so as to not affect any other workflows/extensions/gadgets. If the delay (segmenting+loading the first audio) is acceptable I guess (without having looked into it) that it might be possible to only insert it once you press play (or some other trigger) by working on the same content as we currently access in ParserAfterTidy. Issues which might be cause by such a change are: segmenting/cleaning isn't cached, delayed playback of first sentence, more work on client as opposed to server.
On the MW extension side I think we are not married to any particular solution so suggestions/thoughts or known similar problems/solution are all welcome if they help us reach the desired endpoint in a more efficient way.
Apologies for the slightly hurried answer. Heading off on Christmas holidays, for that reason I probably also won't be able to respond in person until after the holiday season. /André Costa (WMSE) (talk) 17:53, 19 December 2016 (UTC)Reply
I haven't really thought about this very hard, but I guess my main concern would be if the utterance info is roughly the same size as the page content, then this would double page size, which seems like a high cost to pay. You might also consider inserting the data into a js variable as JSON instead of appending things to the end of the html (Mostly because it seems slightly cleaner. The utterance markup isn't really valid html afaik). Loading the data in a separate request does indeed introduce a lot of issues with cache consistency that are very difficult to solve. Other concerns I have is that ParserAfterParse is called in a bunch of places, not just main body content, so I'm not sure if its the most appropriate hook. I'd also be a bit wary that all that DomDocument stuff might be memory heavy, or time heavy on really large pages (but I don't know. I certainly have not done any benchmark). You may want to write up your proposed implementation plan and then submit it to ArchCom for comments - they might be able to give advice on how best to integrate into MediaWiki.
I'll also be at dev summit, perhaps we could talk more about this there. Bawolff (talk) 20:24, 19 December 2016 (UTC)Reply
I agree that the current implementation adds a lot of extra data and may not be the best. One of the main reasons it's done this way is to do as much as possible server side, to allow for low end user devices. As it is now, you could get away without even using JS if you tweaked the server a bit, so that you could get a direct link to the audio. Another reason for the current implementation is that it was the first somewhat reasonable solution I could come up with; if there are other ways of conveying data from PHP to JS, I'd be happy to hear them.
A note about the hook, as André says above, ParserAfterTidy is the one used to retrieve the HTML. We've noticed that it's called for more than just the body, but since this hasn't broken anything (yet), fixing it isn't highly prioritized. Again, I'd be happy if someone knows a better hook (or equivalent) to get the HTML.
/Sebastian Berlin (WMSE) (talk) 11:40, 20 December 2016 (UTC)Reply

Get extension and customization[edit]

Hello, how do you get this extension? How do you then customize it with the various sounds of my language? My language is Veneto and I would like to integrate it into vec.wiki, not only for the deaf, but also for those who would like to practice the Venetian language and hear it spoken. We would like to integrate it into our wiki as a gadget.. We like the idea! Thank you in advance for your reply and a good job! --Fierodelveneto (talk) 17:56, 29 June 2020 (UTC)Reply

@Karl Wettin (WMSE), Sebastian Berlin (WMSE), and André Costa (WMSE): (I tag those who made the latest changes for visibility to my comment) --Fierodelveneto (talk) 17:58, 29 June 2020 (UTC)Reply
@Fierodelveneto: We wrote a small article about adding languages and voices to the extension. As you'll see it's not that easy right now. There is an upcoming project that will be used to create new languages and voices, but it's going to take a while (a year?) before there is anything to show. Karl Wettin (WMSE) (talk) 10:38, 3 July 2020 (UTC)Reply
@Karl Wettin (WMSE): Hi, thank you very much for your reply. Ahh, what a pity! Anyway, if you need help with regards to Veneto, I am there. We always try to be a Wiki that also uses gadgets in a "test" version, so as to detect its problems (thus helping programmers) and above all because we want to be a Wiki that gives access to all resources. I believed that somehow it was possible to make a local modification (inside the wiki) of the reading of the various sounds. Too bad it's not like that. it is truly a great project. If you have new information, please tell me! A good job! --Fierodelveneto (talk) 10:51, 3 July 2020 (UTC)Reply

Basque language[edit]

Hello @Karl Wettin (WMSE), Sebastian Berlin (WMSE), and André Costa (WMSE): ! Three years ago we got "everything" (not really) ready to ask Basque language to this project. I'm following the task at Phabricator but never get an answer about it's readiness. Is it possible to deploy now or do we have to wait? Is Basque language incorporated as it was intended? Thanks for your great work! -Theklan (talk) 11:26, 3 September 2020 (UTC)Reply

Hi @Theklan: !
Due to many reasons we’ve been forced to push Basque a little bit further towards the future in the current time table. I’m afraid there are no specific dates or official schedules to show you. We are preparing for a beta release within the WMF infrastructure with support for Swedish, Arabic and English.
There has however been a bit of progress with Basque. Speechoid (the Wikispeech speech synthesis backend framework) has support for AhoTTS (the Basque TTS) but we have yet to implement the structures needed for building and deploying it within the WMF infrastructure.
We also need to take a closer look at the patches you supplied to our project, as they have not been merged and the codebase has since undergone several large changes. At the very least we need to take a look at the current text segmenter and see how it works compared to the solution you submitted in regard to abbreviations. Perhaps our new segmenter will handle the problems you identified, perhaps we’ll have to refactor our project a bit to support language specific segmenters. In the latter case this would have to be queued behind pre-existing commitments.
Once the beta release is done we will take a look at what is missing for Basque to be added as an available language, taking the pre-existing work into consideration. At that point we will get back to you. As a minority language Basque is of extra interest to us. --Karl Wettin (WMSE) (talk) 11:26, 17 September 2020 (UTC)Reply

How to use this extension?[edit]

How to use this extension to machine-read Wikipedia pages? RIT RAJARSHI (talk) 09:56, 14 September 2020 (UTC)Reply

I cannot find any option or settings that can enable this feature. RIT RAJARSHI (talk) 09:57, 14 September 2020 (UTC)Reply

The extension is not yet available on any Wikipedia version. We are working on getting it enabled as a beta feature. /Sebastian Berlin (WMSE) (talk) 10:13, 7 October 2020 (UTC)Reply

@Sebastian Berlin (WMSE): let us know when we can implement it on wikipedia, we can't wait on vec.wiki! --ꜰɪᴇʀᴏᴅᴇʟᴠᴇɴᴇᴛᴏ (Talk)-(Contributions) 20:45, 30 March 2021 (UTC)Reply

Two minor findings[edit]

  • To change speed (under the gear, settings) both saving the setting and reload page is required (Chrome).
  • In Swedish it correctly read out "cm" to "centimeters", but not "kg" to "kilograms", it ended up as "ghrr".

Best regards, LittleGun (talk) 11:34, 3 June 2021 (UTC).Reply

Thanks you for the feedback.

To change speed (under the gear, settings) both saving the setting and reload page is required (Chrome).

Yes, for the settings to take effect you have to reload the page. There is a notice in the settings popup dialogue explaining this (though it hasn't been translated to Swedish yet, I noticed). Technically they take effect right away, but don't apply to already loaded utterances. This could be fixed, but it's not trivial and since the settings probably won't be changed that often, it shouldn't impact the user too much. I did create a task for it anyway: phab:T284218.

In Swedish it correctly read out "cm" to "centimeters", but not "kg" to "kilograms", it ended up as "ghrr".

This is probably due to the words being missing from the lexicon, in which case the speech synthesis "guesses" how to pronounce it, which sometimes work and sometimes don't. I'll have a look in the lexicon and see if I can fix it. Sebastian Berlin (WMSE) (talk) 12:14, 3 June 2021 (UTC)Reply
I had a look in the lexicon, and it looks like "kg" should be pronounced as "kilo". Could you give me the article where you encountered this so I can have a closer look? Sebastian Berlin (WMSE) (talk) 13:44, 3 June 2021 (UTC)Reply

Supported languages[edit]

Thanks for this exciting and useful software! I'm very happy to see it deployed!

Is there a list of supported languages somewhere?

It does not seem to support Hebrew (I am not surprised; TTS is exceptionally hard in Hebrew). The controls do appear on HEWP pages when the script is invoked, but pressing the PLAY button only changes it to the STOP icon, and highlights the page title, but no sound is emitted, and the highlight does not advance. Ijon (talk) 14:23, 11 June 2021 (UTC)Reply

Currently Arabic, English and Swedish are supported. You can find them under Extension:Wikispeech#TTS engines, MaryTTS, which is the only TTS-engine currently supported. We aim to add more TTS-engines and languages in the future.
The controls should not show up on unsupported pages, e.g. when page language isn't supported. This is bug that was introduced when we had to change how the modules are run. Sebastian Berlin (WMSE) (talk) 15:18, 11 June 2021 (UTC)Reply

How to add languages?[edit]

It would be nice with some documentation on what is needed to add support for more languages. By writing the documentation, perhaps the workload can be distributed in the community? Ainali (talk) 15:13, 11 June 2021 (UTC)Reply

Estonian text-to-speech models[edit]

I found freely licensed text-to-speech models of Estonian here: https://koodivaramu.eesti.ee/tartunlp/text-to-speech Perhaps that can be of use? Ainali (talk) 11:51, 30 June 2022 (UTC)Reply