Accentor: An Explicit Lexical Stress Model for TTS Systems

Abstract

The accurate placement of word stress is a critical component of the correct pronunciation of words. Contemporary publicly available text-to-speech (TTS) datasets have a relatively narrow coverage of unique words, which causes modern neural TTS systems to synthesize speech that often suffers from lexical stress errors. In this work, we propose an efficient approach for explicitly modeling lexical stress knowledge with a dedicated Accentor neural network. The Accentor is trained separately on a large lexically diverse stress-annotated text corpus, which we automatically compile using an automatic speech recognition system. We demonstrate that the Accentor can be composed with a TTS acoustic model to reliably control the word stress encoded in the generated acoustic features. Experiments show that our approach increases the stress prediction accuracy by a factor of 12 in comparison to other modern TTS systems and improves the naturalness and comprehensibility of the synthesized speech.

Paper

Controllable acoustic model

The aim of the acoustic model is to process text annotated with stress information and produce acoustic features that respect the stress patterns present in the input. Thereby, the model enables the control of stress placement in the synthesized speech. The following samples (two per sentence) demonstrate that the stress information encoded in the acoustic features strictly and accurately follows the lexical stress labels (grave accents) specified in the input text. Underlined words indicate the changes in the lexical stress placement between samples corresponding to the same sentence.

Input text	Synthesized speech
`Утринното л`ятно сл`ънце изск`окна вис`око над Ст`ара планин`а.
Утр`инното л`ятно сл`ънце `изскокна висок`о над Ст`ара план`ина.
Сег`а к`азвай: какв`о му се п`ада на т`оя б`яс нечест`ив?
С`ега казв`ай: какв`о му с`е пад`а на т`оя б`яс неч`естив?
А к`ак стр`астно об`ичаше т`ой да здрав`исва вс`ичките си при`ятели и позн`айници!
`А как стр`астно обич`аше т`ой д`а здр`ависва вс`ичките си прият`ели и познайниц`и!

Comparison of our approach with a standard FastPitch model

We compare the composition of the Accentor and the controllable acoustic model (14.2M parameters in total), with a standard FastPitch model with 44.7M parameters. The following samples demonstrate that our approach significantly increases the stress prediction accuracy and improves the naturalness and comprehensibility of the synthesized speech. Each speech sample is accompanied with a copy of the input text annotated with stress marks that indicate the placement of lexical stress in the synthesized speech. Words with pronunciations that contain lexical stress errors are typeset in bold.

Input text	FastPitch	Our approach
Гражданин видял фотографията на търсения мъж по телевизията и се обадил.	Гр`ажданин вид`ял фотогр`афията на търс`ения м`ъж по телев`изията и се обад`ил.	Гр`ажданин вид`ял фотогр`афията на т`ърсения м`ъж по телев`изията и се об`адил.
Тя дали беше права, че не ѝ бяха дали всички права.	Т`я дал`и б`еше прав`а, че не `ѝ б`яха д`али вс`ички пр`ава.	Т`я дал`и б`еше пр`ава, че не `ѝ б`яха д`али вс`ички прав`а.
Сепваше падащо листо или търкулнато от катеричка орехче.	С`епваше п`адащо л`исто или търкулн`ато от катер`ичка ор`ехче.	С`епваше п`адащо лист`о или търк`улнато от к`атеричка `орехче.
Бъдещият виртуоз личеше от всяка нота.	Б`ъдещият вирту`оз л`ичеше от всяка нот`а.	Б`ъдещият вирту`оз лич`еше от вс`яка н`ота.
Главният сладкар измайстори огромен сладкиш, на чийто връх стои шейна.	Гл`авният сл`адкар изм`айстори огр`омен сл`адкиш, на ч`ийто вр`ъх сто`и ш`ейна.	Гл`авният сладк`ар измайстор`и огр`омен сладк`иш, на ч`ийто вр`ъх сто`и шейн`а.

Unclear or blurred lexical stress characteristics with a standard FastPitch model

We noticed deficiencies in the speech synthesized using the standard FastPitch model. It often suffers from unclear or blurred lexical stress characteristics which make the speech sound unnatural and sometimes greatly reduces its comprehensibility. Those defects are not present in the speech synthesized with our approach. The following samples demonstrate such defects. Words with pronunciations that contain unclear lexical stress are typeset in bold.

Input text	FastPitch	Our approach
Свидетели на гледката бяха още няколкостотин лондонски любители на театъра.
Нападателят е арестуван.
Алуминият е включен и в окачването.

The Controllable acoustic model is based on a novel architecture called StreamSpeech, which is published at ICASSP 2023 (link).