In this study a model-based speech synthesis prototype for Tigrinya spoken language idiom is developed in an integrated speech synthesis framework (Festival speech synthesis system). While the frontend of the framework is Graphemebased synthesizer, the backend is CLUSTERGEN Synthesizer which is an instance of statistical parametric speech synthesis. The under resourced linguistic nature of the language was the main reason to choose this framework. 249 Tigrinya graphemes were considered as phonemes independently; irrespective of its 32 phonological phonemes.
For this study, 800 previously prepared sentences and rerecorded again in a recommended way is used as corpus. Amendments and additions to the adopted methodology was done. The whole prototype synthesis development was done automatically. A tenfold threshold method was used for training and testing of the prototype. The synthesized speech was android deployable prototype. This synthesized speech resulted a score of 5.82 using Mel Cepstral Distortion ( which is built-in objective measurement metric); while subjective evaluation resulted 4.5 and 4.3 out of 5 score, naturalness and intelligibility of the synthesized speech respectively. Both evaluations were interpreted as the synthesized speech was almost the same as natural human speech. Finally, future works were indicated.
Table of Contents
1. Introduction
2. Tigrinya Language and Its Writing System
3. Literature Review
3.1 A. Natural Language Processing (NLP)
3.2 B. Digital Signal Processing (DSP)
4. Related Works
5. The Proposed Solution
6. Discussion
7. Conclusion
Research Objectives and Key Topics
This study aims to develop a model-based speech synthesis prototype for the Tigrinya language by utilizing an integrated Festival speech synthesis framework, addressing the limitations of previous concatenative synthesis approaches regarding resources and transliteration complexity. The research focuses on leveraging the phonetic nature of the Tigrinya script to create a more efficient and natural-sounding synthesis system.
- Development of a Grapheme-based synthesis model for Tigrinya.
- Implementation of the CLUSTERGEN statistical parametric synthesis backend.
- Elimination of complex transliteration processes by mapping characters directly to the International Phonetic Alphabet (IPA).
- Performance evaluation using both objective metrics (Mel Cepstral Distortion) and subjective human feedback.
Excerpt from the Book
The Proposed Solution
The proposed solution of the gap showed in the literature review of previous attempts of synthesizing Tigrinya was aimed to surrender the transliteration from GEEZ character to Latin Alphabet by making it directly to IPA and changing the backend to model based synthesizer which is CLUSTERGEN. The other solution to the gap was considering Tigrinya writing system as main base for synthesis instead of its phonological phonemes (32 phonemes). The proposed solution architectural view was depicted here in the below diagram to indicate the how the whole synthesis process looks like.
800 sentences that were previously prepared but recorded again in professional studio by native experienced female journalist was as used as corpus.
Summary of Chapters
Introduction: Provides an overview of speech synthesis applications and identifies the necessity for a model-based approach for under-resourced languages like Tigrinya.
Tigrinya Language and Its Writing System: Explains the phonetic characteristics of the Tigrinya script, which consists of 249 characters organized into seven orders.
Literature Review: Discusses the fundamentals of Natural Language Processing and Digital Signal Processing, while categorizing various text-to-speech synthesis methods.
Related Works: Examines previous efforts in Tigrinya speech synthesis, highlighting the limitations of concatenative models and existing transliteration practices.
The Proposed Solution: Details the transition to a CLUSTERGEN-based architecture and the use of grapheme-to-phoneme conversion to improve synthesis efficiency.
Discussion: Outlines the technical implementation, including the handling of pharyngeal consonants and the training process using a tenfold threshold method.
Conclusion: Summarizes the successful development of an Android-deployable prototype that achieves natural and intelligible speech output.
Keywords
Speech synthesis, Statistical parametric speech synthesis, Grapheme, Spoken language, Tigrinya, Festival framework, CLUSTERGEN, Natural Language Processing, Digital Signal Processing, Phonetic, Prototype, Mel Cepstral Distortion, Corpus, Under-resourced language, Android deployable.
Frequently Asked Questions
What is the primary focus of this research?
The research focuses on creating a more efficient, model-based speech synthesis prototype for the Tigrinya language, specifically by adopting the Festival speech synthesis framework.
What are the central thematic fields covered?
The work explores Natural Language Processing (NLP), Digital Signal Processing (DSP), phonetic analysis of the Tigrinya writing system, and statistical parametric speech synthesis.
What is the main goal of the proposed system?
The primary goal is to overcome the limitations of earlier concatenative systems by using a CLUSTERGEN model that directly processes Tigrinya graphemes into speech without relying on complex Latin-based transliteration.
Which scientific methods were employed?
The researcher utilized statistical parametric speech synthesis, CLUSTERGEN voice building tools, and a tenfold threshold method for model training and testing.
What topics are addressed in the main body?
The main body covers the linguistic structure of Tigrinya, a review of previous synthesis attempts, the architecture of the proposed solution, and a detailed discussion of the training and evaluation phases.
Which keywords best characterize this work?
Key terms include Speech synthesis, CLUSTERGEN, Grapheme, Tigrinya, statistical parametric synthesis, and phonetic modeling.
How were the 14 pharyngeal consonants handled in the system?
The system was specifically updated to include these 14 pharyngeal consonants (both voiceless and voiced) in the phoneset to ensure the synthesis system was efficient and accurate.
Why was the 'AMP' modification necessary?
An automatically derived phone '&' caused labeling errors because it appeared in nearly every word; changing it to 'AMP' resolved this issue during the EHMM labeling process.
What were the results of the subjective evaluation?
The synthesized speech achieved a score of 4.5 for naturalness and 4.3 for intelligibility on a 5-point scale, indicating that the output is comparable to natural human speech.
Is this prototype suitable for mobile platforms?
Yes, the research confirms that the developed prototype is Android-deployable, allowing for potential use on mobile devices.
- Citar trabajo
- Luel Negasi Tewelde (Autor), 2017, Grapheme Based Tigrinya Speech Synthesis Using Statistical Parametric Speech Synthesis, Múnich, GRIN Verlag, https://www.grin.com/document/434779