This thesis deals with emotion recognition from speech signals using several feature
sets and classifiers. Feature sets with different sizes are compared: the feature set
of the Institute for Signal Processing and System Theory as well as standardised
feature sets of eight paralinguistic challenges. The question is whether there is a
connection between the size of a feature set and the performance. The feature sets
are investigated with SFFS and without in combination with Naive Bayes classifier,
k-Nearest-Neighbour classifier and Support Vector Machine. The goal of this thesis
is to find those features which are selected most commonly for good performance.
Diese Arbeit befasst sich mit der Erkennung von Emotionen aus Sprachsignalen. Es werden verschiedene Merkmalsätze und Klassifizierer auf ihre Leistungsfähigkeit getestet. Dabei werden Merkmalsätze mit unterschiedlichen Größen verglichen: der Merkmalsatz vom Institut für Signalverarbeitung und Systemtheorie sowie standardisierte Merkmalsätze von acht Wettbewerben, in denen paralinguistische Informationen erkannt werden sollten. Die Frage
ist, ob es einen Zusammenhang zwischen der Größe eines Merkmalsatzes und der Leistungsfähigkeit gibt. Die Merkmalsätze werden sowie mit auch als ohne Merkmalsauswahl (SFFS) in Kombination mit dem Naiven Bayes Klassifizierer, k-Nächste-Nachbarn Klassifizierer und einer Support Vector Machine untersucht. Das Ziel dieser Arbeit ist, die Merkmale zu finden, die bei den besten Merkmalsätzen am häufigsten ausgewählt wurden.
Inhaltsverzeichnis (Table of Contents)
- 1. Introduction
- 1.1. Motivation
- 1.2. Emotion Recognition
- 1.2.1. Representation of Emotion
- 1.2.2. Pattern Recognition
- 1.3. State-of-the-Art
- 1.4. Contribution of this Thesis
- 1.5. Structure
- 2. Feature extraction
- 2.1. Human Speech Characteristics
- 2.1.1. Source-Filter Model
- 2.1.2. Psychoacoustics and Voice Perception
- 2.2. Main Idea of Feature Extraction
- 2.3. The given Feature Sets
- 2.3.1. INTERSPEECH 2009 Emotion Challenge
- 2.3.2. INTERSPEECH 2010 Paralinguistic Challenge
- 2.3.3. INTERSPEECH 2011 Speaker State Challenge
- 2.3.4. The First International Audio-Visual Emotion Challenge (AVEC 2011)
- 2.3.5. INTERSPEECH 2012 Speaker Trait Challenge
- 2.3.6. The Continuous Audio-Visual Emotion Challenge (AVEC 2012)
- 2.3.7. INTERSPEECH 2013 Computational Paralinguistics Challenge
- 2.3.8. The Continuous Audio-Visual Emotion and Depression Recognition Challenge (AVEC 2013)
- 2.3.9. The ISS Feature Set
- 2.4. Local Features
- 2.4.1. Time-Domain Features
- 2.4.2. Spectral features
- 2.4.3. Pitch features
- 2.5. Global Features
- 2.5.1. Harmony Features
- 2.5.2. Functionals
- 2.5.3. Segmentation
- 2.6. Feature Selection
- 3. Classification
- 3.1. Main Idea
- 3.2. Bayesian Decision Theory
- 3.3. Bayes plug-in
- 3.3.1. Gaussian Model
- 3.3.2. Maximum-Likelihood Estimation
- 3.4. k-Nearest-Neighbors
- 3.5. Support Vector Machine
- 3.5.1. Hard-margin SVM
- 3.5.2. Soft-margin SVM
- 3.5.3. Kernel trick
- 3.5.4. Multiclass SVM
- 4. Material and Methods
- 4.1. Speech Database
- 4.2. Libraries for Feature Extraction and Classification
- 4.2.1. ISS Classification Toolbox
- 4.2.2. libSVM
- 4.2.3. openSMILE
- 4.3. Evaluation Method
- 5. Simulation and Results
- 5.1. Implementation
- 5.1.1. Naive Bayes classifier
- 5.1.2. k-Nearest-Neighbour classifier
- 5.1.3. Support Vector Machine
- 5.2. Comparison of classifiers
- 5.3. Comparison of Feature Sets
- 5.4. Selected Features
- 5.5. Confusion Matrix
- 6. Discussion
- 6.1. Comparison with Literature
- 6.2. Optimal Feature Set
- 6.3. Curse of Dimensionality
- 6.4. Optimal Classifier
- 6.5. Challenges
- 6.6. Summary
- 6.7. Outlook
Zielsetzung und Themenschwerpunkte (Objectives and Key Themes)
This thesis aims to investigate the relationship between the size of a feature set and the performance of emotion recognition from speech signals. It compares different feature sets, including those from various paralinguistic challenges and the Institute for Signal Processing and System Theory (ISS), using various classifiers like Naive Bayes, k-Nearest-Neighbour, and Support Vector Machine.
- The impact of feature set size on emotion recognition accuracy
- Comparison of different feature extraction techniques and classifiers
- Identification of features frequently selected for optimal performance
- Analysis of the curse of dimensionality in relation to emotion recognition
- Evaluation of the performance of various classifiers for emotion recognition
Zusammenfassung der Kapitel (Chapter Summaries)
- Chapter 1 introduces the concept of emotion recognition from speech signals, highlighting its significance and the challenges involved. It provides an overview of different emotion representation models, pattern recognition techniques, and the current state-of-the-art in the field. The chapter also outlines the specific contribution of this thesis and its structure.
- Chapter 2 delves into feature extraction, covering human speech characteristics, the main idea behind feature extraction, and a detailed description of the different feature sets employed in the study. The chapter also explores local and global features, including time-domain features, spectral features, pitch features, harmony features, functionals, and segmentation. It concludes with an explanation of feature selection techniques.
- Chapter 3 focuses on classification methods, providing an overview of the main idea, Bayesian decision theory, and Bayes plug-in techniques. The chapter also examines various classifiers, including k-Nearest-Neighbors and Support Vector Machine, along with their respective theoretical frameworks and implementation details.
- Chapter 4 outlines the material and methods used in the study. It describes the speech database, the libraries employed for feature extraction and classification, and the evaluation method employed.
- Chapter 5 presents the simulation and results of the study. It includes details of the implementation, comparisons of different classifiers and feature sets, the selected features, and a confusion matrix illustrating the recognition performance.
Schlüsselwörter (Keywords)
This thesis explores the field of emotion recognition from speech signals using various feature sets and classifiers. The focus lies on comparing the effectiveness of different feature sets, analyzing the impact of feature set size on performance, and evaluating the suitability of various classifiers for this task. Key terms include emotion, features, classification, speech, Naive Bayes, k-Nearest-Neighbor, Support Vector Machine.
- Citation du texte
- Tobias Gruber (Auteur), 2014, Comparison of different features sets and classifiers for emotion recognition of speech, Munich, GRIN Verlag, https://www.grin.com/document/300174