This thesis deals with emotion recognition from speech signals using several feature
sets and classifiers. Feature sets with different sizes are compared: the feature set
of the Institute for Signal Processing and System Theory as well as standardised
feature sets of eight paralinguistic challenges. The question is whether there is a
connection between the size of a feature set and the performance. The feature sets
are investigated with SFFS and without in combination with Naive Bayes classifier,
k-Nearest-Neighbour classifier and Support Vector Machine. The goal of this thesis
is to find those features which are selected most commonly for good performance.
Diese Arbeit befasst sich mit der Erkennung von Emotionen aus Sprachsignalen. Es werden verschiedene Merkmalsätze und Klassifizierer auf ihre Leistungsfähigkeit getestet. Dabei werden Merkmalsätze mit unterschiedlichen Größen verglichen: der Merkmalsatz vom Institut für Signalverarbeitung und Systemtheorie sowie standardisierte Merkmalsätze von acht Wettbewerben, in denen paralinguistische Informationen erkannt werden sollten. Die Frage
ist, ob es einen Zusammenhang zwischen der Größe eines Merkmalsatzes und der Leistungsfähigkeit gibt. Die Merkmalsätze werden sowie mit auch als ohne Merkmalsauswahl (SFFS) in Kombination mit dem Naiven Bayes Klassifizierer, k-Nächste-Nachbarn Klassifizierer und einer Support Vector Machine untersucht. Das Ziel dieser Arbeit ist, die Merkmale zu finden, die bei den besten Merkmalsätzen am häufigsten ausgewählt wurden.
Table of Contents
1. Introduction
1.1. Motivation
1.2. Emotion Recognition
1.2.1. Representation of Emotion
1.2.2. Pattern Recognition
1.3. State-of-the-Art
1.4. Contribution of this Thesis
1.5. Structure
2. Feature extraction
2.1. Human Speech Characteristics
2.1.1. Source-Filter Model
2.1.2. Psychoacoustics and Voice Perception
2.2. Main Idea of Feature Extraction
2.3. The given Feature Sets
2.3.1. INTERSPEECH 2009 Emotion Challenge
2.3.2. INTERSPEECH 2010 Paralinguistic Challenge
2.3.3. INTERSPEECH 2011 Speaker State Challenge
2.3.4. The First International Audio-Visual Emotion Challenge (AVEC 2011)
2.3.5. INTERSPEECH 2012 Speaker Trait Challenge
2.3.6. The Continuous Audio-Visual Emotion Challenge (AVEC 2012)
2.3.7. INTERSPEECH 2013 Computational Paralinguistics Challenge
2.3.8. The Continuous Audio-Visual Emotion and Depression Recognition Challenge (AVEC 2013)
2.3.9. The ISS Feature Set
2.4. Local Features
2.4.1. Time-Domain Features
2.4.2. Spectral features
2.4.3. Pitch features
2.5. Global Features
2.5.1. Harmony Features
2.5.2. Functionals
2.5.3. Segmentation
2.6. Feature Selection
3. Classification
3.1. Main Idea
3.2. Bayesian Decision Theory
3.3. Bayes plug-in
3.3.1. Gaussian Model
3.3.2. Maximum-Likelihood Estimation
3.4. k-Nearest-Neighbors
3.5. Support Vector Machine
3.5.1. Hard-margin SVM
3.5.2. Soft-margin SVM
3.5.3. Kernel trick
3.5.4. Multiclass SVM
4. Material and Methods
4.1. Speech Database
4.2. Libraries for Feature Extraction and Classification
4.2.1. ISS Classification Toolbox
4.2.2. libSVM
4.2.3. openSMILE
4.3. Evaluation Method
5. Simulation and Results
5.1. Implementation
5.1.1. Naive Bayes classifier
5.1.2. k-Nearest-Neighbour classifier
5.1.3. Support Vector Machine
5.2. Comparison of classifiers
5.3. Comparison of Feature Sets
5.4. Selected Features
5.5. Confusion Matrix
6. Discussion
6.1. Comparison with Literature
6.2. Optimal Feature Set
6.3. Curse of Dimensionality
6.4. Optimal Classifier
6.5. Challenges
6.6. Summary
6.7. Outlook
A. Feature Set Listing
A.1. IS09 Emotion Challenge
A.2. IS10 Paralinguistic Challenge
A.3. IS11 Speaker State Challenge
A.4. AVEC 2011
A.5. IS12 Speaker Trait Challenge
A.6. AVEC 2012
A.7. IS13 Computational Paralinguistics Challenge
A.8. AVEC 2013
A.9. ISS Feature Set
B. Entire Results
B.1. Recognition rates of feature sets using Naive Bayes classifier
B.2. Recognition rates of feature sets using k-Nearest-Neighbour classifier
B.3. Recognition rates of feature sets using Support Vector Machine
B.4. Selected Features
B.5. Confusion Matrices
Objectives & Core Topics
The primary objective of this thesis is to investigate and compare various feature sets in combination with multiple classifiers for the task of emotion recognition from speech signals. It specifically examines whether the size of a feature set influences recognition performance, analyzes the efficacy of feature selection algorithms (specifically SFFS) on high-dimensional feature sets, and determines which features most commonly contribute to high performance in emotion recognition models.
- Performance comparison of standardized challenge-based feature sets versus the ISS Institute feature set.
- Evaluation of Naive Bayes, k-Nearest-Neighbour, and Support Vector Machine classifiers.
- Impact of dimensionality reduction and Sequential Floating Forward Selection (SFFS) on recognition rates.
- Analysis of feature importance for discrimination of basic emotional states.
- Integration of diverse feature extraction libraries including openSMILE and the ISS Classification Toolbox.
Excerpt from the Book
2.4. Local Features
Time-domain features are features that can be computed directly from the signal xk(n). They are usually easy to compute and interpret but also store very limited information of the emotional state. They are for example features like energy and zero-crossing-rate.
The energy of a speech signal is related to the arousal level of emotions. For example it can be a measure to distinguish anger from boredom. The absolute energy Ek_absolute of the kth speech frame is given by Ek_absolute = sum_{n=0}^{N-1} |xk(n)|^2. (2.2)
The root-mean-square energy Ek_RMS is defined as Ek_RMS = sqrt(1/N sum_{n=0}^{N-1} |xk(n)|^2). (2.3)
This energy is approximately the volume of the emitted sound signal. Approximately, because it still depends on the distance to the microphone as well as the room impulse response. The zero-crossing-rate ZCRk is defined by ZCRk = 1/N sum_{n=0}^{N-1} |sgn(xk(n)) - sgn(xk(n-1))| (2.4) and makes a rough statement about the spectrum of zero-mean signals although it is calculated in time-domain. Speech is indeed zero-mean and if the ZCR is low the speech signal contains strong low-frequency components whereas if the ZCR is high the speech signal contains strong high-frequency components. This can be used to separate voiced from unvoiced speech.
Summary of Chapters
1. Introduction: This chapter provides the motivation for emotion recognition, defines the core concepts of pattern recognition and emotion representation, and outlines the thesis structure.
2. Feature extraction: This chapter details human speech production models, explains how various standardized feature sets are extracted, and defines the local and global features used in the study.
3. Classification: This chapter introduces the theoretical framework of classification, covering Bayesian decision theory, k-Nearest-Neighbours, and the mechanics of Support Vector Machines.
4. Material and Methods: This chapter describes the used speech databases, the software libraries (ISS Toolbox, libSVM, openSMILE) and the evaluation protocols employed for the experiments.
5. Simulation and Results: This chapter presents the experimental findings, including recognition rate comparisons across classifiers and different feature set dimensions.
6. Discussion: This chapter evaluates the results against existing literature, discusses optimal feature and classifier choices, and addresses challenges encountered during simulation.
Keywords
emotion recognition, speech processing, feature extraction, classification, Naive Bayes, k-Nearest-Neighbour, Support Vector Machine, SFFS, feature selection, paralinguistics, mel-frequency cepstral coefficients, auditory spectrum, confusion matrix, dimensionality reduction
Frequently Asked Questions
What is the core focus of this research?
The research focuses on the automated recognition of emotions from speech signals by comparing the performance of different feature sets and classification algorithms.
What are the primary thematic areas covered?
The work covers speech signal processing, feature extraction techniques, pattern recognition theory, machine learning classification, and the experimental evaluation of emotional speech databases.
What is the main goal or research question?
The main goal is to determine if a correlation exists between the size of a feature set and the recognition performance, and to identify the most robust features for effective emotion discrimination.
Which scientific methods are employed?
The study utilizes feature selection algorithms like Sequential Floating Forward Selection (SFFS), statistical modeling for feature extraction, and three distinct machine learning classifiers: Naive Bayes, k-Nearest-Neighbour, and Support Vector Machines.
What is discussed in the main body of the work?
The main body examines the theoretical basis for speech feature extraction, evaluates various standard feature sets from previous challenges, details the experimental classification setup, and analyzes results obtained through cross-validation.
Which keywords define this work?
Key terms include emotion recognition, speech processing, classification, SFFS, Support Vector Machine, and feature extraction.
How does the size of the feature set affect classification?
The thesis finds that while extremely small feature sets perform poorly, there is no simple linear dependency; however, an optimal set requires a balance between quality and quantity to avoid redundant data.
Why are Confusion Matrices used in the results?
Confusion matrices are used to visualize which specific emotion classes are frequently confused with one another, such as the difficulty in distinguishing joy from anger.
- Quote paper
- Tobias Gruber (Author), 2014, Comparison of different features sets and classifiers for emotion recognition of speech, Munich, GRIN Verlag, https://www.grin.com/document/300174