Who is Speaking? Male or Female

Master's Thesis, 2013

79 Pages, Grade: A+







1 Introduction

2 Background
2.1 Speech
2.1.1 Speech Signal
2.2 Speech Signal Processing
2.2.1 Fourier Transform
2.2.2 Discrete Cosine Transform
2.2.3 Digital Filters
2.2.4 Nyquist Shannon Sampling Theorem
2.2.5 Window Functions

3 Speech Enhancement
3.1 Signal to Noise Ratio
3.2 Spectral Subtraction
3.3 Cepstral Mean Normalization
3.4 RASTA Filtering
3.5 Voice Activity Detector
3.5.1 The Empirical Mode Decomposition Method
3.5.2 The Hilbert Spectrum Analysis
3.5.3 Voice Activity Detection

4 Gender Identification Systems
4.1 Acoustic Features
4.1.1 Mel Frequency Cepstral Coefficients (MFCC) .
4.1.2 Shifted Delta Cepstral (SDC)
4.1.3 Pitch Extraction Method
4.2 Pitch Based Models
4.3 Models based on Acoustic Features
4.4 Fused Models

5 Learning Techniques for Gender Identification
5.1 Overview
5.2 Adaboost
5.3 Gaussian Mixture Model (GMM)
5.3.1 GMM Training
5.3.2 GMM Testing
5.4 Decision Making
5.5 Likelihood Ratio
5.6 Universal Background Model
5.6.1 UBM Training

6 System Design and Implementation
6.1 Toolboxes
6.1.1 Signal Processing Toolbox
6.1.2 Machine Learning Toolbox
6.2 System Design
6.2.1 Requirement
6.2.2 Initial Approach
6.2.3 Algorithm
6.2.4 Feature Selection
6.3 Experiments and Results
6.3.1 Pitch Based Models
6.3.2 Models Based on Acoustic Features
6.3.3 Fused Model
6.3.4 YouTube Videos

7 Conclusion
7.1 Summary
7.2 Future Recommendation


A Appendix

List of Tables

6.1 Results from pitch based model trained with 1 male and 1 female speaker

6.2 Results from pitch based model trained with 9 male and 1 female speakers

6.3 Results from pitch based model trained with 1 male and 9 female speaker

6.4 Results from pitch based model trained with 8 male and 8 female speakers

6.5 Results from MFCC model trained using 8 GMM components . .

6.6 Results from MFCC model trained using 16 GMM components . .

6.7 Results from MFCC model trained using 32 GMM components . .

6.8 Results from SDC model trained using 8 GMM components

6.9 Results from SDC model trained using 16 GMM components

6.10 Results from SDC model trained using 32 GMM components

6.11 Results from fused model trained using 8 GMM components on SDC features

6.12 Results from fused model trained using 16 GMM components on SDC features

6.13 Results from fused model trained using 32 GMM components on SDC features

6.14 Results from acoustic and fused models tested on large amount of data

6.15 Accuracy of all the models that were tested on YouTube Videos . .

List of Figures

2.1 Mechanism of the human speech system which is representing the underlying phenomenon of speech generation and speech under-standing. The grey boxes are representing computer systems for natural language processing [HAH01]

2.2 Conversion of analogue signal to digital signal. Red lines show the digital value of the analogue signal

2.3 DFT applied to a speech signal

2.4 DCT applied to a speech signal

2.5 Sampling of a Continuous Signal

2.6 Hamming window effect

3.1 A noisy speech signal [Vat12]

3.2 A clean speech signal [Vat12]

3.3 VAD applied to noisy speech [SZ12]

4.1 A block diagram of gender identification model

4.2 A block diagram of MFCC computation [Vat12]

4.3 Mel frequency scale [Vat12]

4.4 Graph of Mel filterbank of 24 filters [Vat12]

4.5 Computational model of SDC [TcSKD02]

4.6 Plot of male and female pitch [Sun00]

4.7 Block diagram of gender identification model trained using MFCC [Sun00]

4.8 A fused gender identification model [Sun00]

4.9 A adaboost score fusion model [IKJGY10]

5.1 Optimal decision boundary between two classes

5.2 One Dimensional Gaussian Mixture Model

6.1 Block diagram of SDC feature extraction

6.2 Block diagram of the final fused model

6.3 Snapshot of the Graphical User Interface of the system


The aim of this project was to create a gender identification system that can be used to identify the gender of the speaker. In this dissertation I have explained the signal processing background such as Fourier transforms and DCT etc. that was needed to understand the underlying signal processing happening in digital devices. Apart from that I also investigated the different classification techniques such as Adaboost and Gaussian Mixture Models and different types of methods such as Fusion method, acoustic methods and pitch methods used in gender identification.

From this perspective I have implemented 3 types of models (4 Models) that are explained in the literature and introducing a new method for gender recognition that uses SDC feature with pitch to identify the gender. All models were tested and trained on the same amount of speech. The SDC and SDC fused model gave satisfactory results on Voxforge dataset. Finally I tested the acoustic and fused models on YouTube video which gave almost 90% accuracy. The results of my implementations are shown in chapter 6.


No portion of the work referred to in this dissertation has been submitted in support of an application for another de- gree or qualification of this or any other university or other institute of learning.


i. The author of this report (including any appendices and/or schedules to this report) owns certain copyright or related rights in it (the “Copyright”) and s/he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes.

ii. Copies of this thesis, either in full or in extracts and whether in hard or elec-tronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has from time to time. This page must form part of any such copies made.

iii. The ownership of certain Copyright, patents, designs, trade marks and other intellectual property (the “Intellectual Property”) and any reproductions of copyright works in the thesis, for example graphs and tables (“Reproduc- tions”), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication and commercialisation of this thesis, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the Uni- versity IP Policy (see http://documents.manchester.ac.uk/DocuInfo. aspx?DocID=487), in any relevant Thesis restriction declarations deposited in the University Library, The University Library’s regulations (see http://www.manchester.ac.uk/library/aboutus/regulations) and in The University’s policy on presentation of Theses


I would like to thank my supervisor Dr Ke Chen for his support and guidance throughout the duration of this project without whom this dissertation would never be a reality. Doing this project was a great experience and I learned a lot.

I would also like to thank my dear friend Mr Manav Priyan Olikara for his help and advice regarding this topic and Ali Zia who helped me in generating the graphics for this project I would finally like to thank my friends and family for their support.

Chapter 1


As the significance of the computers in our daily life is getting popular, the inter- action between human and machine is becoming more important day by day. The desire of humans to communicate with machines in a natural way has led to the evolution of natural language processing. As the advancements in this field are hap- pening, it is likely that voice interaction systems will replace the standard keyboards in near the future. Today if we look in the technology market around us we have some really state of the art technologies like Microsoft® Kinect and Apple® SIRI which performs really well. But every speech system that is available today has its own drawbacks and continuous work is being done to increase the performance of such systems. To increase the performance of speech systems pre-processing like gender recognition and language identification is required.

This MSc project focuses on automatic gender identification system using speech. Identification of gender using the speech of the speaker concerns in detecting that the spoken speech is of male or female speaker. Automatic Gender Identification (AGI) via speech has several applications in the field of natural language processing. In [AH96] has shown that gender dependant speech recognition models are more accurate than gender independent models. Google’s latest speech recognition sys- tem that can be seen in android devices and Google Glass initially finds the gender of the speaker before performing the speech recognition for search. The result of its speech recognition accuracy is exceptionally high as compared to their previous speech recognition system which was an unisex model. Recently a company has launched its “Kinect” based online fitting room that determines the gender of the person using it speech to offer him clothes. In the context of multimedia indexing gender recognition can considerably decrease the search space up to half [HC].

Automatic gender recognition itself is a complex task and it has its own prob- lems and limitations, until now no gender recognition system exists which can work on real time environment with 100% accuracy. As in a real world environment or in the case of multimedia indexing many acoustic conditions exist like noisy speech, compressed speech, silence, speech on the telephone, different languages and so on which significantly reduces the performance of a general gender identification sys- tem. So ideally a system is required which can give acceptable performance under previously described acoustic conditions.

In general, there are three main approaches to building an automatic gender identification system: The first approach uses pitch as a discriminating factor and use labelled data to identify the gender of the speaker. The second approach deals with acoustic features like MFCC and unlabelled data to identify the gender. In this approach relevant features are extracted then the model is trained. In this case gen- erally a GMM is trained for each gender and results from one model are subtracted from the other model to find the gender. The third approach is quite commonly used after year 2005, in which pitch models are combined with acoustic models to form a fused model.

The dissertation is organised as follows: The first chapter presents the challenge of gender identification system. The second chapter presents necessary knowledge of signal processing. Then each chapter describes a step of a gender identifica- tion system: including speech enhancement techniques to reduce background noise, feature extraction, gender modelling methods and different decision making tech- niques. Finally, the last chapter presents the implementations done for this project and results obtained from testing different models on a large set of speakers and YouTube videos

Chapter 2


To understand the gender identification process using speech, we first need to understand the structure of speech. This chapter includes human speech and what is the basic difference between female and male voice.

2.1 Speech

Spoken language or human speech is the natural form of human communication which requires the use of voice. In terms of Linguistics human speech is a form of sound wave which is produced by the lungs and it is given uniqueness by tongue, lips and jaws [HAH01]

2.1.1 Speech Signal

Speech is produced when the air pressure generated by lungs reaches the vocal cords. Then speech begins to resonate in the nasal cavities according to the position of lips, tongue and other organs in the mouth. In terms of signal processing speech signal is an analogue signal which is the convolution of the source

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.1: Mechanism of the human speech system which is representing the un- derlying phenomenon of speech generation and speech understanding. The grey boxes are representing computer systems for natural language processing [HAH01]

e [ n ] and the resonance of speech in the mouth can be modelled as the filter h [ n ]

Abbildung in dieser Leseprobe nicht enthalten

where x [ n ] is the speech signal.

Human Speech Frequency

The frequency range that is the part of the audio range is 300Hz to 3400Hz which means that human speech lies in this range [Tit00a]. On the other hand the sound ranges i.e. the frequency range between humans can hear any sound is between 20 Hz to 20,000Hz. Beyond the region of 20,000Hz the region of ultrasonic comes which, humans are unable to hear.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.2: Conversion of analogue signal to digital signal. Red lines show the digital value of the analogue signal

Fundamental Frequency (Pitch)

Generally fundamental frequency is defined as the minimum frequency of the peri- odic waveform. The fundamental frequency or usually known as pitch in terms of natural language processing is the biggest discriminating factor between a male and a female speech. A typical male adult has a fundamental frequency between 85Hz to 180Hz and an adult female has a fundamental frequency in the range of 165Hz to 225Hz [BO00].

2.2 Speech Signal Processing

From [HAH01] we know that speech is an analogue signal but unfortunately today‘s computers work with digital signals so speech is saved in digital form in computers. When speech is converted to digital form, it loses some of the data so accurate representation of analogue signal into digital signal is required . A conversion of analogue signal can be seen in figure 2.2

2.2.1 Fourier Transform

According to Joseph Fourier, any signal can be represented as a linear combination of sinusoids which means that the Fourier transform can be described as transform- ing a function of time f (t) into a function of frequency F (ω). This can be shown as

Abbildung in dieser Leseprobe nicht enthalten

There exist different types of Fourier transforms but most famous are

1. Continuous Time Fourier Transform
2. Continuous Fourier Transform
3. Discrete Fourier Transform
4. Discrete Time Fourier Transform

In an automatic gender recognition system only discrete Fourier transform is required so only that will be explained.

Discrete Fourier Transform

For any periodic signal x [ t ] the discrete Fourier transform can be defined as

Abbildung in dieser Leseprobe nicht enthalten

A discrete Fourier transform applied to a signal can be seen in figure 2.3

2.2.2 Discrete Cosine Transform

Discrete Cosine Transform commonly known as DCT is similar to DFT. DCT is used to transform a finite sequence of data points into sum of different sinusoids vi- brating at different frequencies. DCT is usually used for compression of images and sound where the lower number of higher frequency components can be discarded which means that the transformed signal is mostly comprised of lower frequencies

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.3: DFT applied to a speech signal

thus majority of information can be found in first coefficient. More information about the usage of the DCT in speech processing can be found in [MAMG11]. Mathematically DCT can be defined as

Abbildung in dieser Leseprobe nicht enthalten

where X T [ k ] is the kth coefficient. DCT applied to a speech signal can be seen in figure 2.4

2.2.3 Digital Filters

Digital filters are mathematical models that are applied to a signal to remove some components of that signal or to enhance some aspects of the signal. In natural language processing widely used filters are low pass filters, band pass filters and high pass filters [APHA96].

Low Pass Filter

A low pass filter is used to discard the frequencies higher than the cut-off frequency in a speech signal.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.4: DCT applied to a speech signal

High Pass Filter

A high pass filter is used to discard the frequencies lower than the cut-off frequency in a speech signal.

Band Pass Filter

The band pass filter allows a certain range of frequencies to pass and discard all the frequencies that are higher or lower than the cut-off frequencies.


Human speech signal is naturally a analogue signal but to perform any computa- tional tasks on the speech signal, it should be converted to a digital form. In signal processing, sampling means to convert a continuous time signal to discrete time signal by looking at in regular intervals of time. The regular interval of time is generally called the sampling interval which is the reciprocal of the sampling fre- quency and is denoted by T s. Sampling frequency, generally known as the sampling rate is defined as number discrete samples taken from a signal in one second and is denoted by [Abbildung in dieser Leseprobe nicht enthalten]. The higher the sampling frequency is the better the digital

T s signal is as more information was captured and less information was lost. Usually in speech processing 44 KHz is considered to be a good sampling rate which means that 44000 samples are taken from 1 second of speech.


In digital signal processing quantization is the process of mapping a continuous range of very large values to a smaller set of discrete or integer values. The error that is induced because of the loss of the information during mapping is called quantization error. Quantization is the used in analogue-to-digital converters for converting discrete signals to digital signals using a quantization level specified in bits. As loss of information during quantization is irreversible, it is a good practise to set the quantization level to higher bits. A good quality compact disc is sampled at 44.1 KHz with a quantization level of 16 bits which can give 65,536 possible values per sample.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.5: Sampling of a Continuous Signal

2.2.4 Nyquist Shannon Sampling Theorem

Nyquist Shannon sampling theorem more generally known as Nyquist sampling theorem states that

”If a function x (t) contains no frequencies higher than B hertz, it is completely determined by giving its ordinates at a series of points spaced 2 B seconds apart [Wik13b].”

Which actually means that to reconstruct a continuous signal from its digital form, the sampling rate f s should be twice as greater than the bandwidth B of the signal.

Abbildung in dieser Leseprobe nicht enthalten

2.2.5 Window Functions

In signal processing, window function is defined as a mathematical function who value is zero outside the given interval. As a person is talking, sound produced by it changes very quickly so to study every change/segment of speech, it is divided into many frames with help of a window function.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.6: Hamming window effect

Most famous window functions are rectangular and Hamming windows. Rect- angular function is always constant inside the given interval and always zero outside the interval. Rectangular window is a constant inside the interval which means that the changes around the edges are abrupt so to decrease this abrupt effect, Hamming

window is used which is generalized by equation.

Abbildung in dieser Leseprobe nicht enthalten

A hamming window effect can be seen in figure 2.6. In MATLAB’s signal processing toolbox hamming window function is available which can be used by giving command ′ hamming (Signal)

Chapter 3

Speech Enhancement

Automatic gender identification systems trained on clean speech under lab environments degrade in performance in real world conditions due to the additive environmental sounds such as noise and music, additionally there is a natural pause in human speech which is known as silence is also considered as an additive environment sound. Systems whose performance does not degrade in a real world environment are called robust systems.

Several researchers are trying to design an auditory system which can mimic the auditory system of human beings as it is robust to environmental changes [Ghi87]. Before training a speech based system, speech enhancement pre-processing has to be done so that a good quality of speech can be extracted from speech recorded in a real world environment. In this chapter some speech enhancement techniques will be discussed.

3.1 Signal to Noise Ratio

Signal to noise ratio is a measure that is used to find the quality of the input signal according to the background noise in the signal. Signal to noise ratio is usually denoted as SNR. Mathematically SNR can be written as

Abbildung in dieser Leseprobe nicht enthalten


Excerpt out of 79 pages


Who is Speaking? Male or Female
University of Manchester
Catalog Number
ISBN (eBook)
ISBN (Book)
File size
1346 KB
speaking, male, female
Quote paper
Hassam Sheikh (Author), 2013, Who is Speaking? Male or Female, Munich, GRIN Verlag, https://www.grin.com/document/265700


  • No comments yet.
Read the ebook
Title: Who is Speaking? Male or Female

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free