Vocal Melody Extraction from Polyphonic Audio files using enhanced Cluster tracking

Research Paper (postgraduate), 2014

10 Pages


Table of Contents


I. Introduction

II. Partial Space Extraction

III. Source Tracking System
A. Tracking A Comb
B. Initializing Comb
C. Comb Termination

IV. Identification of Vocal Comb




Harmonic Cluster Tracking for Vocal Melody Extraction from Polyphonic Audio


It is a complicated and highly difficult task to determine the predominant melody from the musical performances that contains large number of various instruments .It is one of the most challenging task in the field of music information retrieval and computational musicology. Over the past decade, melody extraction has emerged as an active research topic. In this paper, we are newly presenting a novel framework method which calculates and estimates predominant vocal melody in real-time by analyzing and tracking various source frequencies with the help of harmonic clusters (combs) and determining the actual predominant vocal source by using the harmonic strength of the source. Here we rely on the strong higher harmonics for estimating robustness against distortion and of the first harmonic caused due to low frequency accompaniments in signal, in contrast to the currently existing methods which track only the pitch values. The proposed method, although on-line, is shown to significantly outperform our implementation of a state-of-the-art offline method for vocal melody extraction.

Keywords-Music information retrieval, Predominant Frequency Estimation, Pitch Tracking, Spectral Harmonics, Vocal Melody Estimation, Computational Musicology, Vocal harmonics.


In this work, we focus our attention on estimating the melody of singing voice from a single singer in the presence of pitched and percussion accompaniments from a monaural, i.e. single channel, audio recording. we ignore the linguistic information in the song and consider only the melody information

A general architecture that is underlying most of the F0 estimation systems can be outlined as shown in Fig. 1. The information spread in the audio signal is transformed into a suitable representation space, where the information can be conveniently clustered into various subspaces such that each subspace represents one source and can be extracted reliably. Static constraints are used to analyze and cluster the information in a single time frame. Dynamic constraints model the evolution of subspaces over time and help in clustering the source information over successive time frames. The clustered subspaces then give us multiple F0s corresponding to various sources, out of which one particular source, which is vocal in this work, is selected using a harmonic strength criterion and instrument specific constraints. In this paper, we represent the audio signal in terms of a partial space which consists of the sinusoidal components. We see the F0 estimation task as two-level clustering of this partial space-first statically or at spectral level, i.e. in each temporal window, and second dynamically, i.e. over successive temporal windows-into various sound source.

A harmonically related cluster of partials is termed as a comb. With each comb aiming to track the partials from a single source, there are several combs simultaneously tracking various pitched sources. While other works use the F0 and salience values for dynamic constraints, our work relies upon directly tracking the higher harmonics. In this work, we develop a novel vocal melody selection scheme using a series of filters, with an aim to make the system implementable in real time. The overall flowchart of the proposed system is shown in Fig. 2.starting from the extraction of the partial space, then the harmonic source tracking module and finally the vocal source identification module, respectively.


As the first step, the signal has to be transformed to a representation space which contains most of the relevant information of various pitched sources in the polyphony. This is achieved by considering the frequency and amplitude of all the partials (peaks in magnitude spectrum) in the discrete Fourier transform (DFT) space. An N-point short time Fourier transform (STFT) of the monaural music recording is computed, using a sliding hanning window having a temporal length of 80 ms. is chosen to be of the order of the sampling frequency .The next step is to extract the peaks (partials) in the spectrum. There are various ways to improve the partial estimation accuracy .However; our work uses the simplest of all, i.e. the amplitude and frequency of only the local maxima in the spectrum. A local maximum is a point whose amplitude is greater than that of its immediate neighbors on the frequency axis.


This section explains how the partial space is clustered into source subspaces and how these clusters are tracked in time, using harmonic as well as dynamic constraints. The state of the c th comb at the n th instant (time frame) is [n], C= 1,….., The amplitude and frequency of the h th partial in the c th comb at n th instant are represented by , respectively.

The comb state contains the information of the frequencies and amplitudes of ℎ number of partials associated with it.We omit n at several places below for ease of representation. In the following discussion, two likelihood functions will be used-static and dynamic-defined as

illustration not visible in this excerpt

Eq. (1) ensures that for the static likelihood to be high, the distance between the frequencies , should be small and the partial

amplitude should be large. This likelihood is called static because it is used in the context of a single spectrum. On the other hand, (2) ensures that for the dynamic likelihood to be high, the distance between both-the frequencies , as well as the partial amplitudes -should be small. This likelihood is called dynamic as it is generally used in the context of dynamic evolution of spectra, such that are the values predicted from the spectra at the previous instants.


The state of an active comb , which is tracking a cluster of partials, has to be updated at the next time instant. This task is accomplished using prediction and measurement update, as in the Kalman Filter framework.

Prediction update: The a priori state of the comb is estimated using the first-order prediction as

illustration not visible in this excerpt

For all h=1[illustration not visible in this excerpt]Here, are prediction coefficients. And ∆ is defined as

illustration not visible in this excerpt

Where ɡ is to be substituted with A, F.

Measurement update: The partial space at the n th instant is used as the measurement to obtain the a posteriori estimate of the comb state. To tackle with situations when the 1st harmonic is distorted by the accompaniment interferences, the system relies upon the higher harmonics for state update. A harmonic h is selected as a leader harmonic, if it is strong in amplitude. The criterion used is that its amplitude should be greater than a constant fraction of that of the first harmonic for the latest 4 available s.

States of comb i.e. the winner potential state, which finally updates the th comb, is chosen as the one which maximizes the static likelihood,

illustration not visible in this excerpt


If any of the strongest (in amplitude) three partials in the partial space is not being tracked in one of the already active combs, then a new comb, ,c , is initialized with initialized using the corresponding features of that strongest partial. Note that in the algorithm, this step is performed after all the active comb states have been updated at the current instant. Maximum number of combs is, each of which tracks number of harmonics. If all the combs are active, then the comb is terminated to start a new comb ,1is considered as F0. For this step. For

illustration not visible in this excerpt

Selects the partial that is close to the predicted harmonic frequency and has large amplitude.

C. Comb Termination

The playing of any instrument continues for a short period of time and then breaks. Correspondingly, our combs should also get terminated with the decay of source power to less than certain level. The comb c gets terminated if the sum of its partial amplitudes falls below a threshold, , times , the amplitude of the strongest partial in the partial space

illustration not visible in this excerpt

To make this system more specific to a particular instrument, i.e. vocal in this work, we specify some more constraints.

1) The vocal F0 range is limited:
2) The first harmonic is very predominant.


Among maximum combs present at the current time in-stant, the one which corresponds to the vocal source has to be identified. To determine the vocal contour, the harmonic strength criterion is used. The vocal recognition score for the c th comb at the n th instant is defined as

illustration not visible in this excerpt

While other researchers have used recognition criteria over individual frames, we use the knowledge-that a comb tracks the same source-to smoothen out the score over time by using a first order linear filter, represented in z-domain as,

illustration not visible in this excerpt

There is also a possibility that two combs start tracking the same source, due to tracking error.

illustration not visible in this excerpt

Fig. 5. Block schematic for obtaining the score for vocal comb identification.

To reduce their score, we develop a filter based on the idea presented in that these instruments mostly have a stable pitch contour, whereas the vocal pitch contour has an involuntary in-stability called jitter. The stability of pitch contour is quantified using standard deviation (SD), calculated over a finite number of previous instants

illustration not visible in this excerpt

It[illustration not visible in this excerpt]If is less than a threshold,[illustration not visible in this excerpt],the recognition score is attenuated using the filter

illustration not visible in this excerpt

This weakens the instrumental comb strength. Sometimes, the vocal pitch contour also happens to have an SD little less than the threshold.


Extraction accuracy

We are tracking Harmonic clusters in two steps, namely harmonic source tracking and vocal pit selection. We evaluate the performance accuracies for melody extraction without vocal source selection as well as with vocal source selection.



illustration not visible in this excerpt

The estimated melody is correct melody if it is less than half-semitone away from the ground truth pitch value. The accuracy is defined in two ways. Raw pitch accuracy (RPA) is the probability of giving the correct pitch value. Raw Chroma Accuracy (RCA) allows octave errors and is the probability that the estimated pitch value, when mapped into the same octave as that of the ground truth pitch value, is identified as the correct pitch value.


In this work, we have described a harmonic cluster tracking system for tracking various harmonic sources in polyphonic Music and among them identifying the predominant vocal melody based on various heuristics.

The evaluation results clearly indicate that the proposed on-line real time system is able to achieve significant accu- racy levels as compared to the existing state-of-the-art offline method. The current approach identifies the desired sound Source based mainly on the harmonic strength, but using tim-bral features for this task may further improve the accuracy The major contributions of this work are:

(i) Unified Approach: Most of the previous approaches separately apply the static and dynamic constraints, applying the static constraints first, followed by the dynamic ones in the form of dynamic programming.
(ii) Tracking strong higher harmonics: While previous approaches track the ‘F0 trajectories,’ using Viterbi algorithm etc., our algorithm depends upon the ‘strong higher harmonics’ for tracking.
(iii) Vocal selection filters: Instead of mostly used dynamic programming based offline methods, our approach uses Score filters which select vocal source based on strength and jitter constraints.
(iv) Real time method: The proposed method is implementable.


This research paper is made possible through the help and support from everyone, including: parents, teachers, family, friends, and in essence, all sentient beings. Especially, please allow me to dedicate my acknowledgment of gratitude toward the following significant advisors and contributors.

First and foremost, I would like to thank (name of guide) for his most support and encouragement. He kindly read my paper and

offered invaluable detailed advices on grammar, organization, and the theme of the paper. Second, I would like to thank all the other professors who have taught me about Buddhism over the past two years of my pursuit of the master degree.

Finally, I sincerely thank to my parents, family, and friends, who provide the advice and financial support. The product of this research paper would not be possible.

Thank You


[1]A. de Cheveigné and H. Kawahara, “YIN, a fundamental frequency estimator for speech and music,” J. Acoust. Soc. Amer., vol. 111, no. 4, pp. 1917-1930, 2002.

[2]A. Klapuri, “Automatic music transcription as we know it today,” J. New Music Res., vol. 33, no. 3, pp. 269-282, 2004.

[3]A. Klapuri and M. Davy, Signal Processing Methods for Music Tran- scription. Secaucus, NJ: Springer-Verlag, 2006.

[4]T.F.Quatieri, Discrete Time Speech Signal Processing-Principles and Practice. Upper Saddle River, NJ: Prentice-Hall, 2002.

[5]P. Boersma and D. Weenink, Praat: Doing Phonetics by Computer 2004.

[6]G. Hu and D. L. Wang, “A tandem algorithm for pitch estimation and voiced speech segregation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 8, pp. 2067-2079, Nov. 2010.

[7]E. Vincent, N. Berlin, and R. Badeau, “Harmonic and inharmonic non- negative matrix factorization for polyphonic pitch transcription,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 2008, pp. 109-112.

[8]J.Durrieu,G.Richard,B.David,andC.Fevotte,“Source/filter model for unsupervised main melody extraction from polyphonic audio signals,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 3, pp. 564-575, Mar. 2010.

Excerpt out of 10 pages


Vocal Melody Extraction from Polyphonic Audio files using enhanced Cluster tracking
Catalog Number
ISBN (eBook)
ISBN (Book)
File size
1542 KB
MIR, Vocal Melody, F0, cluster tracking, www.vaaiibhav.me, vaibhav vaidya, Vaaiibhav, polyphonic audio
Quote paper
Vaaibhav Vaidya (Author), 2014, Vocal Melody Extraction from Polyphonic Audio files using enhanced Cluster tracking, Munich, GRIN Verlag, https://www.grin.com/document/334010


  • No comments yet.
Look inside the ebook
Title: Vocal Melody Extraction from Polyphonic Audio  files using enhanced Cluster tracking

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free