It is a complicated and highly difficult task to determine the predominant melody from the musical performances which contain a large number of various instruments. It is one of the most challenging tasks in the field of music information retrieval and computational musicology. Over the past decade, melody extraction has emerged as an active research topic. In this paper, we are newly presenting a novel framework method which calculates and estimates predominant vocal melody in real-time by analyzing and tracking various source frequencies with the help of harmonic clusters (combs) and determining the actual predominant vocal source by using the harmonic strength of the source. Here we rely on the strong higher harmonics to estimate robustness against distortion and on the first harmonics caused due to low frequency accompaniments in a signal, in contrast to the currently existing methods which track only the pitch values. The proposed method, although on-line, is shown to significantly outperform our implementation of a state-of-the-art offline method for vocal melody extraction.
Table of Contents
Abstract
I. Introduction
II. Partial Space Extraction
III. Source Tracking System
A. Tracking A Comb
B. Initializing Comb
C. Comb Termination
IV. Identification of Vocal Comb
Conclusion
Acknowledgement
References
Harmonic Cluster Tracking for Vocal Melody Extraction from Polyphonic Audio
Abstract
It is a complicated and highly difficult task to determine the predominant melody from the musical performances that contains large number of various instruments .It is one of the most challenging task in the field of music information retrieval and computational musicology. Over the past decade, melody extraction has emerged as an active research topic. In this paper, we are newly presenting a novel framework method which calculates and estimates predominant vocal melody in real-time by analyzing and tracking various source frequencies with the help of harmonic clusters (combs) and determining the actual predominant vocal source by using the harmonic strength of the source. Here we rely on the strong higher harmonics for estimating robustness against distortion and of the first harmonic caused due to low frequency accompaniments in signal, in contrast to the currently existing methods which track only the pitch values. The proposed method, although on-line, is shown to significantly outperform our implementation of a state-of-the-art offline method for vocal melody extraction.
Keywords-Music information retrieval, Predominant Frequency Estimation, Pitch Tracking, Spectral Harmonics, Vocal Melody Estimation, Computational Musicology, Vocal harmonics.
I. INTRODUCTION
In this work, we focus our attention on estimating the melody of singing voice from a single singer in the presence of pitched and percussion accompaniments from a monaural, i.e. single channel, audio recording. we ignore the linguistic information in the song and consider only the melody information
A general architecture that is underlying most of the F0 estimation systems can be outlined as shown in Fig. 1. The information spread in the audio signal is transformed into a suitable representation space, where the information can be conveniently clustered into various subspaces such that each subspace represents one source and can be extracted reliably. Static constraints are used to analyze and cluster the information in a single time frame. Dynamic constraints model the evolution of subspaces over time and help in clustering the source information over successive time frames. The clustered subspaces then give us multiple F0s corresponding to various sources, out of which one particular source, which is vocal in this work, is selected using a harmonic strength criterion and instrument specific constraints. In this paper, we represent the audio signal in terms of a partial space which consists of the sinusoidal components. We see the F0 estimation task as two-level clustering of this partial space-first statically or at spectral level, i.e. in each temporal window, and second dynamically, i.e. over successive temporal windows-into various sound source.
A harmonically related cluster of partials is termed as a comb. With each comb aiming to track the partials from a single source, there are several combs simultaneously tracking various pitched sources. While other works use the F0 and salience values for dynamic constraints, our work relies upon directly tracking the higher harmonics. In this work, we develop a novel vocal melody selection scheme using a series of filters, with an aim to make the system implementable in real time. The overall flowchart of the proposed system is shown in Fig. 2.starting from the extraction of the partial space, then the harmonic source tracking module and finally the vocal source identification module, respectively.
II PARTIAL SPACE EXTRACTION
As the first step, the signal has to be transformed to a representation space which contains most of the relevant information of various pitched sources in the polyphony. This is achieved by considering the frequency and amplitude of all the partials (peaks in magnitude spectrum) in the discrete Fourier transform (DFT) space. An N-point short time Fourier transform (STFT) of the monaural music recording is computed, using a sliding hanning window having a temporal length of 80 ms. is chosen to be of the order of the sampling frequency .The next step is to extract the peaks (partials) in the spectrum. There are various ways to improve the partial estimation accuracy .However; our work uses the simplest of all, i.e. the amplitude and frequency of only the local maxima in the spectrum. A local maximum is a point whose amplitude is greater than that of its immediate neighbors on the frequency axis.
III. SOURCE TRACKING SYSTEM
This section explains how the partial space is clustered into source subspaces and how these clusters are tracked in time, using harmonic as well as dynamic constraints. The state of the c th comb at the n th instant (time frame) is [n], C= 1,….., The amplitude and frequency of the h th partial in the c th comb at n th instant are represented by , respectively.
The comb state contains the information of the frequencies and amplitudes of ℎ number of partials associated with it.We omit n at several places below for ease of representation. In the following discussion, two likelihood functions will be used-static and dynamic-defined as
illustration not visible in this excerpt
Eq. (1) ensures that for the static likelihood to be high, the distance between the frequencies , should be small and the partial
amplitude should be large. This likelihood is called static because it is used in the context of a single spectrum. On the other hand, (2) ensures that for the dynamic likelihood to be high, the distance between both-the frequencies , as well as the partial amplitudes -should be small. This likelihood is called dynamic as it is generally used in the context of dynamic evolution of spectra, such that are the values predicted from the spectra at the previous instants.
A. TRACKING A COMB
The state of an active comb , which is tracking a cluster of partials, has to be updated at the next time instant. This task is accomplished using prediction and measurement update, as in the Kalman Filter framework.
Prediction update: The a priori state of the comb is estimated using the first-order prediction as
illustration not visible in this excerpt
For all h=1[illustration not visible in this excerpt]Here, are prediction coefficients. And ∆ is defined as
illustration not visible in this excerpt
Where ɡ is to be substituted with A, F.
Measurement update: The partial space at the n th instant is used as the measurement to obtain the a posteriori estimate of the comb state. To tackle with situations when the 1st harmonic is distorted by the accompaniment interferences, the system relies upon the higher harmonics for state update. A harmonic h is selected as a leader harmonic, if it is strong in amplitude. The criterion used is that its amplitude should be greater than a constant fraction of that of the first harmonic for the latest 4 available s.
States of comb i.e. the winner potential state, which finally updates the th comb, is chosen as the one which maximizes the static likelihood,
illustration not visible in this excerpt
B. INTIALIZING COMB
If any of the strongest (in amplitude) three partials in the partial space is not being tracked in one of the already active combs, then a new comb, ,c , is initialized with initialized using the corresponding features of that strongest partial. Note that in the algorithm, this step is performed after all the active comb states have been updated at the current instant. Maximum number of combs is, each of which tracks number of harmonics. If all the combs are active, then the comb is terminated to start a new comb ,1is considered as F0. For this step. For
illustration not visible in this excerpt
Selects the partial that is close to the predicted harmonic frequency and has large amplitude.
C. Comb Termination
The playing of any instrument continues for a short period of time and then breaks. Correspondingly, our combs should also get terminated with the decay of source power to less than certain level. The comb c gets terminated if the sum of its partial amplitudes falls below a threshold, , times , the amplitude of the strongest partial in the partial space
illustration not visible in this excerpt
To make this system more specific to a particular instrument, i.e. vocal in this work, we specify some more constraints.
1) The vocal F0 range is limited:
2) The first harmonic is very predominant.
IV. IDENTIFICATION OF VOCAL COMB
Among maximum combs present at the current time in-stant, the one which corresponds to the vocal source has to be identified. To determine the vocal contour, the harmonic strength criterion is used. The vocal recognition score for the c th comb at the n th instant is defined as
illustration not visible in this excerpt
While other researchers have used recognition criteria over individual frames, we use the knowledge-that a comb tracks the same source-to smoothen out the score over time by using a first order linear filter, represented in z-domain as,
illustration not visible in this excerpt
There is also a possibility that two combs start tracking the same source, due to tracking error.
illustration not visible in this excerpt
Fig. 5. Block schematic for obtaining the score for vocal comb identification.
To reduce their score, we develop a filter based on the idea presented in that these instruments mostly have a stable pitch contour, whereas the vocal pitch contour has an involuntary in-stability called jitter. The stability of pitch contour is quantified using standard deviation (SD), calculated over a finite number of previous instants
illustration not visible in this excerpt
It[illustration not visible in this excerpt]If is less than a threshold,[illustration not visible in this excerpt],the recognition score is attenuated using the filter
illustration not visible in this excerpt
This weakens the instrumental comb strength. Sometimes, the vocal pitch contour also happens to have an SD little less than the threshold.
Performance
Extraction accuracy
We are tracking Harmonic clusters in two steps, namely harmonic source tracking and vocal pit selection. We evaluate the performance accuracies for melody extraction without vocal source selection as well as with vocal source selection.
TABLE I
PITCH AND CHROMA ACCURACIES
illustration not visible in this excerpt
The estimated melody is correct melody if it is less than half-semitone away from the ground truth pitch value. The accuracy is defined in two ways. Raw pitch accuracy (RPA) is the probability of giving the correct pitch value. Raw Chroma Accuracy (RCA) allows octave errors and is the probability that the estimated pitch value, when mapped into the same octave as that of the ground truth pitch value, is identified as the correct pitch value.
Conclusion
In this work, we have described a harmonic cluster tracking system for tracking various harmonic sources in polyphonic Music and among them identifying the predominant vocal melody based on various heuristics.
The evaluation results clearly indicate that the proposed on-line real time system is able to achieve significant accu- racy levels as compared to the existing state-of-the-art offline method. The current approach identifies the desired sound Source based mainly on the harmonic strength, but using tim-bral features for this task may further improve the accuracy The major contributions of this work are:
(i) Unified Approach: Most of the previous approaches separately apply the static and dynamic constraints, applying the static constraints first, followed by the dynamic ones in the form of dynamic programming.
(ii) Tracking strong higher harmonics: While previous approaches track the ‘F0 trajectories,’ using Viterbi algorithm etc., our algorithm depends upon the ‘strong higher harmonics’ for tracking.
(iii) Vocal selection filters: Instead of mostly used dynamic programming based offline methods, our approach uses Score filters which select vocal source based on strength and jitter constraints.
(iv) Real time method: The proposed method is implementable.
Acknowledgement
This research paper is made possible through the help and support from everyone, including: parents, teachers, family, friends, and in essence, all sentient beings. Especially, please allow me to dedicate my acknowledgment of gratitude toward the following significant advisors and contributors.
First and foremost, I would like to thank (name of guide) for his most support and encouragement. He kindly read my paper and
offered invaluable detailed advices on grammar, organization, and the theme of the paper. Second, I would like to thank all the other professors who have taught me about Buddhism over the past two years of my pursuit of the master degree.
Finally, I sincerely thank to my parents, family, and friends, who provide the advice and financial support. The product of this research paper would not be possible.
Thank You
References
[1]A. de Cheveigné and H. Kawahara, “YIN, a fundamental frequency estimator for speech and music,” J. Acoust. Soc. Amer., vol. 111, no. 4, pp. 1917-1930, 2002.
[2]A. Klapuri, “Automatic music transcription as we know it today,” J. New Music Res., vol. 33, no. 3, pp. 269-282, 2004.
[3]A. Klapuri and M. Davy, Signal Processing Methods for Music Tran- scription. Secaucus, NJ: Springer-Verlag, 2006.
[4]T.F.Quatieri, Discrete Time Speech Signal Processing-Principles and Practice. Upper Saddle River, NJ: Prentice-Hall, 2002.
[5]P. Boersma and D. Weenink, Praat: Doing Phonetics by Computer 2004.
[6]G. Hu and D. L. Wang, “A tandem algorithm for pitch estimation and voiced speech segregation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 8, pp. 2067-2079, Nov. 2010.
[7]E. Vincent, N. Berlin, and R. Badeau, “Harmonic and inharmonic non- negative matrix factorization for polyphonic pitch transcription,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 2008, pp. 109-112.
Frequently asked questions
What is the main topic of the document "Harmonic Cluster Tracking for Vocal Melody Extraction from Polyphonic Audio"?
The document focuses on a novel framework for estimating the predominant vocal melody in real-time from polyphonic audio by analyzing and tracking source frequencies using harmonic clusters. It aims to identify the vocal source by its harmonic strength, relying on higher harmonics to overcome distortion and low-frequency accompaniment interference.
What is the abstract about?
The abstract discusses the challenge of extracting the predominant melody from music containing various instruments. It introduces a new method for real-time estimation of vocal melody by analyzing and tracking source frequencies using harmonic clusters. This method emphasizes higher harmonics for robustness against distortion, contrasting with existing methods that only track pitch values.
What are the key steps involved in the proposed method?
The key steps are partial space extraction, harmonic source tracking (using combs to track partials from different sources), and vocal source identification (using a harmonic strength criterion and instrument-specific constraints).
What is Partial Space Extraction?
Partial space extraction is the initial step where the audio signal is transformed into a representation space that contains relevant information about pitched sources in the polyphony. This is done by considering the frequency and amplitude of partials (peaks in the magnitude spectrum) obtained from the Short-Time Fourier Transform (STFT).
What is a "comb" in the context of this document?
A "comb" refers to a harmonically related cluster of partials, with each comb designed to track the partials from a single sound source. The system uses multiple combs to track different pitched sources simultaneously.
How does the Source Tracking System work?
The Source Tracking System clusters the partial space into source subspaces and tracks these clusters over time using harmonic and dynamic constraints. It involves prediction and measurement updates, similar to a Kalman Filter framework, to update the state of each comb.
How are Combs initialized and terminated?
A new comb is initialized if any of the strongest partials are not already being tracked by an active comb. A comb is terminated if the sum of its partial amplitudes falls below a certain threshold relative to the amplitude of the strongest partial in the partial space.
How is the vocal comb identified?
The vocal comb is identified based on a harmonic strength criterion. A vocal recognition score is calculated for each comb, and this score is smoothed over time using a linear filter. Filters are used to weaken combs corresponding to instruments with stable pitch contours to help differentiate them from the more unstable vocal pitch contour (jitter).
What are the advantages of this method?
The document claims that this method offers a unified approach to static and dynamic constraints, tracks strong higher harmonics instead of just F0 trajectories, uses score filters for vocal selection based on strength and jitter constraints, and can be implemented in real-time. The paper claims the system is accurate and effective.
What is the conclusion of the study?
The conclusion states that the proposed harmonic cluster tracking system can accurately identify the predominant vocal melody in polyphonic music. The method is suggested to be comparable or better than existing methods.
What are the keywords associated with the document?
The keywords are Music information retrieval, Predominant Frequency Estimation, Pitch Tracking, Spectral Harmonics, Vocal Melody Estimation, Computational Musicology, and Vocal harmonics.
- Quote paper
- Vaaibhav Vaidya (Author), 2014, Vocal Melody Extraction from Polyphonic Audio files using enhanced Cluster tracking, Munich, GRIN Verlag, https://www.grin.com/document/334010