Musical information retrieval. Signal Analysis and Feature Extraction using Python

Research Paper (postgraduate), 2021

40 Pages, Grade: 8.0

M. Sai Chaitanya (Author)






Chapter 1- Review of Music Concepts
1.1 Literature Overview
1.2 Basic music elements
1.3 Music terminology

Chapter 2- Musical Information Retrieval
2.1 What is MIR?
2.2 Feature Extraction
2.2.1 Low-level similarity
2.2.2 Top-level similarity
2.2.3 Mid-level similarity
2.2.4 Process of Feature extraction

Chapter 3- Signal Analysis and Feature Extraction using Python
3.1 Why Python?
3.2 Basic Feature Extraction
3.2.1 Zero Crossing Rate
3.2.2 Fourier_transform using python
3.2.3 Short-Time Fourier Transform using python
3.2.4 Spectrogram
3.2.5 Mel-spectrogram




Music information retrieval (MIR) is the interdisciplinary science of retrieving information from music. MIR is a small but growing field of research with many real- world applications. Those involved in MIR may have a background in musicology, psychoacoustics, psychology, academic music study, signal processing, informatics, machine learning, optical music recognition, computational intelligence or some combination of these.

MIR is being used by businesses and academics to categorize, manipulate and even create music.One of the classical MIR research topics is genre classification, which is categorizing music items into one of pre-defined genres such as classical, jazz, rock, etc. Mood classification, artist classification, and music tagging are also popular topics.

This paper gives a comprehensive overview of research on the multidisciplinary field of Music Information Retrieval (MIR). MIR uses knowledge from areas as diverse as signal processing, machine learning, information and music theory. The Main Feature of this paper is to explore how this knowledge can be used for the development of novel methodologies for browsing and retrieval on large music collections, a hot topic given recent advances in online music distribution and searching. Emphasis would be given to audio signal processing techniques.

M. Sai Chaitanya Soubhik Chakraborty


I take immense pleasure in thanking Dr. S. Chakraborty , Professor and ex-Head, Department of Mathematics Birla Institute of Technology, Mesra, Ranchi for having me carry out this project and for his constant guidance and support. I would also like to thank the entire Department of Mathematics BIT, Mesra for displaying immense support during my project tenure.

I would like to express my heartfelt gratitude to Mr Nikhil Ken and Mr Rishij Roy Choudary for their contribution in helping me grab the insight knowledge of music by sharing their practical knowledge and also giving me the opportunity to learn and work alongside them which help me gain immensely enriching professional experience.

Finally, yet importantly, I would like to express my heartfelt thanks to my beloved parents for their blessings, my friends and all those who supported me directly or indirectly for their help.


List of Figures

Figure 1.1: Example of a musical score

Figure 2.1: Audio Feature

Figure 2.2: Time Series plot using scipy python

Figure 2.3: Frequency domain of input audio

Figure 2.4: Window function shape of an audio signal

Figure 2.5: Zero crossing rate

Figure 2.6: Spectral centroid

Figure 2.7:Block diagram of MFCC Figure 3.1: Display the kick drum signals

Figure 3.2: Display the snare drum signals

3Figure 3.3: Plotting the signal

Figure 3.4: Plotting the Zero crossing rate. Figure 3.5:Plot of an Audio signal

Figure 3.6: Plotting the spectrum

Figure 3.7: Zoom in the spectrum

Figure 3.8: Spectrogram of the audio file using librosa Figure 3.9: Displaying mel spectrogram using librosa


MIR (Music Information Retrieval) is a new research field dedicated to meeting users' music information needs. Despite its name's focus on retrieval, MIR incorporates a variety of approaches aimed at music management, easy access, and enjoyment, as will be seen. The majority of MIR studies, proposed methods, and built structures are all content-based.

The basic premise of content-based approaches is that a document can be represented by a collection of features computed directly from its content. Typically, content-based access to multimedia data necessitates the development of new methodologies that are customised to each medium. However, since the underlying models are likely to represent fundamental characteristics shared by various media, languages, and application domains, the core information retrieval (IR) techniques, which are focused on statistics and probability theory, may be more widely used outside the textual case. For this reason, the research results achieved in the area of IR, in particular in the case of text documents, are a continuous reference for MIR approaches.

Businesses and academics use MIR to categorise, control, and even construct music. Genre classification, which is the categorising of music objects into one of pre-defined genres such as classical, jazz, rock, and so on, is a classic MIR research subject. Music tagging, mood classification, and artist classification are also common topics. This paper focuses on content-based MIR tools, strategies, and methods rather than the programmes that incorporate them. Systems can be compared based on the number of retrieval tasks they can perform, the size of their sets, and the techniques they use.

The paper is organised as follows: This segment concludes with a brief description of certain musical principles. Chapter 1 introduces the dimensions that define musical documents and can be used to classify their substance, as well as the peculiarities of the music language. Chapter 2 introduces and specifies a variety of information needs that have been taken into account by various MIR methods, highlighting the key typologies of MIR users. In Chapter 3, we look at how to process musical documents in order to extract features relevant to their dimensions of interest. The efforts carried out for creating a shared evaluation framework and their initial results are presented in Chapter 3. Finally, some concluding considerations are drawn in the conclusion part.

Chapter 1: Review of Music Concepts

1.1 Literature overview

The majority of music retrieval methods and strategies are focused on a variety of music principles that may be unfamiliar to those without musical experience. As a result, this section provides a brief overview of certain fundamental concepts and terminology.

1.2 Basic music elements

With the exception of certain percussion instruments, any musical instrument produces almost periodic vibrations. The sounds made by musical instruments, in particular, are the product of a combination of different frequencies, all of which are multiple integers of a fundamental frequency, commonly referred to as F0.

The three basic features of a musical sound are

- Pitch, which is related to the perception of the fundamental frequency of a sound; pitch is said to range from low or deep to high or acute sounds.
- Intensity, which is related to the amplitude, and thus to the energy, of the vibration; textual labels for intensity range from soft to loud; the intensity is also defined as loudness.
- Timbre, which is defined as the sound characteristics that allow listeners to perceive as different two sounds with same pitch and same intensity.

Pitch and intensity perception are more complicated than the above meanings indicate. The human ear does not behave in a linear manner when it comes to pitch detection or strength perception. However, the fundamental frequency and energy of a sound may be used to estimate these two perceptually important qualities of sound.

Timbre, on the other hand, is a multidimensional sound quality that defies straightforward categorization. The recognition of the sound source—differentiating a saxophone from a violin—of the playing technique—determining whether a string has been plucked or played with the bow—of the playing technique nuances—the velocity of the bow and its pressure on the string—of the surrounding acoustics—determining whether the violinist has been in a small room or in a concert hall—and of the coordination. Given all these characteristics, it is not surprising that timbre has been defined for what it is not.

There is no fundamental frequency in many percussive musical instruments, so the vibration is referred to as noise. Noises, on the other hand, are thought to be in one of three registers: low, medium, or strong. Intensity and timbre are also useful noise descriptors.

A chord is formed when two or more sounds are played together. Depending on the tone of the various sounds and, in particular, the distances between them, a chord may have various qualities. Many music genres rely on chords, especially pop, rock, and jazz, where polyphonic musical instruments—such as the piano, keyboard, and guitar—are often devoted to accompaniment and essentially play chords.

1.3 Music terminology

Apart from the fundamental principles discussed in the preceding section, there are numerous words currently used to characterise music that may be unfamiliar to those without a musical background. Music theory and experience also influenced some of the language used by the MIR group.

The musical concepts that are relevant for this overview are the following:

- The tempo is the speed at which a musical work is played, or expected to be played, by performers. The tempo is usually measured in beats per minute.
- The tonality of a song is related to the role played by the different chords of a musical work; tonality is defined by the name of the chord that plays a central role in a musical work. The concept of tonality may not be applicable to some music genres.
- The time signature, usually in the form of a fractional number, gives information on the organization of strong and soft beats along the time axis.
- The key signature, usually in the form of a number of alterations—symbols and — is an incomplete representation of the tonality, which is useful for performers because it expresses which are the notes that have to be consistently played altered.

Figure 1.1 depicts four measures of a polyphonic musical score for piano, taken from Claude Debussy's Premiere Arabesque. The time signature (the C sign) indicates that steps must be divided into four equal beats, and the three sharps (the signs) indicate that if not otherwise indicated, all occurrences of notes F, C, and G must be lifted by a semitone. The existence of three sharps could indicate that the excerpt's tonality is either A major or F minor, both of which have the same number of sharps, with the former being the more probable tonality.

Other principles are more concerned with sound production and the criteria that define single or groups of notes. A text, for example, begins to be perceived at its onset time, lasts for a fixed amount of time, and then stops being perceived at its offset time. Finally, sounds are created by musical instruments and the voice, both of which have a restricted range of pitches that can be produced due to their conformation, which is referred to as instrument— or voice—register.

Abbildung in dieser Leseprobe nicht enthalten

Fig. 1.1 Example of musical score (excerpt from premiere Arabesque by Clande Debussy)

Chapter 2: Music Information Retrieval

2.1 What is MIR?

Music Information Retrieval (MIR) involves searching and organising large collections of music, or music information, according to their relevance to specific queries. This is particularly relevant given the vast quantities of musical information available in digital format, and the popularity of music-related digital services. In general, research in Music Information Retrieval (MIR) focuses on the extraction and inference of meaningful features from music (from the audio signal), indexing of music based on these features, and the creation of various search and retrieval schemes (for instance, content-based search, music recommendation systems, or user interfaces for browsing large music collections).

Furthermore, given its obvious commercial appeal, most media content owners and distributors (e.g. Philips, Sony, Apple) are actively involved in research in the field, while numerous libraries are seeking to incorporate some form of support for MIR in their on-line digital services. Simple MIR systems retrieve data according to a textual query introduced by the user, e.g. ‘David Bowie Heroes’. In those cases the text is compared with the text data that is associated with albums and tracks, making the system essentially no different from any text-based search engine (e.g. Google, Yahoo). However, given the characteristics of the content being retrieved, there is a need for systems that are able to accept “musical” queries such as scores, sung melodies (query by humming) or recorded audio segments (query by example). This proposal is concerned with the latter case.

The goal of querying by example is to retrieve pieces of music from a large collection of digital music content, by their similarity to an example audio document. The ability to perform queries by example is an important requirement for MIR systems. It poses numerous challenges including computational and complexity issues, the design of an appropriate testbed and the choice of an adequate representation for the audio in the query and music collection. The choice of audio representation dictates the similarities that can be identified by the system.

As a consequence, MIR aims at making the world’s vast store of music available to individuals. To this end, different representations of music-related subjects (e.g., songwriters, composers, performers, consumer) and items (music pieces, albums, video clips, etc.) are considered. A key problem in MIR is classification, which assigns labels to each song based on genre, mood, artists, etc. Music classification is an interesting topic with many potential applications. It provides important functionalities for music retrieval. This is because most end users may only be interested in certain types of music. Thus, a classification system would enable them to search for the music they are interested in. On the other hand, different music types have different properties. We can manage them more effectively and efficiently once they are categorized into different groups.

2.2 Feature Extraction

The key components of a classification system are feature extraction and classifier learning. Feature extraction addresses the problem of how to represent the examples to be classified in terms of feature vectors or pairwise similarities. The purpose of classifier learning is to find a mapping from the feature space to the output labels so as to minimize the prediction error. We focus on music classification based on audio signals.

Existing approaches to choosing the representation can be broadly divided into those attempting to identify low-level (acoustic) similarity and those aiming to quantify high level (e.g. note, melody, etc) similarity. Low level features can be further divided into two classes of timbre and temporal features as shown in Fig. 2.1. Timbre features capture the tonal quality of sound that is related to different instrumentation, whereas temporal features capture the variation and evolution of timbre over time. Low-level features are obtained directly from various signal processing techniques like Fourier transform, spectral/cepstral analysis, autoregressive modelling, etc. Low-level features have been used predominantly in music classification, due to the simple procedures to obtain them and their good performance. However, they are not closely related to the intrinsic properties of music as perceived by human listeners. Mid-level features provide a closer relationship and include mainly three classes of features, namely rhythm, pitch, and harmony. These features are usually extracted on top of low-level ones. At the top level, semantic labels provide information on how humans understand and interpret music, like genre, mood, style, etc. This is an abstract level as the labels cannot be readily obtained from lower level features as indicated by the semantic gap between mid-level features and labels. The purpose of content-based music classification is to bridge the semantic gap by inferring the labels from low-/mid-level features. From a different perspective, audio features can also be categorized into short­term features and long-term features, as illustrated by Fig. 2.1. Short-term features like timbre features usually capture the characteristics of the audio signal in frames with 10­100 ms duration, whereas long-term features like temporal and rhythm features capture the long-term effect and interaction of the signal and are normally extracted from local windows with longer durations. Hence, the main difference here is the length of local windows used for feature extraction.

Abbildung in dieser Leseprobe nicht enthalten

Fig.2.1 Audio feature

2.2.1 Low-level similarity

Systems based on low-level acoustic similarity measures are usually intended to recognise a given recording under noisy conditions and despite high levels of signal degradation. These audio fingerprints represent audio as a collection of low-level feature sets mapped to a more compact representation by using a classification algorithm: Haitsma and Kalker use Fourier coefficients and quantization on logarithmically spaced sub-bands, Allamanche et al use MPEG-7 low-level spectral features, while Battle and Cano use Mel Frequency Cepstrum Coefficients (MFCC) followed by decoded Hidden Markov Models (HMMs) to produce the required labelling. These features can be extracted from the signal on a frame-by-frame basis and require little (if any) use of musical knowledge and theory. This technology has shown great success in detecting exact one-to-one correspondence between the audio query and a recording in the database4, even when the query has been distorted by compression and background noise. It has seen commercial application in music recognition for end-users and in radio broadcast monitoring 4. However, acoustic similarity measures disregard any correlation to the musical characteristics of the sound. As a result, two different recordings of the same song (even sharing performer and instrumentation) may not necessarily be near matches in a similarity-ranked list. Low-level similarity measures perform poorly in the retrieval of musically-relevant near matches.

2.2.2 Top-level similarity

As an alternative to low-level similarity for music retrieval, we could attempt to find high-level representations from audio that emphasise the musical similarities between recordings.

Automatic transcription and harmonic modelling algorithms are constrained by the type of instrumentation and music that can be analysed. This is largely due to the use rules of formal musical notation and not necessarily related to the sonic contents of recorded music. For musically relevant retrieval, the use of an unnecessarily high level of representation introduces errors in the symbolic data and reduces the scope of plausible musical queries. This is even more important for music where the performance is a better representation of the musical work than the score (e.g. pop music as opposed to classical music). Therefore, in order to successfully identify musical similarities we need an alternative representation to the non-musically relevant low-level features and the error prone and constrained high level musical notation. We propose mid-level representations to be this alternative.

2.2.3 Mid-level similarity

Mid-level representations of music are measures obtained by the process of transforming the audio signal into a highly sub-sampled function that characterizes the attributes of musical constructs in the original signal. This is the key process in a wide class of musical signal processing algorithms (e.g. onset and pitch detection, tempo and chord estimation). These processes are designed taking into account musical knowledge and a deep understanding of human perception.

Mid-level representations attain higher levels of semantic complexity than low-level features (e.g. successfully characterise the rhythmic structure of a piece), but without being bounded by the constraints imposed by the rules of music notation. We divide mid-level representations into two categories: event-based mid-level representations, concerned with characterising attributes of individual musical events such as note onsets and pitch detection; and segment-based mid level representations, concerned with characterising longer musical sections such as melody, harmony, chorus, etc. Furthermore we propose that the former may be seen as a grammar which can be used to generate the latter.

2.2.4 Process of Feature Extraction

Time and frequency domain representation techniques for the automatic description of music recordings are based on the computation of time and frequency representations of audio signals. We summarize here the main concepts and procedures to obtain such representations. The frequency of a simple sinusoid is defined as the number of times that a cycle is repeated per second, and it is usually measured in cycles per second, or Hertz (Hz). As an example, a sinusoidal wave with a frequency f = 440 Hz performs 440 cycles per second. The inverse of the frequency f is called the period T (f = 1/T), which is measured in seconds and indicates the temporal duration of one oscillation of the sinusoidal signal. In time domain, analog signals x(t) are sampled each Ts seconds to obtain digital signal representations x[n], where n = i Ts, i = 0, 1, 2, ... and fs = 1, Ts is the sampling rate in samples per second(Hz). According to the Nyquist-Shannon sampling theorem, a given audio signal should be at least sampled to the double of its maximum frequency to avoid the so-called aliasing, i.e. the introduction of artifacts during the sampling process. Time-domain representations, illustrated in Fig. 2.2, are suitable to extract descriptors related to the temporal evolution of the waveform x[n], such as the location of major changes in signal properties.

The frequency spectrum of a time-domain signal is a representation of that signal in the frequency domain. It can be generated via the Fourier Transform (FT) of the signal, and the resulting values are usually presented as amplitude and phase, both plotted versus frequency, as illustrated in Fig. 2.3.

For sampled signals x[n] we use the Discrete version of the Fourier Transform (DFT). Spectrum analysis is usually carried out in short segments of the sound signal (called frames), in order to capture the variations in frequency content along time (Short Time Fourier Transform-STFT). This is mathematically expressed by multiplying the discrete signal x[n] by a window function w[n], which typically has a bell-shaped form and is zero-valued outside of the considered interval as illustrated in Fig. 2.4.

Various numerical values derived from the audio signal are used to characterise it. These are referred to as signal features. Feature extraction is a crucial step in the audio analysis process. In general, feature extraction is an essential processing step in machine learning tasks and classification tasks. The aim is to extract a set of features from the dataset of interest. These features are more informative with respect to the desired properties of the original data i.e. the audio signal. Feature extraction can also be viewed as a data rate reduction procedure because analysis algorithms are based on a relatively small number of features. In case, the original data, i.e. the audio signal, is voluminous and as such, it is hard to process directly in any analysis task. Therefore it needs to be transformed into the initial data representation to a more suitable one, by extracting audio features that represent the properties of the original signals while reducing the volume of data. In order to achieve this goal, it is important to have a good knowledge of the application domain, so that we can decide the best features.

Abbildung in dieser Leseprobe nicht enthalten

Fig : 2.2 Time-series plot of audio file using scipy python


Excerpt out of 40 pages


Musical information retrieval. Signal Analysis and Feature Extraction using Python
IMSc Mathematics and Computing
Catalog Number
ISBN (eBook)
musical, signal, analysis, feature, extraction, python
Quote paper
M. Sai Chaitanya (Author)Dr. Soubhik Chakraborty (Author), 2021, Musical information retrieval. Signal Analysis and Feature Extraction using Python, Munich, GRIN Verlag,


  • No comments yet.
Read the ebook
Title: Musical information retrieval. Signal Analysis and Feature Extraction using Python

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free