Software-Based Extraction of Objective Parameters from Music Performances

Doctoral Thesis / Dissertation, 2008

194 Pages, Grade: 1.0

Free online reading


1 Introduction

2 Music Performance and its Analysis
2.1 Music Performance
2.2 Music Performance Analysis
2.3 Analysis Data
2.3.1 Data Acquisition
2.3.2 Instrumentation & Genre
2.3.3 Variety & Significance of Input Data
2.3.4 Extracted Parameters
2.4 Research Results
2.4.1 Performance
2.4.2 Performer
2.4.3 Recipient
2.5 Software Systems for Performance Analysis

3 Tempo Extraction
3.1 Performance to Score Matching
3.1.1 Score Following
3.1.2 Audio to Score Alignment
3.2 Proposed Algorithm
3.2.1 Definitions
3.2.2 Pre-Processing
3.2.3 Processing
3.2.4 Similarity Measure
3.2.5 Tempo Curve Extraction
3.2.6 Evaluation

4 Dynamics Feature Extraction
4.1 Implemented Features
4.1.1 Peak Meter
4.1.2 VU Meter
4.1.3 Root Mean Square Based Features
4.1.4 Zwicker Loudness Features
4.2 Example Results

5 Timbre Feature Extraction
5.1 Implemented Features
5.1.1 Spectral Rolloff
5.1.2 Spectral Flux
5.1.3 Spectral Centroid
5.1.4 Spectral Spread
5.1.5 Mel Frequency Cepstral Coefficients
5.2 Example Results

6 Software Implementation
6.1 Data Extraction
6.1.1 FEAPI
6.1.2 Performance Optimizations
6.2 Performance Player
6.2.1 Smoothing Filter
6.2.2 Overall Results for each Feature
6.2.3 Graphical User Interface

7 String Quartet Performance Analysis
7.1 Musical Score
7.2 Recordings
7.3 Procedure
7.3.1 Audio Treatment
7.3.2 Analysis Data
7.3.3 Feature Space Dimensionality Reduction
7.4 Overall Performance Profiles
7.4.1 Tempo
7.4.2 Timing
7.4.3 Loudness
7.4.4 Timbre
7.5 Performance Similarity
7.5.1 Repetition Similarity
7.5.2 Overall Similarity
7.6 Overall Observations
7.6.1 Dimensionality of Overall Observations
7.6.2 Relationships between Overall Observations
7.7 Summary

8 Conclusion
8.1 Summary
8.2 Potential Algorithmic Improvements
8.3 Future Directions

List of Figures

List of Tables

A Standar d Transformations
A.1 Discrete Fourier Transformation
A.2 Principal Component Analysis

B Soft w are Do cumentation
B.1 Parameter Extraction
B.1.1 Command Line
B.1.2 Input and Output Files
B.2 Performance Player
B.2.1 Loading Performances
B.2.2 Visualize Parameters
B.2.3 Play Performances

C Result Tables - String Quartet Analysis


List of Symbols and Abbreviations

illustration not visible in this excerpt


Music is a performing art. In most of its genres, it requires a performer or a group of performers who “self-consciously enacts music for an audience” [Slo85]. In classical or traditional western music, the performer renders the composer’s work, a score containing musical ideas and performance instructions, into a physical realization.

Different performances of the same score may significantly differ from each other, indicating that not only the score defines the listener’s music experience, but also the performance itself. Performers can be identified by listeners with regard to certain characteristics of their performances, and certain performers can be as famous as composers. A performance is a unique physical rendition or realization of musical ideas that is never just a reproduction but always a (new) interpretation. The performer is expected to “animate the music, to go beyond what is explicitly provided by the notation or aurally transmitted standard - to be ‘expressive’ ” [Cla02b]. Bach explains [Bac94]

Worinn aber besteht der gute Vortrag? in nichts anderm als der Fertigkeit, musikalische Gedancken nach ihrem wahren Inhalte und Affeckt singend oder spielend dem Gehöre empfindlich zu machen.

If different performances of the same piece of music are expected to represent the underlying musical ideas, why do they differ so clearly from each other, and what are the differences and commonalities between them?

For a better understanding of the role of music performances, it is helpful to consider the performance as embedded into a chain of musical communication starting at the composer and his score and ending with the listener, as shown in Fig. 1.1. The model is loosely based on Kendall’s three-stage model featuring Composer, Performer and Listener [KC90]. The feedback paths indicate possible interrelations with the performance.

Obviously, no direct communication takes place between composer and listener. Instead, the composer translates his musical ideas into a score which is analyzed

illustration not visible in this excerpt

Figure 1.1: Chain of Musical Communication

by the performer to derive a performance concept or plan and finally to render the acoustic realization — the performance — that is subsequently perceived by the listener. Each of the communication stages allows or even enforces interpretation, modification, addition and dismissal of information. Chapter 2 provides more in-depth analysis of several of the communication stages within the context of music performance.

Music Performance Analysis (MPA) aims at obtaining a basic understanding of music performances. A good example for applied MPA, although highly subjective, are reviews of concerts and recordings that do not focus on the score information but rather on a specific performance or rendition of this score.

First systematic studies of music performance date to the beginning of the 20th century, when mechanical and technical tools became available to record, reproduce and eventually to analyze music performances that previously had been unique, non-repeatable experiences. It was not only the reproducibility but also the objectivity of the newly available data that motivated researchers to discover music performances as a topic of scientific interest. Piano rolls for example — used to record and reproduce performances on mechanic pi- anos — proved to be excellent sources of detailed tempo and timing data for the recorded performances. Mechanical sensors and cameras allowed to track performance data such as hammer movements in pianos, and oscillographs and similar devices allowed the frequency analysis of recorded performances. The evolution of measurement devices, the introduction of MIDI (Musical Instru- ment Digital Interface) as a standard for control and recording of electronic musical instruments as well as the rise of digital approaches in signal recording, storage and analysis contributed to the development of the research field Music Performance Analysis. Especially during the last decade, new possibilities of data extraction and data mining were introduced and helped to simplify and speed up the process of analysis significantly. Despite all technical improve- ments, the main difficulties in performance research appear to remain the same as before:

- How to extract data that fulfills high demands on reliability, validity and external validity, i.e. the significance of the gathered data set, to allow general conclusions to be drawn?
- How to structure and interpret the extracted information in a musically, musicologically or psychologically meaningful way?

The first difficulty is actually a combination of problems; although for many pieces a nearly limitless number of recordings can be found, only the audio recording of these performances is available instead of detailed and accurate performance data provided by sensors frequently used in performance research. Since the “manual” extraction of performance data from audio is time-consuming, automated data extraction by a software system can be used for previously impracticable large-scale analyses while providing objective and reproducible re- sults. Recently, modern digital audio signal approaches have led to encouraging results in the context of audio content analysis. For example, the accuracy and reliability of high-level data extracted from audio signals increased significantly.

The aim of this work is to adapt and develop such approaches for the use in a software system for the automatic acquisition of performance data from audio recordings in a sufficiently robust and accurate way, and to make the extracted data easily accessible to the analyst. This will be referred to as a descriptive approach which presents characteristics and properties of musical performances, as opposed to an interpr etative approach that would attempt to explain the results in their musical, psychological or other context. For example, it is neither the goal of a descriptive approach to reveal a concept of interpretation or a performance plan nor to assess performance quality or to develop models of performance reception by listeners.

For this purpose, we restrict ourselves to the analysis of recordings from professional music performances of pre-existent compositions, available in classical score notation and do not aim at the analysis of improvisation, sight- reading and rehearsals or music that does not stand in the western concert tradition. There are no restrictions on instrumentation or genre, but the focus lies on polyphonic or multi-voiced ensemble music performed by more than one musician.

To demonstrate the suitability of the presented system, an analysis of string quartet performances is undertaken. The analysis of chamber ensemble perfor- mances is a rather neglected object of study, and the current understanding of music performance is mainly gained from piano performances. The presented results can be used to verify if and how these insights can be transfered to ensemble music with non-keyboard instruments.

In summary, the main contributions of this work are the design and imple- mentation of a software system dedicated to music performance analysis, the presentation of optimized methods for audio content analysis, and the perfor- mance analysis of string quartet recordings.

Chapter 2 is an introduction to music performance and its characteristics. Fur- thermore, it summarizes past and present approaches to systematic performance research with a focus on the extraction, the properties and the interpretation of the investigated performance data.

Chapter 3 describes the algorithmic design of the software library for automatic tempo and timing extraction from an audio file utilizing a score representation of the piece of music. The algorithm is based on a Dynamic Time Warping approach that finds the optimal global match between discrete times of a performed audio recording and the note events of a quantized MIDI file, given a fitting similarity measure between audio and MIDI data.

Chapters 4 and 5 describe the selection, interpretation and implementation of various low-level audio features for the analysis of both musical dynamics and timbre variation in music performances.

Chapter 6 presents the implementation of the complete software system for music performance analysis which is split into two parts, the performance data extraction and sonification and visualization of the data.

A systematic study of 21 performances of a movement of Beethoven’s string quartet No. 13 op. 130 can be found in Chap. 7. It investigates tempo, loudness and timbre characteristics extracted from commercial recordings with recording dates between 1911 and 1999. The final Chap. 8 summarizes and concludes this thesis.

2 Music Performance & Performance Analysis

2.1 Music Performance

The chain of musical communication, depicted in Fig. 1.1, shows that the composer communicates musical ideas or information via the score to the performer. It should be clearly distinguished between the terms musical sc or e and music. According to Hill, the score is not the music itself, but sets down musical information, together with indications on how this information may be interpreted [Hil02]. Other authors describe the score as one of a number of possible representations such as a Compact Disc (CD), recordings and written descriptions or see the score as a “blueprint for a performance” [Cla02b].

A score that stands in the tradition of western music history always contains information on pitch and (relative) duration of each note; almost always instructions on musical dynamics appear in the score as well. Other instructions for example on character, quality or specific ways to perform may also be found in the score. Some of the contained information is available only implicitly (e.g. information on the musical structure) or might be ambiguous or hidden, complicating its description and quantification (compare [Dor42], [Mey56], [Pal97], [BM99]).

All this information is subject to the performers’ interpretation — they detect and evaluate implicit information, try to understand and explain performance instructions, identify ways to convey their understanding of musical ideas to the listener and transform the discrete score representation of pitch, duration and dynamics to continuous scales.

It can be observed that later scores tend to be more explicit in terms of performance instructions than earlier scores, indicating that composers tried to eliminate the unspecified or ambiguous information in the score [Dor42]. This may be due to the increasing awareness of the fact that scores often take into account performance rules that may seem “natural” at the time of composition but may change over decades and centuries, possibly leading to unintended performances.

Although the literature on musical performance frequently conveys the impres- sion that imprecision and restriction of the score representation is undesirable, the fact is that there can be no true or absolute interpretation. Music is a living art and constant re-interpretation of music representations is the artistic breath that gives music life.

Seashore introduced the idea of defining the expressive parts of a performance as deviations from a “neutral”, mechanical score rendition [Sea38]. However, the assumption that all information on such a “neutral” performance is already contained explicitly in the score seems unlikely on second thought, as the understanding and interpretation of a score might require cultural, historical and musicological considerations as well.

Other authors defined a neutral performance as a performance that is p er ceived as mechanic (which may not be necessarily a mechanical performance [Par03]). A different suggestion had been that the required neutral reference performance should be a performance with “perfectly normative rubato (and the equivalent on all other relevant expressive parameters)” [Cla91], that is a performance that matches all standard expectations of the listener.

Although controlled deviations from such a (normative or subjective) reference are most definitely directly connected with the perception of musical expression, they should not be confused with the expression or expressive deviations, as these terms “usually refer to physical phenomena, that is, deviation in timing, articulation, intonation, and so on in relation to a literal interpretation of the score. This use should be distinguished from a more general meaning of expression in music” as the expression’s domain is the mind of the listener or the performer [Gab99].

Every performance requires a concept or plan which can be created by either a rigorous or a rather intuitive and unsystematic analysis of the score (for instance for sight-reading). This analysis should probably not be seen as an independent process applied to the act of interpretation but as “an integral part of the performing process” [Rin02].

The performance plan is a mental representation of the music [Gab99] that is an abstract list of actions that may be realized in an indefinite number of ways and is specified only relative to the context [Slo82]. Both authors stress the importance of structural and other “musical” information for this performance plan, but it also has to contain all intentions of the performer on what to express or convey to the listener. Of course the performance plan is so closely related to the performance itself that in many cases it does not make sense to treat them separately, and the following paragraphs will not always differentiate between the plan and the performance itself.

Every music performance is highly individual in both its production and its perception. Still, a list of parameters that the performance may depend on can be compiled. The number of influencing parameters on the performance (and the performance plan) itself is probably infinite; nevertheless, the following list attempts to describe the main influences that may explicitly or implicitly influence a musical performance (also compare [Dor42], [Slo82], [Slo85], [Pal97], [TH02], [Wal02], [Cla02b], [Cla02a], [Par03], [Jus03a], [Jus03b]).

- general interpretative rules:
These are rules, conventions, or norms that every performance follows because it would be perceived as uncommon or even unnatural otherwise.
- performance plan and expressive strategy:
A concept of interpretation as a list of actions that may be influenced by
interpr etation of musical struc t u r e or shape, e.g. the question of how to successfully convey melody, phrases, etc. to the listener.
addition of unexpectedness or deviation from expected conventions or rules.
stylistic and cultur al context and rules that may vary over time or between countries or follow “performance fashions” [Cla02b], in- cluding instruments or instrument characteristics (such as timbre), used tuning frequencies and temperaments, and typical performance styles with respect to articulation, ornamentation, vibrato styles, tempo, rubato styles, etc.

This may apply for both the historic context (the time the piece of music was composed or premiered) as well as for the context at the time of the performance.

musical mood and emotional expression that the performer plans to convey to the listener.
performance context such as the expected audience, the style and performance plan of other performances and works in the concert program.
- the performers’ personal, social and cultural background:
A very broad category that includes e.g. previous performing and general experiences, teachers and mentors, attitude, manners and mannerisms, etc.
- physical influences:
The auditory and motorical or — more generally — physical abilities of the performer, general human limitations (e.g. in timing precision, breathing) as well as attributes of the musical instrument that can impose limitations on e.g. fingering, changing of hand positions etc. may lead to forced or unintended deviations from the performance plan.
- rehearsal:
The rehearsal phase allows direct feedback on the performance plan and may also train some specific motorical abilities of the performer. It should be noted that a rehearsal can also be seen as a performance itself.
- immediate influences:
Influences that may change the performance at the time of performance and may lead to a deviation from the performance concept such as
runtime feedback control, i.e. the feedback that the performer directly receives that may consist of auditory, visual, tactile, and other cues [Tod93]. This includes various parameters such as the instrument’s sound and reaction, the performance of co-performers, the acoustics of the environment, the reaction of the audience etc.
external influences not directly related to the performance such as humidity, temperature, distractions, etc.
“internal” influences such as the emotional and physical state of the performers (stress, stage fright, fatigue, illness, etc.)

Expressive movements are sometimes also considered to be part of a performance, since performers may move in ways that are not directly related to the generation of sound but to the character of music. In the context of this dissertation, only the acoustical properties of a performance will be taken into account.

Four classes of acoustical parameters that can be used for the description or characterization of music performances have already been identified in the 1930s by Seashore [Sea38]:

- tempo and timing: global or local tempo and its variation, rubato, or expressive timing, subtle variation of note lengths in phrases, articulation of tones, etc.
- velocity, loudness or intensity: musical dynamics, crescendo and
diminuendo, accents, tremolo, etc.
- pitch: temperament, tuning frequency, expressive intonation, vibrato, glissando, etc.
- timbre: sound quality and its variation resulting from instrumenta- tion and instrument-specific properties such as bow positioning (string instruments).

Recorded performances can differ significantly from the live performance, even in the case of so-called live recordings ([Cla02a], [Joh02]). The reason is that more persons than the performers themselves, e.g. the producer and sound engineer, may influence the final result during this production stage. Furthermore, mechanical and technological restrictions enforce differences between an original and reproduced performance, but also open up new possibilities to improve a recorded performance in the post-production process. For example, it is established recording practice (at least in the context of classical music) to not only record complete performances and finally choose the “best”, but instead to record several or many so-called takes of passages of the musical piece. The recording process can also involve repeated listening to the recorded takes and discussions on the performance with influence on the following performances. Afterward, it is decided which parts of these takes will finally be used on the published CD and these will be edited in a way that the cuts are inaudible. Having analyzed seven productions of Beethoven’s 9th Symphony, Weinzierl und Franke found between 50 and 250 cuts between different recording takes in each production; the average number of edits increased with the technical evolution [WF02]. Nowadays, Digital Audio Workstations allow to edit music signals at nearly any score position.

Microphones and their positioning as well as signal processing done by the sound and mastering engineers may impact the loudness, the timbre, the reverberation and other parameters of the recording. These “interventions” can also vary over time to artificially increase or decrease acoustical or performance-based effects (e.g. increase the loudness of a specific instrument for its solo part etc.). Maempel et al. give an overview on processing options and typical objectives in the post production context [MWK08].

The musician’s and the producer team’s influences are not distinguishable on the final product, for example the CD. Therefore, the resulting recording including the (post) production stage will be referred to as performance in the remainder of this text; this seems to be a valid approach as the artist usually states his final agreement with the recording.

It should be kept in mind that it might not only be the editing and processing that differentiate a recorded performance from a live performance, but also the possible adaptation of the performer to a different reception and expectation in the recording context [Cla02a]. However, these recordings represent one of the principal forms in which music has been available in the last and the current century.

The listener, as the receiving end point of the communication chain, subjectively interprets the music. He listens to a performance and conceives musical ideas and other information that is conveyed by the performance. Since the recipient is affected by the incoming information, at this point in the communication chain the subjective effects of a performance can be analyzed. As Lundin points out, the kinds of possible affective reactions are practically limitless [Lun53].

2.2 Music Performance Analysis

Music Performance Analysis (MPA) aims at studying the performance of a musical score rather than the score itself. It deals with the observation, extraction, description, interpretation and modeling of music performance parameters as well as the analysis of attributes and characteristics of the generation and perception of music performance. Three basic directions can be roughly distinguished in the field of systematic performance analysis:

- to study the performance itself: to identify common and individual characteristics in the performance data, general performance rules, or differences between individual performances
- to study the generation or production of a performance: to understand the underlying principles of performance plans, the relation of the per- formers’ intention to objective performance parameters (see below), and to investigate the performers’ motoric and memory skills
- to study the reception of a performance: to comprehend how performances or the variation of specific parameters are perceived by a listener, and to study how he is affected

MPA could on the one hand lead to more explicit formulations of the different (objective) performance characteristics in the practice of music-teaching or enable the development of teaching assisting systems that give the student direct and objective feedback on the performance parameters. On the other hand, it could assist the implementation of performance models that generate computer renditions of human-like music performances. MPA also gains insights that can be valuable for the investigation of music esthetics and music history.

One of the problems of MPA is to define a suitable reference that the extracted performance data may be compared to. While a mechanical rendition seems to be an obvious choice as reference, other reference renditions such as a (human) performance that attempts a mechanical rendition, a rendition that is perceived to be mechanical, a rendition that is perceived to be standard or common, or an average rendition calculated from many performances could be considered to be more meaningful reference renditions. However, in the latter cases the reference renditions can only be valid in a specific context. This will usually not be desirable from the analyst’s point of view.

As Clarke points out, “musical analysis is not an exact science and cannot be relied upon to provide an unequivocal basis for distinguishing between errors and intentions” [Cla04], emphasizing the challenge of meaningful interpretation of extracted performance data. A related difficulty that music performance analysis has to deal with is to distinguish between inherent performance at- tributes and individual performance attributes. In the context of musical accents, Parncutt [Par03] distinguishes between immanent accents that are assumed to be apparent from the score (structural, harmonic, melodic, metrical, dynamic, instrumental) and performed accents that are “added” to the score by the performer. This approach may be applied to nearly all extracted parameters, and in the general case it might not be possible to distinguish score-inherent and performer-induced characteristics.

The interpretation of importance and meaning of characteristics derived from performance data is a difficult task. In the end, final conclusions can only be drawn by taking into account subjective judgments. The methodology and questionnaire or rating scale for such subjective tests and how they can be taken into account, however, has only begun to evolve to systematic approaches during the last centuries. The problem of extracting relevant characteristics is apparent in the design of systems intended to automatically generate music performances from a score. Clarke notes (in the context of parameters possibly influencing performances): “Whatever the attitude and strategy of different performers to this wealth of influence, it is clear that a theory of performance which is presented as a set of rules relating structure to expression is too abstract and cerebral, and that the reality is far more practical, tangible and indeed messy” [Cla02b].

Different areas of research contribute to the field of MPA, including musicology, (music) psychology and engineering. An introduction to the research field is given by Clarke [Cla04]. Articles providing extensive overviews have been compiled for example by Gabrielsson [Gab99], Palmer [Pal97] and Goebl et al. [GDP+05]. The following sections do not reiterate these but intend to give an impression on the variety of different approaches to the analysis of music performance.

There are several possibilities to structure the available literature on musical performance analysis. The publications have been grouped depending on different characteristics of method and methodology, although this may lead to multiple citations of the same publications.

2.3 Analysis Data

2.3.1 Data Acquisition

The acquisition of empirical data is one of the crucial points in systematic music performance analysis. Among the various methods that have been proposed and used to acquire data, two general approaches can be identified: monitoring performances (or performance parameters) by mechanical or technical devices, or extracting the parameters from an audio recording of the performance. Both concepts have inherent advantages and disadvantages.

The monitoring approach usually provides accurate and detailed results since the measurement devices can track the performance parameters more or less directly, but the analysis is exclusively restricted to specific performances that were produced under special conditions and with the specific performers that were available.

The direct extraction of performance parameters from the audio — as opposed to from the instrument with sensors — is difficult and most definitely results in less accurate data. This is true for both the manual annotation of audio (such as marking note onset times) and the fully automated extraction of data. Additionally, some parameters of interest may be even impossible to extract from the audio, such as information on piano pedaling or note-off times. Other parameters of interest such as the performers’ movements are obviously not extractable from the audio at all.

The advantage of extracting parameters directly from the audio signal is the possibility to analyze an enormous and continuously growing heritage of recordings, including outstanding and legendary performances recorded throughout the last century and until now. Hence, audio-based approaches allow to widen the empirical basis considerably with respect to the amount of available sources and their significance.

Audio Content Analysis, an increasingly important branch of Music Informa- tion Retrieval, deals with the automatic extraction and analysis of (musical) information from digital audio signals. The majority of the published algo- rithms work “blind”, meaning that they only have audio data available as input information while any additional input such as the score representation of the analyzed music is not available. Thus, most of these systems aim at the extraction of sc or e information from the audio rather than the extraction of performance information (such as so-called transcription systems). This is however not a technical necessity, so similar approaches can be utilized to extract performance information directly from the audio signal. The increasing accuracy and robustness of these systems will make such approaches more and more important for MPA. Piano or Keyboard Performance

The introduction of mechanical pianos at the end of the 19th century made the acquisition of objective performance data possible through piano rolls. For example, Hartmann [Har32] presented an early analysis of tempo and timing of two piano performances based on their piano rolls. There are also later approaches to the analysis of performance data from piano rolls [Dov95].

Other historic approaches used proprietary sensors that were built to extract performance data. The most prominent example is the Iowa Piano Camera that was used by Seashore [Sea38] and his team at the University of Iowa in the 1930’s. For each piano key, this “camera” recorded onset and note-off times and hammer velocity by optical means. Another example of a proprietary system is Shaffer’s Bechstein grand piano [Sha84], using photo cells to detect hammer movements.

The introduction of the MIDI (Musical Instrument Digital Interface) specifica- tion (latest revision see [MID01]) in the 1980’s resulted in an increasing number of electronic instruments and MIDI sequencers as well as computer hardware and software solutions that supported this specification and opened up new pos- sibilities to measure, store and analyze pianists’ performance data. Partly, music performance research has been done with the help of electronic instruments such as synthesizer keyboards and electronic pianos ([Pal89], [DH94], [Rep96b]), but the majority concentrated on using acoustic instruments with built-in sensors that automatically output MIDI data such as the Yamaha Disklavier product series or Bösendorfer grand pianos with the so-called SE-System ([Rep96a], [Rep96d], [Rep96c], [Rep97c], [Rep97a], [Bre00], [SL01], [Goe01], [WAD+01], [Sta01], [Wid02], [WT03], [Wöl04], [WDPB06], [TMCV06]).

As already pointed out, the analysis of performances that have not or cannot be recorded on specifically equipped instruments has to be based on the audio data itself. This is the case for the vast majority of available recordings.

To extract the tempo curve from an audio recording, the usual approach is to either tap along with the performance ([DG02], [Hon06]) or to manually annotate the onset times in a wave editor/display or a similar application ([Pov77], [Rep90], [Rep92], [Rep97b], [Rep98], [Rep99a], [Rep99b], [Wid95a], [Wid95b], [Wid98a]). Both approaches have also been automated or partly automated by the use of automatic beat tracking systems — followed by manual correction of beat times — ([Wid98b], [ZW03]/ [WZ04], [Tim05], [DGC06]) or more recently by alignment algorithms using score or MIDI data as additional input ([Ari02], [MKR04], [DW05]). The main difference between tap-along and beat-tracking approaches as compared to manual onset time annotation and alignment systems is that in the former case the resulting tempo curve resolution is on beat level, meaning that between-beat timing variations cannot be analyzed, while the latter usually takes into account each single note onset time, whether this note lies on the beat or not.

A focus on piano performances can be observed in the literature. One of the obvious reasons is that the piano is a very common instrument with a large (solo) repertoire, but there are more reasons that make the piano an appealing choice. The tones produced by a piano have a percussive character that makes this instrument far more suitable for accurate timing analysis than for instance string instruments. Its mechanics make it possible to measure data with sensors less intrusive and probably easier than on other instruments that offer a more direct interaction between performer and sound production. Furthermore, the pianist is in some ways more restricted than other instrumentalists; he is limited to fixed (and equal-tempered) pitch frequencies, which rules out key or harmony dependent intonation and other performance specifics such as vibrato. He also has little influence on the timbre of a played note, and after hitting a key, he is not able to control any of the typical note parameters such as pitch, loudness or timbre except its duration. From a technical point of view, these restrictions seem to make the piano a rather unattractive instrument with limited degrees of freedom, but even with these limitations, piano performances are an integral part of western cultural life, meaning that the mentioned restrictions do not really impede the communication of musical expression between pianist and audience. The reduction of possible parameter dimensions is however beneficial in performance research because it simply keeps the measurement dataset smaller. Last but not least, the (commercial) availability of electronic and acoustic instruments using MIDI as a universal communication protocol simplified the performance data acquisition significantly since custom- built solutions were no longer necessary. While the recording of MIDI data from other non-keyboard instruments is at least partly possible, the fact that MIDI is a keyboard-focused protocol results in limited usefulness in many cases. Despite the good reasons for the usage of piano as the main instrument for performance analysis, it has not yet been conclusively shown that the insights gained from piano performance analysis can be applied to performances with other instruments and ensembles (although the not-so-many studies on other instruments indicate that this might at least partly be the case). Other Instruments or Instrumentations

Most non-piano instruments represented in the musical performance literature are monophonic, meaning that never two or more notes can occur simultaneously. In this case, common approaches to frequency analysis can be assumed to be robust enough to extract the pitch variation over time. Proprietary as well as commercially available systems have been applied to the task of pitch extraction form the audio signal ([Sea38], [Scp0], [Dil01], [FJP03], [Wal04], [Bow06], [Orn07], [Rap07], [MAG08], [RPK08]). Seashore invented the “Tonoscope” for the pitch analysis of monophonic signals [Sea02]. It consists of a rotating drum covered with a paper containing small dots, each representing a certain frequency. The input signal is — by the means of a light-emitting gas tube — projected on the rotating paper. If the input frequency matches one of the frequencies a dot represents, this line of dots will stand still for the observer and gives a clear indication of the frequency. The “Melograph” used in [Orn07] appears to be basically of a similar design. Other studies work with spectrogram visualizations, use commercially available software solutions for the detection of monophonic pitches, or implemented their own software algorithms for the pitch detection.

The majority of these systems are not able to extract note onset times, so tempo and timing information is either not analyzed or is extracted by manual annotation. However, to name two counter-examples, Kendall compared timing and dynamics of monophonic melodies performed on piano, clarinet, oboe, violin, and trumpet [KC90] and Ramirez et al. used automatically extracted timing data for the identification of performers of violin recordings [RPK08].

The tempo and timing data for other, non-monophonic signals has usually been extracted by tapping along (e.g. [Hon06]) or by manually setting onset time labels (e.g. [Ras79], [Jer03a], [Jer04]). Clynes [CW86] did not analyze the tempo on a beat or onset level but measured the overall duration of single movements.

2.3.2 Instrumentation & Genre

The majority of musical performance research focuses on the piano as the instrument of main interest.

Other individual instruments include the singing voice ([Sea38], [Scp0], [FJP03], [Rap07]), string instruments such as violin, viola, and violoncello ([Sea38], [KC90], [Dil01], [Bow06], [Orn07], [MAG08], [RPK08]), wind instru- ments such as flute, clarinet, oboe and trumpet ([KC90], [Wal04], [Orn07]), organ ([Jer03a], [Jer04]) and percussion instruments ([Dah00]).

Publications researching chamber music performances show up less frequently (e.g. [Ras79], [CW86], [Hon06]).

A large variety can be found in the style of musical pieces chosen for performance research. The date of composition of the analyzed musical pieces ranges from the 16th to the 20th century and a general focus on well-known and popular composers such as Bach, Mozart, Beethoven, Schumann, and Chopin can be observed.

2.3.3 Variety & Significance of Input Data

With respect to the question if and how reliably conclusions can be drawn from the extracted data, it is important to verify how and from whose performance this data has been generated.

For example, it could be argued that performance data gathered under “lab- oratory conditions” is insignificant per se due to the unnatural recording environment; however, these special conditions are also given for many (studio) recording sessions that resulted in recordings that are in fact perceived as convincing performances by the listeners, so we may disregard this point of view.

Still, when the data is acquired under such laboratory conditions, it implies that the number and possibly the skill of the available performers might be restricted. For example, research had partly been done on student performances (e.g. [Rep96a], [Rep96d], [Rep96c], [Rep97c], [Rep97b], [Bre00], [SL01], [Goe01], [Wöl04], [Bow06]). This fact by itself is not too remarkable, but it nevertheless emphasizes the question if and how research methods and conclusions take into account the possible discrepancies between the performances of student pianists (or just available pianists) and the performances of professional and famous pianists. Under the assumption that fame is related to higher professional skills of the performer this could be a noteworthy criterion.

Due to the difficulties of acquiring large sets of performance data described above, the number of performers per study is usually small. The majority of research in the presented paper database has been done with a number of five or less performers per publication ([Har32], [Pov77], [Ras79], [Sha84], [KC90], [DH94], [HHF96], [Rep96b], [Bre00], [Dah00], [LKSW00], [Dil01], [GD01], [Shi01], [Sta01], [WAD+01], [Wid02], [Wid98b], [FJP03], [WT03], [Dil04], [Jer04], [WDPB06], [DGC06], [Hon06], [Rap07]) or six to ten per- formers ([Rep96a], [Rep96d], [Rep96c], [Rep97c], [Rep97a], [Rep99d], [ZW03]/ [WZ04], [Wöl04]). Examples of publications evaluating more performers are [Orn07] with 15 performers, [Rep90] and [Rep97b] with 19 and 20 performers, respectively, [Goe01] with 22 performers, [Rep92] (and using the same data set [Wid95a], [Wid95b], [Wid98a]) with 24 performers and finally [Rep98]/ [Rep99a]/ [Rep99b] with an outstanding number of 108 performers (115 per- formances). This raises the question if and how insights gained from a small group of performers can be extrapolated to allow general assumptions on performances.

Table 2.1 summarizes the characteristics of the analyzed data set for many of the cited publications. Although the usefulness of such a summary is obviously limited, it gives a quick overview of the data set properties.

illustration not visible in this excerpt

Table 2.1: Overview over the analyzed data set in selected MPA publications

2.3.4 Extracted Parameters

The basic classes of objective performance parameters have been identified by Seashore in the 1930s as tempo and timing, pitch, dynamics, and timbre [Sea38].

The variation of tempo and timing over time is one of the most thoroughly researched aspects in MPA. The extracted onset times are usually converted into relative inter-onset-intervals (I O I) by calculating the discrete derivative. Then, each data point is normed by the corresponding note duration from the score in beat. The resulting curve of normed IOIs is an inverted representation of the tempo with the unit s/Beat (as opposed to the usual musical tempo definition in Beat/s, compare Chap. 3). The analysis of the articulatio n is in most cases restricted to keyboard performances that have been captured in MIDI format. Articulation is then simply interpreted as a measure of performed note overlap or note duration with respect to the score note duration.

In order to analyze the musical dynamics in a performance, the level or loudness over time is extracted using sound intensity or psycho-acoustically motivated loudness measurements. Strictly speaking, such measurements do not corre- spond directly to musical dynamics as these would depend on the musical context, on the instrument or instrumentation, on the timbre, etc. Nevertheless, intensity and loudness measurements seem to be a good approximation to dynamics (see e.g. [Nak87], [Ger95], Chap. 4).

Pitch-related performance parameters such as vibrato and intonation can be directly analyzed by extracting a fundamental frequency or pitch curve from the audio signal. Due to technological restrictions of current analysis systems for polyphonic music, this usually has been limited to monophonic input signals.

The analysis of timbre deviations in performances is probably one of the least-researched parameters in MPA. This may be on the one hand due to the multidimensional nature of timbre (compare Chap. 5), on the other hand because it is assumed to be of least importance and partly of high correlation with dynamics.

2.4 Research Results

2.4.1 Performance

Many studies focus on a rather descriptive approach to performance analysis by just analyzing extracted data such as the tempo curve ([Har32], [Sea38], [Pov77], [Sha84], [Pal89], [Rep90], [Rep92], [Rep98]) or the loudness/intensity curve ([Sea38], [Rep96d], [Rep99a], [Shi01]) to identify attributes of the extracted parameters between different performances and performers.

The relation of musical structure (melodic, metric, rhythmic, harmonic, etc.) or the musical gestalt to tempo and loudness deviations has been in- tensely researched ([Har32], [Sha84], [Slo85], [DH93], [Rep96d], [Kru96], [Pal97], [Rep99a], [LKSW00], [TADH00], [Hon06], [WDPB06]). Most authors agree on the close relationship between musical structure such as musical phrases or accents and performance deviations mainly in tempo and timing. In particular, larger tempo changes seem to be most common at phrase boundaries. There is a general tendency to apply ritardandi or note lengthening at the end of a phrase and moments of musical tension ([Pal89], [Rep90], [Rep92], [Rep98]). Shifres found indications that the loudness patterns are used to outline more global structural levels while rubato patterns have been mostly used for local structural information in his test set [Shi01]. Some of these systematic devia- tions, both in timing and dynamics, are apparently even applied — although less prominent — if the performer is asked to deliver a “mechanical” rendition (that is, with constant tempo and dynamics) of the musical piece (see [Sea38], [Pal89], [KC90]).

Repp found a coupling of timing and dynamic patterns [Rep96d], but in a later study, he only found weak relationships between timing and dynamics [Rep99a].

Desain et al. and Repp report on the influence of overall tempo on expressive timing strategies ([DH94], [Rep95]). They find that the concept of relational invariance cannot be simply applied to expressive timing at different tempi, a result similar to Windsor’s [WAD+01], who analyzed tempo-dependent grace note timing. The overall tempo might also influence overall loudness [DP04], an effect that they link to the increasing amplitude of pianists’ vertical finger movements toward higher tempi.

Goebl [GD01] investigated the relationship of the composer’s tempo indications (andante, allegro, etc.) with the “real” tempo and was not able to separate different tempo classes sufficiently with the tempo extracted from the perfor- mance. The number of note events per minute, however, seemed to be easier to map to the tempo indications.

Studies on the timing of pedaling in piano performance can be found in [Rep96b], [Rep97c]. The observations seem to be hard to generalize, but a relationship between pedal timing and overall tempo can be identified.

The articulation, or the amount of key (non-)overlap has been studied (in the context of keyboard instruments) in [Har32], [Pal89], [Rep97a], [Bre00] and [Jer03a]/ [Jer03b]/ [Jer04]. In summary, key overlap times for legato articulation seem to decrease with increasing Inter-Onset-Intervals (I O I s).

The accuracy of timing synchronization of two and more performers has been studied in [Ras79] and [Sha84], with the result that performers are highly capable of synchronizing onset times even when modulating the tempo over time. Other publications deal with the timing synchronicity between both hands or between the melody and the accompaniment in piano music [Har32], [Sha84]. In many cases of piano performance, a lead of the melody before accompanying voices can be observed [Pal89], but whether this represents a performance concept or a consequence of the higher velocity of the melody tones is subject of discussion ([Rep96a], [Goe01]).

The evaluation of the consistency of repeated performances of the same per- formers has shown their ability to reproduce a rendition quite exactly in terms of timing ([Sea38], [Sha84]), dynamics ([Rep96d]), and pedal timing ([Rep96b]). This seems to be the case for performances spaced by several years as well ([Rep99a], [Hon06]). Only measuring the overall movement durations of several performances of the same ensemble over several years, Clynes found very stable overall tempi [CW86].

Performance data from student and professional performances has been com- pared in [Pal89] and [Rep97b]. While individual differences tended to be more pronounced among the professionals, both groups seemed to share the same general performance concepts.

Statistical and machine learning approaches have been tested to use the ex- tracted tempo and loudness information for the purpose of classification, struc- turing the data or extracting general rules from the data. Dovey tried to extract general as well as individual rules from two of Rachmaninov’s piano roll recordings by using Inductive Logic Programming [Dov95]. Supervised learners can be used to assign representations of the extracted performance data to the corresponding artists with promising results ([Sta01], [ZW03]/ [WZ04], [Dil04]). Other machine learning methods have been used to identify general performance rules ([Wid95a], [Wid95b], [Wid98a], [Wid98b], [Wid02], [WT03]) and to determine individual differences between artists [Wid98b].

Repp [Rep98], [Rep99a] investigated the (statistical) relationships between the extracted performance data and sociocultural variables such as the artists’ gender, nationality, year of birth and recording date but, although some significant correlations could be found, pointed out that these results should be regarded with caution and that individual differences are likely to outweigh any sociocultural correlations.

Walker showed that instrumental timbre may influence several performance parameters such as timing, articulation, and dynamics [Wal04].

The analysis of vocal performances focuses frequently on the evaluation of vibrato rates and depth and the change or stability of pitch over time ([Sea38], [Scp0], [Rap07], [Bow06]) or other intonation characteristics of the performance ([FJP03], [Orn07]). Fletcher analyzed the vibrato (and other acoustical features) of flute players [Fle75].

2.4.2 Performer

While the publications listed above deal mainly with the analysis of the perfor- mance itself, the second area of musical performance analysis tries to determine the capabilities, goals, and characteristics of performers.

For example, Repp analyzed the kind of errors (i.e. pitch deviations from score) pianists make during a performance [Rep96c] and checked if and how severe they were perceived by listeners, coming to the conclusion that the errors concentrated in less important parts of the score in which they were harder to recognize.

The relationship between the performers’ intentions and the parameters ex- tracted from performances has been studied in various ways. Palmer found good correspondence between notated intentions with respect to melody and phrasing and the extracted timing parameters [Pal89]. Also, systematic relation- ships between intended emotionality of the performance and the performance data (that is, representations of loudness and timing) can be detected ([Jus00], [Dil01]/ [Dil03]/ [Dil04]).

Other studies investigate the importance of the feedback of the music instrument to the performer (see e.g. [Slo82]); there have been studies that report on the effect of deprivation of auditory feedback ([Rep99d], [Wöl04]), investigated the performers’ reaction to delayed or changed auditory feedback ([PP02], [FP03], [Pfo05]) or evaluated the role of tactile feedback in a piano performance [GP08].

Publications on the nature of memorization and learning of a musical piece (or its performance) tried to identify differences between novice and expert performers [DP00], to learn more on the nature of performance memory itself ([Pal00], [MP03], [Pal06]), and to find out more on the relation between a real and a virtual, imagined performance [Wöl04].

2.4.3 Recipient

It is the listener of a music performance who ultimately consumes, interprets and probably judges a music performance. Overall judgment ratings of performance data have been evaluated in various studies. In an early publication, Repp reported some significant relations of ratings to measured timing patterns [Rep90], while in a later study he had to conclude that “the aesthetic impression of the original recordings rested primarily on aspects other than those measured (such as texture, tone, or aspects of timing and dynamics (...))” [Rep99b]. Timmers did a similarity rating experiment and concluded that performances are judged in other ways than generally used to represent performance data [Tim05]. In [Tim01], she let listeners rate the goodness of fit of two succeeding parts of different performance pairs. Kendall investigated the communication of three levels of expressiveness: without expression, with appropriate expression, and with exaggerated expression [KC90]. Listeners were in many cases able to identify these three levels. Thompson et al. investigated the variation of listener ratings for a performance over time and found that the listening time to reach a decision was typically in the short range of 15 20 s [TWV07].

The difficulties of studying emotional affection of the listener of a music perfor- mance are discussed by Scherer [Sch03a], who criticizes “the tendency to assume that music evokes ’basic’ or ’fundamental’ emotions” such as anger, fear, etc. Despite such difficulties in approach and methodology, many attempts have been made to investigate the relationship between emotional affections and objective performance data. For example, Juslin detected relationships between moods and tempo and loudness cues [Jus00], and Kantor reported indications of associ- ations of such cues and emotional reactivity [Kan06]. Similar conclusions have been drawn in [SL01] and [Sch04] from studying the time-varying emotional valence or the arousal and its relationship with performance data. Timmers found strong correlations between the dynamics and listener’s judgments of emotionality [TMCV06] and very good communication of emotional activity between performer and listener [Tim07a]. In another study, she examined the influence of recording age and reproduction quality, observing that judgments of age and quality changed strongly with the recording date, in contrast to the perceived emotion that were mostly independent of the recording date; the communication of emotional valence tended to be more restrained for old recordings [Tim07b]. Husain varied the tempo and the mode (major, minor) of a performance and found indications that tempo modifications had an effect on arousal and mode modifications on mood [HTS02]. Krumhansl evaluated the influences on timing and loudness variations on judgments of musical tension and found a close relationship of musical structure with both the listeners’ musical tension rating and the performance data [Kru96].

The tempo perception of a music performance has been studied by Dixon, who found listeners to prefer smoothed beat sequences over the performed ones [DGC06]. Lapidaki investigated the dependency of the initial tempo of a performance on the preferred tempo of a musical piece [Lap00]; he found a general dependency, but he also identified a group of listeners that were able to come to very consistent tempo preferences. Repp found systematic deviations between the tapping of listeners and metronomical time of music events, a result that seems to correspond well with the performers’ inability to render a performance mechanically [Rep99c]. Aarden reported dependencies between tempo and “melodic expectancy” [Aar06].

Of course, there are many more research angles from which music performance can be studied. For example, the impact of visual performance cues on judg- ments of tension and musical phrasing can be found in [TCV03], [DP06], and the brain activation at listeners of music performances is measured in [NLSK02]. Furthermore, the design of computational models of music performances is a closely related topic of research. Most prominent is the KTH Mo del or the KTH rule system developed at KTH1 over the last 30 years (compare [SAF83], [FBS06]). Other models have been published by Todd (e.g. [Tod92], [Tod95]) and Mazzola et al. (e.g. [MZ94]). More recently, Widmer et al. proposed an automatically trained model for music performance [WT03].

2.5 Software Systems for Performance Analysis

The number of complete software systems dedicated to music performance analysis is limited. In most cases, research focuses on the extraction of single performance parameters.

POCO [Hon90] is a software for the analysis and automatic generation of music performances. It is a comparably old system that still is frequently used by a group of researchers at the Music Cognition Group of Amsterdam. It seems to have rather comprehensive analysis functions but is restricted to MIDI (or other symbolic) input data.

An early approach to extract performance data from audio signals while uti- lizing a MIDI representation of the score was proposed by Scheirer [Sch95]. Scheirer, targeting the analysis of piano performances, used filter bank outputs combined with an onset detection algorithm to extract timing and velocity data. The system has apparently not been used for performance analysis in later publications.

The work at the ÖFAI (Austrian Research Institute for Artificial Intelligence) by Widmer, Goebl, Dixon et al. (see selected publications above) has introduced a variety of tools to extract performance data from audio signals that in combination probably comes closest to a complete state of the art system for music performance analysis. Some of their individual tools are available online, but they remain individual components for performance analysis rather than an integrated system.

Dillon [Dil04] presented a software system for music performance analysis that does work on audio input, but targets mainly at subjective aspects of performance analysis such as the recognition of “expressive intentions” and the “detection of arousal”. The audio processing itself is — as it is considered to be only one small part of a bigger system — relatively simple; it aims at monophonic input sources and is therefore probably not too suitable for the analysis of polyphonic audio input.

3 Tempo Extraction

Tempo and Timing are among the most important performance parameters. Musical tempo is usually given in the unit be ats per minute (BPM), and can be defined as the rate at which beats, i.e. perceived pulses with equal duration units, occur [DH93]. From a more score-based point of view, two definitions of the beat duration are common, either as the denominator of the time signature of the musical score or simply as the length of a quarter note. Different representations of tempo are of interest in performance analysis, e.g. the overall tempo, the tempo variation over time, and its micro structure.

The measure of overall tempo is not in every case as simple as one would imagine at first glance: by dividing the overall number of beats by the length in minutes one receives a proper estimate of the mean tempo, but the result does not necessarily match the perceived tempo a listener would indicate; there is a difference between the mean tempo and the perceived tempo. Gabrielsson [Gab99] distinguishes between the mean tempo and the main tempo, the latter being a measure with slow beginnings or final ritardandi removed. Repp [Rep94] found good correlation of the mean value of a logarithmic Inter-Onset- Interval distribution with the perceived tempo. Goebl [GD01] proposes a mode tempo that is computed by sweeping a window over the histogram displaying occurrence of inter-beat intervals and selecting the maximum position as mode tempo. In most cases, the result should be similar to the position of the histogram maximum.

The tempo variation over time or the local tempo can be extracted by identifying the event time of every beat t b and calculating the local tempo between beats i and i + 1 by

illustration not visible in this excerpt

Alternatively, the time of every event t o may be extracted, disregarding whether it is a beat or not, to calculate the local tempo between two events. In this case, the distance in beats ∆ τ i,i +1 between two events has to be known to calculate the correct micro tempo.

illustration not visible in this excerpt

The latter has the advantage of not being restricted to the beat resolution and thus revealing the tempo micro-structure.

There have been many publications dealing with the automatic extraction of tempo from a digital audio signal. One group of common approaches can be summarized under the term “tempo tracking systems” or “beat tracking systems”.

Scheirer [Sch98] presented a tempo extraction system using a bank of resonance filters that process envelope differences. This has the advantage of not requiring a dedicated onset detection, but leads to a quantized tempo histogram. A similar approach is used by Klapuri [Kla03].

In contrast to these systems, a frequently attempted approach is to extract the onset times of all musical events in a first processing stage, followed by an adaptive beat tracking engine to extract the tempo and the beat locations with the information provided by the series of onsets. Examples can be found in publications of Goto ([GM95], [Got01]), Dixon [Dix99] and Meudic [Meu02]. In more recent publications, Laroche [Lar03] as well as Peeters [Pee05] use dynamic programming techniques to determine the tempo curve from the onset locations.

All of these approaches have in common that they are “blind” in the sense that they do not have and do not require information on the analyzed audio material such as the overall number of beats. In the context of this work, this is not optimal because:

- these systems usually do not react very well to sudden tempo changes
- these systems usually try to find the best match between the resulting beats (or beat grid) and the extracted onset times, which may not be a correct assumption for frequent syncopations or rests
- additional information in form of MIDI (score) files is available and could easily be utilized to improve the accuracy of the results.

Hence, we are interested in an algorithm for the automatic synchronization of audio data with MIDI (or, more general, the score) data that associates each symbolic (score) event with its actual time of occurrence in the audio signal. In general, such approaches are usually called Performance - to-Sc or e-Matching systems.

3.1 Performance to Score Matching

Performance-to-Score-Matching systems can be differentiated by their capa- bilities of real-time matching. Real-time systems are usually called Sc or e Following systems, and non-real-time (or offline) implementations are referred to as A udio-to-Sc or e A lignment or A udio-Sc or e Synchronization systems.

Possible applications of such alignment systems could be (compare e.g. [SRS03]):

- linking notation and performance in applications for musicologists to enable to work on a symbolic notation while listening to a real performance
- using the alignment score as a distance measure for finding the best matching document from a database
- musicological comparison of different performances
- construction of a new score describing a selected performance by adding information as dynamics, mix information, or lyrics
- performance segmentation into note samples automatically labeled and indexed in order to build a unit database
- musical tutoring or coaching where the timing of a recorded performance is compared to a reference performance

3.1.1 Scor e Following

Historically, the research on matching a pre-defined score automatically with a performance goes back to the year 1984. At that time, Dannenberg [Dan84] and Vercoe [Ver84] independently presented systems for the automatic computer- based accompaniment of a monophonic input source in real-time.

In the following years, Dannenberg and Bloch ([BD85], [DM88]) enhanced Dannenberg’s system by allowing polyphonic input sources and increasing its robustness against musical ornaments and by using multiple agent systems. Ver- coe [VP85] focused on the implementation of learning from the real performance to improve the score follower’s accuracy.

Baird et al. ([BBZ90], [BBZ93]) proposed a score following system with MIDI input (for the performance) that is based on the concept of musical segments as opposed to single musical events; the tracking algorithm itself is not described in detail.

Heijink [Hei96] and Desain et al. [DHH97] presented a score following system that takes into account structural information as well. It uses a combina- tion of strict pitch matching between performance and score and dynamic programming.

While many of previously presented publications focus on the score following part rather than audio processing itself, Puckette and Lippe ([PL92], [Puc95]) worked on systems with audio-only input with monophonic input signals such as clarinet, flute, or vocals.

Vantomme [Van95] presented a monophonic score following system that uses temporal patterns from the performer as its primary information. From a local tempo estimate he predicts the next event’s onset time and detects if the expected onset time matches the measured onset time within a tolerance. In the case of an ’emergency’, he falls back to the use of pitch information.

Grubb and Dannenberg ([GD97], [GD98]) proposed, in the context of a mono- phonic vocal performance, a system that uses fundamental frequency, spectral features and amplitude changes as extracted features for the tracking process to enhance the system’s robustness. The estimated score position is calculated based on a probability density function conditioned on the distance computed from the previous score event, from the current observation, and from a local tempo estimate.

Raphael published several approaches that make use of probabilistic model- ing and machine learning approaches incorporating Markov Models ([Rap99], [Rap01], [Rap04]).

Cano et al. [CLB99] presented a real-time score following system for monophonic signals based on a Hidden Markov Model (HMM). They used the features zero crossings, energy and its derivative, and three features based on fundamental frequency.

Orio et al. ([OD01], [OLS03]) introduced a score-following system for polyphonic music that utilizes a two-level HMM that models each event as a state in one level, and models a signal model with attack sustain and rest phase in a lower level. They use a so-called Peak Structur e Distance (PSD) that represents the energy sum of band pass filter outputs with the filters centered around the harmonic series of the pitch of the score event under consideration.

Cont [Con06] presented a polyphonic score following system using hierarchical HMMs that uses learned pitch templates for multiple fundamental frequency matching.

3.1.2 Audio to Score Alignment

The publications presented above deal with score-following as a real-time application. The following publications deal with the related topic of non-real- time audio to score alignment.

The importance of reliable pattern matching methods has already been rec- ognized in early publications on score following and alignment; in most cases dynamic programming approaches have been used, see for example Dannen- berg’s publications on score-following mentioned above, and Large [Lar93].

Orio and Schwarz [OS01] presented an alignment algorithm for polyphonic music based on dynamic time warping that uses a combination of local distances (similarity measures). It uses the PSD [OD01], a Delta of PSD (∆PSD) that models a kind of onset probability, and a Silence Model for low energy frames.

Meron and Hirose [MH01] proposed a similar approach with audio features that are relatively simple to compute and added a post-processing step after the dynamic time warping to refine the alignment.

Arifi et al. [Ari02], [ACKM04] proposed a system that attempts to extract multiple pitches segmented into onsets and performs a dynamic programming to align MIDI data to the extracted data. The algorithm has been tuned for polyphonic piano music.

Turetsky and Ellis [TE03] avoided the problems of calculating a spectral similarity measure between symbolic and audio representation by generating an audio file from the (reference) MIDI data and aligning the two audio sequences. For the alignment, a dynamic programming approach is being used as well.

Similarly, Dannenberg and Hu ([DH03], [HDT03]) generated an audio file from the MIDI file to align two audio sequences. They calculate the distance measure based on 12-dimensional pitch chromagrams (each element representing a octave independent pitch class). The alignment path is then calculated by a dynamic programming approach.

Shalev-Shwartz et al. [SSKS04] presented a non-real-time system for audio to score alignment that uses dynamic programming but additionally provides a training stage. Here, they derived a confidence measure from audio and MIDI similarity data and trained a weight vector for these features to optimize the alignment accuracy over the training set. The audio feature set contains simple pitch-style features extracted by band-pass filtering, derivatives in spectral bands to measure onset probability, and a time deviation from the local tempo estimate.

The alignment system of Müller et al. [MKR04] is also based on dynamic programming. It is targeted at piano music, but they claim genre-independence. For the pitch feature extraction, they used a (zero-phase) filter-bank based approach, with each band pass’ center frequency located at a pitch of the equal-tempered scale; the filter outputs are used to extract onset times per pitch.

Dixon and Widmer [DW05] presented an audio-to-audio alignment tool for polyphonic music that works in pseudo-real-time with a modified dynamic programming algorithm. As similarity measure, a Spectral Flux grouped into semi-tone bands is used.


1 Royal Institute of Technology, Sweden

194 of 194 pages


Software-Based Extraction of Objective Parameters from Music Performances
Technical University of Berlin
Catalog Number
ISBN (Book)
File size
6964 KB
Software-Based, Extraction, Objective, Parameters, Music, Performances
Quote paper
Dr. Alexander Lerch (Author), 2008, Software-Based Extraction of Objective Parameters from Music Performances, Munich, GRIN Verlag,


  • No comments yet.
Read the ebook
Title: Software-Based Extraction of Objective Parameters from Music Performances

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free