In this thesis, the similarity, the complexity, as well as the evolution of English song lyrics over the past five decades will be examined with the help of statistical methods. Hence, the central research question of this thesis is: Can information gained by Natural Language Processing and statistical topic modelling be used to determine whether and to what extent song lyrics of various genres changed over the course of the past 50 years?
Based on this, the goals of this thesis are:determining how similar songs of five diverse genres (alternative, country, pop, rock, and hip-hop) are, as measured by text statistics and text features that are composed by Natural Language Processing (NLP) and text mining methods. Additionally, using these methods as well as an attempt to find out whether song lyrics are becoming less complex and therefore less sophisticated And, finally, the main target of this thesis set for itself, computing statistical topic models by applying Latent Dirichlet Allocation (LDA), to analyse how similar the topics of songs are and whether they changed over time. This will be conducted by calculating similarity measures on the per-topic-per-word probability distributions that are output of the LDA models.
Inhaltsverzeichnis (Table of Contents)
- 1. Introduction to the Topic
- 1.1. Previous Research
- 1.2. Hypotheses and Methods
- 2. Theory
- 2.1. Music Information Retrieval
- 2.2. Natural Language Processing
- 2.2.1. Lemmatization
- 2.2.2. Part-of-Speech Tagging
- 2.3. Text Data Mining
- 2.3.1. N-grams
- 2.3.2. Term Frequency-Inverse Document Frequency
- 2.4. Topic Modeling Using Latent Dirichlet Allocation
- 2.4.1. Model Tuning
- 2.4.2. Model Evaluation
- 2.5. Similarity Measures
- 2.5.1. Jensen-Shannon Divergence
- 2.5.2. Hellinger Distance
- 2.5.3. Log Ratio
- 3. Data
- 3.1. Data Selection and Web Scraping
- 3.2. Data Pre-Processing
- 3.3. The Final Data Set
- 4. Analyses
- 4.1. Text Statistics
- 4.1.1. Comparison of Text Statistics
- 4.1.2. Comparison of Word Use
- 4.2. Text Features
- 4.2.1. Term Frequency-Inverse Document Frequency in Application
- 4.2.2. Part-of-Speech Tagging in Application
- 4.2.3. N-grams in Application
- 4.2.4. Conclusions about Text Statistics and Text Features
- 4.3. LDA Modeling
- 4.3.1. Parameter Tuning
- 4.3.2. Model Evaluation
- 4.3.3. Topic Similarity Within Models
- 4.3.4. Topic Similarity Between Models
- 4.3.5. Conclusions about LDA Modeling and Similarity Measures
- 5. Findings and Prospects
- 5.1. Findings Compared to Previous Research
- 5.2. Need for Improvement and Future Applications
Zielsetzung und Themenschwerpunkte (Objectives and Key Themes)
The main objective of this thesis is to investigate whether Natural Language Processing (NLP) and statistical topic modeling, specifically Latent Dirichlet Allocation (LDA), can reveal how song lyrics across different genres have changed over the past 50 years. The thesis aims to determine the similarity between song lyrics of various genres, explore potential decreases in lyrical complexity over time, and analyze the evolution of lyrical topics using LDA and similarity measures.
- Similarity of song lyrics across different genres.
- Changes in lyrical complexity over time.
- Evolution of lyrical topics across genres and decades.
- Application and evaluation of NLP techniques in analyzing song lyrics.
- Optimal parameter settings for LDA topic modeling.
Zusammenfassung der Kapitel (Chapter Summaries)
1. Introduction to the Topic: This chapter introduces the central research question: Can NLP and statistical topic modeling determine changes in song lyrics across genres over 50 years? It motivates this question by referencing previous research indicating a potential decline in lyrical sophistication and introduces the thesis's goals: analyzing genre similarity using NLP and text mining; assessing lyrical complexity changes; and using LDA to compare topic similarity and change across time. The chapter concludes with a brief overview of relevant previous research and a preview of the hypotheses to be tested and methods employed.
2. Theory: This chapter lays the theoretical groundwork for the methods used in the thesis. It covers Music Information Retrieval (MIR), focusing on the underutilization of lyrics data; Natural Language Processing (NLP), explaining lemmatization and part-of-speech tagging; and Text Data Mining, detailing the application of n-grams and tf-idf. The core of the chapter details Latent Dirichlet Allocation (LDA) for topic modeling, including model tuning (perplexity, log-likelihood) and evaluation. Finally, it describes the similarity measures (Jensen-Shannon Divergence, Hellinger Distance, and Log Ratio) used to compare topic distributions.
3. Data: This chapter describes the creation of a custom dataset of song lyrics. It details the selection of five genres (alternative, country, pop, rock, and hip-hop) and a time span (1970-2018), the web scraping process used to gather data from Discogs.com and Wikipedia.com, and the subsequent data pre-processing steps including tokenization, lemmatization, and stop word removal. It concludes with a summary of the final dataset, including statistics on song count, word count, and other relevant variables.
4. Analyses: This chapter presents the data analysis conducted to test the hypotheses. It begins with an examination of text statistics (song length, word length, lexical diversity, lexical density, word frequencies, log odds ratios) to explore differences and similarities between genres and across decades, as well as to assess changes in lyrical complexity. Then, it moves to analyzing text features like tf-idf, parts of speech, and n-grams (for repetition analysis). Finally, the chapter presents a comprehensive analysis of LDA modeling, detailing parameter tuning, model evaluation, and the use of similarity measures to compare topic distributions within and between models (across genres and decades).
Frequently Asked Questions: A Comprehensive Language Preview
What is the main topic of this thesis?
This thesis investigates whether Natural Language Processing (NLP) and Latent Dirichlet Allocation (LDA) topic modeling can reveal how song lyrics across different genres have changed over the past 50 years. It aims to determine the similarity between song lyrics of various genres, explore potential decreases in lyrical complexity over time, and analyze the evolution of lyrical topics.
What are the key objectives of the research?
The main objectives are to analyze the similarity of song lyrics across different genres, assess changes in lyrical complexity over time, explore the evolution of lyrical topics across genres and decades, apply and evaluate NLP techniques in analyzing song lyrics, and determine optimal parameter settings for LDA topic modeling.
What data was used in the analysis?
A custom dataset of song lyrics was created, encompassing five genres (alternative, country, pop, rock, and hip-hop) spanning from 1970 to 2018. Data was gathered through web scraping from Discogs.com and Wikipedia.com, followed by pre-processing steps such as tokenization, lemmatization, and stop word removal.
What methods were used in the analysis?
The analysis employed several methods including: Natural Language Processing (NLP) techniques (lemmatization, part-of-speech tagging), text mining techniques (n-grams, TF-IDF), Latent Dirichlet Allocation (LDA) for topic modeling, and similarity measures (Jensen-Shannon Divergence, Hellinger Distance, Log Ratio) to compare topic distributions. Text statistics (song length, word length, lexical diversity, etc.) were also analyzed.
What are the key themes explored in the thesis?
Key themes include the similarity of song lyrics across genres, changes in lyrical complexity over time, the evolution of lyrical topics, the application and evaluation of NLP techniques in analyzing song lyrics, and the optimization of LDA topic modeling parameters.
What are the main findings of the study (as previewed)?
The preview does not detail specific findings, but it indicates that the analysis will compare the findings with previous research and discuss areas for improvement and future applications of the methodology.
What is the structure of the thesis?
The thesis is structured into five chapters: 1. Introduction, 2. Theory (covering MIR, NLP, Text Mining, LDA, and similarity measures), 3. Data (data selection, scraping, and pre-processing), 4. Analyses (text statistics, text features, LDA modeling), and 5. Findings and Prospects.
What specific NLP techniques are used?
The thesis utilizes lemmatization and part-of-speech tagging as core NLP techniques.
What topic modeling technique is employed?
Latent Dirichlet Allocation (LDA) is the primary topic modeling technique used.
What similarity measures are used to compare topic distributions?
The study employs Jensen-Shannon Divergence, Hellinger Distance, and Log Ratio to compare topic distributions.
- Quote paper
- Laura Zapf (Author), 2019, How Did English Songs Evolve? Retrieving Information from Song Lyrics Via Natural Language Processing and Statistical Topic Modeling, Munich, GRIN Verlag, https://www.grin.com/document/997210