The advancement in computational linguistics and statistics has made an explicit impact on the emergence of corpus linguistics and the sophistication of its applications and studies involving not only pure linguistic issues but also areas related to real-life problems. One of these areas is authorship attribution studies.
Authorship attribution is a domain of a study concerned with identifying the most likely author of a particular anonymous or disputed document from a set of suspected authors. To this end, several methodologies, techniques, and approaches have been devised and so often assessed on various sets of data to make sure of their effectiveness. Although the literature shows no consensus as to which methodology is the best among others, there is an overwhelming fact that all authorship attribution studies are grounded on the assumption that each author has a particular "linguistic fingerprint" which can be captured through detecting and measuring the linguistic clues hidden in their authorial styles.
Taking an experimental framework, this study is an attempt to gauge the discriminating and clustering power of the selected methodology against a particular type of data covering samples of political journal articles. The corpus compiled is a special purpose one strictly controlled for genre, register, and date of publication. It comprises eleven samples extracted from eleven articles with their lengths ranging between (1,101) to (1,113) words long; three ones are taken as test (hypothetically questioned) samples and the rest as training samples. The corpus represents the journalistic writings of four authors.
Table of Contents
Introduction
CHAPTER 1
1.1 Computational Linguistics and Corpus Linguistics
1.2 Corpus-Based VS. Corpus-Driven Studies
1.3 A Historical Background of Corpus Linguistics
1.4 Types of Corpora
1.4.1 General Reference vs. Special Purpose Corpus
1.4.2 Written vs. Spoken Corpus
1.4.3 Monolingual vs. Multilingual Corpus
1.4.4 Synchronic vs. Diachronic Corpus
1.4.5 Open vs. Closed Corpus
1.4.6 Learner Corpus
1.4.7 Online Corpus/ Web as a Corpus
1.5 Methods in Corpus Linguistics
1.5.1 Concordance
1.5.2 Frequency / WordLists
1. 5.3 Keyword Lists
1.5.4 Collocate Lists
1.5.5 Dispersion Plots
CHAPTER 2
2.1 Introduction
2.2 Areas of Forensic Linguistics and Corpus Linguistics
2.2.1 Qualitative and Quantitative Analysis in Forensic Linguistics
2.2.2 Authorship Attribution
CHAPTER 3
3.1 Introduction
3.2 Text Corpus
3.2.1 Text Genre and Time of Publication
3.2.2 Sampling Methodology
3.2.3 Length of Text Samples
3.3 The Research Methodology
3.3.1 The Stylistics Method
3.3.2 The Computational Method
3.3.3 The Statistical Method/ SPSS (Version 19)
3.3.4 Authorship Attribution Approach
CHAPTER 4
4.1 Qualitative Analysis
4.2 Quantitative Analysis
4.2.1 Wordsmith Tools and Excel Program
4.2.2 SPSS (19)
CHAPTER 5
5.1 Conclusions
5.2 Recommendations
5.3 Suggestions for Further Researches
Objectives and Topics
This work aims to evaluate the viability and efficacy of an integrated, non-traditional methodology for authorship attribution by applying linguistic, computational, and statistical principles to a controlled corpus of political journalism articles.
- Application of corpus linguistics to forensic authorship attribution.
- Comparative analysis of qualitative and quantitative methodologies.
- Evaluation of function words as effective style markers.
- Utilization of computational tools and statistical techniques (SPSS/PCA/Cluster Analysis) in forensic investigations.
Excerpt from the Book
2.2.2.4.3 Machine Learning Technique
The difficulty encountered when employing the aforementioned two techniques is the substantial amount of data required for analysis of both the known and unknown authorship texts (Grant, 2008: 226). On the contrary, machine learning techniques are better suited to analyze rather short texts like e-mails, threatening letters, suicide notes, etc. Another advantage of these techniques is that they make it possible to consider a large number of possibly relevant features without decreasing the accuracy of the result if most of these features appear to be irrelevant later (Koppel et al., 2009: 11). However, they have a shortcoming in not being able to give linguistic explanations for the computationally attained results of authorship cases (Grant, 2008: 226).
Machine learning technique starts to deal with the documents of known authors (training corpus) to construct a classifying algorithm, which is based on identifying and counting the relative frequencies of features in these document and which is successful to discriminate between the documents of the known authors (Koppel et al.2013: 319). The anonymous document is then credited to the likely author based on the resultant algorithm (ibid). Machine learning techniques involve support vector machine, neural networks, and decision trees (Abbasi and Chen, 2005: 68).
Summary of Chapters
Introduction: Provides an overview of the scope and purpose of authorship attribution in modern applied linguistics and introduces the study's experimental framework.
CHAPTER 1: Examines the theoretical foundations of corpus linguistics, including its definitions, history, various types of corpora, and essential research methods.
CHAPTER 2: Discusses the intersection of forensic linguistics and authorship attribution, detailing qualitative vs. quantitative approaches and historical developments in feature identification.
CHAPTER 3: Outlines the research methodology, describing the specific text corpus, sampling techniques, and the integration of computational and statistical tools.
CHAPTER 4: Presents the qualitative and quantitative findings, including scatterplots and cluster analyses used to evaluate authorial styles.
CHAPTER 5: Summarizes the study’s conclusions, provides recommendations for future practice, and suggests directions for further research.
Keywords
Forensic Linguistics, Authorship Attribution, Corpus Linguistics, Stylistics, Quantitative Analysis, Qualitative Analysis, Function Words, Principal Component Analysis, Cluster Analysis, SPSS, WordSmith Tools, Idiolect, Linguistic Fingerprint, Style Markers, Political Journalism.
Frequently Asked Questions
What is the core focus of this research?
This research investigates the effectiveness of using a corpus-based approach, specifically focusing on function words as style markers, to identify authors of disputed texts within political journalism.
What are the primary themes covered?
The work explores the methodology of authorship attribution, the theory of idiolect (or linguistic fingerprint), the role of corpus design, and the application of statistical software for linguistic classification.
What is the central research question?
The central inquiry is whether corpus linguistics, when combined with statistical techniques, provides a robust and replicable methodology for addressing forensic authorship problems.
Which scientific methods are employed?
The study utilizes a mixed-methods approach, combining qualitative stylistic examination with quantitative multivariate statistical techniques, specifically Principal Component Analysis (PCA) and Cluster Analysis via SPSS.
What is covered in the main section?
The main sections detail the criteria for corpus construction, the justification for using function words as discriminators, and the specific statistical processes used to analyze and visualize authorial differences.
What are the key terms associated with this work?
Key terms include Forensic Linguistics, Authorship Attribution, Corpus-based Study, Style Markers, and Multivariate Statistical Analysis.
How is the "idiolect" theory applied in this forensic study?
The study transitions from the abstract "idealized" theory of idiolect to the practical concept of "idiolectal style," focusing on measurable and consistent linguistic habits in text.
Why are function words used as the main discriminator?
Function words are preferred because they are topic-independent, high-frequency, and occur inevitably in almost every sentence, making them stable indicators of an individual's subconscious writing style.
- Citar trabajo
- Khalid Shakir Hussein (Autor), Eman Abdul Kareem (Autor), 2017, A Corpus-Based Analysis of Using Function Words in English Forensic Authorship Attribution, Múnich, GRIN Verlag, https://www.grin.com/document/385050