A Corpus-Based Analysis of Using Function Words in English Forensic Authorship Attribution

A Case of Political Journalism Disputes

Case Study, 2017

124 Pages




1.1 Computational Linguistics and Corpus Linguistics
1.2 Corpus-Based VS. Corpus-Driven Studies
1.3 A Historical Background of Corpus Linguistics
1.4 Types of Corpora
1.4.1 General Reference vs. Special Purpose Corpus
1.4.2 Written vs. Spoken Corpus
1.4.3 Monolingual vs. Multilingual Corpus
1.4.4 Synchronic vs. Diachronic Corpus
1.4.5 Open vs. Closed Corpus
1.4.6 Learner Corpus
1.4.7 Online Corpus/ Web as a Corpus
1.5 Methods in Corpus Linguistics
1.5.1 Concordance
1.5.2 Frequency / WordLists
1.5.3 Keyword Lists
1.5.4 Collocate Lists
1.5.5 Dispersion Plots

2.1 Introduction
2.2 Areas of Forensic Linguistics and Corpus Linguistics
2.2.1 Qualitative and Quantitative Analysis in Forensic Linguistics
2.2.2 Authorship Attribution

3.1 Introduction
3.2 Text Corpus
3.2.1 Text Genre and Time of Publication
3.2.2 Sampling Methodology
3.2.3 Length of Text Samples
3.3 The Research Methodology
3.3.1 The Stylistics Method
3.3.2 The Computational Method
3.3.3 The Statistical Method/ SPSS (Version 19)
3.3.4 Authorship Attribution Approach

4.1 Qualitative Analysis
4.2 Quantitative Analysis
4.2.1 Wordsmith Tools and Excel Program
4.2.2 SPSS (19)

5.1 Conclusions
5.2 Recommendations
5.3 Suggestions for Further Researches



Authorship attribution, a task involving an effort of assigning the most likely author of certain writings of unknown or disputed authorship, is one of the tasks that exercised the mind of specialists and scholars long time ago and till now. They set out to device various procedures and techniques that help attribute the disputed or unknown writings to their relevant authors. Some of these procedures and techniques are concerned with linguistic issues, as the case of the present study, and others with non linguistic (traditional) issues.

Since the seminal work of Mosteller and Wallace(1964) which established the new trend of non-traditional authorship studies, a wide variety of methodologies, techniques, and approaches of authorship attribution has appeared on the scene and has been continually tested on various types of data. This book is characterized as being situated within this new trend of non-traditional applications of authorship attribution which draws upon the integration of knowledge and principles of stylistics, computational linguistics, and statistics.

Indeed, this study introduces the reader to one of the controversial issues in applied linguistics because there is no consensus between its practitioners as to which methodology performs better than the others and which style marker is the best discriminator in settling all authorship problems. Therefore, the results of the majority of non-traditional authorship attribution studies are taken as probabilities of different rates rather than absolutely decisive answers.

The theory, which constitutes the basis on which authorship attribution studies are built, is the theory of idiolect or linguistic fingerprint. Such theory in its idealized form is held to be unfeasible or impractical for forensic investigations while its practical form is more promising for such investigations (

The study under investigation takes an experimental framework and it is set to evaluate the behaviour of the methodology and the discriminatory and clustering power of function words against a certain special purpose corpus. This corpus consists of political journalism articles which are strictly controlled for genre, register, and date of publication. It represents the naturally occurring texts rather than fabricated ones for experimental purposes.

The basic research questions this book addresses are related to the viability and the efficacy of the intended non-traditional methodology in tackling one of authorship attribution cases. These questions might be summarized in a form of the questions below:

- Does corpus linguistics offer feasible potentialities in addressing forensic authorship attribution enquires?
- How efficient and workable the intended methodology is when used against a particular type of corpus designed for the purpose of this study?
- What is the discriminatory power of English function words within the context of forensic authorship attribution?

Definitely, all authorship attribution studies rest on an assumption that the linguistic style for each language user can be captured by means of a detecting and measuring process that might rigorously target the linguistic clues left by him/her in their writings, and by seeking the consistent patterns and idiosyncrasies that distinguish him/her from other defined set of candidate language users. Vast variety of linguistic clues or (style markers) is proposed for this purpose. Relevant to this study, it is hypothesized that function words (style markers) have an efficient discriminatory power that helps much in capturing the linguistic style of each author when employed qualitatively or quantitatively.

Moreover, this book assesses how robust and rigorous the methodology is as used against a special purpose corpus representing one genre (political journalism). It tries to explore the potentialities and the facilities of stylistics, computational linguistics, and statistics to the field of forensic authorship attribution, and investigating how corpus linguistics can be used to address a methodological issue in authorship attribution studies.

The book falls into two parts: theoretical and practical: the theoretical part introduces a general overview of literature review and what has been said and done about the relevant field from different insights and perspectives. Chapter one, two, and three cover this part of the study. As for the practical part, it exhibits the outlines or the blueprint used for collecting, measuring, and analyzing the corpus of the study, and it is covered in chapter four, five and six.

The general analytic procedure this book goes through involves building up a special purpose corpus for this study taking into account the guidelines offered by corpus linguistics and authorship attribution studies, followed by selecting two samples for each of the four authors (acting as training corpus) and further two samples for one of those authors (acting as test corpus). A qualitative analysis is conducted by seeking the prominent, characteristic, and outstanding style marker in the test corpus as compared with the training corpus. This qualitative analysis is complemented by conducting a quantitative analysis utilizing the facilities of three software programs: WordSmith Tools, Excel, and SPSS. This analysis adopts the same procedures offered by Hussein (2014) with a slight difference:

1) Finding the most frequent (thirty) function words in the master corpus when processed via Word Smith Tools (version 4.0),
2) Establishing the matrix in an Excel spreadsheet exhibiting how frequent each function word is in each sample,
3) Conducting a thorough statistical analyses through SPSS program (version 19) using two techniques (principle component analysis and cluster analysis),
4) After revealing the result and with the aim in mind to exploring how rigorous this model of analysis is, a third test sample by one of the candidate authors is added to the test corpus,
5) Analyzing and interpreting the results portrayed in visual forms.

Last but not least, the value of this type of studies can be realized and articulated academically and practically. In academic perspective, this study sheds light on one of the growing areas within forensic linguistics, which has witnessed a growth especially after the increasing plethora of electronically available data. It draws the attention to the field of interdisciplinary nature and spells out how the new advancements in other areas of knowledge could be utilized to solve authorship attribution problems.

In practical perspective, it could be of value to practitioners seeking sort of insight of how an authorship problem get managed. The accumulated knowledge and expertise gained by those practitioners may then help them handle real-life authorship problems effectively. For those interested in corpus linguistics, this study offers them an opportunity to be practically acquainted with how a corpus-based study is undertaken.

Khalid Shakir Hussein

Eman Abdul Kareem

Dec. 2017


Corpus Linguistics

1.1 Computational Linguistics and Corpus Linguistics

Corpus-based studies represent one of the many applications of modern linguistics. They witness a growth and catches the interest of many linguists in the last two decades. This may raise many questions as what enhances or motivates such studies, what the premises on which these studies depend, and what the source of facilities provided to conduct such studies. Therefore, it seems plausible and even knowledgeable to have an insight into the impact of computational linguistics on the development of corpus linguistic studies and the type of relationship that holds between them .

Computational linguistics as considered by Igore and Alexander (2004:25) is "a synonym of automatic processing of natural language " which aims at designing and developing programs capable of processing words and texts of natural language. It focuses on studying natural languages in the same way traditional linguistics does, but '' using computers as a tool to model (and, sometimes, verify or falsify) fragments of linguistic theories deemed of particular interest'' (Boguraev et al, 1995:1).

Computational linguistics starts as a special area of artificial intelligence, growing out of approaches in machine translation developed by the American mathematician Warren Weaver in 1949(Hirst,2013:1). Such approaches are concerned with semantic analysis and symbolic methods of parsing(ibid). Most efforts during the period from the mid-1950s to around 1970 tend to be theory-neutral and much attention is paid to such practical techniques as machine translation(MT) and question answering (QA) (Schubert,2015).

During the 1970s, there is a gradual shift towards involving and incorporating theoretical linguistic foundations and world knowledge in comprehensive, modular, and re-usable paradigms, leading to advancement of programs (ibid). With such a shift computational linguistics becomes part of linguistics as much as it was of computer science or Artificial Intelligence ( John and Helen ,1998:6) .

Wilks (2006:761-762) refers in this respect to Winograd's programming system which is based on Halliday's model in processing strings of linguistic symbols, and to McCarthy and Haye's model which is inspired by ideas of Montague's theoretical semantics and Lakoff's generative semantics. Jurafsky and Martin (2007:11) point out to Roger schank and his associates who develop language understanding programs which involve Fillmore's case grammar notions in their representations .

The period from the late of 1980s to the end of 1995 is characterized by a radical shift, in all aspects of natural language processing, to corpus-based statistical approach (Schubert,2015). This empirical and data-oriented approach is underpinned by the profusion and the increasing availability of machine-readable texts, and by the new interest in the way the linguistic items are distributed and structured (ibid). It includes statistically based learning techniques such as part-of-speech tagger and accurate speech recognizer, and it is developed to overcome difficulties encountered by computational linguistics (Ibid).

The new development that computational linguistics undergoes and the vital shift from a rationalist enterprise, inspired by theoretical linguistics and artificial intelligence, to an empirical approach based on corpora and statistics has a rapid and parallel impact on the line of research in corpus linguistics (Hirst,2013:10-11). Advance in technology and computational linguistics has been a crucial enabling factor in the growth of corpus linguistics, and the latter has both influenced and been influenced by it ( McCarthy and O'Keeffe,2010:6). This mutual influence is reflected in a way that any advance in one area would be observed in another.

Despite the affinity and the overlap between Computational linguistics and corpus linguistics researches in their reliance on corpora and computers, they remain largely different in terms of motivations and methods of analysis (ibid) .The main contrast is that corpus linguistics makes use of computational methods to analyze data while computational linguistics exploits linguistic data as a means to an end, i.e., to develop systems capable of processing various linguistic inputs (Ludeing and Zeldes,2008).

The facilities the computational technologies offer to almost all linguists at the individual and disciplinary level are promising, allowing them to collect information more quickly and analyze a large collection of data efficiently in a few minutes (John and Helen ,1998:2). Moreover, they are reshaping the discipline, throwing light on new areas of research, new types of data, and new tools of analysis(ibid).

In recent years, the methods employed in computational linguistics researches, whether theoretical or practical, rely heavily on theories and findings of different branches of science such as computer science, theoretical linguistics, psycholinguistics, philosophical logic, and cognitive science in general (Schubert,2015). Furthermore, the applications or the practical results of computational linguistics are so diverse such as text editing , information retrieval in different document database, automatic translation from one language to another, text generation from pictures and formal diagrams, speech recognition ,etc (Igore and Alexander,2004:53-54).

1.2 Corpus-Based VS. Corpus-Driven Studies

Computational linguistics, as outlined in the previous section, is a discipline whose identity is clear-cut with theoretical and practical concerns and with a goal of devising systems for analyzing linguistic data. However, corpus linguistics orientation is conceived or realized with different perspectives.

The main argument goes as to some linguists view corpus linguistics as a methodology such as McEnery and Wilson(1996:1); Bowker and Pearson(2002:9); Meyer(2004:xi); McEnery et al. (2006:7-8); and Aarts (2001:7). According to McEnery and Wilson (1996:1), corpus linguistics is the study of language based on samples of real language use and it is ''a methodology rather than an aspect of language requiring explanation or description''. Similarly, McEnery et al. (2006:7) note that corpus linguistics is not an independent branch of linguistics like (phonetics, syntax, etc) but it can be used to investigate any area of linguistic concern rather than focusing on one aspect of language. They (ibid:7-8) add that ''As corpus linguistics is a whole system of methods and principles of how to apply corpora in language studies and teaching/learning, it certainly has a theoretical status. Yet theoretical status is not theory itself''

On the other hand, others admit the genuine theoretical status of corpus linguistics such as Tognini-Bonelli (2001:1) who indicates that while a methodology means the application of a previously determined set of rules and accepting facts in a certain situation, corpus linguistics can formulate ''its own sets of rules and pieces of knowledge before they are applied''. Along similar lines, Leech (1992:106) argues that computer corpus linguistics is not a new methodology for linguistic studies, "but a new research enterprise, and in fact a new philosophical approach to the subject'' and Teubert (2005:2) identifies it as ''a theoretical approach to the study of language''.

Hardie and McEnery (2010:385) explain that the two opposite perspectives whether corpus linguistics is a methodology or a theory (an independent discipline) have formally been labelled as ''corpus-based'' versus ''corpus-driven'' respectively by Tognini- Bonelli(2001). The latter shows that the main goal of a corpus-based study is to verify, exemplify, and on certain occasions modify theories and models that have been formulated before the availability of a large set of corpora (2001:65f). On the other hand, a corpus-driven study suggests that the corpus is the only main source of theoretical statements which should be formulated in a way that reflects evidence present in the corpus ( ibid:84).

Adopting a logical and moderate stance or taking a middle-ground position, both perspectives seem mainly plausible and possible to contribute to the linguistic studies although some linguists appear biased towards corpus-based considering it more privileged and prevalent than the other (corpus-driven). For example, Xiao (2008:996) states that the corpus-based approach is better suited to contributing to linguistic theory since it seeks to ''strike a balance between the use of corpora [empiricism] and the use of intuitions [rationalism]'', and Gries (2006:191) refers to it as ''a major methodological paradigm in applied and theoretical linguistics'' (cited in Gries,2012:42).

The present study is oriented to be a corpus-based linguistic study supported by and utilizing the computational facilities and tools. This makes it be viewed as a methodology within the domain of applied linguistics in general and forensic linguistics in particular.

Last but not least, corpus-based linguistic descriptive studies should be distinguished from conventional descriptive fieldwork (a widespread mainstream of linguistic studies during 1940s-1950s ) in linguistics in that the former focuses not only on "what is said or written, where, when and by whom", but also on the measurement aspects relevant to the frequencies of particular linguistic items (Kennedy,2014 :9). The availability of corpora in electronic forms makes linguists able to properly undertake many types of measurement and sorting, either for purposes of linguistic description or for resolving some language-related problems (ibid:11).

1.3 A Historical Background of Corpus Linguistics

What has been presented so far is the two opposite perspectives with which corpus linguistics is identified and the encouraging resources that underpin its evolution as a domain of computational-assisted linguistic researches. Nevertheless, its historical preludes and chronological development remain uncovered.

The word ' corpus ' (plural: corpora) is originally a Latin word which means ' body ' ( Baker et al.,2006:48 ). In linguistics, a corpus is used to refer to whatever collection of naturally occurring examples of language, spoken or written, that serves as a source of evidence upon which linguistic studies depend (Meyer,2004:xi ; Hunston,2002:2) . Based on this definition, two types of corpora are recognized: pre-electronic corpora which are created before the computer era and electronic corpora (Meyer, 2008 :1).

The dominate practice in descriptive linguistics until late 1950s was based on an analysis of collections of authentic non-electronic texts. This tradition was prevalent before the emergence of what has come to be known as intuition-based approach inspired by Chomsky (Biber et al, 2010: 549). During 1940s-1950s using corpora was familiar among American structuralists such as Harris, Fries, and Hill, for whom ''a corpus of authentically occurring discourse was the thing that the linguist was meant to be studying '' ( Leech, 1992:105).

Early examples of corpus-based works are found in C. C. Fries' grammars of written and spoken American English in (1940 and 1952 respectively). The more noticeable work is that of Randolph Quirk (the Survey of English Usage) who focuses on compiling and analyzing a large non-electronic database, recorded on file cards, that represent samples of everyday linguistic interactions, spoken and written, of ordinary people. (Biber et al., 2010: 549; Tognini- Bonelli, 2010:15)

It was not before 1960s that the work on large electronic corpora began to emerge. The word corpus, since that time onwards, almost always implies a collection of text stored in an electronic form which can be read and analyzed automatically or semi-automatically rather than manually (Baker, 1995:226) or a set of "machine readable texts" (Crystal, 2003: 95). Tognini- Bonelli and Sinclair (2006:208) categorize the chronological development of electronic corpora into three stages or as they call 'generations':

- 1960-1980: corpora along twenty years were built using non-electronic data; everything existed in hardcopy had to be converted into electronic forms through keyboarding. The first electronic corpus to emerge was the Brown corpus compiled by Nelson Francis and Nelson Kucera; containing a million words of American English from documents . The invention of tape recorder encourages the creation of the first corpus of spoken language at the university of Edinburgh. Also different types of corpora in different languages were noticeable in the 1970s.
- 1980-2000: corpora size increased considerably due to the use of scanners which improves access to the printed documents.
- The new millennium: unlimited quantities of texts which never have existence in hardcopy formats become available on internet.

Ever since corpus linguistics has been used extensively in addressing various research questions from different areas of linguistics and applied linguistics such as lexicography, translation, language teaching and learning, literary stylistics, sociolinguistics, pragmatics, forensic linguistics, etc (Hunston, 2002; MacCarthy and O'keeffe, 2010).

Consequently, it is plausible to expect different types of corpora created on the basis of their purposes of compilation. Such issue is the main argument of the following section.

1.4 Types of Corpora

It seems unavoidable to have an overview of the ways corpora are classified so that it would be possible then to have access into the main features of each type, different opinions on the topic, and the type of the corpus that fits or coincides with the research questions in hand.

Many have attempted to classify or set criteria as to how corpora can be sorted, however, no consensus has been deducted regarding this issue. For example Baker (1995:229) sets a number of criteria on the basis of which corpora are designed. The more important ones are as follows:

- general language vs. restricted domain
- written vs. spoken language
- synchronic vs. diachronic
- typicality in terms of ranges of sources (speakers/writers)
and genres(e.g. newspaper editorial, radio interviews, court
hearing, fictions, journal articles)
- geographical limits, e.g. British VS American English
- Monolingual VS. bilingual or multilingual (Ibid) Sereda (2012:197) adds further criteria like:
- Size : small , medium , large
- Dynamism : dynamic (monitor) VS. static
- Authorship : one author, two authors and more
- Nature of application : research, aligned comparable , parallel, reference, learner, translation, etc.
- Annotation : un annotated, annotated(morphological, semantic, syntactic, prosodic, etc.)
- Language of texts: English, German, etc.
- Access: free, commercial, closed.

The variety of the criteria of classification may yield different ways of classifications. Bowker and Pearson (2002) classify corpora into eleven types : written vs. spoken corpus, synchronic vs. diachronic corpus, general reference vs. special purpose corpus, monolingual vs. multilingual corpus, open vs. closed corpus, and learner corpus. McEnery et al. (2006) focus on main English corpora classifying them on the basis of their potential use into eight types: written vs. spoken corpus, general vs. specialized corpus, synchronic vs. diachronic corpus, learner vs. monitor corpus. Still, there are other typologies suggested by Sinclair (1988), Tognini-Bonelli (2010), Teubert and Cermakova (2004), etc..

With their suggested typology, Bowker and Pearson (2002:11) argue that the list of corpora types is not exhaustive since '' it is still possible to identify some broad categories of corpora that can be compiled on the basis of different criteria to meet different aims''. Being so, the following paragraphs are going to highlight briefly the most common types of corpora already known and agreed upon:

1.4.1 General Reference vs. Special Purpose Corpus

Within the hierarchy of corpus typology, '' a general corpus appears to be the superordinate in a hierarchy'' (Pearson,1998: 44). It includes texts from a variety of language genres or text types (spoken, written, public, private) which tend to be balanced in size ( Baker et al, 2006: 18). Basically, general corpora are designed to be quite large ranging from (50 to 500) million words, if not more (Teubert and Cermakova, 2004: 118). The more known large general corpora are British National Corpus (100 million words) and Bank of English (450 million words) (ibid). General corpora aim to provide comprehensive picture of language in its broadest sense and to act as a widely existing resource for comparative studies of various linguistic features (Reppen and Simpson, 2002:95).

Special purpose corpus is designed for a specific research goal or to study particular specialist genres of language: child language, English For Academic Purposes, etc. (Baker et al, 2006:147). It is mainly much smaller in size than general reference corpora but it is quite broad and can be attached to other corpus types. For example, MICASE (Michigan Corpus of Academic Spoken English), is a set of specialized spoken corpora representing speech in specific academic settings (Breyer, 2011:28-29). A special purpose corpus can be used in any given language research context(ibid:29). It may be the most essential ''growth area'' in corpus linguistics since researchers increasingly realize the significance of register-specific studies and descriptions of language (Reppen and Simpson, 2002:95).

1.4.2 Written vs. Spoken Corpus

A written corpus includes texts that have been created in a written format such as novels, newspapers, books, magazine, letters, diaries, and even the electronically produced text like emails, websites, etc. (Baker et al, 2006:171)

A spoken corpus, In contrast, includes only transcripts of spoken materials: lectures, conversation, broadcast, etc. (Bowker and Pearson 2002:12). However, some corpora comprise a mixture of both written and spoken formats such as British National Corpus (ibid).

1.4.3 Monolingual vs. Multilingual Corpus

A monolingual corpus is the most typical type used by linguists which contains texts created only in one language (Biel, 2009: 3). It is used for descriptive purposes which involve intralingual analysis and for comparing the language of different genres (ibid). It is mainly used within forensic linguistics, but also in monolingual lexicography and foreign language teaching to build the study materials (ibid).

A multilingual corpus is made up of texts in more than one language and can also be further divided into two subtypes, i.e. comparable and parallel (Bowker and Pearson, 2002:12).

A comparable corpus can be defined as a corpus containing various individual sub-corpora that are chosen using the same or similar sampling frame, i.e., of the same size of the texts, of the same genre, in the same domain, and in the same sampling time but representing a range of different languages (McEnery and Hardie, 2012: 20). As a matter of fact, these sub-corpora are not translations of each other, yet their comparability lies in the sameness of their sampling criteria (ibid).

A parallel corpus, in contrast, contains source texts with their equivalent translations (McEnery and Xiao, 2007:20). It is usual for parallel corpora to be bilingual or multilingual. They may be unidirectional (e.g. from English into Chinese only), bidirectional (e.g. containing both source English texts with their Chinese translations and vice versa), or multi-directional (e.g. the same material with their German, French, English equivalents) (ibid).

Essentially comparable and parallel corpora are designed differently so as to fulfill different purposes, typically the former is used in contrastive studies, whilst the latter in translation researches (McEnery and Hardie, 2012: 20)

1.4.4 Synchronic vs. Diachronic Corpus

A synchronic corpus presents snapshots of language created within close limited periods of time (Bowker and Pearson 2002:12), while a diachronic corpus includes texts from one language produced over different periods of time (McEnery et al.,2006:65). Such a corpus is used to trace back changes or to investigate language evolution (ibid).

1.4.5 Open vs. Closed Corpus

An Open corpus, also known as a monitor (dynamic) corpus, is one which continually undergoes constant expansion and is commonly used by dictionary creators to look up newly coined words and recent changes in meanings (Bowker and Pearson, 2002: 12-13). Monitor corpora are actually very large and get larger by time; most of them are opportunistic and some are designed in terms of rigid principles (Breyer, 2011: 26-27). As examples of such type are Longman Written American corpus and the bank of English (BoE) (Baker et al, 2006: 65).

In opposite, a closed corpus or in Baker's term (2006: 152) a static corpus, is designed to be of a particular size; without any intention to include further texts beyond the targeted size. Most available corpora are static, presenting a ''snapshot'' of a particular language variety (ibid).

1.4.6 Learner Corpus

It contains materials produced by language learners from different language background and is used as a resource to get more insightful information about such issues as error ranges and learner needs (Breyer,2011:29). The most famous corpus of this type is the international corpus of learner English (ICLE), established in 1990 by Sylviane Granger (ibid). Initially, learner corpora were viewed as a subtype of specialized corpora, yet their profitable weight in identifying common errors made by foreign language learners leads to be considered on their own (Blecha, 2009: 26).

1.4.7 Online Corpus/ Web as a Corpus

The idea of using web since its emergence in 1990s as a source of corpora has been proposed by a number of authors such as (Kilgarriff and Grefenstette, 2003; Fletcher, 2004). The availability of that tremendous amount of linguistic data has offered linguists an opportunity to get insights into various types of genres, rare items, and neologisms (recently invented words) (Breyer, 2011: 29). Besides, new text types like emails, blogs, and chat room logs start to be used as objects of study (ibid).

Joseph (2004: 382) indicates that many linguistic studies use corpora as their primary sources of data, and others use internet data. Researchers turn to use Web texts or internet data, as opposite to traditional corpora, because web texts are freely available, immense in size and volume, continually updated, and rich with most recent language use (Renouf, 2007: 42). It is crucial to realize that researchers either use internet as a source of texts for corpus compilation (as is the case of the present study) or search it directly as a whole being (Kilgarriff and Grefenstette, 2003 cited in Hussein, 2014:65).

In spite of the great and outstanding advantages the web as a corpus has over traditional corpora, it has some shortcomings and ''theoretical objections to using web as a corpus come thick and fast ''( Renouf, 2007: 42). The main shortcoming is the inability to control its content (Baker et al, 2006:171) as some texts are converted into their electronic formats through a kind of scanning which is not 100 percent error-free (ibid:142). The lack of controllability indeed affects negatively the central principles of corpus linguistics: representativeness, replicability of searches (due to the constant change of web materials), and exhaustiveness (Breyer, 2011: 30; Renouf 2007: 42).

To differentiate between whatever type of corpora from a mere archive, many practitioners put a number of central criteria that should be taken into considerations when creating a corpus and thereby ensuring the reliability of the result as far as possible. On one hand, Leech (1991:11) suggests that an archive should be distinguished from a corpus since the latter is designed to be representative of a particular domain of language use. On the other hand, Gries (2006: 4) sets out a number of criteria that, if met, characterize the standard or prototypical corpus which is:

- machine- readable
- representing a naturally occurring-language; authentic language produced in natural communicative settings
- representative of each part of a particular genre, variety, or register if they are supposed to be studied
- balanced with regard to the volume of samples included
- analyzed systematically and exhaustively; a corpus should not serve as a body of texts from which certain samples are freely chosen and others neglected but the whole samples are to be included in the analysis
- analyzed on the basis of concordance lines, frequency list, collocations, etc, which are the focus of the next section.

1.5 Methods in Corpus Linguistics

The compilation of various types of corpora of different size, design, and structure is to satisfy different purposes of linguistic inquiries and analyses. Such analyses are definitely even hard to conduct precisely without an explicitly definite methodological framework.

Leech (1991:12) maintains that a corpus, however large, when stored in electronic form, is of no use to corpus users without availability of search and retrieval methods that help extract necessary or required information, an example of which is a concordance program. Likewise Sinclair (2004:189) holds that the essence of the corpus as opposite to the text is that the corpus is observable indirectly via tools by the researcher. An indirect observation may be used to investigate things difficult to be observed directly because they are too far away from each others, too frequent or infrequent, or only picked out after a kind of quantitative processes (ibid).

Before proceeding, it is worth stating that whereas a traditional intuition-based approach ignores entirely corpus data, considering the introspective judgment of speakers an appropriate source of linguistic analysis (Sampson,1980:150), a corpus-based approach draws on both corpus data and intuition offering the researcher improved reliable outcomes (McEnery et al., 2006:7). So a corpus-based analysis ''should be seen as a complementary approach to more traditional approaches, rather than the single correct approach'' (Biber et al., 1998: 7-8).

The main point out of this argument is that not only issues of corpus design are necessary for improving our understanding of how language works, yet two further essential components are of vital importance for any successful corpus study: human intuition (for interpretation or confirmation) and software tools (for extracting and retrieving information) (Anthony, 2009: 90).

The growth of a large amount of corpora goes hand in hand with the development of software tools which allow us to ''easily search, retrieve, sort and carry out calculations on corpora'' (ibid:88). Some of these tools (methods of corpus investigation) are as follows:

1.5.1 Concordance

It is defined as ''a collection of the occurrences of a word-form, each in its own textual environment'' (Sinclair, 1991:32).The textual environment indicates the immediate co-text (context) preceding and following the search word (Cheng, 2012: 73). A concordance is a list of all search item occurrences in a corpus and is usually displayed on the computer screen in the KWIC format (i.e., keyword in context) (ibid: 7).

The search item appears centered both horizontally and vertically within its context, and usually with a colour and typeface marking to make it visually prominent (Scott, 2010: 147). The horizontal dimension (textual one) allows the researcher to observe the linear progress of the text and how this creates meaning, however vertical dimension adds more valuable insight about the similarities and differences between one line and the other surrounding ones so that the researcher becomes able to formulate generalities from recurrences (Sinclair, 2005).

As for what is meant by ''word-form'' in a corpus, it may be best illustrated by referring to an example in this respect. It is argued that (eat, eats, eaten, ate, eating) are a set of word-forms belonging to the same lemma (EAT) since they carry the same sense (Hunston, 2002: 17-18). The researcher is then able to search for either a single word-form or a set of word-forms. So by using lemma list search (advanced search) the researcher when specifying the search-forms (e.g., speak, spoke, speaks, spoken) can obtain the concordance containing all four forms (Scott, 2016: 188).

A concordance of the word-form give might look like the following sample (ibid: 183):

... could not give me the time ...

... Rosemary, give me another ...

... would not give much for that ...

A concordancing software is not restricted to search for individual word-forms, but it also offers a possibility to search for phrase patterns (fixed combination of word-forms), such as the phrase (thank you for) in the following figure (Tribble, 2010: 173-174).

Abbildung in dieser Leseprobe nicht enthalten

Figure (1.1) phrase pattern by concordance

1.5.2 Frequency / WordLists

Evison (2010: 122) refers to the production of frequency lists and the generation of concordances as the central analytical techniques which are built on the vital basis that electronic corpora can be searched quite rapidly. The frequency list is shown either in a rank order- ranging from the more frequent words to the less frequent ones, or in an alphabetical order (ibid).

Abbildung in dieser Leseprobe nicht enthalten

Figure (1.2) Example of Frequency or WordList

The frequency list generated involves a reference to the number of types and tokens, where types refer to unique or distinct words in a corpus, and tokens refer to the total number of words in a corpus (Cheng, 2012: 62-63). It sometimes displays types/token ratio (TTR) which results when the number of types is divided by the number of tokens, a technique utilized for comparison between corpora (ibid: 63).


Excerpt out of 124 pages


A Corpus-Based Analysis of Using Function Words in English Forensic Authorship Attribution
A Case of Political Journalism Disputes
Catalog Number
ISBN (eBook)
ISBN (Book)
File size
1591 KB
corpus-based, analysis, using, function, words, english, forensic, authorship, attribution, case, political, journalism, disputes
Quote paper
Khalid Shakir Hussein (Author)Eman Abdul Kareem (Author), 2017, A Corpus-Based Analysis of Using Function Words in English Forensic Authorship Attribution, Munich, GRIN Verlag, https://www.grin.com/document/385050


  • No comments yet.
Look inside the ebook
Title: A Corpus-Based Analysis of Using Function Words in English Forensic Authorship Attribution

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free