A corpus-based study of the language used in book reviews

1.1 Corpus linguistics - historical sketch and scope of interest
1.1.1 The definition and area of interest
1.1.2 The historical background of corpus linguistics
1.2 The idea of corpus
1.2.1 Defining the term
1.2.2 Major electronic corpora The COBUILD Bank of English The British National Corpus Other important corpora of English Corpora for special purposes
1.2.3 Issues in compilation and corpus design Criteria for corpus design Size Representativeness
1.2.4 Basic procedures used in corpus analysis Frequency list Concordances

2.1 Main characteristic of the press
2.1.1 Press genres
2.2 Main characteristic of reviews
2.2.1 Book reviews Book review vocabulary

3.1 Compilation of the DIY Book Review Corpus – basic information
3.2 Frequency of words
3.2.1 General versus specialized vocabulary
3.2.2 Findings concerning the frequency list of book review terms
3.3 Concordance analyses of selected terms
3.3.1 The noun reader
3.3.2 The noun book
3.3.3 The noun novel
3.3.4 The noun plot
3.3.5 The noun character
3.3.6 The noun story
3.3.7 Synonyms writer and author








Reading books is a pastime that has enjoyed great popularity for many years. Readership of a particular book depends on several factors. One of the most important of them seems to be an expert’s opinion provided by a book review. Readers often reach for a book only because it has been presented in a positive light by reviewers. The book review is a widely used opinion genre, which plays a vital role in the publishing industry. The language of this genre, similarly to the language of any other area, has its characteristic features, such as typical words and phrases. Although book reviewing is such a widespread phenomenon it does not receive as much linguists’ attention as it should. To be precise, the number of valuable, comprehensive linguistic materials devoted to this subject is insufficient. This is particularly evident from the point of view of a person using English as the second language. Most of the course books available focus on theory and the structure of the review, providing merely some basic vocabulary and exemplary compositions, which does not suffice to read and write this type of text effectively enough. Hence, the idea of writing a thesis which could answer the question of what words and phrases are the most likely to be encountered in a book review seemed fascinating to its author.

The aim of this thesis is to discuss certain aspects of the language of book reviews, stressing the importance and indispensability of corpus-based approach in any study of this type. The thesis is to show that the register of book reviewing is a specialized language, characterized by specific vocabulary. By this is meant that there exist some words and phrases which tend to be used more frequently here than in other registers, or that some words have a special, narrowed meaning.

For this purpose the researcher has compiled her own corpus of book reviews in an electronic form, which provides the foundation for the analysis. This collection of written language samples is referred to as the Do-It-Yourself Book Review Corpus (henceforth, DIY BRC). As regards the source of materials, the reviews come from the websites of popular quality newspapers.

In order to ensure the research a relatively high reliability, the linguistic data needs to be as large as possible. Therefore, my corpus (DIY BRC) totals approximately 115 000 words derived from as many as 166 book reviews. The mean sample size is between 660 and 982 words. The text processing and corpus analysis are conducted by means of a professional lexical analysis software package referred to as WordSmith Tools.

The thesis is divided into three chapters. The first one provides a theoretical background concerning fundamental issues in corpus linguistics. In this chapter the following topics are discussed: the definition and area of interest of corpus linguistics; brief historical sketch of corpus-based studies; the definition and different classifications of corpora; factors taken into account during the compilation of a corpus; as well as basic procedures and tools of linguistic data processing used by researchers. As regards Chapter two, it focuses on the issues that show the nature of the book review as a press genre. Here the author discusses main characteristics of the press, the classification of press genres, the most important features of various types of reviews, as well as the characteristics of the book review. There is also a glossary of book review terms in the final section of this chapter, which has been intuitively created by the researcher herself to compare her selection of words with the results of the corpus analysis conducted in the following chapter. The third chapter, in turn, is devoted to a corpus-based study of the language of book reviews gathered in the DIY BRC. The research will be facilitated by the WordSmith Tools, especially by frequency lists and concordances, which are its most basic tools. In the first sections of this chapter some basic information concerning the DIY BRC is given, then the frequency of words is presented and discussed. In the further sections the author examines the use of the most important and central words in the corpus by means of concordances. It is also demonstrated in chapter three, how concordance analysis can determine contextual similarities and differences between particular synonyms.

Obviously, there has not been enough space in this diploma paper to carry out a detailed analysis of all, or even most of the specialized terms characteristic of book reviews. Nevertheless, it is hoped that the thesis concerning such a popular topic and the conclusions drawn from this research will turn out to be interesting and useful to others.



1.1 Corpus linguistics - historical sketch and scope of interest

Corpus linguistics is a relatively new research enterprise which developed rapidly throughout the 20th century. However, it should be noticed that corpus-based studies were started much earlier. This new philosophical approach to linguistic enquiry has become especially popular with the advancement of computer technology since the 1980s. In this subchapter I will discourse issues like definition of corpus linguistics, its scope as well as historical outline.

1.1.1 The definition and area of interest

The problem of defining corpus linguistics has been debated from different standpoints. According to McEnery and Hardie[1] corpus linguistics could be defined as dealing with some set of machine-readable texts which is considered a suitable basis for study of particular research questions. This view is dramatically opposed to Meyer[2] who argued that corpus linguistics is as a way of doing linguistics. Similar to other linguists such as Hindquist[3], he claims that it is not really a domain of research, but a methodology, “a methodological basis for pursuing linguistic research”[4].

However, all authors agree that corpus linguistics is a heterogeneous field which combines with other branches of linguistics, complements other approaches and may facilitate researches in other linguistic studies.

Corpus linguistics uses texts as a basis for linguistic researches. In addition, it includes methodologies for descriptive purposes. Kennedy claims that it deals first of all with the description and explanation of the nature, structure and use of language and such areas as learning of languages, variation and change[5].

The linguist splits work in corpus linguistics into four major areas of activities. Therefore, he identifies four group of researches dealing with corpus-based analysis. The first group includes corpus makers and compilers. The second group of researchers copes with developing tools for corpora analysis. The researches in the third group deal with descriptive linguistics using computerised corpora to describe the lexicon and grammar of language. In the fourth group researcher work on a corpus-based description for a variety of purposes like language acquisition and natural language processing by machine[6].

According to Leech[7] study of corpus linguistics focuses:

-on linguistic performance rather than competence,
-on linguistic description rather than linguistic universals,
-on quantitative as well as qualitative models of language,
-on a more empiricist, rather than a rationalist view of scientific inquiry.

As has been noted by Kennedy the use of computers in corpus linguistics has extended the scope of corpus-based research. Therefore, it is not surprising that such projects improve the work on dictionaries, contribute to the making of word lists, descriptive grammar, diachronic and synchronic comparative studies of speech varieties, and to stylistic, pedagogical and other applications.

1.1.2 The historical background of corpus linguistics

The development of corpus linguistics can be split into three phases. The first stage includes the time before computers (before the 1960s), the second lasted to the 1980s and the third embraces the studies from the 1980s till now.

The corpus researches based on the study of electronic, machine-readable corpora have begun in the 1960s[8]. Before that time corpus-based linguistic analysis was carried out using manual techniques, which was much more time-consuming. According to Kennedy, the pre-electronic corpus studies could be found in five main fields of scholarship:

1. biblical and literary studies
2. lexicography
3. dialect studies
4. language education studies
5. grammatical studies[9].

The beginning of using a large collection of texts in machine-readable form for linguistic description was the making of Brown Corpus by linguists at the University of London and the Brown University who compiled a corpus of one million words of written text. In the period of 1965 to 1986 the growing interest in corpus use could be seen in the number of corpus-based studies which increased by 310 from 10. While the corpus-based approach has won a great popularity, it has also become the target of criticism. Some linguistics suggested that corpus could never be a useful tool for linguistic analysis. As Chomsky said: “[s]ome sentences won’t occur because they are obvious, others because they are false, still others because they are impolite.”[10] He argued that a corpus is composed of a finite set of sentences, while a language is “an infinite set of sentences”. Besides, corpora would be skewed because they come under scrutiny whereas intuitions are not. What is more, Chomsky claimed that corpus data is likely to hold “performance errors”. However, not all linguists agreed with his opinion and that is why corpus studies became more and more popular from 1980 onwards, as techniques and new arguments in favour of the use of corpora became more significant and apparent.[11]

1.2 The idea of corpus

1.2.1 Defining the term

As has been claimed by McEnery and Wilson, any collection of texts, even two texts can be called a corpus, because the term corpus is the Latin equivalent for body.[12] Nevertheless, in modern linguistics the word has more specific connotation: sampling and representativeness, finite size, machine-readable form and a standard reference. Therefore, it could be defined as ‘a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research’.[13] It should be added that corpus linguistics copes with both written and spoken forms.

1.2.2 Major electronic corpora The COBUILD Bank of English

The Collins Birmingham University International Language Database (COBUILD) was set up at the University of Birmingham in 1980 directed by John Sinclair.[14] It was the first major mega-corpus project[15], now known as a Bank of English, which is an open-ended corpus like language itself, dynamic rather than statistic. Originally, it contained 20 million words, by 2007 increased to 450 million words, and by 2010, to over 650 million.[16] The Bank of English corpus set new standards in the area of corpus-based studies.[17] The British National Corpus

The BNC “was undoubtedly the most ambitious corpus compilation project yet attempted”.[18] This was cooperation between Oxford University Press, Longman Group, W. & R. Chambers, the British Library and the Universities of Oxford and Lancaster with the British government paying half of the cost. The Corpus was planned to be representative of British English as a whole and well-balanced, with spoken and written texts. It is compound of about 100 million words from samples of both spoken and written British English.[19] The Corpus is largely used by linguists and dictionary makers. Other important corpora of English

This chapter discussed in brief two general corpora of English which are less important in corpus-based researches, but still they are well-known throughout the world. These are the International Corpus of English and the American National Corpus.

The International Corpus of English (ICE) is the corpus of eighteen countries from all over the world developed for the comparative study of regional diversities of English. The project was initiated by Professor Sidney Greenbaum in 1988.[20] A collection of one million word corpora in both written and spoken form provided for the first time “the resources for systematic study of the national variety as an end in itself”[21]

The American National Corpus (the ANC) is planned to be American equivalent of the BNC in length and coverage.[22] In 1998, a number of scholars started to compile the corpus. Funded by a group of publishers and computer companies five years later 11 million words were released. The ANC was intended to contain at least 100 million words, but the compilation turned out impossible and has not been completed for the financial reason.

Another important corpora of English worth mentioning are: the Corpus of Contemporary American English (COCA), the Cambridge International Corpus, the Oxford English Corpus. The corpora which I brought here up are only a small number of many English corpora designed up till now. However, it would be impossible to deal with all in this dissertation. Corpora for special purposes

This chapter is concerned with specialised corpora which are used to study particular features of a language or a genre of language. As Greenbaum noticed “[o]f making many corpora there is no end”[23] Therefore, a classification or a clear typology of the above-mentioned would be quite complicated and must be constantly brought up to date.

Special corpora bring the language of a selected linguistic area into focus, such as sport, medicine, economics, but also the language used in verbal communication in the classroom, in newspapers or in academic essays. It is also possible to make a compilation of a corpus with texts which are related to a specific topic. As has been defined by Tognini-Bonelli, the special corpora do not dispense a description of ordinary language, because of a large number of unusual features.[24]

Moreover, one of the main differences between general and specialized corpora is, according to Aston[25], that the second types are easier to become familiar with, as a result of a relatively small size. Samples may be selected from a larger non-specialized corpora, or they can be constructed for a particular purpose.

Kennedy discussed some examples of such types of corpora, inter alia[26]:
- Corpus of Spoken American English (CSAE) used by adults, planned to consist of 200 000 words, but increased to one million words,
- Complete Corpus of Old English consists of 3 022 texts, published in 1981 as the basis for Dictionary of Old English,
- the Helsinki Corpus of English: Diachronic Part comprised 400 texts including the period from the Old English to Early Modern English. It was the first specialized electronic diachronic corpus of English with 1,5 million words.
- the Child Language Data Exchange System (CHILDES) is intended to analyse the process of language acquisition and development among children. The corpus consists of 20 million words coming from data which have been recorded and transcribed since the mid-1980.

The last project discussed in this chapter is referred to Polish and English Language Corpora for Research and Applications, known as PELCRA . It was started in 1997 as a joint venture of the Department of Linguistics and Modern English Language of Lancaster University and the Department of English Language of Łódź University. The project is aimed at building Polish and English modern language corpora for research and practical use in linguistic engineering and language studies. It includes spoken corpus of Polish, Polish-English and English-Polish parallel and comparable corpora comprising translated texts. With over 100 000 000 words the PELCRA project is one of the most significant venture in Polish corpus linguistics.[27]

1.2.3 Issues in compilation and corpus design

With the development of the Internet, a special kind of corpora, known as do-it yourself corpus (DIY), appeared. These corpora are also labelled as disposable corpora, virtual and Ad-hoc corpora. As Zanettin[28] claims, DIY Corpus is a compilation of Internet documents, it is created ad hoc as a reaction to a specific text to be translated. Furthermore, it is an open and disposable corpus which is designed to be used for the particular aim and rather removed as soon as the research is completed.

A corpus is not just a compilation of randomly collected texts, but rather a project prepared in detail. Therefore, it would be recommended for anyone who plans to create a DIY corpus to look closely at some aspects connecting with the process of corpus building. Criteria for corpus design

Prior to compiling a corpus, researcher should take many issues into consideration. First of all, the aim of the project ought to be clearly defined. It affects the type of corpus and its design. However, the main problem that linguists come up against is to compile corpus that can serve the purpose for which it was aimed.

It is obviously impossible to pick all objects of research into a corpus. Therefore, compiler should consider some general criteria for texts selection. The majority of linguists deal only with external criteria which seem to be primarily at least at the beginning of the process. These are fundamentally non-linguistic criteria referred to representativeness, size and sampling that are presented below.

Another issues are, according to Atkins et al., range of language varieties, the time period covered and decisions whether to include written and spoken texts and in case of containing speech the approximate level of encoding detail to be recorded in electronic form.[29] If one decides to make both written material and speech as a part of corpus, the next choice should be related to the proportion of each, Kennedy called it a balance.[30]

Including of spoken language may bring some problems, that is why linguists often concentrate only on written form. Sinclair warned compilers not to include so called ‘quasi-speech’ forms like film scripts, drama texts, etc., which may be regarded by some researchers as spoken materials. They are of poor value in a general corpus, because such scripts does not substitute for natural conversations.[31]

The corpus may comprise either a total population of texts, e.g. of a particular author, or a sample of texts from an analysed population. Another matter that should be borne in mind is a decision about the type of corpus, whether it should be synchronic, which means that the time of the text publishing is rather irrelevant, or diachronic one, if the period to be covered is closely connected with the subject of study.[32]

All things considered, there are many criteria that may be used in corpus design. Nevertheless, it is recommendable to accept the smallest set of criteria that is regarded as reasonable for a corpus. Useful criteria for a general corpus, according to Sinclair, are issues referred to the style of texts (formal, informal), the age, sex and origin of the author, the genre of writing (whether the work is fiction or non-fiction) and the source of texts (book, newspaper, journal).[33] Size

The dimension of a corpus is a significant determinant of text corpus. It has been remarked by Kennedy that the quantity of text is concerned not only with the total number of words (tokens), but also related to different words (types) in a corpus, the number of texts form different categories, the number of samples in each category and the number of words in each sample.[34]

According to Sinclair, “a corpus should be as large as possible and should keep on growing”.[35] It is advisable because of the number of word occurrences. In order to research meaning and uses of a word, we need to have available a large number of occurrences. Thus, it is clear that if word frequency is low, no reliable description of its features and collocations is possible. In general, the larger corpus is regarded to be more representative than the smaller one. Nevertheless, if you compile a specialized corpus, then it is not essential to have as big corpus as it would be if it was of a general type of language.[36] Hence, some corpora do not need to be as large in size as others in order to serve the purpose for which they were intended.

What has been said so far, can be supplemented by the words of Kennedy, that “any corpus, however big, can never be more than a minuscule sample of all the speech or writing”.[37] This implies that even a large corpus may not be adequate for a reliable description of the language. Nevertheless, the greater is the number of occurrences, the more faithful are statistical results and observations on which a descriptive statement is based. Representativeness

As Biber emphasises, a corpus is not simply a collection of texts. It aims to represent a language or some parts of it.[38] Therefore, it is commonplace to come up against questions over the notion of representativeness.

The term ‘representativeness’ means that the analysis based on contents of corpus can be generalised to a larger sample or to the language as a whole.[39] Although size does not guarantee representativeness, it is related to this problem and need to be taken to compile a maximally representative corpus. Linguistic studies on the discussed issue have shown that the content and the size are closely interdependent factors. Nevertheless, Sinclair argues that small, specialized corpora display not as much internal variation as general corpora. This is a factor that has implications for not only the size but also the representativeness of the corpus. Hence the conclusion that the greater the variation in the corpus, the more samples and a larger corpus are recommended to assure the representativeness and validity of the data.[40] Moreover, the discussed issue determines the type of research questions and the generalization of the results of research.[41] It means that final conclusion drawn from the studies of a corpus based on conversations between children, could not be generalized to conversation in total.

In the linguistic literature there are presented some methods which may be used in compiling a maximal representative corpus. Biber et al. discuss two processes called ‘proportional sampling’ and ‘stratified sampling’. The first one reflects the proportions of language varieties which a group of people, called informants, is involved in, over a period of time. For example, they can record every conversation, instruction or television show they have heard, produced of listened to over a week. However, the fact that some language varieties will probably fail to be included, cast doubt on the usefulness of this manner.

Therefore, compilers are rather in favour of the second method which is intended to involve the broadest possible range of language varieties.[42] That is why, researchers often struggle to create a corpus consisting of all existing genres.

Corpus data is helpful in showing what is statistically central and typical in language which allow researchers to draw conclusions and make generalization about their linguistic findings. Hence, the issue of representativeness should be considered as a crucial factor while compiling a corpus.


