The development of corpus linguistics to its present-day concept

Term Paper (Advanced seminar), 2005
21 Pages, Grade: 1


Tabel of Contents

1. Introduction

2. Early Corpus Linguistics
2.1 The concept of early corpus linguistics
2.2 Corpus-based work up to the end of the 1950s

3. Criticism on corpus linguistics
3.1 Noam Chomsky and Abercrombie
3.2 Evaluation of the criticism

4. The influence of computer technology and techniques
4.1 Different generations of corpora

5. The concept of modern corpus linguistics
5.1 Definition of a corpus

6. Future prospects in corpus linguistics

7. Conclusion


1. Introduction

Thirty years ago when this research [corpus linguistics] started it was considered impossible to process texts of several million words in length.

Twenty years ago it was considered marginally possible but still lunatic.

Ten years ago it was considered quite possible but still lunatic.

Today it is very popular.[1]

John Sinclair´s short description of the history of corpus linguistic from 1991 is quite striking but it does not give any reasons for the development to corpus linguistics´ present-day popularity. It is obvious that changes must have taken place which transformed corpus linguistics from impossible to lunatic to popular. To put a question first: what actually is corpus linguistics (CL)? Though already a well-established methodology in the linguistic field it is still an unknown territory to a lot of language students. This claim is made on the mere fact that hardly anyone the author talked to actually knew what corpus linguistics is about. So what is the answer to that question?

“It doesn´t exist.” At least one can be sure that this answer given by Noam Chomsky in 1999, the probably most influential critic of CL, is not appropriate anymore. Corpus linguistics, today seen as the study of linguistic phenomena through large collections of machine-readable texts, exists, although there has not always been the exact same idea behind it. If the 1990s had been Chomsky´s formative years, he probably would have had a different opinion since that was the time when CL reached the status it has nowadays.

This paper will provide an overview of the different stages that CL has gone through. Early Corpus Linguistics will be presented first, a term that describes all corpus-based work up to the end of the 1950s. That is the time when Noam Chomsky makes the early researchers reflect on their work under certain aspects which neutralize somehow the work which was done up to that point. As an effect corpus research faces a certain discontinuity.

Nevertheless, corpus-based work does not totally cease and the improvements in computer technology provide completely new possibilities in corpus research. Over the decades a considerable amount of machine-readable corpora is created for more and more different purposes and they initiate all variations of analysis.

After the presenation of the chronological development of CL, the last but one chapter of the paper will finally deal with the concept of modern corpus linguistics and will give the definition of a corpus, which is not yet an definite thing to do.

There is still a lot of work going on to improve the corpus linguistic methodology. The last chapter will give an overview of future prospects.

2. Early Corpus Linguistics

The idea of collecting texts for the use of language analysis is not a new one. As early as in the Middle Ages people begin to make lists of all the words in a certain text, together with their contexts. Other scholars produce list of the most frequent words by counting word frequencies from single texts or from collections of texts.[2]

McEnery and Wilson[3] use the term “early corpus linguistics” for all linguistic corpus-based work done before the advent of Chomsky. That covers the period of time up to the end of the 1950s when structuralism was the basic linguistic science. Famous linguists of the structuralist tradition like Boas, Sapir, Bloomfield and Harris use methods for their analysis of language that can be undoubtedly called corpus-based. However, the term `corpus linguistics´ is not yet explicitly used during that time but comes up later.

2.1 The concept of early corpus linguistics

Early corpus-based methods are used as the basis for a variety of linguistic studies. The researcher collect and analyse naturally occurring data in order to describe and document, for example, the change of language, the phenomena of language acquisition or to proof linguistics hypothesis. More fields of study will be described in the following paragraph.

The main thought behind these tasks is that language description is a matter of objective fact and not one of subjective speculation. In collecting real life examples the linguists find plenty of objective material which seems to offer the answers to all linguistic questions. But the idea of a corpus as the sole explicandum of language is only acceptable if you believe that language is finite. McEnery and Wilson claim that the linguists at that time tacitly hold that view. Their concept is “that the sentences of a natural language are finite, and […] that the sentences of a natural language can be collected and enumerated.”[4] That means, it can be possible to get a complete collection of each and every occurring sentences via a corpus: “Like blades of grass on a lawn, the sentences of language are great in number, but if one has sufficient patience one may collect and count them all. They are finite. There are just so many blades of grass, and there are just so many sentences. Language is an enumerable set which can be gathered and counted.”[5] Since this idea sounds so tempting, that linguistics can be treated like an own empirical science as, for example, physics, no one really voices a criticism. Unfortunately at the end of the 1950s Noam Chomsky disturbs this idyllic situation, but that will be mentioned in chapter 3.

2.2 Corpus-based work up to the end of the 1950s

Early empirical studies play an important role in the development of corpus linguistics. They form the basis for an idea which will be re-examined and improved over the following decades.

In 1897 Käding starts to compare frequency distributions of letters and sequences of letters to retrieve spelling conventions from it. He uses an impressively large corpus of some 11 million German words. Today it seems unthinkable how he could work through this large amount of words without technical aid.

In the time, roughly from 1876-1926, language acquisition research is based on diaries of parents who record carefully their children´s language. It is interesting to note that those findings are still used as sources of normative data in language acquisition over half a century later.

Corpus-based data is also used by Fries and Traver (1940) and Bongers (1947) in research on foreign language pedagogy. The vocabulary lists they use are derived from corpora which are based on the studies of Thorndike (1921), West and Palmer (1933). “From the 1920s there was, especially in the United States and in the United Kingdom, a tradition of word counting in texts in order to discover the most frequent, and arguably therefore the most pedagogically useful, words and grammatical structures for language teaching purposes.”[6]

“Form the 1930s Prague School linguists [undertake] quantitative studies (mainly of Czech, English and Russian) of the frequency of certain grammatical processes, the relative frequencies of different parts of speech, the location and distribution of information in the sentence, and the statistical distribution of syllable types and structures.”[7] The studies are carried out manually and the work goes either in the direction of comparative stylistic analysis (e.g. Krámský, 1972) or towards quantitative comparison of varieties of English (e.g. Dušková, 1977).

Working also in the field of comparative linguistics Eaton (1940) collocates word frequencies in German, Dutch, French and Italian. Even today his work is still considered very sophisticated. Lorge (1949) follows Eaton´s example and uses also semantic frequency lists. Fries (1952) presents at a very early stage a descriptive grammar based on a corpus of telephone conversations. Not only its early release date is extraordinary but also the fact that the grammar is not using exclusively written data.

This small selection gives an idea about the developing experimental interest in corpus-based work and that reflects the belief in its potential.

3. Criticism on corpus linguistics

In the late 1950s some people, especially Noam Chomsky, doubt that findings about the nature of language drawn from the work on a corpus are really useful. He gives such good reasons that his influence causes a change in the linguistic paradigm of that time. In the following two decades corpus linguistics becomes extremely unpopular.

3.1 Noam Chomsky and Abercrombie

Noam Chomsky causes a shift from empiricism towards rationalism.

The empiricist approach is based upon the analysis of external data, such as collected texts and corpora. That means, to decide whether sentence x is a valid sentence of language y, a look in a corpus helps.


[1] John Sinclair in Svartik, Jan (1996): Corpora are becoming mainstream. In: Thomas, Jenny & Short, Michael (eds), S.4.

[2] cf. (17.02.2005)

[3] McEnery, Tony & Wilson, Andrew (1996): Corpus Linguistics. Edinburgh: Edinburgh UP, S. 2f

[4] McEnery & Wilson (1996: 6)

[5] ibid.

[6] Kennedy, Graeme (1998): Introduction to Corpus Linguistics. London: Longman, S. 10

[7] Kennedy (1998: 10)

Excerpt out of 21 pages


The development of corpus linguistics to its present-day concept
LMU Munich  (Institut für Englische Philologie)
Corpus linguistics and teaching
Catalog Number
ISBN (eBook)
ISBN (Book)
File size
484 KB
Describes the development of corpus linguistics from its beginning up to now, including Chomsky´s critical statements
Quote paper
Bernadette Wonner (Author), 2005, The development of corpus linguistics to its present-day concept, Munich, GRIN Verlag,


  • No comments yet.
Read the ebook
Title: The development of corpus linguistics to its present-day concept

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free