How can the use of frequency information from corpora be used in foreign language teaching? A corpus-based study on vocabulary in course books

Diploma Thesis, 2016

122 Pages, Grade: A


Table of contents


2 Coursebook vocabulary

3 Project coursebooks in primary education
3.1 Project home study program

4 Word frequency in current learner’s dictionaries
4.1 Oxford defining vocabulary
4.2 Longman defining vocabulary
4.3 Corpus linguistics in language learning

5 Research
5.1 Goal of research
5.2 Development of methods and tasks
5.2.1 Coursebook series selection
5.2.2 Acquisition of data
5.2.3 Questionnaires
5.2.4 Mistakes and errors
5.3 Compliance of the corpus
5.3.1 Corpus compliance/builder programs
5.3.2 Corpus preparation and importation
5.3.3 Part of speech tagging and lemmatization
5.3.4 Co-occurrence analysis
5.3.5 Thematic analysis
5.3.6 Comparative analysis
5.4 Oxford 3000 corpus

6 Output interpretation
6.1 Questionnaires
6.3 Project
6.3.1 Word list derived analysis
6.3.2 Co-occurrence analysis
6.3.3 Thematic analysis and data mining
6.3.4 Comparative analysis
6.3.5 Mistakes
6.4 Project
6.4.1 Word list derived analysis
6.4.2 Co-occurrence analysis
6.4.3 Thematic analysis and data mining
6.4.4 Comparative analysis
6.4.5 Mistakes
6.5 Project
6.5.1 Word list derived analysis
6.5.2 Co-occurrence analysis
6.5.3 Thematic analysis and data mining
6.5.4 Comparative analysis
6.5.5 Mistakes
6.6 Project
6.6.1 Word list derived analysis
6.6.2 Co-occurrence analysis
6.6.3 Thematic analysis and data mining
6.6.4 Comparative analysis
6.6.5 Mistakes
6.7 Project 4th edition series evaluation
6.8 Analysis and results of the comparison with the Oxford
6.8.1 Project
6.8.2 Project
6.8.3 Project
6.8.4 Project
6.8.5 Project
6.9 Conclusion of research


Bibliography – primary sources

Bibliography – secondary sources

Internet sources



The aim of this thesis is to conduct a study of a complete fourth edition Project coursebooks series on the basis of corpora, representing the individual coursebooks, created for the purposes of this study, in terms of corpus linguistics. The representation of vocabulary in these corpora is compared afterwards, individually and in conjunction, to the content of The Oxford 3000 list of essential words.

Keywords: C orpus Linguistics, Project coursebooks series, vocabulary, English language teaching


Cílem diplomové práce je vypracování studie kompletní řady učebnic Project, publikovaných vydavatelstvím Oxford Univesity Press, při využití korpusů vytvořených pro účel této studie, v rámci korpusové lingvistiky. Na základě vypracované studie je provedeno porovnání zastoupení slovní zásoby v jednotlivých korpusech, i jako celku ve všech pěti učebnicích s obsahem seznamu nezbytných slov The Oxford 3000.

Klíčová slova: K orpusová lingvistika, řada učebnic Project, slovní zásoba, výuka anglického jazyka

I would like to thank PhDr. René Kron for his supervision and valuable advice. My thanks also belong to Mgr. Martina Brychová, head of English department in ZŠ náměstí Míru Nový Bor, for the possibility of conducting research among the teachers and on the fourth series of Project coursebooks and to the team of T-LAB office in Italy, who were kind enough to provide me with a license to their software for the analytical part of this thesis. The biggest thanks belong to my partner for his patience and comfort and especially to my mother who has helped me in every way she deemed possible and always supported me.

List of abbreviations

Abbildung in dieser Leseprobe nicht enthalten

List of tables

Table 1: Project 1 - Listen associated lemmas

Table 2: Project 1 – Concordances words lemmas

Table 3: Project 1 – Cluster representation elementary contexts

Table 4: Project 1 – cluster representation

Table 5: Project 1 – cluster 1 analysis variables

Table 6: Project 1 – cluster 2 analysis variables

Table 7: Project 1 – cluster 3 analysis variables

Table 8: Project 4 – cluster 1 analysis variables

Table 9: Project 1 – Games console vs Game console frequency of use

Table 10: Project 2 - People associated lemmas

Table 11: Project 2 – Concordances words lemmas

Table 12: Project 2 – Cluster representation elementary contexts

Table 13: Project 2 – Lemmas with highest frequency in individual clusters

Table 14: Project 2 – cluster representation

Table 15: Project 2 – cluster 3 analysis variables

Table 16: Project 2 – cluster 4 analysis variables

Table 17: Project 2 – cluster 5 analysis variables

Table 18: Project 2 – cluster 6 analysis variables

Table 19: Project 2 – cluster 7 analysis variables

Table 20: Project 2 – Satsuma vs Tangerine frequency of use

Table 21: Project 3 - Listen associated lemmas

Table 22: Project 3 – Concordances words lemmas

Table 23: Project 3 – Cluster representation elementary contexts

Table 24: Project 3 – cluster representation

Table 25: Project 3 – cluster 1 analysis variables

Table 26: Project 3 – cluster 2 analysis variables

Table 27: Project 3 – cluster 3 analysis variables

Table 28: Project 4 - Listen associated lemmas

Table 29: Project 4 – Concordances words lemmas

Table 30: Project 4 – Cluster representation elementary contexts

Table 31: Project 4 – cluster representation

Table 32: Project 4 – cluster 1 analysis variables

Table 33: Project 4 – cluster 2 analysis variables

Table 34: Project 4 – cluster 3 analysis variables

Table 35: Project 4 – cluster 4 analysis variables

Table 36: Project 4 – An usual day vs A usual day frequency of use

Table 37: Project 5 - People associated lemmas

Table 38: Project 5 – Concordances words lemmas

Table 39: Project 5 – Cluster representation elementary contexts

Table 40: Project 5 – Lemmas and variables with highest frequency in individual clusters

Table 41: Project 5 – cluster representation

Table 42: Project 5 – cluster 1 analysis variables

Table 43: Project 5 – cluster 2 analysis variables

Table 44: Project 5 – cluster 3 analysis variables

Table 45: Project 5 – cluster 4 analysis variables

Table 46: Project 5 – cluster 5 analysis variables .

List of figures

Figure 2: Project 1 – Listen associated lemmas

Figure 3: Project 1 – Co-word analysis

Figure 4: Project 1 – Sequence analysis

Figure 5: Project 1 – Characteristic lemmas

Figure 6: Project 1 – lemmas variables

Figure 7: Project 1 – comparative analysis

Figure 8: Project 2 – People associated lemmas

Figure 9: Project 2 – Co-word analysis

Figure 10: Project 2 – Sequence analysis

Figure 11: Project 2 – Characteristic lemmas

Figure 12: Project 2 – lemmas variables

Figure 13: Project 2 – comparative analysis

Figure 14: Project 3 – Listen associated lemmas

Figure 15: Project 3 – Co-word analysis

Figure 16: Project 3 – Sequence analysis

Figure 17: Project 3 – Characteristic lemmas

Figure 18: Project 3 - lemmas variables

Figure 19: Project 3 – comparative analysis

Figure 20: Project 4 – Listen associated lemmas

Figure 21: Project 4 – Co-word analysis

Figure 22: Project 4 – Sequence analysis

Figure 23: Project 4 – Characteristic lemmas

Figure 24: Project 4 - lemmas variables

Figure 25: Project 4 – comparative analysis

Figure 26: Project 5 – People associated lemmas

Figure 27: Project 5 – Co-word analysis

Figure 28: Project 5 – Sequence analysis

Figure 29: Project 5 – Characteristic lemmas

Figure 30: Project 5 - lemmas variables

Figure 31: Project 5 – comparative analysis .

1 Introduction

Teaching English and especially vocabulary to young learners is one of the most challenging responsibilities that teachers face. The necessity of presenting vocabulary in a certain manner cannot be avoided and it is mostly up to the teachers which approach and methodology they are going to choose.

A great amount of studies on how computers can facilitate the learning of English as a foreign language (EFL) have been published in recent years, when the unconditional majority of current students and students to be are born in the digital era in which more than 50% of children are more advanced in using technology than their parents. (Ofcom, 2013). The age when each household owned a printed copy of an encyclopedia, dictionary and other texts that served as a source of relevant information for decades, is long gone. The information nowadays is accessed and retrieved almost unexceptionally through the internet, to which 91% of households have access to. (Ofcom, 2013).

With the development of corpus linguistics and creation of immense corpora such as Corpus of Contemporary American English (COCA), the British National Corpus (BNC), etc., both teachers and students also have unlimited access to hundreds of millions of words in various corpora and the possibility to explore their relations and occurrence patterns. This advantage is however rarely used in practice, due to relatively short existence of this discipline but most importantly due to lack of information about corpora in ELT that is presented to teachers.

This thesis focuses on the use of corpora, based on word frequency, in teaching English to young learners, with specific application to students of school náměstí Míru in Nový Bor. The entire teacher staff of the secondary school has participated in this study in a form of a questionnaire in which they expressed their opinions and experiences with Project coursebooks and difficulties with teaching vocabulary to students. The thesis also briefly evaluates the fourth series of Project coursebooks in comparison with the previous, third series of the coursebooks as an extension of prior work on the Project coursebooks.

The study offers following hypothesis: little or even no importance is given to the manner in which vocabulary is presented to students and the vocabulary is often introduced in various types of categorizations, such as combination of semantic, thematic and “according to the units of coursebook” arrangements.

The research part of the thesis deals with linguistic analysis of data extracted from each coursebook and their subsequent individual comparison with the list of Oxford 3000 essential words. The aim of the thesis is to investigate the linguistic attributes of texts forming Project coursebooks and to examine their relations. The research is concluded with the comparison of entire vocabulary of the Project coursebooks, fourth edition with the Oxford 3000 lexicon.

2 Coursebook vocabulary

Vocabulary is one of the most overlooked components of language learning in schools. It is not neglected in the terms of quantity but rather the approach to presenting the vocabulary lists to the students and the system used for memorization is inefficient. When dealing with vocabulary, the focus is put mainly, on writing down multiple words that are often connected, only by the coursebook unit into which they belong. Gairns and Redman talk about the importance of differentiation between “street learning” where the vocabulary emerges accidentally but is selected naturally and school learning in a controlled environment.

“In a school learning situation with limited time available, conflicting student interests, and the constraints imposed by other syllabus demands, we cannot leave lexis to take care of itself in this random fashion and assume that students will acquire the vocabulary which best suits their needs.” (1986, 1)

The key to best results in acquisition of a vocabulary for English language learners lies in selection of high priority words and provision of enough opportunities for their practice.

The vocabulary of the Project coursebooks is introduced to students exclusively in Student’s book. The additional, optional material that complements the coursebook series (workbook, methodological guide) do not introduce new words to students, so they are only learning new vocabulary under direct teacher supervision while working with the coursebook in the class. This has been confirmed by Mrs Michaela Vareková from Oxford University Press office in Ostrava.

The intended outcome of presenting vocabulary to learners through coursebook series is in many cases more or less further from the actual end result. Considering the possibility that this could be the case for Project coursebook series, the entire teacher staff of secondary school ZŠ náměstí Míru Nový Bor participated in a survey questionnaire to answer the questions that explain how working with coursebooks affects the actual teaching of vocabulary.

The results varied for children in so called “language classes” and regular classes. In ZŠ náměstí Míru, students starting from second grade are divided by a test which determines students with potential of language learning and those without the potential of advanced learning. The children that pass the test are then placed in the “language classes” which have 2 more hours of English language a week implemented in their timetables.

3 Project coursebooks in primary education

Coursebooks are elemental instruments of teacher’s methodology and form a major part of the course’s syllabus. It is common for the non-active English learner to rely solely on the coursebook and complementing materials and even more so when students are given homework and practice exercises from the coursebook and supplementary materials as well.

At the present time, the variety and number of ELT coursebooks that are available to teachers is very extensive. In the Czech Republic, the MŠMT has accepted the Project coursebooks as a part of a list of integrated coursebook series for primary English language education in 2008. The integration of the Project coursebooks in this list vas valid until 2014, however in December 2012 MŠMT prolonged the validity of the third series of Project coursebooks up to 7. 10. 2019. What is startling is the fact that the fourth edition of Project coursebooks which was published in 2014 is also valid in the very same MŠMT list but with a period of time that is full 3 days longer and expires on 10. 10. 2019. The reason, why third edition of Project coursebooks, which is over 20 years old at the present date, is even considered a candidate for the MŠMT issued list of appropriate teaching material is unknown. That it is, in fact considered a valid and up- to-date coursebook that is supposed to teach contemporary language to children born in 21st century is dismal.

Tom Hutchinson, the author of the Project coursebooks series, has been developing the series for over 30 years. In his video comment on the release of the latest, fourth edition of books he adds that the very first book edition was released in 1995, with second following in 1999, third in 2008 and finally the fourth in 2014. Every edition received an update of style, content and additional teaching material as well.

The school náměstí Míru in Nový Bor, which provided one set of the fourth edition of Project coursebooks (books 1–4) for the purposes of this study, has been using the third edition of the coursebooks since their release in 1995. At the present time, students are still taught with the third edition of Project coursebooks, as the school has a shortage of budget for upgrading to the newest edition of coursebooks series. Even though the school does not currently use the fourth edition of the Project coursebooks for teaching, the validity and up-to-date status of the newest edition serves the purposes for the corpus analysis better than the third one. Therefore, this thesis will analyze data from the following titles for various classes from secondary school.

- Project 1, fourth edition
- Project 2, fourth edition
- Project 3, fourth edition
- Project 4, fourth edition
- Project 5, fourth edition

3.1 Project home study program

As a complementing material to the Project coursebooks series, Oxford University Press also offers an online interactive home study section for students on their website. Each coursebook is covered in 6 units, such as the coursebooks themselves and additionally with Picture dictionary, Phrase builder, Tests and Phonetic symbols. All of the categories, except Audio section, contain a form of exercise and testing that students can do online at home. The 6 units are further separated into 7 subsections: Grammar, Pronunciation, Vocabulary, Listening, Test, Audio and Games, which are very simple flash games based on testing vocabulary. These sections differ from the sections in the home study program for the previous edition of the coursebooks, which did not include the separate audio section for each unit.

The website also offers a section for interactive home study material for the previous set of coursebooks from edition 3 which are still used in a large number of schools, which lack the necessary budget to update to fourth edition of Project coursebooks.

4 Word frequency in current learner’s dictionaries

“Frequency information has a central role to play in learning a language. Nation (1990) showed that the 4,000 – 5,000 most frequent words account for up to 95 percent of a written text and the 1,000 most frequent words account for 85 percent of speech.” (Davies, 2010)

Even though Nation presents the results only for English language, the data provide a clear testament of the benefits of using frequency in vocabulary learning. These are especially helpful to learners who have just started to learn the English language, because they prioritize learning the most frequently used words which they will most likely end up actually using in written or spoken discourse. As Ranalli claims, “learners who could decide for themselves which new items to learn showed a 50 percent higher rate of recall than those using words chosen by someone else.” (2003, 9) The learner also acquires a lexicon which they are able to use in various learning situations and for a long-lasting period of time. Despite this, frequency dictionaries should not act as the sole information source for the learner.

“Over the last one or two decades the importance of frequency in the shaping of the language system has been underlined by many researchers. According to some, frequency effects are all-pervasive and can be found in any area of language processing and production. Ellis (2002), for instance, claims that there are “frequency effects in the processing of phonology, phonotactics, reading, spelling, lexis, morphosyntax, formulaic language, language comprehension, grammaticality, sentence production and syntax” (Ellis, 2002, 143). Corpus-linguistic and cognitive linguistic research has emphasized the role that frequency and statistical associations play in the shaping of the language system.” (Kreyer, 2013, 11)

Patterns of frequency greatly influence the acquisition of language which is shaped by them in many considerable ways. Kreyer (2013) further debates that the higher the frequency of the word is in the language, the more it is entrenched in the language system itself. When it comes to retrieving previously learnt vocabulary, words with highest frequency of use are recalled faster and with higher accuracy than words with lower frequency of use.

4.1 Oxford defining vocabulary

The Oxford 3000 is a fully digitalized list of 3000 words that are absolutely critical for an English language learner’s vocabulary. The corpus based list includes words that are most frequently used across variety of text types from different fields and disciplines. This eliminates the inclusion of words that are only used in one narrow type of English. All the words in the list are also connected by the principle of familiarity such as the connection of the most frequently used days of the week (Friday, Saturday) with the remaining words which are not used in the same frequency, however they are nonetheless connected to the rest of the days of the week and must be presented alongside them.

The best definition of the Oxford 3000 is provided by Patrick Phillips, the ELT Dictionaries Development Editor in Oxford University Press in the presentation video of the Oxford 3000 word list.

“The Oxford 3000 is a list of the most important and the most useful words in English. It was compiled by language experts and by experienced English teachers using information in the Oxford corpus collection and the British National Corpus. The selection of the Oxford 3000 was based on three criteria – frequency, range and familiarity. ” (2010)

The list itself is available not only in an online form, but it is also present in the Oxford advanced learner’s dictionary 7th edition. In the dictionary the most frequently used words, listed in the Oxford 3000 list, are preceded by a symbol of a key.

Any text can be compared with the Oxford 3000 list in an online text checker, which is available at the website of Oxford learner’s dictionaries. The input text can be compared with Oxford 3000 vocabulary, Oxford 2000 keyword vocabulary and Academic word list. The online text checker is, however, not the best solution for comparison of the corpus derived from Project coursebooks with the vocabulary of Oxford 3000 and therefore was not used in the data analysis process of this thesis.

The Oxford 3000 is incorporated in coursebooks such as Q Skills, Aim High and Navigate. It is not, however, designedly integrated in the fourth edition of Project coursebooks series. A detection of the amount of words from the Oxford 3000 present in the vocabulary of the individual coursebooks and in the entire coursebook series is one of the aims of this study.

4.2 Longman defining vocabulary

Longman has had a major role in the development of ELT dictionaries since 1935. It was Longman himself who invented the Defining Vocabulary of the 2000 most common words in that year. (Longman) Since then, many others have followed his lead. All Longman dictionaries are designed for learners and compiled using the Longman Corpus Network which consists of 330 million words from a variety of text genres. The equivalent of the Oxford 3000, the word list of Longman communication 3000, can be traced to the Longman dictionary of Contemporary English 5th edition. In this dictionary, the words that were selected as a representation for the Longman communication 3000 list are marked with a diamond symbol. As the label of the list itself suggests, it contains list of 3000 common English words selected on basis of statistical analysis of the large Longman Corpus Network. The criteria for selection of words into the Longman communication 3000 list are based on raw frequency and range and the words in the list are not organized semantically. (Newman, 2011, 159)

4.3 Corpus linguistics in language learning

Language corpus is a systematic collection of multiple texts investigated with linguistic status for similarities and differences among the texts and their parts. In essence, identical parts of language can be treated both as a text and as a corpus. (Mohsen, 2001, 1). Language represented in a corpus is naturally occurring and can be derived from both spoken and written discourse. This study is preoccupied with written form of language as it is presented in the fourth series of Project coursebooks.

Corpus linguistics is therefore a method of linguistic analysis of naturally occurring language carried out with the use of specialized software and corpus or multiple corpora in format supported by such softwares. The content of the corpus follows certain extralinguistic principles and its exact composition information are available to the researcher of such corpus. (Nesselhauf, 2005)

According to Harmer, “Students who consult a language corpus get the thrill of being their own language researchers and of seeing the evidence that is immediately persuasive.” (2001, 177) It is the authenticity of the language and information about its use that makes it attractive to students.

The area of corpus linguistics in use of language learning is still quite open to exploration. (Krieger, 2003) The limited amount of studies dedicated to this topic are often restricted to theoretical investigation of the problem as a result of using corpus linguistics in EFL education being quite uncommon.

Corpora can be used in two possible pedagogical approaches, direct and indirect. The direct approach means that both teacher and students work with the corpus data themselves in a “raw” form. They do not rely on various researchers who serve as mediators or providers of corpus-based material in an indirect approach. As stated in work occupied with pedagogical applications of corpora by Ute Römer, the direct approach is accurately explained by Tim Johns.

“Tim Johns, the “father” of this direct, so-called “data-driven learning” (DDL), approach, suggests to “confront the learner as directly as possible with the data, and to make the learner a linguistic researcher” (Johns 2002, 108). Johns’ motto for this inductive learning approach in which learner work with concordances and consult corpora in an exploratory way is “Every student a Sherlock Holmes!”” (2006, 4)

There are, however, certain criteria that need to be taken into consideration when it comes to counting and analyzing words. Multiple word meanings, multiword items and grouping words into families or lemmas all alter the final frequency algorithm.

The selection of a core vocabulary for young learners must adhere to certain principles, such as consideration of the learner’s age, type of discourse –spoken or written – and the frequency of use among wide range of texts. As reported by Li, “Statistics show that a core vocabulary for EFL learners selected by distributed frequency achieves higher cumulative coverage than a core vocabulary selected by raw frequency alone.” (2010, 1)

5 Research

5.1 Goal of research

The objective of this study is production of specific linguistic corpora representing Project fourth edition coursebooks. The data of the produced corpora will be analyzed in terms of word list derived analysis, co-occurrence, thematic and comparative analysis and subsequently compared with the vocabulary of the Oxford 3000 list of essential words.

The work will furthermore evaluate the fourth edition of Project coursebooks in comparison to third edition of the series which had been a subject of Supplementary Projects in Teaching English to Young Learners bachelor’s thesis.

Little or even no importance is given to the manner in which vocabulary is presented to students and the vocabulary is often presented in various types of categorization, such as combination of semantic, thematic and “according to the units of a coursebook” arrangements. Not only for this reason, learning vocabulary in the full extent of presented amount, is one of the biggest challenges that students face during learning English as a second language. To prove this hypothesis, the study is supported by opinions of all English teachers from ZŠ náměstí Míru Nový Bor who use Project coursebooks for teaching English language since 1995. The school currently uses Project third edition series which has been approved by the MŠMT as a recommended integrated coursebook for teaching English language in 2008 with validity until year 2019.

Considering the possibility of differences between intended presentation of vocabulary to young learners through coursebook series and the actual results and problems that learners experience after being confronted with said presentation, a questionnaire survey has been carried among the teacher staff of ZŠ náměstí Míru Nový Bor to answer and explain how working with coursebooks affects the actual teaching of vocabulary experience. The study is supported by claims provided by the English teachers from the aforementioned school.

5.2 Development of methods and tasks

Following part of the thesis describes the conditions, expectations and methods used for obtaining data from Project fourth series coursebooks. The specific data, collected separately for each section of individual coursebooks, subsequently served for the purposes of following analysis. This chapter also describes the methods used for the analysis of the data itself alongside with the detailed listing and descriptions of processes applied to the data during the corpus analysis.

Comparative method has been chosen as a main investigative technique for this thesis.

5.2.1 Coursebook series selection

The first fundamental step towards the starting point of the thesis research has been the selection of the coursebook series that would be evaluated, transformed into data in a format of a plain text file and further on analyzed using various analysis tools.

This study is a continuation of previous research, the " Supplementary Projects in Teaching English to Young Learners" by the author, which was carried on Project coursebooks third series, integrated coursebooks used for teaching English language in school náměstí Míru in Nový Bor. In the time of compliance of the abovementioned thesis, the fourth edition of Project coursebooks has not yet been completed and therefore had not been eligible base for the research. At the present day however, the fourth series of the Project coursebooks has not only been completed and published in full extent, but the new series has also been accepted by the MŠMT into a list of integrated coursebooks recommended for primary education in English language.

Mrs. Martina Brychová, head of the English department in ZŠ náměstí Míru Nový Bor was kind enough to provide Project fourth series coursebooks 1, 2, 3 and 4 for the purposes of this study. The last, fifth coursebook was not lent by the school, because the school only uses the first four books for teaching English to young learners up to age of 15. The remaining book, which is quite rare to be found in schools or in bookshops specialized on education, was therefore purchased in an online bookstore solely for the objective of acquisition of data and the possibility of their further analysis.

5.2.2 Acquisition of data

For the purposes of this study a full extent of text from each singular coursebook had been considered as the best volume representation rather than random selection of sample pages from each coursebook. This complies with the claim, that corpora must be as large as possible for the best reliability of the results of investigation (Sinclair, 1991) The only exceptions not included in the data collected from Project coursebooks are multiple listings of items unrelated to the goals of the thesis analysis such as:

- List of names of states

These words do not fall into the category of basic learner’s vocabulary and therefore are not relevant for the purpose of the study.

- Phonetic symbols

Each coursebook contains a section of Pronunciation chapter in the penultimate section of the book. The phonetical symbols were not included in the data obtained from this chapter, due to impossibility of analysis of the phonetic symbols in the T- LAB software. The only words from this section that were included in the corpus of singular coursebooks were words accompanying the symbols, instructions etc.

- Numbers

Numbers standing alone out of context were omitted in the data collection process. Numbers belonging to a certain contextual surrounding were included in the collected data.

The full extent of the acquired data in .TXT files is presented in the electronic attachment of the thesis.

5.2.3 Questionnaires

The practices and opinions of English teachers from ZŠ náměstí Míru Nový Bor, who use Project coursebooks in English lessons, have been collected in a form of a questionnaire consisting of 4 questions with 2 additional open ended supplementary questions for specification of previous affirmative answers. The questionnaire examined the way vocabulary is presented to students, how they categorize this vocabulary and whether students experience difficulties learning the full extent of required vocabulary.

The last question investigated the opinions of teachers on redundancy of certain vocabulary in the coursebooks. The questionnaire is presented in attachment 1.

5.2.4 Mistakes and errors

The form of the coursebooks as they are should also be a subject to analysis in terms of correctness. Due to the technique of transforming the image formats of every page from each singular coursebook to the text format, which was manual, a series of mistakes and errors has been discovered in all coursebooks without exception. The listing and descriptions with individual mistakes and errors accompanied by the location information has been placed as a separate chapter at the end of each coursebook analysis section.

5.3 Compliance of the corpus

For the purposes of this study, the data fro m the entire fourth edition of Project coursebooks were transcribed into plain text files. Thus a “teaching oriented corpora” compiled specifically for EFL purposes were created. The advantage of these corpora is that they consist of texts which are already used by students in class. Therefore, pupils are familiar with the co-text of target features. (Timmis, 2015) This section describes the software used for the compliance and analysis of the individual corpora, the processes that preceded the construction of the corpora itself, and the following proceedings.

5.3.1 Corpus compliance/builder programs

T-LAB Plus 2016 program is an all-in-one software which contains linguistic, statistical and graphical tools for text analysis with use in research fields such as Content Analysis, Semantic Analysis, Sentiment Analysis, Thematic Analysis, Text Mining, Perceptual Mapping, Discourse Analysis, Network Text Analysis, Document Clustering and Text Summarization. According to this relevance, the software was chosen as a primary source for the output of the analysis data.

The program offers a trial version which is limited in the amount of possible data entries. The limit of files that can be analyzed is up to 20 Kb in .TXT format, each of which can include up to 20 short documents. The first coursebook plain text data consists of 52 .txt files, the remaining coursebooks are all composed of nearly identical number of text files which differ due to distinct content and its division into categories.

The T-LAB Plus 2016 required an installation of SQL server 2012 Express LocalDB. After that, the program itself was installed and fully functional with the license, that has been provided by the T-LAB office team exclusively for the purposes of this study. The limitations of the provided full version far exceeded the demands of this thesis therefore the T-LAB Plus 2016 and Microsoft Excel were the only programs used for the complete analysis of the data in this thesis.

5.3.2 Corpus preparation and importation

After the collection of the data from coursebooks and their transformation into plain text format, the files for each individual coursebook had to be merged into a corpus that could be processed by the T-LAB software. All the files for one corpus had been selected and loaded into the program’s database simultaneously and a new folder was created for the corpus to be imported.

The corpus importation itself is an automatic process carried on by the T-LAB software. This transforms the corpus into a set of tables integrated in the T-LAB database which is then subsequently accessible for further analysis. The options selected for all 5 corpora created for the purpose of this thesis, were as follows:

- Lemmatization
- Advanced stop-word check
- Advanced multi-word check
- Text segmentation (elementary contexts): Chunks
- Key term selection: TF-IDF
- Key-term selection: 3000 items

The list of key words is selected from categories of content words: nouns, verbs, adjectives and adverbs. With corpus consisting of multiple files, such as it is in the case of this study, the software first selects the words that surpass the minimum occurrence threshold and subsequently applies the TF-IDF test to all the crosses of each selected word for all the analyzed texts. The TF-IDF consists of a statistical measure for evaluation of relevance of certain words across the analyzed documents. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. The result is a selection of the words with highest TF-IDF values which carry the biggest significance throughout the source texts.

5.3.3 Part of speech tagging and lemmatization

For tagging and lemmatization the first intention has been to use CLAWS tagger. This is the tagger that has been used for example for tagging of the British National Corpus or the Frequency Dictionary of Contemporary American English. The tagger however does consider context for part of speech tagging and in cases such as with word light (adjective and noun) or since (preposition, adverb and conjunction) there is a difficulty and the program may produce errors. The CLAWS tagger is also a separate tagger, which requires input of singular files and therefore does not fit the needs of this thesis.

“Even a good POS-tagger will produce a wrong result in 2-4 per cent of tokens. Parsers or semantic taggers produce a bigger error rate than that. Ideally, errors should be manually corrected in a carefully controlled way, to avoid inconsistencies, before the release of the corpus. This is an expensive and time-consuming process though. So in practice it is common to use annotated corpora which contain a considerable number of errors. ” (Viana, 2011, 168)

Regarding this, the results of each singular analysis step performed by the T- LAB software were retained in the original status, to avoid uneven alteration of the produced data.

5.3.4 Co-occurrence analysis

Co-occurrence analysis discloses relationships between lexical units (i.e. words or lemmas) which determine the local meaning of chosen key words. According to Botley, the word co-occurrence coefficients are numeric scores which reflect how often a pair of words occur in segments of compared corpora. (2000, 12).This is done in two different manners. The first case marks co-occurrences within the elementary contexts of sentences and paragraphs and the second, advanced option computes the co - occurrences within so called n-grams which represent string of two or more words which are automatically selected as useful word sequences. These operations were carried out after the initial computation of co-occurrence analysis of all words in individual corpora.

Radial diagrams displaying the relations of singular lemmas put the lemma in the center of the diagram and the associated words in distance proportional to the degree of their association. Word associations

The word association tool automatically selects the full extent of elementary contexts with the list of key words singled out in previous automatic operation and their pairs matching in terms of co-occurrence. The relation between these word pairs was subsequently used to compare sets of contexts in which both of the key words were present. Co-word analysis

As Quin He debates, the co-word analysis is based the co-occurrence frequency of pairs of words or phrases. (1999). The distribution of individual lemmas, their occurrence in the corpus and their position in relation to other lemmas are presented in a form of Sammon mapping method for visualization. Sequence analysis

As the heading suggests, sequence analysis is preoccupied with lemmas and their arrangement and order. The T-LAB software allows the sequence analysis for key words, sequences of themes and sequences previously recorded in a sequence file. For the purpose of analysis, the sequence analysis of key words was chosen as an investigative method. Concordances

The concordancing process carried out by the T-LAB software extracts all elementary contexts in which the selected key words are present. The information emerging from concordance analysis can be used in major areas of teaching and learning EFL. The syllabus and all course materials should consider concordances which serve as a source of instances of actual use of lexical items. The proper selection of applicability of items with different meanings can be identified by concordancing. (Mohsen, 2001, 80)

5.3.5 Thematic analysis

This analysis focuses on thematic groups of key terms belonging to the same category and determines, examines and maps themes that arise from texts in analyzed corpora. The T-LAB software performs an unsupervised clustering in which it extracts themes. The program uses Thematic Analysis of Elementary Contexts tool for this purpose. The performed subtasks applied to the data in corpus are following:

- Co-occurrence analysis for identification of thematic clusters in context units
- Comparative analysis of the profiles of individual clusters
- Generation of numerous types of graphs, diagrams and tables
- Export of newly emerged thematic clusters

The tool for modeling of emerging schemes provides a way of discovering, examining and modeling the main topics and themes emerging from analyzed texts. The default option, used for the purpose of analysis for all individual coursebooks, operates with a bottom-up approach which analyses word co-occurrences via probabilistic modeling. The themes themselves consist of co-occurrence patterns of key terms and serve for automatic classification of the context units in this thesis. Key context of thematic words

An influential subsection of thematic analysis is the operation of selecting key context of thematic words. During this process, the elementary contexts which contain multip le co-occurrences of words and the words linked to them in the particular thematic field are presented in a form of two correlating tables. The provided output is presented in HTML format. Thematic clustering

Organization of the text in a thematic manner is, as debated by Thomas Tinkham, suggested as easily learnable by recent psychological research instead of a traditional semantic and syntactic clustering. According to the results of his research semantic clustering (eye, nose, ear) does serve as a hindrance while thematic clustering (frog, pond, green) serves as a facilitator of new language vocabulary learning.

“Delving into the psychology research of the first half of this century, one finds an enormous body of literature dedicated to the study of “Interference” and how this phenomenon affects learning and memory. In essence, “interference theory” postulates that as the similarity between information intended to be learnt and information learnt either before or after that information increases, the difficulty of learning that information also increases. Hundreds of studies supporting and expanding the theory have produced findings which would lead readers to predict that near-simultaneous presentation of semantically and syntactically similar words, e.g., knife, fork and spoon, would impede rather than facilitate the learning of all three.” (1997, 140)

Based on this theory and the T-LAB software capability of such analysis, the thematic clustering, rather than semantic one, has been selected for the purposes of this thesis. Cluster analysis requires previous execution of correspondence analysis and for this reason it is listed as a following type of analysis in the synopsis. Cluster analysis served for the purpose of detection of lexical units which share two correspondent features, both high internal and external heterogeneity.

The first thematic analysis applied on the data was thematic analysis of elementary contexts which produced a representation of corpus content in significant thematic clusters. These thematic clusters each consist of elementary contexts such as sentences or paragraphs and are characterized by the same patterns of key words. They are also described through lexical units (i.e. words , lemmas) and the most characteristic variables of the context units in which they are present in.

5.3.6 Comparative analysis

As the heading suggests, the comparative analysis part of the thesis is dedicated to finding similarities and differences between units of a certain context, which in case of this study were individual corpora of Project fourth series coursebooks 1-5 and following comparison with the Oxford 3000 list corpus. Cluster analysis

Cluster analysis is performed on the basis of results from previous comparative analysis and presents a list of words and lemmas which share certain similarities and are connected to one another within a given cluster. The number of clusters and the internal connections and relations of lemmas in these clusters are automatically detected by the software, however, the final selection of amount of clusters is confirmed by the researcher.

5.4 Oxford 3000 corpus

The Oxford 3000 word list was used as a selection of keywords for comparison with each individual corpus created from Project coursebook data. The Oxford 3000 list of words is only available online in two forms. The first one is in a form of an interactive alphabet and entries consisting of related words assigned to each letter. The alternative form of the list is presented in a downloadable PDF file which contains the whole Oxford 3000 list with an introduction. The PDF file was converted into Word document (doc.) and due to the size of the document, the entries were left in the form in which they are presented in the list e.g ., about adv., prep and imported in the T-LAB software in this form. The output data however showed, that T-LAB analyzed the grammatical categories as words and therefore in this form, the Oxford 3000 list was unusable for the purposes of this study. As a result, the list consisting of 140 pages was manually altered into form which contained only words and phrase words without any additional information.

6 Output interpretation

This section of thesis consists of analysis of data in form of tables and graphs produced by the T-LAB software and making inferences on the meanings and relations between them. The program allows exportation of the data in following formats: .DAT, .TXT, .XLS, .HTML. The data presented in the result section and electronic attachment of this thesis contains all of the abovementioned extensions for purposes of further possibility of import and re-elaboration of the data by using any chosen software compatible with the particular file formats. This part furthermore presents the findings and conclusion of the questionnaires given to each English teacher at ZŠ náměstí Míru Nový Bor.

6.1 Questionnaires

Contrary to expectations, the results of the questionnaire did not provide clear outcome, even though the results were unanimous in case of the first two questions. The data yielded by these filled out questionnaires provided unconvincing evidence of the way the aforementioned vocabulary is presented to students in a certain way. From the first question which targeted the way in which vocabulary is presented to students, an option of selecting all three possibilities (thematic, semantic and unit related grouping of vocabulary) presented in the questionnaire was chosen by the entire teacher staff.

For clarification of the provided data, the head of English department of ZŠ náměstí Míru Mgr. Martina Brychová was consulted and the following explanation granted an interpretation clue for the answers from the questionnaire. Mgr. Brychová comments the results of findings of the questionnaire in the following words:

“When it comes to me personally, I do not ask students to write down vocabulary in my language classes at all. We learn from context and not from individual vocabulary items. And if so, and that is mostly in the primary school, I use different processes and methods, I do not stick to just one of them to make the lessons more diverse. For this reason, I use all of them, not at the same time but according to the needs of currently discussed topic. For example for parts of the body I use semantic heading Parts of the body or Clothes and for holidays together with travelling we have thematic heading Holidays. ”

The problem with presenting vocabulary to students, according to the results of the questionnaire, is solved by teachers by dividing the vocabulary content into smaller units. None of the teachers found any of the vocabulary presented in Project coursebooks redundant.

6.2 Project 1

The analysis of the fourth edition of Project coursebooks series begins chronologically with the first book of the series, Project 1 student’s book. The analysis steps applied to this coursebook were duplicated in the analysis process of the following coursebooks. Project 1 student’s book presents 84 numbered pages of content for students which were transformed into 52 individual files based on the distribution of individual units and parts of the coursebook.

6.2.1 Word list derived analysis

The entire corpus contains 1778 words with 17091 occurrences (tokens) and 618 hapaxes collected from 52 texts containing 1641 elementary contexts. The most frequent lemma in the corpus is the article the with 1035 occurrences, followed by the article a with 699 occurrences. The most frequent conjunction is and with 524 instances of use and the most recurring verb is be in form of third person singular, is (356 examples). This is an expected result, being in unison with the frequency of these words in large scale corpora.

6.2.2 Co-occurrence analysis Word associations

Elementary context analysis with the threshold of occurring items set on value 1 produced list of 101 words. This list was checked and personal names were omitted, producing a final list of 94 words with available co-occurrences. The word with most occurrences was Listen with 174 occurrences in the corpus. This is due to its frequent use in instructions to various exercises. Play followed on the second place with 74 occurrences and above the number of 60 scored the following words: town, look, complete, picture, read, question, name and school. The most frequently occurring multiword items were contracted forms it’s with 104 representations and I’m with 60 representations in the corpus.

As an example of associated lemmas, a representative sample of items with frequency coefficient of 0,20 and above was chosen an is presented in a form of a table. The reading keys to the table are as follows:

LEMMA_B = lemma associated with listen

COEFF = value of the selected index

CE_B = total amount of elementary contexts that contain selected lemma

CE_AB = total amount of elementary contexts where lemmas A and B are associated

CHI2 = chi square value concerning the co-occurrence significance

(p) = probability associated with the chi square value

Abbildung in dieser Leseprobe nicht enthalten

Table 1: Project 1 - Listen associated lemmas

As visible from the table, the combination of listen and repeat is the most frequently used sequence of words in Project 1 student’s book. This combination is presented 37 times throughout the entire book. The remaining words associated with word listen are presented in a form of radial diagram.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1: Project 1 – Listen associated lemmas

The further the word is located from the center formed by the word listen, the less frequently it occurs together with that word it in the whole corpus. The figure clearly manifests that the words mostly associated with word listen are words used mostly in instructions for various exercises such as repeat, read, hear, speak and check. Co-word analysis

The mapping method was used on 50 items automatically selected by the T-LAB software from a list of 284 key terms. The maximum capacity of items for this selection is 100, however for the purposes of this study the moderated sample provided enough evidence. The maximum of key terms within each cluster was set on 8 by default. The association index used was cosine. From this list, 21 items were removed for the reason that they represented proper nouns such as names of countries and people, some to frequency as high as 33 occurrences in the corpus.

As expected, the most frequently used lemma listen has also the most co- occurrences reaching to 196 items. The next in order are lemmas play and town with 93 related words. The distribution is best demonstrated by MDS map using Sammon’s method.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2: Project 1 – Co-word analysis

This diagram also provides an insight into the thematic categories which surround certain words and their distribution across the student’s book. All lemmas marked with the same color are connected and the proximity to lemmas from another color category on the axis also represents the connection between these. This is best demonstrated on lemmas day, car, number, word, project, speak and partner which overlap into another segment of the diagram. The full extent of individual association indexes can be reviewed in the CVS file attached to this thesis. Sequence analysis

The sequence analysis was also carried out on the selected set of 284 key words without the allowance of loop cases (i.e. successor = predecessor). In the terms of sequence analysis, the lemma listen is connected to 193 items. With the frequency of 92, lemmas play, town, name and picture followed with 92, 89, 77 and 69 occurrences in the list of key words. The most frequent predecessor of listen is read while the most commonly used successor is repeat.

Abbildung in dieser Leseprobe nicht enthalten

Figure 3: Project 1 – Sequence analysis Concordances

The T-LAB software provides a possibility of two concordance options, one carried out on words and another on lemmas. The concordance analysis applied to words with minimal threshold value 4 and maximum value set by the occurrence number of the highest scoring item, 173. The total number of words that met the set requirements was 519. In unison with the previous types of analysis, the word scoring the highest occurrence in concordance list was listen.

Each of the items in concordance list is connected with automatically generated concordances in a form of full sentences in which the particular word was used and annotation of the text of origin. In the case of listen, the data provide a confirmation of the results of previous investigations which suggest that listen is mostly used in instructions for exercises throughout the coursebook. Following are three examples of the full sentences using the word listen.

- Read and LISTEN. Write the short forms.
- LISTEN again and repeat.
- LISTEN again and do the actions.

In the case of concordance analysis applied to list of lemmas, the number of items that surpassed the criteria was 488. The minimum and maximum threshold was set on the same principle as in case of words analysis, in this case producing numbers 4 for minimum and 196 for maximum value. The lemma with the most occurrences corresponded with analysis applied to words and produced the same result, listen. However, the following order of items slightly differed from one another. The

Abbildung in dieser Leseprobe nicht enthalten

T able 2: Project 1 – Concordances words lemmas

The data generated in table 2 reveal that only one item in each list is non- correspondent with a mirror version in the second list. In the word list this item is school and in the lemma list the missing mirror version of this word is substituted with picture.

6.2.3 Thematic analysis and data mining

The preset methods and analysis criteria selected for the document clustering were unsupervised clustering for a bottom-up analysis (bisecting K-means), the number of thematic clusters to be obtained was set to maximum of 10 and the co-occurrences within the context units minimum value to 2. The sparse matrix and word indexes with input data produced in this process were exported as a CVS file in the attachment of the thesis. For the purpose of this analysis, the extent of 52 documents and 284 key terms was used such as in the previous types of analysis.


Excerpt out of 122 pages


How can the use of frequency information from corpora be used in foreign language teaching? A corpus-based study on vocabulary in course books
Catalog Number
ISBN (eBook)
ISBN (Book)
Linguistics, Corpus study, Project coursebooks, Oxford 3000
Quote paper
Karin Dietiová (Author), 2016, How can the use of frequency information from corpora be used in foreign language teaching? A corpus-based study on vocabulary in course books, Munich, GRIN Verlag,


  • No comments yet.
Read the ebook
Title: How can the use of frequency information from corpora be used in foreign language teaching? A corpus-based study on vocabulary in course books

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free