Lexical Repetition in Academic Discourse

A Computer-Aided Study of the Text-organizing Role of Repetition

Doctoral Thesis / Dissertation, 2016
187 Pages, Grade: A


Table of Contents

1. Introduction.
1.1 Background to the study.
1.2 Aims of the present research.
1.3 An overview of the dissertation.

2 Theoretical Framework.
2.0 Overview.
2.1 Coherence and cohesion.
2.1.1 Definitions of coherence and cohesion
2.1.2 Types of cohesion
2.2 Lexical cohesion and lexical repetition.
2.2.1 Categories of lexical cohesion
2.2.2 Lexical chains or a lexical net?
2.3 Hoey’s (1991) Repetition Model 16
2.3.1 The theoretical background of the model
2.3.2 Hoey’s (1991) taxonomy of lexical repetition
2.3.3 Links and bonds creating a lexical net
2.3.4 The steps of the analysis
2.3.5 Applications of Hoey’s (1991) model
2.3.6 Inconsistencies within Hoey’s (1991) model
2.3.7 The link triangle and the mediator missing
2.3.8 The questions of anaphora resolution
2.3.9 Prescriptiveness or descriptiveness of the model
2.4 Károly’s (2002) Repetition Model 29
2.4.1 Károly’s (2002) taxonomy of lexical repetition
2.4.2 Károly’s (2002) method of analysis
2.4.3 Károly’s empirical investigation
2.4.4 A corpus-based investigation using Károly’s (2002) taxonomy. 332.5 .. Summary

3 Methodological background: the academic writing context.
3.0 Overview..
3.1 The nature of academic discourse.
3.1.1 General features of English academic discourse
3.1.2 The types of writing tasks required at university
3.1.3 Disciplinary differences in academic discourse
3.1.4 Implications for language pedagogy
3.1.5 Independent vs. integrative writing tasks
3.2 Task variables influencing academic discourse quality.
3.2.1 The classification of variables in academic writing
3.2.2 Contextual variables of integrated academic discourse quality
3.2.3 Cognitive variables of integrated academic discourse quality
3.2.4 Summary writing as a complex task
3.2.5 Writing a compare/contrast essay
3.3 Assessing academic discourse.
3.3.1 ‘Traditional’ and recent academic essay assessment practices
3.3.2 Validity in L2 academic writing assessment
3.3.3 Generalizability of judgement on academic discourse quality
3.3.4 Reliability of perceived discourse quality
3.3.5 Text quality requirements by course teachers
3.3.6 Explicit instruction on coherence, cohesion and lexical repetition in higher education
3.3.7 Automated assessment of text quality
3.3.8 Controversial views on the automated assessment of essay quality
3.4 Summary.

4 Aims and Research Questions.

5 Research design and procedures of analysis.
5.1 A sequential mixed design.
5.2 Stage 1: Analysis of academic summaries.
5.2.1 The summary writing task
5.2.2 Corpus size and representativity
5.2.3 Context validity evaluation of the summary writing task
5.2.4 Features of the input text
5.2.5 Quality assessment of the corpus
5.2.6 Methods of data analysis in Stage
5.3 Stage 2: Analysis of compare/contrast essays.
5.3.1 The compare/contrast essay writing task
5.3.2 Quality assessment of the corpus
5.3.3 Methods of data analysis in Stage

6 Results of the lexical repetition analysis of academic summaries.
6.1 General features of the summaries.
6.2 Results related to repetition type.
6.3 Results related to the combination of links and bonds.
6.4 Methodological outcomes.
6.5 Summary.

7 Results of the lexical repetition analysis of compare/contrast essays.
7.1 General features of the compare/contrast essays.
7.2 Results related to repetition type.
7.3 Results related to the combination of links and bonds.
7.4 Features not detected.
7.5 Methodological outcomes with automation in mind.
7.6 Summary.

8 The design of a new LRA model for large-scale analysis.
8.1 The newly proposed LRA model: the three modules of the analysis.
8.2 Phase 1: Preparation of the corpus. 125
8.2.1 Plagiarism check
8.2.2 L2 special corpora treatment / Error treatment
8.2.3 POS tagging
8.2.4 POS tagging for lower level L2 texts
8.2.5 Using WordNet with the existing taxonomy
8.2.6 Using WordNet with errors in a learner corpus
8.3 Phase 2: Finding links.
8.3.1 Theoretical considerations: altering the taxonomy
8.3.2 Introducing the concept of ‘key term’ into the coding process
8.3.3 Lexical unit identification in the case of multiword units
8.4 Special use of the model for academic summary writing.
8.5 Visual representation of links and bonds.
8.6 Connecting the new LRA model to a cognitive framework.
8.7 The scope and limitations of the new LRA model

9 Conclusions.
9.1 Summary of main results.
9.2 Pedagogical implications.
9.3 Limitations.
9.4 Terminology issues
9.5 Suggestions for further research.




Due to the various functions and diverse attitudes to lexical repetition in discourse, it is an aspect of cohesion which creates difficulty for raters when assessing L2 academic written discourse. Current computer-aided lexical cohesion analysis frameworks built for large-scale assessment fail to take into account where repetitions occur in text and what role their patterns play in organizing discourse. This study intends to fill this gap, by applying a sequential mixed method design, drawing on Hoey’s (1991) theory-based analytical tool devised for the study of the text-organizing role of lexical repetition, and its refined version, Károly’s (2002) lexical repetition model, which was found to be capable of predicting teachers’ perceptions of argumentative essay quality with regard to its content and structure. It first aims to test the applicability of the previous models to assessing the role of lexical repetition in the organization of other academic genres, then propose a more complex, computer aided analytical instrument that may be used to directly assess discourse cohesion through the study of lexical repetition.

In order to test the applicability of Károly’s model on other academic genres, two small corpora of thirty-five academic summaries and eight compare/contrast essays were collected from English major BA students at a Hungarian University. The lexical repetition patterns within the corpora were analyzed manually in the case of the summaries, and partially with a concordance program in the case of the compare/contrast essays. The findings revealed that in both genres lexical repetition patterns differed in high and low-rated texts.

Given that in its present form the model cannot be used on large-scale corpora, in the third stage of the research, a computer-aided model was designed for large-scale lexical repetition analysis. First, by employing the theoretical, empirical and methodological results gained from the corpora, several new analytical steps were proposed and built into a modular format. Next, in order to better align the new computer-aided analysis to its manual version, parallel processes were identified between the new analytical model and an existing socio-cognitive framework. The newly proposed model may help teachers to assess discourse cohesion, or can be used as a self-study aid by visualizing the lexical net created by semantic relations among sentences in text.

List of Tables

Table 1. Types of cohesive devices in Halliday and Hasan (1976) with the researcher’s examples

Table 2. The changes of lexical cohesion taxonomies based on Halliday and Hasan’s (1976) and Hasan’s (1984) models

Table 3. Types of repetitions based on Hoey’s (1991) taxonomy

Table 4. Types of lexical relations in Károly’s taxonomy with examples (examples based on Károly, 2002, p. 104, and these two corpora)

Table 5. Variables with the strongest predictive power in Károly’s (2002) lexical repetition analysis research

Table 6. The difference in explicitness caused by phrasal vs. clausal modification (Biber & Gray, 2010)

Table 7. Features of task setting and features of input text within contextual variables (based on Weir’s (2005) socio-cognitive framework)

Table 8. Cognitive variables involved in integrative academic discourse based on Chan (2013) and Chan, Wu, & Weir (2014)

Table 9. The differences between writing a summary ‘essay’ vs summarizing

Table 10. Whole text summary task and guided summary task questions for idea selection based on Tankó, 2012, p

Table 11. Mental processes during summarization as suggested by four models

Table 12. The constructs, their divisions, and the cognitive processes involved in the summary task of the research (Stage 1)

Table 13. Assessing written academic discourse

Table 14. Subscales and descriptions of the IELTS Writing Module (based on Cambridge IELTS webinar 26. February, 2014)

Table 15. University teacher’s requirement (based on Moore & Morton, 1999) analyzed from a language pedagogy angle

Table 16. The various uses of Coh-Metrix in analyzing L2 writing

Table 17. Summary of the intended approach of data collection and analysis

Table 18. Basic features of the summary task

Table 19. The features of task setting in Stage

Table 20. Features of the input text for Stage 1 summary task

Table 21. The constructs, their divisions, and samples (Stage 1)

Table 22. Overview of assessments, methods and their aims in Stage

Table 23. Summary of the categories of repetition (based on Károly, 2002)

Table 24. The number of bonds pointing backward and forward within Text

Table 25. Compare/contrast essay evaluation: the organization and discourse control rubrics

Table 26. Károly’s (2002) taxonomy for LRA with examples from the compare/contrast essay corpus

Table 27. The summary construct

Table 28. The frequency of types of repetition in high- and low-rated summaries

Table 29. The difference between the mean frequencies of SUR and DUR in high-and low-rated summaries

Table 30. Comparison of results related to lexical repetition types between Károly’s (2002) argumentative essay corpus and Stage 1 summary corpus

Table 31. Frequency of links and bonds and density of bonds

Table 32. Possible lexical repetition patterns in one or multiple paragraph summaries

Table 33. The frequency of types of repetition in high- and low-rated essays. Abbreviations: SUR: Same unit repetition. DUR: Different unit repetition

Table 34.Comparison of results related to lexical repetition types between Károly’s (2002) argumentative essay corpus and Stage 2 compare/contrast essay corpus

Table 35. Two interpretations of a learner sentence and their analyses (based on Dickinson & Ragheb, 2013)

Table 36. Proper names as possible topics in four disciplines

Table 37. The contextualization of the new LRA model in this study: the parallel processes between the model and the cognitive processes during reading (Khalifa & Weir, 2009), with explanations.

List of Figures

Figure 1. Illustration of a semantic network for business trips (based on Grimm, Hitzler, & Abecker, 2005, p. 39) Nouns represent the concepts (in rectangles), the arrows specify the relationships between the concepts

Figure 2. Visual representation of a gene ontology within the field of biology (based on the online training material of the European Bioinformatics Institute)

Figure 3. Topical Structure Analysis indicating semantic links between sentences (Lautamatti, 1987, p. 102)

Figure 4. A net and a chain of lexical repetition in two studies (Hoey, 1991, p. 81; and Barzilay & Elhadad, 1999, p. 116)

Figure 5. The link triangle (Hoey, 1991, p. 65)

Figure 6. The General—Particular relationship in text (Hoey, 1995, p.135)

Figure 7. Synonyms offered for the word bank in WordNet

Figure 8. Three paragraphs of a sample compare/contrast essay indicating some of the lexical repetition links (adjectives/adverbs). Text: Oxford Advanced Learners’ Dictionary 8th ed. (OUP, 2008)

Figure 9. A priori (before the test) components of Weir’s (2005) validation framework

Figure 10. Weir’s (2005) whole socio-cognitive framework

Figure 11. The Cognitive Process Model of the Composing Process (Flower & Hayes, 1981, p. 370)

Figure 12. The notion of comparison (Mitchell, 1996)

Figure 13. Suggested outline for the point-by-point pattern (based on the online academic writing guidelines of Humber University)

Figure 14. The block-by-block pattern

Figure 15. The hook-and-eye technique connecting major thoughts on discourse level (in Creswell, (2007, p. 59)

Figure 16. A detail of the repetition matrix of Text 3, itemized and classified

Figure 17. The matrix showing the number of links between each sentence of Text

Figure 18. The span of bonds in Text

Figure 19. The strength of connection between bonded sentences in Text

Figure 20. A detail of the headword frequency list (Text 4). Headwords are listed according to occurrence. (N.= number, % = the percentage of the occurrence in the text)

Figure 21. Another detail of the headword frequency list (Text 3). The numbers represent the sentences in which the words occur. (No.1 = the title)

Figure 22. The span of bonds in Text 1. The two figures (e.g., 2-3) show the two sentences connected by bonds

Figure 23. The three big modules of the new LRA model

Figure 24. The steps of the new LRA model

Figure 25. Teich and Fankhauser’s lexical repetition analysis links in text view (2005)

Figure 26. Visual representation of topic sentence and conclusion sentence identified as central by LRA (text based on Oshima & Hogue, 2006)

1. Introduction

1.1 Background to the study

Cohesion and coherence in text have become widely researched areas within the field of discourse analysis, and a great deal of attention has been given to the subjects of lexical cohesion and lexical repetition due to their significant discourse function (e.g., Halliday, 1985; Halliday & Hasan, 1976; Hoey, 1991; Reynolds, 1995, 2001; Tyler, 1994, 1995). Discourse is “a unit of language larger than a sentence and which is firmly rooted in a specific context” (Halliday & Hasan, 1990, p. 41). Lexical cohesion was defined by Hoey (1991) as “the dominant mode of creating texture” because it is “the only type of cohesion that regularly forms multiple relationships” in text (p.10). He called these relationships lexical repetition, using repetition in a broader sense, referring not only to reiterations but also various other forms of semantic relatedness, such as synonyms, antonyms, meronyms, as well as other paraphrases.

Based on Halliday and Hasan’s (1976) empirical investigation of cohesive ties in various text types, Hoey concluded that lexical cohesion accounted for at least forty percent of the total cohesion devices (1991). In a more recent corpus linguistic study Teich and Fankhauser (2004) claimed that nearly fifty percent of cohesive ties consist of lexical cohesion devices (p. 327), thus making lexical cohesion the most pronounced contributor to semantic coherence.

The study of lexical cohesion is relevant to language pedagogy because what and how to repeat in English written text causes disagreement among native and non-native language users alike. Most teachers would agree with Connor (1984), for instance, who found that in students’ texts repeated words were both a sign of limited vocabulary and of poor text structuring. The problem is more complex however, because lexical choice depends not only on language proficiency level, but on various other factors as well. According to Reynolds (2001), for example, lexical repetition used by writers changes in relation to (1) writing topic, (2) cultural background, and (3) development of writing ability; the third being the most determining factor. Myers (1991) also found that scientific articles generally require more reiterations than popular articles because exact concepts in this field cannot be replaced by synonyms. Therefore what should and what should not be repeated in academic writing is context dependent, and this complexity calls for more research into lexical cohesion in general, and into texts produced by language learners on various topics and genres in particular.

Lexical cohesion is studied both in text linguistics (discourse analysis) and corpus linguistics, which two terms cover related but not the same kind of approaches to the study of text. In discourse analysis, first the various cohesive devices are categorized according to semantic relatedness criteria and a theoretical framework is built, which is later tested on a small number of texts. Lexical repetition patterns are analyzed quantitatively and manually (e.g., the researcher counts how many times certain categories are represented in the text) as well as qualitatively (e.g., conclusions are drawn observing the types, location and lexical environment of repeated words). The main problem with this type of analysis is that only a small number of texts can be observed; therefore, the data gained do not permit generalizations.

The other approach to lexical cohesion analysis is offered by corpus linguistics, which allows for automated analysis of large linguistic data. A disadvantage of this method is that individual differences within texts in a corpus cannot be observed. Reviewing best practice in text-based research, Graesser, McNamara, and Louwerse (2011) maintain that the recent shift in discourse analysis is characterized by moving from “theoretical generalizations based on empirical evidence observing a small corpus to large-scale corpus-based studies” (p. 37), and the results have changed “from deep, detailed, structured representations of a small sample of texts to comparatively shallow, approximate, statistical representations of large text corpora” (p. 37).

Several manual and computer-aided methods exist to analyze lexical features in text. Of particular interest are frameworks capable of not only identifying and classifying linguistic elements but also providing information on their patterns and roles in structuring text. Hoey’s (1991) theory-based analytical tool designed for the study of lexical repetition is the first one of the frameworks devised to offer a manual analytical method for studying the text-structuring role of lexical repetition. This framework explores the semantic network (links, bonds and the lexical net) of text and distinguishes between central and marginal sentences by finding lexical repetition patterns. With this method it is possible to summarize certain types of discourse.

Hoey’s (1991) comprehensive analytical model was later revised by Károly (2002) who made significant changes in the categories. Károly also extended the model by introducing several new analytical steps to reveal the organizing function of lexical repetition in texts. Hers was the first application of Hoey’s model in a Hungarian higher education setting. Károly’s (2002) research results showed that her theory-driven ‘objective’ analytical tool not only offered a descriptive function, but with her analytical procedures the tool was capable of predicting the ‘intuitive’ assessment of teachers judging argumentative essay quality with regard to its content and structure.

Given that in holistic scoring teachers assign more weight to content and organization than to any other components (Freedman, 1979), and given that these two components comprise the concepts of cohesion and coherence, responsible for textuality (Halliday & Hasan, 1976), it is of little surprise that lexical repetition analysis (LRA) can detect the difference between valued and poor writing. The results of Károly’s (2002) analysis proved that the texts, which had previously been judged by experienced university instructors, differed significantly in both repetition types and patterns. Post-tests conducted with another group of teachers confirmed these findings indicating that the analytical measures devised can reliably predict how teachers perceive essay quality, and the results may be generalized for a wider sample.

Hoey (1991) found predictable lexical repetition patterns in news articles, whereas Károly (2002) studied the academic argumentative essay genre in this respect. Due to the fact that summary and compare/contrast essay are the two most commonly used genres across the disciplines at universities (Bridgeman & Carlson, 1983; Moore & Morton, 1999), these integrative (reading into writing) tasks also deserve such thorough investigation. Therefore, research needs to be extended to the predictive power of Károly’s model in genres most likely faced by EFL (English as a Foreign Language) students across universities in Hungary. At the moment no study exists applying Károly’s (2002) lexical repetition analysis (LRA) model for the genres of the academic summary and compare/contrast essay. Neither does a tool exist with a similar theoretical basis that can be applied to large-scale corpora using the same method for other academic genres.

Large-scale essay assessment applications (such as E-rater1, Intelligent Essay Assessor2 ) have been in use for decades to test essay writing skills in the EFL context. However, these applications were developed by major testing agencies and are not available to the public. These essay scoring programs measure cohesion and coherence quantitatively, using statistical methods and natural language processing (NLP) techniques. Their methods focus on cohesion and coherence on the local level, mainly by comparing adjacent sentences semantically, on the assumption that words in adjacent sentences form semantic chains which can be identified for topic progression. This can be called the lexical chain principle. However, these chains are linear in nature, and indicate cohesion on the local level, whereas discourse also shows global cohesion. If text is considered to create (or to be created by) lexical nets, it is necessary to observe semantic links between all the sentences in the text, even if they are located far from each other, in other words: if the lexical chain principle is switched for the lexical net principle. Károly’s (2002) lexical repetition analysis tool, which was based on Hoey’s (1991) Repetition Model, can “measure” discourse cohesion, yet in its present form it is not fit for large-scale application.

1.2 Aims of the present research

With the above in mind, the purpose of the study is

(1) to extend the use of Hoey’s (1991) and Károly’s (2002) lexical repetition model to the academic summary and the compare/contrast genres by analyzing two Hungarian EFL university student corpora;
(2) to test whether Károly’s (2002) analytical tool can predict teachers’ judgement regarding discourse quality in the case of these two genres, too, and
(3) to alter this analytical tool to enable large-scale analysis of EFL student corpora
(4) in order to be able to design the steps and modules necessary for a computer-assisted lexical repetition analysis.

The main questions guiding this research are therefore the following:

(1) Is Károly’s (2002) theory-based lexical repetition model, a revised version of Hoey’s (1991) Repetition Model applicable to the study of summaries and compare/contrast essays written by Hungarian EFL university students?
(2) What modifications are needed in Károly’s (2002) theory-based lexical repetition model to be applicable to large-scale EFL learner corpora?

The study uses a mixed methods design including both qualitative and quantitative methods, as suggested by Creswell (2007); and a sequential mixed design paradigm, as described by Tashakkori and Teddlie (2003). The rationale behind using qualitative and quantitative methods is that, according to Tyler (1995) and Károly (2002), quantitative analysis alone cannot inform research about the real role of lexical repetition in organizing discourse. In the first stage of this study, the model is applied to the summary genre. The second stage utilizes results gained from the first stage and continues to test the model on compare/contrast essays. In this second stage, a concordance analyzer3 is introduced at the initial step of the analysis. In the third stage, the theoretical, empirical and methodological results of the previous stages form the basis of the design of the new, semi-automated analytical tool. Therefore, results gained from each stage inform the next stage of research according to the sequential mixed design paradigm.

This study is multidisciplinary in nature, aiming to contribute to the fields of (a) applied linguistics, more closely to discourse analysis and corpus linguistics; (b) language pedagogy, especially to the teaching and evaluating EFL academic writing; and (c) computer science, to enhance educational software development.

1.3 An overview of the dissertation

This dissertation is organized into nine parts: Chapter 2 surveys the literature on theoretical aspects of coherence, cohesion and lexical repetition. Hoey’s (1991) and Károly’s (2002) repetition models are compared, followed by a description of the major computer-aided applications of these models.

Chapter 3 focuses on the context of academic discourse: from task setting to assessment. First the chapter enumerates the typical written academic genres across disciplines, followed by the description of the task variables influencing discourse quality. The requirements of the summary and compare/contrast essay genres, the two genres investigated in this study, are described next. The final part of the chapter deals with academic discourse assessment, with particular emphasis on teachers’ perceptions of coherence, cohesion and lexical repetition in students’ texts. Basic findings of research into automated summarization and automated essay scoring are also introduced.

Chapter 4 enumerates and explains the research questions. Chapter 5 describes the research design used in this study, introducing the detailed steps of analysis in both Stages 1 and 2 of the research, where Stage 1 aims to test Károly’s (2002) repetition model on academic summaries and Stage 2 extends the model to compare/contrast essays. Chapters 6 and 7 give account of the results of Stages 1 and 2, respectively. Chapter 8 presents the new computer-aided model designed for the lexical repetition analysis of larger corpora. Chapter 9 summarizes the main findings of the study, and considers the implications, limitations and the possible areas of future research. (2010)

2 Theoretical Framework

2.0 Overview

The aim of this chapter is to provide a theoretical background to the study of the text organizing role of lexical repetition in order to be able to propose a new computer-aided analytical model later on. It offers a brief introduction to the theories behind the two basic concepts of textuality: coherence and cohesion, with special emphasis on lexical cohesion and lexical repetition, giving definitions of how these key terms are used in this paper. After the presentation of Hoey’s (1991) and Károly’s (2002) lexical repetition models, as well as Károly’s (2002) empirical investigation, which is the starting point of this research project, some examples follow of how these models have been applied on large corpora. The strengths and weaknesses of previous analyses are also highlighted so as to fulfil the theoretical, empirical, and methodological aims of the current investigation.

2.1 Coherence and cohesion

2.1.1 Definitions of coherence and cohesion

The complex nature of coherence and cohesion offers grounds for a wide spectrum of interpretations. Károly (2002), in her concise review of the most influential schools of English written text analysis, distinguishes three groups among the various descriptive models of coherence (1) those focusing on the surface elements of text and their combinations, (2) those defining coherence as an interaction between textual elements and the mind, and (3) those claiming that coherence is created in people’s minds entirely.

In the first group belong Halliday and Hasan (1976), and later Halliday (1985), who see coherence as created by surface textual elements (e.g., identity and similarity chains). ‘Interactionalists’, who comprise the largest of the three groups, such as van Dijk and Kintsch (1983), define coherence as a cognitive interaction between the reader and the textual elements. According to de Beaugrande and Dressler (1981), coherence refers to “how the configuration of concepts and relations which underlie the surface text, are mutually accessible and relevant” (pp. 3-4). Hoey also contends that coherence is a “facet of the reader’s evaluation of a text” (1991, p. 12). More recently, coherence was defined by Crossley and McNamara as “the understanding that the reader derives from the text” (2010, p. 984), its main factors being prior knowledge, textual features, and reading skills (McNamara, Kintsch, Songer, & Kintsch, 1996). The relatively few theoreticians in the third group, such as Sanders, Spooren and Noordman (1993), approach the notion of coherence, as not being a property of discourse, rather a mental representation of it in people’s brains (p. 94).

A more consistent approach can be observed towards textual cohesion, because most researchers agree that cohesion is a property of the text. Halliday and Hasan (1976) maintain that cohesion is “the relation of meaning that exists within the text, and that define it as a text” (p.4). According to their interpretation, cohesion occurs “where the interpretation of some element in the text is dependent on that of another. The one presupposes the other, in the sense that it cannot be effectively decoded except by recourse to it.”(ibid.). Thus, cohesion largely (but not exclusively) contributes to coherence.

De Beaugrande and Dressler (1981) see cohesion as one of their six criteria of textuality: cohesion, coherence, intentionality, acceptability, situationality and intertextuality. They claim that cohesion is indispensable for a text to be a text. Enkvist (1990) and Hoey (1991) also see cohesion as a property of text and therefore being objectively observable. Hoey’s definition is perhaps a little blurred because he concentrates on the textual roles of sentences: “Cohesion may be crudely defined as the way certain words or grammatical features of a sentence can connect that sentence to its predecessors (and successors) in a text” (1991, p. 3). The nature of cohesion might be better captured instead focusing on the text, as “[c]ohesion refers to the presence or absence of explicit clues in the text that allow the reader to make connections between the ideas in the text” (Crossley & McNamara, 2010, p. 984). Such explicit clues can be “overlapping words and concepts between sentences” or connectives such as therefore or consequently (ibid.).

Widdowson (1978) takes a different approach towards cohesion. He argues that cohesion is neither necessary, nor sufficient for coherence. His famous example for this is the following conversation (p. 29):

A: That’s the telephone.
B: I’m in the bath.
A: O.K.

Even though this short exchange is an example of spoken discourse where coherence can be detected across turn boundaries (what A says /what B says /what A says), it still demonstrates that coherence can exist without the explicit markers of cohesion. As Widdowson puts it: coherence is by nature interactive, while cohesion is within discourse, which is by nature static (as cited in Fulcher, 1989, p. 148).

Given that there is disagreement in the literature about coherence and cohesion, for the purposes of this research project, the following two similar sets of definitions will be used for cohesion and coherence:

(1) Coherence is “the understanding that the reader derives from the text” (Crossley & McNamara, 2010, p. 984).
(2) “Cohesion refers to the presence or absence of explicit clues in the text that allow the reader to make connections between the ideas in the text” by (Crossley & McNamara, 2010, p. 984). and;

(1) “[C]oherence is the quality that makes a text conform to a consistent world picture and is therefore summarizable and interpretable.” (Enkvist, 1990, p. 14)
(2) “Cohesion is the term for overt links on the textual surface [.]” (Enkvist, 1990, p. 14)

From these two coherence definitions, the phrases “reader derives from the text”, “quality”, “consistent world picture”, and “interpretable” are the key terms. From the two cohesion definitions “presence or absence”, “overt links”, “explicit clues”, “connections between ideas”, and “textual surface” are the most important terms for this study, because the presence or absence of overt links and explicit clues will be observed first quantitatively, and next qualitatively, in order to interpret readers’ quality judgements derived from the text.

2.1.2 Types of cohesion

Halliday and Hasan (1976) in their analytic model identify the semantic and lexico-grammatical elements which are responsible for creating texture in English. The five categories are reference, substitution, ellipsis, conjunction, and lexical cohesion, shown in Table 1 with examples. The first four are mainly grammatical categories, and as such, they are fairly straightforward. The category of lexical cohesion seems more problematic, with its two subclasses: reiteration and collocation. (This category will be analyzed in Section 2.2.1). The cohesive relation between any two of these lexical elements is called a cohesive tie. These ties form cohesive chains, and the interactions among chains further cause global “cohesive harmony” (Hasan, 1984) in text.

Abbildung in dieser Leseprobe nicht enthalten

Table 1. Types of cohesive devices in Halliday and Hasan (1976) with the researcher’s examples

In her later work, Hasan (1984) changed the categories of lexical cohesion considerably, indicating that semantics is an area where items are particularly difficult to classify.

2.2 Lexical cohesion and lexical repetition

Lexical cohesion, lexical organization and their roles in establishing coherence have been the focus of several influential studies (e.g., Halliday & Hasan, 1976, 1985; Hoey, 1991; Reynolds, 1995, 2001; Sinclair, 1998; Tyler, 1992, 1994, 1995). Hoey focused his research on cohesion instead of coherence, on the assumption that markers of cohesion appear in the text as observable features. In his view, the study of coherence is outside the scope of textual analysis because only explicitly manifest data can be analyzed (1991). He maintains that lexical repetition items are suitable for ‘objective’ analysis as they appear on the surface of text as categorizable and countable items.

Lexical cohesion was defined by Hoey (1991) as “the dominant mode of creating texture”, because it is “the only type of cohesion that regularly forms multiple relationships” in text (p.10), making it unique among cohesive devices. His empirical investigation indicated that lexical cohesion accounted for more than forty percent of the total cohesion devices in the various texts he studied (1991). In a more recent corpus linguistic study it was claimed that nearly fifty percent of a text’s cohesive ties consist of lexical cohesion devices (Teich & Fankhauser, 2004), thus making it the most pronounced contributor to semantic coherence.

A further argument for the relevance of lexical repetition studies is offered by Stubbs (2004), in the Handbook of Applied Linguistics. In his chapter on corpus linguistics, he makes the following observation when describing the importance of word frequency lists: “A few, mainly grammatical, words are very frequent, but most words are very rare, and in an individual text or smallish corpus, around half the words typically occur only once each” (p.116). If we reverse this logic, this statement also implies that half of the words in the text do occur at least twice in any individual text or smallish corpus. Even if we do not count further types of lexical repetition, such as repeating by synonyms or antonyms, etc., according to this observation, the number of repeated words in any text seems impressive and certainly warrants relevance for further studies, (regardless of the fact that these items occurring in multiple occasions are most probably function words, e.g., be, and, I).

2.2.1 Categories of lexical cohesion

Lexical cohesion plays an important role in creating texture (Halliday & Hasan, 1976): it is the central device that makes the text hang together, defining the “aboutness” of text (ibid, Chapter 6). It is interesting to observe the shifts of categories within lexical cohesion in two different models: Halliday and Hasan’s 1976 model and Hasan’s 1984 revision. Table 2 offers a comparison of these two models.

It is noteworthy that exact repetition of a word comes first in both models although in Halliday and Hasan’s (1976) model it is called reiteration. The reason for this might be that repeating the same semantic item is the most obvious and easily recognizable way to create semantic relatedness. Reiteration however, in the first model also comprises synonymy, and superordinates, as well as the vague category ‘general’ item, thus widening the concept of reiteration beyond the traditional grammatical sense. Later Hasan (1984) changed the categories, separating repetition of the same base form from any other, orthographically different, semantic items.

Abbildung in dieser Leseprobe nicht enthalten

Table 2. The changes of lexical cohesion taxonomies based on Halliday and Hasan’s (1976) and Hasan’s (1984) models

Another interesting category is that of collocation, which subsumes antonymy, meronymy, and, as Hoey put it, a “ragbag of lexical relations” (1991, p. 7). Word pairs classified as collocation are, for instance, laugh – joke, try – succeed, or ill – doctor (Halliday & Hasan, 1976, pp. 285-286). The term collocation in connection with co-occurrence was thoroughly discussed in studies in the sixties (see e.g., Halliday, 1966; or for a review: Langedoen, 2009). In more recent publications collocations are referred to as words frequently appearing together within a sentence, including phrases such as bright idea, or talk freely, and are described as “essential building blocks of natural-sounding English”4. In the case of the above examples, however, it is not a requirement for the word pairs to occur intrasententionally: they can appear in different sentences.

Two observations need to be made here concerning Halliday and Hasan’s (1976) lexical cohesion classification. The first is that, even though several discourse analysts (e.g. Hoey, 1991; Martin, 1992; Tanskanen, 2006) criticized this taxonomy as offering categories which were partly unjustified and partly too vague (typically meaning the category of collocation), time has justified Halliday and Hasan’s decision to include non-systematic sense relations in their taxonomy, such as the above examples. The pairs laugh – joke, try – succeed, or ill – doctor are all semantically related; and all of them, although they cannot be easily categorized, are perceived as cohesive ties. These types of relations are dealt with in psycholinguistics, and can now be analyzed using computer-aided corpus linguistic techniques, after a long wait of almost twenty years in the history of automated textual analysis.

One way of analyzing text for such non-systematic sense relations as ill – doctor is applying knowledge representation (KR) frameworks commonly built for AI (Artificial Intelligence) purposes (Davis, Shrobe, & Szolovits, 1993). Semantic networks and maps are built to represent world knowledge (i.e., everything humans know but a computer does not, for example: When we feel ill, we go to the doctor.). Figure 1 illustrates the semantic network for business trip.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1. Illustration of a semantic network for business trips (based on Grimm, Hitzler, & Abecker, 2005, p. 39) Nouns represent the concepts (in rectangles), the arrows specify the relationships between the concepts.

These networks not only contain concepts (nouns, adjectives, etc.), but also the hierarchy relations between them (X is a part of Y) are defined and taught to the program. Rules, such as IF X, THEN Y; or logic, such as X is an employee, therefore X must be a person, are also provided to improve the knowledge base of the application. This way ontologies (hierarchies) are built for each domain (each field of knowledge) to be able to serve the Semantic Web (see e.g., Grimm, Hitzler, & Abecker, 2005 for more on this topic). Figure 2 shows part of a visual ontology for the Biology field. As can be seen, based on the examples provided in Figures 1 and 2, world knowledge constitutes a large part of perceived coherence and cohesion in texts.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2. Visual representation of a gene ontology within the field of biology (based on the online training material of the European Bioinformatics Institute5 )

Another new way of analyzing text for non-systematic sense relations is applying Word Association Norms (WAN) to a corpus. WANs are an alternative lexical knowledge source to analyze linguistic computational creativity, in order, for example, to explore lexical associations common in poetic texts (Netzer, Gabay, Goldberg, & Elhadad, 2009). These norms are a collection of cue words and subsequent sets of free associations collected from people as respondents to the cue words. WANs are used with statistical measures to analyze, for example, the semantic distance of associated terms in poetry. These new applications are but a few made possible by the advances in technology that can prove that Halliday and Hasan were right to include collocation-type lexical relations in their taxonomy because these word pairs greatly contribute to lexical cohesion, however there are no grounds for naming them collocations.

The second observation regarding the above taxonomy is that although Hoey was one of the linguists who criticized the ‘ragbag’ nature of collocation, he himself proposed a rather similarly ragbag category, that of paraphrase which includes synonymy, antonymy, superordinates, and the even more obscure subgroup: link triangle. Hoey’s (1991) categories will be explored further in Sections 2.3.2 and 2.3.6 of this chapter.

2.2.2 Lexical chains or a lexical net?

According to Hasan (1984), not every word is equally important in a text with regard to its cohesive power. Tokens (i.e., actual words) of a text may or may not form semantic relationships with other words, called cohesive ties. If they are not parts of chains, they are called peripheral tokens, whereas tokens which are parts of chains are relevant tokens, which are central to the text. Centrality is a recurring but ever changing concept in discourse analysis. Mann and Thompson (1988) in their Rhetorical Structure Theory, differentiate between nuclei (the units that are most central to the writer’s purposes) and satellites (less central supporting or expanding units). They call the produced patterns schemas. The hierarchy of important and less important ideas are also described by van Dijk and Kintsch (1983), which they refer to as a system of macro - and microstructures. Hoey (1991) is similarly concerned with centrality, distinguishing between central and marginal sentences. A major importance of chains as cohesive ties is that central tokens within chains connect several discourse levels: words connect sentences, sentences connect paragraphs, and the list can be continued to chapter or whole book-length level. The longer the chain, the longer the writer “stays on topic”.

Other influential models devised for analyzing coherence also attempt to recognize chains, be they lexical or phrasal, even if this fact is not mentioned explicitly in the name of the model. Topical Structure Analysis (TSA), for instance, by Lautamatti (1987) examines semantic relationships between sentence topics and overall discourse topics: it looks at the internal topical structure of paragraphs as reflected by the repetition of key words and phrases (see Figure 3). Thus, the aim of the model is to provide insights into the organizational patterns of discourse by observing chains in the text.

Abbildung in dieser Leseprobe nicht enthalten

Figure 3. Topical Structure Analysis indicating semantic links between sentences (Lautamatti, 1987, p. 102)

By focusing on lexical chains in discourse, several conclusions can be drawn regarding coherence requirements. For example, comparative studies between languages show that English paragraphs tend to have a higher use of internal coherence than Spanish paragraphs (Simpson, 2000), making lexical chains in English texts a relevant field for study.

Chains not only connect discourse structures; they also divide parts of the text. Identifying where chains begin and where chains end is used for both text summarization and text segmentation in corpus linguistics. Barzilay and Elhadad (1999) created summaries by extracting strong lexical chains from news articles: chain strength was scored according to length (“the number of occurrences of members of the chain”, p. 116)” and the homogeneity index (“the number of distinct occurrences divided by the length”, p. 116). As far as segmentation is concerned, close correspondence between the starting and ending points of lexical chains, and paragraph boundaries (structural unit boundaries) was found, for example, by Morris and Hirst (1991), and Berber Sardinha (2000).

Hasan (1984) describes two types of chains. Identity chains are text-bound and held together by the semantic bond of co-referentiality. In other words, identity chains are made up of words which have the same referent (John – he – the boy). Similarity chains are not-text bound, and are based on co-classification or coextension. To give an example, if in a text something was mentioned as easy, and later in the text, something else is also mentioned as easy, a similarity chain will be formed between these two elements, i.e., between the two mentions of easy. The concept of similarity chain is close to the psycholinguistic concept of word associations and is also a basic tenet of intertextuality.

Hoey (1991) maintains that the presence of chains does not guarantee coherence: it is the interaction of chains that matters in this respect. Therefore, Hasan’s contribution to clarifying the relationship of coherence and cohesion, according to Hoey, is that Hasan abandoned the classificatory approach, and introduced an integrated approach. In other words, the combination of ties within chains is a more important idea in Hasan’s model than observing and classifying the ties without their context.

Besides considering texts as holders of lexical chains, they can also be viewed as containers of a lexical net (or network). The lexical net concept, however, is more connected to research on semantic networks in literature, rather than to research in discourse analysis. Such studies, for example, in psycholinguistics describe the mental lexicon (the arrangement of words in one's mind), or more recently, studies utilize Princeton WordNet6, an online semantic network database. According to Hoey (1991), who is a major advocate for the lexical net concept, the main difference between chains and a net is that the constituents of chains (i.e., the ties) have directions, pointing either backward or forward, whereas a net is a two-dimensional map of words disregarding directionality. In order to prove scientifically whether the chain or net representation of text is more accurate, more research is necessary concerning the two types of lexical repetition patterning. It is possible, for example, that there are generic differences and, for certain genres or registers, chain patterns would be more suitable, while other genres would call for a net-like organization.

Lexical chains and nets are comparable in certain aspects if they are represented visually. Figure 4 shows a net from Hoey (1991) and a chain as illustrated in Barzilay and Elhadad (1999). At first sight, they look very similar. The second illustration (the chain) is a little unusual in this form because it looks more like a net. However, the spread-out form only serves as an aid to highlight the interaction between the items. The division below information, for example, indicates that this word is repeated further as area and datum.

Abbildung in dieser Leseprobe nicht enthalten

Figure 4. A net and a chain of lexical repetition in two studies (Hoey, 1991, p. 81; and Barzilay & Elhadad, 1999, p. 116)

The main difference between these two representations of semantic relations is what they connect: in the first illustration, provided by Hoey, sentences are shown which are connected by three, or more than three, lexical repetitions. The numbers indicate which sentences in the text bond, i.e., are significantly connected semantically. The second illustration, on the other hand, shows instead the actual words that link sentences. Hoey’s net therefore, visualizes a higher layer of discourse.

2.3 Hoey’s (1991) Repetition Model

2.3.1 The theoretical background of the model

Hoey (1991) was the first to provide a comprehensive analytical model which reveals the organizing function of lexical repetition in texts. His great contribution to knowledge in discourse analysis and corpus linguistics was that he recognized the information content of the lexical net which was created by lexical repetition.

In his view, the role of grammatical cohesion is less significant than that of lexical cohesion; therefore, he focuses only on words with lexical meaning. In creating his model, he draws mostly on the theories of Hasan (1984), Winter (1977, 1979) and Phillips (see Hoey, 1991, pp. 14-25). The connection between Hasan’s and Hoey’s models has already been highlighted in the previous sections (Sections 2.1 and 2.2). The contributions of Winter (1977, 1979) and Phillips are briefly described below.

Hoey (1991) adopted the broad interpretation of repetition from Winter (1977, 1979), even though they disagreed on the function of repetition. According to Winter, the function of repetition is to focus attention on the word which has been replaced. Thus, in the John – he word pair the focus is on John. Although Hoey did not refute this explicitly, the main problem with this for him must have been the directionality, more precisely the anaphoric direction of repetition assumed by Winter. On the other hand, Winter’s other assumption that replacement (exact repetition or repetition by another item) also needs to be observed at a clause level, was favoured by Hoey. According to Winter, if we repeat an item, the position of new versus old information will also change in the clauses involved. With this assumption, Winter integrates two levels: lexical analysis and clausal analysis.

Hoey translated Winter’s main conclusions for his own research in the following way:

1. “If cohesion is to be interpreted correctly, it must be interpreted in the context of the sentences where it occurs.
2. We are more likely to arrive at a satisfactory account of how cohesion works if we concentrate on the way repetition clusters in pairs of sentences.
3. It is the common repeating function of much cohesion that is important, not the classificatory differences between types of cohesion.
4. There is informational value to repetition, in that it provides a framework for interpreting what is changed.
5. Relations between sentences established by repetition need not be adjacent and may be multiple.” (Hoey, 1991, p. 20)

Similarly to Hoey’s critique of Hasan’s (1984) work, he again stresses the importance of interactionality over classification (in conclusion No. 3, above). He maintains that repetition defines a context in which sentences interact not only with neighbouring sentences but on a wider distance as well (in No. 4 and 5). Perhaps this is why Hoey proposed the term organization instead of structure when he described texts in his study (1991).

Besides Winter (1977, 1979) and Hasan (1984), Phillips (1985) influenced Hoey methodologically. Phillips analyzed long texts by computer, which made Hoey broaden his enquiries to book-length texts, and observe long-distance lexical relations. A noteworthy result of Phillips, as cited by Hoey (1991, p. 24), is that academic texts contain many more long-distance clusters of repetition than other types of text, and these consistent mentions have an important organizational function. The fact that Phillips used automated means of research drove Hoey to work on his methodology with automation in mind.

2.3.2 Hoey’s (1991) taxonomy of lexical repetition

As mentioned in the previous section, Hoey himself did not regard the classification of links of primary importance compared to other aspects of his work, namely the role of lexical repetition patterns in organizing discourse. The key concepts of his taxonomy can be grouped into Lexical and Non-lexical repetition. Lexical repetition is categorized, as shown in Table 3, in the following way:

Table 3. Types of repetitions based on Hoey’s (1991) taxonomy

Abbildung in dieser Leseprobe nicht enthalten

Simple lexical repetition occurs “when a lexical item that has already occurred in a text is repeated with no greater alternation than is entirely explicable in terms of a closed grammatical paradigm” (p. 55). It includes exact repetitions, and repetitions of the same word with inflectional changes.

Complex lexical repetition occurs “either when two lexical items share a lexical morpheme, but are not formally identical, or when they are formally identical, but have different grammatical functions” (p. 55). Hoey’s example for this is drug as a noun and drugging as in making sleepy (-ing form of a verb).

Simple paraphrase occurs “whenever a lexical item may substitute another in context without loss or gain in specificity and with no discernible change in meaning” (p. 62). Such are synonyms, e.g. hot—cold.

Complex paraphrase occurs when “two lexical items are definable such that one of the items includes the other, although they share no lexical morpheme”. This category is broken down into three subcategories: 1. antonymy, 2/a link triangle, 2/b the “mediator” missing, 3. Other types of complex paraphrase: superordinates and co-reference (p. 64). The categories link triangle and the mediator missing are very difficult to interpret, therefore they will be discussed within Section 2.3.7 among the problematic features of Hoey’s taxonomy.

Non-lexical repetitions are substitution links, and as such, grammatical categories: personal pronouns, demonstrative pronouns and modifiers.

Words of a text form links with other words according to these main categories. The links need to be coded according to repetition type, counted and their positions recorded (more details in Section 2.3.4). According to Hoey’s description, only content words (words with lexical meaning, e.g., nouns, verbs, adjectives, etc.) can be part of a link. Grammatical items and other semi-grammatical categories, such as connectives, although they play a role in cohesion, are not analyzed within his framework of lexical cohesion. However, substitutions, such as pronouns need to be replaced by the original item, thus resolving the anaphora, i.e. the backward reference created between the pronoun and the missing noun. This theoretical, as well as methodological, problem is analyzed in Section 2.3.8.

2.3.3 Links and bonds creating a lexical net

Hoey (1991) claimed that “lexical items form links when they enter into semantic relationships” (p. 91). These links, however, are only realized between two sentences, not inside a sentence. Therefore, if two words are repeated within one sentence, these will not be analyzed. The reason for this, according to Hoey, is that intra-sentential repetitions do not play a role in structuring discourse, even if they have an important function, e.g., emphasis (They laughed and laughed and laughed uncontrollably.). Hoey differentiated his concept of link from Hasan’s cohesive tie in two aspects. Firstly, his categories were greatly different from those of Hasan’s. Secondly, he emphasized that links have no directionality.

Hoey’s important claim is that certain sentences play a more central role in organizing discourse than others. Sentences sharing three or more links are significant for the organization of discourse because they form bonds, a higher level connection. Marginal sentences, with fewer than three links, do not contribute essentially to the topic, therefore if omitted, do not disrupt the flow of the topic (Hoey, 1991).

Bonded sentences lead to nets, which ultimately organize text, in a manner similar to Hasan’s (1984) identity and similarity chains. Hoey found that bonded sentences are central to text, as they are the core bearers of information (resembling the concept of macropropositions by van Dijk and Kintsch, 1983, see also Section Hoey’s main claim that links created via lexical repetition may form bonds which subsequently create significant sentences, was later reaffirmed by Reynolds (1995) and Teich and Fankhauser (2004, 2005).

Hoey also defines the minimum level of linkage necessary to create a bond, i.e., identifies certain constraints to making sentences significant. He sets three links as the threshold, and turns to Sinclair’s (1988) work on word sense disambiguation to support this claim. Word sense disambiguation is a necessary step to identify the right meaning of a word in a sentence. The English language contains many polysemous and homonymous7 words, therefore differentiation of meaning is a relevant problem for discourse analysis. Sinclair, who uses corpus linguistic techniques, finds word sense disambiguation one of the most problematic issues. He recommends to look at the collocational pattern of words for sense disambiguation because different senses of words more than likely also have different collocational profiles.

Let us relate this to the problem of threshold calculation for bonding. Supposing we have to distinguish between two senses of the word bank (sense 1: financial institution; sense 2: part of a river), by observing the preceding and following words, we will find significant lexical differences regarding the two senses. If, in a text which is about money matters, we find a sentence about a river, the collocational profile of this sentence will be so different from the other sentences that the sentence will “stick out”. The reason therefore, why three is the number of minimum links to form a bonded sentence is that if the sentence is central to the topic, it will be linked in at least three places: minimum once as a key word, plus in two other places as collocations to this key word, which collocations probably reappear in the text, comprising further links with other sentences. Even though the above might shed light to Hoey’s decision regarding the threshold, it still does not explain why he chose three, instead of four or five as the limit for bonding.

Hoey (1991) used a news article and a short section of a non-narrative book for his analysis. Based on the results gained from this small data pool, he showed three methods of how abridgements (summaries) could be created of a longer text: (1) by deleting marginal sentences; (2) by collecting central sentences; or (3) by selecting topic opening and topic closing sentences. He admitted that these modes would summarize different aspects of the original text with shifts in meaning. Nevertheless, he emphasized that the patterns of lexical repetition he examined are characteristic of non-narrative prose, as opposed to narration which has a more linear structure.

Although Hoey presented these models as possible means to create summaries, he did not give guidance on how to evaluate the quality of these summaries. He argued, however, that the lexical repetition patterns revealed by his analytical tool can indicate differences in text quality. He put forward two claims, a weak and a strong one:

“The weak claim: each bond marks a pair of sentences that is semantically related in a manner not entirely accounted for in terms of its shared lexis.” (p.125)

“The strong claim: because of the semantic relation referred to in the weak claim, each bond forms an intelligible pair in its context.” (p.126)

What follows from this, firstly, is that bonded sentences hold important information content; and secondly, bonded sentences have a special discourse function in text. Hoey did not claim that the observed lexical repetition patterns are present in every text type, in fact he excluded narrative texts from his analysis maintaining that they are structurally different from the news article he experimented with.

2.3.4 The steps of the analysis

Hoey’s analytical steps consist of three phases: (1) identifying the lexical items which enter into relationship with other items, (2) classifying these relationships according to the taxonomy (i.e. finding the links), and (3) identifying which sentence pairs have three or more than three links (i.e., finding the bonds). The detailed steps of the analysis also appear in Appendix A, illustrated with diagrams.

1. Coding the text according to the taxonomy. Finding links between every sentence pair, including the title, which also counts as one sentence. (1—2, 1—3, 1—n, etc., and in the same way 2—3, 2—4, 2—n).
2. Writing the links into a connectivity matrix where each cell represents a sentence, as Hoey put it: “to trace a sentence’s connections with other sentences in the text” (1991, p. 85). All links should be written into the cells.
3. The information in the matrix should be written into another matrix in a number format.
4. Cells containing three, or more than three links should be highlighted because these are the bonded sentences. In the following only these sentences will be examined.
5. The locations of bonded sentences need to be found in the text, and they should be highlighted.
6. If the purpose of the analysis is to create a summary, either the bonded sentences should be collected, or the marginal sentences should be deleted (same procedure). The third procedure is to collect all the topic opening and topic closing sentences. The bonded sentences will give the basis of the summary.

2.3.5 Applications of Hoey’s (1991) model

As the above method is laborious, the first question is whether it can be applied to a large corpus or not. In his book on lexical repetition patterns, Hoey analyzed a 5-sentence long article in detail, as well as the first forty sentences of the first chapter of a non-narrative book. He concluded that theoretically it is possible to create summaries of texts of “unlimited length” applying his repetition model, but he did not give instructions on how to do so in practice. Furthermore, the process of comparing lengthy texts with their summaries was not examined, either. The model as the basis for automated summarization

To the knowledge of the researcher, the first text-processing computer application based on Hoey’s model (Tele-Pattan) was created by Benbrahim and Ahmad in 1994 (de Oliveira, Ahmad, & Gillam, 1996). It represented a computer implementation of two of Hoey’s four lexical repetition categories: Simple Repetition and Complex Repetition. The program created five summaries of the same stock exchange news, which were then evaluated by four traders and five university students. The outcome was that 60% of the raters felt that essential information was missing, and that participants evaluated the summaries differently. As the texts were not available in the research paper, the experiment cannot be replicated. However, it can be argued that the text processing program was limited in use because (1) it incorporated only two of Hoey’s categories, and perhaps as a consequence (2) the resulting summaries were rated differently, even though the type of text (stock exchange news) did not allow for a wide lexical or syntactic variety. The result is all the more surprising because, for a non-expert, it would seem relatively easy to summarize such a functionally ‘predictable’ genre.

As the size of available and searchable corpora increased significantly, British Telecom financed a project, lead by Hoey and Collier, to design a “software suite” for the abridgement of electronic texts (Collier, 1994), by automatically selecting central sentences, i.e., sentences containing the main ideas in text. The program was able to create a matrix of links in seconds, but again, only for the two basic repetition categories: simple and complex repetition. According to Collier (1994), thesaural links were added manually to analyze antonyms, but this step resulted in only a minor improvement in the program. His research plan lists several semantic and structural difficulties in automating central concordance line selection and he concludes that further research is necessary into these areas.

Two programs evolved from the original version: a document similarity tracer (Shares), and an automatic document summarization/abridgement system (Seagull). A demo version of both can be accessed at the Birmingham City University Research and Development Unit for English Studies website8. Several other attempts have been made to use computer discourse analysis programs based on Hoey’s taxonomy (de Olivera, et al., 1996, Monacelli, 2004), however, no research we know of has utilized the whole of Hoey’s framework without alterations.

As Collier (1994) described above, the automated identification of repetition links was attempted using a concordance selector. Due to the extremely laborious nature of data collection, many studies utilize a concordance program (e.g., AntConc9, or Concordance 3.310 ) to search discourse data in the area of investigating lexical repetition patterns. As data is textual, a frequency analysis software is helpful in counting how many times certain words appear in the text. It is also possible, using a concordancer, to count how many times certain pairs are repeated. The software is able to show in which sentences the repetitions occur. It cannot evaluate qualitative data, however, without a human observer to process information.

Since its first implementation, automated summarization has been widely used for a variety of purposes (e.g., Barzilay & Elhadad, 1999). These summarization applications, however, use algorithms different from the one Hoey provided. Mani (2001) and Spärk Jones (1999) gave detailed descriptions of the latest developments in this field. The model as the basis for a discourse segmentation tool

Hoey’s (1991) claims regarding central sentences, particularly topic initial and topic closing sentences instigated research into text segmentation (i.e., dividing a text into meaningful units). While he aimed to synthetize text by collecting topic opening and topic closing sentences which were revealed by bonds, other researchers wanted to achieve the exact opposite: segmenting discourse by identifying paragraph boundaries.

Hearst (1994) used a computer program to locate segment boundaries by lexical repetition counts, a different method from Hoey’s. She compared equally long text parts and tried to find similarities. If the two text parts were very different lexically, they did not constitute a single coherent paragraph. This method is called text tiling by Hearst, who computed simple repetition only and used a technical text which is more likely to contain repetition of terminology.

Berber Sardinha (2000) on the other hand, who criticized Hearst for her practice of comparing equally long discourse parts, looked for an alternative method and attempted segmentation using Hoey’s analytical framework. He soon found, however, that the lexical net organization pattern is an obstacle to segmentation because the sentences forming the net were all parts of a large group connecting, rather than segmenting the whole of the text. Therefore, Berber Sardinha diverged from Hoey’s framework and looked for link sets, more resembling cohesive chains. He calculated the link set medians, which provided meaningful information about where link sets begin and end.

The prevalent problem with these early computational applications was that they only searched for simple repetition because computerized thesauri were not available. This trend changed with the implementation of Princeton WordNet11, an online thesaurus. Its synsets (systematic semantic relations) are now the basis for most lexical analysis on semantic relatedness in corpora. Large-scale lexical repetition analysis with WordNet

Teich and Fankhauser (2004, 2005) connected Hoey’s framework with WordNet12 as a thesaurus in order to observe differences between registers in the Brown Corpus13 concerning lexical repetition patterns. As they noted, the flexibility of Hoey’s categories facilitated the creation of a multi-layer corpus, the building of which was the focus of attention around the turn of the millennium. This kind of corpus is annotated at multiple linguistic levels, such as e.g., the syllable, word, phrase, clause and text levels. Teich and Fankhauser’s results shed light on interesting aspects of lexical repetition. It was found for example, that register-specific vocabulary forms stronger cohesive ties than general vocabulary (2005), as the words typical of the specific register form longer chains within text than general vocabulary does. Their qualitative analysis also revealed that the texts in the learned and in the government registers have longer-stretching lexical chains than texts from press or fiction. This might be the result of academic texts in English being more linear (c.f. Lautamatti, 1987), or showing higher lexical density (the number of lexical words divided by the number of total words), or due to nominalization (Halliday and Martin, 1993) The model as the basis for validating multiple choice test prompts

A unique implementation of Hoey’s lexical repetition analysis was carried out by Jones (2009) who investigated question—answer pairs in an EFL context. He analyzed reading passages and related question prompts in the Pearson Language Test Reading Comprehension Section. He investigated the semantic similarity between the wording of the questions and the wording of the related answers by looking for semantic links between the sentences drawing on Hoey’s categories. As it was a manual analysis, he was able to use all the categories in Hoey’s (1991) taxonomy. His original assumption was that an answer is more difficult for EFL learners if the semantic links between the prompt and the answer are semantically less related, i.e., they can be found further down in Hoey’s taxonomy table. Thus, for instance, a word in the question which is repeated exactly (simple repetition) in the answer, makes the answer fairly easy. If the question and answer pair contains derived repetitions, or superordinate terms, finding the right answer is linguistically more demanding for the student. Jones, in this pilot study was able to lay down the theoretical foundations for further possible studies to scientifically measure the semantic distance of multiple choice item pairs in the Pearson Reading Comprehension test prompts.

2.3.6 Inconsistencies within Hoey’s (1991) model

Several inconsistencies exist within Hoey’s (1991) model. Károly (2002) pointed out that it contains three weaknesses: (1) theoretical problems with the taxonomy, such as several obscure category labels, and the unclear definition of the basic unit of analysis, (2) weaknesses of the method of analysis, such as not examining intra-sentential repetition, or the missing theoretical foundation for choosing the number of bonds to be seen as significant connections, (3) research methodological problems, such as making strong claims based on a single text type.

The inconsistencies of the model derive from two areas: Hoey’s inductive approach in his data collection and analysis, and his claim that the classification of links is of lesser importance than the patterns gained by the interactivity of links. He analyses data gained from a short text and, based on these results, he draws conclusions implying that the same can apply to longer texts. This inductivity can be observed, for example, when Hoey gives guidance on how to categorize certain problematic words.

It needs to be mentioned that approaches which start by analyzing existing data without a previous hypothesis are the standard procedure for corpus linguistics, therefore it is not a unique characteristic of Hoey’s research. The problem is rather that he uses his data as illustration, and makes decisions on a case-to-case basis, which makes the model difficult to apply.

The first problem is that his categories are difficult to identify, and the second is that the categorization is unjustified . He did not use traditional grammatical categories and subcategories, perhaps because he wanted to create a new multi-layered analytical tool, and found the existing categories too restrictive in this sense. It might be suggested that these invented categories are rather confusing than helpful during the coding process.

A further problem is that the categorization is not justified. The category Simple Repetition seems to be the most straightforward, nevertheless even this is problematic, especially when observed with automation in mind or from a theoretical point of view. Regarding the latter issue, Hoey contemplates that when we repeat the same word in the same form, is it still the same word, or has its meaning changed by the fact that it was repeated.

Another interesting problem, which Hoey recognizes is that words have different functions within a text, even though he does not explicitly state this: he calls them accidental repetition in his book On the surface of text (1995, p. 108). He suggests observing the functions of the word reason as an example.

No faculty of the mind is more worthy of development than the reason. It alone renders the other faculties worth having. The reason for this is simple.

According to the author’s explanation, this word appears with two different meanings in the text, therefore the two examples within this paragraph cannot be considered a repetition link, since the second mention of reason has a referential function in this context similar to metadiscoursal14 phrases in text, such as in the following, let us see. Hoey does not offer a list of such problematic words, or a rule on the basis of which these umbrella terms, or special discourse function words should be excluded from the analysis. Such nouns were later compiled by Hyland (1998) which he collected by corpus analytical means from eighty research articles from eight academic disciplines. (See in Appendix B.)

During the analysis of a large corpus, the frequency of the errors caused by misidentified links as the above might be scarce compared to the frequency of the correctly identified links. However, if we analyze a corpus of short essays by computer, the lack of unified instruction could be a large problem. Given the fact that academic discourse is notorious for using metadiscoursal nouns (issue, problem, reason), the error rate of the analysis can be higher.

A serious, but different kind of difficulty arises when we attempt to analyze Paraphrases. Complex paraphrase by definition can be a one word unit, a phrase, a clause, a sentence or even “a stretch of text” (Hoey, 1995, p.112)! It is difficult to prepare coders for such intricacies regarding the unit of analysis. Even though the complex nature of any text cannot be denied, it is still questionable whether all these features are necessary to be analyzed within the same single framework in order to yield data on textual cohesiveness.

The above mentioned problems with the taxonomy are but a few of the inconsistencies regarding the framework and the over-flexible nature of the units of analysis. Due to these unresolved issues, identification of units and annotation of text is prone to low inter-coder agreement, thus hindering reliability. It seems, Hoey’s ground-breaking idea to collate several discourse layers still needs to be further refined, and more experiments are needed for its implementation.

2.3.7 The link triangle and the mediator missing

Hoey introduces two categories which are unique in the reviewed literature: these are the link triangle (shown in Figure 5) and the mediator missing categories,. According to Hoey’s (1991) explanation, if two words of a text are connected in a certain link and form a pair, this will cause a putative link between the two items otherwise previously not connected. An alternative version of this concept is the mediator missing, which exists when one of the elements does not appear directly in the text, instead it is referred to by a pronoun.

Abbildung in dieser Leseprobe nicht enthalten

Figure 5. The link triangle (Hoey, 1991, p. 65)

Hoey rightly maintains that semantic triangles exist, however, the problem is that these word associations can take the form of any other shapes as well, for example, a square (by the fourth repeated item), an octagon (by the eighth mention), etc. Therefore, even if Hoey reveals another important feature of discourse by introducing the triangle concept, there is no theoretical basis for calling such formations triangle and insisting on placing them in his link taxonomy.

Besides theoretical considerations, the concept of link triangle is also problematic from a methodological point of view. Hoey’s category inadvertently confuses his own taxonomy by looking for connections between more than two elements at the same time. While data on frequencies and locations of inter-sentential relationships between lexical units can be observed and analyzed relatively easily, triangle-type relationships would be more difficult to detect and record. Triangle frequencies could also prove to be impossible to interpret alongside the other types of data. For instance, supposing we find three links between the first and the 20th sentence, according to Hoey, we can claim that these two sentences are bonded. In other words, there is a strong semantic and structural relationship between them with a distance of over 20 sentences. However, it is not described what procedure should be followed if there is another word in sentence 21, which may be a candidate for a link triangle: where should this be recorded in the connectivity matrix and how this would influence the overall number and length of bonds within the text. This is a threat for both manual and computerized analysis, even considering the latest advancements in technology. As far as the mediator missing category is concerned, such links should be manually coded (and possibly inserted), on a case-by-case basis. This would seriously slow down the analysis, if not make it impossible.

Hoey further elaborates on the triangle concept in his other influential book, “On the surface of discourse” (1983), where he observes triangles one level higher: at the discourse level, and he describes their discourse functions such as the problem – solution or the general – particular patterns. See Figure 6 for the general – particular relationship.

Abbildung in dieser Leseprobe nicht enthalten

Figure 6. The General—Particular relationship in text (Hoey, 1995, p.135).

2.3.8 The questions of anaphora resolution

Another problematic area is whether to resolve pronominal anaphoric reference; one of the major contributors to cohesion, affecting at least every second sentence in any English text. According to Hoey’s (1991) methodology, if a word is repeated by a substitution (i.e., replaced by a pronoun in the following sentence), the original noun (who or what the pronoun refers to) should be inserted in the place of the pronoun in order to recreate the original lexical link. Thus, in the sentence pair John wants to go to the theatre. He is a great fan of contemporary drama, he should be replaced by John for the purpose of the analysis.

If we want to create summaries using Hoey’s model, reestablishing links is a logical and necessary step to improve cohesion. However, this treatment cannot be applied if we want to connect Hoey’s model with text quality research. If we followed Hoey’s advice and replaced the pronouns with their preceding referents in their original form, the number of simple repetition would increase considerably, distorting perceptions of discourse quality. (Not to mention dubious cases, when it is difficult to decide who or what was meant by the author as referent.)

2.3.9 Prescriptiveness or descriptiveness of the model

Tyler (1995) criticized Hoey on the grounds that quantitative investigation of lexical repetition alone cannot capture the difference between well and badly formed texts: qualitative analysis is necessary to explore how repetition is used. Connor (1984) went further by suggesting that it is possible for a text to be lacking in lexical cohesive links, still be better organized than another text containing more lexical repetition links, but at the same time not having a well-formed argument. Therefore, if we accept the assumption that the quantity of lexical repetition is not a major dimension in discourse quality, the next obvious question is: if we examine the cohesion patterns formed by lexical repetition links in texts, will we be able to judge discourse quality or not?

Tyler’s (1992, 1994) empirical study indicated that repetition in itself was not sufficient to cause cohesion because the perceived quality difference of native and non-native speakers’ language production is influenced by what and how is repeated. This issue is not addressed in Hoey’s studies. Nevertheless, Tyler did not contradict Hoey’s main claim regarding the function of bonds as text-building devices. Reynolds (1995, 2001) also found differences between the usage of lexical repetition among native and non-native speakers. He applied Hoey’s coding system and methodology on students’ expository essays. His findings revealed that EFL writers did not use the lexical repetition devices optimally: bonded sentences were not used to form the bases of developing the argument structure. Reynolds’ conclusion was that “the content of what is being repeated is as important as the quantity” (1995, p. 185). Thus, it can be concluded that Hoey’s model has great potential for studying lexical repetition analysis, particularly if it is completed with a content-based approach.

2.4 Károly’s (2002) Repetition Model

Károly (2002) applied Hoey’s model to explore the text-organizing role of lexical repetition. She revised Hoey’s (1991) taxonomy putting it into a wider perspective, using the model in a Hungarian EFL academic context.

2.4.1 Károly’s (2002) taxonomy of lexical repetition

Abbildung in dieser Leseprobe nicht enthalten

Table 4. Types of lexical relations in Károly’s taxonomy with examples (examples based on Károly, 2002, p. 104, and these two corpora)

Károly introduced the term lexical unit as the basic unit of her analysis. This is a unit “whose meaning cannot be compositionally derived from the meaning of its constituent elements” (Károly, 2002, p. 97), i.e., the individual words placed one after the other mean something different than each word means standing alone. A lexical unit can be a one-word unit, an idiom or a phrasal compound (words expressing a unique concept, e.g., non-native English s peaking teachers, non-NEST-s). She also proposed a new taxonomy of the lexical repetition types, as indicated in Table 4.

Table 4 shows that Károly (2002) uses more traditional grammatical terms than Hoey (1991), and her units of analysis are linguistic constituents which can be more easily identified than those of Hoey’s. The instantial relations category introduces a semantic category which is temporarily bound by context, and resembles Hasan’s (1994) instantial lexical cohesion category, which was originally broken down to equivalence, naming and semblance. Károly also argues for the differentiation between inflection and derivation within the category same unit repetition because inflectional differences are only syntactical variants, therefore represent closer semantic connections than derivation which changes the meaning of the word, irrespective of whether it happens with or without word class change. Hoey’s original idea that a unit is as small as a word but can be stretched as far as a passage if these two passages are paraphrases of each other (e.g., 1995, p. 110) is lost by Károly’s more rigorous categorization. As a consequence, the semantic flexibility Hoey’s analysis offered is sacrificed. On the other hand, the clarity of the categories and the traditional grammatical terminology enhances the reliability of the coding, inasmuch as coders do not have to make many ad-hoc decisions.

2.4.2 Károly’s (2002) method of analysis

Károly also found several weaknesses in Hoey’s methodology. Such were not examining intra-sentential repetition, or the missing theoretical foundation for choosing the number of bonds to be seen as significant connections. As far as Hoey’s research methodology is concerned, Károly’s criticism was that Hoey made strong claims about the role of lexical patterns in discourse based on a single text type.

Károly (2002) not only revised the categories but also introduced a number of new analytical steps related to the combination of links and bonds to extend the research capacity of the analytical tool. Her method of analysis focused on new aspects of bonds, such as their position, length, and strength between sentences with special discourse function (SDF), such as the title, the thesis statement, the topic sentences, and the concluding sentences.

For instance, the length of bonds category indicates how far apart bonded sentences are located from each other, and the distinction between adjacent bonds and non-adjacent bonds indicates which sentences form mutual relationships. The strength of bonds was calculated to reveal how many links connect sentences in the given text. Károly’s new quantitative analytical measures are shown in Appendix C.

2.4.3 Károly’s empirical investigation

Károly (2002) investigated the organization of ten high-rated and ten low-rated argumentative EFL essays written by English BA majors at a Hungarian university. Her main hypothesis was that her revised tool is able to differentiate between high-rated and low-rated essays, based on the role lexical repetition plays in structuring texts.

Károly used a number of variables, which she later reduced to five. These were: the frequency of derived repetition, the relative use of bonds at paragraph boundary, the density of bonds, the frequency of adjacent bonds, and the amount of bonds between the title and the topic sentences of the essay. They proved to be capable of predicting raters’ quality judgements of essays with 95% certainty. The variables with the most predictive power are shown in Table 5.

Károly’s (2002) research results showed that her theory-driven “objective” analytical tool not only offered a descriptive function, but with her analytical measures, it was capable of predicting the “intuitive” assessment of teachers evaluating the essays with regard to the content and structure of EFL academic argumentative essays. Her main hypothesis was that her revised tool is able to differentiate between high-rated and low-rated essays.

Abbildung in dieser Leseprobe nicht enthalten

Table 5. Variables with the strongest predictive power in Károly’s (2002) lexical repetition analysis research

The results of Károly’s analysis proved that the texts, which had previously been rated high or low by experienced university instructors, differed significantly in both repetition amount and types. Her results indicated that high-rated essays contained significantly more repetition links, including more same unit repetition, and within this, derived repetition; as well as more simple opposites and instantial relations.

An interesting finding was that the analytical tool could not discriminate between high-rated and low-rated essays based on the combination of bonds. The four aspects observed here were the quantity of bonds, the amount of adjacent and non-adjacent bonds, the length of bonds and the strength of bonds. Therefore as a next step, a content-based approach was used to investigate the sentences with special discourse function (SDF). Such sentences were for instance, the title, the thesis statement and the topic sentences.

The novel outcome of the analysis was that the amount of bonds connecting SDF sentences was significantly higher in high-rated essays: particularly the title and the topic sentences and the title and the rest of the sentences. Károly’s results revealed thus far hidden dimensions of lexical repetition, such as the first result, which means that even high-rated essays contained many repetition links, although it is common teachers’ practice to advise against using repetitions in texts. Post-tests conducted with another group of teachers confirmed these findings, thus indicating that the analytical measures devised are reliable and the results may be generalized for a wider sample.

Another interesting aspect of the teachers’ perceptions of essay quality was also uncovered by Károly’s (2002) analysis: one essay which was predicted to be low-rated by the model due to the lack of appropriate bonding, was still scored high by the teachers. A content-based analysis revealed that this particular essay utilized a number of rhetorical devices, such as illustrative examples, parallelisms, and rhetorical questions for supporting the argument. This, as well as irony, such as the following example, indicates that the model cannot capture certain features perceived as significant in overall discourse quality.

What a shame that there are such inhuman beings living around us as journalists, you think when reading through the passage.

2.4.4 A corpus-based investigation using Károly’s (2002) taxonomy

To my knowledge, no research has been carried out by manual coding using Károly’s (2002) model. A recent computer-aided empirical investigation based on Károly’s (2002) taxonomy aimed to compare shifts in lexical cohesion patterns between translated and authentic Hungarian texts (Seidl-Péch, 2011). Seidl-Péch found that authentic Hungarian and translated Hungarian texts differ in lexical cohesion patterns. Her quantitative analysis was facilitated by language technology modules provided by Orosz and Laki (Laki, 2011; Novák, Orosz & Indig, 2011), whose linguistic parser (analyzer) program helped to automate the analysis. The Hungarian WordNet Program15 (Prószéky & Miháltz, 2008) was used to explore semantic links between sentences.

Although Seidl-Péch’s (2011) study was the first to utilize Károly’s lexical repetition analysis model for a multilingual corpus-based investigation, it cannot be considered as a model for our research for several reasons. Firstly, parts of her methodological decisions were determined by her special research focus, namely studying the quality of translation. Secondly, the scope of Seidl-Péch’s research limited her investigation to nouns. (Interestingly, in the results section, however, the screenshots revealed that the software also analyzed pronouns, e.g., azt [ that in object form], arra [ onto that ], p. 135. It is possible that anaphoric references (i.e., repetitions by pronoun) were also included in the sum of repetitions. If not, it would be important to know for future research such as this one, how they were discarded.)

Thirdly, Seidl-Péch (2011) did not provide enough details on how the application analyzed the texts exactly. She did not explain, for example, how lexical sense disambiguation occurred precisely. An English example would be this: How did the application decide which the synonym sets for bank were? As lexical repetition analysis (LRA) sets out to identify semantic relations, and polysemy and honomymy are frequent in English, the key methodological question is whether the software offered this word for the researcher to manually choose the right meaning in the given context, or the application selected the right meaning on its own, entirely automatically. In the latter case, it is of key importance to examine how the program decided which meaning was relevant.

To explore this feature in the English version of WordNet, on which the HunWordNet was based, I experimented with the word bank to find out which meaning is considered first: the most frequent, the most likely16 or whether some other factors are considered? The result was that bank as sloping land was offered before bank as financial institution (as shown in Figure 7, which is a proof that the WordNet application was trained on a general corpus (Brown Corpus) as a database to calculate frequencies, and not on specialized texts, such as texts from the business domain. A more detailed description would have been helpful so that further research can replicate the treatment of how synonyms or antonyms were coded using the HunWordNet in Seidl-Péch’s (2011) research.

Abbildung in dieser Leseprobe nicht enthalten

Figure 7. Synonyms offered for the word bank in WordNet

The above detailed two methodological questions were already addressed in Sections 2.3.3 and 2.3.8 when Hoey’s (1991) anaphora resolution and link threshold decisions were discussed as theoretical decisions. Now both reappeared as research application issues. Nevertheless, Seidl-Péch’s (2011) research indicates that it is feasible to automate Károly’s (2002) framework.

Another issue to consider is that Seidl-Péch limited the scope of her research to nouns due to the fact that the HunWordNet contains noun synsets. It seems appropriate to do more research into texts which contain more adjectives or verbs than usual, to explore their text structuring significance. One genre where adjectives and adverbs are also frequently compared with their opposites is the compare/contrast essay. Figure 8 shows part of such an essay, presented here as an example of how much lexical cohesion would be lost without the analysis of adjectives and adverbs. (Note: In reality, many more adjectival/adverbial repetition links start from these two paragraphs than indicated in Figure 8. They are not visible now because only link pairs between the second and third paragraphs are illustrated.)

Abbildung in dieser Leseprobe nicht enthalten

Figure 8. Three paragraphs of a sample compare/contrast essay indicating some of the lexical repetition links (adjectives/adverbs). Text: Oxford Advanced Learners’ Dictionary 8th ed. (OUP, 2008)

2.5 Summary

The aim of this chapter was to provide a theoretical background to the study of cohesion and coherence in discourse. Whereas there seems to be more of a consensus as far as cohesion, an overt textual feature is concerned, coherence, which involves more hidden, cognitive factors, is looked at in various contradictory ways. Within the concept of cohesion, Hoey’s (1991) comprehensive lexical repetition analysis model was discussed, revealing that the patterns lexical repetition creates play a key role in organizing discourse, thus their investigation can be used for various purposes (for instance, for text summarization). Károly’s (2002) extended and revised version of Hoey’s original model was capable of adding a discourse quality judgement dimension to the framework. The current investigation follows these lines and extends the use of the model to two other academic genres and further designs a method which will enable the model to be applied to larger corpora, too.

3 Methodological background: the academic writing context

3.0 Overview

The academic context from where the corpora of this study originate plays a significant role in this research for several reasons. The first reason is that this research project draws on theories and practices from discourse analysis and corpus linguistics, both of which areas deal with “real”, authentic texts created within an existing social context. In this study, both the summary and the compare/contrast essay corpora consist of real assignments at a university: the former task is written in a disciplinary course (English for Tourism), and the latter is an Academic Writing course assignment. This has an undeniable effect on the texts as products.

The second reason why it is important to observe the academic context is that the students’ writings in the corpora were all first evaluated by their course tutors. The tutors’ judgement is consequential for this study, because their quality judgement on the texts is collated with the predictive power of the lexical repetition analysis tool tested here on two genres. Therefore, it is important to find out how clear tutors’ perceptions are about the three major concepts (coherence, cohesion and lexical repetition) discussed in the previous chapter. It is also interesting to find out how clear these concepts are in a relatively new field: automated essay assessment. Thus, this chapter is the ‘melting pot’ of viewpoints regarding these three concepts examining discourse analysis theory, everyday assessment practicalities and language technology.

This interdisciplinarity requires us to draw on the relevant literature to describe what quality of writing really means in a university context, with particular focus on the research into EFL (English as a foreign language) student writing. In order to better understand the notion of quality in academic discourse, three major areas need to be discussed: what kind of writing tasks students are assigned; which are the variables (contextual and cognitive factors) that influence their final products; and how teachers make their quality judgements about these products. These two latter areas, as they have instigated a large body of research in their own right, are only examined from a narrower, EFL pedagogical angle.

First, academic writing tasks will be described which have been found to be the most prominent in university contexts generally. This is followed by enumerating the task variables of academic writing, following through from task setting to assessment. Next, the two major integrative tasks in the focus of this research are introduced: summary writing and compare/contrast essay writing. The last part of the section also deals with the theoretical and empirical aspects of evaluating writing assignments. Given that student texts are assessed both ways, manual and automated assessment practices will also be discussed, with particular focus on how teachers and computers ‘judge’ particular aspects of cohesion and coherence.

3.1 The nature of academic discourse

3.1.1 General features of English academic discourse

In this study the research focus is on written discourse, more precisely on written academic discourse17. Discourse as a product appears in the form of particular genres. Genre is defined by Swales as “a class of communicative events, the members of which share some set of communicative purposes. These purposes are recognized by the expert members of the parent discourse community and thereby constitute the rationale for the genre. This rationale shapes the schematic structure of the discourse and influences and constrains choice of content and style” (Swales, 1990, p. 58).

Typical written academic genres are, for instance, the journal article or the research proposal, whereas spoken genres are the lecture or the student presentation. The communicative events (i.e. the kinds of written and oral products) of the discourse communities which Swales describes above have evolved simultaneously with the emergence of the scientific communities, thus the same members comprise the scientific as well as the discourse communities. Therefore typical genres,18 text types, and structural conventions might differ across disciplines.

Besides typical genres and text types, another discourse feature of academic writing is the academic register, as opposed to, say, legal or media registers. The notion of register is defined by Biber et al. (1998) as a “cover term for varieties defined by their situational characteristics” considering the “purpose, topic, setting, interactiveness, mode, etc.” of the situation (p. 135). The academic English register, particularly written discourse, is objective, complex, and formal. It is also more structured both in its spoken and written form than general English.

Objectivity in academic discourse is achieved by using an impersonal style, for example, by avoiding first person singular pronouns, as well as using referencing when introducing other authors’ ideas. Higher lexical density, i.e., a greater use of content words (verbs and nouns) than structure words (words with grammatical function), is also pervasive in this register. Particular grammatical forms, such as passive structures and nominalization make academic texts more static and more abstract. Nominalization is characterized by Francis (1990) as “a synoptic interpretation of reality: it freezes the processes and makes them static so that they can be talked about and evaluated” (p.54).

Biber and Gray (2010) discuss nominalization as a strategy to make a text more condensed. They argue that academic discourse is wrongly characterised as elaborated; on the contrary, it is compressed and implicit due to nominalization and passivisation. They illustrate the gradual shift towards reduced explicitness with the following three phrases (p.11):

(a) someone manages hazardous waste
(b) hazardous waste is managed
(c) hazardous waste management.

In example (b) the agent is omitted, in example (c) “it is not even explicit that an activity is occurring” (p. 11). It is also obvious that while the first two examples can stand on their own as complete sentences, example (c), which is the most likely in academic discourse, will need some further syntactic elements to build up a whole sentence, which, as a consequence, will result in more information content in the particular sentence. The relevance of this observation for our research is that due to this compression, in academic discourse a high number of sentence elements will enter into links with other sentences in the text.

Besides nominalization, the lack of elaboration in academic writing is caused by the relative lack of clausal modifiers. Instead, the above mentioned phrasal modification is employed in cases where relative clauses would provide extra information in order to facilitate comprehension. The following Table 6 shows how condensed phrasal modification can be, with some examples from the same Biber and Grey (2010) corpus-based study (p. 9).

Abbildung in dieser Leseprobe nicht enthalten

Table 6. The difference in explicitness caused by phrasal vs. clausal modification (Biber & Gray, 2010)

Apart from difficulties caused by conventions and syntax, language learners also face problems created by academic vocabulary. Coxhead (2000) set out to collect the word families which appear with high frequency in English-language academic texts. The Academic Word List (AWL) contains 570 word families commonly found in academic journals, textbooks, lab manuals, and course notes, thus comprising the “general academic English”. It does not include terminology characteristic of specific disciplines. Academic words cover ten percent of an average academic text (Reid, 2015), yet their abstractness is an obstacle for comprehension.

3.1.2 The types of writing tasks required at university

Several studies have assessed the typical writing tasks required from students at universities (Bridgeman & Carlson, 1983; Hale et al., 1996; Horowitz, 1986; Huang, 2010; Leki & Carson, 2012). Some of these studies identify genres common across universities. According to data collected by Horowitz (1986) from 29 mainly undergraduate courses across 17 departments in a US university context, the most common genres students have to write are summary of/reaction to a reading, annotated bibliography, report on a specified participatory experience, case study, and research project. He also describes connection of theory and data, and synthesis of multiple sources as tasks also frequently required.

A decade later, a survey conducted across various discipline areas from Accounting to Visual Arts in Australian universities (Moore & Morton, 1999) resulted in almost similar findings. The following genres were found the most prevalent: essay (most common with 60% of all tasks), review, literature review, experimental report, case study report, research report, research proposal, and summary.

Other studies do not use the term genre when they refer to students’ tasks, such as description, summarization, explanation, etc. The terminology is varied in this respect, the terms: tasks, rhetorical functions, abilities or skills are all used when referred to these smaller assignments. In a large-scale survey (Bridgeman & Carlson, 1983) covering 190 university departments in Canada and the USA, for example, when teachers were asked which tasks they perceived as the most typical, both at undergraduate and postgraduate levels, the two most common text types mentioned were description and interpretation of non-verbal input and comparison and contrast plus taking a position. In the Moore and Morton (1999) survey these were evaluation (67%), description (49%), summarization (35%), comparison (35%), and explanation 28(%).

Rosenfeld, Courtney and Fowles (2004) identified the specific tasks required both at undergraduate and graduate levels. Their large-scale survey included more than 700 faculty members from thirty institutions in the United States. The most important subtasks the students were required to perform across disciplines are presented in a combined table in Appendix D.

These findings suggest that requirements gradually grow from undergraduate to doctoral levels. Johns and Swales (2002) also found that there is an upward progression in terms of assignment length, complexity of resources utilized, and sophistication expected from students. The table shows that across the three levels, the three most important requirements are the same: proper crediting of sources, writing coherently and writing in standard written English.

As the focus of this dissertation is coherence and cohesion, it is important to note here that the requirements regarding coherence and cohesion are considered very important for all three levels. Another observation is that summarization and comparison and contrast, the two types of discourse this dissertation observes, appear at Master’s and Doctoral levels as core requirements. Even though these studies were based on Anglo-Saxon higher education practice, the tasks mentioned above can be considered typical also in the Hungarian academic context.

3.1.3 Disciplinary differences in academic discourse

In order to be able to understand teachers’ requirements regarding students’ written tasks, specific, disciplinary differences also need to be examined. In line with this, more recent studies set out to map the different writing tasks across faculties. For instance, in the social sciences, humanities and arts domains the review, proposal, case study and summary genres were found to be typical, whereas only library research papers and project reports were commonly assigned across all the disciplines (Cooper & Bikowski, 2007). The same genres also differ in many aspects depending on the subject domain: a report or case study, for instance, written for the biology department might differ considerably from the same genre submitted to the sociology department. This difference, according to Carter (2007) lies in the fact that various fields of studies have their own “ways of doing and ways of knowing”, which, as a consequence, formed their “ways of writing” practices (p. 393). New disciplines form their own standards. As Haswell (2013) observes “[e]ach new field of study seems to function out of a different epistemology requiring a different set of writing skills – unfamiliar composing processes, novel genres and tasks, shifting standards and expectations” (p. 416).

Carter (2007) argues that four main types of writing can be distinguished across academic fields: problem-solving assignments, research from sources, empirical inquiries, and performance assignments. Problem-solving tasks are typical in the business and engineering domains, where students need to write business plans, project reports or project proposals. Research from sources is dominant in the humanities, a typical task being literary criticism. Empirical inquiry relies on data collected by the students themselves and characteristic in science disciplines, an example of which could be the lab report. In performance assignments the artistic value of the text is as important as the content. Performance text types are creative writing tasks or poetry, typical in arts classes (examples from Olson, 2013). Carter (2007) assumes that identifying these four types of “metagenres”, i.e., higher categories than individual genres, is more informative than analyzing texts according to the conventions of the genre alone. This “overarching allegiance” in disciplinary communication was also observed by Widdowson (1979), who argued that “scientific exposition is structured according to certain patterns of rhetorical organization which […] imposes conformity on members of the scientific community no matter what language they happen to use” (p. 61).


1 https://www.ets.org/erater/about

2 http://www.lt-world.org/kb/ipr-and-products/products/obj_80512

3 a program which displays every instance of a specified word with its immediate preceding and following context

4 examples and description from the online version of Oxford Collocations Dictionary, http://oxforddictionary.so8848.com

5 https://www.ebi.ac.uk/

6 https://wordnet.princeton.edu

7 “Homonymous words exhibit idiosyncratic variation, with essentially unrelated senses, e.g. bank as FINANCIAL INSTITUTION versus as NATURAL OBJECT. In polysemy [...] sense variation is systematic, i.e., appears for whole sets of words. E.g., lamb, chicken, and salmon have ANIMAL and FOOD senses.” (more about how to distinguish them by automated means, in Utt & Padó, p. 265) http://www.aclweb.org/anthology/W11-0128

8 http://rdues.bcu.ac.uk/summariser.shtml

9 http://www.laurenceanthony.net/software/antconc

10 http://www.concordancesoftware.co.uk/

11 https://wordnet.princeton.edu/

12 https://wordnet.princeton.edu/

13 The Standard Corpus of Present-Day Edited American English; the manual of the corpus is available from http://clu.uni.no/icame/brown/bcm.html

14 Metadiscourse is a common concept in academic discourse research, the narrow definition being text about the text (Swales & Feak, 2010).

15 http://www.inf.u-szeged.hu/rgai/HuWN

16 I looked up several words in the WordNet dictionary related to the meaning of bank – as institution to find out whether the software ‘remembers’ the previous requests when I asked it to define bank. It did not remember. (This was only an unorthodox trial-error test to explore this feature, it is not based on the literature.)

17 Discourse is “a unit of language larger than a sentence and which is firmly rooted in a specific context” (Halliday & Hasan, 1990, p. 41). For the purposes of this study it will be regarded as a similar concept as text, which, according to Halliday and Hasan, comprises “any passage, spoken or written, of whatever length, that does form a unified whole” (1976, p. 1).

18 Biber (1988) differentiates between genre and text type. The former is categorized by external criteria, the latter are texts similar in linguistic pattern, and can take the form of various genres.

Excerpt out of 187 pages


Lexical Repetition in Academic Discourse
A Computer-Aided Study of the Text-organizing Role of Repetition
Eötvös Loránd Tudományegytem  (Doctoral School of Education)
PhD Programme in Language Pedagogy
Catalog Number
ISBN (eBook)
ISBN (Book)
lexical cohesion, computer-assisted assessment
Quote paper
Dr Maria Adorjan (Author), 2016, Lexical Repetition in Academic Discourse, Munich, GRIN Verlag, https://www.grin.com/document/459913


  • No comments yet.
Read the ebook
Title: Lexical Repetition in Academic Discourse

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free