Results and Discussion
Suggestions and Further Research
This study conducted a comparison of assessment behavior between nonnative raters with overseas educational backgrounds (NNSWOEB) and nonnative speakers without overseas educational backgrounds (NNSW/OOEB) in oral English assessment. The 8 Participant, all nonnative speakers, are divided by their overseas educational background into two categories. Each category, consisting of 4 raters, are required to rate 28 English Education major students' oral English samples. Data analysis was conducted using Minifac (Facets) Rasch software. Results of the analysis revealed that a: there is significant difference in severity between NNSWOEB raters and NNSW/OOEB raters. b: NNSW/OOEB raters are more severe than NNSWOEB raters in every rating criteria except criteria of pronunciation. c: NNSWOEB raters have different rating concept to rating scales with NNSW/OOEB raters. Study findings add support to previous research that has demonstrated nonnative speakers or speakers without overseas educational background are more severe than native speakers or speakers with overseas educational backgrounds , particularly in grammatical criteria.
Key words: NNSWOEB; NNSW/OOEB; rater severity; rater consistency
In recent years, English speaking skills plays an important role in students' comprehensive English skills. An increasing number of students and parents lay great emphasis on English speaking study, and some of which have attended some kinds of English speaking tests, such as IELTS (International English Language Testing System), MSE (Main Suite Examinations), GESE (Graded Examinations in Speakers of Other Languages), in which all the raters are native speakers. However, students in Korea or in China, because of the limited teaching resources, are taught oral English mostly by nonnative speakers, some of whom have overseas educational backgrounds, some of whom do not have overseas educational backgrounds. In such situation, students may get different scores because of different raters' overseas education background. This research aims to find out the difference of assessment behavior between raters with and without overseas educational background.
The research questions are listed as follows:
1. what are the difference between NNSWOEB raters and NNSW/OOEB raters in terms of raters' severity?
2. what are the difference between NNSWOEB raters and NNSW/OOEB raters in terms of rating criteria?
3. what are the difference between NNSWOEB raters and NNSW/OOEB raters in terms of rating scale?
Performance assessment refers to a variety of tasks and situation in which students are given opportunities to demonstrate their understanding and to thoughtfully apply knowledge, skills and habits of mind in a variety of contexts. These assessment often occur over time and result in a tangible product or observable performance. They encourage self-evaluation and revision, require judgment to score, reveal degrees of proficiency based on established criteria, and make public the scoring criteria (Marzano, Pickering, and McTighe, 1993). This introduces new features of the assessment settings such as (1) the rater themselves, who will vary in the standards they use and the consistency of their application of these standards; (2) the rating procedures they are required to implement. The interaction of rater characteristics and the qualities of the rating scales they use has a crucial influence on the ratings that are given, regardless of the quality of the performance.(Mc NAMARA 1996).
Raters' variability: McNAMARA (1996) pointed out some of the important ways in which raters may differ from one another.
1. Two raters may simply differ in their overall leniency
2. Raters may display particular patterns of severity or leniency in relation to only one group of candidates, not others, or in relation to particular tasks, not others, or in relation to particular analytic scores, not others. In general, a rater may be consistently lenient on one item, consistently severe on another; this is a kind of rater-item interaction.
3. Raters may differ from each other in the way they interpret the rating scale they are using.
4. Finally, and rather more obviously, rater may differ in terms of their consistency; that is, the extent of the random error associated with their ratings.
Raters’ language backgrounds, raters' L1 are the most commonly studied elements in analyzing raters' rating behavior. Brown (1995); Johnson & Lim (2009); Shi (2001) found that there is either non-significant or small interaction effects between raters’ L1 and their rating scores. At the same time, Hill (1997) have reported a significant effect of raters’ L1 background on rater severity. Though most studies have found either a non-significant or small difference for severity, they all revealed dissimilarities between native speaker (NS) and nonnative speaker (NNS) raters in rating.
Few research have been found about the relationship between nonnative raters' overseas educational background and the differences of them in rating behavior.
A group of 8 raters are assigned to rate 28 students' oral English materials in this study. All the raters are nonnative speakers, four of whom are classified as nonnative speaker with overseas educational background (NNSWOEB), and four of whom are classified as nonnative speakers without overseas educational background (NNSW/OOEB). The NNSWOEB raters are categorized as such because they studied in English speaking countries more than 3 years. The NNSW/OOEB raters are Chinese or Korean, who did not study in English speaking countries. The basic information of raters are listed as follows.
illustration not visible in this excerpt
For the research instrument, The speaking task in this study is an IELTS speaking test (trial version), in which students are asked to express their ideas on Housing. The rating rubric is IELTS speaking band descriptor (public version), which is a 9-point scale, consists of 4 analytic criteria, namely, fluency and coherence, lexical resources, grammatical range and accuracy and pronunciation. The scoring sheet consists of student ID number, analytic score, holistic score and respective comments on your scores.
All the raters conducted a basic training session for one hour before each operational scoring session. After all the data were collected, quantitative analysis was conducted by using Minifac (Facets) Rasch software program to compare the data between NNSWOEB raters and NNSW/OOEB raters.
Results and Discussion
In terms of raters severity and consistency
This section can be divided into three parts, first of which is centered around Table 1, mainly discussing rating qualities of all 8 raters in terms of MnSq and ZStd. In the second part, all the results are presented from Table 2, analyzed the difference of the two groups in terms of rating criteria. In the last part, the statistics shown in Table 3, Table 4 and Table 5, dealing with the difference between these two groups in terms of rating scale.
illustration not visible in this excerpt
Model, Sample: Separation 3.11 Strata 4.48 Reliability (not inter-rater) .91
Model fixed: chi-square: 72.7 d.f.: 7 significance (probability): .00
Exact agreements: 898 = 35.2% Expected: 849.1 = 33.3%
As shown in Table 1, raters are placed in order from most severe to most lenient, which means rater Jiyeon (NNSW/OOEB) is the most severe and rater Ada (NNSWOEB raters) is the most lenient. The difference between them is 0.86 (6.77-5.91=0.86), which means rater Jiyeon will give the same oral English material 0.86 lower than rater Ada.
The table also revealed that the two most severe raters are NNSW/OOEB raters and the most lenient raters are NNSWOEB raters. The chi-square is 72.7, significance (probability): .00, which means that the null hypotheses that NNSW/OOEB and NNSWOEB raters have no differences in severity in rating oral English materials is rejected. And there is evidence to show that NNSW/OOEB raters are more severe than NNSWOEB raters.
The results in column 4 show the logit scores of each rater. The scores above "0" is more severe than the scores under "0". In this study, all the raters are below "0", which means all the raters are lenient. The results in column 5 show the standard error, which means the precision of our estimation, the smaller the standard error, the higher the estimate precision (Linacre 2005). In this study, the range of standard error is from 0.1 to 0.15, which is relatively small. Therefore higher precision of estimate can be followed.