This work is about different corpora and translation tools, which can assist during the translation process and have been used in order to ensure a high level of translation quality.
In this paper, different exercises with the aforementioned corpora and tools have been carried out.
The term “corpus” can be defined as a collection of spoken or written utterances that exist in machine-readable form (annotated corpora) or in their “raw” state (unannotated corpora) and that are used for linguistic-related tasks. However, the term is mainly used to refer to machine-readable variety.
Corpora are used to analyze lexical, syntactic and semantic/pragmatic aspects, as well as comparing different languages and language registers to each other. The observations in these areas found through corpora cannot only be a support in translation, they can also bring clarity in areas such as historical linguistics and language acquisition.
In order to be applicable in those fields, the corpus must be multifunctional and reusable. Therefore, it needs to conform to standards of falsifiability (the model can be tested on different samples of corpus material and can be replaced by a better fitting model if necessary), completeness (the model has to account for unrestricted data), and objectivity (the model can objectively be tested by observers who do not have an emotional connection to its success or failure).
Table of Contents
1. Introduction
2. Exercise 1: Corpora: Definition, Corpus Parts, Kinds of Corpora and their Usage
2.1 Definition
2.2 Corpus Parts
2.3 Kinds of Corpora
2.4 Usage of Corpora
3. Exercise 2: AntConc: Frequ. List, Keyword List, Coll., N-grams/Clusters, Con. Plot...
3.1 Frequency List
3.2 Keyword List
3.3 Collocational Behaviour
3.4 Comparison between Collocates and Clusters/N-Grams
3.5 Concordance Plot
4. Exercise 3: Using TreeTagger for Annot., Analy. Annot. Errors, Analy. the Tagset
4.1 TreeTagger for Annotation ofbiology.txt
4.2 Analyzing the Annotation Errors
4.3 Analyzing the Tagset
5. Exercise 4: Creating an XML File, Def. Tags for Metad., Def. some new Tags
6. Exercise 5: Search in BNC, Usage of Semantically Related Words, Using DWDS
5.1 Search in BNC
5.2 Usage of Semantically Related Words
5.3 Using DWDS
7. Exercise 6: Analyzing German Support Verb Constructions with CQPWeb
9. Exercise 8: Passives in English and German and Translational Universals
9.1 Translational Differences of English and German Passives
9.2 Translational Universals: Shining Through and Normalization
10. Exercise 9: Synonyms and Antonyms
11. Exercise 10: Terminology Database MultiTerm
Bibliography
1. Introduction
During the seminar „Ausgewählte Themen der Maschinellen Übersetzung und der Fachkommunikation: Introduction to Language Resources for Translators” different corpora and translation tools, which can assist during the translation process, have been used in order to ensure a high level of translational quality. In the following, different exercises with the aforementioned corpora and tools have been carried out.
2. Exercise 1: Corpora: Definition, Corpus Parts, Kinds of Corpora and their Usage
2.1 Definition
The term “corpus” can be defined as a collection of spoken or written utterances that exist in machine readable form (annotated corpora) or in their “raw” state (unannotated corpora) and that are used for linguistic-related tasks However, the term is mainly used to refer to machine readable variety.1
2.2 Corpus Parts
For a corpus to be versatile to various data and various projects, it has to be standardized to a certain extent. Therefore, every corpus consists of three different parts:
- Raw data, which is the “authentic” language data, such as video/audio data, transcriptions of spoken language or texts in their physical form.
- Metadata, or the “data about data” that describe the primary data within the corpus, such as giving information about genre, data size, format, authors etc.
- Annotations, where, in data of spoken language, firstly the speech signal is segmented into phonemes, words or sentences etc. Secondly, the data is analyzed through transcription, where the exact wording of an utterance can be reproduced as a text or in phonetic symbols. In text corpora, the segmentation can analyze the structure of the text, e.g. word borders (tokens), paragraphs, footnotes, syntactic phrases and their functions, etc. Furthermore, extralinguistic interpretations, such as emotions, meaning, facial expressions and gestures can be extracted.
2.3 Kinds of Corpora
Besides the aforementioned annotated and unannotated corpora, there are various other kinds of corpora:
- Monolingual Corpora are collected texts written in one language.
- Parallel Corpora hold the same text in more than one language. Typically, these corpora are rather bilingual than multilingual. However, parallel corpora can raise questions such as “which word is the translation of which word?” or “which sentences are translations of which?”
- Parallel Aligned Corpora can evade the issues of Parallel Corpora by aligning sentences and word units that are mutual translations of one another.
- A Comparable Corpus selects similar texts, e.g. with the same genre, written in more than one language or language variety.
2.4 Usage of Corpora
Corpora are used to analyze lexical, syntactic and semantic/pragmatic aspects as well as comparing different languages and language registers to each other. The observations in these areas found through corpora cannot only be a support in translation, they can also bring clarity in areas such as historical linguistics and language acquisition. In order to be applicable in those fields, the corpus must be multifunctional and reusable. Therefore, it needs to conform standards of falsifiability (the model can be tested on different samples of corpus material and can be replaced by a better fitting model if necessary), completeness (the model has to account for unrestricted data), and objectivity (the model can objectively be tested by observers who do not have an emotional connection to its success or failure).2
3. Exercise 2: AntConc: Frequency List, Keyword List, Collocations, N-grams/Cluster, Concordance Plot
There are different research methods within the field of Corpus Linguistics. One of them is called Annotation. Annotation applies and extracts a certain structure within the corpus. The programme AntConc has multiple tools available which serve to carry out this corpus linguistics research and data-driven learning:
3.1 Frequency List
This AntConc-tool is able to count all the words in the corpus and presents them in an ordered list in order to find out which words are the most frequent in a corpus. This can help to study the type of vocabulary used in the text, to identify common word clusters and grammatical patterns and to compare the frequency of a word in different text files.
The most frequent word in the given “biology text” is the definite article “the”. This is not very surprising as “the” is the most frequent word in the English language3. Other than that, many prepositions such as “of’, “in”, “to”, “by”, “for” and “with” occur in the text since they indicate the relationship of a noun/pronoun to another text element. Furthermore, the conjunction “and” that connects two ideas with each other has a very high frequency in the text. Besides that, the high frequency of the words “dna” and “phage” make clear that the given text deals with a biological topic.
3.2 Keyword List
With this tool, an unusual frequency of words can be figured out by comparing the occurring words in the corpus (biology.txt) with those in a reference corpus (british.txt).
Since the words of the two texts are compared to each other and sorted by keyness, the unusual frequency of biological words such as “dna”, “phage”, “plasmid” etc. clarify the genre of the text.
3.3 Collocational Behaviour
The AntConc Collocates Tool shows which words typically occur together in a corpus, therefore it allows to investigate non-sequential patterns within a language. AntConc is able to order collocates by frequency, by the value of a statistical measure, by the frequency on the right or on the left side of the search term and by the start or end of the word. Choosing the adjective “genetic” as a search term, it becomes clear that it most often collocates with nouns such as “transposition” and “manipulation”. Adjective-noun collocations are very common since adjectives describe nouns and give more information about them.
[...]
1 Tony McEnery and Andrew Wilson, Corpora and Translation: Uses and Future Prospects, (1993), p. 1.
2 Geoffiy Leech, “Corpora and Theories of Linguistic Performance”, in Directions in Corpus Linguistics, (Mouton de Gruyeter, 1991), p. 108.
-
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X.