Introduction to Language Resources for Translators


Seminar Paper, 2017

21 Pages, Grade: 1,0


Excerpt


Table of Contents

1. Introduction

2. Exercise 1: Corpora: Definition, Corpus Parts, Kinds of Corpora and their Usage
2.1 Definition
2.2 Corpus Parts
2.3 Kinds of Corpora
2.4 Usage of Corpora

3. Exercise 2: AntConc: Frequ. List, Keyword List, Coll., N-grams/Clusters, Con. Plot...
3.1 Frequency List
3.2 Keyword List
3.3 Collocational Behaviour
3.4 Comparison between Collocates and Clusters/N-Grams
3.5 Concordance Plot

4. Exercise 3: Using TreeTagger for Annot., Analy. Annot. Errors, Analy. the Tagset
4.1 TreeTagger for Annotation ofbiology.txt
4.2 Analyzing the Annotation Errors
4.3 Analyzing the Tagset

5. Exercise 4: Creating an XML File, Def. Tags for Metad., Def. some new Tags
6. Exercise 5: Search in BNC, Usage of Semantically Related Words, Using DWDS
5.1 Search in BNC
5.2 Usage of Semantically Related Words
5.3 Using DWDS

7. Exercise 6: Analyzing German Support Verb Constructions with CQPWeb

9. Exercise 8: Passives in English and German and Translational Universals
9.1 Translational Differences of English and German Passives
9.2 Translational Universals: Shining Through and Normalization

10. Exercise 9: Synonyms and Antonyms

11. Exercise 10: Terminology Database MultiTerm

Bibliography

1. Introduction

During the seminar „Ausgewählte Themen der Maschinellen Übersetzung und der Fachkommunikation: Introduction to Language Resources for Translators” different corpora and translation tools, which can assist during the translation process, have been used in order to ensure a high level of translational quality. In the following, different exercises with the aforementioned corpora and tools have been carried out.

2. Exercise 1: Corpora: Definition, Corpus Parts, Kinds of Corpora and their Usage

2.1 Definition

The term “corpus” can be defined as a collection of spoken or written utterances that exist in machine readable form (annotated corpora) or in their “raw” state (unannotated corpora) and that are used for linguistic-related tasks However, the term is mainly used to refer to machine readable variety.1

2.2 Corpus Parts

For a corpus to be versatile to various data and various projects, it has to be standardized to a certain extent. Therefore, every corpus consists of three different parts:

- Raw data, which is the “authentic” language data, such as video/audio data, transcriptions of spoken language or texts in their physical form.
- Metadata, or the “data about data” that describe the primary data within the corpus, such as giving information about genre, data size, format, authors etc.
- Annotations, where, in data of spoken language, firstly the speech signal is segmented into phonemes, words or sentences etc. Secondly, the data is analyzed through transcription, where the exact wording of an utterance can be reproduced as a text or in phonetic symbols. In text corpora, the segmentation can analyze the structure of the text, e.g. word borders (tokens), paragraphs, footnotes, syntactic phrases and their functions, etc. Furthermore, extralinguistic interpretations, such as emotions, meaning, facial expressions and gestures can be extracted.

2.3 Kinds of Corpora

Besides the aforementioned annotated and unannotated corpora, there are various other kinds of corpora:

- Monolingual Corpora are collected texts written in one language.
- Parallel Corpora hold the same text in more than one language. Typically, these corpora are rather bilingual than multilingual. However, parallel corpora can raise questions such as “which word is the translation of which word?” or “which sentences are translations of which?”
- Parallel Aligned Corpora can evade the issues of Parallel Corpora by aligning sentences and word units that are mutual translations of one another.
- A Comparable Corpus selects similar texts, e.g. with the same genre, written in more than one language or language variety.

2.4 Usage of Corpora

Corpora are used to analyze lexical, syntactic and semantic/pragmatic aspects as well as comparing different languages and language registers to each other. The observations in these areas found through corpora cannot only be a support in translation, they can also bring clarity in areas such as historical linguistics and language acquisition. In order to be applicable in those fields, the corpus must be multifunctional and reusable. Therefore, it needs to conform standards of falsifiability (the model can be tested on different samples of corpus material and can be replaced by a better fitting model if necessary), completeness (the model has to account for unrestricted data), and objectivity (the model can objectively be tested by observers who do not have an emotional connection to its success or failure).2

3. Exercise 2: AntConc: Frequency List, Keyword List, Collocations, N-grams/Cluster, Concordance Plot

There are different research methods within the field of Corpus Linguistics. One of them is called Annotation. Annotation applies and extracts a certain structure within the corpus. The programme AntConc has multiple tools available which serve to carry out this corpus linguistics research and data-driven learning:

3.1 Frequency List

This AntConc-tool is able to count all the words in the corpus and presents them in an ordered list in order to find out which words are the most frequent in a corpus. This can help to study the type of vocabulary used in the text, to identify common word clusters and grammatical patterns and to compare the frequency of a word in different text files.

The most frequent word in the given “biology text” is the definite article “the”. This is not very surprising as “the” is the most frequent word in the English language3. Other than that, many prepositions such as “of’, “in”, “to”, “by”, “for” and “with” occur in the text since they indicate the relationship of a noun/pronoun to another text element. Furthermore, the conjunction “and” that connects two ideas with each other has a very high frequency in the text. Besides that, the high frequency of the words “dna” and “phage” make clear that the given text deals with a biological topic.

3.2 Keyword List

With this tool, an unusual frequency of words can be figured out by comparing the occurring words in the corpus (biology.txt) with those in a reference corpus (british.txt).

Since the words of the two texts are compared to each other and sorted by keyness, the unusual frequency of biological words such as “dna”, “phage”, “plasmid” etc. clarify the genre of the text.

3.3 Collocational Behaviour

The AntConc Collocates Tool shows which words typically occur together in a corpus, therefore it allows to investigate non-sequential patterns within a language. AntConc is able to order collocates by frequency, by the value of a statistical measure, by the frequency on the right or on the left side of the search term and by the start or end of the word. Choosing the adjective “genetic” as a search term, it becomes clear that it most often collocates with nouns such as “transposition” and “manipulation”. Adjective-noun collocations are very common since adjectives describe nouns and give more information about them.

[...]


1 Tony McEnery and Andrew Wilson, Corpora and Translation: Uses and Future Prospects, (1993), p. 1.

2 Geoffiy Leech, “Corpora and Theories of Linguistic Performance”, in Directions in Corpus Linguistics, (Mouton de Gruyeter, 1991), p. 108.

Excerpt out of 21 pages

Details

Title
Introduction to Language Resources for Translators
College
Saarland University
Grade
1,0
Author
Year
2017
Pages
21
Catalog Number
V1187657
ISBN (eBook)
9783346623799
Language
English
Keywords
Linguistik, Translation, Übersetzen, Sprachwissenschaft, Language Science, Computer Linguistics
Quote paper
Marie-Louise Meiser (Author), 2017, Introduction to Language Resources for Translators, Munich, GRIN Verlag, https://www.grin.com/document/1187657

Comments

  • No comments yet.
Look inside the ebook
Title: Introduction to Language Resources for Translators



Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free