This Bachelor of Arts thesis contributes to the CREAM project between Novartis Pharma AG and Bielefeld University. Throughout the thesis a method called n-gram modeling will be discovered which supplies its user with information about the frequential use of words. This information will be needed in order to improve a database the CREAM project works on. This improvement is to do with a calculation of probabilities in search queries sent to the database. The thesis consists of five chapters. The first chapter introduces the CREAM project and the database. The second chapter provides the reader with information about the current state of n-gram modeling and where it can be found in contemporary literature. The third chapter deals extensively with how corpora have to be prepared in order to be analyzed accordingly and how n-gram modeling can be computed in terms of frequential distribution of words. In chapter four a computer code will be introduced that uses a corpus to obtain certain n-grams. Finally, in chapter five, the information retrieved by the computer code(s) will be evaluated and a forecast of future work will be mentioned.
Due to copyright-protected material, the appendix is not part of the thesis.

Excerpt

Inhaltsverzeichnis (Table of Contents)

Introduction to CREAM

The Corpus Research for Exploitation of Annotated Metadata project (CREAM)

The main topic
The Corpus eNova Database

eNova Application and Sample Walk Through

The main goal

State of the Art

Contributions to Word Prediction and Word Probability

Current Contributions to n-Gram Analysis

Applied n-Gram Analysis

National Security
Spelling Correction

Other Areas Related to Spelling Correction

n-Grams - Word Count and n-Gram Modeling

Introduction to Word Count in Corpora
Tokenization - Word Segmentation in Corpora

Word Types vs. Word Tokens
Stemming and Lemmatization
Non-word Characters

Parameters of Tokenization

Compounding and Words Separated by Whitespace
Hyphens
Case-(In-)Sensitivity

Zielsetzung und Themenschwerpunkte (Objectives and Key Themes)

This Bachelor of Arts thesis aims to contribute to the CREAM project, a collaborative effort between Novartis Pharma AG and Bielefeld University, by exploring the use of n-gram modeling for improving a database used within the project. The thesis examines the potential of n-gram modeling to calculate probabilities in search queries submitted to the database, ultimately enhancing search efficiency and effectiveness. Key themes explored in the thesis include:

N-gram modeling and its application in word prediction and probability calculation
The role of corpora in n-gram analysis, particularly in preparing and analyzing data
Practical applications of n-gram modeling in areas such as national security and spelling correction
The methodology of tokenization and its importance in processing corpora for n-gram analysis
The development and implementation of computer code to extract n-grams from corpora

Zusammenfassung der Kapitel (Chapter Summaries)

Chapter 1 introduces the CREAM project and its primary goals. It focuses on the Corpus eNova database, providing a description of the database and its application within the project. This chapter also highlights the importance of calculating probabilities in search queries submitted to the database. Chapter 2 delves into the current state of the art in n-gram modeling and its contributions to word prediction and probability calculation. It examines existing research on n-gram analysis and discusses its practical applications in areas such as national security and spelling correction. Chapter 3 explores the process of preparing corpora for n-gram analysis. It covers aspects of tokenization, including word segmentation, word types vs. word tokens, stemming, lemmatization, and the handling of non-word characters. The chapter also discusses different parameters involved in tokenization, such as compounding, hyphens, and case sensitivity. Chapter 4 introduces a computer code specifically designed to extract n-grams from corpora. This chapter explains the functionality and usage of the code, demonstrating its ability to retrieve relevant n-grams from a given corpus.

Schlüsselwörter (Keywords)

The primary keywords and focus topics of the thesis include n-gram modeling, word prediction, word probability, corpora, tokenization, spelling correction, national security, and the CREAM project. This thesis examines the potential of n-gram analysis in improving search efficiency and effectiveness within the CREAM project's database. The research focuses on applying n-gram modeling techniques to enhance user experience and optimize search results.

Excerpt out of 63 pages - scroll top

Details

Title: Word prediction and word probability exemplified in searches over a pharmaceutical database
College: Bielefeld University
Grade: 1,0
Author: B.A. Marc Bohnes (Author)
Publication Year: 2009
Pages: 63
Catalog Number: V199178
ISBN (eBook): 9783656256755
ISBN (Book): 9783656258216
Language: English
Tags: Linguistik Englisch n-Gramme NLP Computerlinguistik Programmieren
Product Safety: GRIN Publishing GmbH

Quote paper: B.A. Marc Bohnes (Author), 2009, Word prediction and word probability exemplified in searches over a pharmaceutical database, Munich, GRIN Verlag, https://www.grin.com/document/199178

Word prediction and word probability exemplified in searches over a pharmaceutical database