This Bachelor of Arts thesis contributes to the CREAM project between Novartis Pharma AG and Bielefeld University. Throughout the thesis a method called n-gram modeling will be discovered which supplies its user with information about the frequential use of words. This information will be needed in order to improve a database the CREAM project works on. This improvement is to do with a calculation of probabilities in search queries sent to the database. The thesis consists of five chapters. The first chapter introduces the CREAM project and the database. The second chapter provides the reader with information about the current state of n-gram modeling and where it can be found in contemporary literature. The third chapter deals extensively with how corpora have to be prepared in order to be analyzed accordingly and how n-gram modeling can be computed in terms of frequential distribution of words. In chapter four a computer code will be introduced that uses a corpus to obtain certain n-grams. Finally, in chapter five, the information retrieved by the computer code(s) will be evaluated and a forecast of future work will be mentioned.
Due to copyright-protected material, the appendix is not part of the thesis.
Inhaltsverzeichnis (Table of Contents)
- Introduction to CREAM
- The Corpus Research for Exploitation of Annotated Metadata project (CREAM)
- The main topic
- The Corpus eNova Database
- eNova Application and Sample Walk Through
- The main goal
- State of the Art
- Contributions to Word Prediction and Word Probability
- Current Contributions to n-Gram Analysis
- Applied n-Gram Analysis
- National Security
- Spelling Correction
- Other Areas Related to Spelling Correction
- n-Grams - Word Count and n-Gram Modeling
- Introduction to Word Count in Corpora
- Tokenization - Word Segmentation in Corpora
- Word Types vs. Word Tokens
- Stemming and Lemmatization
- Non-word Characters
- Parameters of Tokenization
- Compounding and Words Separated by Whitespace
- Hyphens
- Case-(In-)Sensitivity
Zielsetzung und Themenschwerpunkte (Objectives and Key Themes)
This Bachelor of Arts thesis aims to contribute to the CREAM project, a collaborative effort between Novartis Pharma AG and Bielefeld University, by exploring the use of n-gram modeling for improving a database used within the project. The thesis examines the potential of n-gram modeling to calculate probabilities in search queries submitted to the database, ultimately enhancing search efficiency and effectiveness. Key themes explored in the thesis include:- N-gram modeling and its application in word prediction and probability calculation
- The role of corpora in n-gram analysis, particularly in preparing and analyzing data
- Practical applications of n-gram modeling in areas such as national security and spelling correction
- The methodology of tokenization and its importance in processing corpora for n-gram analysis
- The development and implementation of computer code to extract n-grams from corpora
Zusammenfassung der Kapitel (Chapter Summaries)
Chapter 1 introduces the CREAM project and its primary goals. It focuses on the Corpus eNova database, providing a description of the database and its application within the project. This chapter also highlights the importance of calculating probabilities in search queries submitted to the database. Chapter 2 delves into the current state of the art in n-gram modeling and its contributions to word prediction and probability calculation. It examines existing research on n-gram analysis and discusses its practical applications in areas such as national security and spelling correction. Chapter 3 explores the process of preparing corpora for n-gram analysis. It covers aspects of tokenization, including word segmentation, word types vs. word tokens, stemming, lemmatization, and the handling of non-word characters. The chapter also discusses different parameters involved in tokenization, such as compounding, hyphens, and case sensitivity. Chapter 4 introduces a computer code specifically designed to extract n-grams from corpora. This chapter explains the functionality and usage of the code, demonstrating its ability to retrieve relevant n-grams from a given corpus.Schlüsselwörter (Keywords)
The primary keywords and focus topics of the thesis include n-gram modeling, word prediction, word probability, corpora, tokenization, spelling correction, national security, and the CREAM project. This thesis examines the potential of n-gram analysis in improving search efficiency and effectiveness within the CREAM project's database. The research focuses on applying n-gram modeling techniques to enhance user experience and optimize search results.- Quote paper
- B.A. Marc Bohnes (Author), 2009, Word prediction and word probability exemplified in searches over a pharmaceutical database, Munich, GRIN Verlag, https://www.grin.com/document/199178