This Bachelor of Arts thesis contributes to the CREAM project between Novartis Pharma AG and Bielefeld University. Throughout the thesis a method called n-gram modeling will be discovered which supplies its user with information about the frequential use of words. This information will be needed in order to improve a database the CREAM project works on. This improvement is to do with a calculation of probabilities in search queries sent to the database. The thesis consists of five chapters. The first chapter introduces the CREAM project and the database. The second chapter provides the reader with information about the current state of n-gram modeling and where it can be found in contemporary literature. The third chapter deals extensively with how corpora have to be prepared in order to be analyzed accordingly and how n-gram modeling can be computed in terms of frequential distribution of words. In chapter four a computer code will be introduced that uses a corpus to obtain certain n-grams. Finally, in chapter five, the information retrieved by the computer code(s) will be evaluated and a forecast of future work will be mentioned.
Due to copyright-protected material, the appendix is not part of the thesis.
Table of Contents
1 Introduction to CREAM
1.1 The Corpus Research for Exploitation of Annotated Metadata project (CREAM)
1.1.1 The main topic
1.1.2 The Corpus - eNova Database
1.1.2.1 eNova Application and Sample Walk Through
1.1.3 The main goal
2 State of the Art
2.1 Contributions to Word Prediction and Word Probability
2.1.1 Current Contributions to n-Gram Analysis
2.2 Applied n-Gram Analysis
2.2.1 National Security
2.2.2 Spelling Correction
2.2.2.1 Other Areas Related to Spelling Correction
3 n-Grams - Word Count and n-Gram Modeling
3.1 Introduction to Word Count in Corpora
3.2 Tokenization - Word Segmentation in Corpora
3.2.1 Word Types vs. Word Tokens
3.2.2 Stemming and Lemmatization
3.2.3 Non-word Characters
3.3 Parameters of Tokenization
3.3.1 Compounding and Words Separated by Whitespace
3.3.2 Hyphens
3.3.3 Case-(In-)Sensitivity
3.3.4 Other Cases in English
3.3.5 Stop words
3.3.6 Other Languages
3.4 Introduction to Word Probability and Word Prediction
3.4.1 Markov Assumption and n-Gram Modeling
3.4.2 n-Grams - n-Token Sequences of Words
3.4.3 Simple n-Gram Analysis - Maximum Likelihood Estimation
3.5 n-Gram Analysis over Sample Corpus
4 Waterfall Model
4.1 Introduction
4.2 Introduction to the Waterfall Model of Software Development
4.2.1 Requirement Analysis
4.2.2 Specification Phase
4.2.3 Design
4.2.4 Implementation and Testing - Monogram Code
4.2.5 Implementation and Testing - Bigram Code
4.2.6 Integration
5 Interpretation of n-Gram Retrieval
5.1 Introduction
5.2 Presentation of Monograms
5.2.1 Interpretation of Monograms
5.2.2 Spelling Correction
5.3 Bigrams
5.3.1 Relative Frequency of Sample Bigrams
5.4 Trigrams
5.5 Main interpretation and forecast
Objectives and Research Themes
The primary goal of this thesis is to improve the user experience and guided search functionality of the eNova pharmaceutical database through the application of n-gram modeling. By analyzing historical search queries, the work aims to infer user search strategies, predict intended keyword combinations, and optimize database performance to support both scientific and expert users.
- Application of n-gram modeling for word probability analysis in a specialized pharmaceutical database.
- Methodological preparation of data corpora, including tokenization, stop word filtering, and handling of domain-specific constraints.
- Implementation of Ruby-based computational tools to extract monograms, bigrams, and trigrams from search query logs.
- Evaluation of search patterns and performance improvements for the eNova database interface.
Excerpt from the Book
1.1.2 The Corpus - eNova Database
As the main goal of the CREAM project is to exploit large databases, we will have to take a look at the corpus in advance. The corpus we are going to work with consists of search queries of the Novartis Corporate Drug Literature Database, also called eNova. By using eNova, Novartis created a system with which they supply customers with any information about their own products (drugs). Hence, Novartis provides scientists, doctors, internal experts, or the like with a thorough and large database of pharmaceutical keywords and articles written by experts about a certain drug and related issues, such as contributing authors, respective journals, drug side-effects, and so on. The unique feature of the drug literature database covers for instance: ”1) a comprehensive coverage of products (drugs) by Novartis, 2) customized Novartis drug specific abstracts, [...], and 4) direct access to the full text of most articles” [Mas, 2008]. Therefore, eNova is a drug literature database of excellent quality and quantity. Further, every search over eNova executed by the user is saved and stored on hard disc for further analysis. The analysis of this corpus is part of this thesis and will be dealt with throughout the following chapters. What now follows are some sample lines taken from the corpus of executed searches in order to get a first impression on how the, say, ”raw” corpus looks like:
ea8022569a146c814c33eac56ef767f5,,"xolair.prn and zeig.au"
41a544e889f3807af4bcb5340d5ed3e5,,00290:00290.ANO
41a544e889f3807af4bcb5340d5ed3e5,,"(#3 ) and (#2 )"
41a544e889f3807af4bcb5340d5ed3e5,,00072:00072.ANO
41a544e889f3807af4bcb5340d5ed3e5,,"(#5 ) and ( ’TP FTY720’.PRN )"
As we will focus on the occurring keywords, we will refer to the actual term (such as xolair ) as the stem and its specification (such as .prn) as its suffix. A detailed list of what the suffixes stand for is given in appendix A. It is important to note that we will be working with two different corpora. The one consists of internal expert search queries based on expert searches by experts within Novartis’s department
Summary of Chapters
1 Introduction to CREAM: Introduces the CREAM project and the eNova pharmaceutical database, setting the stage for the research objective of utilizing n-gram modeling to enhance guided search capabilities.
2 State of the Art: Provides a theoretical overview of word prediction, n-gram modeling, and related computational linguistics concepts, including historical context like the Shannon Game.
3 n-Grams - Word Count and n-Gram Modeling: Discusses the methodology of corpus preparation, focusing on tokenization, lemmatization, and the mathematical foundations of word probability and n-gram sequences.
4 Waterfall Model: Details the software development process used to create the Ruby-based n-gram analysis tools, covering requirement analysis, design, implementation, and integration of the code.
5 Interpretation of n-Gram Retrieval: Presents the empirical results of the n-gram analysis performed on the eNova database, including interpretations of monograms, bigrams, and trigrams, and concludes with a forecast for future work.
Keywords
n-gram modeling, corpus analysis, pharmaceutical database, eNova, Novartis, tokenization, word probability, Markov assumption, information retrieval, Ruby, software development, waterfall model, spelling correction, search strategy, lexicography
Frequently Asked Questions
What is the core focus of this research?
The work focuses on analyzing search query data from the eNova pharmaceutical database to understand user search behavior and improve the efficiency of guided search interfaces through n-gram modeling.
Which fields are covered in this study?
The study integrates computational linguistics and software engineering, applying natural language processing (NLP) techniques to optimize professional, domain-specific search engines.
What is the primary objective of the thesis?
The main goal is to implement a method for predicting keyword combinations that users are likely to look for, thereby enabling more intuitive and powerful search interfaces using picklists and statistical distribution data.
What methodology is employed?
The research uses n-gram analysis—specifically monograms, bigrams, and trigrams—combined with the Maximum Likelihood Estimation (MLE) method to calculate word probabilities within the provided search corpora.
How is the main body structured?
The main body follows a software development lifecycle approach (Waterfall Model) to conceptualize, design, and implement customized Ruby code that parses search queries and generates statistical models.
Which keywords characterize this study?
Key terms include n-gram modeling, corpus analysis, eNova, Novartis, tokenization, and Markov assumption.
How does the author handle spelling errors in the data?
The author notes that while spelling errors exist (approx. 0.7-1.1%), they are deemed minor enough not to significantly skew the statistical results, though future improvements might incorporate automated spelling correction.
How is the Ruby code utilized?
Three distinct Ruby scripts are used to handle different levels of n-gram analysis (monograms, bigrams, and trigrams), specifically filtered for pharma-industry search syntax (e.g., ignoring Boolean operators).
Why are there two different corpora in the analysis?
The study contrasts "internal expert" queries with "external expert" queries to demonstrate differences in search strategies and keyword usage between different user groups.
What is the conclusion regarding future developments?
The author suggests that future work should focus on optimizing code for syntactic case-insensitivity and shifting the focus towards analyzing suffixes to better categorize generic search intents.
- Quote paper
- B.A. Marc Bohnes (Author), 2009, Word prediction and word probability exemplified in searches over a pharmaceutical database, Munich, GRIN Verlag, https://www.grin.com/document/199178