Automatic extraction and processing of document references

A CRF-based approach

Master's Thesis, 2007

67 Pages, Grade: 1.0



1 Introduction
1.1 Project description
1.2 Related tasks
1.3 Problem formalisation
1.4 Evaluation measures
1.5 Approaches to named entity recognition

2 Machine learning approaches to sequence labelling
2.1 Classifier-based approaches
2.2 Probabilistic sequence models
2.2.1 Hidden Markov Models
2.2.2 Maximum Entropy Markov Models
2.2.3 Conditional Random Fields
2.2.4 Comparison of sequence models
2.2.5 Motivation for using CRFs
2.3 Features
2.3.1 Lexical features
2.3.2 Linguistic features
2.3.3 Orthographical features
2.3.4 Formatting
2.3.5 Context features

3 Implementation
3.1 Definition of the named entity
3.2 Data analysis
3.3 CRF implementation
3.4 Data preprocessing
3.4.1 File format conversion - class FileConverter
3.4.2 Extracting potential candidates - class ContextExtractor
3.4.3 Annotation guidelines
3.4.4 Generating training data - class DatasetGenerator
3.5 Experiments
3.5.1 Initial feature types
3.5.2 Tagging scheme and performance measure
3.5.3 Number of models
3.5.4 Additional features
3.6 Critical evaluation
3.7 System overview
3.8 Processing of the extracted references

4 Conclusions and future work
4.1 Conclusions
4.2 Future work
4.2.1 Additional reference types
4.2.2 Improving the model
4.2.3 Additional training data
4.2.4 Precision recall trade-off

Terms of Reference
Schedule and Gantt chart
System documentation
Experimental results
Chi-square example


Thanks a lot to Markus Nosse for valuable advice and continuous support, to Christo Panchev for critical comments on the thesis, to Achim Hässler and Marc Beyerle for contributing their ideas and helpful suggestions, to Kristina Acker and Simone Hartung for proof-reading, and to Christoph Zwirello and my family for moral support.


While reading documents, you often encounter text passages advising you to refer to other documents for more information about a specific topic. These references to other documents are particularly common in technical documents, written for the sole pur- pose of providing the reader with as much relevant information as possible, without rephrasing information that can be found elsewhere. Knowing how the documents in a system are interrelated, i.e. which other documents a document refers to or is referred by, can be extremely helpful when trying to get access to relevant information. A typical example of such a “knowledge net” providing information about document relations is CiteSeer1, a digital library of academic literature. For each document in the library system, CiteSeer displays lists of related documents, such as a list of documents that the current document cites as well as a list of documents that the current document is cited by. The assumption that inspired this thesis is that such lists are not only helpful when reading academic literature but could also assist a reader of technical documents stored in a company’s document management system. The idea was thus to extend an existing document management system by displaying, for each document stored in the system, a list of links to documents that the current document refers to.

As information about how the documents in this system are interrelated was not avail- able, the focus of the project underlying this thesis was on the first step towards solving this task: automatically analyzing documents in order to extract names of related doc- uments. Once all document names mentioned in a document have been extracted, the next step would then be to search for these documents in the system’s database and, in case they have been successfully found, create links to the respective documents.

The outcome of the project was a system that performs the extraction task. It is based on Conditional Random Fields, a machine learning technique introduced by Lafferty et al. (2001), and is able to extract document names from unseen documents, achiev- ing high precision scores (88%) and acceptable recall scores (65%) on a test dataset. The implementation is based on a Java package provided by Sarawagi & Cohen (2005), which was adapted and extended to suit the nature of the task. As the approach is based on supervised learning, the project also involved the generation of appropriate training data.

This thesis is subdivided into four chapters: Chapter 1 introduces the task in more detail, chapter 2 gives an overview of approaches to related tasks, chapter 3 describes the implementation and the experiments that were carried out in order to tune the system, and chapter 4 discusses the results and provides suggestions for future work.

1 Introduction

1.1 Project description

The project underlying this thesis was carried out for IBM Germany Development GmbH in Böblingen, Germany, hereafter referred to as the client, and aimed at developing a tool that identifies parts in text documents that denote references to other documents. Background of the project is a MyDMS1 document management system used by the client for managing technical documentation. Many of the documents stored in this system’s database contain references to other documents in which additional information about a specific topic can be found, e.g. in the form of a sentence like “Please refer to the XYZ Specification for details”. At present, a referenced document can only be found using the system’s search function, which can be tedious and time-consuming. The idea underlying the project is thus to generate, for each document, a list of links to documents that the current document refers to. Such a list could then be displayed along with each document, e.g. on the document’s meta-page in the system, and would allow the user to navigate easily to a document of interest and thereby speed up the process of finding relevant information in related documents. Generating this list can be subdivided into the following subtasks:

1. extracting all document names mentioned in the current document
2. searching for the mentioned documents in the database
3. generating links to these documents
4. displaying the list of links along with the current document

In the project described in this thesis, the focus was on solving the first subtask. Due to the large amount of documents stored in the document management system, automating this step is necessary to avoid having to analyse vast numbers of documents manually.

The goal of the project was thus to develop and evaluate a system that solves the task of automatically extracting document references from text. To reach this goal, the following objectives were defined:

- Research on related tasks and machine learning techniques used for these tasks
- Definition of a set of features
- Generation of training data
- Implementation of a system extracting references
- Demonstration of how the extracted references can be used to extend an existing document management system

1.2 Related tasks

To the best of my knowledge, this was the first attempt to extract document references automatically. However, the task of extracting document references is closely related to other information extraction (IE) tasks such as the extraction of named entities (NEs), a task commonly referred to as named entity recognition (NER). The NER problem was first defined in the context of the Sixth Message Understanding Conference (MUC-6)2 (Sundheim 1996) and focuses at recognizing (and usually classifying) NEs in semi- or unstructured input text. NEs are words or sequences of words that belong to a defined category. Depending on the task, these categories can encompass different types of entities. Common NE categories are organisations, persons and locations (e.g. Sundheim 1996, Sundheim 1998), but domain specific categories are popular as well, e.g. biomedical terms, such as protein, gene, disease or virus names (e.g. Kim et al. 2004, Tsai et al. 2005). NER is usually performed as a sequence labelling task, where the input text is viewed as a sequence of words (or tokens) and each token in this sequence is assigned a label (or tag) according to whether it is part of an NE or not. Other sequence labelling tasks in natural language processing (NLP) include part-of-speech (POS) tagging (the task of assigning

POS tags to the words in an input text), shallow parsing (the task of dividing sentences into non-overlapping phrases, also called phrase chunking), and speech recognition (the task of mapping acoustic signals to a sequence of words). However, most similar to the task at hand is the NER task because it uses word tokens as its input (unlike speech recognition, which is based on acoustic signals), involves a rather restricted number of labels (unlike POS tagging, which usually uses a large tag set), and aims at extracting relevant pieces of information from a large amount of data that is irrelevant to the task (unlike phrase chunking, which aims at assigning structure to the input text and can be considered a preparatory step to IE tasks).

1.3 Problem formalisation

The extraction of document references involves correctly detecting and delimiting these references in a string of text. Viewing the task as an NER problem, it can be formalised as follows: For x = x1 + x2 + ... + xn being a sequence of n tokens and C = c1, c2, ..., cm being a set of m NE categories to be predicted, an NE phrase, denoted as (s,e)c, is a phrase spanning from token xs to token xe, with s ≤ e, assigned to c. A solution to the NER problem is a set of non-overlapping NE phrases, where non-overlapping means that for each pair of NE phrases ((si, ei)c [Abbildung in dieser Leseprobe nicht enthalten] . The goal is to [Abbildung in dieser Leseprobe nicht enthalten] find the correct set of NE phrases for the given token sequence x.

1.4 Evaluation measures

Commonly used evaluation measures for NER and other classification systems are precision and recall. These measures are calculated based on a contingency table (shown in Table 1.1) and compare the tagging predicted by the system for a test dataset to the “correct” tagging, i.e. the tags assigned by a human expert.

Abbildung in dieser Leseprobe nicht enthalten

Table 1.1: Contingency table

Precision measures the proportion of instances the system correctly assigned to category

c (TP) to all instances assigned to c (TP+FP):

precision [Abbildung in dieser Leseprobe nicht enthalten] .

Analogously, recall measures the proportion of instances correctly assigned to category c (TP) to all instances tagged c by the human expert (TP+FN):

recall [Abbildung in dieser Leseprobe nicht enthalten]

For NER tasks, precision and recall can be measured either on the token-level, i.e. based on how many individual tokens have been assigned the correct label, or on the entity- level, i.e. based on how many entities have been found (an entity is only considered to be found if all its tokens are tagged correctly). Measuring performance on the entity-level usually yields slightly worse results. For the task at hand, performance will be measured on the token-level.

As for most tasks, precision and recall do not make much sense in isolation from each other (high levels of precision can be obtained at the price of low levels of recall and vice versa), they are usually combined into a single measure, the so-called F-measure (or Fβ function) introduced by van Rijsbergen (1979):

[Abbildung in dieser Leseprobe nicht enthalten])·precision·recall

[Abbildung in dieser Leseprobe nicht enthalten]·precision+recall

If β = 1, the function attributes equal importance to precision and recall. If β < 1, precision is stressed, analogously, if β > 1, recall is attributed greater importance. In particular, F2 weights recall twice as much as precision, F0.5 weights precision twice as much as recall. For the evaluation of the experiments described in chapter 3, a β value of 1 was used.

Achieving an F-measure score of 100% is not realistic because even human annotation (tagging) is generally not perfect, due to lack of concentration when annotating or short- comings in the clarity and completeness of annotation guidelines. The quality of human annotation is usually measured according to the agreement of the annotation of sev- eral people on the same data. This agreement, referred to as interannotator agreement, ranges from as high as 97% for MUC-7 (Marsh & Perzanowski 1998) to as low as 89% for biomedical NEs (Demetriou & Gaizauskas 2004).

The client requires an F-measure score of about two-thirds (at least 66%) for the system to be considered useful.

1.5 Approaches to named entity recognition

The simplest approach to the NER task is to maintain a lexicon of NEs and their morphological variants and match the input text against this lexicon. However, for most tasks this would require huge lexicons which, e.g. due to the dynamic nature of natural language and spelling variations, would still be far from exhaustive. A more flexible approach is to make use of the fact that NEs of the same category are often found within similar contexts and can share common internal features. McDonald (1996) refers to these two complementary kinds of evidence as internal (i.e. “derived from within the sequence of words that comprise the name”) and external (“provided by the context in which the name appears”). Internal evidence can for example refer to the fact that English proper names are usually capitalised, or to morphological knowledge, e.g. that “land” at the end of a word indicates a location. External evidence can for example refer to the fact that specific words tend to occur in the context of an NE, e.g. the word “Mr.” usually indicates that a person name follows. Successful NER systems integrate both internal and external evidence.

This evidence can be used in two different ways: It can serve as the basis for manually generated rules or it can be used to build a machine learning (ML) model based on statistics concerning this evidence. Early NER systems were often based on manually constructed rules that generally combined regular expressions to define patterns that matched NEs and their contexts. In addition, these systems often made use of lists storing candidates, keywords or exceptions. Rule-based systems for English include SPARSER (McDonald 1996), University of Sheffield’s LaSIE-II (Humphreys et al. 1998), and ISOQuest’s NetOwl (Aone et al. 1998, Krupka & Hausman 1998). Another rule- based system is University of Edinburgh’s LTG system (Mikheev et al. 1998, Mikheev et al. 1999), which uses a hybrid approach by incorporating a learned model and will therefore be discussed later, in section 2.2.2.

The major disadvantage of rule-based approaches is that a lot of expert knowledge and heuristics are required to formulate rules and that these rules are usually specific to language and domain. To improve the robustness and portability of NER systems, most current approaches thus apply ML to build NER systems using annotated training data. ML approaches are based on the idea that, less difficult than formulating rules is to provide examples of what should and should not be tagged as NE. An ML system is then used to automatically find patterns in these training examples and apply these patterns to find NEs in previously unseen text. An overview of work in NER using ML is given in the following chapter.

2 Machine learning approaches to sequence labelling

This chapter provides an overview of machine learning (ML) approaches to NER and other sequence labelling tasks. Unlike rule-based systems, which are built using human intuition, ML-based systems integrate a model learned from given training examples. The idea is to combine different pieces of evidence automatically rather than by manu- ally writing rules. The two essential components of an ML task are training and infer- encing. In supervised learning, training is the process in which the learning algorithm learns a model using the training dataset (a set of labelled examples). Learning is mostly based on statistics about frequencies of events in this dataset and can be understood as learning a set of parameters (known as weights), which reflect the importance of certain features describing the data. The learned model is then applied to unlabelled instances to predict the labels, a process referred to as inferencing.

The first ML-based systems for NER tasks were developed for MUC-6 (Sundheim 1996) but were outperformed by rule-based systems. However, much progress has been made over the last decade, and ML-based systems have now become the standard approach to solving NER tasks. For example, ML was used in all systems participating in the shared task of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), a workshop on biomedical entity recognition held in 2004 (Collier et al. 2004).

ML approaches to NER can be subdivided into classifier-based approaches and approaches based on probabilistic sequence models. This chapter introduces work done in both fields, the focus being on probabilistic models because this approach will be adopted for solving the task at hand. Furthermore, it introduces features commonly used to represent textual input data to ML systems.

2.1 Classifier-based approaches

Early efforts in NER were usually classifier-based. Classifier-based systems look at each token in the input individually and classify it as to whether it is part of an NE or not. The classification result at each position may depend on the previous classifications. The first NER system that integrated a learning component is the Alembic system by Aberdeen et al. (1995), who apply a variant of Eric Brill’s transformation-based error- driven learning algorithm (Brill 1995). The system preclassifies a token according to a simple algorithm (e.g. the most likely NE category for the token in the training data), and, in a second step, improves the classification by applying rules selected automat- ically from a large number of rules generated via transformation. The transformation rules are based on rule skeletons containing variables. With different values assigned to the variables, the rules are evaluated on the training corpus and kept only in case they improve classification performance. Aberdeen et al. (1995) experiment with both automatically-learned and hand-crafted rule sequences but in the version of Alembic submitted for MUC-6 (Sundheim 1996) use hand-crafted sequences only, as they out- perform the rules learned automatically.

Other early approaches (Sekine 1998, Berger 1997) were based on decision trees, which are not suitable for problems with many features and have hardly been applied later. However, a system by Carreras et al. (2002) uses AdaBoost, a boosting algorithm, to combine several decision trees, and outperforms all other systems both on the Spanish and the Dutch test dataset of the Conference on Computational Natural Language Learning (CoNLL-2002)1 (Roth & van den Bosch 2002) shared task.

Memory-based learning, e.g. based on TiMBL (Daelemans et al. 2003), performed com- paratively bad on the CoNLL-2002 (Roth & van den Bosch 2002) and CoNLL-2003 (Daelemans & Osborne 2003) shared tasks. Rössler (2006) suggests that this might be due to TiMBL’s difficulties in dealing with redundant features inherent in natural- language data.

Another method that has lately been applied to NER tasks are Support Vector Ma- chines (SVMs), introduced by Vapnik (1995). SVMs were the most common approach applied to the JNLPBA-2004 shared task but were not successful in isolation because, by concentrating on the classification of single tokens without looking at the NEs as a sequence of tokens, they failed to recognise complex NEs (Rössler 2006). In the system which achieved the best results, Zhou & Su (2004) used SVMs in combination with a probabilistic sequence model (a Hidden Markov Model) to overcome this common problem associated with classifier-based approaches, by taking into account sequential information that is naturally present in textual data.

2.2 Probabilistic sequence models

For the last couple of years, most approaches to sequence labelling problems such as NER have been based on probabilistic sequence models. In these models, two random variables X and Y denote any input token sequences and label sequences, respectively. These models can be used to find the most likely sequence of labels for a given sequence of tokens. The most important probabilistic models used for the NER task are presented in the following.

2.2.1 Hidden Markov Models

Introduction to HMMs

The first ML system to show results competitive with those achieved by rule-based systems is BNN’s IdentiFinder (Miller et al. 1998, Bikel et al. 1999), presented at MUC- 7 (Sundheim 1998), which is based on a Hidden Markov Model (HMM) approach. An HMM is a type of probabilistic finite-state machine, which consists of sequences of states generating sequences of observations. A simple HMM is shown in Figure 2.1, where xi denotes the ith observation of observation sequence x (e.g. the sequence of tokens in a text) and si denotes the ith state of state sequence s (e.g. the sequence of states emitting a label for each token). According to standard notation, the shaded nodes indicate the sequence of observations made, the non-shaded nodes indicate the sequence of hidden states which generated the observations. An HMM is characterised by the number of states, the number of distinct observation symbols per state, the state tran- sition probability distribution (the probability of going from one state to another), the observation symbol (or emission) probability distribution in each state (the probability of a state generating an observation), and the initial state distribution (the probability of the sequence starting in a particular state) (Rabiner 1989). In ML, the state transition probabilities and emission probabilities are learned from training data.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.1: A simple hidden Markov model

Assuming a one-to-one relationship between states and labels (i.e. each state corresponds to a single label), we can calculate the probability of a sequence of labels y given a sequence of observations x using Bayes Rule:

Abbildung in dieser Leseprobe nicht enthalten

The most likely sequence of labels for sequence x can be found by maximizing P (y|x). Since the unconditioned probability P (x) is constant for a given observation sequence, we only have to maximise P (x, y), the joint probability of the observation sequence and the label sequence.

The calculation of P (x, y) is based on the Markov assumption, which states that the current state is only dependent on the previous state2:

Abbildung in dieser Leseprobe nicht enthalten

where yi is the label corresponding to the ith state, P (yi|yi−1) is the state transition probability from the state of label yi−1 to the state of label yi and P (xi|yi) is the emission probability of the ith observation xi in the state of yi. P (xi|yi) cannot be derived directly but can be calculated using Bayes theorem, from p(yi|xi), which is estimated from training data. An efficient and commonly used algorithm for finding the most probable sequence of states is the Viterbi decoder (Viterbi 1967). For a detailed description, refer to Manning & Schütze (1999).

Named entity recognition with HMMs

HMMs are a popular and well developed tool for modelling sequential data and have been applied successfully to a number of language-related tasks including speech recogni- tion (Rabiner 1989), POS tagging (Kupiec 1992, Lee et al. 2000), text segmentation and topic tracking (van Mulbregt et al. 1998), and information extraction (Leek 1997, Sey- more et al. 1999, Freitag & McCallum 1999). Early HMM-based approaches to NER are by Bikel et al. (1997), Miller et al. (1998), and Bikel et al. (1999). Bikel et al. (1997) describe a system called Nymble, later called IdentiFinder (Miller et al. 1998, Bikel et al. 1999), which uses a separate model for each NE class and is based on bigram (a sequence of two tokens) probabilities. To deal with zero-probabilities due to data sparseness (i.e. bigrams that did not appear in the training data), they back off to less powerful models, the least powerful model being the unigram (a single token) model. The different models are assigned different weights using a smoothing method.

More recent approaches try to make use of the fact that a token can be described in terms of multiple features (cf. section 2.3), which should all be integrated into the model. This is not straightforward with HMMs because, being generative models, they estimate the joint probability P (x, y) of observation sequence x and label sequence y. For an observation being a complex vector of features, calculating P (x, y) requires calculating the joint probability of each label and all possible combinations of feature values, which is estimated based on the co-occurrence of labels and observations in the training data. As the amount of annotated training data is always limited, this poses serious limitations to the complexity of the feature vector.

Zhou & Su (2002) thus use a modified HMM, which is based on the mutual information independence assumption instead of the conditional probability independence assumption after Bayes’ rule and allows them to take into account additional features. They also use trigrams to make use of local context. Their system outperforms all previous systems on the MUC-6 (Sundheim 1996) and MUC-7 (Sundheim 1998) datasets. Zhou & Su (2004) use the HMM described in Zhou & Su (2002) and combine it with SVMs to address the data sparseness problem. Their system achieves the best results on the JNLPBA-2004 (Collier et al. 2004) dataset.

Another way of overcoming the shortcomings of traditional HMMs is to use other model types such as dynamic Bayesian nets.

Dynamic Bayesian nets

Hidden Markov Models (HMMs) can be viewed as a special type of Bayesian network (BN, also belief network), a popular kind of probabilistic network based on directed graphs. In its simplest form, a BN consists of a network structure and its associated conditional probability table (CPT). The network structure is represented by a directed graph, the CPT specifies a probability distribution based on the network. BNs were not designed to model temporal reasoning, which is needed for sequence labelling tasks, but have been expanded into so-called dynamic BNs (DBNs), which represent dependencies between variables at adjacent positions (like HMMs). DBNs generalise over HMMs by allowing the state information to be decomposed into a set of variables, which enables them to represent a token as a complex set of features and thereby integrate various aspects of language into a single probabilistic model. DBNs have been shown to out- perform HMMs on standard speech recognition tasks (Zweig & Russell 1998) and have also been successfully applied to an IE task (Peshkin & Pfeffer 2003). However, to my knowledge, they have not been applied to any NER task.

An alternative way of integrating a complex set of features is to apply non-generative models like Maximum Entropy Markov models or Conditional Random fields, both described in the following sections.

2.2.2 Maximum Entropy Markov Models

An alternative finite-state model is the Maximum Entropy Markov Model (MEMM), a discriminative model introduced by McCallum et al. (2000), which combines the advan- tages of HMMs and Maximum Entropy models. Unlike generative models, discriminative models make no attempt to model underlying distributions but are only interested in optimizing a mapping from inputs to desired outputs. For the NER task, this means that discriminative models concentrate on estimating P (y|x) rather than P (x, y). This allows for a much more complex feature vector. For estimating the probability distri- butions from data, McCallum et al. (2000) use the maximum entropy (ME) framework, introduced in the following.

Maximum Entropy

Maximum entropy (ME) models are statistical models introduced by Ratnaparkhi (1996) for POS tagging, which generate a label probability distribution for each observation.

In an ME model the probability of a label yi given an observation xi is calculated based on a vector of binary feature functions, each associated with a weight. For fj (xi, yi) denoting the jth feature function, λj denoting the weight of that function and Z(xi) being a normalisation factor, P(yi|xi) is calculated as

Abbildung in dieser Leseprobe nicht enthalten

The weight vector λ is estimated based on the principle of ME introduced by Jaynes (1957) into the field of statistical mechanics. The principle states that the best model for the data is the one based on the probability distribution that “has maximum entropy subject to whatever is known”, i.e. the distribution that is consistent with the constraints represented by the knowledge derived from the training data, and makes the fewest assumptions beyond these constraints.

One of the first ME-based NER systems is the MENE system (Borthwick et al. 1998b, Borthwick et al. 1998a), a purely statistical system, which does not use any hand- generated patterns. The system uses features similar to those used in BBN’s Nymble/ Identifinder system (Bikel et al. 1997, Miller et al. 1998, Bikel et al. 1999) but allows several features to fire in overlapping cases. For instance, in MENE, both the initial cap feature and the all cap feature would activate on “IBM”, while in Nymble, the “initial capitalisation” feature would not fire because the “all-capitalized” feature would take precedence. As the conditional nature of the ME model allows for a large feature set, the system is able to make use of a large amount of evidence represented as binary features when making its tagging decisions. Using MENE alone, the results are inferior to other systems. However, Borthwick et al. (1998b) have very successfully combined MENE with the rule-based NYU system (Grishman 1995) to achieve better results.

The effectiveness of combining an ME-component with a rule-based system has also been shown by Mikheev et al. (1998) and Mikheev et al. (1999). Their NER system LTG uses a hybrid approach with five phases, three of which involve maximum entropy, and outperforms all other systems on the MUC-7 (Sundheim 1998) dataset, achieving an F-measure score of 93.39. The first phase uses so-called sure-fire rules, which only mark up candidates in particular contexts and are very reliable in terms of precision. To improve recall, the system performs a probabilistic partial match of the entities extracted in phase 1, pretags all occurrences of these partial matches and classifies them using an

ME model (phase 2). In phase 3, additional NEs are found based on the NEs found so far and making use of NE lists. Phase 4 applies another ME model on partial matches found in the previous phases, phase 5 completes the NER by annotating NEs in titles.

Maximum Entropy Markov Models

Like the systems MENE and LTG, MEMMs are based on the ME principle. However, unlike the models used in traditional ME systems, they are able to represent sequential dependencies by including information about the previous label, similar to HMMs. In an MEMM, the transition and observation functions used in HMMs are replaced by a single function P (yi|yi−1, xi) that provides the probability of the current state yi given the previous state yi−1 and the current observation xi. This model reflects the fact that the observations are given and that we are interested in the probability of the label sequence induced by the state sequence rather than the probability of the observations. The difference is illustrated in Figure 2.2, where the black filling of a node indicates that the corresponding variable is generated by the model. In MEMMs, P (y|x) is calculated as follows:

Abbildung in dieser Leseprobe nicht enthalten

where [Abbildung in dieser Leseprobe nicht enthalten] is the jth feature applied to the ith token of observation sequence x.

For the NER task, their main advantage over HMMs is that they are discriminative rather than generative. Unlike generative models, which need the probability of the observation to calculate the joint probability of the observation and the label, MEMMs concentrate on estimating the conditional probability of the label given the observation. Thus, they can present each observation as a complex combination of multiple, nonindependent features and incorporate more pieces of evidence.

Label bias problem

The main disadvantage of MEMMs and other discriminative Markov models based on directed graphs is that they are potential victims of the Label Bias Problem (Lafferty et al. 2001), i.e. that they are biased towards states with few successor states. The problem is illustrated in Figure 2.3 (after Bottou (1991)).




2 The Message Understanding Conferences (MUC) were held seven times between 1987 and 1997 and were financed by the US Defense Advanced Research Projects Agency (DARPA) to encourage the development of new and better methods of information extraction. For each MUC, participating groups were given sample data on the type of information to be extracted and asked to develop systems to perform the extraction task. For the conferences, the systems were then evaluated against a test dataset. The task of recognizing named entities was added for the sixth conference (MUC-6).

1 The CoNLL is a meeting of the SIGNLL, the Special Interest Group on Natural Language Learning of the Association for Computational Linguistics, which has been held yearly since 1997. The shared task of CoNLL-2002 and CoNLL-2003 concerned language-independent named entity recognition.

2 This is true of a first-order Markov model; in an n-order Markov model, the current state is dependent on n previous states.

Excerpt out of 67 pages


Automatic extraction and processing of document references
A CRF-based approach
University of Sunderland  (School of Computing and Technology)
Catalog Number
ISBN (eBook)
ISBN (Book)
File size
1285 KB
Für die Arbeit wurde die Bewertung "with distinction" vergeben.
Automatic, CRF-based
Quote paper
Kathrin Eichler (Author), 2007, Automatic extraction and processing of document references, Munich, GRIN Verlag,


  • No comments yet.
Read the ebook
Title: Automatic extraction and processing of document references

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free