While reading documents, you often encounter text passages advising you to refer to other documents for more information about a specific topic. These references to other documents are particularly common in technical documents, written for the sole purpose of providing the reader with as much relevant information as possible, without rephrasing information that can be found elsewhere. Knowing how the documents in a system are interrelated, i.e. which other documents a document refers to or is referred by, can be extremely helpful when trying to get access to relevant information. A typical
example of such a “knowledge net” providing information about document relations is CiteSeer, a digital library of academic literature. For each document in the library system, CiteSeer displays lists of related documents, such as a list of documents that
the current document cites as well as a list of documents that the current document is cited by. The assumption that inspired this thesis is that such lists are not only helpful when reading academic literature but could also assist a reader of technical documents
stored in a company’s document management system. The idea was thus to extend an existing document management system by displaying, for each document stored in the system, a list of links to documents that the current document refers to. As information about how the documents in this system are interrelated was not available,
the focus of the project underlying this thesis was on the first step towards solving this task: automatically analyzing documents in order to extract names of related documents. Once all document names mentioned in a document have been extracted, the next step would then be to search for these documents in the system’s database and, in case they have been successfully found, create links to the respective documents.
The outcome of the project was a system that performs the extraction task. It is based on Conditional Random Fields, a machine learning technique introduced by Lafferty et al. (2001), and is able to extract document names from unseen documents, achieving high precision scores (88%) and acceptable recall scores (65%) on a test dataset.
The implementation is based on a Java package provided by Sarawagi & Cohen (2005), which was adapted and extended to suit the nature of the task. As the approach is based on supervised learning, the project also involved the generation of appropriate training
data.
Inhaltsverzeichnis (Table of Contents)
- Introduction
- Project description
- Related tasks
- Problem formalisation
- Evaluation measures
- Approaches to named entity recognition
- Machine learning approaches to sequence labelling
- Classifier-based approaches
- Probabilistic sequence models
- Hidden Markov Models
- Maximum Entropy Markov Models
- Conditional Random Fields
- Comparison of sequence models.
- Motivation for using CRFs
- Features
- Lexical features.
- Linguistic features
- Orthographical features
- Formatting
- Context features
- Implementation
- Definition of the named entity.
- Data analysis
- CRF implementation
- Data preprocessing
- File format conversion - class FileConverter
- Extracting potential candidates - class ContextExtractor
- Annotation guidelines
- Generating training data - class Dataset Generator
- Experiments
- Initial feature types
- Tagging scheme and performance measure
- Number of models
- Additional features
- Critical evaluation
- System overview
- Processing of the extracted references
- Conclusions and future work
- Conclusions
- Future work
- Additional reference types
- Improving the model
- Additional training data.
- Precision recall trade-off.
Zielsetzung und Themenschwerpunkte (Objectives and Key Themes)
The objective of this thesis is to develop a system that automatically extracts document names from technical documents, enabling the creation of a "knowledge net" that links related documents within a company's document management system. The system is based on Conditional Random Fields (CRFs) and aims to achieve high precision and acceptable recall scores. Key themes include:- Automatic document reference extraction
- Conditional Random Fields (CRFs) for sequence labelling
- Named entity recognition
- Data preprocessing and training data generation
- System evaluation and future directions
Zusammenfassung der Kapitel (Chapter Summaries)
- Chapter 1 introduces the project's goal of automatically extracting document references from technical documents. It outlines the problem, formalizes its definition, and discusses evaluation measures. This chapter also explores existing approaches to named entity recognition, providing context for the chosen CRF-based approach.
- Chapter 2 delves into machine learning approaches for sequence labelling, focusing on probabilistic models such as Hidden Markov Models, Maximum Entropy Markov Models, and CRFs. It compares these models and provides a compelling justification for the use of CRFs in this specific project.
- Chapter 3 elaborates on the implementation of the system. It defines the target named entity, analyzes the data, and discusses the CRF implementation using the Java package by Sarawagi & Cohen (2005). The chapter further describes the data preprocessing techniques, including file format conversion, potential candidate extraction, and annotation guidelines. It concludes by discussing the experiments conducted to tune the system, focusing on feature selection, tagging schemes, and performance evaluation.
Schlüsselwörter (Keywords)
This thesis focuses on the automatic extraction and processing of document references using a CRF-based approach. The main keywords include: document reference extraction, named entity recognition, sequence labelling, Conditional Random Fields, data preprocessing, machine learning, supervised learning, evaluation metrics, precision, recall, technical documentation, knowledge net, document management system.- Quote paper
- Kathrin Eichler (Author), 2007, Automatic extraction and processing of document references, Munich, GRIN Verlag, https://www.grin.com/document/158610