In the following thesis we present a memory-based word sense disambiguation system, which makes use of automatic feature selection and minimal parameter optimization. We show that the system performs competitive to other state-of-art systems and use it further for evaluation of automatically acquired data for word sense disambiguation.
The goal of the thesis is to demonstrate that automatically extracted examples for word sense disambiguation can help increase the performance of supervised approaches. We conducted several experiments and discussed their results in order to illustrate the advantages and disadvantages of the automatically acquired data.
Table of Contents
1 Introduction
2 Basic Approaches to Word Sense Disambiguation
2.1 Knowledge-Based
2.1.1 The Lesk Algorithm
2.1.2 Alternative Methods
2.2 Unsupervised Corpus-Based
2.2.1 Distributional Methods
2.2.2 Translational Equivalence Methods
2.3 Supervised Corpus-Based
2.3.1 Sense Inventories
2.3.2 Source Corpora
2.3.3 Data Preprocessing
2.3.4 Feature Vectors
2.3.5 Supervised WSD Algorithms
2.4 Semi-Supervised Corpus-Based
3 Comparability for WSD Systems
3.1 Differences between WSD Systems
3.2 Most Frequently Used Baselines
3.2.1 The Lesk Algorithm
3.2.2 Most Frequent Sense
3.2.3 Random Choice
4 Evaluation of WSD Systems
4.1 Fundamentals in Evaluation of WSD Systems
4.2 International Evaluation Exercise Senseval
4.3 Senseval-1
4.4 Senseval-2
4.5 Senseval-3
4.5.1 The All-Words Task
4.5.2 The Lexical Sample Task
4.5.3 Other Tasks
4.6 Semeval-1
4.7 Summary
5 TiMBL: Tilburg Memory-Based Learner
5.1 Overview
5.2 Application
6 Automatic Extraction of Examples for WSD
6.1 Overview of the System
6.2 Data Collection
6.2.1 Sense Inventory
6.2.2 Source Corpora
6.2.3 Automatic Annotation
6.3 Data Preprocessing
6.3.1 Basic Preprocessing
6.3.2 Part-of-Speech Tagging
6.4 Training and Test Sets
6.4.1 Feature Vectors
6.4.2 Training Set
6.4.3 Test Set
6.5 Algorithm Selection
6.6 Parameter Optimizations
6.6.1 General Parameter Optimizations
6.6.2 Automatic Feature Selection
6.7 Scoring
6.8 Experimental Results
6.8.1 Supervised WSD
6.8.2 Unsupervised WSD
6.8.3 Semi-supervised WSD
6.8.4 Discussion
6.9 Evaluation
7 Conclusion, Future and Related Work
7.1 Related Work
7.2 Future Work
7.3 Conclusion
Research Objective and Scope
The primary objective of this thesis is to address the knowledge acquisition bottleneck in Word Sense Disambiguation (WSD) by demonstrating that automatically extracted training examples can effectively enhance the performance of supervised WSD systems. The research investigates methods to minimize human labor in creating labeled datasets while maintaining high accuracy.
- Development of a semi-supervised WSD system.
- Automatic extraction and annotation of data from online dictionaries and large corpora.
- Implementation of memory-based learning (TiMBL) for classification tasks.
- Comparative evaluation of various training configurations against established benchmarks.
- Analysis of the trade-offs between manually and automatically acquired training instances.
Excerpt from the Book
2.1.1 The Lesk Algorithm
One highly influential dictionary-based method which provided to a great extent the basis for most of the research in the area is the Lesk method presented in (Lesk, 1986). The algorithm is based on the assumption that words with similar surroundings will usually share a common topic. In other words the contextual overlap (in this approach - among dictionary definitions) is used as a measure to pick up the most likely sense for a given word. Lesk’s method was as well one of the first to work on any word in any text. The reason why this algorithm is classified as knowledge-based is because of the fact that it acquires knowledge only from a set of dictionary entries (a separate entry is needed for each sense of every word) concentrating on the immediate context of the target word. Here, with immediate context we mean the words that closely surround the target word. As an example we can look at the ”evergreen” case of disambiguation of pine cone that Lesk (1986) first suggested as shown in (1) on page 14.
Summary of Chapters
1 Introduction: Provides an overview of ambiguity in computational linguistics and sets the motivation for automating Word Sense Disambiguation.
2 Basic Approaches to Word Sense Disambiguation: Surveys fundamental methodologies including knowledge-based, unsupervised, supervised, and semi-supervised approaches to WSD.
3 Comparability for WSD Systems: Discusses the inherent difficulties in comparing diverse WSD systems and the necessity of establishing baseline performance metrics.
4 Evaluation of WSD Systems: Details standard evaluation practices and historical competition benchmarks, specifically focusing on the Senseval series.
5 TiMBL: Tilburg Memory-Based Learner: Introduces the software tool utilized for the memory-based learning components of the research.
6 Automatic Extraction of Examples for WSD: Explains the design of the semi-supervised system, data collection strategies, feature vector construction, and experimental results.
7 Conclusion, Future and Related Work: Summarizes the thesis findings, discusses related research, and proposes future directions for optimizing WSD performance.
Keywords
Word Sense Disambiguation, WSD, Memory-Based Learning, Automatic Extraction, Semi-Supervised Learning, Senseval, Corpus Linguistics, Feature Selection, Parameter Optimization, TiMBL, Machine Learning, Natural Language Processing, Sense Annotation, Knowledge Acquisition, Lexical Semantics.
Frequently Asked Questions
What is the core focus of this thesis?
The thesis focuses on solving the knowledge acquisition bottleneck in Word Sense Disambiguation (WSD) by developing a semi-supervised system that uses automatically extracted and labeled data to improve performance.
What are the primary themes covered?
The work covers WSD methodologies, system comparability, evaluation metrics (precision, recall, F-score), the use of memory-based learning, and the impact of data quality on classification accuracy.
What is the main goal or research question?
The primary goal is to demonstrate that automatically extracted examples can reduce the need for manual annotation while maintaining or improving the performance of supervised WSD systems.
Which scientific method is employed?
The research employs a memory-based learning (MBL) approach, utilizing the TiMBL software, and incorporates automatic feature selection and parameter optimization to refine the classification process.
What topics are discussed in the main part of the work?
The main part details the system architecture, the process of automatic data collection from various sources, the automatic annotation framework (SenseRelate::TargetWord), and experimental evaluations using Senseval-3 benchmarks.
Which keywords characterize the work?
Key terms include Word Sense Disambiguation, Semi-Supervised Learning, Memory-Based Learning, Automatic Extraction, and Feature Selection.
How does this system differ from fully supervised approaches?
Unlike fully supervised approaches that rely exclusively on manually annotated datasets, this system integrates automatically gathered and labeled examples to extend the training data, thereby reducing human effort.
What significance do the experiments on the "solid" or "suspend" examples have?
These examples illustrate cases where automated data might be sparse or poorly distributed, highlighting the critical role that the quality and distribution of training data play in overall system success.
- Citar trabajo
- Desislava Zhekova (Autor), 2009, Automatic Extraction of Examples for Word Sense Disambiguation, Múnich, GRIN Verlag, https://www.grin.com/document/267346