The number of Amharic documents on the Web is increasing as many newspaper publishers started providing their services electronically. The unavailability of tools for extracting and exploiting the valuable information from Amharic text, which is effective enough to satisfy the users has been a major problem and manually extracting information from a large amount of unstructured text is a very tiresome and time consuming job, this was the main reason which motivate the researcher to engage in this research work. The overall objective of the research was to develop information extraction system for the Amharic vacancy announcement text. 116 Amharic vacancy announcement texts which contain 10,766 words were collected from the ―Ethiopian reporter‖ newspaper published in Amharic twice in week. For this study, nine candidate texts are selected from Amharic vacancy announcement text, these are organization, position, qualification, experience, salary, number of people required, work agreement, deadline and phone number. The experiments have been carried out on each component of a system separately to evaluate its performance on each components, this helps us to identify drawbacks and give some clue for future works.
The experimental result shows, an overall F - measure of 71.7% achieved. In order to make the system to be applicable in this domain which is Amharic vacancy announcement,
Table of Contents
CHAPTER ONE
INTRODUCTION
1.1. GENERAL BACKGROUND
1.2. STATEMENT OF THE PROBLEM
1.3. OBJECTIVE OF THE STUDY
1.3.1 GENERAL OBJECTIVE
1.3.2. SPECIFIC OBJECTIVES
1.4. METHODOLOGY
1.4.1.STUDY DESIGN
1.4.2.LITERATURE REVIEW
1.4.3. DATA SOURCES AND DATA PREPARATION FOR THE EXPERIMENT
UNDERSTANDING OF DOMAIN LANGUAGE
1.4.4. DESIGN AND IMPLEMENTATION OF AVATIES
1.5. APPLICATION OF RESULTS AND BENEFICIARIES
1.6. SCOPE AND LIMITATIONS OF THE STUDY
1.7. ORGANIZATION OF THE STUDY
CHAPTER TWO
LITERATURE REVIEW
2.1. INTRODUCTION
2.2. INFORMATION EXTRACTION (IE)
2.3. BUILDING INFORMATION EXTRACTION SYSTEMS
I. KNOWLEDGE ENGINEERING APPROACH
II. AUTOMATIC TRAINING APPROACH
2.4. ARCHITECTURE OF INFORMATION EXTRACTION SYSTEM
2.5. PREPROCESSING OF INPUT TEXTS
2.6. LEARNING AND APPLICATION OF THE EXTRACTION MODEL
2.7. POST PROCESSING OF OUTPUT
2.8. RELATED NLP FIELDS TO INFORMATION EXTRACTION
2.8.1. INFORMATION RETRIEVAL (IR)
2.8.2. TEXT SUMMARIZATION
2.8.3. QUESTION ANSWERING SYSTEMS
2.8.4. TEXT CATEGORIZATION
2.9. INFORMATION EXTRACTION (IE) AND INFORMATION RETRIEVAL (IR)
2.10. EVALUATION OF INFORMATION EXTRACTION
2.11. RELATED WORKS
INFORMATION EXTRACTION FOR E-JOB MARKETPLACE
INFORMATION EXTRACTION FROM AMHARIC TEXT
INFORMATION EXTRACTION FROM ENGLISH TEXT
INFORMATION EXTRACTION FROM CHINESE TEXT
CHAPTER THREE
THE AMHARIC WRITING SYSTEM
3.1. INTRODUCTION
3.2. AMHARIC CHARACTER REPRESENTATION AND WRITING SYSTEM
3.3. AMHARIC PUNCTUATION MARKS AND NUMERALS
3.4. CHARACTERISTICS OF THE AMHARIC WRITING SYSTEM
3.5. THE MORPHOLOGY OF AMHARIC
3.6. GRAMMATICAL STRUCTURE OF AMHARIC
3.6.1 WORD CATEGORIZATION IN AMHARIC
3.7. SENTENCES IN AMHARIC
CHAPTER FOUR
DESIGN AND IMPLEMENTATION OF AVATIES
4.1. INTRODUCTION
4.2. PROPOSED MODEL
DATA PREPROCESSING
LEARNING AND EXTRACTION COMPONENT
POST PROCESSING
THE PROTOTYPE SYSTEM
CHAPTER FIVE
RESULT AND EVALUATION
5.1. INTRODUCTION
5.2. EVALUATION METRICS
5.3. THE DATASETS
5.4. EXPERIMENTAL RESULT AND EVALUATION EACH COMPONENT OF OUR SYSTEM
5.4.1. EXPERIMENTAL RESULT AND EVALUATION OF NORMALIZATION
5.4.2. EXPERIMENTAL RESULT AND EVALUATION OF STOPWORD REMOVAL
5.4.3. EXPERIMENTAL RESULT AND EVALUATION OF TRANSLITERATION
5.4.5. EXPERIMENTAL RESULT AND EVALUATION OF PROTOTYPE SYSTEM FOR CANDIDATE TEXT EXTRACTION
5.4.5.1. EXPERIMENTAL RESULT AND EVALUATION OF ORGANIZATION AND POSITION EXTRACTION
5.4.5.2. EXPERIMENTAL RESULT AND EVALUATION OF OTHER CANDIDATE TEXT EXTRACTION
CHAPTR SIX
CONCLUSION AND RECOMMENDATION
6.1. CONCLUSIONS
6.2. RECOMMENDATION
REFERENCE
Research Objectives and Topics
The primary objective of this research is to design and implement an automated Information Extraction (IE) system for Amharic vacancy announcement texts. The study aims to overcome the challenges of manually processing unstructured job postings by developing a rule-based system capable of accurately identifying and extracting key organizational and job-related data.
- Development of an Amharic-specific information extraction architecture.
- Implementation of robust linguistic preprocessing techniques for Amharic text (tokenization, normalization, transliteration).
- Design and testing of rule-based algorithms to extract specific data attributes like organization, position, qualification, salary, and deadlines.
- Evaluation of system performance using precision, recall, and F-measure metrics on real-world datasets.
Excerpt from the Book
1.1. GENERAL BACKGROUND
Rapid developments in Information and Communication Technology are making available huge amount of data and information. Much of these data is in electronics forms (like more than billion documents in the World Wide Web). Usually these data are unstructured or semi-structured and can generally be considered as a text database. Likewise, the recent decades witnessed a rapid proliferation of Amharic textual information available in digital form in a myriad of repositories on the Internet and intranets. As a result of this growth, a huge amount of valuable information, which can be used in education, business, health and other many areas are hidden under unstructured representation of the textual data and is thus hard to search in. This resulted in a growing need for effective and efficient techniques for analyzing free-text data and discovering valuable and relevant knowledge from it in the form of structured information, and led to the emergence of Information Extraction technologies.
Information Extraction (IE) is one of the NLP applications that aim to automatically extract structured factual from unstructured text. Riloff [2] discusses, the task of automatic extraction of information from texts involves identify a predefined set of concepts and deciding whether a text is relevant for a certain domain, and if so extracting a set of facts from that text.
IE has three different components regardless of the language and domain on which it is developed for. The components are linguistic preprocessing, learning and application, and post processing. Linguistic preprocessing uses different tools to make the natural language texts ready for extraction. The learning and the application component learns a model and extract the required information from the preprocessed text.
Chapter Summaries
CHAPTER ONE: Provides an introduction to the research, outlining the background, problem statement, objectives, and the methodology used to develop the IE system.
CHAPTER TWO: Reviews the literature on Information Extraction (IE) techniques, related NLP fields, and existing IE systems, providing a foundation for the proposed approach.
CHAPTER THREE: Discusses the Amharic writing system, morphology, and grammatical structure, highlighting language-specific challenges relevant to the research.
CHAPTER FOUR: Details the design and implementation of the AVATIES prototype, including the proposed model, preprocessing steps, and extraction algorithms.
CHAPTER FIVE: Presents the experimental results and performance evaluation of the system, using precision, recall, and F-measure metrics on the collected test dataset.
CHAPTR SIX: Concludes the thesis by summarizing key findings and providing recommendations for future research and system improvements.
Keywords
Information Extraction, Amharic Language, Vacancy Announcement, Rule-Based System, Natural Language Processing, Tokenization, Normalization, Named Entity Recognition, Gazetteer, Morphology, POS Tagging, Prototype System, Precision, Recall, F-measure.
Frequently Asked Questions
What is the core purpose of this research?
The research focuses on designing an automated Information Extraction system specifically for Amharic vacancy announcements to reduce the manual effort of extracting job-related details from unstructured newspaper text.
What are the central themes of this work?
Key themes include natural language processing for the Amharic language, rule-based IE modeling, linguistic preprocessing, and system performance evaluation within the specific domain of job postings.
What is the primary research question?
The study seeks to determine the most effective approaches, algorithms, and models for designing an Amharic IE system capable of accurately identifying and extracting relevant data from unstructured vacancy announcements.
Which scientific methodology does the author use?
The research employs an experimental methodology, involving the collection of Amharic vacancy texts, development of rule-based algorithms, and testing the system's performance using standard NLP metrics like precision and recall.
What is covered in the main body of the work?
The main body covers the literature review of IE systems, an analysis of the Amharic language structure (writing system and morphology), the technical design of the AVATIES prototype, and a thorough performance evaluation.
Which keywords define this work?
The study is characterized by terms such as Information Extraction, Amharic Language, Vacancy Announcement, Rule-Based System, Natural Language Processing, and Prototype System.
Why is the Amharic writing system challenging for information extraction?
Amharic has a unique syllabic script where characters exhibit spelling variations due to interchangeably used consonants that share the same sound, requiring robust normalization algorithms for effective extraction.
How does the AVATIES prototype handle different vacancy formats?
The system uses a combination of gazetteers (predefined lists) and feature-based context rules to identify entities like job position and organization name regardless of the specific format of the vacancy text.
What were the final results of the experiment?
The prototype system achieved an overall F-measure of 71.7%, demonstrating that a rule-based knowledge engineering approach is a promising direction for Amharic information extraction.
- Quote paper
- Sintayehu Hirpassa (Author), 2011, Designing an Information Extraction System for Amharic Vacancy Announcement Text, Munich, GRIN Verlag, https://www.grin.com/document/289226