Designing an Information Extraction System for Amharic Vacancy Announcement Text


Thesis (M.A.), 2011

105 Pages, Grade: Very Good


Excerpt

TABLE CONTENTS

ACRONYMS & ABBREVIATIONS

ABSTRACT

CHAPTER ONE INTRODUCTION
1.1. GENERAL BACKGROUND
1.2. STATEMENT OF THE PROBLEM
1.3. OBJECTIVE OF THE STUDY
1.3.1 GENERAL OBJECTIVE
1.3.2. SPECIFIC OBJECTIVES
1.4. METHODOLOGY
1.4.1. STUDY DESIGN
1.4.2. LITERATURE REVIEW
1.4.3. DATA SOURCES AND DATA PREPARATION FOR THE EXPERIMENT UNDERSTANDING OF DOMAIN LANGUAGE
1.4.4. DESIGN AND IMPLEMENTATION OF AVATIES
1.5. APPLICATION OF RESULTS AND BENEFICIARIES
1.6. SCOPE AND LIMITATIONS OF THE STUDY
1.7. ORGANIZATION OF THE STUDY

CHAPTER TWO LITERATURE REVIEW

2.1. INTRODUCTION
2.2. INFORMATION EXTRACTION (IE)
2.3. BUILDING INFORMATION EXTRACTION SYSTEMS
I. KNOWLEDGE ENGINEERING APPROACH
II. AUTOMATIC TRAINING APPROACH
2.4. ARCHITECTURE OF INFORMATION EXTRACTION SYSTEM
2.5. PREPROCESSING OF INPUT TEXTS
2.6. LEARNING AND APPLICATION OF THE EXTRACTION MODEL
2.7. POST PROCESSING OF OUTPUT
2.8. RELATED NLP FIELDS TO INFORMATION EXTRACTION
2.8.1. INFORMATION RETRIEVAL (IR)
2.8.2. TEXT SUMMARIZATION
2.8.3. QUESTION ANSWERING SYSTEMS
2.8.4. TEXT CATEGORIZATION
2.9. INFORMATION EXTRACTION (IE) AND INFORMATION RETRIEVAL (IR)
2.10. EVALUATION OF INFORMATION EXTRACTION
2.11. RELATED WORKS
INFORMATION EXTRACTION FOR E-JOB MARKETPLACE
INFORMATION EXTRACTION FROM AMHARIC TEXT
INFORMATION EXTRACTION FROM ENGLISH TEXT
INFORMATION EXTRACTION FROM CHINESE TEXT

CHAPTER THREE THE AMHARIC WRITING SYSTEM
3.1. INTRODUCTION
3.2. AMHARIC CHARACTER REPRESENTATION AND WRITING SYSTEM
3.3. AMHARIC PUNCTUATION MARKS AND NUMERALS
3.4. CHARACTERISTICS OF THE AMHARIC WRITING SYSTEM
3.5. THE MORPHOLOGY OF AMHARIC
3.6. GRAMMATICAL STRUCTURE OF AMHARIC
3.6.1 WORD CATEGORIZATION IN AMHARIC
3.7. SENTENCES IN AMHARIC

CHAPTER FOUR DESIGN AND IMPLEMENTATION OF AVATIES
4.1. INTRODUCTION
4.2. PROPOSED MODEL
DATA PREPROCESSING
LEARNING AND EXTRACTION COMPONENT
POST PROCESSING
THE PROTOTYPE SYSTEM

CHAPTER FIVE RESULT AND EVALUATION
5.1. INTRODUCTION
5.2. EVALUATION METRICS
5.3. THE DATASETS
5.4. EXPERIMENTAL RESULT AND EVALUATION EACH COMPONENT OF OUR SYSTEM
5.4.1. EXPERIMENTAL RESULT AND EVALUATION OF NORMALIZATION
5.4.2. EXPERIMENTAL RESULT AND EVALUATION OF STOPWORD REMOVAL
5.4.3. EXPERIMENTAL RESULT AND EVALUATION OF TRANSLITERATION
5.4.4. EXPERIMENTAL RESULT AND EVALUATION OF PART OF SPEECH TAGGER
5.4.5. EXPERIMENTAL RESULT AND EVALUATION OF PROTOTYPE SYSTEM FOR CANDIDATE TEXT EXTRACTION

CHAPTER SIX CONCLUSION AND RECOMMENDATION
6.1. CONCLUSIONS
6.2. RECOMMENDATION

REFERENCE

Appendix

ACRONYMS & ABBREVIATIONS

Abbildung in dieser Leseprobe nicht enthalten

ABSTRACT

The number of Amharic documents on the Web is increasing as many newspaper publishers started providing their services electronically. The unavailability of tools for extracting and exploiting the valuable information from Amharic text, which is effective enough to satisfy the users has been a major problem and manually extracting information from a large amount of unstructured text is a very tiresome and time consuming job, this was the main reason which motivate the researcher to engage in this research work.

The overall objective of the research was to develop information extraction system for the Amharic vacancy announcement text. The system was developed by using Python and visual basic programming language and rule-based technique was applied to address the problem of automatically deciding the correct candidate texts based on its surrounding context words. 116 Amharic vacancy announcement texts which contain 10,766 words were collected from the “Ethiopian reporter” newspaper published in Amharic twice in week.

For this study, nine candidate texts are selected from Amharic vacancy announcement text, these are organization, position, qualification, experience, salary, number of people required, work agreement, deadline and phone number. The experiments have been carried out on each component of a system separately to evaluate its performance on each components, this helps us to identify drawbacks and give some clue for future works.

The experimental result shows, an overall F - measure of 71.7% achieved. In order to make the system to be applicable in this domain which is Amharic vacancy announcement, further study is required like incorporating additional rules, improving the speed of the system by modifying the algorithm, a well designed user interface and integrating other NLP facilities.

CHAPTER ONE INTRODUCTION

1.1. GENERAL BACKGROUND

Rapid developments in Information and Communication Technology are making available huge amount of data and information. Much of these data is in electronics forms (like more than billion documents in the World Wide Web). Usually these data are unstructured or semi-structured and can generally be considered as a text database. Likewise, the recent decades witnessed a rapid proliferation of Amharic textual information available in digital form in a myriad of repositories on the Internet and intranets. As a result of this growth, a huge amount of valuable information, which can be used in education, business, health and other many areas are hidden under unstructured representation of the textual data and is thus hard to search in. This resulted in a growing need for effective and efficient techniques for analyzing free-text data and discovering valuable and relevant knowledge from it in the form of structured information, and led to the emergence of Information Extraction technologies.

Information Extraction (IE) is one of the NLP applications that aim to automatically extract structured factual from unstructured text. Riloff [2] discusses, the task of automatic extraction of information from texts involves identify a predefined set of concepts and deciding whether a text is relevant for a certain domain, and if so extracting a set of facts from that text.

IE has three different components regardless of the language and domain on which it is developed for. The components are linguistic preprocessing, learning and application, and post processing. Linguistic preprocessing uses different tools to make the natural language texts ready for extraction. The learning and the application component learns a model and extract the required information from the preprocessed text.

In the last component the semantic post processing assign the extracted information into their predefined attribute category and manages the normalization and duplication problem with the extracted data [5].

In principle, designing IE has two approaches: (1) the learning approach, and (2) the Knowledge Engineering approach. For systems or modules using learning techniques an annotated corpus of domain relevant texts is necessary. This approach calls for someone who has enough knowledge about the domain and the tasks of the system to annotate the texts appropriately. The annotated texts are the input of the system or module, which runs a training algorithm on them. Thus, the system obtains knowledge from the annotated texts and can use it to gain desired information from new texts of the same domain.

The Knowledge Engineering (KE) approach needs a system developer, who is familiar with both the requirements of the application domain and the function fof the designed IE system. The developer is concerned with the definition of rules used to extract the relevant information. Therefore, a corpus of domain-relevant texts will be available for this task [9].

IE is quite different from IR. An IR system finds relevant texts that is based on a query and presents them to the user. An IE application analyzes texts and presents only the specific information from it that the user is interested in.

IE systems are more difficult and knowledge-intensive to build, and are to varying degrees tied to particular domains and scenarios. It is also more computationally intensive than IR. In applications where there are large text volumes IE is potentially much more efficient than IR because of the possibility of dramatically reducing the amount of time people spend reading texts [4].

During the last ten years, IE has become an increasingly researched field. As [2] stated, “unfortunately, during this time most of the known IE systems have been invented for texts written in the English language. In comparison to the success registered for English IE systems for most of other languages are still lacking essential components”.

Depending on the number of native speakers, prosperity of countries, and the need for natural language processing capabilities, as well as due to the complexity of certain languages, IE systems are uniquely designed for individual languages.

1.2. STATEMENT OF THE PROBLEM

With the popular use of the World Wide Web as global information system a number of newspapers are already flourishing online. Likewise, in Ethiopia most of Amharic newspaper publishers are providing their publications online. Among the well known newspapers in Ethiopia, the “Ethiopia Reporter” is the one. It appears twice a week with contents such as news, politics, science and technology, sport, business, vacancy and social. The newspaper presents different vacancies of organizations in structured, unstructured and semi- structured forms. Cowie and Wilks [3] noted, manually extracting information from such an often unstructured or semi-structured text is a very tiresome and time consuming job. Thus, getting the right information for decision making from existing abundant unstructured text is a big challenge.

In addition, the unavailability of tools for extracting and exploiting the valuable information which is effective enough to satisfy the users for Amharic language has also been a major problem. It is hoped that the availability of an IE tool can ease this information searching process.

IE unlike the other research domains is language and domain dependent [38]. The IE system developed for English and in this specific domain may not work for Amharic language even if its domain is similar. There are different, language specific, issues which may not be handled by the system developed for English. This is due to the reason that IE system has to be trained about the different nature of the language and the domain for which they are developed for.

To the best knowledge of the researcher, the work of Tsedalu [21] has only been one research conducted on Amharic IE system. The work is also limited in extracting numeric and nominal data from the Amharic news text. News texts that are about a single issue are only considered and extraction of relationship between entities is out of the scope of this research work. In addition, even if the language is similar with this research its domain is different. This system may not fully handle the concerns that are viewed in information extraction from AVAT. Thus, this and other reasons initiate the researcher to engage in research to design IE for AVAT.

This study has attempted to answer the following research questions:

1. What approaches should be followed in designing an Amharic IE system that identifies useful information from vacancy announcement?
2. What algorithms are suitable for automatic Amharic IE?
3. What model ought to suppose to design Amharic IE?

1.3. OBJECTIVE OF THE STUDY

1.3.1 GENERAL OBJECTIVE

The general objective of this study is to design information extraction system for AVAT.

1.3.2. SPECIFIC OBJECTIVES

The specific objectives of the research are:

To review word categorization and character representation in Amharic language.

To build up an architecture for IE for AVAT.

To develop suitable approaches and algorithms for IE

To develop a prototype system that demonstrates the potentials of the Amharic IE system.

To evaluate the performance and usability of the prototype developed for Amharic information extraction

1.4. METHODOLOGY

1.4.1. STUDY DESIGN

The design of this research is experimental. In this study different activities were involved. Identifying the problem in the area of AVAT was the starting point of the study. To address the problem IE system is designed and implemented. It is obvious that, testing is mandatory for any type of system once it designed, to check its applicability and to evaluate its performance. In the same way, the system which is designed in this study is tested and evaluated based on the test dataset.

1.4.2. LITERATURE REVIEW

In order to have a better understanding of in IE and design a system for Amharic language, different local and global researches were thoroughly reviewed. Literature such as journals, articles, proceeding, papers and books were reviewed for achieving the objective of this research.

1.4.3. DATA SOURCES AND DATA PREPARATION FOR THE EXPERIMENT

The researcher collected different AVAT that were required for training and testing the system from the “Ethiopian Reporter” newspaper published in Amharic twice in week. For the purpose of this study, 116 AVATs that contain in general 10,766 words were selected purposefully with different range of vacancy announcements. There dissimilarity is based on the organization of who is posted the vacancies and the type of vacancies. The newspaper was chosen as a data source since it has large collection of AVAT in its database.

After the raw AVAT were collected, different data preprocessing tasks were undertaken (such as tokenization, normalization, transliteration) and, gazetteer was prepared.

UNDERSTANDING OF DOMAIN LANGUAGE

The different facts about Amharic language like the word categorization, character and number representation and other language specific issues that are important for the research work have been analyzed and presented. It helps to understand the nature of the language with regard to information extraction.

1.4.4. DESIGN AND IMPLEMENTATION OF AVATIES

The designing phase contains the document preprocessing, learning and extraction, and post processing as the three main components. In order to develop a prototype system, different appropriate tools have been selected and used.

The different data preprocess IE components, such as Tokenizer, Normalizer, Transliterator, etc, which are mostly language specific algorithms are developed using python programming language. This programming language was employed for developing candidate text selector and tagger and candidate text extractor. The main reason that the python programming language is used is for the familiarity of the language with the researcher.

The POS which is developed by Gebrekidan [36] is used as one of the features in IE component. Also, Microsoft SQL server 2008 was used to store extracted candidates and Visual Basic programming language was used for the development of user interface, which helps a user to interact with the system and access data from database.

1.5. APPLICATION OF RESULTS AND BENEFICIARIES

Nowadays most people use online newspapers as a source of information for vacancy announcements. Thus, those who use newspapers and websites for job search are the main beneficiaries of this study. It will help them save their time in searching detailed information about the jobs that are posted in unstructured format. It also helps them access facts or details easily.

The system will also have great significance for publishers of newspapers as it will help to provide vacancy news in attractive and structured fashion. It can also keep them from committing errors while in changing unstructured AVAT in to structure and also will have a tremendous effect in enhancement of their day to day activities.

Beside to this, the extracted structured information from unstructured text can be used as an input for other applications such as question and answering application system and etc.

At the end, it is hoped that, this study will serve as tipping point for other researchers to focus much on this research issue.

1.6. SCOPE AND LIMITATIONS OF THE STUDY

The task of designing information extraction system requires a very intensive knowledge in natural language processing. The main limitation while processing the study is the unavailability of enough corpus and word categories for natural language processing for the domain. This would set a constraint on amount of rule generation.

A full-fledged information extraction system will require a number of NLP tools such as Sentence Parser, Part of Speech tagger (POS), Named Entity Recognizer (NER), Co-reference Resolution and others. Even though some of the NLP systems for Amharic language have been done by other researchers, they are not publicly available.

Having these limitations in mind, the researcher tried to design a rule-based IE system only for a specific domain, which is AVAT, and this study was confined only to extract organization, job title, required qualification, work experience, salary, number of people required, job agreement and deadline data from AVAT. Information extraction of other information type from the AVAT is out of the scope of the study.

1.7. ORGANIZATION OF THE STUDY

The thesis is organized into six chapters. The first chapter of the thesis contains background, statement of the problem, objectives, and methodology of the research. Chapter 2 discusses the different issues in IE and the related subject areas as literature review. Also this Chapter lays the foundation in understanding what an IE system comprises of, what approaches are used, and the different components which are required by the IE system. The last part of the chapter, discusses related works on IE systems in different languages and on different domains.

In chapter 3, a discussion is made about Amharic language with regard to IE. Many language specific issues such as the writing system and language structure are presented. Chapter four is devoted to discussing the architectural and design issues of the system, the main components of our system, their functional operation and the specific sub-component of each component are briefly discussed. In this chapter it also discusses the main implementation issues of our IE system, the algorithms and techniques used to develop the system successfully. Result and performance evaluation of the system is presented in chapter five. Finally, conclusion and recommendations for further study is forwarded.

CHAPTER TWO LITERATURE REVIEW

2.1. INTRODUCTION

Now a day, an increasing amount of information is available in the form of electronic documents. This makes it nearly impossible to manually search, filter and choose which information one should use for his/her own purpose [20].

Different scholars tried to develop different information management systems so that the drawing of summarized and relevant information from an ocean of information can be facilitated and the right information for decision making can be acquired. Among the different solution to the problems are Information Retrieval (IR), Information Extraction (IE), Question Answering, Text Summarization and Text Categorization [21].

In this section we will describe the requirements and components of IE systems as well as present various approaches for building such systems. Then, we will present important methodologies and systems for IE systems.

The related NLP fields are also reviewed and presented in order to see their similarity and difference with IE. Evaluation standards for the performance of IE system which are used for the evaluation purpose are also presented in this chapter.

2.2. INFORMATION EXTRACTION (IE)

IE has become an important notion to address the problem of information overload by locating the target phrases from document and transforms them in to structured representation.

As it is defined by Eikvil [15] “it is the task of locating specific pieces of data from a natural language document, a particularly useful sub-area of natural language processing (NLP). In IE, the data to be extracted from a natural language text is given by a template may be either one of a set of specified values or strings taken directly from the document”

Mooney and Califf [16] also defines IE as “IE is a form of shallow text processing that locates a specified set of relevant items in a natural language document, transforming unstructured text into a structured database”. Systems for this task require significant domain-specific knowledge. So generally, IE is the process of extracting relevant and factual data from unstructured or free text.

IE usually uses NLP tools, lexical resources and semantic constraints for better efficiency [21]. The General Architecture for Text Engineering (GATE) which is the widely known open source software system for computations related to natural language defines IE as a system which analyses unstructured text in order to extract information about pre-specified types of events, entities or relationships.

According to Wilks and Brewster [8], the requirement of templates and bundling domain and corpus specific information with the IE techniques are two major challenges on IE.

2.3. BUILDING INFORMATION EXTRACTION SYSTEMS

At this point, we shall turn our attention to what is actually involved in building IE systems. Before discussing in detail the basic parts of an IE system, we point out that there are two basic approaches to the design of IE systems, which we label as the Knowledge Engineering Approach and the Automatic Training Approach.

I. KNOWLEDGE ENGINEERING APPROACH

The Knowledge Engineering Approach is characterized by the development of the grammars used by a component of the IE system by a “knowledge engineer,” i.e. a person who is familiar with the IE system, and the formalism for expressing rules for that system, who then, either on his own, or in consultation with an expert in the domain of application, writes rules for the IE system component that mark or extract the sought after information.

Typically the knowledge engineer will have access to a moderate-size corpus of domain-relevant texts (a moderate-size corpus is all that a person could reasonably be expected to personally examine), and his or her own intuitions [1]. It is obviously the case that the skill of the knowledge engineer plays a large factor in the level of performance that will be achieved by the overall system. In addition to requiring skill and detailed knowledge of a particular IE system, the knowledge engineering approach usually requires a lot of labor as well [1].

Building a high performance system is usually an iterative process whereby a set of rules is written, the system is run over a training corpus of texts, and the output is examined to see where the rules under and over generate. The knowledge engineer then makes appropriate modifications to the rules, and iterates the process [1]. Thus, the performance of the IE system depends on the skill of the knowledge engineer.

II. AUTOMATIC TRAINING APPROACH

The Automatic Training Approach is quite different. Following this approach, it is not necessary to have someone on hand with detailed knowledge of how the IE system works, or how to write rules for it.

It is necessary only to have someone who knows enough about the domain and the task to take a corpus of texts, and annotate the texts appropriately for the information being extracted.

Typically, the annotations would focus on one particular aspect of the system’s processing. For example, a name recognizer would be trained by annotating a corpus of texts with the domain-relevant proper names.

A co-reference component would be trained with a corpus indicating the co-reference equivalence classes for each text. Once a suitable training corpus has been annotated, a training algorithm is run, and resulting in information that a system can employ in analyzing novel texts. Another approach to obtaining training data is to interact with the user during the processing of a text. The user is allowed to indicate whether the system’s hypotheses about the text [4, 9]. The above mentioned approaches for IE can be applied on the free text or semi structured or structured text which is used as an input for IE system [16].

Free text: is unstructured collection of text. It can’t be easily managed as it doesn’t have the structure or any predefined format in order to manage it by using computers. The natural language components are applied in order to manage extraction from the free text [21].

Semi Structured Text: is a data which is not in the form of tuples like structured text and is different from free texts which rather exist in between the two. The information in the form of HTML tags is semi structured text [21].

Structured Text: is textual information which exists in a database or file following a predefined and strict format. Such information can easily be extracted by using the format description as it has a known format [21].

IE approaches supported on supervised machine learning technique are divided in to the following three categories [21]

I Rule learning
II Linear separators
III Statistical learning

I Rule Learning

This approach is based on a symbolic inductive learning process. The extraction patterns represent the training examples in terms of attributes and relations between textual elements.

Some IE systems use propositional learning (i.e. zero order logic), for instance, Auto Slog-TS and CRYSTAL, while others perform a relational learning (i.e. first order logic), for instance WHISK and SRV. This approach has been used to learn from structured, semi-structured and free-text documents [21].

II Linear Separators

In this approach the classifiers are learned as sparse networks of linear functions (i.e. linear separators of positive and negative examples). It has been commonly used to extract information from semi-structured documents. It has been applied in problems such as extraction of data from job ads, and detection of an e-mail address change [21].

In general, the IE systems based on this approach present an architecture supported on the hypothesis that looking at the words combinations around the interesting information is enough to learn the required extraction patterns. [21].

III Statistical Learning

This approach is focused on learning Hidden Markov Models (HMMs) as useful knowledge to extract relevant fragments from documents [21].

These IE systems also differ from each other in the features that they use. Some use only basic features such as token string, capitalization, and token type (word, number, etc.).

In addition, others use linguistic features such as part-of-speech, semantic information from gazetteer lists, and the outputs of other IE systems (most frequently general purpose named entity recognizers).

A few systems also exploit genre-specific information such as document structure. In general, the more features the system used, the better performance it could achieve. One of the most successful machine learning methods for IE is Support Vector Machine (SVM), which is a general supervised machine learning algorithm. It has achieved state-of-the-art performance on many classification tasks, including named entity recognition.

2.4. ARCHITECTURE OF INFORMATION EXTRACTION SYSTEM

Different scholars use different steps for designing extracting information system for different language and different domain. The research work in [1] mainly categorizes IE in to six different tasks.

Part-of-Speech (POS) Tagging

Named Entity Recognition (NER)

Syntax Analysis

Co-references and Discourse Analysis

Extraction Patterns

Bootstrapping

I. Part-of-speech tagging (POS)

It is the act of assigning each word in sentences of tag that describes how that word is used in the sentences. That means POS tagging assigns whether a given word is used as a noun, adjective, verb, etc.

As Pal and Molina [23] acknowledges, one of the most well-known disambiguation problem is POS tagging, because many words are ambiguous: they may be assigned more than one POS tag (for example, the English word round may be a noun, an adjective, a preposition or an adverb, or a verb).

POS tagger finds the possible tags or lexical category for each word provided that the word is in a lexicon and guess possible tags for unknown words. It also chooses possible tag for each word that is ambiguous in its part-of-speech. If certain word is assigned more than one tag, this means that the word can have different meanings or function in different context.

According to Antonio [29], there are two approaches to automatic POS tagging: rule-based approaches use linguistic knowledge to formulate simple rules that assign a part of speech to an ambiguous word using context information; statistical approaches of which hidden Markov models trained using the expectation-maximization algorithm are the standard model) use the statistics collected from ambiguously or unambiguously tagged texts to estimate the likelihood of each possible interpretation of a sentence or text portion so that the most likely disambiguation is chosen.

II. Named Entity Recognition (NER)

Named entities are one of the most often extracted types of tokens during extracting information from documents. Named entity recognition is classification of every word in a document as being a person-name, organization, location, date, time, monetary value, percentage, or “none of the above”. Some approaches use a simple lookup in predefined lists of geographic locations, company names, person names and name of animals and other things from the gazetteers, while some others utilize trainable Hidden Markov Models to identify named entities and their type.

For example the NER takes the following AVAT recognize the named entities and numbers which will be used as attributes for the predefined database slot

አ/ማኅበራችንከዚህበታችበተመለከተውየሥራመደብአመልካቾችንአወዳድሮ በኮንትራት ለመቅጠር ስለሚፈልግ የትምህርትና የሥራ ልምድ ማስረጃዎቻችሁን በመያዝ በ10 ተከታታይ የሥራ ቀናት ውስጥ ሰው ሃብት ሥራ አመራር ቡድን ቢሮ ቁጥር 104 እየቀረባችሁ መመዝገብ የምትችሉ መሆኑን እናስታውቃለን። የሥራመደቡ የሕግባለሙያ ተፈላጊችሎታ ከታወቀ ዩኒቨርስቲ በሕግ የመጀመሪያ ዲግሪ በመደበኛ የትምህርት ክፍለ ጊዜ የጨረሰና ከምረቃ በኋላ 6 ዓመት የሥራ ልምድ የቅጥር ሁኔታ ኮንትራት ደመወዝ በስምምነት አድራሻ፡- ንፋስ ስልክ የቀድሞው ኢጭማኮ ቅጥር ግቢ 1ኛ ፎቅ ቢሮ ቁጥር 104 ስ.ቁ 011-42-38-64 ወይም 0114-42-47-77 የውስጥ መስመር 525/253 ኮሜት ትራንስፖርት አክሲዮን ማኅበር

The words which are names and numbers that represent different thing will be extracted like ንፋስ ስልክ, ኮሜት ትራንስፖርት አክሲዮን ማኅበር , 104, ሕግ ባለሙያ, ሕግ, 6, ዲግሪ, ሰው ሃብት ሥራአመራር ,ዩኒቨርስቲ which are the named entity attributed in the text which represent different things.

III. Syntax analysis

In contrast to POS tagging, syntax analysis, also called syntax parsing, looks beyond the scope of single words. During syntax analysis we attempt to identify syntactical parts of a sentence (verb group, noun group and prepositional phrases) and their functions (subject, direct and indirect object, modifiers and determiners). Simple sentences, consisting, for instance, of a main clause only, can be parsed using a finite state grammar. Simple finite state grammars are often not sufficient to parse more complex sentences, consisting of one or more subordinate clauses in addition to the main clause, or containing syntax structures, such as prepositional phrases, adverbial phrases, conjunction, personal and relative pronouns and genitives in noun phrases.

Using finite state grammars in such cases may result in errors. Instead, those cases are handled by statistically founded methods which have to be trained with training text corpora.

[...]

Excerpt out of 105 pages

Details

Title
Designing an Information Extraction System for Amharic Vacancy Announcement Text
College
Addis Ababa University
Course
NAtural Language processing
Grade
Very Good
Author
Year
2011
Pages
105
Catalog Number
V289226
ISBN (eBook)
9783656895565
ISBN (Book)
9783656895572
File size
1129 KB
Language
English
Notes
This paper is a very good view in Natural Language Processing specifically Information Extraction for local language of Ethiopia. The author is the pioneer on this topic in our school and spawns remarkable results from his study; I believe that this study is a milestone for the integration of local language in to computerize world.
Tags
information, extraction
Quote paper
Sintayehu Hirpassa (Author), 2011, Designing an Information Extraction System for Amharic Vacancy Announcement Text, Munich, GRIN Verlag, https://www.grin.com/document/289226

Comments

  • No comments yet.
Read the ebook
Title: Designing an Information Extraction System for Amharic Vacancy Announcement Text



Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free