Development of an automatic news summarizer for isiXhosa language


Master's Thesis, 2017
115 Pages, Grade: 75

Excerpt

Table of Contents

ACKNOWLEDGEMENTS

DEDICATION

ACRONYMS

ABSTRACT

LIST OF TABLES

LIST OF FIGURES

LIST OF LISTINGS

INTRODUCTION

1. INTRODUCTION AND BACKGROUND
1.0 Overview
1.1. Automatic Text Summarization(ATS)
1.2. Motivation
1.3. The Problem Statement and Justification of the Study
1.4. Research Questions
1.5. Objectives of the Study
1.4.1. Specific Objectives
1.6. Significance of the Study
1.7. Research Methodology
1.8. Literature Review
1.9. Data Source Collection and Preperation
1.9.1. Corpus Preparation
1.9.2. Manual Summary Preparation
1.10. Summarization Method and Tools used in this Study
1.10.1. Development Tools
1.10.2. The Natural Language Toolkit (NLTK)
1.10.4. Installing the NLTK data
1.10.8. Operating System
1.10.9. The Python Programming Language
1.10.10. The Numpy Library
1.10.11. Charm Integrated Development Environment (IDE)
1.12. Scope and Limitations of the Study
1.13. Outline of the Dissertation

CHAPTER TWO
1.1. LITERATURE REVIEW
2.0 Introduction
2.1. Automatic Text Summarization
2.2. Processes of Automatic Text Summarization
2.2.1. Summarization Parameters
2.2.2. Methods of Summarization
2.3. Linguistic Concepts to Consider
2.3.1. Coherence
2.3.2. Cohesion
2.3.3. Lexical Cohesion
2.4. News Writing Structure
2.5. Evaluation Methods used in Automatic Summarization

CHAPTER THREE
THE XHOSA LANGUAGE
3.0 Introduction
3.1. Xhosa Consonants and Vowels
3.1.1. The Vowel System
3.1.2. Consonants
3.2. Overview of Xhosa Orthography
3.3. Xhosa Morpheme Types
3.3.1. Xhosa Nouns
3.3.2. Xhosa Prefixes
3.3.3. The Xhosa Noun Stems
3.3.4. Xhosa Suffixes
3.3.5. Pronouns
3.3.6. Verbs
3.3.7. Adjectives
3.3.8. Apostrophe
3.4. Abbreviation
3.5 Summary

CHAPTER FOUR
METHODOLGY AND SYSTEM DESIGN
4.0 Introduction
4.1. Methodol ogy
4.2. Proposed Algorithm
4.4.1. How the Al gorithm W orks
4.3. Preprocessing
4.3.1. Tokenization
4.3.2. Stop Words
4.3.3. Stemming
4.6 Sentence Ranking
4.7 Summary Generation
4.8 System Design
4.10 Summary

CHAPTER FIVE
IMPLEMENTATION
5.0 Introduction
5.1. Tokenization
5.2. Stop Word Removal
5.3. Stemming
5.4. Implementation
5.4.1. The IsiXhoSum Interface
5.4.2. Modules of the Xhosa Text Summarizer
5.5. Experimentation
5.5.1. Corpus Preparation
5.5.2. Creation of Manual Summaries
5.6. Summary

CHAPTER SIX
TESTING, RESULTS, AND DISCUSSION
6.0 Introduction
6.1. Testing
6.2. Results
6.2.1. Results of Subjective Evaluation
6.2.2. Results of Objective Evaluation
6.3. Discussion of the Results
6.4. Discsion on Coherence and Cohesion
6.5. Summary

CHAPTER SEVEN
5 CONCLUSION AND FUTURE WORK
7.0 Introduction
7.1. Research Summary
7.2. Conclusion and Future Work

REFERENCES

LIST OF APPENDIXES

Appendix A: List of Publications

Appendix B: Xhosa Stemmed Nouns and Verbs

Appendix C : The Xhosa Stop W ord List

Subjective Evaluation Results

Appendix D: Comparison of the Methods in keeping the first sentence

Appendix E: Objective Evaluation results

Appendix F: Example Summary

Appendix G: Manual Summaries

Appendix H: System generated Summaries

ACKNOWLEDGEMENTS

In the first place and for most, I might want to send my true because of God almighty for helping me complete this work.

I might want to sincere my genuine appreciation to my supervisor Dr. Zelalem Shibeshi for inspiration, advice and the information he gave me all through this two-year period. Those weeks after week meetings have molded myself to being a dedicated I am today.

I would like to thank my co-supervisor Professor C. Botha for the tremendous support and advice that he gave to make this research success.

I would likewise like to express gratitude towards Telkom Center of Excellence (Coe) and National Research Foundation (NRF) for the financial augment they provided for my study at the University of Fort Hare.

I want to thank the Head of Department (HOD) Mr. S. Scott for allowing me to do my research in the Department of computer science in the Univeristy of Fort Hare.

I send my sincere appreciation to Mr. Zengethwa, who significantly helped in the formation of manual extracts for this study. Without his work this research would have been eccentric.

I, last but not list send my sincere gratitude and appreciation to my family for believing in me ever since I embarked on the research journey.

I want to thank every one of the members who helped during the evaluation period of this study.

Finally, I want to thank everybody who has supported whichever way or the other during this study. I am much indebted to all of you.

DEDICATION

I dedicate this work to my whole family i.e my Grandmother, my Father, my Mother, Brothers, Cousins, and Aunts, and not to forget to mntion my late Uncle, my partner, son, and nephews. They had confidence in me, giving expressions of guidance. All the motivation that all of you you gave throughout the years is greatly apreciated.

Abbildung in dieser Leseprobe nicht enthalten

ABSTRACT

From practice perspective, given the abundance of digital content nowadays, coming up with a technological solution that sumnmarizes written text without losing its message, coherence and cohesion of ideas is highly essential. The technology saves time for readers as well as gives them a chance to focus on the contents that matter most.

This is one of the reseach areas in natural language proccesing/ information retrieval, which the dessertation tries to contribute to. It tries to contextualize tools and technoologies that are developed for other lamnguages to automatically summarize textual Xhosa news articles. Specifically, the dissetation aims at devloping a text summarizer for textaul Xhosa news articles based on the extraction methods.

In doing so, it examines the litrature and understand the techniques and technologies used to analyze contents of a written text, transform and synthesize it, the phonolgy and morphology of the Xhosa language, and finally, designs, implememnts and test an extraction-based automatic newa article for the Xhosa language. Given comprehension and relevance of the litrature review, the research design , the methods and tools and technologies used to design, implement and test the pilot system.

Two approaches were used to extract relevant sentences, which are, term frequency and sentence position. The Xhosa summarizer is evaluated using a test set. This study has employed both subjective and objective evaluation methods. The results of both methods are satisfactory. Keywords: Xhosa, Automatic Text Summarization, Term Frequency and Sentence Position.

LIST OF TABLES

Table 1 :Vowels in Xhosa Vanderstouwe [34, p.3]

Table 2:Velaric Sounds (Clicks) Vanderstouwe [34, p.8]

Table 3 : Examples of Xhosa Nouns that end with a Vowel. Mtuze et al.[39]

Table 4: Sample of Common Xhosa Stop Words used in this Research

Table 5: Sample of Xhosa Stemmed Verbs

Table 6:Sample of Xhosa Stemmed Nouns

Table 7: Basic Statistics of the Xhosa Test Set

Table 8: Manual Summaries Used for Evaluation

Table 9: Text files tested on Xhosa Text Summarizer

Table 10: Shows Results of a Better Method

Table 11: Results of a coherent summary

Table 12: Linguistic Quality Results

Table 13: Output of the ROUGE2.0 tool

LIST OF FIGURES

Figure 1: Architecture Design of Xhosa Text Summarizer

Figure 2: Xhosa Text Summarizer interface

Figure 3 : Algorithm for Selecting Article in the Corpus

LIST OF LISTINGS

Listing 1: Sentence Tokenisation Code

Listing 2: Removing Noisy Words from the Text

Listing 3 : Stemming of words Code Method

INTRODUCTION

1. INTRODUCTION AND BACKGROUND

1.0 Overview

This dissertation traces the insights made during exploratory research work completed on the subject of making summaries of Xhosa news articles. This chapter highlights the topic by giving a context for the research and by explaining the motivation for initiating the project. Having examined the results of the exploration, the primary objectives of the research are set out, together with its scope and conclusions.

1.1. Automatic Text Summarization(ATS)

The volume of information available for users of the Internet has been increasing on a daily basis. In this, the information age, the growth of electronic information has necessitated intensive research in the area of Natural Language Processing (NLP) and Information Retrieval (IR). The fast growth of information has made it difficult for many users to cope with all the text that potentially is of interest to them. As a result, systems that can automatically summarize one or more documents, have become the focus of interest recently, in the field of automatic summarization[1]. Automatic text summarization has become a suitable tool for assisting people in the task of reading large volumes of textual information.

Examples of summaries that users choose are: news headlines, scientific abstracts, minutes of meetings, and weather forecasts. These are all kinds of summaries people enjoy reading on a daily basis[2].

A summary can help users to get the meaning of a complete text document within a short time. The following are some of the general reasons that support the necessity of text summarization.

- Summarization improves document indexing efficiency
- A Summary or abstract saves reading time
- A Machine generated summary is free from bias
- A Summary or an abstract facilitates document selection and literature searches.

1.2. Motivation

There has been much work done for many languages throughout the world with regard to automatic summarization. The Xhosa language is on the rise in terms of electronic content, especially online. As a result, there is a need for language processing applications for the Xhosa language, and an automatic text summarizer is one such application that is required by the local community. A large number of people and organizations can benefit from this application by obtaining the most relevant content within the minimum period. As a result of this need, there is a strong motivation to design a Xhosa Text Summarizer.

1.3. The Problem Statement and Justification of the Study

The rate at which information has grown electronically has made it difficult for users to obtain important information in the shortest possible time. In other words, users are highly affected by information overload. Information overload also leads users to read unnecessary details and waste their time. Most of the time they read documents that they are not even interested in, in too much detail. People who speak different languages around the globe are facing this problem, and this is true for people who speak the Xhosa language.

These days many agencies produce a large amount of textual information specifically for Xhosa speakers. These include media bodies such as the South African Broadcasting Corporation (SABC) that broadcasts the news in the Xhosa language, online news publishers and national newspaper publishers. The range of documents goes beyond that of just the publishing of news, it includes reports from government offices, especially from the Provincial Legislatures of the Eastern and Western Cape Provinces. Most of these items are of more than five paragraphs, which in this busy world is not appropriate for users.

As mentioned above, newspapers and other news releases in the Xhosa language reach readers from many sources. It is evident that it is of paramount importance to read news items to keep abreast of what is happening in the world. However, because of the busy lives that people lead and their day-to-day activities, there is frequently insufficient time to read entire documents resulting in important information being missed.

Automatic Text Summarization (ATS) systems are still scarce, especially for the Xhosa language. It would be advantageous if the agencies, some of which host daily shows, that release news in the Xhosa language could have a tool that can provide users with a short version of the original text.

Hence, there is a real need to design a tool that would deal with the problem of information overload, especially in the domain of news in the Xhosa language.

This study presents a possible solution to this particular problem. This work also makes a contribution towards developing Natural Language Processing applications, which can be used by Xhosa native speakers. This work investigates text summarization applications for Xhosa, to increase the scope of the text summarization research for this language.

1.4. Research Questions

How can a computerized system select key sentences?

How can language-based rules be used to extract the salient sentences and reduce content in the text?

1.5. Objectives of the Study

The aim of this study is to investigate, implement, and evaluate a Xhosa text summarizer. The name of the summarizer is IsiXhoSum, which is based on the methods and algorithms put forward by H. P. Luhn[5]and H. P. Edmundson[14]. Changes to language-specific rules to support the Xhosa language were developed and implemented.

The general objectives of this research are: to understand the structure of Xhosa news text, to investigate how a Xhosa text summarizer can be created that can extract relevant sentences from a written text document, and to present these as a readable summary. The aim is to discover an approach that does not require too many linguistic resources but that gives an acceptable result.

The unique methods developed when automatic summarization started did not exploit a significant part of the semantic mechanisms like parsers, annotated corpora and so on. This work aims at concentrating on the primary methods that do not require many semantic resources.

Up to the present time, no work has been done to discover how the elements of the Xhosa language perform in a summarization context. As a significant aspect of the development of an automatic summarizer, this work puts a focus on determining how key information is disseminated throughout a document. Thus, looking at how Xhosa news items are composed using sections or paragraphs, is of particular interest in this research.

1.5.1. Specific Objectives

To understand the field of automatic text summarization and to analyze the related research.

To investigate existing methods and algorithms used to develop extraction-based automatic summarization.

To develop a prototype Xhosa news summarizer that will serve as a model for a full scale/ fully operational Xhosa text summarizer.

To develop a test set to evaluate the system.

To draw conclusions from the results of the research.

To suggest future work on how the system could be used for other documents written in the Xhosa language.

1.6. Significance of the Study

This work will design and develop a complete Xhosa text summarizer. The summarizer will be applicable in different regions in Southern Africa especially Xhosa speaking people and foreign people that want to learn isiXhosa as language. This research has significance in that it could start additional studies in the field of automatic text summarization for Xhosa and other Southern African languages.

1.7. Research Methodology

Before initiating the study, a research methodology has to be chosen. This choice is influenced by reviewing existing literature in the field. The steps to be followed are:

1.8. Literature Review

To achieve the objectives detailed in Section 1.5, an extensive literature review on automatic text summarization was made. This review looked at relevant published documents, materials on the Internet, books, and j ournal articles, to get a deeper knowledge of the nature of the Xhosa language and of the structure of a document. Chapter two of this thesis contains the literature review.

1.9. Data Source Collection and Preperation

In order to create a corpus, data must be collected. The data used in this study is collected from different sources. This data should be clean and be in readable format. The corpus does not come with clean and readable characters and because of that, the corpus has to be cleaned, and non­characters are excluded. The corpus is cleaned so that only the necessary letters remain in the documents. More information on how the data was collected and prepared to form a Xhosa news corpus is explained in Section 5.5.1 of this work.

1.9.1. Corpus Preparation

This study involves the preparation of a corpus (a collection of written or spoken material stored on a computer and used to find out how language is used) of texts that was used for analysis. These texts were collected from various websites, which publish news using the Xhosa language. These sources are available online and a complete explanation of the how the corpus was created is provided in this study.

1.9.2. Manual Summary Preparation

A portion of the corpus was taken to a linguist to create extracts manually. These texts were randomly select from the corpus. Each text had to be at least two to three paragraphs to be selected. The linguist used his expertise to extract the key sentences in each text. Section 5.5.2 explains in detail how the manual summary was created.

1.10. Summarization Method and Tools used in this Study

The method used in this study is extraction based automatic summarization. Using the extractive approach, salient sentences are extracted from the document and displayed for the user. There is no need for summary regeneration (the rewriting of sentences to form the summary) when the extraction method is used. Sentences are weighted based on the cue phrases the sentences contain, the location of the sentences, and those sentences containing the most frequently used words (term frequency) in the text document. Sentences with the highest weights are kept. Then using the efficient combinations of extraction features, the most important sentences are selected to form a summary.

1.10.1. Development Tools

The Xhosa Text Summarizer is built using the python programming language. There are prerequisites to observe, these includes the readiness of its environments, and settling on a working framework to use. The subsections that follow explain the components that have been put together. The next subsection starts by explaining the Natural Language Toolkit (NTLK) in detail and how it is used in this work.

1.10.2. The Natural Language Toolkit (NLTK)

NLTK is a collection of various language-processing modules and has been developed as an open source library. It is intended to give massive support to a variety of disciplines like researchers, students of empirical linguistics, science, artificial information retrieval, and machine learning in Natural Language Processing[44]. It offers functions and wrappers that are very convenient. The wrappers and functions use building block for common NLP tasks effectively. The toolkit has built-in versions of raw and pre-processed corpora that are mentioned and used in a wide range of NLP literature and courses. This library was used in this study because it gives access to text analysis tools via its toolkit. The toolkit has a variety of built-in tokenizers and statistically based modules for text analysis. It has a wide range of text collections in corpora, for the variety of languages.

1.10.3. The NLTK Installation

The Natural Language Toolkit is a python package and requires the following versions of python: 2.6, 2.7, and 3.7 +. The package installs as a file or as setup when downloaded. For this project, the installation that was chosen is the nltk-3.0Awin32.exe (md5) installed on a 32-bit windows operating system.

1.10.4. Installing the NLTK data

Installing the NLTK data is a separate package that can be installed without any of the installations outlined in 4.2.1.2. NLKT data comes in the form of a folder where there are subfolders of chunkers, corpora (raw and annotated), stop words, models stemmers, and tokenizer. To make sure that the entire package is downloaded and installed it is first to launch the python interpreter and type a command. After that, the Python interpreter is run as follows:

f >>> import NLTK f >>> NLTK. Download ()

When the command above has been typed and run, a window appears with NLTK downloader. In this downloader, it is possible to select all the files or just a selected number of files. For this study, all the files were selected and installed in the machine directory. In the machine, the complete folder has been located on the Local Disk (C/users/Zukile/AppData/Roaming/nltk_data. In the following subsections, some of the modules that were used in the NLTK are explained.

1.10.5. The NLTK Tokenization

Tokenization as the process of splitting a sentence into its constituent tokens. For segmented languages such as English, the existence of whitespace makes tokenization relatively easy[45]. However, for languages such as Chinese and Arabic, the task is more difficult since there are no explicit boundaries[45]. Xhosa is a Latin based language; it uses the same letters that the English language uses. So tokenizing Xhosa text is as simple as tokenizing in English. The NLTK module was used for the tokenization of the Xhosa text in this study.

1.10.6. The NTLK Corpora

As stated in[45], NLTK comes with several useful text corpora. Already loaded. (NLTK) Despite that fact that there are regular content words, there is also another class of words called stop words, that perform important grammatical functions, but are unlikely to be interesting by themselves, such as prepositions and complementizers. This class has been modified to handle Xhosa stop words. NLTK comes bundled with the Stop Words Corpus - a list of 2400 stop words across 11 different languages (including English).

1.10.7. The NTLK Corpus Reader Class

NLTK's corpus reader classes are used to access the contents of a diverse set of corpora. Each corpus reader class is specialized to handle a specific corpus format[46]. Examples include the PlaintextCorpusReader. This class handles corpora that consist of a set of unannotated text files. In addition, the nltk.corpus package automatically creates a set of corpus reader instances used to access the corpora in the NLTK data. Each corpus uses a "corpus reader" object from nltk.corpus.

Each corpus reader provides a variety of methods to read data from the corpus, depending on the format of the corpus. For example, plaintext corpora support methods to read the corpus of raw text, a list of words, a list of sentences, or a list of paragraphs. The researchers have used the plain text corpus reader to create and read Xhosa text file.

1.10.8. Operating System

The operating system that is utilized is the 32-bit Microsoft windows 8. The NLTK supports different types of operating systems including Windows, Linux, and Macintosh. This work is based on the Natural Language Toolkit, which is written by python. Python has various packages and wants broadened libraries: details one of the libraries will be clarified in section 1.10.10.

1.10.9. The Python Programming Language

Python supports multiple programming paradigms, including object-oriented, imperative, and functional programming or procedural styles. A python 2.7 version was selected and installed. This release supports the NLTK library. The NLTK library also requires the installation of another library called Numpy, which is optional but necessary when instaling python installations.

1.10.10. The Numpy Library

Numpy is an open source library that has a very large collection of modules. Numpy or Numeric Python is a python extension, which supports a wide variety of the major multi-dimensional arrays including matrices[47]. Numpy is equipped with high-level mathematical functions to operate these arrays. It is compulsory to install Numpy in order to use python, as it will give all the functionality of built-in arrays and other modules. This library builds on the original code base, however, it combines feature that have already been created by the Num-array, which also includes added features[48]. When python and Numpy are installed, there follow the NLTK installations.

1.10.11. Charm Integrated Development Environment (IDE)

PyCharm is a smart code editor that provides first-class support for Python, JavaScript, CoffeeScript, TypeScript, CSS, (popular template languages). It takes advantage of language- aware code completion, error detection, and on-the-fly code fixes.

In this study PyCharm IDE has been used to create a user interface in the development of the Xhosa Text Summarizer. It also has some extensive libraries to make the graphical user interface.

1.11. Evaluation Methods

The intrinsic method is an assessment that focusses on the summary itself; it attempts to measure its cohesion, coherence, and in-formativeness. Usually, a comparison is made with another summary of the same text. In this study, the evaluation of the summary is made both subjectively and objectively. This, however, is not the only evaluation method; some authors use subjective evaluation only.

When considering the subjective evaluation, the evaluators (native speakers) look at closely some aspects. They look at the linguistic qualities, such as in-formativeness, and how coherent the summary is. They evaluate the summary in terms of the relevancy of all the summaries. Making use of some pre-defined guidelines, evaluators will allocate a score, using a predefined scale, to each summary that is under evaluation. The evaluators assign quantitative scores to the summaries based on a range of different qualitative features, like content, fluency, etc.

To evaluate the system objectively, a special tool called ROUGE2.0, was installed and configured for the requirements of Xhosa news summaries.

1.12. Scope and Limitations of the Study

This study specifically focuses on the development of an automatic text summarizer for Xhosa news articles. The emphasis is on news articles only. In fact, it is a single document summarization project. Therefore, the scope of this research is limited to apply technologies applicable to languages that do not require the sophistcated langauge based rule, to find out the most suitable factors for achieving precise summaries automatically.

1.13. Outline of the Dissertation

This dissertation comprises several chapters. Chapter one being the introductory chapter, it gives the background, motivation, the problem statement, objectives, significance of the study, methodology and evaluation methods. It also discusses the scope of the research.

CHAPTER TWO presents the literature review undertaken to discover the status of Automatic Text Summarization. It not only gives a profound technical background but also provides a concise overview of different summarization approaches and the resources used over the last decades to automatically summarize the text.

CHAPTER THREE talks about the Xhosa language discusses the consonant and vowel inventory of the language. It talks about the language's orthography, morpheme types, abbreviations, and also the structure of news writing.

CHAPTER FOUR describes the methodology and discusses the tool, the Natural Language Toolkit (NLTK), used to design the summarizer. The modules of the toolkit that were considered are explained in detail in this chapter.

CHAPTER FIVE talks about the Implementation of the Xhosa Text Summarizer. It also shows the interface between the system and the modules used in the Xhosa Text Summarizer. This chapter finishes by explaining how the summarizer was tested.

concludes the study. It starts by giving a research summary. This is followed by a presentation of the conclusions that were drawn from the research. Lastly, it presents ideas for the future work that needs to be carried out, following on from the current work, to produce a fully-fledged tool.

CHAPTER TWO

1.1. LITERATURE REVIEW

2.0 Introduction

The advancements in the Information and Communication Technology (ICT) has resulted in an increase to the production, collection, organization, storage and the dissemination of information which accordingly result in the so-called information overload.

Currently, various technologies used to produce information made it possible for users to access information in multiple formats, multiple sources, and single sources. They also come in single and in multiple languages. This means we need tools to cope with this information explosion. Summarization is an important tool to help us to keep up to date with what is happening in the world. Summarization primarily condenses textual information from one source or more sources and presents it to the reader in short format.

In general, summarization has many uses. The following are how we use summarization in our everyday life Pachantouris G et al[3]:

-Headlines of the news
-Table of contents of a magazine
-Preview of a movie
-Abstract summary of a scientific paper
-Review of a book
-Highlights of a meeting

The last sections of this chapter describe processes, the basic concepts, types, methods and methods of automatic text summarization.

2.1. Automatic Text Summarization

Dinegde, G. D. et al[4]in 2014 defined Automatic Text Summarization (ATS) as the task of taking one or more text document (s) as an input and reduce it in an attempt to produce condensed format (i.e. summary) from the original text document. The most important thing in summarization is to be able to find the most important sentences, and this involves knowing the semantics of written or spoken document(s). Also being able to write a concise and fluent summary needs the capacity to reorganize, transform, and join information expressed in different sentences as input.

A complete interpretation of document(s), followed by the creation of abstracts, is most of the time the most difficult task for people to perform. This has been a difficult task in the area of automatic text summarization. Moreover, the main objective of automatic text summarization is to make a reduction of the compound and lengthy text while the relevancy and content of it are verbatim[5].

H. P. Luhn[5]states that though automatic text summarization is possible and viable, the extractive methods as they are given focus and interest take the attention of researchers to one significant question: How can a system determine the sentences that are most relevant in any text document?

In the past years, the field of Natural Language Processing (NLP) has experienced more advances in the sophistication of machine learning and language processing that ascertains the significance of sentences. The authors in[6]state that the task of determining the information that is imperative to include in the summary has a lot to do with a variety of factors, which are the genre and the nature of the source text.

2.2. Processes of Automatic Text Summarization

R. Boguraev[7]in 2009 described automatic text summation as the process having three phases:

-Analysis of a text
-Transforming the analyzed text
-The synthesis of the output text

Analysis of the text involves the identification of the content to be able to make an internal representation. This may involve the implementation of statistical methods in order to extract the key content and even complex methods that involve deeper natural language processing methods. Statistical methods select salient terms of contextual sentences that make them and connect them to form a summary. Other methods require a complete understanding of the source so that at the end a summary is constructed.

Transformation is changing of text from extraction or abstraction. The transformation step is likely to make some cleaning and conforming to the incoming data to gain accurate data which is correct, complete, consistent, and unambiguous etc. For extraction summaries, the central topics identified in the previous step are forwarded to the next step for further processing. For abstract summaries however, a process of interpretation is performed. This process includes merging or fusing related topics into more general ones, removing redundancies, etc.

The last phase called the synthesis of the output text takes the summary representation and produces a suitable summary, which precisely corresponds to the requirements of the users. This final step in the process deals with the organization of the content. The following sub section reflects on summarization parameter.

2.2.1. Summarization Parameters

The use of text summarizers varies depending upon the user’s needs as well as its application. This, therefore, means that there are important things that one needs to take into account when making the design of text summarizer. Various types of summaries are defined based on deferent scenarios. Some of the scenarios include:

-Nature of input text that must be summarized
-Purpose of the summary
-And the output of the summary

Several recent studies[56],[59],[60]have stated that there are two different types of summary called User-Focused and Generic are defined for this purpose. User-focused is custom-made to the requirements of specific user or group of users .This means that the needs of users are well thought out when developing the summarizer. The user query and background knowledge of the subject is most important factor for user-focused summaries. Generic summaries, alternatively, aim at a wide-ranging readership community.

Another important way to look at summaries is in terms of the difference between Indicative and Informative summaries. Borko, H., & Bernier, C. L. , H[8]in 1975 stated that based on the content of text document to be summarized, the content can be either an informative or an indicative summary. An informative summary is meant to represent (and often replace) the original document. In view of that, it must contain entirely the appropriate information needed to deliver the structural information. The focus of an indicative summary is to suggest the contents of the article without taking away details on the substance of the article. It helps to attract the user into reading the full document. User-focused summaries had gained wide popularity, because of their ability to capture the user’s requirements and their interests.

Summarization systems can be viewed in two ways: single text document and multiple text documents. A single text document only takes one documents as an input. In the case of multiple text documents, more than one text documents are taken as input[9],[10],[11].

There are two ways of viewing text summaries, Eduard Hovy et al[12]in 1997 stated the difference between Extracts and Abstracts. An extract involves selecting important sentences as they are in the original text and put them in summary whereas an abstract includes breaking the text down making a number of various key ideas, merging of certain ideas in order to obtain more ones than that are general, and creation of new sentences different from the original text(s). However, in abstraction, the focus should be more on semantic meaning and cohesion. This method produces new sentences that are completely not from the source text. The sentences are put together from the existing content.

A typical example would be the phrase “She ate an orange, peach, and apple.'’" In trying to bring about a more concise form of the sentence, we would get the following summarized phrase as “She ate fruit.” This abstract wants to produce a more general concept ‘fruit’, two or more topics, orange; peach and apple joined. Implementation of abstract methods necessitates symbolic world knowledge which is by far the most difficult to obtain on a large scale to provide a summarization.

An extract is created by picking up certain sentences or phrases accurately from the original text to form a summary[2]. An extract is a collection of meaningful sentences in a document, reproduced verbatim[13].

However, extraction methods have been the point of focus in the area of automatic text summarization, but then again there is an issue of cohesion and balance[7]. While the use of extraction method has been the point of focus for many researchers, many challenges arise when creating such an extractive summary. These challenges are:

-How to select an important sentence from a long text.
-Creating a summary that is coherent.
-Redundancy of terms in the summary.

Extraction method has its usefulness and up until now, it is a workable method. Extracting sentences from a text with the statistical keyword approach often brings the problem of cohesion (explained later in the linguistic concepts section). To be able to improve the quality of the result i.e. the summary, some methods are usually combined with other methods. The following section explains such methods.

2.2.2. Methods of Summarization

One of the most important concepts in automatic text summarization is the decision on the use of an appropriate method to create a summary. Many researchers have used numerous extraction features and weighing methods to find ways of creating a fluent summary. Summarization systems use a number of methods that are independent components. These methods include positional methods, cue and phrase methods, query, Word, and Phrase Frequency Methods.

- Sentence Positional Methods

Mentioned before, the way a title, sentence, paragraph are positioned in the document have some great deal of significance. This happened most of the times in the way newspapers are written for instance the first sentence in the first paragraph of the paper portrays a significant meaning. It has the first priority when making a summary[19].

- Cue Word or Phrase Method

In some certain genres, words such as significant, conclusion have some level of importance. It is in the sense that these words are prioritized, and they should be extracted. H.P Edmundson[14]in his work has used three types of cue words that he put for experimentation: 783 words which were called bonus words (The author said that these words positively affect the relevance of the sentence) e.g. “Greatest,” “Incidentally” and “Significant ,” 73 of the phrase were called stigma words because they negatively affect the relevance of the sentence for example “hardly,” “impossible” and “inadequate”.193 were just null words which were also termed as irrelevant. H. P. Edmundson[14]then computed a cue weight for each sentence. The cue weight is the summation of the individual cue word in the phrase. Same method was also adopted by The scholars in[15]which they applied to their study, and reported that this approach was their best method. This is based on the 64 % of joint precision and recall that they got. What they did was a collection of manually build scientific text of cue phrases from a specific domain. Subsequently, they rated each cue phrase for relevance to the text unit by allocating a so-called ‘goodness score’ which ranges from one to three.

- Query Method

R. M. Alguliev et al[16]in 2007 stated that a query method is used to query text based summarization systems. Given the text document, all (the sentences are scored based on the frequency in the text document). In addition, those sentences that carry the query phrases get the higher scores while sentences with single query phrase get lower scores. Those sentences with the highest score make it into the summary with their structured context included. Most importantly, these portions of text are taken out from various section and subsections. This says what is in the summary is the collection extracts. The number of sentences extracted highly relies on the summary.

- Word and Phrase Frequency Method

The scholar H. P. Luhn[5]uses the Law provided by Zipf, which is the law of word distribution.

The Zipf’s states that:

Few words that occur very often Fewer words that occur somewhat often, Many words that occur infrequently

In order to develop the following extraction benchmark: if there are unusual words in the text, then there is the likelihood of those sentences in the text being necessary. The systems created by the scholars in[5],[14],[15]put into use various frequency measures, and make a report based on the performance that is between 15 percent and 35 percent recall and precision.

- Title Method

This method is close to a query method except the fact that here the interest is only in the words that are on the titles and headings. In the work provided by H. P. Edmundson[14], there was a combination of word and phrase method. In his method, each title word is assigned the same score and then a sum of text unit is made. According to the author in[15], the score is the mean frequency of title word occurrences in the sentences.

- Machine Learning Methods

With the robust rise of machine learning methods in the field of Natural Language Processing in 1990, researchers have written and published many papers that incorporate an infinite number of statistical methods to create documents extracts.

Despite the fact, many systems have based their reliance on the feature independence; some have based their reliance on the Naïve Bays methods. Some systems based their reliance on the choice of relevant features, and some focused more on the learning algorithms to reduce the assumption of independence.

Other researchers have also considered using models like Hidden Markov model and Log-Linear Models, neural networks and to enhance extractive summarization significantly.

- Naive Bayes Approach

Julian Kupiec et al[17]described a method that was trained and was able to learn from data. This approach was derived from work provided H. P. Edmundson[14]. The authors in[17]used a

classification function to classify the worthiness of the sentence and this is done using the naïve- Bayes classifier.

- S1 be a particular sentence, o S be set of sentences that make up the summary, o Moreover, F1, F2 ..., Fk are the features.

Abbildung in dieser Leseprobe nicht enthalten

The following formula assumes independence of features:

Abbildung in dieer Leseprobe nicht enthalten

- Sentence Position Method

The scholars in[18]deliberate the significance of single features called the sentence position. The method is to weigh a sentence by its position in the text. The authors named it the “position method.” This rose from the idea that text generally follows a predictable discourse structure. Additionally, the sentences of greater topic centrality tend to occur in certain specifiable locations (e.g. title, abstracts, etc.). The authors argued that since the discourse structure significantly varies over domains, the positioning method could not be as naively as in[19].

- Hidden Markov Models (HMM)

An approach, which describes that a given set of feature computes a posterior probability that treats each sentence as summary sentence. The HMM has fewer assumptions of independence, in particular, it does not assume that the probability that a sentence i in the summary is independent of whether sentence i -1 is included in the summary.

J. M. Conroy et al used the HMM to develop their text summarizer. They considered five features for the development of the HMM. The features used in the HMM (built into the state structure of the HMM) are position of the sentence in the document, position of the sentence in the paragraph, number of terms in the sentence and the probability of terms given the document.

- Neural Networks and Third Party features

The author in[52]used the neural networks to train the system to learn the types of sentences that should be included in the summary. The network is with sentences in several test paragraphs. In this method, the neural network is trained with sentences that are located in several paragraphs where each sentence is identified whether it should be included in the summary or not. A human reader does this.

- Combination of different techniques

Some of the methods researchers combine are the ones outlined above. Researchers have also discovered that no single method can outdo well in terms of scoring the text more than the way human extracts are created.

However, combining various methods requires different evidence of various sources. Additionally, the incorporation of the different methods put into use seems to do well. This implies that there is finest approach that can do well alone. Julian Kupiec et al[17]developed a Bayesian classifier based on a principle that any sentence will be contained within in the final output i.e. a summary .This was possible provided that there is some certain feature such as paragraph position , cue phrase indicators , word frequency , upper-case words as well as a sentence length. Short sentences were generally excluded in this regard. The experimental results they obtained were that 33 % when paragraph position and 29 were from cue phrase indicator and when methods were combined 42 %.

On the other hand, the authors in[18]made a comparison of eighteen combination of the features an optimal combination was achieved. This was possible with an incorporation of machine learning algorithm. The combined features are the same as the ones presented above and few others that are indicating the prominence of names, dates, quantities pronouns, as well as quotes in the sentence. The method explained above all is the learned function. The term query method became the second best score achieved. The third best score (up to 20 percent length) attained correspondingly by using the word frequency, the lead method, as well as the naïve combination function. The scholars in[18]further says that summaries should not exceed 35 percent and they should not be shorter than 15 percent.

2.3. Linguistic Concepts to Consider

When people write text, they write to bring about a specific idea, concept, and event. In a sensible text, the text document does not just contain the bag of sentences that bear no meaning. The manner in which text is created is even, it has grammatical structure and meaning, or relevance is not the same. A text has a necessary component referred to as a semantic structure. It is natural that every specific text document revolves around a specific idea.

In an excellent presentation, the major idea can be presented and divided into subsidiary concept and ideas. Ideas connect collectively to bring about a broader picture. The topic should be flow. They should move in a proper manner in order to drive the reader to the general idea in any easy and comprehensive way. This means that a text needs to reveal some higher level of coherency that will attract the reader to understand the general concept.

2.3.1. Coherence

In linguistics, coherence is the semantic integrity of a particular text and is an essential component in a well-composed text. An element provides a feeling that a text document is written in a logical manner. Coherence is a semantic structure of a text. Modeling coherence requires an interpretation of the text.

Coherence relationship, which is also called semantic relation, can be used to create a model of a text. The current relationships are an elaboration, cause, support, exemplification, contrast, and result. Classifying the relationship for sentences is also a very complex process. Typically, efforts regarding coherence analysis lead to trees where the nodes are regarded as text segments (paragraphs, sentences, and phrases) linked to these relationships.

Coherence structure of the text can be represented excellently using discourse structure and rhetorical parsing. The author Daniel Marcu[21]presents an efficient summarization system that makes use of models of coherence. Daniel Marcu[21]takes the help of cue phrases and name them as discourse markers. A tree-like model is formed by the local discourse structures they are the ones that form the global discourse structure of the text.

Coherence structure is a complex feature to deal with because it is necessary that there should be more knowledge than the information that could be obtained from the text. Therefore using both coherence and cohesion is crucial in understanding the dynamics of text.

2.3.2. Cohesion

Cohesion is better simpler than coherence; it helps to establish the discourse structure in a particular text. It is considered as a surface level feature. Coherence deals specifically with the entire semantic structure of the text, whereas cohesion, only deals with relationships among the peer units of the text. If there is a text document, cohesion ascertains whether a unit of text has a connection with the other units in the same text or not. Ruqaiya Hasan et al[22]state five types of cohesion relationship that are found in a text.

- Conjunction

Usage of conjunctive structures like ’and’ to present two pieces of evidence in a cohesive manner. An example, the sentence ’I have a cat, and his name is Felix’, two facts are connected with the conjunctive ’and’.

- Reference

The use of pronouns for entities for example ’Dr. Kenny lives in London. He is a doctor.’ the pronoun ’he’ in the second sentence refers to ’Dr. Kenny’ in the original sentence.

- Lexical Cohesion

The use of related words. For instance sentence ’Prince is the succeeding leader of the kingdom.’ ’Leader’ is a more common word for ’prince.'

- Substitution

Making use of the indefinite article for a noun. In the example ’As soon as John was given a cup of tea, Mary wanted one too.’ the word ’one’ denotes to the phrase ’cup of tea’.

- Ellipsis

Point towards a noun without reiterating. For example ’Do you have a car? No, I don’t ’, the word ’car’ is indirect without starting in the second sentence.

From the above cohesion structures, lexical cohesion is the best definite and easiest to catch.

2.3.3. Lexical Cohesion

A lexicon is defined as a structured knowledge base keeping semantic data about a series of words Cohesion is based on the type of relation between units of a text document. These units are referred to as words and phrases. Lexical cohesion is known as a phrase or word in a text document that shows the semantic relationship. Creating lexical cohesion relies on ascertaining the semantic relationship between words or phrases.

2.4. News Writing Structure

Usually, when writing news, you write an account of what has been happening from one place to another. The news may also give information about many issues; these issues may include new projects, ongoing projects, initiatives, and discoveries. The writing of news aims at being able to respond to any basic questions about any certain event: which are who, where, which, why and what. “How” is always put at the beginning of an article .The way news are structured portray a tone and its relative importance to its intended user. The author in[23]articulates that the major concern is with the structure of vocabulary and sentences.

The authors also state that News stories also contain at least one of the following essential characteristics relative to the intended audience: proximity, prominence, timeliness, human interest, oddity, or consequence. This form of a structure is an inverted pyramid. It refers to the decreasing information in the subsequent paragraphs in the news[23]. Discussing these characteristics further is not in the scope of this work.

Newspapers are generally adhered to an expository writing style. Expository writing is a type of writing where the purpose is to explain, inform, or even describe. As mentioned in[23], the purpose of expository writing is to explain and analyze information by presenting an idea, relevant evidence, and appropriate discussion.

Whatever structures a news item follows, the first sentence is the one that carries the most significant structural element of the story. The lead sentence is usually the first sentence, in some instances, the first two sentences become the lead sentences, and ideally, these sentences are 20 - 25 words in length[23].

2.5. Evaluation Methods used in Automatic Summarization

Although researchers are attempting to create real human replaceable summaries using computers over the last several years, the subject of evaluation is one area with an unresolved problem[61]. It is hard to define a better summary even based on the perception but at times; it is easier to state if a summary is poor or good.

There are two types of summary evaluation: extrinsic and intrinsic. An extrinsic method of evaluation is where the quality of the summary is judged on how well it helps a person performing other task such as information retrieval. An intrinsic evaluation is where humans judge the quality of summarization directly on an analysis of the auto-generated summary.

Comparing system output to some ideal summary was performed in works of[14],[21],[17]. To simplify evaluating extracts, Daniel Marcu[21]independently developed and automated method to create extracts corresponding to abstracts (ideal summary).

The other way to use intrinsic method is to have evaluators rate systems’ summaries responsiveness and /or linguistic quality using some scale (readability, grammar, informativeness, fluency, coverage, redundancy)[24].

The method of comparing system output to some ideal summary was carried out in work provided by[14],[21],[17]. Daniel Marcu[21]independently developed an automated method to create extracts same as abstracts (ideal summary).

2.6. Discusion on Related works

The main objective of text summarization is to be able to identify the most importtant sentence in a docuemnent and eventually generate a a concise form of it. The work presented in Dinegde, G. D. et al[4]is a complete explanation of the abtratiction summarization that was caried out to create automatic summaries. Abstraction summarization requires deep semantic and understanding of the at a greater depth. This means deep linguistic feature was be thouroughly studied. On the other side, extraction summarization summaries are created by picking up certain sentences or phrases accurately from the original text to form a summary[2]. In this method, sentences are extracted verbatim[13]from a text document. Abtraction is still a difficult method to implement and a lot of work has be done using extyraction methods.

There are many summarization methods and systems available for languages such as English. Although some of them claim to be language-independent, they need at least language resources to work with. So far there is work done for isiXhosa langauge and there are langauge based based resource such stemmers[34]etc. So with such resources Xhosa can be expanded for other Natural Language Proccessing technologies.

The sentence position method was also adopted in this study, this is because most of the times the way newspapers are written for instance the first sentence in the first paragraph of the paper portrays a significant meaning. It has the first priority when making a summary[19].

2.7. Conclusion

In this chapter, we have looked at different approaches carried out by various researchers in the area of automatic text summarization. We have looked at major approaches on summarization such as statistical approaches as they are of high importance in the research. We also gave some insight on linguistic concepts that are imperative to understanding the whole text and its summary. We then close the chapter by talking about evaluation methods, which are the methods used in this the research.

CHAPTER THREE

THE XHOSA LANGUAGE

3.0 Introduction

Xhosa (called isiXhosa in the language) is a language spoken, for the most part, in the Eastern and Western Cape of the Republic of South Africa. It is one of South Africa's eleven official languages. The Census Department of South Africa[43]reports that around 18% of the nation's populace speaks the Xhosa language. The language has a rich morphology. As indicated in a study[25], Xhosa is a southeastern Bantu language and part of the Nguni language family that incorporates isiZulu, isiNdebele, and siSwati.

According to work provided in[62]Same as Zulu language, Xhosa is one of the Nguni languages which is a group that share a significant degree of commonality. However, these languages have some etymological contrasts, for example, phonology, morphology, vocabulary, and sentence structure[55]. Due to this, they are viewed as independent languages with individual characteristics and their own particular word references and linguistic uses. Xhosa is one of the Bantu dialects that is significantly endowed with the refinement of click sounds e.g. X, c, and q.

The authors in[26]state that the language has a tonal component, as one of its notable elements. The authors explain the significance of consonants and vowels which depends on whether they are being said utilizing a rising or falling voice. The scholar in[25]that the orthography of the language is Latin based and has a composition framework created by the Christian teachers in the nineteenth century.

The first paper to be published in Xhosa and distributed is believed to have been in 1834. The Xhosa langauge has several dialects which are Ngqika, Gcaleka, Mfengu, Thembu, Bomvana, and Mpondomise. However, the author in[27]points out that other authors say that the Xhosa language is based on the Gcaleka, Ndlambe, and Gaika dialects.

Xhosa, as well as other Bantu languages, has borrowed words generously from Khoisan (languages of the Southern African, aboriginal hunter-gatherer populations) and in modern times from English and Afrikaans[28]. Some scholars believe that the existence of clicks in the Xhosa language is because there was close interaction and socializing of Xhosa and Khoisan people.

Being one South Africa’s eleven official languages, the Xhosa language is a medium of instruction in various areas of the country starting from grade one up to senior levels.

According to Europa Publications provided in[29], literary work in Xhosa including prose and poetry has been developed. The South African Broadcasting Corporation offers a domestic service in Xhosa on both radio (129 hours per week) and television (15 hours a week in TV2). There is also a Radio programme“Umhlobo Wenene FM”, which is broadcast at the national level The broadcasts in Xhosa alone are concentrated in 27 community FM radios stations in the Eastern Cape. A number of publications and newspapers are published in Xhosa and English or other African languages.

The online presence of the language is increasing significantly; this includes newspapers, online dictionaries, and online courses. Religious documents (online Xhosa bibles), research articles, and journals are published and are available online in the language. Xhosa and other African vernacular languages are used in numerous organizations for legislative, judicial, and administrative purposes. This growth, therefore, suggests the need for a means to filter the most important content for interested users to read, hence the necessity of tools such as automatic text summarizers.

The next sections of this chapter discuss the Xhosa writing system and its alphabets, this is followed by the punctuation marks and their usage, the morphology of Xhosa, and Xhosa word boundaries.

3.1. Xhosa Consonants and Vowels

The scholars in[30]state, “Every language of the world contains the two basic classes of speech sounds often referred to as consonants and vowels.” In the writing of Xhosa, it is not easy to distinguish vowels from consonants, as the common word syllables in this language contain common consonants and vowels. The work put forward in[28]states clearly that the phonology of Xhosa has a simple vowel inventory as well as a highly marked consonantal system, which contains ejectives, implosives, and clicks. In the following subsections, there is information about the vowel inventory and consonantal system.

3.1.1. The Vowel System

Yule [31, p.40] states that “while the consonant sounds are mostly articulated via closure or obstruction in the vocal tract, vowel sounds are produced with the relatively free flow of air.” The author in Finegan [52, p.89] agrees by stating that, “Vowel sounds are produced by passing air through different shapes of the mouth, with the various positions of the tongue and of the lips, and with the air stream relatively unobstructed by narrow passages except at the glottis.”

[...]

Excerpt out of 115 pages

Details

Title
Development of an automatic news summarizer for isiXhosa language
Course
Computer Science
Grade
75
Author
Year
2017
Pages
115
Catalog Number
V442361
ISBN (eBook)
9783668861718
ISBN (Book)
9783668861725
Language
English
Notes
Appendix F is in Xhosa (South Africa).
Tags
IsiXhosa, Python, NLTK
Quote paper
Zukile Ndyalivana (Author), 2017, Development of an automatic news summarizer for isiXhosa language, Munich, GRIN Verlag, https://www.grin.com/document/442361

Comments

  • No comments yet.
Read the ebook
Title: Development of an automatic news summarizer for isiXhosa language


Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free