Textual Classification for Sentiment Detection. Brand Reputation Analysis on the Web using Natural Language Processing and Machine Learning


Academic Paper, 2018
54 Pages

Free online reading

Abstract
i
Abstract
Cloud computing makes it possible to build scalable machine learning sys-
tems for processing massive amounts of complex data, be them structured
or unstructured, real time or historical, the so-called Big Data. Publicly
available cloud computing platforms have been made available, for instance,
Amazon EC2, EMR, and Google Compute Engine. More importantly, open
source APIs and libraries have also been developed for ease of program-
ming on the cloud, for instance, Cascading, Storm, Scalding, Apache Spark
and Trackur. Meanwhile, computational intelligence approaches, examples of
which include evolutionary computation, immune-inspired approaches, and
swarm intelligence, are also employed to develop scalable machine learning
and data analytics tools. In this project, we presented the sentiment-focused
web crawling problem and designed a sentiment-focused web crawler frame-
work for faster discovery and retrieval of sentimental context on the Web.
We have developed a computational framework to perform automated repu-
tation analysis on the Web using Natural Language Processing and Machine
Learning. This paper introduces such framework and tests its performance
on automated sentiment analysis for brand reputation. In addition, we pro-
posed different strategies for predicting the polarity scores of web pages.
Experiments have shown that the performance of our proposed framework is
more efficient than existing frameworks. Reputation analysis is a useful ap-
plication for organizations that are looking for people's opinions about their
products and services. Our approach consists of 4 parts: in the first part, the
framework performed Web crawling based on the query specified by the user.
In the second part, the framework locates relevant information within tex-
tual data using Entity Recognition. In the third part, relevant information
were recorded in the database for feature extraction/engineering and classi-
fication. Lastly, the framework displayed the data for reputation analysis.
In the training phase, we used data provided by the marketing team of the
University of the Witwatersrand, Emoticons, a subset of the SentiStrength
lexicon and ClueWeb09 dataset. Each domain was labelled accordingly (pos-
itive/negative and neutral) with equal numbers of polarity in plain text. In
the test phase, the classifier predicted the polarity of real-time data. We
used accuracy as evaluation metric to measure how much our classifier acted
precisely. Additionally, we included negation detection to improve the accu-
racy of the classifier. Furthermore, we have observed better results in both
training and test stages.
Keywords: Sentiment Detection, Reputation Analysis, Web Crawl-
ing.

esum´
e
ii
esum´
e
Internet augmente `
a la vitesse de l'´
eclair et les donn´
ees qui y sont stock´
ees
sont vastes. La croissance d'Internet a donner naissance a une source ´
enorme
de donn´
ees. Ces donn´
ees peuvent renfermer des informations sentimentale,
notamment sur la fa¸con dont les gens pensent sur diff´
erents probl`
emes. De nos
jours, les opinions des personnes jouent un r^
ole pr´
epond´
erant dans l'industrie.
C'est la raison pour laquelle, les grandes et petites entreprises, ´
etudient les
methodes automatiques pour r´
ecup´
erer les informations dont elles ont besoin
`
a partir de gros volumes de donn´
ees stock´
ees sur le Web. L'analyse automa-
tique de la r´
eputation des entreprises est une m´
ethode efficace pour r´
esoudre
ce genre de problematique. L'analyse automatique de la r´
eputation des en-
treprise d´
etermine automatiquement la fa¸con dont les mots-cl´
es, les termes,
ou le contenu g´
en´
er´
e par l'utilisateur peuvent nuire `
a un nom de marque, `
a
un produit ou `
a une entreprise mentionn´
es dans un texte. L'analyse de la
eputation automatique utilise la d´
etection du sentiment qui implique des
ethodes avanc´
ees telles que l'apprentissage machine et le traitement auto-
matique du langage naturel pour capturer la polarit´
e pouvant ^
etre positive,
egative ou neutre `
a partir de textes simples. Cette recherche se focalise sur
l'exploration du Web pour l'analyse automatique de la r´
eputation des entre-
prises sur Internet. Une analyse de la r´
eputation automatique est effectu´
ee
sur l'Universit´
e du Witwatersrand pour ´
etudier sa popularit´
e sur Internet. Il
existe une large gamme de champs pour lesquels des informations peuvent
^
etre r´
ecup´
er´
ees. Cette recherche ´
etudie les sentiments concernant Wits `
a par-
tir des donn´
ees publiquement disponibles. La recherche pr´
esente une nouvelle
perspective pour l'exploration Web cibl´
ee. Nous avons propos´
e un syst`
eme
d'exploration ax´
e sur le Web pour faciliter la d´
ecouverte rapide du contenu
sentimental. Le system propos´
e peut ^
etre appliqu´
e de mani`
ere g´
en´
erique pour
collecter, traiter et afficher la r´
eputation de diff´
erentes marques/entreprises
en temps r´
eel. Cette ´
etude d´
ecrit ´
egalement des outils qui permettent le
eveloppement de technologies prenant en charge le traitement textuel pour
acc´
el´
erer la d´
etection des sentiments pour une analyse de la r´
eputation des
entreprises/marques. Dans cette perspective, nous proposons une application
simulant le syst`
eme proposer. Ce syst`
eme est clairement d´
efini pour effectuer
l'exploration Web cibl´
ee.
mot-clef : Detection sentimental, Analyse r´
eputationnel, Explora-
tion toil´
ee.

Acknowledgements
Firstly, I would like to thank my supervisor, Professor Turgay Celik, for his
advice and guidance throughout the research and writing process.
iii

Contents
Abstract
i
esum´
e
ii
Acknowledgements
iii
1
Introduction
1
1.1
Aims and Objectives of the Research
. . . . . . . . . . . . . .
2
1.2
System Architecture
. . . . . . . . . . . . . . . . . . . . . . .
3
1.2.1
Schematic Description of the Architecture
. . . . . . .
3
1.3
Literature Review
. . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4
Textual Data Retrieval
. . . . . . . . . . . . . . . . . . . . . .
11
1.5
Sentiment Analysis, NLP and Machine Learning
. . . . . . . .
11
1.5.1
N-gram
. . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.5.2
Bag-of-Words
. . . . . . . . . . . . . . . . . . . . . . .
12
1.5.3
Autoencoders
. . . . . . . . . . . . . . . . . . . . . . .
13
1.5.4
Learning Algorithms
. . . . . . . . . . . . . . . . . . .
14
1.5.5
Neural Networks (NNs)
. . . . . . . . . . . . . . . . . .
15
1.5.6
Named Entity Recognition (NER)
. . . . . . . . . . . .
17
1.6
Evaluating the System
. . . . . . . . . . . . . . . . . . . . . .
17
1.6.1
Evaluating Coverage
. . . . . . . . . . . . . . . . . . .
18
1.6.2
Evaluating Accuracy
. . . . . . . . . . . . . . . . . . .
18
1.6.3
F-Measure
. . . . . . . . . . . . . . . . . . . . . . . . .
19
1.6.4
Accuracy
. . . . . . . . . . . . . . . . . . . . . . . . .
19
1.7
Content Extraction
. . . . . . . . . . . . . . . . . . . . . . . .
20
1.7.1
Word Extraction
. . . . . . . . . . . . . . . . . . . . .
20
1.7.2
Training Phase
. . . . . . . . . . . . . . . . . . . . . .
22
1.7.3
Emoticons
. . . . . . . . . . . . . . . . . . . . . . . . .
23
1.7.4
Our Training Approach
. . . . . . . . . . . . . . . . . .
23
1.7.5
Training
. . . . . . . . . . . . . . . . . . . . . . . . . .
25
1.7.6
Analysis
. . . . . . . . . . . . . . . . . . . . . . . . . .
27
iv

Contents
v
1.8
Graphical User Interface
. . . . . . . . . . . . . . . . . . . . .
28
1.9
Results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
1.10 Testing
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
1.10.1 Completeness of ANN (Accuracy)
. . . . . . . . . . . .
32
1.11 Empirical Testing of ANN
. . . . . . . . . . . . . . . . . . . .
33
1.12 Discussion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
1.13 Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35

List of Figures
1.1
The platform hierarchy.
. . . . . . . . . . . . . . . . . . . . .
4
1.2
Reputation Mining (Morinaga et al., 2002).
. . . . . . . . . .
7
1.3
The architecture of the semantic content analysis framework
(Musto et al., 2015).
. . . . . . . . . . . . . . . . . . . . . . .
8
1.4
COBRA architecture (Spangler et al., 2009).
. . . . . . . . .
9
1.5
Data collection and labelling.
. . . . . . . . . . . . . . . . . .
11
1.6
N-gram model.
. . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.7
BoW model.
. . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.8
Autoencoder.
. . . . . . . . . . . . . . . . . . . . . . . . . . .
14
1.9
Supervised Classification.
. . . . . . . . . . . . . . . . . . . .
15
1.10 Basic structure of Neural Networks.
. . . . . . . . . . . . . .
16
1.11 Named Entity Recognition.
. . . . . . . . . . . . . . . . . . .
17
1.12 A text extraction sample by BoilerPipe.
. . . . . . . . . . . .
21
1.13 The pre-processing step of a Web page using Stanford POS
tagger.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
1.14 A sample of the Wits marketing dataset.
. . . . . . . . . . . .
23
1.15 Emoticons and their variations (Read, 2005).
. . . . . . . . .
24
1.16 Graph of positive features (x axis) Vs the Cost (y axis).
. . .
26
1.17 Graph of Negative features (x axis) Vs the Cost function (y
axis).
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
1.18 Graph of Cost vs. The Number of Iterations.
. . . . . . . . .
27
1.19 Home-page of the web-based application. This figure shows
the initial screen of the application. At this page, the user is
required to performed reputation analysis.
. . . . . . . . . . .
29
1.20 log-in screen of the web-based application. Once the user ac-
cepted to performed reputation analysis, he will have access
to the log-in screen.
. . . . . . . . . . . . . . . . . . . . . . .
30
1.21 Results for:"Wits University".
. . . . . . . . . . . . . . . . .
30
1.22 Results for:"Wits students protest".
. . . . . . . . . . . . . .
31
1.23 Learning process of ANN. Features (x axis) and Time (y axis)
34
1.24 Negative Features and the Bag-of-words performance.
. . . .
35
vi

List of Tables
1.1
Binary Confusion Matrix.
. . . . . . . . . . . . . . . . . . . .
19
1.2
BoilerPipe extraction performance.
. . . . . . . . . . . . . . .
20
1.3
The words-list.
. . . . . . . . . . . . . . . . . . . . . . . . . .
21
1.4
List of English Stop Words.
. . . . . . . . . . . . . . . . . . .
22
1.5
Evaluation metrics for ANN.
. . . . . . . . . . . . . . . . . . .
25
1.6
Confusion Matrix approximation.
. . . . . . . . . . . . . . . .
25
1.7
The Bag-of-Words performance.
. . . . . . . . . . . . . . . . .
28
1.8
The meanings of colors for the GUI.
. . . . . . . . . . . . . . .
29
1.9
Predicting the Web page polarity.
. . . . . . . . . . . . . . . .
31
1.10 Polarity prediction using SentiStrength.
. . . . . . . . . . . . .
32
1.11 The Bag-of-Words performance.
. . . . . . . . . . . . . . . . .
33
1.12 Confusion Matrix for ANN.
. . . . . . . . . . . . . . . . . . .
33
1.13 Approximation of the ANN Confusion Matrix. The empirical
error approximation starts from 0.0569% up to 0.13%.
. . . . .
34
1.14 Performance of ANN for Testing and Training.
. . . . . . . . .
34
vii

Nomenclature
(x
(i)
, y
(i)
) Output of our hypothesis on input x, using parameters W, b (this
should be a vector of the same dimension as the target value)
Learning rate (numeric value (R))
The parameter of the encoder (recognition model)
The parameter of the decoder (generative model)
Specific hypothesis function parameter (numeric value (R))
a
(l)
i
Activation/output of unit i in layer l of the network (a
(1)
i
= x
i
)
g()
Neural network activation function (returns r (R))
h
W,b
(x) The i
th
training example
S
Sentimentality (s
+
i
: sentences with positivity, s
-
i
: sentences with neg-
ativity and s
i
: sentences with neutrality)
SS
p
The average of sentimentality scores of sentences (S
i
is the set of sen-
tences in page p, where SS
p
=
1
i
×
S
i
p
(s
+
i
- s
-
i
- s
i
- 3))
W
(l)
ij
The parameter associated with the connection between unit j in layer
l (and unit i in layer l + 1)
X
Input features for a training example R
n
1
y
Output/target values
Z
(l)
i
Total weighted sum of inputs to unit i in layer l (a
(l)
i
= f (Z
(l)
i
))
1
An upper case, boldface letter is a matrix; an upper case, light (non-boldface) letter
is a set. A lower case, boldface letter is a vector.
viii

Dedication
ix
Dedicated to my supervisor, Prof. Turgay Celik and my father Prof. Jean-
Paul Nkongolo Mukendi.

Epigraph
x
The more you try, the luckier
you get.
edric Villani

Chapter 1
Introduction
Knowing the reputation of your own products or competitors' products is im-
portant for marketing and customer relationship management. Questionnaire
surveys are conducted for this purpose, and open questions are generally used
in the hope of gaining valuable information about corporate/brand reputa-
tions. It is very costly to gather and analyze the large volume of high quality
survey data, which is necessary for meaningful brand reputation analysis.
One approach which promises to reduce costs in this regard is to automati-
cally extract opinions about specific products from the Internet. The purpose
of this research is to provide a framework for automatically collecting and
analyzing opinions for reputation analysis on the Internet.
Organizations or companies need to know the public's feelings and judgments
on their products or services. In order to achieve this, they must manage
opinion polls or take a survey of a target group. With the popularization of
internet usage, a significant repository of textual opinions and reviews has
been created. The most popular sources are C-Net, IMDB, Amazon, Rotten-
Tomatoes, twitter, facebook (
Zeinalipour-Yazti et al.
,
2004
). Our research
concentrates on brand reputation analysis and as a case study we will focus
on Wits.
The availability of these textual opinions has changed the information gath-
ering process. It is possible to read the opinions and experiences of hundreds
of people about almost every existing product. Reading through all this in-
formation in order to reach a conclusion on whether a product or a service
is good or bad, is a time-consuming task. Moreover, drawing an inference
(positive, negative or neutral) when there are different conflicting opinions is
very difficult. The process of reputation analysis is a powerful conduit that
can automatically extract opinions and sentiments from online sources, and
1

Introduction
2
classify them as positive, negative or neutral.
1.1
Aims and Objectives of the Research
The information gathered from such a brand reputation mining activity can
be used by organizations to:
· compare their performance against competitors
· assess specific marketing strategies;
· gauge how a particular product or service is received in the market.
The successful conduct of such brand reputation mining entails three
broad challenges:
· the identification and collection of mentions on Web media
· the application of data mining techniques to the gathered information
in order to determine the sentiment associated with opinions expressed
in the form of mentions by individuals or groups according to the views
expressed;
· display of the results.
This research was conducted for the following reasons:
· implementing a real-time reputation analysis platform
· developing efficient algorithms to address the problem of reputation
analysis;
· implementing solution techniques of studying the reputation of Wits
University on the Internet;
· evaluating the proposed framework systematically.

Introduction
3
1.2
System Architecture
The number of web-pages and the amount of content available is unimagin-
ably large because the size of the World Wide Web has tremendously grown
over the past few years.
In order to access this information, a system is needed to scan through all
available websites and pick out relevant sites that the user browsing the In-
ternet is interested in. The most common systems used for this purpose in
current times are crawlers. The proposed framework integrated a database
and a crawler.
1.2.1
Schematic Description of the Architecture
The following Figure 1.1 (page 4) illustrates the automation of brand repu-
tation on the Internet. To create the dataset used for this research, text was
taken from different links provided by the crawler. The crawler then pop-
ulated the database with relevant information. Prior to storing the textual
data into the database, a named entity recognition checked for the validity
of the data. This named entity recognition (corporate/brand name recogni-
tion) was able to detect texts which includes a corporate/brand name using
keywords specified in the query. This phase is related to natural language
processing and machine learning. The database contains four components
separated into four tables (label, sentiments, crawler and raw text). We then
extract observations/sentences including keywords, then classified each ob-
servation/sentences using a classifier. The classification of extracted observa-
tions/sentences focused on their polarity (positive, negative, neutral). This
was achieved through iterative feature extraction and feature engineering.
To perform the sentiment detection, feature extraction used NLP techniques
such as N-gram and bag-of-words (
Tang et al.
,
2014
).
The feature engineering utilized the autoenconders approach by reconstruct-
ing a model of the extracted observations/sentences based on the keywords.
This model was then utilized by the classifier. Here, we considered the pro-
cess of converting words to vectors. There are several methods for doing so.
We considered a very simple implementation, proposed by
Lebret and Col-
lobert
(
2015
), which uses autoencoders to jointly learn representations for
words and phrases. ANN and SentiStrength
1
was used to classify extracted
1
SentiStrength is a lexicon-based sentiment analysis library. Given a short piece of text
written in English, the library generates a positive/negative and a neutral sentiment score
for each word in the text.

Introduction
4
observations/sentences (
Vural et al.
,
2013
). Following this, the results of the
classification step was displayed using a Web-based interface.
We opted for Artificial Neural Networks because it produces good results
in complex domains and they are suitable for both discrete and continuous
data, especially better for the continuous domain (
Ikonomakis et al.
,
2005
).
SentiStrength captures the inherent characteristics of the textual data better
and minimizes the upper bound on the generalization error. Its ability to
learn can be independent of the dimensionality of the feature space (Global
minima vs. Local minima) (
Ikonomakis et al.
,
2005
).
Figure 1.1: The platform hierarchy.
This framework focuses mainly on the implementation of various com-
ponents to create the final robust and accurate reputation analysis system.
These are the main tasks completed in this research:
· Dataset collection: This involves collecting the required textual datasets
using a crawler.
· Feature extraction/engineering: This task includes designing and
extracting a novel, discriminative set of features from the input texts

Introduction
5
from the datasets. This is probably the most important step in the
research, as this step directly effect the eventual sentiment detection
rates of the system.
These features are gradient-based spatio-temporal features, as well as
deep learning-based features.
· Classifier selection, design, and implementation: This task is
selecting, designing, and implementing appropriate classifier for the
system.
The classifier was trained on the features extracted.
The
chosen classifier is ANN.
· Query browser: The query browser takes the user's request/query.
Furthermore, the crawler uses this request/query to extract the textual
data by utilizing the key words in the query/request as reference while
crawling the Internet.
· Reporting and Visualization Module: After classifying textual
data, this module concentrates mainly on the visualization of the clas-
sified textual data. This implies that for each web-page crawled, we as-
signed a classification specifying its polarity. Positivity will be marked
by a yellow color, negativity by a red color and neutrality by a green
color.
· Data Labelling: At this level we have given the opportunity to read
the content of the web page in order to test the accuracy of the classifier.
The user can read the textual content of a web-page and find out its
polarity. He can now compare the discovered polarity with that given
by the classifier.
1.3
Literature Review
Morinaga et al.
(
2002
) implemented a system that produces reputation analy-
sis on the Internet. The proposed system included one component for opinion
extraction and another for text extraction; the first part is nothing more than
a question-answering system to an application-specific question, and the sec-
ond part has four basic functions: extracting features from different words,
extracting words in a simultaneous manner, the extraction and analysis of
sentences and finally, the analysis of correspondence. To juxtapose these two
parts, the authors performed the labeling of opinions, which allowed them
to solve the problem of supervised learning.
Morinaga et al.
(
2002
) used real
data to explicitly demonstrate that the proposed system may allow users to

Introduction
6
capture crucial knowledge about the reputation of the products of interest
and to effectively minimize the cost of collecting and analyzing opinions.
Their system can also be applied to mining far beyond the field of industrial
products; such as: events, individuals, government, services, and companies.
The purpose of the research presented by
Morinaga et al.
(
2002
) was
to specify the reputation of a company/brand by carefully studying online
opinions. By using texts collected on the Internet, it is logical that some in-
formation on a product will not be necessary. However, opinions of products
that describe the experience of an individual product will be necessary. The
difference between
Morinaga et al.
(
2002
)'s work and our system presented in
this research is the data mining approach. Obviously,
Morinaga et al.
(
2002
)
uses sentiment analysis, however there is no topic discovery element as in-
cluded in our framework (a classification approach for topic discovery has
been implemented). The strategy of
Morinaga et al.
(
2002
) was to designate
a text with the label A as 1 and with any other label as 0. They then noted
a set of D texts as being a binary sequence.
They noted a subset of D which consists of texts comprising of a word or
phrase w as E(w) and the remaining sequence as D - E(w). Assume that
I(E(w)) and I(D - E(W )) represent the theoretical complexity of E(w) and
D - E(w), respectively. Generally, for any binary sequence x, its complexity
in information theory, will be noted as I(x), and one can compute it using
stochastic complexity. Figure 1.2 (next page) shows the reputation and ex-
traction system flow used by
Morinaga et al.
(
2002
).
The system supports the opinions extraction and applies the analysis of the
extracted opinions. Indeed, the user can input the product name into the
system, and the function performing the opinion retrieval will use the search
engine to retrieve web pages that include these names. The search engine
then retrieves all phrases that express opinions pertaining to these products
and feeds them into an opinion database (see Figure 1.2). The text mining
component, a particularly crucial component, accepts as input an analysis
condition specifying the target category, and produces an output that con-
sists of the exploration results.

Introduction
7
Figure 1.2: Reputation Mining (
Morinaga et al.
,
2002
).
Musto et al.
(
2015
) created CrowdPulse, a purely agnostic system for
textual analysis of social flows. The system performs social media analy-
sis and makes use of algorithms for semantic processing. The system was
implemented in order to detect the most dangerous zones of Italian ter-
ritory according to the content posted on social media. This system has
been deployed to monitor the state of the city of L'Aquila after the terrible
and shocking earthquake of April 2009. With this in mind, the system has
demonstrated its effectiveness in the context of the use of such technology
in a remarkably innovative way. CrowdPulse presents itself as a real-time
framework of semantic analysis of social flows. This platform focuses on an-
alytic approach. Each analysis is performed while executing the extraction
of heuristic processing. In a typical case, a user interacts with the social
network system on which he wants to analyze/apply the heuristic. Then, the
user must specify the type of process that he would like to perform on the
data content. The platform aims to extract, analyze, aggregate and classify
a huge amount of data/information, which is very important for users. It is
therefore imperative to emphasize that the system is totally independent of
the domain, so that it can be aggregated.
The architecture of the system is explained in Figure 1.3 (next page),
after which, a small description of each part of the system is provided.
· Social Extractor: populates a relational database of all information by
referring to social network APIs. This relational database is adapted
in real time, and is also powered by a heuristic approach (for example

Introduction
8
Figure 1.3: The architecture of the semantic content analysis framework
(
Musto et al.
,
2015
).
to collect all tweets containing a specific hashtag, posts/tweets from
different places, and all messages crawled).
· Semantic Tagger: categorizes each form of content. In this step, an
algorithmic method such as Tag.me and DBpedia Spotlight is applied
(
Musto et al.
,
2015
).
· Sentiment Analyzer: gives a polarity to each content. In this context,
Musto et al.
(
2015
) has created a lexicon approach that deals with
vocabularies thus associating a polarity (positive, negative or neutral).
· Domain-specific processing: produces the results required for each spe-
cific scenario. It incorporates a variety of data mining and machine
learning techniques.
· Analytics Console: produces outputs, providing data visualization wid-
gets.
Once the extraction processes have been triggered, all the content is pro-
cessed semantically. This step is encapsulated and the output is finally stored
locally. The information is aggregated and then presented to the user via an
interactive interface adapted in real time. The aggregation of data and the
type of widgets presented depend largely on the analysis and the results that

Introduction
9
the user would like to obtain: in some cases, it may be important to repre-
sent on a pie chart the feeling of the population, or to control the growth of
feeling in a time space, while in other cases the user will ask to store all the
geographic content on a map to analyze the propagation of certain subjects
or on several domains and so on. The analysis possibilities that could come
from it are infinite.
Similarly,
Spangler et al.
(
2009
), described an integrated brand and repu-
tation analysis solution that mines CGM (Consumer Generated Media) con-
tent for insight, called COBRA (Corporate Brand and Reputation Analysis).
Spangler et al.
(
2009
) have implemented a platform that monitors and gives
feedback on the reputation of the brand. The main purpose of the COBRA
platform is to detect different product categories, topics, issues and brands
that need to be monitored. With this in mind, the platform has become
much more focused on keyword-based queries attached to a brand in order
to extract sufficient data (see Figure 1.4). This method can be very greedy
when it comes to the use of bandwidth and data storage. The difference
with the system that we propose is that our primary objective remains the
analysis of the reputation of companies/brands. As a result, the data we are
collecting is imperatively less ambiguous (non-significant data is ignored),
and the data collection applied is optimized.
Figure 1.4: COBRA architecture (
Spangler et al.
,
2009
).
First, COBRA is an implementation of 3 systematic components, as we

Introduction
10
can see in Figure 1.4.
Typically, a generic engine named ETL (Extract
Transform Load) continuously supports CGM content in structured or un-
structured forms and then makes use of an information warehouse as can be
seen on the left side of Figure 1.4. Subsequently, an analytics engine gives
users the opportunity to model the analysis that will be used to semantically
mark the brand, topics, and problems of different contextual sources (top
right in Figure 1.4). Finally, an alarmist system considers the tagged data
for the purpose of generating brand image and reputation alerts (bottom
right in Figure 1.4). COBRA embodies an array of analytical techniques to
identify and monitor brand image and reputation alerts. Finally, COBRA
identifies alerts through an approach that progressively filters data through
four stages:
· Extended keyword-based queries: this component retrieves information
from identified data sources on the Internet, and collects content in-
cluding brand name matches. For multiple brands, the user's request
must contain the brand names and any possible variances. The main
purpose is to collect enough information about the entities to be ana-
lyzed, including brands and companies. However, the analysis phases
can treat and filter the data. We argue that such requests should in
no way be elusive in terms of extracting a whole range of insignificant
information, which can affect the results of the analysis.
· Snippets: textual collection method which analyzes content available
on the Internet. The majority of content available on the Internet is of-
ten irrelevant. The information may contain different topics. Therefore,
COBRA produces results based on the query stored in the relational
database. To collect the textual content that mentions specific marks,
Spangler et al.
(
2009
) used the Java regular expression syntax. They
eventually also reduced the total size of data that users can retrieve
by focusing on the relevant text segments for the subject, instead of
focusing on the entirety of the documents.
· Analytical modeling: COBRA also uses a variety of analytical tools to
collect and identify the brands or the problems/topics. Users begin by
identifying the main models, the brand and company names, they then
report to COBRA. Subsequently, the basic models and filtering models
are constructed by making use of the domain knowledge of users, for
instance, knowledge about the candy industry and brands, or using the
knowledge of the system generated by textual exploration.
· Orthogonal filtering: except for the first three steps of the filtering
methods, such as small queries, the generation of extracted contents,

Introduction
11
the conceptual analytic modeling of keywords, COBRA starts with
a unique filtering method, called orthogonal filtering, which identifies
important alerts.
1.4
Textual Data Retrieval
Normally, we defined an entry page for textual data retrieval. One web page
should contain URLs of other web pages, then we retrieved these URLs from
the current page and add all of these affiliated URLs into the crawling queue.
Next, we crawled another page and repeat the same process as the first one
recursively. Essentially, we assumed the crawling scheme as depth-search or
breadth-traversal. And as long as we accessed the Internet and analyze the
web page, we crawled a website.
Following this, we extracted the body content of the web page crawled and
applied Entity Recognition to eliminate irrelevant information. Lastly, the
relevant information was recorded in the database for later processing (see
Figure 1.5).
Figure 1.5: Data collection and labelling.
1.5
Sentiment Analysis, NLP and Machine
Learning
The sentiment analysis process classified the polarity of the text retrieved
at different levels - an attempt is made to determine whether the opinion
expressed in a text is positive, negative or neutral.
1.5.1
N-gram
In our research, feature extraction used N-gram/bag-of-words techniques that
are prominent in modern Natural Language Processing. N-grams is simply

Introduction
12
an aggregation of sequences as they appear in the texts (see Figure 1.6). The
intuition is that N represents the frequency of the aggregates sequence. The
parsing must be applied in order to obtain the syntactic paths. Nowadays,
many analyzers are available for many languages. Unfortunately, there are no
parsers for all languages. However for English or Spanish, there is a plethora
of parsers (
Volcani and Fogel
,
2006
).
Figure 1.6: N-gram model.
1.5.2
Bag-of-Words
The bag-of-words model is a representation used in Natural Language Pro-
cessing. This model represents a text as a multi-set of its words, without
focusing on grammar or the order of words.
However, the bag-of-words
model keeps multiplicity. The bag-of-words can also be used in computer
vision (
Pang et al.
,
2002
). This model is usually juxtaposes with classifica-
tion algorithms in which the frequency of each word to be used is utilized
as a property forming a classifier (Deep Learning). In our research, the bag-
of-words model was used as a tool generating frequencies. After the textual
transformation into a bag-of-words, we performed pompous calculations to
measure the textual characteristics. One of the most used features was the
frequency of words (number of times a term appears in the text (see Figure
1.7)).

Introduction
13
Figure 1.7: BoW model.
1.5.3
Autoencoders
In machine learning, documents are usually represented as bag-of-words
(BoW), which reduces a piece of text with arbitrary length to a fixed length
vector. Despite its simplicity, BoW remains the dominant representation in
many applications including text classification (
Koncz and Paralic
,
2011
).
There has also been a large body of work dedicated to learning useful repre-
sentations for textual data. By exploiting the co-occurrence pattern of words,
one can learn a low dimensional vector that forms a compact and meaningful
representation for a document.
The new representation is often found useful for subsequent tasks such as
topic visualization and information retrieval. Autoencoders have attracted
a lot of attention in recent years as a building block of Deep Learning
(
Mescheder et al.
,
2017
). In our framework, autoencoder act as the feature
learning methods by reconstructing inputs with respect to a given loss func-
tion. We implemented a neural network of autoencoder, the hidden layer
was taken as the learned feature. While it is often trivial to obtain good
reconstructions with plain autoencoders, much effort has been devoted on
regularizations in order to prevent them against overfitting.

Introduction
14
Figure 1.8: Autoencoder.
An autoencoder always consists of two parts (see Figure 1.8), the encoder
and the decoder, which can be defined as transitions and such that:
: X F
: F X
(1.1)
In the simplest case, where there is one hidden layer, the encoder stage
of an autoencoder takes the input of the generated features x R
d
= X
and maps it to z R
p
= F :z = (W x + b) where z is usually referred as
latent representation. Here, is an element-wise activation function such as
a sigmoid function. W is a weight matrix and b is a bias vector. After that,
the decoder stage of the autoencoder maps z to the reconstruction x of the
same shape as x:
x = (W z + b)
(1.2)
where , W and b for the decoder may differ in general from the corre-
sponding , W , b for the encoder, depending on the design of the autoen-
coder. Autoencoders are unsupervised learning models as discussed in the
following subsection 1.5.4.
1.5.4
Learning Algorithms
In supervised learning, we have a set of examples that consists of input-
output pairs. The desired predictor is a function that maps an input to a
relevant output or label. This set of examples is divided into two distinct
subsets (a training set and a test set) (
Pang et al.
,
2002
). The training set is
a bunch of correctly labeled examples while the test set contains unseen or

Introduction
15
new input data that is labeled by the predictor (see Figure 3.5). However, in
unsupervised learning, the predictor is a function that detects the patterns
in the input data even though no explicit feedback is provided. There is no
training set, and no labeled data are involved. The function groups all inputs
into several sub-groups based on their common patterns. This task is called
clustering (
Pang et al.
,
2002
). The semi-supervised learning is useful when
there are a few labeled examples and more unlabeled ones. The algorithm
generates an appropriate predictor to label the new data using knowledge of
both labeled data and the clustering function.
Figure 1.9: Supervised Classification.
1.5.5
Neural Networks (NNs)
Neural networks are generally regarded as being a major segment of the
supervised learning discipline. As with all types of classification in supervised
learning, the goal of a neural network is to utilize labelled data in order to
train the network to be able to classify any new instances of data that may
come in as either being expected or unexpected (
Furnkranz et al.
,
1998
). This
labelled data consists of a vector X, corresponding to the various features
of a particular object or environment and their values. A vector Y is also
included, which, in a binary classification problem, classifies each observation
as either expected or unexpected, based on the values of the vector X at that
position (
Furnkranz et al.
,
1998
). An example of such a problem would be in
text classification, where positive/negative and neutral observations would be
classified differently in the Y vector, with different combinations of features
in the X vector.
After learning a parameter vector, W , that multiplies
with X to minimize the error of classification with already-labelled data,
the program would then be able to create labels for new data, and thus to
determine whether new instances of textual data are positive/negative or
neutral.

Introduction
16
Figure 1.10: Basic structure of Neural Networks.
Features are input to the system as a layer of nodes (see Figure 1.10).
In order to deal with non-linear relationships between the data, the values
obtained in the input layer of nodes are then converted using a set of param-
eters w
1
into a new set of features, known as the hidden layer (
Furnkranz
et al.
,
1998
). This process can be repeated for a number of hidden layers,
however, networks with one or two hidden layers are most commonly used
for their lower computation times. Finally, at the end of the hidden layers,
the features are once again converted into a final output value, which clas-
sifies the observation as being either positive/negative or neutral. On the
first run-through of this algorithm, randomized parameters are used, which
are unlikely to give accurate results (
Furnkranz et al.
,
1998
). Thus, through
a process known as back-propagation, labelled training data can be used to
iteratively update each set of parameters, until the algorithm has a reliable
set in order to make accurate classifications. To update the parameters, a
cost function is developed, which is based on the mean error of the observa-
tion in relation to the current classification function (
Furnkranz et al.
,
1998
).
Hence, the algorithm is progressively updated until it is deemed to be suit-
able to tackle any new data of the same nature given to it. The value of the
cost function eventually converges to a minimum, which is when the learning
process will be complete. In this research, the neural network algorithm is
implemented by propagating the activation function g(S) values from the
input layer to the output layer.
u
(t)
0
= 1, t [1, 2]
(1.3)
u
(1)
i
= X
(i)
(1.4)

Introduction
17
u
(t)
i
= g(S
(t)
i
), t [2, 3]
(1.5)
where
S
(t)
i
=
jn
(t-1)
u
(t-1)
j
(t-1)
j,i
(1.6)
and
g(S) =
1
1 + e
-S
(1.7)
Where, u
(t)
i
represents the activation value of i
th
node of the t
th
layer and
g(S) represents the activation function. is a specific hypothesis function
parameter with a numerical value (R).
1.5.6
Named Entity Recognition (NER)
NER has segmented entities, it has sought to locate and categorize differ-
ent entities such as names of people, organizations, locations, hour expres-
sions, quantities, monetary values, percentages etc. (see Figure 1.11). In our
research, Named Entity Recognition detected and classify company/brand
names within textual content.
Figure 1.11: Named Entity Recognition.
1.6
Evaluating the System
The performance of the system relies on the sentiment analysis. Efficiency
of sentiment analysis applications was calculated through experiments in the

Introduction
18
form of test data. For binary classifiers there are different metrics to measure
the performance and the testing phase can be split up into the following
phases (
Wang et al.
,
2012
):
1.6.1
Evaluating Coverage
The coverage of the system refers to the percentage of the total observations
that the system is able to make classification for (
Gon¸calves et al.
,
2013
). A
high coverage value is desired to make sure that every feature used by the
system is able to have classification generated for them. The formula for
calculating the coverage is shown below (
Gon¸calves et al.
,
2013
):
P recision =
T P
T P + F P
(1.8)
By using this metric, we are able to determine the reliability of the system
in terms of making classification.
1.6.2
Evaluating Accuracy
In order to evaluate accuracy, we used the metrics of precision and recall.
We need a baseline to use as a reference (ground truth). As false/true nega-
tives/positives relate to the predictions, we must first establish a set of data
points whose classification is known ahead of time. Then, the predicted out-
comes are compared to the known cases. The false positive is the case that
a model predicts something that is known not to be so. The false negative
is the case that a model predicts something not to be so when the known
classification is that it is so. Precision calculates how many classifications
made were consistent. Recall compares the number of correct classifications
made to the total number that could have been made based on the dataset.
Both a high coverage and a high recall are desirable for the classification
system. The formulae for calculating the recall is shown below (
Gon¸calves
et al.
,
2013
):
Recall =
T P
T P + F N
(1.9)
where true positives stands for "TP" and false positive stands for "FP".
False negative is represented by "FN". High recall means that the classifier
can predict correctly most of the Web pages.

Introduction
19
1.6.3
F-Measure
F-measure also called harmonic means is the combination of precision and
recall that also is called balanced F-score and is calculated as follow (
Hatzi-
vassiloglou and McKeown
,
1997
):
F = 2
P recision Recall
P recision + Recall
(1.10)
In this formula, precision and recall are weighted evenly.
1.6.4
Accuracy
Another statistical metric of how well a binary classifier performs correctly on
test data is accuracy (
Hatzivassiloglou and McKeown
,
1997
). To calculate
the accuracy both true positive and true negative values among the total
number of examined cases are considered. The accuracy formula is :
Accuracy =
T P + T N
T P + T N + F P + F N
(1.11)
We have a classification algorithm f (x|), after training, the prediction
of x represents the positive polarity if f (x|) , at a certain level . We
suggest that f (x|) [0, 1] embodies probabilistic space if x has a positive
polarity, this implies that ¯
P (+|x) f (x|). We say that x has a nega-
tive polarity if f (x|) < and ¯
P (-|x) 1 - f (x|). Then, based on the
true x label, we consider 4 cases in our confusion matrix. We also count
the number of their occurrence. True Positive (TP), False Negative (FN),
False Positive (FP), and True Negative (TN). For neural networks, we fo-
cused mainly on the classification of positive and negative polarity using a
binary classification. In this case, the FN / FP represents a misclassification.
Positive
(Actual)
Negative
(Actual)
Positive
(Predicted)
True Positive
(T P )
False Positive
(F P )
Negative
(Predicted)
False Negative
(F N )
True Negative
(T N )
Table 1.1: Binary Confusion Matrix.

Introduction
20
1.7
Content Extraction
In the cleanup section, we extracted appropriate content, and irrelevant in-
formation such as advertising, and menu bars was ignored from the main
content. To perform this extraction and such filtering, we used another tool
named BoilerPipe. BoilerPipe enabled the framework to retrieve the body
content from web pages. It offers 5 options of textual collection, among the 5
option CanolaExtractor is considered as the most efficient option because it
has surpassed the extraction performance of the other 4 functions on a larger
number of pages (see Table 1.2). The textual extraction of a Web page by
CanolaExtractor from BoilerPipe is presented in Figure 1.12 (page 21).
BoilerPipe Options
Size of features extracted
KeepEverythingExtractor
600kB
ArticleSentencesExtractor
650kB
NumWordsRulesExtractor
567kB
CanolaExtractor
1MB
LargestContentExtractor
700kB
Table 1.2: BoilerPipe extraction performance.
1.7.1
Word Extraction
We also incorporate the Stanford Core NLP library which is used extensively
for textual segmentation from the content of the extracted page.
2
Stanford
Core NLP is a set of Natural Language Processing tools, this library can
provide all basic forms of words, the aggregation of speech and the structural
tag of texts from a raw English text entry.
Our platform first sends the text to Stanford POS tagger for some pre-
processing steps such as: sentence segmentation, tokenization all sentences
and tagged tokens (see Figure 1.3, page 22).
In order to extract the Web page words-list, our framework builds an
array including all stemmed tokens (the words) along with their frequencies
and part of speech tags (see Table 1.3, page 21).
After constructing the words-list, all words which are tagged as verbs and
adjectives are sent to the classifier in order to extract their polarity scores.
The total scores of all positive/negative and neutral words are calculated.
Based on the higher score, our classifier predicts the Web page polarity. If
2
NLP Stanford Core Library (version 1.2.0), http://nlp.stanford.edu/software/corenlp.shtml.

Introduction
21
Figure 1.12: A text extraction sample by BoilerPipe.
Tokens
Word
Frequency
Tag
bought
buy
1
VBD
speedy
speedy
1
JJ
to
to
3
TO
`s
is
5
VBZ
...
...
...
...
Table 1.3: The words-list.
the classifier can correctly predict the polarity of the Web page, it will be
added to the database. Our classifier goes through all real-time/testing data
sets and predict the polarity of extracted sentences/observations.
The author bases his analysis on the contents of the following website: http://
www.bbc.co.uk/programmes/profiles/N8TcrLGxrf6dYzLZP1zhQj/meet-the-
candidates.

Introduction
22
Figure 1.13: The pre-processing step of a Web page using Stanford POS
tagger.
One of the strength points of this method is the ability for further expansion
by considering more out-of-domain data sets as well as adding the additional
sentiment lexicons.
Stop Words are words which do not contain important significance to be
used in Search Queries. In this research, these words are filtered out from
search queries because they returned vast amount of unnecessary information.
The following Table 1.4 shows a list of English stop words ignored.
a, about, above, across, after, afterwards
again, against, all, almost, alone, along
already, also, although, always, am, among
amongst, amoungst, amount, an, and, another
any, anyhow, anyone, anything, anyway, anywhere
...
Table 1.4: List of English Stop Words.
1.7.2
Training Phase
In the training phase, the classifier takes a lexicon and a Web page words-
list as input and computes the polarity of the Web page by querying for
sentimental values of its all adjectives and verbs.

Introduction
23
Figure 1.14: A sample of the Wits marketing dataset.
1.7.3
Emoticons
To easily classify the polarity of a text/message, it is necessary to focus on the
emoticons it contains. We can define emoticons as a representation of happy
or sad feelings. To specify the polarity of the emoticons, we considered an
entire group of common emoticons. Emoticons have been used in combination
with other techniques to implement a set of learning data. Figure 1.5 (next
page) shows a sample of emoticons that we used to train the classification
algorithm.
1.7.4
Our Training Approach
The entire textual dataset was broken down into individual observations
using Python code, with the most commonly appearing words being iden-
tified and recorded. The particular number of common words chosen can
be increased or decreased depending on the size of the dataset, or if results
generated are not accurate enough. Thus, the logic behind this is that if an
observation's polarity is to be predicted it should contain at least some of the
words commonly associated with past observations polarity examined. Af-
ter building this list of observations, training and testing datasets were then
built from the obtained textual data. A multidimensional array was built
for the dataset, utilizing the previously obtained common words as features,

Introduction
24
Figure 1.15: Emoticons and their variations (
Read
,
2005
).
and each record in the dataset as a separate row in the array. By reading
the text file containing the observations, each observation was read in one-
by-one, using the % tags to separate each observation. Each record was then
compared, word-by-word, to the list of common words generated previously.
If any of the words in the observation matched with a common word, this was
recorded in the multidimensional dataset array by incrementing the corre-
sponding value in the array. Entire datasets for the textual data were read in
using this method, with a corresponding targets array being labelled "0" or
"1", depending on which observation was currently being read. The finalized
array was then split into a training and a testing dataset, for the purposes
of the neural network method. This particular system uses a 4-layer total
network (with 2 hidden layers), so three sets of random weights were first
generated for the transitions between each layer. Forward propagation is first
applied to the network, using the following activation function (
Goodfellow
et al.
,
2016
):
h(x) =
1
1 + e
-
(W
T
x)
(1.12)
where W is the weight vector and x is the value of the input for that node.
As a result, hypotheses are generated for each node in the hidden layers (to
be used as features for the next layer) as well as a hypothesis for the output
in the final layer. Backpropagation was then applied, in order to update the
weights between the output layer and the second hidden layer, the second
hidden layer and first hidden layer, and first hidden layer and input layer,

Introduction
25
respectively. This is done to improve hypothesis generation accuracy, as men-
tioned previously by
Goodfellow et al.
(
2016
). Following this, the parameter
with which to split the data into positive/negative was obtained by testing
the previously obtained factors on a combined portion of the dataset with
positive/negative labels. This test was done repeatedly, each time updating
the parameter value until the optimal separation is found.
1.7.5
Training
The implementation created for the intake and preprocessing of the dataset
was found to proceed quickly enough to get results in a reasonable amount of
time. This was based on the current amount of data collected for the tests.
Obviously, it is expected that the more data is added to the dataset, the
longer the code will take to generate results.
The ANN algorithm, when tested on the basis of true positives, true nega-
tives, false positives, and false negatives, was found to produce approximately
11% false positives and 2.7% false negatives. This is shown in Table 1.5 be-
low. ANN accurately classified each observation using a binary polarity with
an empirical error of 0.1379%.
Positive
(Actual)
Negative
(Actual)
Positive
(Predicted)
22 = 15.172413 %
16 = 11.034482%
Negative
(Predicted)
4 = 2.7586206 %
103 = 71.034482%
Table 1.5: Evaluation metrics for ANN.
In terms of approximations, the range of the ANN's confusion matrix is
given in Table 1.6.
Positive
(Actual)
Negative
(Actual)
Positive
(Predicted)
20 to 33
4 to 18
Negative
(Predicted)
4 to 16
89 to 103
Table 1.6: Confusion Matrix approximation.

Introduction
26
Empirical error = 0.13 to 0.17%
As we can see from the confusion matrix approximation, there is more
or less 33 positive observations that are predicted as positive; with at least
103 negative observations that are classified as negative. In contrast, there
were at least 34 observations that are misclassified. ANN is an adaptive al-
gorithm which change its inner structure based on the information passing
through it. Therefore, learning in ANN means that a processing unit could
update its input/outputs due to the change in environment. For training,
we used some training samples with unique features (words, sentences); and
to performed testing we used some testing sample with other unique features.
Figure 1.16: Graph of positive features (x axis) Vs the Cost (y axis).
Figure 1.17: Graph of Negative features (x axis) Vs the Cost function (y
axis).
From the Graphs of ANN shows in Figure 1.16 and Figure 1.17 (page
26), we note that there is a strong correlation between the volume of the

Introduction
27
dataset and the classifier prediction. This implies that for a big dataset
with multiple observations, the algorithm will decrease the value of the cost
function. However, if the volume of the dataset is minimized, the algorithm
will increase the value of the cost function. The cost function of the ANN
reached a minimum, thus showing that the algorithm was able to converge
correctly. The graph in Figure 1.18 shows the value of the cost function
decreasing.
3
Figure 1.18: Graph of Cost vs. The Number of Iterations.
1.7.6
Analysis
From the results obtained from the ANN implementation, it can be seen that
most of the data was able to be classified correctly. This is likely due to the
textual polarity data not having too much overlap, resulting in the boundary
value being able to separate them properly. Due to the varied nature of the
feature data, with features potentially changing as the dataset gets larger, it
can be seen that this classification method could possibly have a reduction
in prediction accuracy as the sample size increases, due to more observations
potentially overlapping. However, a larger dataset could, instead, increase
hypothesis generation accuracy as the algorithm would have more data to
work with and the boundary between positive/negative could be better un-
derstood.
The quality of the data is thus important in this respect. In addition, for
the neural networks implementation, it was found that the algorithm only
misclassified a minor number of observations, with an empirical error of 0.13
3
With the number of iterations.

Introduction
28
Words
Occurrence
Wits
200
University
119
Students
85
Research
75
2016
68
School
58
South
55
Student
51
African
50
Academic
45
Table 1.7: The Bag-of-Words performance.
or 0.17%, suggesting a correct classification rate of 71%. The vast majority
of observations were thus classified correctly, based on this result. The algo-
rithm was thus able to handle the complex relationships between the various
features, making use of the hidden layers to account for these relationships.
It was largely successful in classifying each observation as either positive or
negative. Once again, a larger dataset could potentially increase output ac-
curacy for this algorithm. The results of the bag-of-words is shown is Table
1.7 (page 28) with the key word Wits having the highest occurrence within
the textual dataset.
1.8
Graphical User Interface
The data visualization that were used in this work is discussed in this sec-
tion. To visualized the data we incorporated our framework into Trackur
4
and created searches for the terms that a user want to track, such as brand
names and corporate terms. Trackur API allows developers to store struc-
tured data in their databases. In this research Trackur was used as a crawler
and thus it allowed the platform to utilize tens/millions of crawled textual
data. This data was extracted from social medias and other source of in-
formation. Our design presents the information in an understandable way,
which makes it possible to understand the result produced by the system. We
simply implement a principle of structural organization defining categories
of information by function or importance. The color allowed us to classify
web pages crawled with respect to their polarity. Table 1.8 (page 29) is a
4
Trackur is a social media monitoring tool for individuals up through large companies
and agencies, with this API, new items are found in almost real-time.

Introduction
29
list of some meanings about colors and how they are interpreted by the GUI
framework:
Colors
Meaning
Red
Negatives polarity
Yellow
Positives polarity
Green
Neutral polarity
Table 1.8: The meanings of colors for the GUI.
"NA" is used to indicate that the textual contents of a certain Web page
is not available. "Date" is used to indicate the date a website is written.
"Sentiment" indicates the polarity prediction. "Source" is used to determine
the source of web pages. A snippet is a small text segment around a specified
keyword. The following Figure 1.21 displays the result for the query "Wits
University". Similarly, Figure 1.22 displays the result for the query "Wits
students protest".
Figure 1.19: Home-page of the web-based application. This figure shows
the initial screen of the application. At this page, the user is required to
performed reputation analysis.
The user name and password would have been given to the user by the
system administrator. The backend MySQL database contains a table for
The author refers to the following website: http://reputationanalysis.wifeo.com/

Introduction
30
Figure 1.20: log-in screen of the web-based application. Once the user ac-
cepted to performed reputation analysis, he will have access to the log-in
screen.
all users that stores the credentials for both their user names and passwords.
If both the user name and password match, then the user is logged into the
system. Otherwise, the user is presented with an error message and is asked
to re-enter their credentials.
Figure 1.21: Results for:"Wits University".
The polarity of the entire Web page was displayed in the final reputation
analysis application.

Introduction
31
Figure 1.22: Results for:"Wits students protest".
1.9
Results
In this section, we discuss the use of ANN and SentiStrength for sentiment
detection in reputation analysis. Our study was aimed at investigating the
use of textual data in Web mining for reputation analysis using the afore-
mentioned classifier. Since our proposed framework is supervised, given a
Web page, the method first count the number of positive/negative and neu-
tral observations. If the number of neutral observations s
i
is larger than the
number of positive observations s
+
i
and negative observations s
-
i
, the Web
page p
j
is considered as neutral. If the number of positive observations is
larger than the number of negative/neutral observations, the Web page is
considered as positive, otherwise as negative (see Table 1.10).
Condition
Prediction
(s
i
> s
+
i
) (s
i
> s
-
i
)
(s
i
> s
+
i
) (s
i
> s
-
i
) = p
j
s
i
(s
+
i
> s
-
i
) (s
+
i
> s
i
)
(s
+
i
> s
-
i
) (s
+
i
> s
i
) = p
j
s
+
i
(s
-
i
> s
+
i
) (s
-
i
> s
i
)
(s
-
i
> s
+
i
) (s
-
i
> s
i
) = p
j
s
-
i
(s
i
= s
+
i
) (s
i
> s
-
i
)
(s
i
= s
+
i
) (s
i
> s
-
i
) = p
j
s
i
p
j
s
+
i
(s
i
= s
-
i
) (s
i
> s
+
i
)
(s
i
= s
-
i
) (s
i
> s
+
i
) = p
j
s
i
p
j
s
-
i
(s
+
i
= s
-
i
) (s
+
i
> s
i
)
(s
+
i
= s
-
i
) (s
+
i
> s
i
) = p
j
s
+
i
p
j
s
-
i
(s
+
i
= s
-
i
) (s
+
i
= s
i
)
(s
+
i
= s
-
i
) (s
+
i
= s
i
) = p
j
s
+
i
p
j
s
-
i
p
j
s
i
Table 1.9: Predicting the Web page polarity.
We also compare our results with the results obtained in
Wang and Araki
(
2008
). One of the more interesting services available on the crawler used is

Introduction
32
insights for data.
The crawler/API provides a basic "search" feature but can also include some
incredibly detailed filters. For instance, we can look at data from various
perspective.
1.10
Testing
To evaluate our classifiers, we apply them to real-time domain to predict
the polarity of the Web pages Crawled. For a Crawled Web page, first, its
contents is extracted, stored and the polarity computed.
The following Table 1.11 shows the classification of sentences using Sen-
tiStrength.
Sentence
Highest polarity
Compound
Make sure you prepared:) or : you will fail
neg: 0.314
-0.3594
The pass rate is positive
neu: 0.1
0.00
I really enjoyed this class
pos: 0.545
0.5563
I dislike this class, it is boring
neu: 0.505, neg: 0.495
-0.5994
I like you
neu: 0.286, pos: 0.714
0.3612
Table 1.10: Polarity prediction using SentiStrength.
Additionally, we showed the keywords detected in terms of reputation
prediction (next page, Table 1.12).
Since the core keyword is "Student(s)", one can say that the textual
dataset contained information about students protest in south Africa. Nev-
ertheless, we can draw very different meanings from the same results.
1.10.1
Completeness of ANN (Accuracy)
In this subsection we report the average of Accuracy for comparability to
earlier results in text classification. Finally, we summarized the average for
the learning algorithm. Table 1.13 shows the confusion Matrix for the ANN
with an empirical error of 0.073%.

Introduction
33
Words
Occurrence
Students
113
Student
93
University
77
Universities
72
Protest
67
CPUT
62
South
58
2016
55
FeesMustFall
47
Violence
46
Table 1.11: The Bag-of-Words performance.
Positive
(Actual)
Negative
(Actual)
Positive
(Predicted)
89 = 72.3577235 %
2 = 1.62601626 %
Negative
(Predicted)
7 = 5.6910569 %
25 = 20.3252032 %
Table 1.12: Confusion Matrix for ANN.
1.11
Empirical Testing of ANN
The issue of conducting computational experiments has been addressed since
the late 70's (
Chen et al.
,
1999
). Empirical testing of algorithms has been
the focus of research in a variety of contexts. One of the major limitations of
ANN is the learning process which is relatively slow and the implementation
took slightly longer to arrive at results due to backpropagation having to be
done repeatedly (see Figure 1.24). The framework presented by
Wang and
Araki
(
2008
) achieved 70% for opinion sentences classification. This result
is lower than the one achieved by our framework. The proposed framework
performed 92% of Accuracy, this surpassed the accuracy presented by
Wang
and Araki
(
2008
).
1.12
Discussion
Results from this experiment demonstrate the limiting factor of ANN which
suffered from significant slowdowns with larger datasets.
An increasing
amount of data to process as well as additional features exponentially in-

Introduction
34
Positive
(Actual)
Negative
(Actual)
Positive
(Predicted)
83 to 90
1 to 10
Negative
(Predicted)
4 to 15
22 to 30
Table 1.13: Approximation of the ANN Confusion Matrix. The empirical
error approximation starts from 0.0569% up to 0.13%.
ANN
TP
TN
FP
FN
Accuracy
Training
22
103
16
4
86.20%
Testing
89
25
2
7
92.68%
Average
55.5
64
9
5.5
89.44%
Table 1.14: Performance of ANN for Testing and Training.
Figure 1.23: Learning process of ANN. Features (x axis) and Time (y axis)
creased the processing time required for the learning algorithm to finish.
However, since the algorithm tries to find an exact fit for the textual data,
the accuracy of hypotheses generated is likely to be high. Thus, an individual
considering this algorithm for use may have to look at the tradeoff between
processing time and accuracy and make a determination in that regard.
It can thus be concluded that using both ANN and SentiStrength algorithms
for the purpose of reputation analysis is a viable proposition. The ANN
is able to learn a set of parameters in order to compare the text in a given

Introduction
35
Figure 1.24: Negative Features and the Bag-of-words performance.
observation to the obtained words and thus decide whether to classify the ob-
servation as positive/negative. The SentiStrength algorithm can effectively
determine the polarity of a sentence. The compound score is computed by
summing the valence scores of each word in the lexicon, and then normalized
to be between -1 (most extreme negative) and +1 (most extreme positive)
(
Araujo et al.
,
2016
). Therefore, it is clear to see that such fast and efficient
methods of classification are able to be implemented and incorporated in a
reputation analysis framework.
1.13
Conclusion
The Internet is growing at lightning speed and the data stored therein is vast.
The increasing growth of the Internet makes it an enormous source of data,
especially on how people feel about different issues. Nowadays, the opinions
of people play a crucial role in industry. So large and small industries, are
studying automatic approaches to retrieve the information they need from
large volumes of data on the Internet. Reputation analysis is an effective
method to deal with this problem. Reputation analysis automatically deter-
mines how different keywords, terms, topics or user-generated content may
harm a brand name, product or company that are mentioned. Reputation
analysis utilizes sentiment detection that involves advanced methods such
as machine learning and natural language processing to capture the polarity
such as positive, negative, or neutral, with or without their strength, from
plain texts. This research focuses on Web mining for reputation analysis. A
reputation analysis is performed on the University of the Witwatersrand to
study its popularity on the Internet. There exists a wide range of fields for
which information can be retrieved. This research investigated sentiments
about Wits from publicly available data. The system can be use to retrieve,

Introduction
36
to process and to display the reputation of different brands/corporate. This
study also describes tools that enable the development of technologies that
support text processing to speed up sentiment detection in reputation anal-
ysis. In this perspective, we offer an application of how the proposed frame-
work works for a clearly defined system such as focused web crawling. Our
work is totally different from the work presented by
Wang and Araki
(
2008
).
Wang and Araki
(
2008
) utilized unsupervised techniques, our framework ap-
plied supervised algorithms.

References
[Ackoff 1989] Russell L Ackoff. From data to wisdom. Journal of applied
systems analysis, 16(1):3­9, 1989.
[Aggarwal and Zhai 2012] Charu C Aggarwal and ChengXiang Zhai. Mining
text data. Springer Science & Business Media, 2012.
[Araujo et al. 2016] Matheus Araujo, Julio Reis, Adriano Pereira, and Fabri-
cio Benevenuto. An evaluation of machine translation for multilingual
sentence-level sentiment analysis. In Proceedings of the 31st Annual ACM
Symposium on Applied Computing, pages 1140­1145. ACM, 2016.
[Asghar et al. 2014] Muhammad Zubair Asghar, Aurangzeb Khan, Shakeel
Ahmad, and Fazal Masud Kundi. A review of feature extraction in senti-
ment analysis. Journal of Basic and Applied Scientific Research, 4(3):181­
186, 2014.
[Blanco and Moldovan 2011] Eduardo Blanco and Dan I Moldovan. Some
issues on detecting negation from text. In FLAIRS Conference, pages
228­233, 2011.
[Chang and Lin 2011] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a li-
brary for support vector machines. ACM transactions on intelligent sys-
tems and technology (TIST), 2(3):27, 2011.
[Chen et al. 1999] Chun-Hung Chen, S David Wu, and Liyi Dai. Ordinal
comparison of heuristic algorithms using stochastic optimization. IEEE
Transactions on Robotics and Automation, 15(1):44­56, 1999.
[Cooley et al. 1997] Robert Cooley, Bamshad Mobasher, and Jaideep Srivas-
tava. Web mining: Information and pattern discovery on the world wide
web. In Tools with Artificial Intelligence, 1997. Proceedings., Ninth IEEE
International Conference on, pages 558­567. IEEE, 1997.
[Dadvar et al. 2011] Maral Dadvar, Claudia Hauff, and Franciska MG
de Jong. Scope of negation detection in sentiment analysis. 2011.
37

References
38
[Drake 2003] Miriam Drake. Encyclopedia of library and information science,
volume 1. CRC Press, 2003.
[Durant and Smith 2006] Kathleen T Durant and Michael D Smith. Mining
sentiment classification from political web logs. In Proceedings of Workshop
on Web Mining and Web Usage Analysis of the 12th ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining (WebKDD-
2006), Philadelphia, PA, 2006.
[Frakes and Baeza-Yates 1992] William B Frakes and Ricardo Baeza-Yates.
Information retrieval: data structures and algorithms. 1992.
[Freitag 1998] Dayne Freitag. Information extraction from html: Application
of a general machine learning approach. In AAAI/IAAI, pages 517­523,
1998.
[Fu et al. 2012] Tianjun Fu, Ahmed Abbasi, Daniel Zeng, and Hsinchun
Chen. Sentimental spidering: leveraging opinion information in focused
crawlers. ACM Transactions on Information Systems (TOIS), 30(4):24,
2012.
[Furnkranz et al. 1998] Johannes Furnkranz, Tom Mitchell, Ellen Riloff,
et al. A case study in using linguistic phrases for text categorization on
the www. In Working Notes of the AAAI/ICML, Workshop on Learning
for Text Categorization, pages 5­12, 1998.
[Godsay 2015] Manasee Godsay. The process of sentiment analysis: a study.
International Journal of Computer Applications, 126(7), 2015.
[Gon¸calves et al. 2013] Pollyanna Gon¸calves, Matheus Ara´
ujo, Fabr´icio Ben-
evenuto, and Meeyoung Cha. Comparing and combining sentiment analy-
sis methods. In Proceedings of the first ACM conference on Online social
networks, pages 27­38. ACM, 2013.
[Goodfellow et al. 2016] Ian
Goodfellow,
Yoshua
Bengio,
and
Aaron
Courville.
Deep learning (2016).
Book in preparation for MIT Press.
URL: http://www. deeplearningbook. org, 2016.
[Hatzivassiloglou and McKeown 1997] Vasileios Hatzivassiloglou and Kath-
leen R McKeown. Predicting the semantic orientation of adjectives. In
Proceedings of the eighth conference on European chapter of the Associa-
tion for Computational Linguistics, pages 174­181. Association for Com-
putational Linguistics, 1997.

References
39
[Ikonomakis et al. 2005] M Ikonomakis, Sotiris Kotsiantis, and V Tampakas.
Text classification using machine learning techniques. WSEAS transac-
tions on computers, 4(8):966­974, 2005.
[Kantor 1994] Paul B Kantor. Information retrieval techniques. Annual re-
view of information science and technology, 29:53­90, 1994.
[Keim 2002] Daniel A Keim. Information visualization and visual data min-
ing. IEEE transactions on Visualization and Computer Graphics, 8(1):1­8,
2002.
[Kennedy and Inkpen 2006] Alistair Kennedy and Diana Inkpen. Sentiment
classification of movie reviews using contextual valence shifters. Compu-
tational intelligence, 22(2):110­125, 2006.
[Koncz and Paralic 2011] Peter Koncz and Jan Paralic. An approach to fea-
ture selection for sentiment analysis. In Intelligent Engineering Systems
(INES), 2011 15th IEEE International Conference on, pages 357­362.
IEEE, 2011.
[Konstantinova et al. 2011] Natalia Konstantinova, Sheila CM De Sousa,
and JA Sheila. Annotating negation and speculation: the case of the
review domain. In RANLP student research workshop, pages 139­144,
2011.
[Koppel and Schler 2006] Moshe Koppel and Jonathan Schler. The impor-
tance of neutral examples for learning sentiment. Computational Intelli-
gence, 22(2):100­109, 2006.
[Kosala and Blockeel 2000] Raymond Kosala and Hendrik Blockeel.
Web
mining research: A survey. ACM Sigkdd Explorations Newsletter, 2(1):1­
15, 2000.
[Kucuktunc et al. 2012] Onur Kucuktunc, B Barla Cambazoglu, Ingmar We-
ber, and Hakan Ferhatosmanoglu. A large-scale sentiment analysis for ya-
hoo! answers. In Proceedings of the fifth ACM international conference on
Web search and data mining, pages 633­642. ACM, 2012.
[Kumar et al. 1999] Ravi
Kumar,
Prabhakar
Raghavan,
Sridhar
Ra-
jagopalan, and Andrew Tomkins. Extracting large-scale knowledge bases
from the web. In VLDB, volume 99, pages 639­650, 1999.
[Larsen and Aone 1999] Bjornar Larsen and Chinatsu Aone. Fast and effec-
tive text mining using linear-time document clustering. In Proceedings of

References
40
the fifth ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 16­22. ACM, 1999.
[Lebret and Collobert 2015] R´
emi Lebret and Ronan Collobert. " the sum
of its parts": Joint learning of word and phrase representations with au-
toencoders. arXiv preprint arXiv:1506.05703, 2015.
[Martin et al. 2006] Olivier Martin, Irene Kotsia, Benoit Macq, and Ioannis
Pitas. The enterface'05 audio-visual emotion database. In Data Engi-
neering Workshops, 2006. Proceedings. 22nd International Conference on,
pages 8­8. IEEE, 2006.
[Mescheder et al. 2017] Lars Mescheder, Sebastian Nowozin, and Andreas
Geiger. Adversarial variational bayes: Unifying variational autoencoders
and generative adversarial networks. arXiv preprint arXiv:1701.04722,
2017.
[Morinaga et al. 2002] Satoshi Morinaga, Kenji Yamanishi, Kenji Tateishi,
and Toshikazu Fukushima. Mining product reputations on the web. In Pro-
ceedings of the eighth ACM SIGKDD international conference on Knowl-
edge discovery and data mining, pages 341­349. ACM, 2002.
[Mullen and Collier 2004] Tony Mullen and Nigel Collier. Sentiment anal-
ysis using support vector machines with diverse information sources. In
EMNLP, volume 4, pages 412­418, 2004.
[Musto et al. 2015] Cataldo Musto, Giovanni Semeraro, Pasquale Lops, and
Marco de Gemmis. Crowdpulse: A framework for real-time semantic anal-
ysis of social streams. Information Systems, 54:127­146, 2015.
[Nasukawa and Yi 2003] Tetsuya Nasukawa and Jeonghee Yi.
Sentiment
analysis: Capturing favorability using natural language processing. In Pro-
ceedings of the 2nd international conference on Knowledge capture, pages
70­77. ACM, 2003.
[Nicola 2013] Raluca Georgeta Nicola. Categorization and visualization of
Twitter data. PhD thesis, Technical University of Dresden, 2013.
[Nkongolo 2017] Mike Nkongolo. A Web-Based Prototype Course Recom-
mender System using Apache Mahout. GRIN Verlag, 2017.
[O'Leary 2013] Daniel E O'Leary. Artificial intelligence and big data. IEEE
Intelligent Systems, 28(2):96­99, 2013.

References
41
[Pang et al. 2002] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
Thumbs up?: sentiment classification using machine learning techniques.
In Proceedings of the ACL-02 conference on Empirical methods in natural
language processing-Volume 10, pages 79­86. Association for Computa-
tional Linguistics, 2002.
[Pyle 1999] Dorian Pyle. Data preparation for data mining, volume 1. mor-
gan kaufmann, 1999.
[Read 2005] Jonathon Read. Using emoticons to reduce dependency in ma-
chine learning techniques for sentiment classification. In Proceedings of the
ACL student research workshop, pages 43­48. Association for Computa-
tional Linguistics, 2005.
[Routray et al. 2013] Preeti
Routray,
Chinmaya
Kumar
Swain,
and
Smita Praya Mishra. A survey on sentiment analysis. International Jour-
nal of Computer Applications, 76(10), 2013.
[Russell et al. 1995] Stuart Russell, Peter Norvig, and Artificial Intelligence.
A modern approach.
Artificial Intelligence. Prentice-Hall, Egnlewood
Cliffs, 25:27, 1995.
[Saif et al. 2012] Hassan Saif, Yulan He, and Harith Alani. Semantic senti-
ment analysis of twitter. The Semantic Web­ISWC 2012, pages 508­524,
2012.
[Scott and Matwin 1999] Sam Scott and Stan Matwin. Feature engineering
for text classification. In ICML, volume 99, pages 379­388, 1999.
[Sebastiani 2002] Fabrizio Sebastiani. Machine learning in automated text
categorization. ACM computing surveys (CSUR), 34(1):1­47, 2002.
[Spangler et al. 2009] Scott Spangler, Ying Chen, Larry Proctor, Ana
Lelescu, Amit Behal, Bin He, Thomas D Griffin, Anna Liu, Brad Wade,
and Trevor Davis. Cobra­mining web for corporate brand and reputation
analysis. Web Intelligence and Agent Systems: An International Journal,
7(3):243­254, 2009.
[Stuart and Majewski 2015] Keith Douglas Stuart and Maciej Majewski. In-
telligent opinion mining and sentiment analysis using artificial neural net-
works.
In International Conference on Neural Information Processing,
pages 103­110. Springer, 2015.

References
42
[Taboada et al. 2008] Maite Taboada, Kimberly Voll, and Julian Brooke.
Extracting sentiment as a function of discourse structure and topicality.
Simon Fraser Univeristy School of Computing Science Technical Report,
2008.
[Tang et al. 2014] Duyu Tang, Furu Wei, Bing Qin, Ting Liu, and Ming
Zhou. Coooolll: A deep learning system for twitter sentiment classification.
In Proceedings of the 8th International Workshop on Semantic Evaluation
(SemEval 2014), pages 208­212, 2014.
[Thelwall et al. 2012] Mike Thelwall, Kevan Buckley, and Georgios Pal-
toglou. Sentiment strength detection for the social web. Journal of the
Association for Information Science and Technology, 63(1):163­173, 2012.
[Turney 2002] Peter D Turney.
Thumbs up or thumbs down?: semantic
orientation applied to unsupervised classification of reviews. In Proceedings
of the 40th annual meeting on association for computational linguistics,
pages 417­424. Association for Computational Linguistics, 2002.
[Unwin 2000] Antony Unwin. Visualisation for data mining. In International
Conference on Data Mining, Visualization and Statistical System, S´
eoul,
Korea, 2000.
[Volcani and Fogel 2006] Yanon Volcani and David Fogel.
System and
method for determining and controlling the impact of text, November 14
2006. US Patent 7,136,877.
[Vural et al. 2013] A Gural Vural, B Barla Cambazoglu, Pinar Senkul, and
Z Ozge Tokgoz. A framework for sentiment analysis in turkish: Applica-
tion to polarity detection of movie reviews in turkish. In Computer and
Information Sciences III, pages 437­445. Springer, 2013.
[Wallace et al. 2012] Byron C Wallace, Issa J Dahabreh, Thomas A Trikali-
nos, Joseph Lau, Paul Trow, Christopher H Schmid, et al. Closing the gap
between methodologists and end-users: R as a computational back-end. J
Stat Softw, 49(5):1­15, 2012.
[Wang and Araki 2008] Guangwei Wang and Kenji Araki. A graphic rep-
utation analysis system for mining japanese weblog based on both un-
structured and structured information. In Advanced Information Network-
ing and Applications-Workshops, 2008. AINAW 2008. 22nd International
Conference on, pages 1240­1245. IEEE, 2008.

References
43
[Wang et al. 2012] Hao Wang, Dogan Can, Abe Kazemzadeh, Fran¸cois Bar,
and Shrikanth Narayanan. A system for real-time twitter sentiment anal-
ysis of 2012 us presidential election cycle. In Proceedings of the ACL 2012
System Demonstrations, pages 115­120. Association for Computational
Linguistics, 2012.
[Zeinalipour-Yazti et al. 2004] Demetrios Zeinalipour-Yazti, Vana Kaloger-
aki, and Dimitrios Gunopulos. Information retrieval techniques for peer-
to-peer networks. Computing in Science & Engineering, 6(4):20­26, 2004.
[Zhang et al. 2003] Shichao Zhang, Chengqi Zhang, and Qiang Yang. Data
preparation for data mining. Applied Artificial Intelligence, 17(5-6):375­
381, 2003.
[Zhang et al. 2011] Shu Zhang, Wenjie Jia, Yingju Xia, Yao Meng, and Hao
Yo. Product features extraction and categorization in chinese reviews.
In The Sixth International Multi-Conference on Computing in the Global
Information Technology, ICCGI, 2011.
[Ziegler and Skubacz 2012] Cai-Nicolas Ziegler and Michal Skubacz.
To-
wards automated reputation and brand monitoring on the web. In Mining
for Strategic Competitive Intelligence, pages 109­119. Springer, 2012.
54 of 54 pages

Details

Title
Textual Classification for Sentiment Detection. Brand Reputation Analysis on the Web using Natural Language Processing and Machine Learning
College
University of the Witwatersrand
Course
Machine learning - Artificial Intelligence - Big Data - Natural Language Processing
Author
Year
2018
Pages
54
Catalog Number
V419732
ISBN (Book)
9783668701687
File size
2546 KB
Language
English
Tags
textual, classification, sentiment, detection, brand, reputation, analysis, natural, language, processing, machine, learning
Quote paper
Mike Nkongolo (Author), 2018, Textual Classification for Sentiment Detection. Brand Reputation Analysis on the Web using Natural Language Processing and Machine Learning, Munich, GRIN Verlag, https://www.grin.com/document/419732

Comments

  • No comments yet.
Read the ebook
Title: Textual Classification for Sentiment Detection. Brand Reputation Analysis on the Web using Natural Language Processing and Machine Learning


Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free