Cloud computing makes it possible to build scalable machine learning systems for processing massive amounts of complex data, be them structured or unstructured, real-time or historical, the so-called Big Data. Publicly available cloud computing platforms have been made available, for instance, Amazon EC2, EMR, and Google Compute Engine. More importantly, open source APIs and libraries have also been developed for ease of programming on the cloud, for instance, Cascading, Storm, Scalding, Apache Spark and Trackur. Meanwhile, computational intelligence approaches, examples of which include evolutionary computation, immune-inspired approaches, and swarm intelligence, are also employed to develop scalable machine learning and data analytics tools.
In this project, we presented the sentiment-focused web crawling problem and designed a sentiment-focused web crawler frame-work for faster discovery and retrieval of sentimental context on the Web. We have developed a computational framework to perform automated reputation analysis on the Web using Natural Language Processing and Machine Learning. This paper introduces such framework and tests its performance on automated sentiment analysis for brand reputation. In addition, we proposed different strategies for predicting the polarity scores of web pages.
Experiments have shown that the performance of our proposed framework is more efficient than existing frameworks. Reputation analysis is a useful application for organizations that are looking for people's opinions about their products and services.
Our approach consists of 4 parts: in the first part, the framework performed Web crawling based on the query specified by the user. In the second part, the framework locates relevant information within textual data using Entity Recognition. In the third part, relevant information was recorded in the database for feature extraction/engineering and classification. Lastly, the framework displayed the data for reputation analysis. In the training phase, we used data provided by the marketing team of the University of the Witwatersrand, Emoticons, a subset of the SentiStrength lexicon and ClueWeb09 dataset. Each domain was labelled accordingly (positive/negative and neutral) with equal numbers of polarity in plain text. In the test phase, the classifier predicted the polarity of real-time data. We used accuracy as evaluation metric to measure how much our classifier acted precisely.
Frequently Asked Questions
What is the primary focus of this research?
The primary focus of this research is to provide a framework for automatically collecting and analyzing opinions for reputation analysis on the Internet, with a case study focusing on Wits (University of the Witwatersrand).
What are the aims and objectives of this research?
The research aims to implement a real-time reputation analysis platform, develop efficient algorithms for reputation analysis, study the reputation of Wits University on the Internet, and systematically evaluate the proposed framework.
What is the system architecture described in the research?
The system architecture includes a web crawler, a database, and components for Natural Language Processing (NLP) and Machine Learning (ML). The crawler collects data, which is then processed using named entity recognition and stored in the database. Sentiment detection, feature extraction, and engineering are performed before classification and display of the results through a web-based interface.
What are some of the key data mining techniques used?
The research uses N-gram, Bag-of-Words, and autoencoders for feature extraction and engineering. The sentiment detection employs Artificial Neural Networks (ANNs) and SentiStrength for classification.
How is the system evaluated?
The system is evaluated based on coverage, accuracy (precision and recall), F-measure, and the empirical error of the ANN classifier.
What is BoilerPipe, and how is it used?
BoilerPipe is a tool used for extracting the main textual content from web pages, filtering out irrelevant information like advertising and menu bars.
What is Named Entity Recognition (NER) used for?
Named Entity Recognition is used to detect and classify specific entities, such as company/brand names within the textual data, allowing the framework to focus on relevant information.
What is the role of the Graphical User Interface (GUI)?
The GUI is designed to display the results of the sentiment analysis in an understandable way, utilizing colors to indicate polarity (positive, negative, or neutral).
What datasets were used for training?
Data provided by the marketing team of the University of the Witwatersrand, Emoticons, a subset of the SentiStrength lexicon and ClueWeb09 dataset.
What is the cost function used for in the context of neural networks?
The cost function measures the mean error of the observation in relation to the current classification function, and it's iteratively updated during backpropagation until a minimum is reached, indicating that the learning process is near completion.
- Quote paper
- Mike Nkongolo (Author), 2018, Textual Classification for Sentiment Detection. Brand Reputation Analysis on the Web using Natural Language Processing and Machine Learning, Munich, GRIN Verlag, https://www.grin.com/document/419732