In today scenario there is abrupt usage of microblogging sites such as Twitter for sharing of feelings and emotions towards any current hot topic, any product, services, or any event. Such opinionated data needs to be leveraged effectively to get valuable insight from that data. This research work focused on designing a comprehensive feature-based Twitter Sentiment Analysis (TSA) framework using the supervised machine learning approach with integrated sophisticated negation handling approach and knowledge-based Tweet Normalization System (TNS). We generated three real-time twitter datasets using search operators such as #Demonetization, #Lockdown, and #9pm9minutes and also used one publically available benchmark dataset SemEval-2013 to assess the viability of our comprehensive feature-based twitter sentiment analysis system on tweets. We leveraged varieties of features such as lexicon-based features, pos-based, morphological, ngrams, negation, and cluster-based features to ascertain which classifier works well with which feature group. We employed three state-of-the-art classifiers including Support Vector Machine (SVM), Decision Tree Classifier (DTC), and Naive Bayesian (NB) for our twitter sentiment analysis framework. We observed SVM to be the best performing classifier across all the twitter datasets except #9pm9minutes (DTC turned out to be the best for this dataset). Moreover, our SVM model trained on the SemEval-2013 training dataset outperformed the winning team NRC Canada of SemEval- 2013 task 2 in terms of macro-averaged F1 score, averaged on positive and negative classes only. Though state-of-the-art twitter sentiment analysis systems reported significant performance, it is still challenging to deal with some critical aspects such as negation and tweet normalization.
Table of Contents
1. Introduction
1.1. Levels of Sentiment Analysis
1.1.1 Document-level Sentiment Analysis
1.1.2 Sentence-level Sentiment Analysis
1.1.3 Aspect-level Sentiment Analysis
1.2. Sentiment Analysis Approaches
1.2.1 Knowledge-Based Approach (Lexicon-Based Approach)
1.2.2 Statistical Approach
1.2.3 Hybrid Approach
1.3. History of Sentiment Analysis, Specifically Twitter Sentiment Analysis
1.4. Twitter
1.5. Need for Twitter Sentiment Analysis
1.6. Research Objectives
1.6.1 Corpus Creation
1.6.2 Tweet Normalization
1.6.3 Feature Engineering
1.6.4 Negation Handling
1.6.5 Training of Classifiers
1.6.6 Evaluation of Classifiers
1.7. Thesis Organization
2. Literature Review
2.1. Twitter Sentiment Analysis
2.2. Data Pre-processing
2.3. Negation Modelling
2.3.1 Forms of Negation
2.3.2 Addressing Negation
2.3.2.1 Negation Scope Detection
2.3.2.2 Negation Handling
3. Experimental Set up
3.1. Twitter Corpus
3.1.1 Benchmark Twitter Dataset
3.1.2 Real-time Twitter Dataset
3.1.2.1 Real-Time Twitter Corpus Labelling
3.2. Linguistic Resources
3.3. Classification Features
3.3.1 Ngrams Features
3.3.2 POS-Based Features
3.3.3 Morphological (Twitter-Specific Features)
3.3.4 Cluster features
3.3.5 Lexicon-based features
3.3.6 Negation Features
3.4. Supervised Machine Learning Classifiers
3.4.1 Naive Bayesian Classifier
3.4.2 Support Vector Machine
3.4.3 Decision Tree Classifiers
3.5. Evaluation Metrics
3.5.1 Accuracy
3.5.2 Precision (Positive Predictive Value)
3.5.3 Recall (Sensitivity)
3.5.4 F1 score
3.6. Conclusion
4. Tweet Normalization
4.1. Tweet Normalization System (TNS)
4.1.1 Phase 1: Basic Cleaning Operations
4.1.2 Phase 2: Tweet Normalization
4.2. Evaluation of Tweet Normalization System (TNS)
4.2.1 TNS Evaluation Result on #Demonetization Corpus
4.2.2 TNS Evaluation Result on #Lockdown Corpus
4.2.3 TNS Evaluation Result on #9pm9minutes Corpus
4.2.4 TNS Evaluation Result on Twitter SemEval-2013 Dataset
4.3. Conclusion
5. Negation Handling
5.1. Types of Negation
5.2. Phases of Modelling Syntactic Negation
5.2.1 Negation Cue Identification
5.2.2 Negation Scope Detection
5.2.3 Handling the Negated Context Words
5.3. Proposed Algorithm for Negation Exception Cases
5.4. Evaluation of Negation Exception Algorithm (NEA)
5.4.1 Evaluation of NEA on #Demonetization Corpus
5.4.2 Evaluation of NEA on #Lockdown dataset
5.4.3 Evaluation of NEA on #9pm9minutes Dataset
5.4.4 Evaluation of NEA on SemEval-2013 Twitter Dataset
5.5. Conclusion
6. Classifiers Training and Evaluation
6.1. Training of Classifiers
6.2. Classifier Evaluation Results on Real-time Twitter Datasets
6.2.1 Evaluation on #Demonetization Corpus
6.2.2 Evaluation on #Lockdown Corpus
6.2.3 Evaluation on #9pm9minutes Corpus
6.3. Evaluation on Benchmark SemEval-2013 Twitter Dataset
6.4. Contribution of Negation Handling Approach with Incorporated Negation Exception Algorithm on Classifiers Performance
6.5. Contribution of Each Pre-processing Modules on Classifiers Performance
6.6. Conclusion
7. Conclusion
7.1. Corpus Creation
7.2. Tweet Normalization System (TNS)
7.3. Negation Modelling
7.4. Feature Engineering
7.5. Classification Result
7.6. Future Work
Objectives & Research Themes
The primary research objective is to develop a comprehensive, feature-based framework for Twitter Sentiment Analysis (TSA) that employs supervised machine learning to classify tweets as positive, negative, or neutral. The study seeks to address significant challenges in analyzing unstructured social media text, specifically by improving negation handling and tweet normalization, in order to achieve higher classification performance compared to existing approaches.
- Design of a comprehensive, feature-based Twitter Sentiment Analysis (TSA) framework using supervised machine learning.
- Development of a knowledge-based Tweet Normalization System (TNS) to clean and standardize unstructured Twitter data.
- Creation of a sophisticated negation modeling approach that incorporates a novel negation exception algorithm.
- Empirical evaluation of state-of-the-art machine learning classifiers (SVM, DTC, NB) across real-time and benchmark datasets.
- Exhaustive feature engineering and ablation experiments to identify the most significant feature groups for classification tasks.
Book Excerpt
3.3.1 Ngrams Features
Ngrams is a set of N words in sequence such as “great movie” is a bi-gram (2 words), “very excellent speech” is tri-gram (3 words), “happy” is unigram (one word), “demonetization is a disaster” is 4-gram (4 words), and so on. Unigram features alone can’t capture the context specific information. For instance, in the text “this movie sucks” unigram “sucks” conveys a negative sentiment due to the usage in context of “movie”. However, it may be conveying some other sentiment in another context. Thus, we captured bi-grams too in our work as they can capture domain-specific information. Moreover, usage of high order ngrams will lead to the sparseness (lots of zero) because of their rarity in the corpus. For sparse vector, a large amount of computational resources and memory is required. Hence, we limited to the extraction of unigrams and bi-grams.
We evaluated 2 ways of representing ngrams: TFIDF weighing scheme and Bag-of-Words (CountVectorizer) but, obtained better results with TFIDF as it punishes frequently occurring words. One of the most simple and easy representation of a text is the Bag-of-Words, which contains each word of a text with its no. of occurrences discarding the position of each word in that text. It is a popular technique of feature extraction from the text. BOW is made up of two things: vocabulary of words and occurrences of each word. Main problem with the frequency count of BOW model is that most common occurring words such as ‘a’, ‘an’, and ‘the’ will get high scores. Such words are not informative but will be getting high score. This will affect the model performance.
Summary of Chapters
Chapter 1 Introduction: Provides a fundamental overview of sentiment analysis, its importance in modern social media, and introduces the core objectives and research gaps addressed by this thesis.
Chapter 2 Literature Review: Surveys existing state-of-the-art studies in sentiment analysis and Twitter sentiment analysis, focusing on pre-processing and negation modeling approaches.
Chapter 3 Experimental Set up: Details the generation and annotation of real-time Twitter corpora and describes the linguistic resources and feature engineering processes used.
Chapter 4 Tweet Normalization: Presents the developed knowledge-based Tweet Normalization System and evaluates its impact on data quality across various datasets.
Chapter 5 Negation Handling: Explains the process of negation modeling, including the proposed negation exception algorithm to handle complex linguistic cases.
Chapter 6 Classifiers Training and Evaluation: Discusses the implementation of machine learning classifiers, presenting ablation experiments and performance comparisons.
Chapter 7 Conclusion: Summarizes the thesis findings, validates the proposed methodologies, and suggests directions for future research.
Keywords
Sentiment Analysis, Twitter Sentiment Analysis, Negation Modelling, Negation Exception Case, Negation Exception Algorithm, Tweet Normalization System, Supervised Machine Learning, Real-Time Twitter Dataset, Benchmark Twitter Dataset, Corpus-Based Statistical Approach, Reverse Polarity, Negation Cue, Feature Engineering
Frequently Asked Questions
What is the core focus of this research?
The research focuses on designing a robust, feature-based framework for Twitter Sentiment Analysis that utilizes supervised machine learning and addresses critical linguistic challenges like negation and noise in unstructured tweets.
What are the primary themes investigated in the study?
The core themes include comprehensive feature engineering, the development of a knowledge-based Tweet Normalization System, sophisticated negation modeling with exception handling, and comparative evaluation of state-of-the-art machine learning classifiers.
What is the main goal or research question?
The main goal is to determine which combinations of linguistic features and machine learning classifiers achieve optimal sentiment analysis performance while effectively handling unstructured social media content, negation, and negation exception cases.
Which scientific methods are employed?
The research employs supervised machine learning models—specifically Support Vector Machine (SVM), Naive Bayesian (NB), and Decision Tree Classifier (DTC)—alongside a corpus-based statistical approach for negation handling and linguistic rule-based techniques for tweet normalization.
What aspects are covered in the main section of the work?
The main section covers the collection and labeling of Twitter corpora, the development of normalization and negation algorithms, extensive feature engineering, and systematic ablation experiments to evaluate classifier performance.
Which keywords characterize this work?
Key terms include Sentiment Analysis, Twitter Sentiment Analysis, Negation Modelling, Negation Exception Algorithm, Tweet Normalization System, Supervised Machine Learning, and Feature Engineering.
How does the negation exception algorithm function?
The algorithm uses POS tagging and pattern matching to identify negation cues and then applies linguistic rules to distinguish between actual negation and cases where negation words have no negating effect (e.g., in specific phrases or rhetorical questions), thereby preventing misclassification.
Why is the Tweet Normalization System necessary?
Twitter data is highly noisy due to informal language, acronyms, misspelled words, and symbols. The TNS cleans and standardizes this data, which significantly reduces data sparsity and dimensionality, enabling more accurate training of machine learning classifiers.
- Citar trabajo
- Manu Banga (Autor), A Comprehensive Approach on Sentiment Analysis & Prediction, Múnich, GRIN Verlag, https://www.grin.com/document/1315485