With the amount of data available on social networks, new methodologies for the analysis of information are needed. Some methods allow the users to combine different types of data in order to extract relevant information. In this context, the present paper shows the application of a model via a platform in order to group together information generated by Twitter users, thus facilitating the detection of trends and data related to particular pathologies. In order to implement the model, an analyzing tool that uses the Levenshtein distance was developed, to determine exactly what is required to convert a text into the following texts: 'gripa'-"flu", "dolor de cabeza"-"headache", 'dolor de estomago'-"stomachache", 'fiebre'-"fever" and 'tos'-"cough" in the area of Bogotá.

Leseprobe

1. Abstract

2. Introduction

3. Obtaining the information

4. Applying the Levenshtein Distance

5. Experimentation

6. Conclusions

7. References

Research Objectives and Topics

The primary objective of this research is to develop and implement a methodology for analyzing social media data, specifically Twitter, to detect trends and behaviors related to specific pathologies, such as influenza. The study focuses on processing mass information to identify patterns, evaluate sentiment, and determine the prevalence of users reporting symptoms within a specific geographical area like Bogotá.

Application of data mining and web mining techniques on social media platforms.
Utilization of the Levenshtein distance algorithm for text normalization and analysis.
Development of automated information gathering using Python and the Twitter API.
Sentiment analysis to classify the emotional tone associated with pathology-related terms.
Clustering and visualization of semantic relationships between concepts in user tweets.

Excerpt from the Book

4 Applying the Levenshtein Distance

The Levenshtein Distance shows the number of operations that you need in a thread to finish another one (Vladimir I Levenshtein 1965). It was used because of the simplicity of the algorithm (insertion, deletion, or substitution of a single character); here you can take a look at the behavior of information as a rough example:

LDistance (home, horse) = 2

Amendment m, r = hore

Insertion s = horse

The analyzer removes information such as special characters and blank spaces in the thread. Additionally, it gets a portion of the thread to perform this analysis on the desired patterns; the resulting thread is built with a maximum of 4 words (Table 1).

A snapshot of the corpus is:

'Tengo Mucha Gripa!\n>:(',

'Tenemos indicios de gripa!',

'Voy a tener gripa #Enfermo',

'Muriendo de gripa!!!',

After cleaning the following information is obtained (Table 1):

Chapter Summary

1. Abstract: Provides a high-level overview of using Twitter data mining and Levenshtein distance to detect trends related to specific pathologies in Bogotá.

2. Introduction: Discusses the importance of social networks for mass opinion analysis and defines the role of web mining and data processing in modern sciences.

3. Obtaining the information: Describes the technical setup using Python libraries, the Twitter API, and GeoPlanet to collect location-specific tweet data.

4. Applying the Levenshtein Distance: Explains the algorithmic approach used to process text, normalize strings, and prepare the corpus for pattern detection.

5. Experimentation: Details the clustering and sentiment analysis experiments conducted to visualize relationships between symptom-related concepts.

6. Conclusions: Summarizes the necessity of data cleaning and highlights the potential of combining machine learning with mass data analysis for decision-making.

7. References: Lists the academic sources and technical documentation utilized to support the research methodology and findings.

Keywords

Twitter, Data Mining, Web Mining, Levenshtein Distance, Pathologies, Social Networks, Sentiment Analysis, Big Data, Clustering, Information Analysis, Python, Bogotá, Influenza, Information Processing, Artificial Intelligence.

Frequently Asked Questions

What is the core focus of this research paper?

The paper focuses on developing a methodology to detect and analyze public health-related trends, specifically pathologies, by mining and processing massive amounts of user-generated data from Twitter.

What are the primary thematic areas covered?

The study covers data mining, web mining, algorithm-based text analysis, sentiment analysis, and the application of social network data for medical trend detection.

What is the main research goal?

The goal is to demonstrate that information gathered via social networks can be analyzed using data mining techniques and artificial intelligence to predict or identify the prevalence of sickness in a specific population.

Which scientific methods are utilized?

The authors utilize the Levenshtein distance algorithm for string comparison, sentiment analysis for classifying emotional tone, and clustering techniques to visualize semantic links between symptoms and user expressions.

What is discussed in the main body of the paper?

The main body details the technical implementation of data extraction using Python, the normalization of text via the Levenshtein distance, and the experimental results regarding symptom clustering and sentiment scoring.

Which keywords characterize this work?

Key terms include Twitter, Data Mining, Levenshtein Distance, Sentiment Analysis, Pathologies, Big Data, and Information Analysis.

How does the Levenshtein distance help in identifying sickness?

It allows the analyzer to convert varied user-written forms of words (like misspellings or shorthand) into a standard "pathology" format, enabling consistent grouping and analysis of the data.

Why is sentiment analysis important for this study?

Sentiment analysis is used to determine if the context surrounding a symptom is negative, which helps researchers infer that the individual posting the tweet is likely actually feeling ill.

What significance does the city of Bogotá have in this study?

Bogotá serves as the specific geographical focus for the data extraction process, where tweets were linked to the city via coordinate systems to ensure local relevance.

Ende der Leseprobe aus 4 Seiten - nach oben

Details

Titel: Behavior of Users Talking about Pathologies and Diseases on Twitter
Autoren: Dennis Salcedo (Autor:in), Alejandro León (Autor:in)
Erscheinungsjahr: 2015
Seiten: 4
Katalognummer: V302920
ISBN (eBook): 9783668014046
ISBN (Buch): 9783668014053
Sprache: Englisch
Schlagworte: behavior users talking pathologies diseases twitter
Produktsicherheit: GRIN Publishing GmbH

Arbeit zitieren: Dennis Salcedo (Autor:in), Alejandro León (Autor:in), 2015, Behavior of Users Talking about Pathologies and Diseases on Twitter, München, GRIN Verlag, https://www.grin.com/document/302920

Behavior of Users Talking about Pathologies and Diseases on Twitter