In many areas of science as well as in practice in economics, politics, etc., it is a matter of determining the probability of the occurrence of a certain event. In the field of marketing, for example, it is interesting to know which factors increase the probability of a purchase, in the field of medicine it is important to know which factors increase the risk of an illness, and in politics it will be of interest to determine the effects of certain variables on the probability of being elected. All these events can be viewed as dichotomous (binary) variables (purchase - non-purchase; disease - non-disease; election - non-election; etc.). It is precisely in these cases, when the dependent variable is dichotomous, that linear regression fails to provide a satisfactory answer. The probability for the occurrence of the event (dependent/endogenous/explained variable or regressand or also called prognosis variable) is therefore 1 minus the probability for the non-occurrence of the event. With the help of logistic regression the probabilities for the occurrence of an event can be calculated. On the one hand, this method has similarities to discriminant analysis in that it is a two-group approach. On the other hand, there are similarities to linear regression analysis, since the independent variables (exogenous/explanatory variables, regressor or predictor variables) are weighted via a regression approach (see Backhaus, p. 284).

In Machine Learning and Medicine Logistic regression is widely used. Machine Learning can have a major impact in medicine since it helps analyze an enormous amount of data (See Tripepi et., 2008, p.808). This papers aim is to create a machine learning tool with logistic regression that analyzes Covid-19 Data from Mexico. In future pandemics such an algorithm could help physicians and health authorities to answer questions like: Who should be vaccinated first or who has a higher risk of a severe course of the disease? Answering these kind of questions quick and based on data could save a lot of resources and could dampen the course of the pandemic. Furthermore it simplifies difficult decisions and could also provide help for decision making in a Triage-Situation when resources are scarce.

Excerpt

1. INTRODUCTION

2. METHODOLOGY

2.1 WHY NOT LINEAR REGRESSION?

2.2 LOGISTIC REGRESSION IS THE SOLUTION

3. APPLICATION OF LOGISTIC REGRESSION ALGORITHM TO COVID19 DATA FROM MEXICO

3.1 WHAT INFLUENCES THE COURSE OF AN INFECTION?

3.2 EXPLORATORY DATA ANALYSIS AND DATA CLEANING

3.3 PREDICTION WITH A LOGISTIC REGRESSION MODEL

4. CONCLUSION

Objectives & Key Topics

The primary objective of this work is to develop a machine learning tool based on logistic regression to analyze COVID-19 data from Mexico, aiming to predict whether a patient requires hospitalization and to identify factors influencing the severity of the infection.

Theoretical overview and limitations of linear regression in classification tasks
Mathematical foundation and application of the logistic regression algorithm
Identification of critical variables such as age and comorbidities in COVID-19 progression
Data cleaning and preprocessing of a large-scale Mexican COVID-19 dataset
Predictive modeling performance and accuracy assessment using confusion matrices

Excerpt from the Book

2.1 Why not linear regression?

Logistic Regression is used if there are so called classifications problems. While the linear approach deals with metrically scaled dependent variables whose value is to be predicted by the regression line, the logistic approach is concerned with determining the probability of occurrence of a certain event. But as mentioned earlier, there are many problems where the dependent variable is qualitative. How this leads to problems is illustrated below (Figure 1). Figure 1 plots two variables from the dataset we are going to use in chapter 3 for our model. It is to be examined which influence the age has on the treatment of the patient (ambulant=0, stationary=1). The straight line drawn on the basis of the linear regression function clearly exceeds the permissible range of values, i.e. it takes on values below 0 or above 1 (see James et al., 2013, p. 130)

Thus, the probability of occurrence of an event cannot be estimated with a linear regression. The linear regression approach assumes a dispersion of [- ∞;+∞] and not from 0 to 1. Furthermore, the normal distribution of the residual uk is assumed. This is not the case for binary variables. Third, implausible estimates arise when using linear regression, as shown above. In conclusion one could possibly use the linear regression method by using a dummy variable and OLS would produce outcomes, but since some of them would be out of the [0;1] range they would be hard to interpret as probabilities (See Hosmer et al., 2013, pp. 3-7).

Summary of Chapters

1. INTRODUCTION: This chapter outlines the necessity of modeling dichotomous events in economics and medicine, introducing logistic regression as a superior alternative to linear models for such purposes.

2. METHODOLOGY: This section explains why linear regression fails to provide valid probability estimates for binary outcomes and derives the mathematical logic behind the logistic regression function and the Maximum Likelihood approach.

3. APPLICATION OF LOGISTIC REGRESSION ALGORITHM TO COVID19 DATA FROM MEXICO: This chapter contextualizes COVID-19 risk factors through existing literature and details the practical cleaning, processing, and application of a logistic regression model on the Mexican pandemic dataset.

4. CONCLUSION: The final chapter summarizes the efficacy of the developed model in assisting public health decision-making and efficient resource allocation during pandemic scenarios.

Keywords

Logistic Regression, Machine Learning, COVID-19, Mexico, Hospitalization, Data Cleaning, Maximum Likelihood, Binary Variables, Predictive Modeling, Public Health, Pandemic, Comorbidities, Confusion Matrix, Odds Ratio, Statistical Learning

Frequently Asked Questions

What is the core focus of this research paper?

The paper focuses on applying a logistic regression machine learning algorithm to assess the probability of a COVID-19 patient requiring inpatient care based on a dataset from Mexico.

Which thematic fields are addressed in the study?

The study integrates medical literature regarding COVID-19 risk factors with statistical methodology to demonstrate the practical application of binary classification models.

What is the central research question?

The central research question is whether it is possible to accurately predict if a COVID-19 patient will need to be hospitalized, thereby helping to dampen the course of a pandemic.

Which scientific method is utilized?

The author employs a logistic regression model, utilizing the Maximum Likelihood approach for parameter estimation and evaluating the model's performance via accuracy metrics and confusion matrices.

What main topics are covered in the core chapters?

The core chapters cover the mathematical deficiencies of linear regression, the literature-based identification of COVID-19 outcome predictors, data cleaning processes, and the final predictive analysis of the model.

Which keywords best characterize this work?

Significant keywords include Logistic Regression, COVID-19, Machine Learning, Hospitalization, Binary Classification, and Statistical Learning.

Why were the variables "ICU" and "Intubed" excluded from the model?

They were excluded because they contained too many missing values and, crucially, because they occur chronologically after the decision for hospitalization is made, making them unsuitable for an early intervention tool.

How well does the model perform in terms of accuracy?

The model performs at a high accuracy level of approximately 81.56% in predicting the need for patient hospitalization on the test subset.

What is the significance of the McFadden’s R-squared value achieved?

The model achieved a McFadden’s R-squared of approximately 0.41, which the author interprets as an excellent fit according to standard statistical benchmarks.

Excerpt out of 14 pages - scroll top

Details

Title: A powerful tool for future pandemics?
Subtitle: Application of a Logistic Regression to Mexican Covid-19 Data
College: University of Bayreuth
Grade: 2,0
Author: Anonym (Author)
Publication Year: 2023
Pages: 14
Catalog Number: V1394994
ISBN (PDF): 9783346942449
Language: English
Tags: logistic regression logistische Regression Statistical learning
Product Safety: GRIN Publishing GmbH

Quote paper: Anonym (Author), 2023, A powerful tool for future pandemics?, Munich, GRIN Verlag, https://www.grin.com/document/1394994

A powerful tool for future pandemics?

Application of a Logistic Regression to Mexican Covid-19 Data