Predicting COVID-19 Cases in US Long-Term Care Facilities

An Empirical Study Using Epidemiological Data

Master's Thesis, 2020

61 Pages, Grade: 1.0


Table of Contents

List of Figures

List of Tables

List of Abbreviations


1. Introduction
1.1 Background
1.2 Research Scope
1.3 Structure of this Paper

2. Literature Review
2.1 COVID-19 and US Nursing Homes
2.2 Factors Influencing the Number of COVID-19 Cases

3. Research Methodology
3.1 Description of the Datasets
3.2 Data Processing in Python
3.3 Statistical Analyses
3.4 Prediction of Nursing Homes with COVID-19 Cases

4. Findings and Discussion
4.1 Evaluation and Interpretation of the Developed Models
4.2 Discussion of Results

5. Conclusion
5.1 Summary of Key Findings
5.2 Limitations of the Analyses
5.3 Implications for Practice and Future Research



Appendix 1: Python Documentation

List of Figures

Figure 1: Methodological Approach

Figure 2: Data Mapping Process

Figure 3: List of Missing Values

Figure 4: Feature-Importance and Machine Learning Models Accuracy Comparison

List of Tables

Table 1: Literature Review of considered Variables in similar Studies

Table 2: Univariate Analyses of COVID-19 Cases and Deaths in US Nursing Home

Table 3: Bivariate Analyses of COVID-19 Cases and Deaths in US Nursing Home

List of Abbreviations

CDC Centers for Disease Control and Prevention

CMS Centers for Medicare and Medicaid Services

hprd hours per resident day

LTCfocus Long-term Care: Facts on Care in the US

ONS Office for National Statistics

PHE Public Health England

RF Random Forest

RKI Robert Koch-Institute

RN Registered Nurse

UK United Kingdom

US United States


Objectives – The focus of this paper is to identify factors that increase the probability of COVID-19 cases in nursing homes and to provide an exemplary concept for applying the findings using machine learning algorithms to allow future research to derive appropriate countermeasures for practice.

Subjects – This study investigates all active, and Medicare and Medicaid certified nursing home facilities in the United States (excluding Alaska and Washington, DC) that reported the number of COVID-19 infections and fatalities among their residents and staff up to August 02nd, 2020 (13,069 nursing homes). The required information about nursing homes and the respective epidemiological data was obtained from nine datasets retrieved from three state and publicly accessible databases.

Methods – The dependant variables are the reported COVID-19 infections and fatalities among nursing home residents and staff. The independent variables are clustered into five different themes: (1) facility characteristics, (2) rating, deficiencies and fines, (3) nurse staffing, (4) resident demographics and (5) external factors. A total of seven machine learning models were tested. The models are logistic regression, nearest neighbours, Gaussian naïve Bayes, support vector machines, decision trees, random forests and neural networks.

Results – The findings show evidence of a relationship between COVID-19 infections and fatalities and (1) the size of a nursing home, (2) the facility's age, (3) whether a nursing home is for-profit, (4) whether a nursing home is urban or rural, (5) the number of federal deficiencies, (6) the total amount of fines, (7) the concentration of residents with Medicaid, (8) the share of residents from a racial or ethnic minority, (9) the excess of beds in the respective county of a nursing home, (10) the number of infections per 100,000 people in a county, (11) the number of deaths per 100,000 people in a county, (12) the occupancy rate, (13) the overall CMS facility rating, (14) the total reported registered nurse (RN) staffing levels, (15) the total reported nurse staffing levels and (16) its competitive environment (Herfindahl Index). Whether a nursing home was a chain was not significantly related to COVID-19 cases and deaths. Even though the for-profit status, the urban location, the rating, the number of serious federal deficiencies and infection control deficiencies and the total amount of fines paid showed a statistical significance, these factors only marginally contributed to the machine learning model.

1. Introduction

1.1 Background

Nursing homes face high rates of COVID-19 mortality amongst their residents because of their advanced age, frequent underlying health conditions and the movement of staff between facilities (Comas-Herrera et al., 2020; McMichael et al., 2020). In addition, the lack of appropriate monitoring systems and the different testing strategies and capacities across and even within countries have possibly led to a general underestimation of the severity of COVID-19 us in nursing homes (ECDC, 2020).

According to data collected from the Centers for Disease Control and Prevention (CDC), the first COVID-19 patient in the United States (US) was diagnosed on January 20th (Holshue et al., 2020), and within a little over six months, the total reached 4,623,301 confirmed cases and 153,621 deaths as of August 02nd. The data from the Centers for Medicare and Medicaid Services showed that, of all recorded COVID-19-related deaths, 53,196 were in nursing homes, which accounts for 35% (CMS, 2020a; USAFacts, 2020a). Similar severe ratios can also be found in many European countries. According to the Office for National Statistics (ONS), Public Health England (PHE) and the Care Quality Commission (CQC), as of August 2nd, the United Kingdom (UK) had 306,663 confirmed COVID-19 infections and 41,272 reported deaths, of which 19,394 occurred in care homes, accounting for about 47% of deaths (CQC, 2020; ONS, 2020). In Germany, the Robert Koch-Institute (RKI) reported 209,983 COVID-19 cases and 9,141 deaths due to the virus, 3,621 of which occurred in care and nursing homes, or 40% of all COVID-19-related deaths (RKI, 2020). However, data availability and reporting methodologies vary significantly between countries and often do not allow disclosure of the number of COVID-19 infections and deaths in individual care homes, which is, for instance, the case in the UK (McNeill and MacAskill, 2020).

In response to the pandemic, the US has mandated the CMS to develop an appropriate surveillance system to detect problem areas and provide timely information on future infection control measures (CDC, 2020a; CMS, 2020b). With penalties for failing to report, approximately 93% of the 15,404 nursing homes reported the required data to the CDC by August 02nd (CMS, 2020a). With this extent of publicly available data, in combination with other data sources and the use of machine learning models, the drivers of COVID-19 cases in nursing homes can be better understood and used to determine virus susceptible facilities much earlier, thereby minimising the spread of the virus (ECDC, 2020). Although the dataset initially contained incorrect entries due to unprepared nursing homes and a lack of controls on the part of the CMS, the dataset showed a significant improvement in quality after just a few weeks. It will be updated weekly in the future (CMS, 2020a).

1.2 Research Scope

The primary objective of this study is to identify factors that increase the probability of COVID-19 cases in nursing homes. This will be done using methods of data preparation and statistical analyses and by applying several machine learning models using publicly available datasets. Three sub-goals are derived from this. Sub-goal 1 is to identify COVID-19 facilitating factors based on the literature review and initial insights from analysing the study sample. For the first sub-goal, a total of five studies conducted in the past months will be used as a starting point and compared with the results of this work in the fourth chapter. Sub-goal 2 is to process the dataset developed for this study and to conduct the statistical analyses of the COVID-19 facilitating factors identified in Sub-goal 1. This objective also includes the application of machine learning models, to classify nursing homes with and without virus outbreaks. The second sub-goal forms the core of this study in which the insights from the first sub-goal are used to process a total of nine selected datasets. The individual datasets were first cleaned of redundant variables, then converted into a uniform format and finally combined into one dataset and prepared for further study. The processing includes the handling of missing values, feature engineering for the statistical analyses and the application of machine learning algorithms. Subsequently, representative findings were obtained, which were compared in the context of the previously selected studies. The developed target variables, or the dependent variables, are also based on similar studies. A total of four target variables were defined and analysed in the descriptive study: (1) confirmed COVID-19 infections among residents, (2) confirmed COVID-19 fatalities among residents, (3) confirmed COVID-19 infections among staff, and (4) confirmed COVID-19 fatalities among staff. The machine learning models focus on the classification of nursing homes with infections among residents. Sub-goal 3 is to evaluate the achieved results in light of comparable studies and findings from the literature review. Based on the analyses and predictions carried out in Sub-goal 2, the last sub-goal bridges the gap between the conducted analyses and models and their role in the existing literature.

1.3 Structure of this Paper

Chapter 2 introduces nursing homes in the context of the COVID-19 pandemic and presents the variables used in similar studies through an extensive literature review. Chapter 3 describes the methodology for this study. The focus in the third chapter is on explaining the selected datasets and outlining the process from obtaining the data to the application of different machine learning models. The results are evaluated and interpreted in Chapter 4. Subsequently, the insights gained are discussed within the context of similar studies. This paper ends with a summary of its central findings and limitations and provides an outlook regarding the implications for research and practice.

2. Literature Review

The following chapter lays the foundation for empirical studies used in this work. The first part briefly reviews the literature about the impact of COVID-19 on US nursing homes in order to provide a better understanding of the ecosystem investigated in this study. The second part of the literature review defines the main variables considered as potential drivers of COVID-19 cases and fatalities in nursing homes within the context of similar studies.

2.1 COVID-19 and US Nursing Homes

Nursing homes, also known as long-term care facilities or skilled-care facilities, play an important role in providing care for dependent older people. Such facilities help vulnerable people who have difficulty living independently due to chronic illness or old age. Especially because of an ageing population in many places, the need for elderly care will increase (ECDC, 2020; National Institute on Aging, 2017; World Health Organization, 2017). According to a recent report by Comas-Herrera et al. (2020), the effects of COVID-19 on residents and staff in nursing homes have become mainly apparent in two ways: (1) nursing homes are overcrowding due to a large number of fatalities in a short period, and (2) too many staff members are becoming infected.

In recent months, there have been numerous scientific publications on the new Coronavirus. While a majority of these are medically focussed on understanding its symptoms and finding a cure (e.g., Holshue et al., 2020), there is also an increasing body of studies re-creating the dynamics of the virus and predicting its geographical distribution (e.g., Dowd et al., 2020; McMichael et al., 2020; Ren et al., 2020). The latter is also being investigated with respect to nursing homes, although only a handful of related publications have been issued to date (e.g., Abrams et al., 2020; Harrington et al., 2020; He et al., 2020; Li et al., 2020). In contrast to the academic work, both governmental institutions and non-profit organisations provide regular updates on the number of infections and fatalities and offer analyses, predictions and in some cases also recommendations for necessary countermeasures (Comas-Herrera et al., 2020; Dawson et al., 2020; Mollalo et al., 2020).

The first case of COVID-19 among nursing homes was recorded by the LifeCare Center of Kirkland, which quickly became a hotbed for the new virus. As of April 23rd, 2020, there were already over 50,000 confirmed COVID-19 cases and approximately 10,000 deaths related to the virus in US nursing homes, which was more than one-tenth of every COVID-19 case recorded in the US at that time and almost one-third of recorded deaths. To better protect the old, vulnerable population, the CMS has published a comprehensive guide on infection control procedures. States are also developing their own guidelines for visiting nursing homes, screening staff and using personal protective equipment (He et al., 2020).

2.2 Factors Influencing the Number of COVID-19 Cases

The studies considered for this paper were all conducted in recent months. However, they partly differ from each other in their periods of study, the US states they include, the total number of nursing homes, and in their chosen variables (dependant and independent) investigated. While Gopal et al. (2020), Harrington et al. (2020) and He et al. (2020) all conducted their research in the state of California, the number of nursing homes in their study sample varied between 713 and 1,223 facilities. Li et al. (2020) examined 215 nursing homes in Connecticut, and Abrams et al. used a sample of 9,395 nursing homes, which the authors obtained from over 30 US states. The three most common variables examined by the five teams of authors are the ownership type, which was used in all five papers, the rating of nursing homes, which was also used in all five papers, and the size of the nursing home, which was used in all studies except Li et al. 's (2020). In addition, the publications had, in some cases, placed different emphases in the variables under study or identified differences in the statistical relevance of similar variables. Moreover, the studies differed also in the techniques used to evaluate their data. The variables used in the mentioned studies are divided for the purposes of this paper into the following five themes, which are introduced in more detail in the following section: (1) facility characteristics, (2) rating, deficiencies and fines, (3) nurse staffing, (4) resident demographics and (5) external factors. A comparison of the studies is shown in the following illustration for clarification.

Abbildung in dieser Leseprobe nicht enthalten

Table 1: Literature Review of considered Variables in similar Studies

Facility Characteristics

For this work, the facility characteristics were divided into the following features: (1) the facility size, which is measured by the number of certified beds, (2) the facility's age, which indicates the years the nursing home has been in operation, (3) the ownership type, which specifies whether a nursing home is profit-oriented, non-profit or state-owned, (4) the chain affiliation, indicating whether the nursing home is part of a larger nursing home operator with at least two facilities, (5) the state the nursing home is located in, with a distinction between urban and rural areas, and (6) the occupancy rate.

Recent studies have found that nursing homes with infected residents were on average larger than facilities without COVID-19 infections (Abrams et al., 2020; Gopal et al., 2020; Harrington et al., 2020). Regarding facility size, Abrams et al. (2020), Gopal et al. (2020) and Harrington et al. (2020) found this factor statistically significant (P < 0.05). Li et al. (2020) did not find statistical significance between the size of a nursing home and the number of reported infections. He et al. (2020) examined the age of facilities in their work and found a statistically significant relationship without describing the finding further. Three of the five research teams found a statistically relevant positive correlation between for-profit ownership of a nursing home and the number of reported COVID-19 cases (Harrington et al., 2020; He et al., 2020; Li et al., 2020). The occupancy rate was only investigated by Gopal et al. (2020) and He et al. (2020), both of whom found a negative association with the likeliness COVID-19 cases.

CMS Rating

The rating is an indicator of the quality of a nursing home and thus is often a focus of the academic literature on nursing homes. The Five-Star Quality Rating System for US nursing homes was developed by the CMS to give consumers and their families more transparency by providing information on the Nursing Home Compare website. The data is based on structured interviews from nursing home caregivers and residents. Each US nursing home receives a rating between one and five stars. A five-star rating is regarded as qualitatively far above average, and a one star is qualitatively considered far below average. For each nursing home, there is an overall rating of five stars and three separate ratings on health inspection, staffing, and quality measures (CMS, 2020c).

Three of the five publications examined included only the overall rating as a variable in their work, and with the exception of Abrams et al. (2020), they also demonstrated that the overall rating was statistically relevant (He et al., 2020; Li et al., 2020). Gopal et al. (2020) examined the separate ratings on health inspection, staffing and quality measures. The authors Harrington et al. (2020) focused instead on the rating of staffing and divided it further into the nurse staffing and registered nurse staffing and also achieved significant results.

Deficiencies and Fines

Another indicator of a facility's quality is the total number and severity of deficiencies and the resulting fines. Nursing homes that do not meet the required safety standards for their residents are issued deficiencies. Castle et al. (2011) defined resident safety in the nursing homes based on a modified description of the Agency for Healthcare Research and Quality as 'Freedom from accidental or preventable injuries produced by medical care' (Castle et al., 2011:1). Issued deficiencies are further classified based on their scope and severity. Serious deficiencies are those that risked the life of residents and are assigned Level G or higher (Grabowski and Stevenson, 2008; Harrington et al., 2012).

The study conducted by Harrington et al. (2020) found that facilities with high ratings were associated with a lower number of total deficiencies. These nursing homes reported, on average fewer infection control deficiencies. However, compared to the infection control deficiencies, the total number of deficiencies was a more robust measure in determining the COVID-19 incidents in nursing homes. Abrams et al. (2020), by contrast, could not statistically support the same relationship. Rather, they found that prior infection violations did not significantly relate to the probability of having COVID-19 cases. Serious federal deficiencies and the amount of fines paid by nursing homes have not yet been studied.

Nurse Staffing

It has been shown that nurse staffing has a beneficial effect on the process as well as on the outcomes of care at nursing homes, particularly in terms of fewer deficiencies and longer resident life expectancies (Kim et al., 2009; O'Neill et al., 2003). A study carried out by the CMS in 2001 showed that the staffing levels should be at least 0.75 RN hours per resident day (hprd) and 4.1 nursing hprd in order to prevent patient jeopardy and harm (CMS, 2001). For better quality care, an panel of experts has even recommended staffing levels of at least 4.55 hprd (Harrington et al., 2000).

Harrington et al. (2020), for example, have also noted the urgency in implementing their earlier recommendation in most recent study on COVID-19. Concerning total RN staffing, Li et al. (2020) also and found that, in nursing homes with one or more resident fatalities, every 20-minute increase in RN staffing reduced the predicted COVID-19 fatalities among nursing home residents by 26%.

Resident Demographics

In the context of this study, resident demographics include the proportion of Medicaid-insured residents on the one hand and the proportion of racial or ethnic minority residents on the other. According to the CDC, racial and ethnic minority groups are more at risk due to inequalities in social determinants of health (CDC, 2020b).

The effect of these demographic factors on the probability of having COVID-19 cases in nursing homes can also be seen in the work of Li et al. (2020) in which the authors found that facilities with a higher concentration of Medicaid residents or racial or ethnic minority residents had 16% and 15% more confirmed cases than their counterparts, respectively. A statistically significant association between the share of racial or ethnic minority residents and the number of cases was also found by He et al. (2020). In a previous study, the authors Li et al. (2015) also found that certified nursing assistant hprd increased over a period of 10 years in nursing homes with a small share of minorities, while it decreased on average over the same period in nursing homes with a larger share.

External Factors

For the purposes of this work, external factors are divided into the following features: excess beds in a county, the Herfindahl Index, county infections and county deaths per 100,000 people. The Herfindahl Index ranges from 0 to 1, with the lowest concentration representing high competition in a given county (Harrington et al., 2012). Academic studies have shown that differences in the competition have an impact on staffing levels and thus, the quality of care nursing homes provide (Grabowski, 2001).

Compared to the other areas, external factors are only considered in the study by Gopal et al. (2020) in the form of the number of infections in a county per 100,000 people, which according to the authors, is also statistically relevant for the number of COVID-19 cases in nursing homes. The variables excess beds in a county, the Herfindahl Index and the county deaths per 100,000 people have not yet been studied in association with COVID-19.

3. Research Methodology

In order to prepare the different datasets, several methods of data preparation, statistical analyses and machine learning algorithms were applied. Thus, the following chapter is divided into four sections. The first section describes the selected datasets. The second section explains the preparation process in detail. The third section provides an overview of the applied statistical analyses. The fourth section introduces the application and optimisation of different machine learning models. The methodological approach is illustrated in Figure 1. The complete code can be seen in Appendix 1.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1: Methodological Approach (Author's Own Compilation)

3.1 Description of the Datasets

This study obtained information on nursing homes and their respective epidemiological data from nine datasets, which were retrieved from three state and publicly accessible databases. The nursing homes studied are all active, Medicare and Medicaid certified facilities in the US (excluding Alaska and Washington, DC due to sampling size limitations) that reported the number of COVID-19 infections and fatalities among their residents and staff up to August 02nd, 2020. Each line in the data record corresponds to a nursing home in the US. This data forms the main data frame for data processing.

The American health insurer Medicare offers a comprehensive collection of information on nursing homes on its website in the dataset congregation 'Nursing Home Compare'. The datasets provide a wide range of characteristics on the 15,404 nursing homes currently active and certified by Medicare and Medicaid, such as the number of certified beds, quality measures, staffing and other information used in the five-star rating system. The collection comprises 26 individual data records for seven categories: general information, health and fire safety inspections, penalties, quality measures, staffing, star ratings and skilled nursing facility value-based purchasing. The dataset was developed by the CMS using the following three sources: (1) the CMS' health inspection database, (2) a national database of resident clinical data known as the Minimum Dataset (MDS) and (3) Medicare claims data. For the purpose of this study, the following 4 of the 26 datasets were selected: Provider info, which contains general information about nursing homes, such as the number of certified beds, quality measure scores and staffing times. In addition, the lists of all health deficiencies and complaint inspections from the past three years which consists of three individual datasets (CMS, 2020d).

Since the US commissioned the CMS to develop a surveillance system, nursing homes are required to report COVID-19 relevant information, such as the number of confirmed infections of residents, to the appropriate authorities. The reported data is reviewed by the CMS and published in the COVID-19 Nursing Home Dataset at regular intervals. As of August 02nd, 14,256 of the 15,404 nursing homes have provided this information (CMS, 2020a).

Additional information was obtained about the characteristics of nursing home residents through the second online repository used for this study from Long-term Care: facts on care in the US (LTCfocus), which was developed by Brown University. LTCfocus obtains information from five different sources: (1) the Online Survey Certification and Reporting System, (2) the MDS, (3) state policy data (SP), the Area Resource File and the Residential History File. These datasets provide information about health and the functional status of nursing home residents, the characteristics of nursing homes, relevant state policies and the characteristics of markets in which nursing homes exist (Brown University, 2017).

To account for external factors such as confirmed infections and death cases per capita on a county level, this study obtained data from USAFacts. USAFacts is a non-profit and nonpartisan civic initiative that provides an understandable source of government data. The initiative collects data from (1) the CDC and (2) state and local-level public health agencies. In addition, by referencing state and local agencies directly, the initiative confirms their county-level data (USAFacts, 2020b).

3.2 Data Processing in Python

In order to gain insights from the different datasets, various data processing steps are carried out and documented in detail. For better transparency and reproducibility of this study, the programming language Python and the programming environment Jupyter Notebook were used. Due to its platform independence and a large number of available packages and functions, Python is one of the most popular software tools for the application of predictive analyses (Anandarajan, Hill and Nolan, 2019). This study used Python version 3.6.

Step 1: Reading the Datasets

The collected data for this study contains over 280 features distributed over the nine different datasets, which also include duplicate information. Since the consideration of each feature would go beyond the scope of this study, a hand full of relevant features were selected based on similar studies. A total of 38 features were chosen for being mapped in Step 2. In the COVID-19 Nursing Home Dataset, a filter was carried out on the most recent date (August 02nd, 2020). The three datasets about the nursing home deficiencies were merged and aggregated on a nursing home level. In the LTCfocus dataset, a new feature named ‘county_code’ was created, which was necessary for the later merging of the datasets. Only the observations of the COVID-19 infections and fatalities from the USAFacts datasets were kept that matched the previously selected date August 02nd, 2020. The datasets were then merged with the information about the county population into one data frame for further processing.

Step 2: Data Mapping as Preparation for the Cleaning Process

As this study obtained information from nine datasets from the three aforementioned online repositories, data mapping was a crucial part of the data preparation process. The mapping required additional information to link the nursing home data to the confirmed cases on a county level and the allocation of nursing homes into different counties. The necessary information for this mapping step was retrieved from (1) the US Census Bureau, which provided the Census Bureau Region and Division Codes and the Federal Information Processing System (FIPS) Codes for States and (2) the National Bureau of Economic Research, which provided the CMS' SSA to FIPS State and County Crosswalk data. The newly mapped data frame contained a total of different 25 features. A simplified overview of the data-mapping process can be seen in Figure 2.

Step 3: Data Pre-processing

In the newly created data frame, 52% of the 15,404 nursing homes had missing entries for at least one of their features. An overview of the missing values can be seen in Figure 3. In order to retain as many nursing homes as possible, missing values were treated by filling them with the median of the respective column. Categorical features, such as the chain affiliation (multifac) and county-specific data were not replaced. Consequently, 5,650 of the 7,985 nursing homes had no missing information after this step. The remaining 2,335 nursing homes with missing entries were dropped from the dataset. As an extensive data quality assurance process is conducted by the CMS, this study applied no additional outlier detection treatment.

Step 4: Feature Engineering

The feature engineering step includes the transformation of the variables and the creation of new variables (Vanderplas, 2016). Two of the newly created features are, for instance, the Herfindahl Index and the surplus of beds in a county. The Herfindahl Index is calculated by dividing the total number of beds in a facility with the total number of facility beds in each county. The facility's share of a market within a county is then squared and summed in order to create the Herfindahl Index. The excess of beds in a county is calculated by subtracting the number of occupied nursing home beds in a county from the total number of beds in the respective county (Harrington et al., 2012). From the total of 27 features in the target data frame, 23 of them were either transformed or created new.

Step 5: Data Normalisation

The features in the prepared dataset were in different units. As machine learning algorithms are in general sensitive to these differences in the scale of magnitudes of features, this study addressed this issue by normalising the data (Ramyaa et al., 2019; Vanderplas, 2016).

Abbildung in dieser Leseprobe nicht enthalten

Figure 2: Data Mapping Process (Author's Own Compilation)

Abbildung in dieser Leseprobe nicht enthalten

Figure 3: List of Missing Values (Author's Own Compilation)

3.3 Statistical Analyses

The prepared dataset was used for statistical analyses. The conducted analyses provided a fundamental understanding of the relationship between COVID-19 infections and fatalities in US nursing homes and decreased the number of selected features by excluding variables that were not statistically relevant for the machine learning models. This study applies two key statistical analyses: (1) univariate analyses of the independent variables and (2) bivariate analyses. In contrast to the processing section, the statistical analyses were conducted using Microsoft Excel.

Univariate Analyses

Univariate analyses of the independent variables were conducted to calculate the number of nursing homes and the average number of confirmed COVID-19 cases and COVID-19-related fatalities (He et al., 2020). These two categories were further differentiated by the type of infected person: residents or staff. Each of these dependant variables was then further divided according to the following categories: no COVID-19 case or fatality, total cases or fatalities below the 95th percentile (including the 95th percentile) and cases or fatalities above the 95th percentile. This method is used for all subcategories except staff fatalities (staff fatalities 95th percentile is zero) to further differentiate between COVID-19 cases in nursing homes (normal outbreak or extreme outbreak).

Bivariate Analyses

The bivariate analyses compared the facilities with confirmed infections and fatalities among residents and staff to facilities with no reported COVID-19 infection or fatality. For those features that were continuous (i.e., facility size, facility age, occupancy rate and the overall CMS facility rating) the median of the respective measure was used to create a threshold and convert them into binary values with the differentiation between 'high' and 'low' (Harrington et al., 2020).

3.4 Prediction of Nursing Homes with COVID-19 Cases

In order to predict COVID-19 cases in nursing homes, various steps regarding the use, selection and improvement of machine learning algorithms were carried out and documented.

Step 1: Selection of Appropriate Models

The initial step was to compare the predictive accuracy of seven machine learning methods including logistic regression, nearest neighbours, Gaussian naïve Bayes, support vector machines, decision trees, random forests (RF) and neural networks for predicting which nursing homes are susceptible to COVID-19 infections. While some machine learning models are more or less suitable for classification tasks, the method of testing a wide range of models is often the first step in practice (Anandarajan, Hill and Nolan, 2019; Ramyaa et al., 2019).

Step 2: Feature Selection

Feature selection is an effective method for solving the problem of the high dimensionality of data by removing redundant data and thus reducing computation time and improving learning accuracy (Cai et al., 2018). To further enhance the robustness of the model, the least predictive features from the chosen model were dropped. In addition, the interpretation of the results is supported by a ranking the feature importance of machine learning models. Unexpected high-ranking features hint at previously unknown factors contributing to the probability of COVID-19 cases in nursing homes (Liu, 2011; Zien et al., 2009).

Step 3: Hyperparameter Tuning

The RF algorithm has different hyperparameters that must be set before running the model, for instance, the number of trees or the minimum number of samples necessary to split an internal node and be at a leaf node. With hyperparameter tuning, a set of optimal parameters for the respective algorithm is selected. A grid search is the most commonly used method to optimise hyperparameters in machine learning models, and consequently, this method was used in this study. The technique works by exhaustively exploring and testing a defined subset of hyperparameters. The advantage of this method is that it finds the optimal combination of the predefined parameters. The disadvantage of it is that, due to the exhaustive search and testing of all parameters, it is very computing intensive and can be time-consuming. In addition, there is the risk in case of intensive hyperparameter tuning of overfitting (Probst et al., 2019).

Step 4: Evaluation of Results

In order to evaluate the obtained results, this paper used a series of methods. However, since the focus is on classification, accuracy is used as an main evaluation metric. More specifically the average score achieved through the K-Fold cross validation, which gave a more robust metric. The K-Fold method splits the data set into k folds of training and test sets, in the case of this study 10 fold. For each split a separate accuracy was calculated. Accuracy describes a relation between the trueness and precision of a measurement method. Trueness is the distance from the averaged measurements to the true value. Precision measures the variability within a set of measurements and thus gives information about the reproducibility of the measurement. Measurement methods of high trueness and high precision are called accurate. In binary classification tasks, accuracy is typically defined as the ratio of correctly classified cases to the total number of cases (ISO, 1994).

4. Findings and Discussion

The insights from the conducted statistical analyses and the optimised machine learning models are presented and discussed in the following chapter. The first part of this chapter focuses on the visualisation and description of the results, while the second part discusses them with reference to similar studies.

4.1 Evaluation and Interpretation of the Developed Models

This section is divided into the description of findings from the statistical analyses and the application of machine learning models.

Statistical Analyses

The study sample consisted of 13,069 nursing homes, which accounts for 84.8% of the 15,404 nursing homes in the US. Results are shown in Table 1. Of the 13,069 nursing homes, which included both skilled nursing facilities and nursing homes, 6,885 (52.7%) had at least one documented COVID-19 case among residents, and 654 of them had at least 61 (over the 95th percentile of the total sample) confirmed infections. Among facilities with a positive COVID-19 case among residents, the average number of cases was 22.7. While Florida (65.5%) and New York (63.3%) had the largest share of COVID-19 affected facilities, California reported the highest number of infected nursing home residents, with 14,581 confirmed cases, and New York had the second highest number, with 11,839 cases among residents.


Excerpt out of 61 pages


Predicting COVID-19 Cases in US Long-Term Care Facilities
An Empirical Study Using Epidemiological Data
Catalog Number
ISBN (eBook)
ISBN (Book)
COVID-19, Predicting, Forecasting, Data Analysis, Long-Term Care Facilities, Nursing Homes, Coronavirus
Quote paper
Metin Baki (Author), 2020, Predicting COVID-19 Cases in US Long-Term Care Facilities, Munich, GRIN Verlag,


  • No comments yet.
Look inside the ebook
Title: Predicting COVID-19 Cases in US Long-Term Care Facilities

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free