Excerpt

## Programme

0. Introduction

0.1. Introduction to data manipulation

0.1.1. Analyzing the data structure

0.1.2. Analyzing first observation

0.1.3. Counting the number of observations and variables

0.1.4. Selecting sample

0.1.5. Selecting variables

0.1.6. Dropping Variables

0.1.7. Renaming variable

0.1.8. Filtering subset data

0.1.9. Summarizing numerical variables

0.1.10. Group data by categorical variable

0.1.11. Selecting rows by position

0.1.12. Use of IF ELSE Statement

0.1.13.Selecting with condition

0.1.13. Summarizing the number of levels in factor variables

0.1.14. Identifying levels of factor variable

0.2. Data summarizing

0.2.1. Central tendency and dispersion measures for quantitative variable

0.2.2. Calculating frequency for qualitative variable

0.2.3. Analyzing quantitative and qualitative data

0.2.4. Analyzing two qualitative data

0.2.5. Calculating percentage for two qualitative data

1. Normality analysis

1.1. Analyzing normality visually

1.2. Testing normality numerically

1.3. Testing normality using skewness and kurtosis

2. Correlation Analysis

2.1. Pearson and Spearman Correlation

2.2. Partial correlation

2.3. Polyserial correlation

2.4. Point-biserial correlation

3. Multiple Regression Analysis

3.1. Assumptions of Multiple regression

3.2. Testing Multiple Regression Assumption in R

3.2.1. Check the linearity of the model

3.2.2. Analyzing the mean of residuals.

3.2.3. Testing homoscedasticity

3.2.4. Testing normality of residuals

3.2.5. Testing for Independence of residuals

3.2.6. Checking for collineality

3.2.7. Checking for Model Outliers

3.2.8. Checking for Other Assumptions

3.9. Variable Importance.

4. Mean Comparison

4.1. One-sample t-test

4.2. Comparing the means of two independent groups

4.2.1. Unpaired Two Samples T-test (parametric)

4.2.2.Unpaired Two-Samples Wilcoxon Test (non-parametric)

4.2.3. Comparing mean for paired sample.

4.2.3.1. Preliminary test to check paired t-test assumptions

5. Comparing the means of more than two groups

6. Test of Independence

7. Testing Association between two nominal variables

8. Comparing proportion and independence test

8.1. One-proportion Z-test

8.2. Two-proportions z-test

9. Testing for stationarity for time series

10. Factor Analysis

10.1. Exploratory Factor Analysis

10.1.1. Descriptive statistics

10.1.2. Testing correlation and sample size for factor analysis

10.1.3. Reliability

10.1.4. Identifying Optimal Number of Factors

10.1.4.1. Using Eigenvalues

10.1.4.2. Parallel analysis

10.1.5. Run the EFA with seven factors

10.2. Confirmatory Factor Analysis

10.2.1. Indices of Goodness of Fit

10.2.1. Fitting the model with CFA

10.2.1.1. Specify the model

10.2.1.2. Fitting the model

*10.2.1.* 3. Getting the model summary, confidence intervals and goodness of fit indicators

*10.2.1.* 4. Obtain confidence intervals for the estimated coefficients

10.2.1.5. Obtain goodness of fit indicators of the model

*10.2.1.* 6. Reliability Analysis

*10.2.1.* 7. Getting standardized estimates

10.2.1.8.Plotting the etimates

10.2.2. Fitting the model with CFA using SEM

10.2.2.1. Specify the model

10.2.2.2. Fitting the model

10.2.2.3. Summarizing the model

*10.2.2.* 4. Getting goodness of fit indicators of the model

10.3. Analyzing not normally distributed

11. Structural Equation Modeling

11.1. Model one

11.1.1. Model Specification

11.1.2. Model fitting

11.1.3. Summarizing the model

11.1.4. Getting goodness of fit indicators of the model

11.1.5. Obtain goodness of fit indicators of the model Reported

11.2.Model two

11.2.1. Model specification with residual covariance

11.2.2. Model fit

11.2.3. Model summary

11.2.4 Obtain goodness of fit indicators of the model Reported

12. Goodness-of-Fit Measures

13. Mediation and Moderation

13.1. Baron and Kenny procedures

13.1.1. Analyzing the Relationship between Independent variable (FDI) and mediator variable (EXP).

13.1.2. Variables EXP and GDP must be related once the effect of FDI is controlled

13.1.3. Analyzing the relationship between Independent and Dependent variables

13.1.4. Analyzing the decrease of the Relationship between FDI and GDP

13.2. Mediation analysis using Nonparametric bootstrap

13.3. Robust mediation analysis

14. Robust Regression

14.1. Investigating Data Normality

14.2. Identifying Outliers

14.3. Linear Regression

14.4. Robust regression

14.5. Comparing Robust and Linear regression

15. Nonparametric regression

15.1. Kendall–Theil Sen Siegel nonparametric linear regressionAbbildung in dieser Leseprobe nicht enthalten

15.2. Generalized additive models

References

## Abstract

The purpose of this guide is to show how to conduct some data analysis using R tool. This guide is not aiming teaching statistics or related field, nevertheless, it shows practically when and how inferential statistics are conducted for those who have little knowledge on R programing environment. It is a collection of packages needed to conduct data analysis. The guide indicates step by step how to choose statistical test based on the research questions. It also presents the assumptions to be respected to validate a statistical test. This guide covers normality test, correlation analysis (numerical, ordinal, binary, & categorical), multiple regression analysis, robust regression, nonparametric regression, comparing one-sample mean to a standard known mean; comparing the means of two independent groups, comparing the means of paired samples, comparing the means of more than two group, independence test, comparing proportion, goodness of fit test, testing for stationarity for time series, exploratory factor analysis, confirmatory factor analysis, and structural equation modeling. Scripts and codes are available for each test. It shows how to report the result of the analysis. This guide will help researchers and data analysts, and will contribute to increasing the quality of their publications.

Key words: Data analysis, correlation, multiple regression, structural equation, t-test, ANOVA, independence test.

## 0. Introduction

Data analysis is used in different circumstances by students, researchers, professionals, policymakers, etc. However, tools and technique to conduct statistical analysis are not taught in many colleges and universities. Those who try conducting statistical analysis do it wrongly and through what greatly distorts the results obtained. Each statistical test has its own conditions or assumptions to be met to make the result valid. For instance, Schober and Boer (2018) indicated that the use of Pearson correlation coefficient is conditioned to having jointly normally distributed data following a bivariate normal distribution. While for nonnormally distributed continuous data, for ordinal data, or for data with pertinent outliers, a Spearman rank correlation can be applied as a measure of a monotonic association.

Ghasemi and Zahediasl (2012) indicated that near 50% of the published articles have at least one error. They added that normality and other assumptions would be respected because, when these assumptions are violated, it is impossible to draw accurate and reliable deductions about reality. Analytical methods differ depending on whether or not the normality is satisfied, as inconsistent results may be obtained depending on the data analysis method used (Kwak & Park, 2019). The authors added that in many clinical research papers, the findings are presented and interpreted without checking or testing normality.

This guide presents the procedures to conduct normality test, correlation analysis, multiple regression analysis, robust regression analysis, nonparametric regression, two means comparison, independence test, proportion comparison; exploratory and confirmatory factor analysis, etc. It also provides useful codes for data manipulation. This guide gathers packages and codes required for each test to improve the quality of published findings. The user is asked to look for underlying theories even though a brief introduction is made for each test.

### 0.1. Introduction to data manipulation

The purpose of this introduction part is to provide skills in analyzing quantitative and qualitative data. It also allows making univariate and bivariate analysis. Finally, it shows how to access and use data from a data frame. For these objectives, we use data from “questionr” and “dplyr” packages for data operation.

library(questionr)

library(dplyr)

data(hdv2003) #Calling for data

Data <- hdv2003 # saving the data in Data

#### 0.1.1. Analyzing the data structure

str(Data)

Data is the name of the data to be analysed. The data analyst will replace Data with the name of his/her data. The function indicates the number of observations, the nature of the variables (integer, factor, numeric).Each variable begins with the symbol $. As it can be seen on the result, we have 2000 observations and 20 variables. Some of them are integer (number), other factors, etc. (qualitative

'data.frame': 2000 obs. of 20 variables:

- id : int 1 2 3 4 5 6 7 8 9 10 ...

- age : int 28 23 59 34 71 35 60 47 20 28 ...

- sexe : Factor w/ 2 levels "Homme","Femme": 2 2 1 1 2 2 2 1 2 1 ...

- nivetud : Factor w/ 8 levels "N'a jamais fait d'etudes",..: 8 NA 3 8 3 6 3 6 NA 7 ...

- poids : num 2634 9738 3994 5732 4329 ...

- occup : Factor w/ 7 levels "Exerce une profession",..: 1 3 1 1 4 1 6 1 3 1 ...

- qualif : Factor w/ 7 levels "Ouvrier specialise",..: 6 NA 3 3 6 6 2 2 NA 7 ...

- freres.soeurs: int 8 2 2 1 0 5 1 5 4 2 ...

- clso : Factor w/ 3 levels "Oui","Non","Ne sait pas": 1 1 2 2 1 2 1 2 1 2 ...

- relig : Factor w/ 6 levels "Pratiquant regulier",..: 4 4 4 3 1 4 3 4 3 2 ...

- trav.imp : Factor w/ 4 levels "Le plus important",..: 4 NA 2 3 NA 1 NA 4 NA 3 ...

- trav.satisf : Factor w/ 3 levels "Satisfaction",..: 2 NA 3 1 NA 3 NA 2 NA 1 ...

- hard.rock : Factor w/ 2 levels "Non","Oui": 1 1 1 1 1 1 1 1 1 1 ...

- lecture.bd : Factor w/ 2 levels "Non","Oui": 1 1 1 1 1 1 1 1 1 1 ...

- peche.chasse : Factor w/ 2 levels "Non","Oui": 1 1 1 1 1 1 2 2 1 1 ...

- cuisine : Factor w/ 2 levels "Non","Oui": 2 1 1 2 1 1 2 2 1 1 ...

- bricol : Factor w/ 2 levels "Non","Oui": 1 1 1 2 1 1 1 2 1 1 ...

- cinema : Factor w/ 2 levels "Non","Oui": 1 2 1 2 1 2 1 1 2 2 ...

- sport : Factor w/ 2 levels "Non","Oui": 1 2 2 2 1 2 1 1 1 2 ...

- heures.tv : num 0 1 0 2 3 2 2.9 1 2 2 ...

#### 0.1.2. Analyzing first observation

The function head(Data) allows to see six first rows of the data under analysis. To analyse the first ten row the codes becomes head(Data,10).

Tableau 1: Analysing first rows in the data

Abbildung in dieser Leseprobe nicht enthalten

As it can be seen, the first row or the first observation is a Femme (woman), the third is Homme (man).We can see some variable like id, Age, sexe (gender), etc. It is important to visualize the first lines to make sure that your data have been well imported in R working environment.

#### 0.1.3. Counting the number of observations and variables

One may wish to know the number of rows and the number of variables. The code “dim” followed by the noun of the data gives the number of rows and variables (column) . As the result shows; we have 2000 rows and 20 variables.

dim(Data)

1 2000 20

#### 0.1.4. Selecting sample

One may wish to select a sample in data set. He/she may select randomly a given number of rows or indicated percentage of the rows. The “sample_n” function picks random rows from a data frame (or table). The second parameter of the function tells R the number of rows to select.

sample_n(Data,5) # selecting randomly five rows.

Abbildung in dieser Leseprobe nicht enthalten

The sample_frac function returns randomly k % of rows. In the example below, it returns randomly 3% of rows.

sample_frac (Data,0.03)

Abbildung in dieser Leseprobe nicht enthalten

#### 0.1.5. Selecting variables

As indicated by Bhalla (n.d.), select() is used to choose only desired variables. For example, to select variable “age”in our data; the following code is used.

select(Data,age)

#### 0.1.6. Dropping Variables

One may wish to delete some variables in the dataset. Using minus sign (-) before a variable states R to drop the variable. To drop the variable “age”in the Data for instance, the following code is used. However, it is advised to create a dataset to be manipulated to keep the original data.

select(Data,-age)

#### 0.1.7. Renaming variable

One may wish to change the noun of his/her variables, the function rename ( ) is used to change variable name. It follows this structure: rename(data , new_name = old_name), data here is the noun of the data you wish to rename variables. If you want to replace age by old for instance, the following code is used:

rename(Data , old = age)

#### 0.1.8. Filtering subset data

The function is used to extract dataset in a variable. The function filter( ) is used to subset data with matching logical conditions. Suppose that you want to filter from the variable “nivetud” “2eme cycle” and save it in Data1.

Data1<- filter(Data, nivetud == "2eme cycle ")

Data2 <- filter(Data, nivetud %in% c("2eme cycle ", "N'a jamais fait d'etudes ") & age >= 20 )

#### 0.1.9. Summarizing numerical variables

If you have data set containing numerical data and factors, you may want to summarise each of them separately.

summarise_if(Data, is.numeric, funs(n(),mean,median)) # you get number, mean and median.

#### 0.1.10. Group data by categorical variable

If you wish to group data base on grouping factor, when making a contingency factor for instance. If we wish to group data by sex, for example group_by() function collects data by categorical variable

group_ by(Data, sexe)

#### 0.1.11. Selecting rows by position

If you wish to select some rows (observations) you just indicate their positions.The slice() function is used to select rows by position. Selecting rows 1,5,8,9.

slice(Data,1,5,8,9)

Abbildung in dieser Leseprobe nicht enthalten

#### 0.1.12. Use of IF ELSE Statement

This function is used to return variable if the condition is met, and gives another variable if the condition is not met. This function can be used while transforming numerical variable in categorical variable for instance.

if_else(condition, true, false)

true : Value if condition meets

false : Value if condition does not meet

Suppose that in the variable “age” we want to use “minor” for the age under 20 years old, and “major” for others. We write the following codes:

age<-if_else(Data$age<20, "Minor", "Major")

We can see the result by making the table.

table(age)

age

Major Minor

1952 48

You can use it for complex code.

Data$age<-ifelse(Data$age<18, "Mineur", ifelse((Data$age>=18 & Data$age<=25),

"Moyen",ifelse((Data$age>25 & Data$age<=30), "Grand",

ifelse((Data$age>=30 & Data$age<=45),"Experimente", "vieux"))))

table(Data$age)

Experimente Grand Moyen vieux

568 160 191 1081

#### 0.1.13. Selecting with condition

Data3 = select_if(Data, is.numeric) # Selecting numeric variables

Data4 = select_if(Data, is.factor) # Selecting factor variables

#### 0.1.13. Summarizing the number of levels in factor variables

summarise_if(Data, is.factor, funs(nlevels(.)))

#### 0.1.14. Identifying levels of factor variable

levels(Data$sexe)

"Homme" "Femme"

### 0.2. Data summarizing

This section shows how to summary quantitative data and qualitative data. For quantitative data, mean, median, variance, minimum, maximum, skewness, etc. can be calculated. However, for qualitative data, frequencies can be calculated.

#### 0.2.1. Central tendency and dispersion measures for quantitative variable

The package jmv provides many useful descriptive statistics in APA format. But there so many other packages and functions that can produce various summaries like describe from the package psych.

library(jmv)

descriptives(data = Data, vars = c( "age"), freq = FALSE, hist = TRUE,dens = TRUE,

bar = FALSE, barCounts = FALSE, box = TRUE, sd = TRUE,variance = TRUE,

range = TRUE,se = TRUE, skew = TRUE, kurt = TRUE, quart = TRUE, pcEqGr = TRUE, pcNEqGr = 10)

Abbildung in dieser Leseprobe nicht enthalten

Results that are useless can be deleted. The advantage of this package is that it provides standard error of kurtosis and skewness that can be used in z calculation as it will be seen in normality analysis.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1: Densite diagrame of age

Abbildung in dieser Leseprobe nicht enthalten

Figure 2: Boxplot of Age

#### 0.2.2. Calculating frequency for qualitative variable

Two qualitative data are analyzed in this code. However, you may analyze on or more variables.

descriptives(data = Data,vars = c( "occup","qualif"), freq = TRUE, bar = TRUE, barCounts = TRUE)

Abbildung in dieser Leseprobe nicht enthalten

Figure 3: Bar of Occup

Abbildung in dieser Leseprobe nicht enthalten

Figure 4: Bar of Qualif

#### 0.2.3. Analyzing quantitative and qualitative data

This is a bivariate analysis. It combines two variables in same table. We want to see the age distribution based on occup variable.

We need the package “ tidycomm”

library(tidycomm)

Data%>%crosstab(age,occup)

Abbildung in dieser Leseprobe nicht enthalten

#### 0.2.4. Analyzing two qualitative data

A contingency table can be made from a combination of two categorical data.

Data%>%crosstab(sexe,nivetud)

Abbildung in dieser Leseprobe nicht enthalten

0.2.5. Calculating percentage for two qualitative data

You want to calculate the percentage in contingency table.

Data%>%crosstab(sexe,occup,add_total = TRUE,percentages = TRUE)

Abbildung in dieser Leseprobe nicht enthalten

## 1. Normality analysis

Several of the statistical procedures as well as correlation, regression, t tests, and analysis of variance, called parametric tests, are based on the assumption that the data follow a normal distribution or a Gaussian distribution (Ghasemi & Zahediasl, 2012). Conversely, these authors added that when working with large enough sample sizes (> 30 or 40), the violation of the normality assumption should not cause major problems.

Normality can be checked visually or using numbers. Ghasemi &and Zahediasl proposed histogram, stem-and-leaf plot, boxplot, P-P plot (probability-probability plot), and Q-Q plot (quantile-quantile plot). Before conducting the any statistical test, you must know the null hypothesis that is being tested. Another thing to know is the significance level Alpha (α) which constitutes a cut-off to accept or reject the null hypothesis. In most publications, 5% is used as cut-off. However, one can define the level of his α at 10% or less.

In testing normality assumption, the **null** - **hypothesis** of this **test** is that **the data are normally distributed**. Thus, if the p-value is less than the chosen alpha level (5%, 10%), then the **null hypothesis** is rejected and there is evidence that the data **tested** are not normally distributed. To show how normality is conducted in R, we generate following data. Data can be imported in R from one’s file.

### 1.1. Analyzing normality visually

To analyze visually whether data are normally distributed boxplot can be used. qqPlot can also be used, but the package “car” must be installed. One can use also ggqqplot, but the package “ggpubr” must be installed. The assumption of normality must be checked for small sample (n<30). Ghasemi and Zahediasl (2012) indicated that when working with large enough sample sizes (> 30 or 40), the violation of the normality assumption should not cause major problems. This is the result of the central limit theorem indicating that no matter what distribution things have, the sampling distribution tends to be normal if the sample is large enough (n > 30).

#Data creation

Abbildung in dieser Leseprobe nicht enthalten

Figure 5: Boxplot normality test

As it can be seen on the boxplot, some observations are out of the limit of the boxplot.

Installation of libraries required for qqPlot and ggqqplot.

library(car)

library(ggpubr)

qqPlot(X, main = "Testing normality for X")

In these codes, we begin by the function followed by parenthesis. X is the noun of the variable to be tested, main is a function to add title, col is a function allowing personalizing the color of the figure.

Abbildung in dieser Leseprobe nicht enthalten

Figure 6: qqPlot normality test

The most interesting thing when using ggqqplot function is that observations that are different from the majority of the observations called outliers are indicated. Therefore, 18 and 23 are considered to be far from the most observations. As it can be seen on Figure 2, some points deviate markedly from the straight line indicating that data are not normally distributed.

ggqqplot(X,main=" normality analysis",col="orangered",ylab="observations")

In these Codes, we begin by the function followed by parenthesis. X is the noun of the variable to be tested, main is a function to add title, col is a function allowing personalizing the color of the figure, ylab allows adding the noun of y.

**[...]**

- Quote paper
- Docteur Antoine Niyungeko (Author), 2021, Practical Guide for Data Analysis Using R Tool, Munich, GRIN Verlag, https://www.grin.com/document/1010252

Publish now - it's free

Comments