The purpose of this guide is to show how to conduct some data analysis using R tool. This guide is not aiming teaching statistics or related field, nevertheless, it shows practically when and how inferential statistics are conducted for those who have little knowledge on R programing environment. It is a collection of packages needed to conduct data analysis. The guide indicates step by step how to choose statistical test based on the research questions. It also presents the assumptions to be respected to validate a statistical test. This guide covers normality test, correlation analysis (numerical, ordinal, binary, & categorical), multiple regression analysis, robust regression, nonparametric regression, comparing one-sample mean to a standard known mean; comparing the means of two independent groups, comparing the means of paired samples, comparing the means of more than two group, independence test, comparing proportion, goodness of fit test, testing for stationarity for time series, exploratory factor analysis, confirmatory factor analysis, and structural equation modeling. Scripts and codes are available for each test. It shows how to report the result of the analysis. This guide will help researchers and data analysts, and will contribute to increasing the quality of their publications.
Table of Contents
0. Introduction
0.1. Introduction to data manipulation
0.1.1. Analyzing the data structure
0.1.2. Analyzing first observation
0.1.3. Counting the number of observations and variables
0.1.4. Selecting sample
0.1.5. Selecting variables
0.1.6. Dropping Variables
0.1.7. Renaming variable
0.1.8. Filtering subset data
0.1.9. Summarizing numerical variables
0.1.10. Group data by categorical variable
0.1.11. Selecting rows by position
0.1.12. Use of IF ELSE Statement
0.1.13. Selecting with condition
0.1.13. Summarizing the number of levels in factor variables
0.1.14. Identifying levels of factor variable
0.2. Data summarizing
0.2.1. Central tendency and dispersion measures for quantitative variable
0.2.2. Calculating frequency for qualitative variable
0.2.3. Analyzing quantitative and qualitative data
0.2.4. Analyzing two qualitative data
0.2.5. Calculating percentage for two qualitative data
1. Normality analysis
1.1. Analyzing normality visually
1.2. Testing normality numerically
1.3. Testing normality using skewness and kurtosis
2. Correlation Analysis
2.1. Pearson and Spearman Correlation
2.2. Partial correlation
2.3. Polyserial correlation
2.4. Point-biserial correlation
3. Multiple Regression Analysis
3.1. Assumptions of Multiple regression
3.2. Testing Multiple Regression Assumption in R
3.2.1. Check the linearity of the model
3.2.2. Analyzing the mean of residuals.
3.2.3. Testing homoscedasticity
3.2.4. Testing normality of residuals
3.2.5. Testing for Independence of residuals
3.2.6. Checking for collineality
3.2.7. Checking for Model Outliers
3.2.8. Checking for Other Assumptions
3.9. Variable Importance.
4. Mean Comparison
4.1. One-sample t-test
4.2. Comparing the means of two independent groups
4.2.1. Unpaired Two Samples T-test (parametric)
4.2.2.Unpaired Two-Samples Wilcoxon Test (non-parametric)
4.2.3. Comparing mean for paired sample.
4.2.3.1. Preliminary test to check paired t-test assumptions
5. Comparing the means of more than two groups
6. Test of Independence
7. Testing Association between two nominal variables
8. Comparing proportion and independence test
8.1. One-proportion Z-test
8.2. Two-proportions z-test
9. Testing for stationarity for time series
10. Factor Analysis
10.1. Exploratory Factor Analysis
10.1.1. Descriptive statistics
10.1.2. Testing correlation and sample size for factor analysis
10.1.3. Reliability
10.1.4. Identifying Optimal Number of Factors
10.1.4.1. Using Eigenvalues
10.1.4.2. Parallel analysis
10.1.5. Run the EFA with seven factors
10.2. Confirmatory Factor Analysis
10.2.1. Indices of Goodness of Fit
10.2.1. Fitting the model with CFA
10.2.1.1. Specify the model
10.2.1.2. Fitting the model
10.2.1.3. Getting the model summary, confidence intervals and goodness of fit indicators
10.2.1.4. Obtain confidence intervals for the estimated coefficients
10.2.1.5. Obtain goodness of fit indicators of the model
10.2.1.6. Reliability Analysis
10.2.1.7. Getting standardized estimates
10.2.1.8.Plotting the etimates
10.2.2. Fitting the model with CFA using SEM
10.2.2.1. Specify the model
10.2.2.2. Fitting the model
10.2.2.3. Summarizing the model
10.2.2.4. Getting goodness of fit indicators of the model
10.3. Analyzing not normally distributed
11. Structural Equation Modeling
11.1. Model one
11.1.1. Model Specification
11.1.2. Model fitting
11.1.3. Summarizing the model
11.1.4. Getting goodness of fit indicators of the model
11.1.5. Obtain goodness of fit indicators of the model Reported
11.2.Model two
11.2.1. Model specification with residual covariance
11.2.2. Model fit
11.2.3. Model summary
11.2.4 Obtain goodness of fit indicators of the model Reported
12. Goodness-of-Fit Measures
13. Mediation and Moderation
13.1. Baron and Kenny procedures
13.1.1. Analyzing the Relationship between Independent variable (FDI) and mediator variable (EXP).
13.1.2. Variables EXP and GDP must be related once the effect of FDI is controlled
13.1.3. Analyzing the relationship between Independent and Dependent variables
13.1.4. Analyzing the decrease of the Relationship between FDI and GDP
13.2. Mediation analysis using Nonparametric bootstrap
13.3. Robust mediation analysis
14. Robust Regression
14.1. Investigating Data Normality
14.2. Identifying Outliers
14.3. Linear Regression
14.4. Robust regression
14.5. Comparing Robust and Linear regression
15. Nonparametric regression
15.1. Kendall–Theil Sen Siegel nonparametric linear regression
15.2. Generalized additive models
Objectives and Research Scope
This guide provides a practical, step-by-step introduction to conducting data analysis using the R environment, specifically designed for users with limited prior programming knowledge. The primary objective is to equip researchers and data analysts with the necessary R packages, code scripts, and statistical procedures to ensure valid and high-quality research outcomes.
- Practical application of statistical tests, including normality, correlation, and regression.
- Step-by-step guidance on data manipulation and preparation using R packages like dplyr.
- Methods for validating statistical assumptions to prevent errors in research findings.
- Advanced modeling techniques such as Exploratory Factor Analysis (EFA), Confirmatory Factor Analysis (CFA), and Structural Equation Modeling (SEM).
- Procedures for mean comparisons, independence tests, and robust regression analysis.
Extract from the Book
1. Normality analysis
Several of the statistical procedures as well as correlation, regression, t tests, and analysis of variance, called parametric tests, are based on the assumption that the data follow a normal distribution or a Gaussian distribution (Ghasemi & Zahediasl, 2012). Conversely, these authors added that when working with large enough sample sizes (> 30 or 40), the violation of the normality assumption should not cause major problems.
Normality can be checked visually or using numbers. Ghasemi &and Zahediasl proposed histogram, stem-and-leaf plot, boxplot, P-P plot (probability-probability plot), and Q-Q plot (quantile-quantile plot). Before conducting the any statistical test, you must know the null hypothesis that is being tested. Another thing to know is the significance level Alpha (α) which constitutes a cut-off to accept or reject the null hypothesis. In most publications, 5% is used as cut-off. However, one can define the level of his α at 10% or less.
In testing normality assumption, the null-hypothesis of this test is that the data are normally distributed. Thus, if the p-value is less than the chosen alpha level (5%, 10%), then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed. To show how normality is conducted in R, we generate following data. Data can be imported in R from one’s file.
Summary of Chapters
0. Introduction: Outlines the necessity of applying correct statistical techniques in R to prevent common errors and invalid conclusions in research.
1. Normality analysis: Discusses the visual and numerical methods for testing the assumption of normality, which is critical for the validity of parametric tests.
2. Correlation Analysis: Details how to measure the strength and direction of relationships between variables using techniques like Pearson and Spearman correlation.
3. Multiple Regression Analysis: Explains the assumptions and implementation of linear regression models in R, focusing on diagnostic checks for model validity.
4. Mean Comparison: Covers procedures for comparing sample means, including t-tests for independent and paired samples, as well as non-parametric Wilcoxon alternatives.
5. Comparing the means of more than two groups: Introduces one-way ANOVA for comparing multiple group means and the Kruskal-Wallis test as a non-parametric alternative.
6. Test of Independence: Explains the use of the Chi-square test to analyze associations between categorical variables in contingency tables.
7. Testing Association between two nominal variables: Focuses on measures like Phi Coefficient and Cramer's V for evaluating associations between categorical variables.
8. Comparing proportion and independence test: Discusses One-proportion and Two-proportions Z-tests for comparing observed proportions.
9. Testing for stationarity for time series: Describes the importance of stationarity in time-series data and how to test for unit roots using ADF and KPSS tests.
10. Factor Analysis: Explains both Exploratory (EFA) and Confirmatory Factor Analysis (CFA) to identify and test latent variables.
11. Structural Equation Modeling: Introduces the SEM framework for testing complex theoretical models involving multiple latent constructs.
12. Goodness-of-Fit Measures: Outlines the chi-square goodness of fit test for comparing observed versus expected distributions in discrete data.
13. Mediation and Moderation: Presents approaches for mediation analysis using the Baron and Kenny procedure and non-parametric bootstrap methods.
14. Robust Regression: Explores techniques to ensure reliable regression analysis even when the data contain outliers.
15. Nonparametric regression: Introduces robust alternatives like Kendall–Theil Sen Siegel regression and Generalized Additive Models (GAMs).
Keywords
Data analysis, R tool, correlation, multiple regression, structural equation modeling, t-test, ANOVA, independence test, normality test, factor analysis, mediation, moderation, robust regression, nonparametric, stationarity.
Frequently Asked Questions
What is this guide primarily about?
This guide serves as a practical, technical manual for researchers and analysts on how to perform various statistical data analyses using the R programming environment.
What are the central thematic fields covered?
The book covers a broad spectrum of statistical analysis including descriptive statistics, correlation, regression analysis, mean comparisons, factor analysis, structural equation modeling, and mediation analysis.
What is the primary goal of the author?
The author's goal is to bridge the gap between statistical theory and practical implementation, helping users perform tests correctly to increase the quality and reliability of their published research.
Which scientific methods are primarily used?
The book employs both parametric and non-parametric statistical methods, ranging from standard tests like t-tests and ANOVA to more complex techniques like EFA, CFA, and SEM.
What is the focus of the main content?
The main content focuses on practical R code, the interpretation of statistical output, and the validation of specific assumptions required for each test, such as normality, homoscedasticity, and independence.
How would you describe the keyword profile of this work?
The work is characterized by keywords related to empirical research methods and R programming, specifically focusing on data manipulation, regression, and complex structural modeling.
Why is it necessary to perform normality tests before conducting regressions?
Parametric tests assume normally distributed data; violating this assumption can lead to biased regression coefficients and invalid standard errors, potentially resulting in false Type I or Type II errors.
What does the book suggest when data contains outliers?
The author advises using robust regression methods, such as the `lmRob` or `mblm` functions, which down-weight the influence of deviating observations to yield more reliable results compared to standard linear regression.
- Citation du texte
- Docteur Antoine Niyungeko (Auteur), 2021, Practical Guide for Data Analysis Using R Tool, Munich, GRIN Verlag, https://www.grin.com/document/1010252