Regression Analysis is an important statistical tool for many applications. The most
frequently used approach to Regression Analysis is the method of Ordinary Least Squares.
But this method is vulnerable to outliers; even a single outlier can spoil the estimation
completely. How can this vulnerability be described by theoretical concepts and are there alternatives? This thesis gives an overview over concepts and alternative approaches.
The three fundamental approaches to Robustness (qualitative-, infinitesimal- and quantitative Robustness) are introduced in this thesis and are applied to different estimators. The estimators under study are measures of location, scale and regression. The Robustness approaches are important for the theoretical judgement of certain estimators but as well for the development of alternatives to classical estimators. This thesis focuses on the (Robustness-) performance of estimators if outliers occur within the data set. Measures of location and scale provide necessary steppingstones into the topic of Regression Analysis. In particular the median and trimming approaches are found to produce very robust results.
These results are used in Regression Analysis to find alternatives to the method of Ordinary Least Squares. Its vulnerability can be overcome by applying the methods of Least Median of Squares or Least Trimmed Squares. Different outlier diagnostic tools are introduced to improve the poor efficiency of these Regression Techniques. Furthermore, this thesis delivers a simulation of some Regression Techniques on different situations in Regression Analysis.
This simulation focuses in particular on changes in regression estimates if outliers occur in the data.
Theoretically derived results as well as the results of the simulation lead to the
recommendation of the method of Reweighted Least Squares. Applying this method
frequently on problems of Regression Analysis provides outlier resistant and efficient
estimates.
Table of Contents
1. Introduction
2. The Classical Linear Ordinary Least Squares Regression
2.1. Introduction to OLS
2.2. Properties of the Least Squares Estimates
2.3. Problems of OLS
3. Outliers and OLS
3.1. Outlier definition and common error sources
3.2. Outlier in Regression Analysis and their influence on OLS results
4. The concept of Robustness
4.1. Introduction to Robustness
4.2. Qualitative Robustness
4.3. Infinitesimal Robustness
4.4. Quantitative Robustness
4.5. Robust Estimates
4.6. On asymptotic Results
5. Some measures of location and scale – with regard to their robustness properties
5.1. Introduction
5.2. Measures of location
5.2.1. A Definition
5.2.2. The Arithmetic Mean
5.2.3. The Median
5.2.4. Trimmed mean(s)
5.2.5. Other measures of location
5.3. Measures of scale
5.3.1. A Definition
5.3.2. The Standard deviation
5.3.3. The Median Absolute Deviation (MAD)
5.3.4. The t-Quantile Range
5.3.5. Other scale estimates
5.4. Higher Dimensions
6. Robust Regression Techniques
6.1. An Introduction and Definition
6.2. M-Estimates
6.3. The Repeated Median
6.4. The Least Median of Squares Regression
6.5. The Least Trimmed Squares Regression
6.6. The Coakley – Hettmansperger Estimator
6.7. Reweighted Least Squares
6.8. The Multivariate Reweighted Least Squares Approach
6.8.1. The Hat Matrix
6.8.2. The Minimum Volume Ellipsoid Estimator
6.9. Other Regression Methods and Limitations
6.10. Conclusions on Robust Regression
7. Application to SAS and Simulation
7.1. Introduction to Robustness Application and Simulation purposes
7.2. The initial data set – The zero contamination case
7.3. Seemingly negligible contamination in X-direction
7.4. Seemingly negligible contamination in Y-direction
7.5. High Leverage contamination
7.6. Large overall contamination
8. Conclusions
Objectives and Topics
This thesis explores the vulnerability of the Ordinary Least Squares (OLS) regression method to outliers and evaluates alternative robust regression techniques. The primary objective is to technically examine how outliers affect OLS results and to introduce, discuss, and simulate more robust statistical estimators that provide reliable estimates in the presence of contaminated data.
- The theoretical foundations of Robustness (qualitative, infinitesimal, and quantitative).
- Evaluation of various measures of location and scale regarding their robustness properties.
- Detailed analysis of Robust Regression Techniques, including M-Estimates, Least Median of Squares, and Least Trimmed Squares.
- Implementation and comparative simulation using SAS to assess the performance of regression methods under different levels of outlier contamination.
- Practical recommendations for robust regression, emphasizing the Reweighted Least Squares (RLS) method.
Excerpt from the Book
3.2. Outlier in Regression Analysis and their influence on OLS results
A point, (x1,...,xp,yi) which deviates from the (in our case: linear) relation described by the majority of the data is called a Regression Outlier. The term linear relation describes the relation between the dependent and independent variables. This deviation from the actual trend can be the result of deviations either in the dependent or in the independent variables space.
If a x-value xi is located far away from the bulk of the other x-values, the observation (xi,yi) is called a leverage point or “outlier in the x-direction” respectively. This definition does not take into account the y-value of the certain observation. A leverage point is not necessarily something negative; a leverage observation can be quite beneficial. Therefore, we have to distinguish between good and bad leverage points. Whereas if a leverage point is a regression outlier, as defined above, it is called a bad leverage point. A bad leverage point has huge influence on the Least Squares estimates. If a leverage point does fit in the linear relation described by the other observations, it is very beneficial for the analysis. Such an observation is called a good leverage point. To present the benefit of good leverage points it is necessary to recall the covariance matrix of the Least Squares estimator Var(β|X) = σ²(X'X)⁻¹: a higher dispersion among the independent variables reduces the variance of the LS estimator. This becomes more apparent if we reduce the regression model to a simple one.
Summary of Chapters
1. Introduction: Presents the motivation for robust statistics, highlighting the vulnerability of OLS regression to outliers and outlining the scope of the thesis.
2. The Classical Linear Ordinary Least Squares Regression: Recalls the OLS method, its necessary assumptions, and its key properties like the Best Linear Unbiased Estimator (BLUE) status.
3. Outliers and OLS: Defines outliers and investigates their impact on regression results, specifically distinguishing between leverage points and vertical outliers.
4. The concept of Robustness: Introduces the three fundamental pillars of robustness theory: qualitative, infinitesimal, and quantitative robustness, which serve as evaluation tools for estimators.
5. Some measures of location and scale – with regard to their robustness properties: Analyzes traditional and robust estimators for center and dispersion, serving as a basis for understanding robust regression.
6. Robust Regression Techniques: Discusses detailed robust alternatives to OLS, covering M-Estimates, Repeated Median, Least Median of Squares, Least Trimmed Squares, and Reweighted Least Squares.
7. Application to SAS and Simulation: Provides a simulation-based comparison of OLS, M-estimation, and high-breakdown techniques under various contamination scenarios.
8. Conclusions: Synthesizes the results and recommends the Reweighted Least Squares method as a robust and efficient approach for practical regression analysis.
Keywords
Regression Analysis, Ordinary Least Squares, Robustness, Outliers, Leverage Points, Breakdown Point, Influence Function, M-Estimates, Least Median of Squares, Least Trimmed Squares, Reweighted Least Squares, Asymptotic Efficiency, Simulation, SAS, Statistical Inference.
Frequently Asked Questions
What is the core focus of this thesis?
The work primarily addresses the sensitivity of standard regression models to outlier contamination and identifies robust alternative methods that remain reliable when data violates classical assumptions.
What are the central themes covered?
The thesis explores the theoretical concepts of robustness, analyzes various estimators for location, scale, and regression, and performs extensive simulation studies to compare their performance.
What is the primary research goal?
The main goal is to evaluate techniques that maintain high statistical performance in the presence of outliers and to provide actionable recommendations for robust regression analysis.
Which statistical methods are discussed?
The thesis covers OLS, M-Estimates, the Repeated Median, Least Median of Squares (LMS), Least Trimmed Squares (LTS), and Reweighted Least Squares (RLS).
What does the main part of the work involve?
The main section evaluates specific robust regression estimators through their robustness properties, computational feasibility, and efficiency, culminating in simulation-based benchmarking.
Which keywords characterize this work?
Key terms include Robustness, Outliers, Breakdown Point, Regression Analysis, Least Trimmed Squares, and Statistical Simulation.
How does the author define a "bad" versus a "good" leverage point?
A leverage point is a data point with an x-value far from the bulk of data. A leverage point is "good" if it fits the linear trend of the majority of data, thereby reducing estimator variance; it is "bad" if it deviates from that trend and exerts undue influence on the regression line.
Why is the "masking effect" important in outlier detection?
The masking effect occurs when outliers prevent the detection of other outliers, rendering standard diagnostics like the Hat Matrix unreliable. This is why the thesis advocates for initial robust estimation to correctly identify these influential points.
- Quote paper
- Robert Finger (Author), 2006, Robust Methods in Regression Analysis – Theory and Application, Munich, GRIN Verlag, https://www.grin.com/document/73282