Estimation in Case of Endogenous Selection with Application to Wage Regression

Master's Thesis, 2016

108 Pages, Grade: 1,0



List of Abbreviations

List of Figures

List of Tables

1 Introduction

2 Theory and Models
2.1 Literature Review and General Theory
2.2 Application to the Linear Model in a Cross-Sectional Setting
2.3 Testability of the Independence Condition
2.4 Application to the Linear Model in a Pooled OLS Setting

3 Simulations
3.1 Cross-Sectional Simulations
3.2 Pooled OLS Simulations
3.3 Sargan-Hansen J-Test Simulations
3.4 Conclusions from the Simulations

4 Data and Statistical Analysis
4.1 Empirical Literature Review
4.2 Data Description, Variables and Manipulations
4.3 Statistical Analysis

5 Conclusions


A Mathematical Annex

B Figures

C Tables


I would like to express my gratitude to Christoph Breunig for his guidance throughout this thesis. I am also grateful to the participants and instructors of the Studienabschlussseminar for the very useful tips and remarks they gave me for this thesis. Additionally, I would like to thank Xavier D’Haultfoeuille who kindly replied to questions although being on holiday. A special gratitude to my family, my girlfriend Diana, and to my friends for their comprehension and motivation.


This thesis addresses the problem of linear regression estimation with selectively observed response data when selection is endogenous. The approach relies critically on the existence of an instrument that is independent of the selection, conditional on potential outcomes and other covariates. A parametric two-step estimation procedure its proposed. In a first step the probability of selection is estimated employing a generalized method of moments estima­tor. The second step uses the estimated probability weights in order to perform an inverse probability weighted least squares estimation. Two potential estimators are presented and expressions for their asymptotic variance-covariance matrices are provided. As an extension, it is shown how the concept could be used in multiple period setup, using a pooled weighted least squares estimator. Finite sample properties are illustrated in a Monte Carlo simulation study. An empirical illustration is given, using the Survey of Health, Ageing and Retirement in Europe dataset, applying the theory to wage regressions.

Data generating process Feasible generalized least squares Generalized method of moments Inverse probability weighting Missing-at-random Missing-completely-at-random Missing-not-at-random

List of Abbreviations

illustration not visible in this excerpt

List of Figures

1 Missingness in a Two-Period Setting

2 Normal Distribution and Distribution of Selective Observed Variable

3 Distribution of Selective Observed Variable and Distribution of Weighted Se­lective Observed Variable

4 Simulation Density 1 (Sargan-Hansen J-Test)

5 Simulation Density 2 (Sargan-Hansen J-Test)

6 Simulation Density 3 (Sargan-Hansen J-Test)

7 Simulation Density 4 (Sargan-Hansen J-Test)

8 Moment Conditions for Wave 1

9 Densities of Logarithmic Wages (Before Inverse Probability Weighting)

10 Densities of Logarithmic Wages (After Inverse Probability Weighting)

11 Moment Conditions for Wave 2

12 Moment Conditions for Waves 1-2

List of Tables

1 Simulation Section: Definitions

2 Simulation Section: Cross-Sectional DGPs

3 Simulation Section: Pooled OLS DGPs

4 Simulation Section: Sargan-Hansen J-Test DGPs

5 Variables Included in the Statistical Analysis

6 Simulation Results 1 (Cross-Sectional)

7 Simulation Results 2 (Cross-Sectional)

8 Simulation Results 3 (Cross-Sectional)

9 Simulation Results 4 (Cross-Sectional)

10 Simulation Results 5 (Cross-Sectional)

11 Simulation Results 6 (Cross-Sectional)

12 Simulation Results (Pooled OLS)

13 Simulation Results (Sargan-Hansen J-Test)

14 Countries Included in SHARE

15 Summary Statistics for Wave 1

16 Summary Statistics for Wave 2

17 OLS Estimates for Available Responses in Wave 1

18 Probit Estimates for Response Probability in Wave 1

19 GMM Estimates for Response Probability in Wave 1

20 OLS-IPW Estimates in Wave 1

21 Robustness Analysis in Wave 1

22 OLS Estimates for Available Responses in Wave 2

23 Probit Estimates for Response Probability in Wave 2

24 GMM Estimates for Response Probability in Wave 2

25 OLS-IPW Estimates in Wave 2

26 Robustness Analysis in Wave 2

27 POLS Estimates for Available Responses in Waves 1-2

28 Probit Estimates for Response Probability in Waves 1-2

29 GMM Estimates for the Response Probability in Waves 1-2

30 OLS-IPW Estimates in Waves 1-2

31 Robustness Analysis in Waves 1-2

1 Introduction

Econometrics and Statistics provide useful tools for empirical research in the Social Sciences. ”Regression-like” models are very popular and used in a wide variety of fields that rely on cross-sectional or panel data. However, especially in the Social Sciences and Economics a lot of empirical research relies on surveys, interviews and data collections where non-response is very common. While this might be not problematic in some special cases, in most cases it is indeed a serious concern. Especially if a significant number of non-responses occurs for the very variable that the researcher wants to use as the dependent variable in her model. The problem becomes even worse if the non-response mechanism is connected with the level of the variable of interest.

A rather drastic example that illustrates the potential consequences of wrong dealing with non-responses is given by a study conducted by the German Institute for Economic Research (DIW) in 2009. In their study, the scholars used data from the German Socioeconomic Panel in 2005 and estimated the poverty risk of German children. The scholars concluded that 16.3% of all children in their sample are subject to the risk of poverty. Being a surprisingly high share, this was even more controversial as this number was published in a report of the Organisation for Economic Co-operation and Development (OECD) by Chapple and Richard­son (2009). In this report, the OECD average of children that are endangered of poverty is around 12.3%. Since the report was published three weeks before the Bundestagswahl in 2009, a heated debate about poverty risk of children in Germany began, leading eventually to an increase of child allowance by 20 Euro, which equals an increase by 10%. In 2011, the DIW published a share of 8.3% for 2005 without any further comment on the topic. In the end it turned out, that the researchers at the DIW made a small but momentous error. In cases of family income non-response, the income was set to zero. This procedure biased the family incomes towards zero and therefore wrongly increased the poverty risk.[1]

While this anecdote shows how consequential decisions on missing values can be, it does not even capture the problem of endogenous non-response. A little hypothetical example might help to understand this problem. Assume that a researcher is interested in the expec­tation of a random variable that is normally distributed with a mean of one and a variance of four. If the data generating process (DGP) would be known to the researcher, she could immediately conclude that the expectation is equal to one, but as almost always in empirical research, the researcher does not know the probabilistic structure - she only can collect data. Alas, as it also might happen in real research, not all data is observable and the missingness itself depends on the random variable. Assume that the probability of observing an obser­vation greater than the mean is 0.8, while the probability of observing a realization smaller than the mean is 0.4. Thus, it is more likely to observe a realization being higher than the mean than one being smaller. As a consequence, the mean will be biased upwards if the missing values are simply excluded from the sample.

The graphical representation in figure 2 in the Annex illustrates this point. The solid line represents the true density of the random variable - a normal distribution centered around one and scaled by its variance. The dashed line represents the distribution of the data, the researcher actually can collect. Obviously, the selection mechanism shifts the density to the right and changes its general shape. Now, much more probability weight lies behind the true expected value. The expected value associated with the new density is roughly 1.53.[2]

What should the researcher do? One possible solution is assuming that the missingness is purely random and thus to measure only the variables where a response is available. Given that enough data is collected, the sample average will converge to 1.53, but it will not come closer to the true expected value.

A lot of solutions are possible in such a situation. The solution that will form the main topic of this thesis is the approach of inverse probability weighting (IPW). The general intu­ition behind inverse probability weighing is the idea that, if a value is observed only twice out of five tries, it should be scaled up by 2.5. The same logic applies to realizations greater than one. Those should be scaled by 1.25. Then, all weighted observations are summed up and divided by the number of all observations, including those that are missing. This approach is visualized in figure 3. As it is illustrated, the weighting allows to recover the true expectation as the inverse probability weights shifts and stretches the probability mass in such a way, that it is balanced again around the true first moment. In this case, much of the probability mass is shifted to the left in order to counterbalance the upward bias.

The theoretical foundations of this thesis are the results of D’Haultfoeuille (2010) and Bre- unig et al. (2015). In these two papers, a nonparametric framework for estimation in case of endogenous selection based on an instrumental variable approach is developed. The central idea is the assumption of having an instrument at hand that is independent of the missing- ness mechanism once controlled for the potential response and covariates. This instrument enables the researcher to form a moment condition, allowing for the estimation of the prob­ability of non-response. After the estimation of those probabilities, an estimator that uses inverse probability weights can be employed.

The main research agenda of this thesis is to show, how this concept could work in a parametric two-step estimation procedure, when the equation of interest can be described by a linear model and the inverse probabilities can be estimated by a parametric link function employing the generalized method of moments (GMM). Special attendance is given to the comparison of differences and similarities between the two candidate estimators that arise in such a setting. One could use an estimator that applies inverse probability weights only on the response variables, or an estimator that weights the response variables as well as the covariates. It turns out, that the latter one performs much better in terms of efficiency and precision. Another central part of this thesis is the development of estimators for the variance-covariance matrix of such an two-step estimator. Here large sample analysis is used in order to find expressions that are easy to interpret and estimate. It also will be shown how the approach could be extended to a multiple period setting. Simulations are used for further investigation of finite sample properties. An application to wage regression, using the Survey of Health, Ageing and Retirement in Europe (SHARE), shows how the theory could be applied. Although the application potentially suffers from the absence of a valid instrument, it provides some evidence that returns to educational attainment might be overestimated in this dataset if one does not correct for the endogenous non-response problem.

This thesis structured in four main sections. In section 2 a literature review is given and the theory is explained, at first in general and after that, in the context of parametric estimators. Most attention is given to the estimation of the standard errors in such an esti­mation procedure. Besides the standard errors, possible extensions of the theory to multiple period data will be discussed. Section 3 provides simulations for the estimators. Section 4 describes the SHARE dataset and shows an application of the theory. Section 5 concludes. Additional material on mathematical derivations can be found in the Mathematical Annex. All simulations and empirical findings can be reproduced with the code on the USB flash drive attached to this thesis.

2 Theory and Models

This section is organized as follows. At first some literature review is given and the theory, developed by D’Haultfoeuille (2010) and Breunig et al. (2015), is described in a very general way in subsection 2.1. Based on this general results, subsection 2.2 shows how this theory can be used in the framework of a linear cross-sectional model. In subsection 2.3 some ideas on how to test the central assumption of the theory are explained. Subsection 2.4 shows how the linear model can be generalized to two periods in a pooled OLS framework.

As for notational reasons, note that in the general explanations in subsection 2.1 the notation of D’Haultfoeuille (2010) is used that denotes random variables as capital letters (i.e. X is a random variable). In the subsections 2.2 and 2.4 where estimators are developed the notation of Wooldridge (2010) with random variables in lower case letters with index i is employed (i.e. xi is regarded as a random variable). The switch of notation is justified that for the general theory the index does not improve understanding, but would rather distract from the essential part. This is different for the sections where estimators are developed. There, the notation with lower case letters including the index i helps to distinguish the vectors from the matrices and is very useful for derivations like establishing consistency or asymptotic distributions.

2.1 Literature Review and General Theory

In order to formalize the problem, assume that the researcher is interested in the relationship of a response variable Y * and a vector of covariates X. More precisely, she wants to investi­gate the conditional expectation E[Y*|X]. The Econometrician observes realizations of the random variables (Y,, Δ,Χ), where Δ is an indicator function, being one if Y * is observed and zero otherwise. The actual observed variable Y is defined as Y = ΔY*.[3] In such a setting, it is central to know how the non-response mechanism works in order to identify the conditional expectation of interest. By the law of total expectation it holds that:

illustration not visible in this excerpt

If the nature of the selection mechanism is missing-completely-at-random (MCAR), then it would hold that Δ л. (Y*,X), meaning that the selection is independent of the joint distri­bution of Y * and X. Provided MCAR is assumed, one can write [Abbildung in dieser Leseprobe nicht enthalten]. Thus, deletion of non-response observations is validated and the Econometrician must not care about the missing values too much. An example for MCAR is provided by Briggs et al. (2003). Assume that an online questionnaire is distributed to individuals in order to ascertain some of their characteristics. Unfortunately not all the questionnaires are returned because of a random computer error. Then, the non-response can be called MCAR, because the reason for failure to return the questionnaire is unrelated to any variables under consideration.

While the idea of taking the selection as an entirely random process is very tempting, it is nevertheless an unrealistic assumption in a lot of settings. Another very classical ap­proach is the missing-at-random (MAR) assumption (see for example Rubin (1976) or Rubin and Little (2002)). This assumption states essentially, that conditional on the observed co­variates, the selection mechanism and the outcomes are independent: Δ л Y*|X, implying [Abbildung in dieser Leseprobe nicht enthalten]. Back to the online questionnaire example, MAR would apply if the return of the questionnaire depends on some observable characteristics, but not on the outcome variable. Imagine that the submission of the online questionnaire does not work for some elderly people that are in trouble handling their computer appropriately. If the researcher is interested in the health status as response variable Y*, controlling for X represented by age, she could work with the MAR assumption.

However, this assumption is also rather stringent, because it is often the case that the selection mechanism is related to covariates and the response variable itself, sometimes called missing-not-at-random (MNAR). There exist numerous examples where such an suspicion is justified. To stay within the framework of the online questionnaire, this would be the case, if people refuse to report the characteristic the researcher is interested in, because being connected with some social stigma, such as obesity, mental disorder or sexual preferences. If this is true, matters become more complicated. Now [Abbildung in dieser Leseprobe nicht enthalten] cannot be identified and E[Y*|X] is not estimable without further assumptions.

As it comes to procedures of how to handle these different forms of missingness, a lot of methods have been proposed in the literature. Some of the most known are imputation, identification at infinity, bounds, inverse probability weighting and instrumental variables.

The procedure of multiple imputation is very common in applied work, at least in Statis­tics. While a lot of methods are concerned with the question, when it is allowed to just drop missing observations, it would also be possible to ”fill in the gaps” where data is missing. Such methods are described in Rubin and Little (2002) and naturally require assumptions about how to impute the missing values. Mostly, it is again assumed that the data is MAR or MCAR in order to create estimates of the missing values. One of the most prominent methods is the so called EM algorithm (Expectation-Maximization-Algorithm). This method consid­ers missing values as random variables and aims to estimate the unknown parameters via maximum likelihood (see Dempster et al. (1977)).

Identification at infinity does not need the MAR assumption. In this approach, the problem of endogenous selection is solved by relying on a covariate with large support. This idea was developed by Chamberlain (1986) and uses the fact that selection becomes negligible for large values of the covariates. With this idea, the effects of covariates on the response variable can be identified in a linear model. The approach was refined by D’Haultfoeuille and Maurel (2013), proposing another identification strategy, based on the idea, that the selection variable becomes independent of the covariates when the response (instead of the covariates) tends to infinity.

The idea of imposing bounds on Y * conditional on X dates back to Manski (1989). If such a restriction can be made, then an estimable bound for the conditional expectation can be derived and consequently also bounds on the effects of covariates on the conditional expectation can be found.

The technique of inverse probability weighting is very common in the survey literature and was pioneered by Horvitz and Thompson (1952). The estimator derived in this paper is now called the Horvitz-Thompson estimator and applies inverse probability weights to observed observations to account for different proportions of observations within subsamples in a target population. In its original form, it relies on knowing the inverse probability weights. In Econometric applications it is mostly assumed that the MAR assumption holds, then the probability of response is estimated, using a binary-response model as the Logit or the Probit. Examples, where also the asymptotics of such an inverse probability weighted estimator are derived can be found in Wooldridge (1999) and Wooldridge (2002).

Also on the usage of instruments exists a large literature. Mostly it is assumed that there are two equations. One that describes the conditional expectation and one that represents the missingness mechanism. The instrument comes into play via assumptions on how it is related to these equations. A very famous example is the work of Heckman. In Heckman (1974) and Heckman (1979) the problem of endogenous selection is solved via a parametric two-step estimation method. It is assumed that there is a first step, where selection is determined and a second step, describing the equation of interest that is influenced by the first step. In order to correct for the impact of the first step, it is assumed that at least one variable that appears in the selection equation is not a part of the second step equation. Hence, the method relies on instruments that determine the selection but not the outcome.

This thesis combines the last two approaches as it also relies on the existence of an instrument and uses inverse probability weighting. But it is different in some aspects, mostly because another instrumental strategy as in the Heckman model is employed and because the MAR assumption is not needed for the estimation of the probability weights. But another, potentially restrictive, assumption is necessary. Analogous to D’Haultfoeuille (2010) and Breunig et al. (2015) the analysis fundamentally relies on the following assumption: Assumption A1

illustration not visible in this excerpt

In plain english, this means that the selection mechanism Δ is independent of X, provided that one has controlled for the potential response Y*. However, A1 does not imply that the selection mechanism is independent of X in general. In order to give an intuition why this assumption is useful, imagine a situation where the correlation between X and Y* is positive. If A1 holds, then X helps to estimate the probability of response by the following reasoning. If X is relatively low for realizations where Y* is not observed and X is relatively high for the observed realizations, then it is likely that the probability of response increases in Y* . Because X is conditional independent of Δ, the variation of this variable provides observable exogenous variation for Y* , allowing to identify the probability of response.

In which settings is such an assumption plausible? An interesting example can be found in Breunig et al. (2015). Assume that Y * denotes the quantity of alcohol a person consumes. In a typical survey it is very likely that persons with a very high consumption of alcohol refuse to report their level of alcohol consumption. In this setting, the scholars used total expenditure as an instrument, a variable that is likely to influence the spending on alcohol, but is unlikely to directly influence non-response.

Another application where A1 could be useful are wage regressions. The income is often not reported if it is very low or very high.[4] In such a setting it is somewhat harder to think about a potential instrument. A very plausible instrument would be the lagged income, as it is not unlikely that the lagged income is independent of the non-response when controlling for the contemporaneous income.

A similar approach is applied by Davezies et al. (2011) for the estimation of the rate of unemployment in France. As this rate was subject to attrition over time, the researchers tried to correct for potentially endogenous non-response by using past employment status as an instrument.

Zhao and Shao (2015) used A1 in order to investigate the probability of cotton factory workers developing dyspnoea. The dyspnoea statuses are given for three time periods, where no missing values are present in the first period, but a nontrival share of non-responses is present in the two subsequent periods. In order to solve this problem it is assumed that the dyspnoea status in the first available period can be used as an instrument satisfying A1. Besides the practical examples, it might be helpful to express A1 in the context of a two- equation system. Assume that Y * is a function of the covariates and an error term e. The equation that describes the selection of Y * depends on Y * itself and and an error term η:

illustration not visible in this excerpt

This is the same representation as in D’Haultfoeuille (2010, Proposition 2.1), where it is shown that the assumption η л. (X, e) implies A1.5 Therefore, in such a setting A1 can be interpreted as assuming that the error term of the selection equation is independent of the instrument X and the error term of the equation of interest.

Since the identification of the conditional expectation E[Y*|X] will crucially depend on inverse probability weighting, it has to be assumed:

Assumption A2

illustration not visible in this excerpt

This assumption ensures the inverse of the probabilities always to exist. In addition, it excludes the case where selection is a deterministic function. If for example Δ = 1{Y* < b} for some real valued constant b, then the probability of having Δ = 1 for responses greater than b is zero, violating A2.[5] [6]

Given this assumption, define the function g(·) = 1/Ρ(Δ = 1\Y* = ·). This function gives the inverse of the response probability conditional on Y *. The argument is the value Y * takes. Now it can be shown, that identification is provided by inverse probability weighting:

illustration not visible in this excerpt

The first equality is just an extension by Ρ(Δ = 1\Y*)g(Y*) = 1. This extension is justified because of assumption A2. Equality two uses the law of iterated expectations and in equality three, assumption A1 is used together with the fact that the expectation of an indicator variable gives the probability of the variable being one. The last equality holds, because Y * is unequal Y only if Δ = 0. Hence, although the inverse probability weighting relies on the probability of response conditional on Y*, it must only be evaluated at the known Y for recovering the conditional expectation. One also should recognize that the recovering of the conditional expectation does not come for free. If g is just a constant g e [1, oo) for all values Y * can take, then V^Y *g] = V^Y *]g2 > V^Y *]. Thus, the variance increases with inverse probability weighting. A result that very intuitively follows from the fact that the application of the weights stretches the density of Y, as illustrated in figure 3.

Since it is unlikely that the Econometrician knows g, it is a crucial point how to identify the probability of the sample selection. From assumption A1 a very useful moment condition can be derived:

illustration not visible in this excerpt

The proof is straightforward:

illustration not visible in this excerpt

In the first equality, the law of iterated expectations is applied. The second equality is a result of assumption A1. Equality three makes use of the definition of g. Again, this approach works, because in all cases where g must be evaluated at the unknown Y*, Δ is equal to zero.

In order to exploit equation (2) parametrically, some more restrictive assumptions are needed. Additionally, in most applications covariates that do not fulfill A1 are used. Assume that the covariates, that determine the conditional expectation E[Y*\X], can be splitted as follows. Let X = (X1,X2)' be of dimension k, where Xi serves as an instrument satisfying assumption A1. Further, let X2 be of dimension q and define l = q + 1. Additionally, define the ¿-dimensional vector V ξ (X2,Y*), altering A1 into:

illustration not visible in this excerpt

(ii) Ρ(Δ = 1\V) = F(V'θ0), where F is a known, continuous, differentiable and strictly increasing function, mapping from R to (0,1), with f (x) ξ dF(x)/dx Here, A3(i) ensures that the expectation of X is identified with the given data. Condition A3(ii) assumes that the probability of response can be modeled by a link function F that has the properties of a cumulative distribution function applied to a linear index V'θ0. Therefore, the problem of estimating the probability of response is reduced to the problem of estimating the ¿-dimensional parameter vector θ0. For local identification a rank condition is needed: Assumption A4

illustration not visible in this excerpt

Assumptions A1’, A3 and A4’ ensure global identification, such that: only holds for θ = θ0.

2.2 Application to the Linear Model in a Cross-Sectional Setting

In this subsection, the results of the previous subsection will be applied to the linear model in its most basic form in a cross-sectional case. References for this application are Wooldridge (1999), Wooldridge (2010), Lin (2000) and Baser et al. (2004). Where especially the theory of the linear model, two-step estimators and the theory of inverse probability weighting is taken from Wooldridge, while the work of Lin (2000) and Baser et al. (2004) helps to understand the applied side of the models.

2.2.1 A Tale of Two Estimators

Assume that the true data generating process is given by:

Assumption OLS1

(i) V* = x'ißo + Ui, Ui - (0,σ2)

(ii) {(v* ,xi),i = 1, ...,N} is a random sample from the underlying population and can be regarded as a collection of i.i.d. random vectors

Where xi is a fc-dimensional vector of random design covariates and β0 gives the vectors of parameters that the Econometrician is interested in. The parameter space is denoted by B c Rk. Let X = (x[ ... x'N)' represent the (N x k) matrix that contains the stacked vectors xi and define u as the vector that contains all N error terms ui. Further define the N - dimensional vector Y* as the vector where all observations y* are stacked. Hence, the linear model can be written in matrix notation as

illustration not visible in this excerpt

In order to ensure consistency of the OLS estimator it must be assumed that E[X'u] = 0 and E[X'X] must be of rank k. The first condition is equivalent to the assumption that u has mean zero and is uncorrelated with all covariates. The second condition allows for identification of ß0 because the rank condition ensures that E[X'X] is invertible, resulting in a unique solution of the normal equations. Given this assumptions the least squares estimator, defined as ßOLS = (X'X)-1X'Y*, is consistent (see for example Wooldridge (2010, Chapter 4)). However, this is only true when no endogenous selection is present in the data. If all missing values are just excluded, the error term is no longer independent from the covariates and E[X'u] = 0 fails to hold.[9]

Assuming that all the covariates in X are still fully observed, but not all elements of are observable anymore, the observed y* can be denoted by yi = by** with Si being an indicator showing whether observation y* is observed or not. Because of assumption OLS(ii) this implies that {(vi,Si,xi),i = 1, ...,N} is a random sample from the underlying population.

In order to identify the conditional expectation, assumptions in the spirit of assumptions A1 and A2 from subsection 2.1 are needed. Let xi = (x1,i,x2,i)', where x1,i is of dimension p and x2}i is ^-dimensional, with p + q = k. Also define vi = (v*,x2,i)' of dimension l, with l = q + 1. Given this definitions, assume that xlyi serves as a vector of instruments:

Assumption OLS-IPW

illustration not visible in this excerpt

(ii) The function P(Si = 1\vi) = p(vi) >0 is known to the Econometrician

With OLS-IPW(i) being the same assumption as A1’ and OLS-IPW(ii) stating that for some reason the Econometrician just knows the function p(vi), an assumption that soon will be replaced by a more realistic one. In order to have a matrix expression for the probability weighting, define the (N x N) weighting matrix W ξ diag({Si/p(vi)}1iil). The vector of observed response variables can be defined as Y ξ diag({Si}N=l)Y*. Assuming the rank condition to hold, together with E[X'u] = 0, OLS1, and OLS-IPW a linear estimator can be derived:

illustration not visible in this excerpt

In this derivation, the first equality follows from assumption OLS-IPW(i) and the central result of equation (1). The first implication from line one to line two, follows from the law of iterated expectations and the third one uses assumption OLS1(i), provoding that the expectation of u is zero. Then X' is applied to both sides, expectations are taken and in a last step the rank condition helps to identify β0. Note, that in this expression only the observed response vector Y is weighted. In the expression (E[X'X])-1 all information on the covariates is used, even from those where the response is missing. However, based on the result of equation (2) it follows that E[W\X] = IN, with IN being the (N x N) identity matrix. With the law of iterated expectations, another possible estimator can be established:

illustration not visible in this excerpt

This estimator puts weights on the response variables as well as on the covariates.10 Obviously the question arises for which reason one should chose the estimator from equation (4) instead of the estimator that only weights the vector of response variables of equation (3), especially because all information on covariates, where the response is not available, is lost. In order to answer this question, the statistical properties of both estimators are derived. Apply the analogy principle by using sample analogs for expectations, in order to find estimators for equations (3) and (4) and define β = (X'X)-1X'WY and β ξ (X'WX)-1X'WY. The question of consistency of inverse probability weighted least squares estimators was first discussed by Nathan and Holt (1980) and Hausman and Wise (1981). Both, β and β are consistent if it is assmed that:

Assumption OLS2

illustration not visible in this excerpt

Note that OLS2(i) demands that the weighted covariates are uncorrelated with the errors terms. The second assumption provides the invertibility of E[X'WX] = E[X'X]. If the estimators should be unbiased, even stronger assumptions are needed:

illustration not visible in this excerpt

Here it is only possible to conclude that the estimator is unbiased, if it would hold that E[(X'X)-1 X'Wu] = 0, a result that could be achieved by assuming E[Wu|X] = 0. The same finding applies to β:

illustration not visible in this excerpt

Therefore β and β are unbiased if one is willing to assume that E[Wu|X] = 0, with the weaker assumption E[X'Wu] = 0 both estimators are biased. But for consistency the weaker

10A similar estimator that weights both, responses and covariates can be found in Lin (2000). In Breunig et al. (2015) only the estimator that weights the responses is considered.

assumption is enough. With finite first moments:

illustration not visible in this excerpt


illustration not visible in this excerpt


illustration not visible in this excerpt

Here the second line follows by the continuous mapping theorem and the general rules of probability limits (see Klenke (2013, Chapter 13)). Almost the same can be done for the second estimator:

illustration not visible in this excerpt

Besides consistency, the limiting distribution of estimators is of high interest. A very crucial point that must be clarified at this point, is the fact that the random variable p(vi) from assumption OLS-IPW(ii) does surely depend on y* and hence also on ß0, but not on the estimatorsßor ß. From this perspective it is not needed to apply the delta method when establishing the asymptotic variance of the estimators. Also OLS-IPW(ii) implies that one has not to bother at this point with questions of how the estimation of p(vi) influences its variance-covariance matrix. If the second moments are finite, asymptotic normality can be shown by rewriting the estimator:

illustration not visible in this excerpt

where e ξ WY - Xß0 are just the residuals of this estimator. And it holds, E[X'e] = 0 and V[X'e] = E[X'ee'X]. With N tending to infinity, (N-1X'X)-1 converges against (E[X'X])-1 by the law of large numbers and the continuous mapping theorem. By applying Slutsky’s theorem, it holds that:

illustration not visible in this excerpt

At this point, nothing explicitly can be said about the question which of the two estimators is more efficient. Since the inverse probabilities are greater than one and applied to the covariates, their variance increases and one could think that this causes the variance to be lower in case of ß. On the other hand, the residuals of the estimators are different. This reasoning leads to the heart of the problem that arises with the estimator (3. The weighting of the response allows indeed for recovering of the expectation, but this property does not convey to the variance. While the estimator (3 allows for the usage of the ”true” residuals that are weighted by 1/p2(vi) after squaring them, the estimator (3 must rely on residuals that balance surely in expectation but increase strongly if squared. In case of Si = 1 the term (yi/p(vi) -xiß)2 is used and if Si = 0 the squared residual is given by (xiß)2. In case of 3 it is 0 for Si = 0 and (yi - x'iß)2/p2(vi) for 5i = 1. On top of that, the variation induced by the weighting of the residuals of 3 is counterbalanced by the inverted weights in the outer part of the variance-covariance matrix. Although not unambiguous, this allows for the justified suspicion that 3 could be more efficient. Estimable expressions for the asymptotic variance-covariance matrix are provided by application of the analogy principle:

illustration not visible in this excerpt

using щ = yi - xiß. It should be highlighted that this variance-covariance matrices are robust to heteroscedasticity and in case of equation (6) ”[...] simply the White (1980) heteroscedasticity-consistent covariance matrix estimator applied to the stratified sample, where all variables for observation i are weighted [...] before performing the regression.” (see Wooldridge (1999, Page 1395)). Hence, if the probabilities of selection are known to the Econometrician, her task would be accomplished. Unfortunately, this is almost never the case and hence, they have to be estimated. In order to do this, extend assumption OLS-IPW such that:

Assumption OLS-IPW’

(i) Si л xM|vi
(ii) P(Si = 1|vi) = F(v'^0), where F is a known, continuous, differentiable and strictly increasing function, mapping from R to (0,1), with f (x) = dF(x)/dx
(iii) e Rl, P(v'i£ = 0|Si = 1) = 1 ^ ξ = 0
(iv) E[xi,i|Si = 1,vi] = Γ1 x2,i + Γ2yi, with Γ2 * 0

Compared to assumption OLS-IPW, now it is assumed that the the function that generates the non-response is not fully known, but can be characterized by an estimable vector of parameters θ0 (from the parameter space Θ c Rl) that provide together with vi a linear index for a cumulative density function. Assumption OLS-IPW’(i) is implied by assuming ηi it (x1,i,ui)lx2,i as shown in the Mathematical Annex A.2. Hence, the researcher should be willing to assume that the variables contained in x1)i as well as the error term ui from the equation of interest are independent of the error term in the selection equation η^ once controlled for x2)i.

The second new thing that enters in OLS-IPW’ is given by OLS-IPW’(iii) and OLS- IPW’(iv), ensuring global identification of θ0. If this assumption holds, the moment condi­tion can be derived using the law of iterated expectations:

illustration not visible in this excerpt

Equation (7) provides k moment conditions for GMM. As vi is of dimension l, l = q + 1 parameters have to be estimated with k moment conditions. Hence, if follows k ≥ q+1 ^ p ≥ 1, which means that at least one instrument is needed. In the Mathematical Annex A.6 the concept of GMM and its application to this specific problem are discussed. It also presents and discussed the assumption sets GMM1 and GMM2 that imply that the GMM estimator, defined by the solution θ to the quadratic form

illustration not visible in this excerpt

Because of the first-step GMM estimation, the estimators now have become two-step estima­tors. As a reference point, the following derivations will mostly refer to ßIPW, but it will be commented on ßIPW when needed.

Given standard regularity conditions, the two-step estimators are still consistent: ’’Es­timation of the probabilities using parametric methods has no interesting consequences for the consistency of the weighted M-estimator: consistency follows under standard regularity conditions from basic results on two-step estimation [...]” (see Wooldridge (1999, Page 123) and Newey and McFadden (1994)). Following Wooldridge (2010, Chapter 12), in his general framework, three conditions are sufficient for the two-sep estimator to be consistent. The first condition demands that the estimator Θ converges against some limiting vector θ in proba­bility. This is not the same as being consistent. However, in this application a consistent estimator is available with assumptions GMM1 and GMM2 from the Mathematical Annex A.6. Hence, here it holds: θ = θ0. As a second condition, it must hold that:

illustration not visible in this excerpt

This condition is called the identification condition. It demands that, given θ0, the expecta­tion of the weighted least squares criterion is the smallest, using the true value β0 compared to any other parameter valueßin the parameter space B. It can be restated in the lin­ear model case as Ε[δ7(χ'7β0 - xiß)2/F(ν\θ0)] > 0 Vß e B, β Φ β0 and holds if the rank of E[X'WX] is k, because then the parameter β0 is unique. Therefore, this condition is already implied by OLS2(ii). The third condition is not implied by already used assumptions and is given by:and Newey and McFadden (1994)). Following Wooldridge (2010, Chapter 12), in his general framework, three conditions are sufficient for the two-sep estimator to be consistent. The first condition demands that the estimator Θ converges against some limiting vector θ in proba­bility. This is not the same as being consistent. However, in this application a consistent estimator is available with assumptions GMM1 and GMM2 from the Mathematical Annex A.6. Hence, here it holds: θ = θ0. As a second condition, it must hold that:

Assumption UWLLN1’

Allowing the conclusion that also the two-step estimator ßIPW is consistent. The central results of this subsection are provided below.

Assumptions OLS1 and OLS2 allow to identify ß0, if all variables are observable (i.e. W = In). But in case of endogenous non-response, the OLS estimator ßOLS = (X'X)-1X'Y is inconsistent.

Assumptions OLS1, OLS2 and OLS-IPW allow to identify β0 with endogenous non­response, if the inverse probabilities are known. The two possible estimators

illustration not visible in this excerpt

are consistent but biased. With finite second moments they are asymptotically normal and their variance-covariance matrices can be estimated using:

illustration not visible in this excerpt

The assumptions OLS1, OLS2, OLS-IPW’, GMM1, GMM2 and UWLLN1 (or UWLLN1’ for ßIPW) allow to estimate the response probabilities consistently and asymptotically efficient via GMM and provide two consistent two-step estimators:

illustration not visible in this excerpt

2.2.2 On the Estimation of Standard Errors

An important aspect that remains, is the question which standard errors to use. As stated in Newey and McFadden (1994, Chapter 6): ”An important question for two-step estimators is whether the estimation of the first step affects the asymptotic variance of the second, and if so, what effect does the first step have. Ignoring the first step can lead to inconsistent standard error estimates, and hence confidence intervals that are not even asymptotically valid.” Just replacing W by W and using the variance-covariance estimator of equation (6) would ignore the variance imposed by the estimation of W. The following derivation is very close to the findings described in Wooldridge (1999) and Wooldridge (2010, Chapter 12) where a general theory for two-step M-estimators is developed. A useful starting point is assuming again OLS1, OLS2, OLS-IPW’, GMM1, GMM2 and UWLLN1. Additionally the following assumption set is needed:

Assumption UWLLN2

illustration not visible in this excerpt

The crucial question is, whether the fact that θ plays a role in equation (12) has an influence on the asymptotic distribution of ßIPW. It would not matter if the following equation would hold:

If UWLLN2(ii) holds, then Ñ 1 Σί= i (-Sif (v'^uxivi)/F2(v'fl) — E[-5if (v'fio)uxivi/F2 (v'fio)] ξ F0. The (k xl) matrix F0 gives the influence of the estimation of θ0 on the score of the second- step estimation. Remember that θο is estimated by GMM and that %/Ñ(O-θo) = Op(1). Then, it can be written by the mean value theorem:

illustration not visible in this excerpt

This means that in expectation, the influence of the first-step estimation on the score of the second one is zero. In any situation where Si and hence vi, are not related to y* but only to xi, this condition would hold by mean independence of the error term: E[-Sif ([Abbildung in dieser Leseprobe nicht enthalten]} = 0. This is for example the case with the MAR assump­tion. There, the influence of the first-step estimation indeed can be ignored as shown by Baser et al. (2004) and Wooldridge (2010, Chapter 19). Another prominent example where the first-step estimate can be ignored is feasible generalized least squares (FGLS) estimation (see Wooldridge (2010, Chapter 7, Page 178)).

But equation (15) does not hold in the given setting, as the problem of sample selection is endogenous. Therefore, it cannot be assumed that the first-step estimation can be ignored and a correction for the first-step estimator is indeed necessary.

This raises need for the question how the variability of θ is structured. In the Mathemat­ical Annex A.6, the asymptotically efficient estimation of θ0 using GMM is discussed. From this, the optimal (kxk) weighting matrix is given by [Abbildung in dieser Leseprobe nicht enthalten] - If a suitable estimator for this expression is used, then the first-step estimation will be asymptotically effi­cient. Define the (lxk) matrix Ω0 ξ (Γ0Σ0Γ0)-1Γ0Σ0 with Γ0 ξ [Abbildung in dieser Leseprobe nicht enthalten] of dimension (k x l) and Σ0 [Abbildung in dieser Leseprobe nicht enthalten] that is (k x k). Note that Σ0 gives the inverse expectation of the outer product of the GMM moment condition from equation (7) and Γ0 represents the first derivative of the moment condition with respect to θ and evaluated at θ0. By assuming GMM1 and GMM2, equation (37) from the Mathematical Annex A.6 can be used:


[1] The story was uncovered by Rademaker (2011) appearing in the Financial Times Germany and also appeared in an article by Harder and Lohse (2011) in the Frankfurter Allgemeine Zeitung.

[2] See in the Mathematical Annex A.1 for the derivation of this number.

[3] Implying, that all response variables that are zero-valued are treated as being a non-response. This convention might be problematic in cases where the value ”0” is indeed a meaningful response. A solution for this problem is to assume that the dataset comes with coding for missing values that is such that non-response can be clearly disentangled from ”0” as a response. For example, if the missing is coded ”NA” or ”-”. Or otherwise, if a continuous response takes values on the real line, then it holds P(Y* = 0) = 0 almost surely. If it is not clear, whether an observed zero is a non-response or in fact a ”0”, the methodology employed in this thesis is misleading.

[4] A dependence between response and the height of income is for example found by Frick and Grabka (2003), reporting a negative relationship. Among the numerous papers that deal with the problem of missing income data and the problem of endogeneity in income response are also Chen et al. (2008), Kim et al. (2007), Korinek et al. (2006) and Riphahn and Serfling (2005).

[5] The proof for this case, as well as for the case with additional covariates, can be found in the Mathematical Annex A.2.

[6] Henceforth, all equations are assumed to hold almost sure. Also the possibility that the response proba­bility gets arbitrarily close to zero is ruled out in the following.

[7] The author of this thesis is very grateful to Xavier D’Haultfoeuille who gave a lot of useful help with regard to the interpretation of this assumption, its similarity to the rank condition in linear instrumental variable regression and the aspect of lagged responses as an instrument.

[8] One might wonder whether the assumption that F is a standard parametric link function like in a Logit or a Probit is enough to assume A4 in order to achieve global identification. See a discussion of this in the Mathematical Annex A.5.

[9] See in the Mathematical Annex A.1 for an example. A deeper investigation of the bias is for example provided by Nathan and Holt (1980).

Excerpt out of 108 pages


Estimation in Case of Endogenous Selection with Application to Wage Regression
Humboldt-University of Berlin  (Institute for Statistics and Econometrics)
Catalog Number
ISBN (eBook)
ISBN (Book)
File size
1132 KB
Inverse Probability, Sample Selection, GMM, Endogenous Selection, Wage Regression, IV Estimation, OLS, Two-Stage-Estimation
Quote paper
Michael Lebacher (Author), 2016, Estimation in Case of Endogenous Selection with Application to Wage Regression, Munich, GRIN Verlag,


  • No comments yet.
Look inside the ebook
Title: Estimation in Case of Endogenous Selection with Application to Wage Regression

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free