On the Construction of Nonasymptotic Confidence Intervals in Linear Models (and for Expectations)
Joint work with: Alexis Derumigny and Yannick Guyonvarch.
Building upon earlier reflections as regards the construction of confidence intervals (CIs) for ratios of expectations and using tools from first-order Edgeworth expansions of standardized sums of independent variables, we develop nonasymptotic CIs (i.e. with guaranteed coverage for any sample size) for individual coefficients in linear models, both without endogeneity (standard Ordinary Least Square settings) and with endogeneity (Instrumental Variable frameworks with one endogenous regressor, one excluded instrument, and additional exogenous covariates).
History or status: preparing for submission. In fact, we divide the project into two parts. The endogeneity case is left for future articles. The current article focuses on the case of an expectation and coefficients of a linear regression model (without endogeneity). It is better motivated/supported by general (impossibility and possibility) results about the construction of nonasymptotically valid confidence intervals.
Abstract: We are concerned with constructing nonasymptotic confidence intervals (CIs) for the individual coefficients of a linear regression model, i.e. CIs whose probability to contain the true parameter is at least the nominal level for any sample size. The existing tools to conduct nonasymptotic inference either rely on the normality of the error term or the independence between the error term and the observed covariates of the model. However, those assumptions may be restrictive in economic applications: normality rules out models with skewed or fat-tailed idiosyncratic shocks while independence does not allow for heteroskedasticity. Our contribution is twofold. First, when there is no endogeneity issue, we propose a novel CI close in spirit to the one based on the t-statistic. Unlike the latter which has only asymptotic validity, our CI is valid for any sample size under moment restrictions. In particular, we require the distribution of errors conditional on covariates to have bounded kurtosis and fourth moment. Furthermore, our CI has the same length asymptotically as the one based on the t-statistic, whose coverage equals its nominal level in the limit. Our CI also has a closed-form expression which makes it very easy to compute. Second, with one endogenous regressor, one instrument, and exogenous covariates, we propose a modification of the asymptotic Anderson-Rubin confidence set (CS) that is valid for any sample size under analogous conditions to the exogenous case. Our CS inherits the following appealing property of the Anderson-Rubin construction: inference remains valid even when the excluded instrument is weak (uncorrelated with the endogenous variable). To the best of our knowledge, we provide the first nonasymptotic inference result in a weak instruments’ framework without imposing normality of residuals. We also discuss improvements in our results when we assume that errors be symmetrically distributed conditional on covariates. We finally investigate the practical performance of our method in a simulation study.
Average Marginal Effects in Nonparametric Instrumental Regressions based on RKHS
Joint work with: Elia Lapenta.
History or status: preparing for submission.
Abstract: We propose an estimator for average marginal effects in Nonparametric Instrumental Regressions (NPIV) based on Reproducing Kernel Hilbert Spaces (RKHS). Our estimator is based on a single regularization parameter, making it easy to use in practice. We obtain the consistency and asymptotic normality of our estimator. Since the asymptotic distribution depends on an intricate asymptotic covariance, we propose a weighted Bayesian bootstrap method to get the critical values and conduct inference.
Identification and estimation of a polarization index in large choice sets, with an application to U.S. Congress speech (1873-2016)
Joint work with: Xavier D’Haultfoeuille and Roland Rathelot.
In “Measuring Group Differences in High-Dimensional Choices: Method and Application to Congressional Speech” (Econometrica 2019), M. Gentzkow, J. M. Shapiro, and M. Taddy propose a novel method to quantify to which extent two exogenous groups make different choices among a set of alternatives or options. In the context of high-dimensional choices, that is only a few choices per option is observed, their method deals with the so-called small-unit bias, which, otherwise, precludes relevant comparisons over time or across settings. Such a problem arises in various applications: residential segregation between immigrants and natives; occupational segregation between men and women; speech polarization between Democrats and Republicans in the US Congress; etc. Their methodology relies on the penalized estimation of a structural discrete choice model. As a consequence, it requires data at the chooser-level (in the example of congressional speech, it means knowing the map between speeches and speaker identities). Moreover, as the number of possible options might be large (in the same example, the vocabulary is made of hundreds of thousands of distinct bigrams), they use a Poisson approximation and distributed computing to be able to perform the estimation. In this setting of high-dimensional choices, we propose an alternative method to estimate a polarization index. Our method has formal identification, estimation, and inference results under a simple and testable statistical model. Our estimators and confidence intervals have closed-form expressions, hence they are very light to compute. Besides, our method only uses aggregated data, namely the counts of choices per group and option.
History or status: a complete working paper version can be found in Chapter 3 of my PhD manuscript (link); slides (link). The project is currently suspended since our main application (speech polarization) uses a bag-of-word approach, which is most likely superseded by more modern Natural Language Processing (NLP) methods, in particular word embedding. Nonetheless, the general methodological contribution (measuring group differences in high dimensional choice sets) remains relevant and can be interesting for other applications.
Abstract: Recently, political divisiveness appears to have been increasing in various democracies (Trump, populism, extreme right-wing parties in Europe, etc.). Language, as a basic determinant of group identity, might be part of that story. “Witch hunt” versus “impeachment hearing”; “undocumented workers” compared to “illegal aliens”; “death tax” or “progressive wealth tax”: those partisan expressions name the same object but with different connotations. They diffuse into media coverage and can induce framing effects on public opinion. Hence the interest in measuring speech polarization of political leaders and comparing its evolution over time or across countries. A way to do so would be linguistics and literary exegesis. Another is a statistical analysis that, despite being quite rough (data is word counts essentially), enables the comprehensive study of a large corpus of texts without relying on ex-ante partisan expressions. The aim is to quantify to which extent two distinct groups speak using different words. In our application, we study Democrat and Republican speakers in the US Congress between 1873 and 2016. Gentzkow, Shapiro, and Taddy (Econometrica 2019) address the same issue with a structural discrete choice model and a machine-learning type penalization. We provide an alternative method whose pros are the following: (i) a formal partial identification result for the parameter of interest, a speech partisanship index, within a testable statistical model; (ii) simple and computationally light estimators for the bounds and confidence intervals; (iii) only “aggregated data” is required. As a consequence, our methodology can easily be applied to other settings with the same problem of quantifying differences as regards the choices made by individuals split into two groups in a “high-dimensional” context – meaning that the number of distinct options is large relative to the number of observed choices in data. In our application, these are choices of words pronounced by Republican and Democrat speakers, but it might as well be choices of residential locations between natives and immigrants when investigating segregation, product choices between distinct groups of consumers in empirical industrial organization, etc.
Measures of several dimensions of residential segregation in France between 1968 and 2019 based on the Labor Force Survey clusters
History or status: currently suspended; a first-iteration draft can be found in Chapter 2 of my PhD manuscript (link).
Abstract: This article takes advantage of the sampling scheme of the French Labor Force Survey, which draws clusters of around thirty adjacent housing, to study several dimensions of residential segregation in France from 1968 to 2019. Such clusters form relevant neighborhoods to study residential segregation provided the indices account for the small-unit bias so that they can be compared over time or across different dimensions of segregation (French versus non-French people, jobseekers versus employed, college graduates versus non-graduate, white-collar versus blue-collar workers, etc.).
Applying the methodology developed in D’Haultfoeuille and Rathelot (Quantitative Economics, 2017) and Rathelot (Journal of Business and Economic Statistics, 2012), we estimate annual segregation indices for different specifications of the “minority” and “majority” groups, aiming to quantify several dimensions of residential segregation and compare them.
The results suggest two main conclusions.
First, whatever the dimension under study (ethnicity, immigrant, nationality, occupational category, labor market status, education), the estimated indices do not reveal any substantial evolution over time: within each dimension, the magnitude of residential segregation has remained globally constant for the past decades.
Second, they reveal the magnitude of segregation differs across the different dimensions according to the following decreasing ranking (in parentheses, order of magnitude of the corresponding Duncan segregation index, on average over the period 1968-2019):
– nationality (0.65)
– ethnicity, using as a proxy parents’ country of birth, and being an immigrant (0.50)
– social status, as measured by occupational category or college education (0.40)
– labor market status, specified here as unemployed or employed (0.25)
A conditional analysis, separating neighborhoods that belong to urban areas of 200,000 inhabitants or more from neighborhoods belonging to smaller urban areas, complements the unconditional analysis.
On the Construction of Confidence Intervals for Ratios of Expectations
Joint work with: Alexis Derumigny and Yannick Guyonvarch.
Ratios of expectations are frequent parameters of interest in applied economics. In particular, any conditional expectation can be expressed as such. Inference for these quantities usually stems from asymptotic normality and the delta method. Their properties are therefore only asymptotic. In settings where the denominator is close to zero, we document through simulations that the asymptotic approximation may require hundreds of thousands of observations. In finite samples, it entails that the confidence intervals used in practice can have a probability to contain the target parameter far below their nominal level. To address this issue, we investigate how to conduct reliable inference in settings with a denominator close to zero.
History or status: working paper arXiv:1904.07111; somewhat obsolete as we have reorganized the paper (dividing it to separate asymptotic and nonasymptotic results and extending the nonasymptotic results to linear regressions).
Abstract: In econometrics, many parameters of interest can be written as ratios of expectations. The main approach to construct confidence intervals for such parameters is the delta method. However, this asymptotic procedure yields intervals that may not be relevant for small sample sizes or, more generally, in a sequence-of-model framework that allows the expectation in the denominator to decrease to 0 with the sample size. In this setting, we prove a generalization of the delta method for ratios of expectations and the consistency of the nonparametric percentile bootstrap. We also investigate finite-sample inference and show a partial impossibility result: nonasymptotic uniform confidence intervals can be built for ratios of expectations but not at every level. Based on this, we propose an easy-to-compute index to appraise the reliability of the intervals based on the delta method. Simulations and an application illustrate our results and the practical usefulness of our rule of thumb.