### Explicit Nonasymptotic Bounds for the Distance to the first-order Edgeworth Expansion

*Joint work with**:* Alexis Derumigny and Yannick Guyonvarch.

We develop explicit nonasymptotic bounds for the distance to the first-order Edgeworth expansion in the case of a standardized sum of independent random variables with finite fourth moment. In another article, we use this tool to construct nonasymptotic confidence intervals in linear models.

*History or status**:* working paper arXiv:2101.05780; under review.

*Abstract:* In this article, we study bounds on the uniform distance between the cumulative distribution function of a standardized sum of independent centered random variables with moments of order four and its first-order Edgeworth expansion. Existing bounds are sharpened in two frameworks: when the variables are independent but not identically distributed and in the case of independent and identically distributed random variables. Improvements of these bounds are derived if the third moment of the distribution is zero. We also provide adapted versions of these bounds under additional regularity constraints on the tail behavior of the characteristic function. We finally present an application of our results to the lack of validity of one-sided tests based on the normal approximation of the mean for a fixed sample size.

### On the Construction of Nonasymptotic Confidence Intervals in Linear Models

*Joint work with:* Alexis Derumigny and Yannick Guyonvarch.

Building upon earlier reflections as regards the construction of confidence intervals (CIs) for ratios of expectations and using tools from first-order Edgeworth expansions of standardized sums of independent variables, we develop nonasymptotic CIs (*i.e.* with guaranteed coverage for any sample size) for individual coefficients in linear models, both without endogeneity (standard Ordinary Least Square settings) and with endogeneity (Instrumental Variable frameworks with one endogenous regressor, one excluded instrument, and additional exogenous covariates).

*History or status: *preparing for submission; draft (March 2020 version) available upon request.

*Abstract**:* We are concerned with constructing nonasymptotic confidence intervals (CIs) for the individual coefficients of a linear regression model, *i.e.* CIs whose probability to contain the true parameter is at least the nominal level for any sample size. The existing tools to conduct nonasymptotic inference either rely on the normality of the error term or the independence between the error term and the observed covariates of the model. However, those assumptions may be restrictive in economic applications: normality rules out models with skewed or fat-tailed idiosyncratic shocks while independence does not allow for heteroskedasticity. Our contribution is twofold. First, when there is no endogeneity issue, we propose a novel CI close in spirit to the one based on the t-statistic. Unlike the latter which has only asymptotic validity, our CI is valid for any sample size under moment restrictions. In particular, we require the distribution of errors conditional on covariates to have bounded kurtosis and fourth moment. Furthermore, our CI has the same length asymptotically as the one based on the t-statistic, whose coverage equals its nominal level in the limit. Our CI also has a closed-form expression which makes it very easy to compute. Second, with one endogenous regressor, one instrument, and exogenous covariates, we propose a modification of the asymptotic Anderson-Rubin confidence set (CS) that is valid for any sample size under analogous conditions to the exogenous case. Our CS inherits the following appealing property of the Anderson-Rubin construction: inference remains valid even when the excluded instrument is weak (uncorrelated with the endogenous variable). To the best of our knowledge, we provide the first nonasymptotic inference result in a weak instruments’ framework without imposing normality of residuals. We also discuss improvements in our results when we assume that errors be symmetrically distributed conditional on covariates. We finally investigate the practical performance of our method in a simulation study.

### Measuring Speech Polarization: Identification and Estimation

*Joint work with:* Xavier D’Haultfoeuille and Roland Rathelot.

In “Measuring Group Differences in High-Dimensional Choices: Method and Application to Congressional Speech” (Econometrica 2019), M. Gentzkow, J. M. Shapiro, and M. Taddy propose a novel method to quantify to which extent two exogenous groups make different choices among a set of alternatives or options. In the context of high-dimensional choices, that is only a few choices per option is observed, their method deals with the so-called small-unit bias, which, otherwise, precludes relevant comparisons over time or across settings. Such a problem arises in various applications: residential segregation between immigrants and natives; occupational segregation between men and women; speech polarization between Democrats and Republicans in the US Congress; etc. Their methodology relies on the penalized estimation of a structural discrete choice model. As a consequence, it requires data at the chooser-level (in the example of congressional speech, it means knowing the map between speeches and speaker identities). Moreover, as the number of possible options might be large (in the same example, the vocabulary is made of hundreds of thousands of distinct bigrams), they use a Poisson approximation and distributed computing to be able to perform the estimation. In this setting of high-dimensional choices, we propose an alternative method to estimate a polarization index. Our method has formal identification, estimation, and inference results under a simple and testable statistical model. Our estimators and confidence intervals have closed-form expressions, hence they are very light to compute. Besides, our method only uses aggregated data, namely the counts of choices per group and option.

*History or status:* preparing for submission.

*Abstract:* Recently, political divisiveness appears to have been increasing in various democracies (Trump, populism, extreme right-wing parties in Europe, etc.). Language, as a basic determinant of group identity, might be part of that story. “Witch hunt” versus “impeachment hearing”; “undocumented workers” compared to “illegal aliens”; “death tax” or “progressive wealth tax”: those partisan expressions name the same object but with different connotations. They diffuse into media coverage and can induce framing effects on public opinion. Hence the interest in measuring speech polarization of political leaders and comparing its evolution over time or across countries. A way to do so would be linguistics and literary exegesis. Another is a statistical analysis that, despite being quite rough (data is word counts essentially), enables the comprehensive study of a large corpus of texts without relying on ex-ante partisan expressions. The aim is to quantify to which extent two distinct groups speak using different words. In our application, we study Democrat and Republican speakers in the US Congress between 1873 and 2016. Gentzkow, Shapiro, and Taddy (Econometrica 2019) address the same issue with a structural discrete choice model and a machine-learning type penalization. We provide an alternative method whose pros are the following: *(i)* a formal partial identification result for the parameter of interest, a speech partisanship index, within a testable statistical model; *(ii)* simple and computationally light estimators for the bounds and confidence intervals; *(iii)* only “aggregated data” is required. As a consequence, our methodology can easily be applied to other settings with the same problem of quantifying differences as regards the choices made by individuals split into two groups in a “high-dimensional” context – meaning that the number of distinct options is large relative to the number of observed choices in data. In our application, these are choices of words pronounced by Republican and Democrat speakers, but it might as well be choices of residential locations between natives and immigrants when investigating segregation, product choices between distinct groups of consumers in empirical industrial organization, etc.

### Several Dimensions of Residential Segregation in France (1968-2019)

The sampling scheme of the French Labor Force Survey happens to draw clusters of around thirty to forty adjacent housing. Such clusters form relevant neighborhoods to study residential segregation provided the segregation indices account for small-unit bias, so that they can be compared over time or across different dimensions of segregation (French versus non-French people; jobseekers versus workers; college graduates versus non-graduates, etc.). In this project, I try to construct long-term series of residential segregation indices to describe the evolution of residential segregation in France and compare the magnitude of segregation along several dimensions (nationality, ethnicity, education, socio-economic status).

*History or status**:* ongoing project.

### Understanding Jobseekers’ Demand for Training

*Joint work with**:* Bruno Crépon, Anett John, and Esther Mbih.

The broad topic of this project concerns the demand for vocational training from jobseekers. More precisely, it aims at investigating the barriers that might impede jobseekers from registering and completing intermediate to long (at least two months) professional training programs. Indeed, despite potentially high gains in terms of labor market outcomes, the demand for vocational training programs remains low in France, particularly for young and low-educated people. The project is done in collaboration with the French public employment service (“Pôle emploi”).

*History or status:* ongoing project; a survey to obtain a comprehensive map of the barriers faced by jobseekers and notably contrast the standard barriers of information, supply, and funding with behavioral impediments was administered in 2020 to jobseekers; a working paper describing the results of the survey is available upon request; current stage: design of the interventions.

### On the Construction of Confidence Intervals for Ratios of Expectations

*Joint work with**:* Alexis Derumigny and Yannick Guyonvarch.

Ratios of expectations are frequent parameters of interest in applied economics. In particular, any conditional expectation can be expressed as such. Inference for these quantities usually stems from asymptotic normality and the delta method. Their properties are therefore only asymptotic. In settings where the denominator is close to zero, we document through simulations that the asymptotic approximation may require hundreds of thousands of observations. In finite samples, it entails that the confidence intervals used in practice can have a probability to contain the target parameter far below their nominal level. To address this issue, we investigate how to conduct reliable inference in settings with a denominator close to zero.

*History or status:* working paper arXiv:1904.07111; somewhat obsolete as we have reorganized the paper (dividing it to separate asymptotic and nonasymptotic results and extending the nonasymptotic results to linear regressions).

*Abstract:* In econometrics, many parameters of interest can be written as ratios of expectations. The main approach to construct confidence intervals for such parameters is the delta method. However, this asymptotic procedure yields intervals that may not be relevant for small sample sizes or, more generally, in a sequence-of-model framework that allows the expectation in the denominator to decrease to 0 with the sample size. In this setting, we prove a generalization of the delta method for ratios of expectations and the consistency of the nonparametric percentile bootstrap. We also investigate finite-sample inference and show a partial impossibility result: nonasymptotic uniform confidence intervals can be built for ratios of expectations but not at every level. Based on this, we propose an easy-to-compute index to appraise the reliability of the intervals based on the delta method. Simulations and an application illustrate our results and the practical usefulness of our rule of thumb.