### Under Review

### segregsmall: a Command to Estimate Segregation in the Presence of Small Units

*Joint work with* Xavier D’Haultfoeuille and Roland Rathelot.

The Stata package “segregsmall” implements three methods to measure classical segregation indices (Duncan, Theil, Atkinson, Coworker a.k.a. Isolation, and Gini) in the context of small units. Units can be geographical areas, residential neighborhoods, firms, classrooms, or other clusters provided that every individual belongs to exactly one unit. Small units refer to the fact that the number of individuals per unit is small, typically around a few dozens at a maximum. In such settings, which are quite common in concrete applications, looking at the variations of the empirical shares of the minority group across units, although a natural idea, leads to biased indices. They are upward biased and cannot be reliably compared over time or across settings as the bias might evolve too. Hence the interest of methods that account for the small-unit bias.

*History or Status: *Under revision (submitted at *The Stata Journal*).

The **package** is operational and can be easily installed on your Stata through this GitHub page. Please contact me should you have any questions or difficulties.

*Abstract: *Suppose that a population, comprised of a minority and a majority group, is allocated into units, which can be neighborhoods, firms, classrooms, etc. Qualitatively, there is some segregation whenever the allocation process leads to the concentration of minority individuals in some units more than in others. Quantitative measures of segregation have struggled with the small-unit bias. When units contain few individuals, indices based on the minority shares in units are upward biased. For instance, they would point to a positive amount of segregation even when the allocation process is strictly random. The Stata command segregsmall implements three recent methods correcting for such bias: the non- parametric, partial identification approach of D’Haultfœuille and Rathelot (2017), the parametric model of Rathelot (2012), and the linear correction of Carrington and Troske (1997). The package also allows for conditional analyses, namely measures of segregation taking into account characteristics of the individuals or the units.

### Preparing for Submission

### On the Construction of Nonasymptotic Confidence Intervals in Linear Models

*Joint work with* Alexis Derumigny and Yannick Guyonvarch.

Building upon earlier reflections as regards the construction of confidence intervals (CIs) for ratios of expectations, we develop nonasymptotic CIs (*i.e.* with guaranteed coverage for any sample size) for individual coefficients in linear models, both without endogeneity (standard Ordinary Least Square settings) and with endogeneity (Instrumental Variable frameworks with one endogenous regressor, one excluded instrument, and additional exogenous covariates).

*History or Status: *Preparing for submission. Draft (March 2020 version) available upon request.

*Abstract:* We are concerned with constructing nonasymptotic confidence intervals (CIs) for the individual coefficients of a linear regression model, *i.e.* CIs whose probability to contain the true parameter is at least the nominal level for any sample size. The existing tools to conduct nonasymptotic inference either rely on the normality of the error term or the independence between the error term and the observed covariates of the model. However, those assumptions may be restrictive in economic applications: normality rules out models with skewed or fat-tailed idiosyncratic shocks while independence does not allow for heteroskedasticity. Our contribution is twofold. First, when there is no endogeneity issue, we propose a novel CI close in spirit to the one based on the t-statistic. Unlike the latter which has only asymptotic validity, our CI is valid for any sample size under moment restrictions. In particular, we require the distribution of errors conditional on covariates to have bounded kurtosis and fourth moment. Furthermore, our CI has the same length asymptotically as the one based on the t-statistic, whose coverage equals its nominal level in the limit. Our CI also has a closed-form expression which makes it very easy to compute. Second, with one endogenous regressor, one instrument, and exogenous covariates, we propose a modification of the asymptotic Anderson-Rubin confidence set (CS) that is valid for any sample size under analogous conditions to the exogenous case. Our CS inherits the following appealing property of the Anderson-Rubin construction: inference remains valid even when the excluded instrument is weak (uncorrelated with the endogenous variable). To the best of our knowledge, we provide the first nonasymptotic inference result in a weak instruments’ framework without imposing normality of residuals. We also discuss improvements in our results when we assume that errors be symmetrically distributed conditional on covariates. We finally investigate the practical performance of our method in a simulation study.

### Measuring Speech Polarization: Identification and Estimation

*Joint work with* Xavier D’Haultfoeuille and Roland Rathelot.

In “Measuring Group Differences in High-Dimensional Choices: Method and Application to Congressional Speech” (Econometrica 2019), M. Gentzkow, J. M. Shapiro, and M. Taddy propose a novel method to quantify to which extent two exogenous groups make different choices among a set of alternatives or options. In the context of high-dimensional choices, that is only a few choices per option is observed, their method deals with the so-called small-unit bias, which, otherwise, precludes relevant comparisons over time or across settings. Such a problem arises in various applications: residential segregation between immigrants and natives; occupational segregation between men and women; speech polarization between Democrats and Republicans in the US Congress; etc. Their methodology relies on the penalized estimation of a structural discrete choice model. As a consequence, it requires data at the chooser-level (in the example of congressional speech, it means knowing the map between speeches and speaker identities). Moreover, as the number of possible options might be large (in the same example, the vocabulary is made of hundreds of thousands of distinct bigrams), they use a Poisson approximation and distributed computing to be able to perform the estimation. In this setting of high-dimensional choices, we propose an alternative method to estimate a polarization index. Our method has formal identification, estimation, and inference results under a simple and testable statistical model. Our estimators and confidence intervals have closed-form expressions, hence they are very light to compute. Besides, our method only uses aggregated data, namely the counts of choices per group and option.

*History or Status:* Preparing for submission.

*Abstract:* Recently, political divisiveness appears to have been increasing in various democracies (Trump, populism, extreme right-wing parties in Europe, etc.). Language, as a basic determinant of group identity, might be part of that story. “Witch hunt” versus “impeachment hearing”; “undocumented workers” compared to “illegal aliens”; “death tax” or “progressive wealth tax”: those partisan expressions name the same object but with different connotations. They diffuse into media coverage and can induce framing effects on public opinion. Hence the interest in measuring speech polarization of political leaders and comparing its evolution over time or across countries. A way to do so would be linguistics and literary exegesis. Another is a statistical analysis that, despite being quite rough (data is word counts essentially), enables the comprehensive study of a large corpus of texts without relying on ex-ante partisan expressions. The aim is to quantify to which extent two distinct groups speak using different words. In our application, we study Democrat and Republican speakers in the US Congress between 1873 and 2016. Gentzkow, Shapiro, and Taddy (Econometrica 2019) address the same issue with a structural discrete choice model and a machine-learning type penalization. We provide an alternative method whose pros are the following: *(i)* a formal partial identification result for the parameter of interest, a speech partisanship index, within a testable statistical model; *(ii)* simple and computationally light estimators for the bounds and confidence intervals; *(iii)* only “aggregated data” is required. As a consequence, our methodology can easily be applied to other settings with the same problem of quantifying differences as regards the choices made by individuals split into two groups in a “high-dimensional” context – meaning that the number of distinct options is large relative to the number of observed choices in data. In our application, these are choices of words pronounced by Republican and Democrat speakers, but it might as well be choices of residential locations between natives and immigrants when investigating segregation, product choices between distinct groups of consumers in empirical industrial organization, etc.

### Work in Progress

### On the Construction of Confidence Intervals for Ratios of Expectations

*Joint work with* Alexis Derumigny and Yannick Guyonvarch.

Ratios of expectations are frequent parameters of interest in applied economics. In particular, any conditional expectation can be expressed as such. Inference for these quantities usually stems from asymptotic normality and the delta method. Their properties are therefore only asymptotic. In settings where the denominator is close to zero, we document through simulations that the asymptotic approximation may require hundreds of thousands of observations. In finite samples, it entails that the confidence intervals used in practice can have a probability to contain the target parameter far below their nominal level. To address this issue, we investigate how to conduct reliable inference in settings with a denominator close to zero.

*History or Status: *An Arxiv preprint (April 2019) is available here. Nonetheless, we plan to reorganize the paper, probably dividing it in order to better present our different sets of results.

*Abstract:* In econometrics, many parameters of interest can be written as ratios of expectations. The main approach to construct confidence intervals for such parameters is the delta method. However, this asymptotic procedure yields intervals that may not be relevant for small sample sizes or, more generally, in a sequence-of-model framework that allows the expectation in the denominator to decrease to 0 with the sample size. In this setting, we prove a generalization of the delta method for ratios of expectations and the consistency of the nonparametric percentile bootstrap. We also investigate finite-sample inference and show a partial impossibility result: nonasymptotic uniform confidence intervals can be built for ratios of expectations but not at every level. Based on this, we propose an easy-to-compute index to appraise the reliability of the intervals based on the delta method. Simulations and an application illustrate our results and the practical usefulness of our rule of thumb.

### Understanding Jobseekers’ Demand for Training

*Joint work with* Bruno Crépon, Anett John, and Esther Mbih.

The broad topic of this project concerns the demand for vocational training from jobseekers. More precisely, it aims at investigating the barriers that might impede jobseekers from registering and completing intermediate to long (at least two months) professional training programs. Indeed, despite potentially high gains in terms of labor market outcomes, the demand for vocational training programs remains low in France, particularly for young and low-educated people. The project is done in collaboration with the French public employment service (“Pôle emploi”) and relates to the so-called “Plan d’Investissement dans les competences” (PIC, a 15-billion euros project to develop professional training that has been launched in 2018 by the French government). The first stage of the project constitutes in eliciting the perceived barriers to the adoption of training programs through a comprehensive survey sent to 100,000 randomly selected jobseekers. The originality and goal of this survey are to obtain a comprehensive map of the barriers faced by jobseekers and notably contrast the standard barriers of information, supply, and funding with those behavioral impediments. It will enable to know what hold jobseekers back from vocational training: is it mainly standard barriers or behavioral ones? Our intuition (supported by a pilot version with 1,200 jobseekers but to be confirmed by the large-scale survey) is that, in addition to the standard barriers, the behavioral obstacles are significant. The former is beyond our scope of control, but our idea is that the latter might be changed at a much smaller cost by simple interventions through information/persuasion messages sent by emails or other media. Thus, the second stage of the project consists of designing interventions to alleviate the behavioral barriers, and testing them against each other. The precise design of the interventions is yet to be determined based on the survey’s results. A priori, they would consist in two sets of interventions: sending information to correct the “external beliefs” (e.g. about returns to training); information/persuasion messages to target the “internal beliefs” of jobseekers (in particular about self-efficacy and self-esteem). We will test these interventions and their combination on a large sample through an RCT. We will explore the heterogeneity of the impacts along several dimensions. In particular, we would like to investigate the effects of our interventions depending on the ex-ante propensity to enter training. In order to do so, we plan to use the extensive datasets of “Pôle emploi” to predict the entry in training by machine learning tools. Finally, if the survey reveals substantial variations regarding the behavioral barriers faced by jobseekers, we will explore more involved interventions. In this case, the basic idea would be to run a short baseline survey on the treated individuals in order to identify the prevailing barrier for each of them and then adapt the intervention accordingly. It would be another possibility to examine the heterogeneity of the impacts across the different types of behavioral barriers.

*History or Status:* Data under collection (the survey has been launched since February 2020).