Journées de la statistique I2E 2026

Insee-ENSAE-ENSAI ENSAI, Rennes - 9 & 10 juin 2026

Programme

Chaque présentation dure 40 minutes, questions comprises.

Mardi 9/06

13h30 : accueil
14h : mot d’accueil de Corinne Prost, Directrice de la méthodologie et de la coordination statistique et internationale à l'Insee
14h10 - 15h30 : Sébastien Da Veiga (ENSAI), Julien Jamme (Insee)
15h30 - 16h40 : Session posters
16h40 - 18h00 : Johann Faouzi (ENSAI), Matthieu Lerasle (ENSAE)
19h : dîner

Mercredi 10/06

9h15 : accueil
9h30 - 10h50 : Arnak Dalalyan (ENSAE), Jean Rubin (Insee)
10h50 - 11h10 : pause café
11h10 - 12h30 : Sébastien Herbreteau (ENSAI), Jaouad Mourtada (ENSAE)
12h20 - 13h30 : déjeuner
13h30 - 14h50 : Meilame Tayebjee (Insee), Emmanuel Pilliat (ENSAI)
14h50 - 15h10 : pause café
15h10 - 16h30 : Austin Stromme (ENSAE), Lionel Truquet (ENSAI)

Organisation

François Portier (ENSAI)
Valérie Ledonné (ENSAI)

Titres et résumés des exposés

Sébastien Da Veiga (ENSAI)

Brief introduction to conformal prediction, with a discussion on recent research challenges

Conformal prediction has emerged recently as a promising and popular framework for producing confidence intervals around predictions with no assumptions on the data distribution and without relying on asymptotics on the number of observations. In this talk we will start by introducing the basics of conformal prediction, and discuss the numerous extensions that have been proposed to widen its practical applicability and computability (cross-validation, adaptivity, asymmetry, …). We will also discuss open research questions in this field.

Julien Jamme (Insee)

Perturbation of magnitude tables with a driven utility-risk perspective

Protecting magnitude tables against disclosure rests on a trade-off between statistical utility and respondent confidentiality, which existing methods formalise unevenly. Mechanisms targeting dominant contributors are operationally light but lack an explicit link between their parameters and the risk-utility trade-off, whereas differential privacy offers a complete probabilistic formalism at the cost of an interpretability gap and a high sensitivity for magnitude statistics. This paper proposes a perturbation mechanism designed to reconcile three desirable properties: explicit targeting of the classical disclosure scenarios, an analytical and interpretable link between parameters and the risk-utility trade-off, and large-scale practicability. The mechanism multiplies each total by a Gaussian noise combining a dominance-driven component and a general-purpose component, governed by only three parameters. From the resulting distribution we derive, in closed form, a measure of information loss and risk measures for three attack scenarios —external inference, internal inference and differencing— each transposing a statistical disclosure rule. These expressions make calibration a purely analytical decision, requiring neither simulation nor recalibration on the data, and yield a structured, step-by-step calibration procedure. Random keys are handled in a continuous setting through SHA-512 hashing.

Johann Faouzi (ENSAI)

Time series clustering with CLUES-WEASEL

Time series data is very common in many real-world applications and in numerous domains, with increasing interest for automated information extraction using machine learning. One of these subfields is time series clustering, which consists in identifying clusters among a set of time series in an unsupervised fashion. Most time series clustering algorithms suffer from the same balancing act: they trade clustering performance for faster runtimes or vice versa. We present a novel time series clustering algorithm that we call CLUES-WEASEL, which stands for CLustering with the UnsupervisEd Second version of Word ExtrAction for time SEries cLassification. CLUES-WEASEL extracts features using the unsupervised version of the transformation step of WEASEL 2.0, which is a time series classification algorithm, then reduces these features using principal component analysis, and finally performs clustering with the k-means algorithm using these reduced extracted features. Through extensive experiments, we provide evidence that CLUES-WEASEL is significantly better than any other existing time series clustering algorithm while being (much) faster than any state-of-the-art one. We also show that the architecture of CLUES-WEASEL can work well with other time series feature extraction algorithms. Our findings highlight the relevance of CLUES-WEASEL for time series clustering.

Matthieu Lerasle (ENSAE)

Bornes de classification pour le MLE en régression logistique.

La régression logistique est un modèle élémentaire pour la classification. Le comportement asymptotique de l’estimateur du maximum de vraisemblance est décrit par le théorème de Wilks qui assure que son excès de risque est d’ordre d/n, où d est le nombre de covariables et n le nombre de données. Le lemme de Zhang est un résultat permettant de transférer les bornes d’excès de risque en bornes de classification. En appliquant cette recette standard, on obtient que le risque de classification du MLE en régression logistique est au pire d’ordre \sqrt{d/n}. Lorsque le design est Gaussien, le modèle satisfait une condition de marge qui peut être exploitée pour montrer que ce risque est en réalité au pire d’ordre (d/n)^{2/3}.

Dans cet exposé, je montrerai que le lemme de Zhang peut encore être précisé dans ce problème grâce à une condition de marge 2D.

En combinant ce résultat aux récentes bornes précises d’excès de risque non asymptotiques pour le MLE en régression logistique, on en déduit que le risque de classification de cet estimateur est en réalité d’ordre optimal d/n.

Ce travail est issu d’une collaboration avec H. Chardon et J. Mourtada.

Arnak Dalalyan (ENSAE)

A Simple Proof of Improved Wasserstein Bounds for Langevin Monte Carlo

I will present a simple and sharp analysis of the Langevin Monte Carlo algorithm. The main theorem provides a non-asymptotic upper bound on the Wasserstein-2 error under strong convexity and smoothness assumptions. The proof is shorter than existing ones and reveals that the discretization error is controlled by an average of coordinate-wise smoothness constants, rather than by the worst-case smoothness parameter. I will discuss the resulting improvement in the mixing-time bound, compare it with prior work, and show how the argument extends to variable step-size schemes.

Jean Rubin (Insee)

With-replacement balanced sampling

Balanced sampling methods are a natural approach to leveraging known information in order to obtain better estimates at reduced cost. For example, it is used in France for the design of the Master Sample, which is a first-stage geographical sample representative of the entire territory and used among other things for the French population census. A standard way to produce a balanced sample is to use the “cube” method proposed by Deville and Tillé (2004), but this method focuses on constructing samples without replacement. With-replacement balanced sampling could help provide easy-to-use bootstrap variance estimates. Similarly, being able to select the same individual multiple times can simplify the construction of balanced stream-samples, i.e., individuals arriving progressively over time [Sunter (1977); Tillé (2019)].

We present methods for performing balanced sampling with replacement by generalizing the cube method in several ways. The first approach applies the cube method to a duplicated population, redistributing inclusion probabilities of individuals from the original population. The second approach involves relaxing the constraint that forces selection multiplicities to be less than unity in the standard cube method. The third approach proposes a generalization of the geometric interpretation of the cube method in the case of a balanced sampling with replacement: at each stage of the flight phase of the with-replacement cube, there are in general no longer only two possible candidate states. Compared to the standard cube method, it is therefore also necessary to choose a probability distribution to randomly select the next state while remaining centered on the original state on average. Finally, we present a variance approximation formula for these types of sampling, in the style of Deville and Tillé (2005), and perform simulations to study its quality.

Sébastien Herbreteau (ENSAI)

Divergence-Free Neural Networks with Application to Image Denoising

We introduce a resource-efficient neural network architecture with zero divergence by design, adapted for high-dimensional problems. Our method is directly applicable to image denoising, for which divergence-free estimators are particularly well-suited for self-supervised learning, in accordance with Stein's unbiased risk estimation theory. Comparisons of our parameterization on popular denoising datasets demonstrate that it retains sufficient expressivity to remain competitive with other divergence-based approaches, while outperforming its counterparts when the noise level is unknown and varies across the training data.

Jaouad Mourtada (ENSAE)

Estimation of discrete distributions in relative entropy, and the deviations of the missing mass

We consider the problem of estimating a distribution over a finite alphabet from an i.i.d. sample, with accuracy measured in relative entropy (Kullback-Leibler divergence). While optimal bounds on the expected risk are known, high-probability guarantees remain less well-understood. First, we characterize the performance of the classical Laplace (add-one) estimator, obtaining matching upper and lower bounds on its performance and establishing its optimality among confidence-independent estimators. We then characterize the minimax-optimal high-probability risk and show that it is achieved by a simple confidence-dependent smoothing technique. Notably, the optimal non-asymptotic risk incurs an additional logarithmic factor compared to the ideal asymptotic rate. Next, motivated by modern regimes in which the alphabet size exceeds the sample size, we discuss methods that adapt to the sparsity of the underlying distribution. We introduce an estimator using data-dependent smoothing, for which we establish a high-probability risk bound depending on two effective sparsity parameters. As part of our analysis, we also derive a sharp high-probability upper bound on the "missing mass", namely the total probability of symbols that do not appear in the sample.

Meilame Tayebjee (Insee)

Public health policy evaluation via reinforcement learning and foundation models

Representing patient health trajectories as structured states for causal inference remains an open methodological challenge. We introduce a framework in which dense embeddings, generated by a foundational Transformer trained on longitudinal medical records, serve directly as Markovian states in an Off-Policy Evaluation (OPE) pipeline. These embeddings capture the full semantic and temporal complexity of individual healthcare pathways across the French population (67M individuals), enabling principled counterfactual reasoning over high-dimensional health histories. As a demonstration of this approach, we apply the framework to primary care policy (PCP): we show that anchored PCP access significantly reduces emergency entries, particularly among aging populations and those with chronic conditions, and quantify the causal cost of medical deserts by linking regional primary care deficits to excess adverse events.

Emmanuel Pilliat (ENSAI)

A Unified Framework for Infinitely Many-Armed Bandits

We study bandit problems where the sampling budget is far smaller than the number of arms, possibly infinite. Instead of minimizing simple regret, which requires the arm means to be bounded, we maximize the expected reward of the recommended arm, with guarantees that hold even for unbounded distributions. The analysis relies on a single quantity that captures the difficulty of recommending a good arm. The resulting upper bounds recover known rates, uncover new transition phenomena tied to the noise level, and give the first guarantees for unbounded distributions. The talk also offers algorithmic insights, including a practical refinement with strong empirical performance and an efficient implementation.

Austin Stromme (ENSAE)

On the implicit regularization of Langevin dynamics with projected noise

We study Langevin dynamics with noise projected onto the directions orthogonal to an isometric group action. This mathematical model is introduced to shed new light on the effects of symmetry on stochastic gradient descent for over-parametrized models. Our main result identifies a novel form of implicit regularization: when the initial and target density are both invariant under the group action, Langevin dynamics with projected noise is equivalent in law to Langevin dynamics with isotropic diffusion but with an additional drift term proportional to the negative log volume of the group orbit. We prove this result by constructing a coupling of the two processes via a third process on the group itself, and identify the additional drift as the mean curvature of the orbits.

Lionel Truquet (ENSAI)

When Taylor Meets Taylor: Asymptotic Bias Corrections for Fluctuation Scaling

Taylor's power law of fluctuation scaling states that the variance of a stochastic quantity scales as a power of its mean, a relationship widely observed in ecology, finance, physics, and even in the distribution of prime numbers. In ecology, empirical estimation of this law typically proceeds through a log-log regression between empirical variances and empirical means computed from collections of time series observed across many units or locations. This naturally leads to a two-dimensional asymptotic framework in which both the number of units n and the time horizon T grow jointly large.

We show that estimating a growing number of empirical moments may induce non-negligible asymptotic biases closely related to the Scott and Neyman incidental parameter problem. Similar effects are well known in the asymptotic theory of panel data models, where the estimation of many nuisance parameters affects the limiting behaviour of structural estimators.

Using higher-order Taylor expansions of the estimating equations, we derive explicit analytical bias corrections and establish valid asymptotic inference for the resulting debiased estimators under general heterogeneity conditions. Several open questions and possible extensions are also discussed.