Alexis Derumigny

Assistant Professor of Statistics

Research Interests

Dependence modeling, copulas, high-dimensional statistics, non-parametric statistics, kernel smoothing, statistical modeling of conditional distributions

Biography

Jobs

  • February 2021 – now: Assistant Professor of Statistics at Delft University of Technology (Delft, Netherlands)
  • August 2019 – January 2021: Researcher at the University of Twente (Enschede, Netherlands)
  • October 2016 – July 2019: Teaching Assistant (“Chargé de TD”) at ENSAE ParisTech (Palaiseau, France)
  • May 2016 – September 2016: Graduate Research Intern, CREST (Palaiseau, France)
  • June 2015 – January 2016: Quantitative Analyst Intern, Meteo Protect (Paris, France)

Education

Publications

Abstract: ” Several procedures have been recently proposed to test the simplifying assumption for conditional copulas. Instead of considering pointwise conditioning events, we study the constancy of the conditional dependence structure when some covariates belong to general Borel conditioning subsets. We introduce several test statistics based on the equality of conditional Kendall’s taus and derive their asymptotic distributions under the null hypothesis. In settings where such conditioning events are not fixed ex ante, we propose a data-driven procedure to recursively build such relevant subsets. This procedure is based on decision trees that maximize the differences between the conditional Kendall’s taus, which correspond to the leaves of the trees. Empirical results for such tests are illustrated in the supplementary materials. Moreover, a study of the conditional dependence between financial stock returns is presented, and highlights specific contagion effects of past returns. The last application deals with conditional dependence between coverage amounts in an insurance dataset. “

GitHub page of the package: https://github.com/AlexisDerumigny/CondCopulas
Also available on CRAN at: https://cran.r-project.org/package=CondCopulas. The algorithms proposed in this article are available in the R functions bCond.simpA.CKT() and bCond.treeCKT() of the CondCopulas package.

Abstract: ” We study the weak convergence of conditional empirical copula processes indexed by general families of conditioning events that have non zero probabilities. Moreover, we also study the case where the conditioning events are chosen in a data-driven way. The validity of several bootstrap schemes is stated, including the exchangeable bootstrap. We define general multivariate measures of association, possibly given some fixed or random conditioning events. By applying our theoretical results, we prove the asymptotic normality of the estimators of such measures. We detail the link between pointwise conditional copulas and conditional copulas indexed by general events and their application in statistical methodology. We illustrate our results with financial data. “

Abstract: ” In recent years, implementing a circular economy in cities has been considered by policy makers as a potential solution for achieving sustainability. Existing literature on circular cities is mainly focused on two perspectives: urban governance and urban metabolism. Both these perspectives, to some extent, miss an understanding of space. A spatial perspective is important because circular activities, such as the recycling, reuse, or storage of materials, require space and have a location. It is therefore useful to understand where circular activities are located, and how they are affected by their location and surrounding geography. This study therefore aims to understand the existing state of waste reuse activities in the Netherlands from a spatial perspective, by analyzing the degree, scale, and locations of spatial clusters of waste reuse. This was done by measuring the spatial autocorrelation of waste reuse locations using global and local Moran’s I, with waste reuse data from the national waste registry of the Netherlands. The analysis was done for 10 material types: minerals, plastic, wood and paper, fertilizer, food, machinery and electronics, metal, mixed construction materials, glass, and textile. It was found that all materials except for glass and textiles formed spatial clusters. By varying the grid cell sizes used for data aggregation, it was found that different materials had different “best fit” cell sizes where spatial clustering was the strongest. The best fit cell size is ∼7 km for materials associated with construction and agricultural industries, and ∼20–25 km for plastic and metals.The best fit cell sizes indicate the average distance of companies from each other within clusters, and suggest a suitable spatial resolution at which the material can be understood. Hotspot maps were also produced for each material to show where reuse activities are most spatially concentrated. “

Abstract: ” Meta-elliptical copulas are often proposed to model dependence between the components of a random vector. They are specified by a correlation matrix and a map g, called a density generator. When the latter correlation matrix can easily be estimated from pseudo-samples of observations, this is not the case for the density generator when it does not belong to a parametric family. We state sufficient conditions to non-parametrically identify this generator. Several nonparametric estimators of g are then proposed, by M-estimation, simulation-based inference or by an iterative procedure available in a R package. Some simulations illustrate the relevance of the latter method. “

GitHub page of the package: https://github.com/AlexisDerumigny/ElliptCopulas
Also available on CRAN at: https://cran.r-project.org/package=ElliptCopulas

Abstract: ” This paper deals with robust inference for parametric copula models. Estimation using Canonical Maximum Likelihood might be unstable, especially in the presence of outliers. We propose to use a procedure based on the Maximum Mean Discrepancy (MMD) principle. We derive non-asymptotic oracle inequalities, consistency and asymptotic normality of this new estimator. In particular, the oracle inequality holds without any assumption on the copula family, and can be applied in the presence of outliers or under misspecification. Moreover, in our MMD framework, the statistical inference of copula models for which there exists no density with respect to the Lebesgue measure on [0,1]d, as the Marshall-Olkin copula, becomes feasible. A simulation study shows the robustness of our new procedures, especially compared to pseudo-maximum likelihood estimation. An R package implementing the MMD estimator for copula models is available. “

GitHub page of the package: https://github.com/AlexisDerumigny/MMDCopula
Also available on CRAN at: https://cran.r-project.org/package=MMDCopula

The R scripts to reproduce the simulations and the figures of the paper are available at https://github.com/AlexisDerumigny/Reproducibility-EstimationOfCopulasViaMMD.

Abstract: ” Collection of accurate and representative data from agricultural fields is required for efficient crop management. Since growers have limited available resources, there is a need for advanced methods to select representative points within a field in order to best satisfy sampling or sensing objectives. The main purpose of this work was to develop a data-driven method for selecting locations across an agricultural field given observations of some covariates at every point in the field. These chosen locations should be representative of the distribution of the covariates in the entire population and represent the spatial variability in the field. They can then be used to sample an unknown target feature whose sampling is expensive and cannot be realistically done at the population scale.
An algorithm for determining these optimal sampling locations, namely the multifunctional matching (MFM) criterion, was based on matching of moments (functionals) between sample and population. The selected functionals in this study were standard deviation, mean, and Kendall’s tau. An additional algorithm defined the minimal number of observations that could represent the population according to a desired level of accuracy. The MFM was applied to datasets from two agricultural plots: a vineyard and a peach orchard. The data from the plots included measured values of slope, topographic wetness index, normalized difference vegetation index, and apparent soil electrical conductivity. The MFM algorithm selected the number of sampling points according to a representation accuracy of 90% and determined the optimal location of these points. The algorithm was validated against values of vine or tree water status measured as crop water stress index (CWSI). Algorithm performance was then compared to two other sampling methods: the conditioned Latin hypercube sampling (cLHS) model and a uniform random sample with spatial constraints. Comparison among sampling methods was based on measures of similarity between the target variable population distribution and the distribution of the selected sample.

GitHub page of the package: https://github.com/AlexisDerumigny/MFunctMatching

Abstract: ” Conditional Kendall’s tau is a measure of dependence between two random variables, conditionally on some covariates. We assume a regression-type relationship between conditional Kendall’s tau and some covariates, in a parametric setting with a large number of transformations of a small number of regressors. This model may be sparse, and the underlying parameter is estimated through a penalized criterion and a two-step inference procedure. We prove non-asymptotic bounds with explicit constants that hold with high probabilities. We derive the consistency of the latter estimator, its asymptotic law and some oracle properties. Some simulations and applications to real data conclude the paper. “

Abstract: ” We study nonparametric estimators of conditional Kendall’s tau, a measure of concordance between two random variables given some covariates. We prove non-asymptotic pointwise and uniform bounds, that hold with high probabilities. We provide “direct proofs” of the consistency and the asymptotic law of conditional Kendall’s tau. A simulation study evaluates the numerical performance of such nonparametric estimators. An application to the dependence between energy consumption and temperature conditionally to calendar days is finally provided. “

Abstract: ” It is shown how the problem of estimating conditional Kendall’s tau can be rewritten as a classification task. Conditional Kendall’s tau is a conditional dependence parameter that is a characteristic of a given pair of random variables. The goal is to predict whether the pair is concordant (value of 1) or discordant (value of -1) conditionally on some covariates. The consistency and the asymptotic normality of a family of penalized approximate maximum likelihood estimators is proven, including the equivalent of the logit and probit regressions in our framework. Specific algorithms are detailed, adapting usual machine learning techniques, including nearest neighbors, decision trees, random forests and neural networks, to the setting of the estimation of conditional Kendall’s tau. Finite sample properties of these estimators and their sensitivities to each component of the data-generating process are assessed in a simulation study. Finally, all these estimators are applied to a dataset of European stock indices. “

Abstract: ” Extending the results of Bellec, Lecué and Tsybakov to the setting of sparse high-dimensional linear regression with unknown variance, we show that two estimators, the Square-Root Lasso and the Square-Root Slope can achieve the optimal minimax prediction rate, which is (s/n) log(p/s), up to some constant, under some mild conditions on the design matrix. Here, n is the sample size, p is the dimension and is the sparsity parameter. We also prove optimality for the estimation error in the lq-norm, with q in [1,2] for the Square-Root Lasso, and in the l2 and sorted l1 norms for the Square-Root Slope. Both estimators are adaptive to the unknown variance of the noise. The Square-Root Slope is also adaptive to the sparsity s of the true parameter. Next, we prove that any estimator depending on s which attains the minimax rate admits an adaptive to s version still attaining the same rate. We apply this result to the Square-root Lasso. Moreover, for both estimators, we obtain valid rates for a wide range of confidence levels, and improved concentration properties as in [Bellec, Lecué and Tsybakov, 2017] where the case of known variance is treated. Our results are non-asymptotic.”

Abstract: ” We discuss the so-called “simplifying assumption” of conditional copulas in a general framework. We introduce several tests of the latter assumption for non- and semiparametric copula models. Some related test procedures based on conditioning subsets instead of point-wise events are proposed. The limiting distribution of such test statistics under the null are approximated by several bootstrap schemes, most of them being new. We prove the validity of a particular semiparametric bootstrap scheme. Some simulations illustrate the relevance of our results. “

Preprints and submitted articles

Abstract: ” In this article, we obtain explicit bounds on the uniform distance between the cumulative distribution function of a standardized sum S_n of n independent centered random variables with moments of order four and its first-order Edgeworth expansion. Those bounds are valid for any sample size with n^{-1/2} rate under moment conditions only and n^{-1} rate under additional regularity constraints on the tail behavior of the characteristic function of S_n. In both cases, the bounds are further sharpened if the variables involved in S_n are unskewed. We also derive new Berry-Esseen-type bounds from our results and discuss their links with existing ones. We finally apply our results to illustrate the lack of finite-sample validity of one-sided tests based on the normal approximation of the mean. “

Abstract: ” Kendall’s tau and conditional Kendall’s tau matrices are multivariate (conditional) dependence measures between the components of a random vector. For large dimensions, available estimators are computationally expensive and can be improved by averaging. Under structural assumptions on the underlying Kendall’s tau and conditional Kendall’s tau matrices, we introduce new estimators that have a significantly reduced computational cost while keeping a similar error level. In the unconditional setting we assume that, up to reordering, the underlying Kendall’s tau matrix is block-structured with constant values in each of the off-diagonal blocks. Consequences on the underlying correlation matrix are then discussed. The estimators take advantage of this block structure by averaging over (part of) the pairwise estimates in each of the off-diagonal blocks. Derived explicit variance expressions show their improved efficiency. In the conditional setting, the conditional Kendall’s tau matrix is assumed to have a constant block structure, independently of the conditioning variable. Conditional Kendall’s tau matrix estimators are constructed similarly as in the unconditional case by averaging over (part of) the pairwise conditional Kendall’s tau estimators. We establish their joint asymptotic normality, and show that the asymptotic variance is reduced compared to the naive estimators. Then, we perform a simulation study which displays the improved performance of both the unconditional and conditional estimators. Finally, the estimators are used for estimating the value at risk of a large stock portfolio; backtesting illustrates the obtained improvements compared to the previous estimators. “

Abstract: ” We consider the least-squares regression problem with unknown noise variance, where the observed data points are allowed to be corrupted by outliers. Building on the median-of-means (MOM) method introduced by Lecue and Lerasle Ann.Statist.48(2):906-931(April 2020) in the case of known noise variance, we propose a general MOM approach for simultaneous inference of both the regression function and the noise variance, requiring only an upper bound on the noise level. Interestingly, this generalization requires care due to regularity issues that are intrinsic to the underlying convex-concave optimization problem. In the general case where the regression function belongs to a convex class, we show that our simultaneous estimator achieves with high probability the same convergence rates and a similar risk bound as if the noise level was unknown, as well as convergence rates for the estimated noise standard deviation. In the high-dimensional sparse linear setting, our estimator yields a robust analog of the square-root LASSO. Under weak moment conditions, it jointly achieves with high probability the minimax rates of estimation s1/p((1/n)log(p/s))1/2 for the ℓp-norm of the coefficient vector, and the rate ((s/n)log(p/s))1/2 for the estimation of the noise standard deviation. Here n denotes the sample size, p the dimension and s the sparsity level. We finally propose an extension to the case of unknown sparsity level s, providing a jointly adaptive estimator (β˜,σ˜,s˜). It simultaneously estimates the coefficient vector, the noise level and the sparsity level, with proven bounds on each of these three components that hold with high probability. “

    Abstract: ” It is a common phenomenon that for high-dimensional and nonparametric statistical models, rate-optimal estimators balance squared bias and variance. Although this balancing is widely observed, little is known whether methods exist that could avoid the trade-off between bias and variance. We propose a general strategy to obtain lower bounds on the variance of any estimator with bias smaller than a prespecified bound. This shows to which extent the bias-variance trade-off is unavoidable and allows to quantify the loss of performance for methods that do not obey it. The approach is based on a number of abstract lower bounds for the variance involving the change of expectation with respect to different probability measures as well as information measures such as the Kullback-Leibler or chi-square divergence. Some of these inequalities rely on a new concept of information matrices. In a second part of the article, the abstract lower bounds are applied to several statistical models including the Gaussian white noise model, a boundary estimation problem, the Gaussian sequence model and the high-dimensional linear regression model. For these specific statistical applications, different types of bias-variance trade-offs occur that vary considerably in their strength. For the trade-off between integrated squared bias and integrated variance in the Gaussian white noise model, we propose to combine the general strategy for lower bounds with a reduction technique. This allows us to reduce the original problem to a lower bound on the bias-variance trade-off for estimators with additional symmetry properties in a simpler statistical model. To highlight possible extensions of the proposed framework, we moreover briefly discuss the trade-off between bias and mean absolute deviation. “

    Abstract: ” In econometrics, many parameters of interest can be written as ratios of expectations. The main approach to construct confidence intervals for such parameters is the delta method. However, this asymptotic procedure yields intervals that may not be relevant for small sample sizes or, more generally, in a sequence-of-model framework that allows the expectation in the denominator to decrease to 0 with the sample size. In this setting, we prove a generalization of the delta method for ratios of expectations and the consistency of the nonparametric percentile bootstrap. We also investigate finite-sample inference and show a partial impossibility result: nonasymptotic uniform confidence intervals can be built for ratios of expectations but not at every level. Based on this, we propose an easy-to-compute index to appraise the reliability of the intervals based on the delta method. Simulations and an application illustrate our results and the practical usefulness of our rule of thumb. “

    Abstract: ” U-statistics constitute a large class of estimators, generalizing the empirical mean of a random variable X to sums over every k-tuple of distinct observations of X. They may be used to estimate a regular functional θ(PX) of the law of X. When a vector of covariates Z is available, a conditional U-statistic may describe the effect of z on the conditional law of X given Z=z, by estimating a regular conditional functional θ(PX|Z=⋅ ). We prove concentration inequalities for conditional U-statistics. Assuming a parametric model of the conditional functional of interest, we propose a regression-type estimator based on conditional U-statistics. Its theoretical properties are derived, first in a non-asymptotic framework and then in two different asymptotic regimes. Some examples are given to illustrate our methods. “

    Conferences & communications:

    2020:

    2019:

    2018:

    2017:

    2016:

    • Inference of elliptical copula generators, with Jean-David Fermanian, Invited speaker at the 9th International Conference of the ERCIM WG on Computational and Methodological Statistics (CMStatistics2016, Seville, Spain, 9-11 December 2016).

    Teaching

    2020 – 2021:

    • Introduction to Statistics (University of Twente, ATLAS Bachelor program 3rd Semester)

    2018 – 2019:

    • Numerical Analysis (ENSAE 1st year)
    • Probability Theory ; C++ ; Mathematical Statistics 1 (ENSAE 2nd year)
    • Time Series ; Financial Econometrics (ENSAE 3rd year)

    2017 – 2018:

    • Probability Theory ; Numerical Analysis (ENSAE 1st year)
    • C++ (ENSAE 2nd year)
    • Time Series ; Financial Econometrics (ENSAE 3rd year)

    2016 – 2017:

    • Analysis and Topology ; Convex Optimization ; Numerical Analysis (ENSAE 1st year)
    • Financial Econometrics (ENSAE 3rd year)