27
Glucodensities: a new representation of glucose profiles using distributional data analysis Marcos Matabuena 1,2,* , Alexander Petersen 3 , Juan C.Vidal 2 and Francisco Gude 1 1 Unidad de Epidemiolog´ ıa Cl´ ınica, Hospital C´ ınico Universitario de Santiago de Compostela, Spain 2 CiTIUS (Centro Singular de Investigaci´on en Tecnolox´ ıas Intelixentes), Universidade de Santiago de Compostela, Spain 3 Department of Statistics and Applied Probability, University of California, Santa Barbara * [email protected] August 19, 2020 Abstract Biosensor data has the potential ability to improve disease control and detection. However, the analysis of these data under free-living conditions is not feasible with current statistical techniques. To address this challenge, we introduce a new functional representation of biosen- sor data, termed the glucodensity, together with a data analysis framework based on distances between them. The new data analysis procedure is illustrated through an application in dia- betes with continuous-time glucose monitoring (CGM) data. In this domain, we show marked improvement with respect to state of the art analysis methods. In particular, our findings demonstrate that i) the glucodensity possesses an extraordinary clinical sensitivity to capture the typical biomarkers used in the standard clinical practice in diabetes, ii) previous biomark- ers cannot accurately predict glucodensity, so that the latter is a richer source of information, and iii) the glucodensity is a natural generalization of the time in range metric, this being the gold standard in the handling of CGM data. Furthermore, the new method overcomes many of the drawbacks of time in range metrics, and provides deeper insight into assessing glucose metabolism. 1 Introduction The steadily increasing availability and prominence of biosensor data have given rise to new method- ological challenges for their statistical analysis. A primary feature of these data is that the moni- tored individuals are in free-living conditions, making a direct analysis of the recorded time series between groups of patients problematic if not infeasible. A clear example of such data is found in the study of diabetes, where continuous glucose monitoring (CGM) is increasingly used. The elevation of glucose is distinct between individuals and is influenced by factors such as mealtimes, diet composition, or physical exercise (Ewings et al., 2015). Consequently, an exciting topic of debate is how to exploit the enormous wealth of information recorded by CGM to draw more re- liable conclusions about the glucose homeostasis rather than the cursory summary measures such as fasting plasma glucose (FPG) or glycated hemoglobin (A1c) (Zaccardi and Khunti, 2018). 1 arXiv:2008.07840v1 [stat.AP] 18 Aug 2020

Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

Glucodensities: a new representation of glucose profiles using

distributional data analysis

Marcos Matabuena1,2,∗, Alexander Petersen3, Juan C.Vidal2 and Francisco Gude1

1Unidad de Epidemiologıa Clınica, Hospital Cınico Universitario de Santiago deCompostela, Spain

2CiTIUS (Centro Singular de Investigacion en Tecnoloxıas Intelixentes), Universidade deSantiago de Compostela, Spain

3Department of Statistics and Applied Probability, University of California, Santa Barbara∗[email protected]

August 19, 2020

Abstract

Biosensor data has the potential ability to improve disease control and detection. However,the analysis of these data under free-living conditions is not feasible with current statisticaltechniques. To address this challenge, we introduce a new functional representation of biosen-sor data, termed the glucodensity, together with a data analysis framework based on distancesbetween them. The new data analysis procedure is illustrated through an application in dia-betes with continuous-time glucose monitoring (CGM) data. In this domain, we show markedimprovement with respect to state of the art analysis methods. In particular, our findingsdemonstrate that i) the glucodensity possesses an extraordinary clinical sensitivity to capturethe typical biomarkers used in the standard clinical practice in diabetes, ii) previous biomark-ers cannot accurately predict glucodensity, so that the latter is a richer source of information,and iii) the glucodensity is a natural generalization of the time in range metric, this being thegold standard in the handling of CGM data. Furthermore, the new method overcomes manyof the drawbacks of time in range metrics, and provides deeper insight into assessing glucosemetabolism.

1 Introduction

The steadily increasing availability and prominence of biosensor data have given rise to new method-ological challenges for their statistical analysis. A primary feature of these data is that the moni-tored individuals are in free-living conditions, making a direct analysis of the recorded time seriesbetween groups of patients problematic if not infeasible. A clear example of such data is foundin the study of diabetes, where continuous glucose monitoring (CGM) is increasingly used. Theelevation of glucose is distinct between individuals and is influenced by factors such as mealtimes,diet composition, or physical exercise (Ewings et al., 2015). Consequently, an exciting topic ofdebate is how to exploit the enormous wealth of information recorded by CGM to draw more re-liable conclusions about the glucose homeostasis rather than the cursory summary measures suchas fasting plasma glucose (FPG) or glycated hemoglobin (A1c) (Zaccardi and Khunti, 2018).

1

arX

iv:2

008.

0784

0v1

[st

at.A

P] 1

8 A

ug 2

020

Page 2: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

Since 2010, the American Diabetes Association (ADA) has included measurement of A1c levelsto both diagnosis and diabetes control Association et al. (2018). A1c levels reflect underlying glucoselevels over the preceding 2–3 months, testing is convenient because blood samples can be obtained atany time of day, overnight fasting is not required, and A1c within-patient reproducibility is superiorto that of fasting plasma glucose and oral glucose tolerance tests (OGTTs) Selvin et al. (2007).However, recent articles have provided evidence for the need to go beyond A1c and use new measuresfor glycemic control (Group, 2018; Bergenstal, 2015) in order to capture more diverse aspects of thetemporally evolving glucose levels beyond the average, for example, glucose variability and time inrange metrics. The time in metric range measures the proportion of time an individual’s glucoselevels is maintained in different target zones. In the case of diabetes, these can include rangescorresponding to hypoglycemia and hyperglycemia. In an innovative article, Beck et al. (2019)validated the time in range metric, showing that it is a good predictor of long-term microvascularcomplications despite just measuring glucose values seven times per day. Lu et al. (2018) reachedsimilar conclusions but using CGM technology only for 24 hours in each patient. At the same time,it is well-known that two patients may have the same glycosylated hemoglobin and a completelydifferent glycemic profile (Beck et al., 2017). These new approaches and findings have lead clinicalspecialists to consider that continuous glucose measurement during long monitoring periods canlead to more accurate results in research and clinical practice than in standard methods (Hirschet al., 2019). In fact, since 2012, the European Medicine Agency (for Medicinal Products forHuman Use et al., 2012) recommends the use of CGM to validate the effect of drugs for treatmentor prevention of diabetes mellitus.

Traditionally, CGM was designed for the risk management in real-time for type 1 diabetes andthe control of glucose values with insulin pumps (Kovatchev et al., 2009; Feig et al., 2017; DiMeglioet al., 2018). Notwithstanding, more recent applications of CGM have been more general. Theyinvolve, for example, screening patients, optimizing diet, epidemiological studies, assessing patientprognosis, and supporting treatment prescription, and have even been used in healthy populations(Freeman and Lyons, 2008; Hall et al., 2018). In addition to the increasing utility of CGM data,the technology is gradually becoming cheaper, and new devices capable of measuring glucose in anon-invasive way, for example, with glasses (Nichols et al., 2013), are quickly emerging. All of theseadvances are facilitating the adoption of CGM in standard clinical practice.

In 2012, a panel of experts discussed how to represent CGM data in an “easy to view format”(Bergenstal et al., 2013). They also analyzed the convenience of using glycemic variability measuresand other summary measures such as time in range to extract the recorded information from CGM.In 2019, ADA established an updated version of clinical standards to use and define target zoneswith time in range metrics (Battelino et al., 2019). In a more recent review about the CGM metric,they establish time in range as a gold standard measure (Nguyen et al., 0).

Motivated by the problem of analyzing data gathered via CGM more precisely, while stillleveraging the advantages possessed by time in range metrics, we propose an approach based on theconstruction of a functional profile of glucose values for each subject. Conceptually, the approachis a natural extension of time in range metrics in which the ranges shrink and size and increase innumber, so that new profile effectively measures the proportion of time each patient spends at eachspecific glucose concentration rather than a coarsely defined range. As a result of this, the newfunctional profile, which we refer to as a glucodensity, automatically and simultaneously capturesall parameters arising from individual glucose distributions. Figure 1 illustrates a set of constructedglucodensities that represent the data objects for which we will propose the use of a tailored set of

2

Page 3: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

statistical methods.Mathematically, glucodensities constitute functional distributional data since each glucodensity

represents a distribution of glucose concentrations. As such, these complex and constrained curvescannot be directly analyzed with the usual techniques. To overcome this, we introduce a frameworkfor the analysis of glucodensities by compiling suitable methods that are based on the calculationof glucodensities distances. We also reveal the superior clinical capacity of our representationcompared to classical measures of diabetes. Finally, we demonstrate that our representation hasa higher sensitivity than the standard time in range metric to explain the glycemic differencesbetween patients in various settings, including regression analysis. A new shiny interface to use themethods outlined in this paper is available at https://tec.citius.usc.es/diabetes.

50 100 150 200 250 300 350 400

0.00

0.02

0.04

0.06

Glucodensities

Glucose, mg/dL

Den

sity

Figure 1: Example of a set of glucodensities estimated from a random sample of the AEGISpopulation-based study

1.1 Outline

The structure of this paper is as follows. First, we briefly describe the AEGIS study, and themethods used. We then formally introduce the concept of glucodensity, the estimation methods,and some essential statistical background to understand the statistical procedures introduced inthe paper. Subsequently, we explain the regression models used in the validation of the repre-

3

Page 4: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

Biomarker Clinical significance

A1cin diabetes diagnosis and control

Gold standard marker

HOMA-IRresistance and β-cell function

Measurements to quantify insulin

CONGA

MODDglucose variability

Summary indices of

MAGE

Table 1: Clinical importance of biomarkers used in the statistical analysis

sentation. Afterward, we show the results that demonstrate the superiority of glucodensity overglucose representations of state-art. Then, we illustrate the use with real data of the glucodensitiesmethodology in two-sample testing and cluster analysis. Finally, we discuss the clinical implica-tions of these results, their limitations, and the new perspectives of the glucodensities method inmedicine and device technology.

2 Sample and Procedures

2.0.1 Study design

A subset of the subjects in the A Estrada Glycation and Inflammation Study (AEGIS; trialNCT01796184 at www.clinicaltrials.gov) provided the sample for the present work. In thelatter cross-sectional study, an age-stratified random sample of the population (aged ≥ 18) wasdrawn from Spain’s National Health System Registry. A detailed description has been publishedelsewhere (Gude et al., 2017). For a one year beginning in March, subjects were periodically exam-ined at their primary care centre where they (i) completed an interviewer-administered structuredquestionnaire; (ii) provided a lifestyle description; (iii) were subjected to biochemical measure-ments; and (iv) were prepared for CGM (lasting 6 days). The subjects who made up the presentsample were the 581 (361 women, 220 men) who completed at least 2 days of monitoring, out ofan original 622 persons who consented to undergo a 6-day period of CGM. Another 41 originalsubjects were withdrawn from the study due to non-compliance with protocol demands (n = 4) ordifficulties in handling the device (n = 37). The characteristics of the participants are shown onthe Table 2.

2.0.2 Ethical approval and informed consent

The present study was reviewed and approved by the Clinical Research Ethics Committee fromGalicia, Spain (CEIC2012-025). Written informed consent was obtained from each participant inthe study, which conformed to the current Helsinki Declaration.

2.0.3 Laboratory determinations

Glucose was determined in plasma samples from fasting participants by the glucose oxidase per-oxidase method. A1c was determined by high-performance liquid chromatography in a Menarini

4

Page 5: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

Men (n = 220) Women (n = 361)

Age, years 47.8± 14.8 48.2± 14.5A1c, % 5.6± 0.9 5.5± 0.7FPG mg/dL 97± 23 91± 21HOMA-IR mg/dL.µUI/m 3.97± 5.56 2.74± 2.47BMI kg/m2 28.9± 4.7 27.7± 5.3CONGA mg/dL 0.88± 0.40 0.86± 0.36MAGE mg/dL 33.6± 22.3 31.2± 14.6MODD 0.84± 0.58 0.77± 0.33

Table 2: Characteristics of AEGIS study participants by sex. Mean and standard deviation areshown. BMI - body mass index; FPG - fasting plasma glucose; A1c - glycated haemoglobin;HOMA − IR - homeostasis model assessment-insulin resistance; CONGA - glycemic variabilityin terms of continuous overall net glycemic action; MODD - mean of daily differences; MAGE -mean amplitude of glycemic excursions.

Diagnostics HA-8160 analyser; all A1c values were converted to DCCT-aligned values (Hoelzel et al.,2004). Insulin resistance was estimated using the homeostasis model assessment method (HOMA-IR) as the fasting concentration of plasma insulin (µ units/mL) × plasma glucose (mg/dL)/ 405(Matthews et al., 1985).

2.0.4 Glycaemic variability

Glycaemic variability was measured in terms of continuous overall net glycemic action (CONGA)(McDonnell et al., 2005), the mean amplitude of glycaemic excursions (MAGE) (Service et al.,1970), and the mean of the daily differences (MODD) (Molnar et al., 1972) in glucose concentration.

2.0.5 CGM Procedures

At the start of each monitoring period, a research nurse inserted a sensor (EnliteTM, Medtronic,Inc, Northridge, CA, USA) subcutaneously into the subject’s abdomen, and instructed him/her inthe use of the iProTM CGM device (Medtronic, Inc, Northridge, CA, USA). The sensor continu-ously measures the interstitial glucose level 40 − 400 (range mg/dL) of the subcutaneous tissue,recording values every 5 min. Participants were also provided with a conventional OneTouchRVerioR Pro glucometer (LifeScan, Milpitas, CA, USA) as well as compatible lancets and test stripsfor calibrating the CGM. All subjects were asked to make at least three capillary blood glucosemeasurements (usually before main meals). These readings were taken without checking the currentCGM reading. On the seventh day the sensor was removed and the data downloaded and storedfor further analysis. If the number of data-acquisition “skips” per day totalled more than 2 h, theentire day’s data were discarded.

2.1 Time-in-range metric

The time in the range metric was calculated with two different methods. In the first, through theCGM records of the AEGIS study, we estimate the deciles of CGM records with normoglycemic

5

Page 6: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

patients and use as cut-offs the deciles (Table 3). In the second, we use cut-off points establishedby the ADA in the 2019 Medical guideline (Battelino et al., 2019) (Table 4).

Range 1 < 85

Range 2 85− 90

Range 3 91− 94

Range 4 95− 98

Range 5 99− 101

Range 6 102− 105

Range 7 106− 109

Range 8 110− 115

Range 9 116− 124

Range 10 > 125

Table 3: Cut-offs for metric time in range using own estimations throught normoglucemic individ-uals of AEGIS study

Range 1 < 54

Range 2 54− 69

Range 3 70− 180

Range 4 181− 250

Range 5 > 250

Table 4: Cut-offs for metric time in range following ADA guidelines Battelino et al. (2019)

2.1.1 Statistical procedure

The density functions for each individual was estimated with non-parametric Nadaraya-Watsonprocedure. For this purpose, we used a Gaussian kernel and rule of thumb as a smoothing parameter.In addition, we estimate quantile representation for 2-Wasserstein methods using the empiricaldistribution.

The following three regression models were used: i) The non-parametric kernel functional regres-sion model through 2-Wasserstein distance with the glucodensity as predictor (Ferraty and Vieu,2006); ii) A global 2-Wasserstein regression model where the glucodensity is response (Petersenand Muller, 2019); and iii) k-nearest neighbor algorithm in the case of time in range metrics withk = 10 neighbors.

In the case of time in range metrics, we applied the isometric log-ratio (ilr) transformationfor compositional data prior to fitting the model. To avoid problems with zeros, a fixed positiveconstant was added to each each range, which were then normalized to add to 1.

All analyses were carried out using R software. Functional data analysis was performed usingthe fda.usc package (Febrero-Bande and de la Fuente, 2012), which is freely available at https:

//cran.r-project.org/, and our own implementations of the ANOVA test of Dubey and Muller

6

Page 7: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

(2019) or Frechet regression in Petersen and Muller (2019) using the 2-Wasserstein distance. Theglucodensities and their quantile representation were estimated using the R basis functions.

3 Definition and Estimation of the Glucodensity

For patient i, denote the gathered glucose monitoring data by pairs (tij , Xij), j = 1, . . . ,mi, wherethe tij represent recording times that are typically equally spaced across the observation interval,and Xij is the glucose level at time tij ∈ [0, Ti]. Note that the number of records mi, the spacingbetween them, and the overall observation length Ti can vary by patient. One can think of these dataas discrete observations of a continuous latent process Yi(t), with Xij = Yi(tij). The glucodensityfor this patient is defined in terms of this latent process as fi(x) = F ′i (x), where

Fi(x) =1

Ti

∫ Ti

01 (Yi(t) ≤ x) dt for inf

t∈[0,Ti]Yi(t) ≤ x ≤ sup

t∈[0,Ti]Yi(t)

is the proportion of the observation interval in which the glucose levels remain below x. Since Fiare increasing from 0 to 1, the data to be modeled are a set of probability density functions fi,i = 1, . . . , n.

Of course, neither Fi nor the glucodensity fi is observed in practice, but one can construct anapproximation through a density estimate fi(·) obtained from the observed sample. In this case ofCGM data, the glucodensities may have different support and shape. Therefore, we suggest usinga non-parametric approach to estimate each density function. For example, using a kernel-typeestimator, we have

fi(x) =1

mi

mi∑j=1

Khi(x−Xij),

where hi > 0 is the smoothing parameter and Khi(s) = 1hiK( shi ). The choice of K does not have a

big impact on the efficiency of the estimator, but the value of hi is crucial.Several alternatives for selecting the smoothing parameter have been proposed in the literature,

including cross-validation, minimizing the estimated mean integrated squared error (MISE), or a“rule of thumb” derived from the assumption that the density is Gaussian. In this last case, the

choice can be explicitly written as hi = 1.06σim−1/5i , where σi is the sample standard deviation of

the Xij . For more details, see Silverman (1986). Other approaches for the density function estima-tion include the use orthogonal series (e.g., Fourier or Wavelet) expansions, splines, or smoothing ofhistograms. For further details the reader is referred to Antoniadis (1997); Izenman (1991); Mullerand Petersen (2014).

3.1 Distance-based Descriptive Statistics

Let [a, b] be an interval of the real line, which may be unbounded, and suppose that each gluco-density fi has support contained in [a, b]. From a statistical point of view, the sample f1, . . . , fnmay be modeled and analyzed using methods of functional data analysis (Ramsay et al., 2005;

Wang et al., 2016). However, since the fi must be positive and satisfy∫ ba fi(x)dx = 1, classical

methods have in recent years been adapted to account for the nonlinear, distributional structureof density samples (Petersen and Muller, 2016; Hron et al., 2016). The general approach is todefine a metric or distance between densities that, in turn, leads to descriptive statistics that

7

Page 8: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

respect the unique density properties. For example, define the data space of glucodensities asA := f : [a, b]→ R+ :

∫ ba f(x)dx = 1 and

∫ ba x

2f(x)dx <∞. Given two arbitrary glucodensitiesf, g ∈ A, the 2-Wasserstein distance (Villani, 2008) between f and g is

dW 2(f, g) =

√∫ b

a(F−1(x)−G−1(x))2dx, (1)

where F and G are the cumulative distribution functions (cdfs) of the density functions f and g.The 2-Wasserstein distance is a natural distance to measure the similarity between density

functions through its representation in the space of the quantile (inverse cdf) functions and ithas already been successfully applied in biological problems. Furthermore, it has computationaland modeling advantages compared to the usual L2[a, b] metric when glucodensities have differentsupport within [a, b]. Finally, it has a physical interpretation in the theory of optimal transport.

As glucodensities are distributional data, the subsequent application of the usual techniquesfor functional data, such as estimation of mean, covariance, and regression models, may lead tomisleading results. Hence, we have chosen to use models based on the 2-Wasserstein distance,although other choices are possible. As a starting point, based on the notion of distance we cangeneralise the mean and variance of a random variable that takes values in an abstract space withmetric structure (Frechet, 1948). As we will see, similar adaptations can be developed for regression,hypothesis testing, or to perform cluster analysis. Given a distance d : A×A→ R+, of which dW 2

is one example, and a random variable f defined on A, the Frechet mean of f is

µf = arg ming∈A

E(d2(f, g)).

The Frechet variance of Z is thenσ2f = E(d2(f, µf )).

If the choice of distance is the Wasserstein metric dW 2 , these are given the names of Wassersteinmean and variance, respectively. In the following subsections we will extend these concepts ofFrechet to statistical methodologies of regression, clustering, and hypothesis testing based on thenotion of distance.

4 Regression models with glucodensities

4.0.1 Non-parametric regression model with Glucodensity as the Predictor

Let f be a functional random variable taking values in (A, dW 2) and Y a random variable that takevalues in the real line. We assume the following regression relationship between f and Y , whichrepresent the predictor and response variables, respectively:

Y = g(f) + ε (2)

where g : A→ R is an unknown smooth function, and the random error ε satisfies E(ε) = 0.Given a sample (fi, Yi) ∈ A × Rni=1, most non-parametric estimators g(·) have the form of a

weighted average of the responses

g(x) =n∑i=1

wni(x)Yi. (3)

8

Page 9: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

In general, the weights wni(x) depend on the distance between each fi and x, with larger distancesreceiving lower weights, and satisfy

∑ni=1wni(x) = 1 (Ferraty and Vieu, 2006). A typical choice

would be the Nadaraya–Watson weights

wni(x) =K(d(x,fi)h )∑n

i=1(K(d(x,fj)h ))

, (4)

where h is a smoothing parameter and K : R → R is a known univariate probability densityfunction called the kernel. For more details about this procedure see Ferraty and Vieu (2006). Asan alternative for the above method, we can use the kernel methods in Reproductive Kernel HilbertSpaces (RKHS) (Preda, 2007; Szabo et al., 2016).

4.1 Regression model with Glucodensity as the Response

In the case of the regression models with a density function as response, the literature is not veryextensive to the current date (Nerini and Ghattas, 2007; Han et al., 2019; Petersen and Muller,2019; Capitaine et al., 2019; Talska et al., 2018). In this article we use the model proposed inPetersen and Muller (2019) which allows us to incorporate the desired metric dW 2 and is a directgeneralization of classical linear regression. The primary rationale for our use of this model is that,unlike the other approaches mentioned above, there is a methodology developed to performanceinferential procedures such as confidence bands and hypothesis testing in order to establish thesignificance of the input variables in the model Petersen et al. (2019).

Let f be a random variable (e.g. a glucodensity) that take values in the space of (A, dW 2) definedabove. Consider a random vector U ⊂ Rd that contains the set of predictors. Our interest is in theFrechet regression function, or function of conditional Frechet means,

f(u) := arg ming∈A

E(d2W 2(f, g)|U = u), u ∈ Rd (5)

Petersen and Muller (2019) imposes a particular model for f that, in direct analogy to classicallinear regression, takes the form of a weighted Frechet mean:

f(u) = arg ming∈A

E(s(U, u)d2W 2(f, g)), u ∈ Rd. (6)

Here, the weight function is

s(U, u) = 1 + (U − µ)TΣ−1(u− µ), µ = E(U),Σ = Cov(U), (7)

and Σ is assumed to be positive definite.Given a sample (Ui, fi), i = 1, . . . , n, of independent pairs each distributed as (U, f), one can

proceed to estimate f(u) for any desired input u. Due to the intimate connection between theWasserstein metric and quantile functions as in (1), for most inferential procedures it is sufficientto estimate the conditional Wasserstein mean quantile function Q(u) corresponding to f(u). Let Dbe the set of quantile functions, Qi the quantile function corresponding to the random density fi,and define empirical weights sin(u) = 1+(Ui−U)T Σ−1(u−U), where U and Σ are the sample meanand variance of the Ui, respectively. The natural estimator under dW2 is the weighted empiricalmean quantile function

Q(u) = arg minQ∈D

n∑i=1

sin(x)‖Q−Qi‖2, (8)

9

Page 10: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

where ‖·‖ denotes the L2[0, 1] norm on D.

A straightforward algorithm for computing Q(u) is shown in Supplementary Material of originalreference Petersen et al. (2019). In addition, two algorithms are given to estimate the confidencebands at a given significance level α for both the quantile functional parameter Q(u) and the densityparameter f(u).

5 Results

5.1 Clinical validation of the glucodensity

To validate the glucodensity representation, we use the database from the AEGIS study (Gudeet al., 2017). The database contains the continuous glucose monitoring data between 2-6 days of581 patients from a random sample of a general population. A detailed description of the datais introduced in Section 2 together with characteristics of patients in Table 2. To develop thevalidation task, we use two different regression models: i) a non-parametric regression model wherethe unique predictor is glucodensity, and ii) a linear regression model where the response is aglucodensity. Further details on the regression models used can be found in the Section 3. Thefirst model was used to predict glycated hemoglobin (A1c) (Kilpatrick, 2000), homeostatic modelassessment (HOMA-IR) Ausk et al. (2010), and the following measures of glycemic variabilityService (2013); Monnier et al. (2008); Gude et al. (2017): continuous overall net glycemic action(CONGA), mean amplitude of glycemic excursions (MAGE) and mean of daily differences (MODD),through glucodensity representation. In contrast, the second was used to predict the glucodensitywith the five variables above. Figure 1 gives a visualization of the sample of glucodensities used inthese models. Biological significance in variables under consideration is described in Table 1.

5.1.1 Prediction of biomarkers using the glucodensity

The aim of the first set of regression analyses is to demonstrate that the glucodensity is sufficientlyrich in its information content to recover the aforementioned biomarkers with high precision. Toquantify this precision, we estimated the R2 after fitting a non-parametric model for each biomarkeras the outcome variable, using the glucodensity as the sole predictor (i.e. independent variable).The R2 estimates for A1c, HOMA-IR, MAGE, MODD, CONGA were 0.79, 0.79, 0.92, 0.86, and 0.92respectively. To supplement the results, Figure 2 shows the predicted values against the observedvalues, where the outstanding predictive capacity of the glucodensity can be seen independently ofhigh or low response values.

5.1.2 Prediction of the glucodensity using biomarkers

In the second regression analysis with the glucodensity as the outcome variable, we aim to show thatthe previous measurements commonly used in the clinical practice are not capable of capturing theglucodensity with high accuracy. This fact is not completely surprising because, as noted by someauthors (Zaccardi and Khunti, 2018), the information provided by a CGM is more precise thanthat contained in summary measures. To accomplish this, we computed a suitable version of R2

for this task after fitting a regression model where the response is a glucodensity, and the previousvariables are the predictors. In this case, the R2 estimated was 0.74. As predicted, comparedto the previous section’s results, we were not able to accurately capture the complex nature of

10

Page 11: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

3 4 5 6 7 8 9 105

67

89

10

A1c, %

Real values

Est

imat

ed v

alue

s

0 10 20 30 40 50 60

010

2030

40

HOMA−IR, mass units

Real values

Est

imat

ed v

alue

s

50 100 150

5010

015

0

MAGE, mg/dL

Real values

Est

imat

ed v

alue

s

1 2 3 4 5

12

34

5

MODD

Real values

Est

imat

ed v

alue

s

0.5 1.0 1.5 2.0 2.5

0.5

1.5

2.5

CONGA, mg/dL

Real values

Est

imat

ed v

alue

s

Figure 2: Real values vs Estimated values when glucodensity is predictor

glucodensities, even while using the combined predictive power of several commonly used summarymeasures. Moreover, in some cases, the differences in prediction can be significant (see Figure 3).

5.2 Comparison of time in range metrics with glucodensities

To illustrate the higher clinical sensitivity of glucodensities compared to time in range metrics, wecompared the ability of each representation to predict A1c, HOMA-IR, and glycemic variabilitymetrics MODD, MAGE, and CONGA, using the data from AEGIS study. The predictive capacityof the glucodensity representation was illustrated above, and this section gives the correspondingresults for time in range metrics, where these were calculated according to two sets of cutoffs. Inthe first, the deciles of the normoglycemic individuals from the AEGIS study were used, while inthe second those proposed by the ADA were used. Tables 3 and 4 in Section 2 show the exact cutoffvalues for both cases. Since the time in range metrics constitute a sample of standard compositionaldata, the isometric log-ratio (ilr) transformation was employed in combination with a k-nearestneighbor algorithm as a regression model for predicting the scalar variables. Methodological detailsabout this statistical procedure can be found in Section 2.

11

Page 12: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

0.0 0.2 0.4 0.6 0.8 1.0

−10

0−

500

5010

0

Residuals

Quantile

Glu

cose

, mg/

dL

Figure 3: Residuals in quantile space prediction glucodensities

5.2.1 Prediction of A1c, HOMA-IR and glycemic variability measures using time inrange metrics

Figure 4 compares the real and estimated values of the previous five variables under the two timein range metrics under consideration with. Table 5 provides the estimates of R2 for each variableand metric. The predictive capacity is significantly worse than that attained by the glucodensitymethodology. The superiority of the glucodensity is particularly noteworthy in the case of theHOMA-IR variable, where the association is quite weak for time in range metrics. Even for the othervariables where the values of R2 are moderate, the larger residuals seen in diabetes patients withmore severe alterations of glucose metabolism indicate that time in range metrics are particularlypoorly suited for such patients. Interestingly, we do not observe substantial or consistent differences

A1c HOMA-IR CONGA MAGE MODD

Normoglycemic cut-off 0.63 0.22 0.68 0.65 0.65

ADA cut-off 0.61 0.08 0.73 0.69 0.60

Table 5: R2 estimated with time in range metrics under consideration

12

Page 13: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

between the two time in range metrics used, as deciles perform better than ADA criteria for twoof the variables, while in other instances the ordering was reversed.

3 4 5 6 7 8 9 10

5.0

6.0

7.0

8.0

A1c, %

Real values

Est

imat

ed v

alue

s

0 10 20 30 40 50 60

26

1014

HOMA−IR, mass units

Real values

Est

imat

ed v

alue

s

50 100 150

2040

6080

100

MAGE, mg/dL

Real values

Est

imat

ed v

alue

s

1 2 3 4 5

0.5

1.5

2.5

MODD

Real values

Est

imat

ed v

alue

s

0.5 1.0 1.5 2.0 2.5

0.5

1.0

1.5

2.0

CONGA, mg/dL

Real values

Est

imat

ed v

alue

s

Figure 4: Real values vs. Estimated values when time in range metric is the predictor. Blue, timein range metric with cut-offs calculated with normoglycemic patients of AEGIS database. Red,time in range metric using of cut-offs suggested by ADA.

6 Hyphotesis and clustering analysis with glucodensities

6.1 Analysis of variance with glucodensities

As a special case of regression, suppose we have a sample f1, . . . fn of glucodensities defined on(A, dW ) belonging to k different groups G1, G2, · · · , Gk that partition 1, . . . , n and are of size nj(j = 1, · · · , k), so that

∑kj=1 nj = n. If the goal is to simply test whether the Wasserstein means

are equal for each group, Petersen et al. (2019) developed testing procedures based on model (6) forthis purpose. An advantage of this model is its flexibility, which allows for multiple factor layoutsas well as tests for interactions. However, the theoretical properties of these tests require a type ofequal variance assumption that may be restrictive for some data sets.

More generally, one may wish to test the null hypothesis that the population distributions of thek groups share common Wasserstein means and variances, against the alternative that at least one

13

Page 14: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

of the groups has a different population distribution compared to the others in terms of either itsWasserstein mean or variance. In this scenario, Dubey and Muller (2019) investigated a test statisticbased on the group proportions λj,n = njn

−1, the groupwise sample Wasserstein means µj =arg ming∈A

∑i∈Gj

d2W 2(fi, g) and variances Vj = n−1j∑

i∈Gjd2W 2(fi, µj), the pooled Wasserstein

mean µp = arg ming∈A∑k

j=1

∑i∈Gj

d2W 2(fi, g) and variance Vp = n−1∑k

j=1

∑i∈Gj

d2W 2(fi, µp), andfinally the quantities

σ2j =1

nj

∑i∈Gj

d4W 2(fi, µj)−

1

nj

∑i∈Gj

d2W 2(fi, µj)

2

as estimates of the variance of Vj . Then, with

Fn = Vp −k∑j=1

λj,nVj , Rn =∑j<l

λj,n, λl,nσ2l σ

2j

(Vj − Vl),

the proposed test statistic is

Tn =nRn∑k

j=1 λj,nσ−2j

+nF 2

n∑kj=1 λj,nσ

2j

. (9)

Dubey and Muller (2019) demonstrated that the corresponding test is distribution-free, inthat the limiting distribution of Tn does not depend on the underlying distribution under someassumptions. In practice, it was also demonstrated that it can be useful to calibrate the test underthe null hypothesis via a simple empirical bootstrap over the preceding statistics. For more detailswe refer the reader to the supplementary material of the original reference.

6.2 Energy distance methods with glucodensities

The energy distance is a statistical distance between two distribution functions proposed in 1984by Gabor J. Szekely (Szekely and Rizzo (2017)). This distance is inspired by the concept ofgravitational energy between two bodies, and has experienced a rise in appeal for modern statisticalapplications due to its applicability to data of a complex nature such as functions, graphs or objectsthat live in negative type space.

Consider independent random variables Y, Y ′ ∼ F and Z,Z ′ ∼ G that are defined on a(semi)metric space (Ω, ρ) of negative type, where ρ : V × V → R is the semi-metric. Though thenotation in this section is quite general, in particular we have in mind the case (Ω, ρ) = (A, dW 2)corresponding to glucodensities. The energy distance associated with ρ between the distribution Fand G is

ερ(F,G) = 2E(ρ(Y, Z))− E(ρ(Y, Y ′))− E(ρ(Z,Z ′)).

Given random samples Y1, . . . , Yniid∼ F and Z1, . . . , Zm

iid∼ G, the sample energy distance is

ερ(F,G) = 21

nm

n∑i=1

m∑j=1

ρ(Yi, Zj)−1

n2

n∑i=1

n∑i=1

ρ(Yi, Yj)−1

m2

m∑i=1

m∑i=1

ρ(Zi, Zj).

The asymptotic distribution of the above statistic for a null hypothesis (H0 : F = G) as well asfor the alternative (Ha : F 6= G) is dependent on the chosen semimetric ρ. Besides, its expression

14

Page 15: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

is difficult to calculate and to implement in practice. Hence, when using the energy distancebased methods, the distribution under the null hypothesis is usually calibrated with a permutationmethod. Alternatives to calibrate the distribution under the null hypothesis include the wild ora weighted boostrap, as described in Leucht and Neumann (2013); Jimenez-Gamero et al. (2019).The energy distance can also be extended to handle samples from more than two populations.

Given k independent samples Yj1, . . . , Yjnj

iid∼ Fj , j = 1, . . . , k, the energy distance statistic is

ερ(F1, . . . , Fk)∑

1≤j<l≤k

njnl2n

[2gjl − gjj − gll], gjl =1

njnl

nj∑i=1

nl∑i′=1

ρ(Yji, Yli′), (10)

where n = n1 + · · · , nk.We now explain how this statistic can be adapted to perform clustering. Consider random pairs

(Yi, Ii), i = 1, . . . , n, where Yi is observed and takes values in (Ω, ρ), while Ii ∈ 1, . . . , k is anunobserved label of cluster membership. The task is to recover the true clusters C∗j = i : Ii = j,j = 1, . . . , k. Let C1, . . . , Ck be a generic partition of 1, . . . , n, and denote the size of each clusterby |Cj |. Then a clustering may be chosen by optimizing the statistic

Sρ(C1, . . . , Ck) =∑

1≤j<l≤k

njnl2n

[2gjl − gjj − gll], gjl =1

|Cj ||Cl|∑

(i,i′)∈Cj×Cl

ρ(Yi, Y′i ) (11)

over all possible clusters Cj . At first view, this seems computationally intractable due to the ap-pearance of distances between the elements of each cluster. However, defining

Wρ(C1, . . . , Ck) =

k∑j=1

|Cj |2gjj , (12)

it can be proven that Sρ + Wρ is constant. This implies that maximizing Sρ is equivalent tominimizing Wρ.

In Franca et al. (2020), the authors show the equivalence between the previous optimizationproblem with the clustering procedure kernel k-means. The latter relationship allows the solvingof kernel Kgroup clustering procedure through the popular heuristics algorithms as Hartigan andLloyd that allow finding the optimal solution with the k-means algorithm.

6.3 Example of hypothesis testing and clustering analysis with glucodensitymethodology

Below, we illustrate the methodology of glucodensities in hypothesis testing and cluster analysiswith the 2-Wasserstein distance. We use the ANOVA test (Dubey and Muller, 2019) and thek-groups algorithm Franca et al. (2020).

6.3.1 Hypothesis testing

An interesting question to address in an epidemiological study is whether there are differences be-tween men and women in the glycemic profile. The ANOVA test is an important instrument toestablish whether there are statistically significant differences in mean and variance with glucoden-sities where there are two or more patient groups. After applying this method with AEGIS data,

15

Page 16: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

the test yields a p-value equal to 0.10. Therefore, there is no statistically significant differencebetween men and women at the significance level of 5 percent.

Figure 5 shows the glucodensity samples for each gender using their quantile representations.The pointwise means of these quantile functions constitute the quantile function of the sampleWasserstein mean glucodensites. These, together with pointwise standard deviation curves, arealso shown in Figure 5. On average, the groups are quite similar. However, certain discrepanciesare observed between both groups in terms of their variance, although not large enough for the testto show statistical significant differences.

0.0 0.4 0.8

5010

015

020

025

030

035

040

0

Women

Quantile

Glu

cose

, mg/

dL

0.0 0.4 0.8

5010

015

020

025

030

035

040

0

Men

Quantile

Glu

cose

, mg/

dL

0.0 0.4 0.8

8010

012

014

016

0

Mean

Quantile

Glu

cose

, m

g/dL

WomenMen

0.0 0.4 0.8

1020

3040

50

Standart desviation

Quantile

Glu

cose

, m

g/dL

WomenMen

Figure 5: (Left two panels) Glucodensities for men and women of the AEGIS study, plotted asquantile functions; (Third panel) 2-Wasserstein mean quantile functions for each group; (FourthPanel) Cross-sectional standard deviation curves for quantile functions in each group.

6.3.2 Clustering analysis

Cluster analysis is an essential tool for identifying subgroups of patients with similar characteristics.As an example, with the diabetes patients’ data from the AEGIS study, we perform a clusteranalysis using three clusters. To establish when a patient has diabetes, we use the doctor’s previousdiagnostic criteria, or if individuals currently have their glucose values measured with A1c and FPGin the ranges established by the ADA to be classified in that category.

16

Page 17: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

Figure 6 contains the results of applying the cluster analysis in diabetes patients. The algorithmhas identified three differentiated groups of patients. The first group is patients with normal glucosevalues, probably because they are on medication, and the diagnosis of diabetes was made in thepast. The second group are patients with slightly altered diabetes metabolism. Finally, the lastgroup is patients with severely altered glucose values, and as can be seen in the glucodensities, theirglucose is continuously fluctuating. The two-dimensional graphical representation of the densityfunction of A1c and FPG helps to validate these findings.

50 100 150 200 250 300 350 400

0.00

0.02

0.04

Cluster 1

Glucose, mg/dL

Den

sity

Cluster 1

A1c, %

FP

G, m

g/dL

0.005 0.01

5 6 7 8 9 10

100

200

50 100 150 200 250 300 350 400

0.00

0.02

0.04

Cluster 2

Glucose, mg/dL

Den

sity

Cluster 2

A1c, %

FP

G, m

g/dL 0

.001

0.001

0.002

0.003

0.004

5 6 7 8 9 10

100

200

50 100 150 200 250 300 350 400

0.00

0.02

0.04

Cluster 3

Glucose, mg/dL

Den

sity

Cluster 3

A1c, %

FP

G, m

g/dL

5e−04

5e−04

5e−04 0.001

0.001

0.0015

0.0025 0.003

0.004

5 6 7 8 9 10

100

200

Figure 6: Clustering analysis of diabetes patients in AEGIS study

7 Discussion

The primary contribution of this article is to propose a new representation of CGM data calledglucodensity. We have validated this representation from a clinical point of view, proving that it ismore accurate than time in range metrics.

17

Page 18: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

7.1 Diabetes etiology and biological components to capture in a mathematicalrepresention

Diabetes encompasses a heterogeneous group of impaired glucose metabolism, such as the frequentpresence of hyperglycemias or hypoglycemias Association et al. (2018). Anomalous glucose fluctu-ations are another essential trait of dysglycemic regulation Monnier and Colette (2011); Monnieret al. (2008). The use of glycemic control measures that go beyond the average glucose valuessuch as A1c and also capture i) the impact of time spent at each glucose concentration on theglucose deregulation process, ii) the oscillations of glucose associated with cellular damage Monnierand Colette (2011), is crucial in the management of patients with diabetes as in the assessment ofglucose metabolism with a high degree of precision.

7.2 Clinical validation of glucodensity

Our proposal accurately captures the components of diabetes mentioned above. Using clinical data,we evaluated the clinical sensitivity against established biomarkers in diabetes. We found a highassociation between A1c, HOMA-IR, CONGA, MODD, MAGE, and glucodensity. In the case ofthe HOMA-IR variable, the predictive ability does not seem excellent, although, to the best of ourknowledge, no known marker shows a predictive ability against that variable. However, our modelcan provide consistent values in moderate and large HOMA-IR values. While the fit for the variableA1c was not perfect, we must consider that the time scale for the A1c and the glucodensities werequite different. A1c is a measure that reflects the average glucose over 2−4 months while monitoringpatients for less than 1 week to compute the glucodensity. Our R2 of 0.79 is better than the averageglucose recorded by the monitoring period (R2 = 0.61), which indicates that an individual’s glucosedistributional values may give extra information to the long-term glucose averages.

In the prediction of glucodensity from A1c, HOMA-IR, and glycemic variability measures, theestimated R2 shows a moderate relationship between those variables. However, we are introducingthe essential variables of the glucose deregulation process. A possible explanation of this is thatthe use of the summary measures commonly used in diabetes can hardly capture an individual’sglycemic profile. Glucose metabolism is very complex and highly dependent on the patient’s con-ditions. For example, the cellular mechanisms are different in type I and type II diabetes. Inthe former, there is an inhibition of β-cell function and consequent non-insulin production, whileinsulin secretion is reduced in the latter (Taylor, 2013). In this context, the introduction of theconcept of glucodensity provides greater clinical accuracy to the possible decisions derived fromsuch representation compared to traditional methods because we utilize the entire distribution ofglucose concentrations of an individual over time.

7.3 Time in range metrics vs. Glucodensity

While time in range metrics may also achieve the previous aim, they do so to a clearly lesserextent than the glucodensity. Our proposal can capture the differences between individuals in eachglucose concentration. Notwithstanding, time in range only measures glucose differences alongintervals with the subsequent loss of information. Also, time in range metrics are substantiallylimited since the target zones must be defined previously, and these may also depend on the studypopulation or the aim of the analysis.

Empirical results demonstrate the advantages of our proposal out of the theoretical framework.The ability of glucodensity to predict A1c, HOMA-IR, and the CONGA, MAGE, and MODD

18

Page 19: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

variability measures is surprisingly high, much higher than that achieved with the range metricdespite using two different target zones: the deciles of normoglycemic patients glucose values andthe target zones prescribed by the ADA.

The estimated R2 between glucodensities and A1c is similar than that reported by other authorsbetween A1c and average glucose values Nathan et al. (2007). However, in this study, patients aremonitored only for 2 − 6 days and not for weeks. Two possible factors we must consider in theanalysis of the results. First, there are both diabetic and non-diabetic patients in our database, and,second, the glucodensity captures A1c better because it represents the entire distribution of glucoseconcentration values, while glycation rates are known to increase with glucose concentrations Singhet al. (2001). In particular, the estimated R2 between A1c and the mean glucose in our databaseis only 0.61.

7.4 Statistical considerations

From a statistical standpoint, glucodensities are a special constrained type of functional data knownas distributional data; therefore, it is not possible to directly use the usual statistical techniques.In this paper, we have proposed a framework for the analysis of these distributional data based ondistances with existing techniques for hypothesis testing, cluster analysis, and regression models.However, further methodological development is necessary, as it can be the case of mixed modelsor causal inference methods where there is no available methodology.

7.5 Limitations

A potential limitation of our representation is that it ignores the order of events. Instead, it analyzesonly the distribution of glucose values. However, the event sequence may not be a critical componentin diabetes modeling. The main factor of microvascular and macrovascular complications is chronichyperglycemia Cryer (2014); Skrha et al. (2016), and this is captured with high accuracy by ourmodels. Moreover, an essential aspect of managing diabetes patients is hypoglycemia control, andour proposal also captures this. Finally, the third component of dysglycemia Monnier et al. (2008),glucose variability, can accurately predict by our representations, at least, through metrics CONGA,MAGE, and MODD.

The sample size used may also be a limitation from a statistical point of view. Nevertheless,in the field of diabetes, the AEGIS study is the world’s largest databases and, unlike other stud-ies, is composed of randomly selected individuals from a general population and non-participantsZeevi et al. (2015). Finally, for study validation, perhaps the most reliable way of validating thenew representation is in terms of the patients’ long-term prognosis. However, to the best of ourknowledge, no study with a reasonable sample size has this information from the intensive use ofCGM technology. Moreover, we have established the clinical validity from variables that do havea clear and established relationship with the prognosis and prevalence of diabetes as evidenced inthe current literature in the field.

7.6 Potential Applications

Adopting the concept of glucodensity in clinical practice and biomedical research could be verypromising in the following ways.

19

Page 20: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

• To have a simple and more accurate representation of the glycaemic profile of an individual.This representation is especially useful in the management of diabetic patients and to assessthe effects of an intervention.

• To establish if there are statistically significant differences between patients subjected todifferent interventions, for example, in a clinical trial.

• To identify different subtypes of patients based on their glycaemic condition and other vari-ables. Cluster analysis of glucodensities can create new patient subtypes based on the riskof diabetes or other complications. Furthermore, it allows us to describe the etiology ofthe disease better by creating groups of subjects whose glucose profiles and other clinicalcharacteristics are similar.

• To establish the prognosis or risk of a patient or to analyze the relationship of an individual’sglycaemic profile with different clinical variables in epidemiological studies.

• To predict changes in the glycaemic profile based on the characteristics of the individuals andthe intervention performed. For example: how does the glucodensity vary according to thediet?

• To recommend the most advantageous treatments for a patient. Following the previous idea,a causal inference model could be fitted where the response is glucodensity, for example, toestablish which diet is the most beneficial for the individual to achieve a suitable glucoselevels.

7.7 Future work

We introduce glucodensities methodology with CGM data. However, our methodology is alsovalid for data from other biosensors such as accelerometers to measure physical activity levels. Inthis domain, the time in range metric is one of the most used representations, and perhaps theadoption of our approach can lead to better results Dumuid et al. (2018, 2020). The adoption ofnew methodology with other biosensors may be an essential research issue to be addressed in thefuture.

In the diabetes field, it will be necessary to evaluate the predictive capacity of the glucodensityin the long-term prognosis of patients. In addition, it would be interesting to assess, in moreextended monitoring periods, the reproducibility between days and weeks with the representationconstructed. One way to accomplish this is to compute the intraclass correlation coefficient (ICC)using, for example, the methodology proposed recently in Xu et al. and based on distances betweenfunctions.

Acknowledgements

We thanks Russell Lyons for his discussions on the use of energy-distance based methods withglucodensities.

This work has received financial support from Carlos III Health Institute, Grant/Award Num-ber: PI16/01395; Ministry of Economy and Competitiveness (SPAIN) European Regional Devel-opment Fund (FEDER); the Axencia Galega de Innovacion, Consellerıa de Economıa, Emprego e

20

Page 21: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

Industria, Xunta de Galicia, Spain, Grant/Award Number: GPC IN607B 2018/01; National Sci-ence Foundation, Grant/Award Number: DMS-1811888; the Spanish Ministry of Economy andCompetitiveness Grant/Award Number: TIN2015-73566-JIN and TIN2017-84796-C21-R.

Competing Interests

The authors declare no competing interests.

References

Antoniadis, A. (1997). Wavelets in statistics: A review. Journal of the Italian Statistical Society,6 97. URL https://doi.org/10.1007/BF03178905.

Association, A. D. et al. (2018). 6. glycemic targets: standards of medical care in dia-betes—2018. Diabetes Care, 41 S55–S64.

Ausk, K. J., Boyko, E. J. and Ioannou, G. N. (2010). Insulin resistance pre-dicts mortality in nondiabetic individuals in the u.s. Diabetes Care, 33 1179–1185.https://care.diabetesjournals.org/content/33/6/1179.full.pdf, URL https://care.

diabetesjournals.org/content/33/6/1179.

Battelino, T., Danne, T., Bergenstal, R. M., Amiel, S. A., Beck, R., Biester, T.,Bosi, E., Buckingham, B. A., Cefalu, W. T., Close, K. L., Cobelli, C., Dassau, E.,DeVries, J. H., Donaghue, K. C., Dovc, K., Doyle, F. J., Garg, S., Grunberger,G., Heller, S., Heinemann, L., Hirsch, I. B., Hovorka, R., Jia, W., Kordonouri, O.,Kovatchev, B., Kowalski, A., Laffel, L., Levine, B., Mayorov, A., Mathieu, C., Mur-phy, H. R., Nimri, R., Nørgaard, K., Parkin, C. G., Renard, E., Rodbard, D., Saboo,B., Schatz, D., Stoner, K., Urakami, T., Weinzimer, S. A. and Phillip, M. (2019). Clini-cal targets for continuous glucose monitoring data interpretation: Recommendations from the in-ternational consensus on time in range. Diabetes Care. https://care.diabetesjournals.org/content/early/2019/06/07/dci19-0028.full.pdf, URL https://care.diabetesjournals.

org/content/early/2019/06/07/dci19-0028.

Beck, R. W., Bergenstal, R. M., Riddlesworth, T. D., Kollman, C., Li, Z., Brown,A. S. and Close, K. L. (2019). Validation of time in range as an outcome measure for diabetesclinical trials. Diabetes Care, 42 400–405. https://care.diabetesjournals.org/content/42/3/400.full.pdf, URL https://care.diabetesjournals.org/content/42/3/400.

Beck, R. W., Connor, C. G., Mullen, D. M., Wesley, D. M. and Bergenstal, R. M.(2017). The fallacy of average: How using hba1c alone to assess glycemic control can be mislead-ing. Diabetes Care, 40 994–999. https://care.diabetesjournals.org/content/40/8/994.

full.pdf, URL https://care.diabetesjournals.org/content/40/8/994.

Bergenstal, R. M. (2015). Glycemic variability and diabetes complications: Does it mat-ter? simply put, there are better glycemic markers! Diabetes Care, 38 1615–1621.https://care.diabetesjournals.org/content/38/8/1615.full.pdf, URL https://care.

diabetesjournals.org/content/38/8/1615.

21

Page 22: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

Bergenstal, R. M., Ahmann, A. J., Bailey, T., Beck, R. W., Bissen, J., Buckingham,B., Deeb, L., Dolin, R. H., Garg, S. K., Goland, R., Hirsch, I. B., Klonoff, D. C.,Kruger, D. F., Matfin, G., Mazze, R. S., Olson, B. A., Parkin, C., Peters, A.,Powers, M. A., Rodriguez, H., Southerland, P., Strock, E. S., Tamborlane, W. andWesley, D. M. (2013). Recommendations for standardizing glucose reporting and analysis tooptimize clinical decision making in diabetes: The ambulatory glucose profile (agp). DiabetesTechnology & Therapeutics, 15 198–211. PMID: 23448694, https://doi.org/10.1089/dia.

2013.0051, URL https://doi.org/10.1089/dia.2013.0051.

Capitaine, L., Genuer, R. and Thiebaut, R. (2019). Frechet random forests. 1906.01741.

Cryer, P. E. (2014). Glycemic goals in diabetes: Trade-off between glycemic control and iatrogenichypoglycemia. Diabetes, 63 2188–2195. https://diabetes.diabetesjournals.org/content/

63/7/2188.full.pdf, URL https://diabetes.diabetesjournals.org/content/63/7/2188.

DiMeglio, L. A., Evans-Molina, C. and Oram, R. A. (2018). Type 1 diabetes. TheLancet, 391 2449 – 2462. URL http://www.sciencedirect.com/science/article/pii/

S0140673618313205.

Dubey, P. and Muller, H.-G. (2019). Frechet analysis of variance for random ob-jects. Biometrika, 106 803–821. http://oup.prod.sis.lan/biomet/article-pdf/106/4/803/30646779/asz052.pdf, URL https://doi.org/10.1093/biomet/asz052.

Dumuid, D., Pedisic, Z., Palarea-Albaladejo, J., Martın-Fernandez, J. A., Hron, K.and Olds, T. (2020). Compositional data analysis in time-use epidemiology: What, why, how.International journal of environmental research and public health, 17 2220. 32224966[pmid], URLhttps://pubmed.ncbi.nlm.nih.gov/32224966.

Dumuid, D., Stanford, T. E., Martin-Fernandez, J.-A., Zeljko Pedisic, Maher, C. A.,Lewis, L. K., Hron, K., Katzmarzyk, P. T., Chaput, J.-P., Fogelholm, M., Hu, G.,Lambert, E. V., Maia, J., Sarmiento, O. L., Standage, M., Barreira, T. V., Broyles,S. T., Tudor-Locke, C., Tremblay, M. S. and Olds, T. (2018). Compositional data analysisfor physical activity, sedentary time and sleep research. Statistical Methods in Medical Research,27 3726–3738. PMID: 28555522, https://doi.org/10.1177/0962280217710835, URL https:

//doi.org/10.1177/0962280217710835.

Ewings, S. M., Sahu, S. K., Valletta, J. J., Byrne, C. D. and Chipperfield, A. J. (2015).A bayesian network for modelling blood glucose concentration and exercise in type 1 diabetes.Statistical Methods in Medical Research, 24 342–372. PMID: 24492795, https://doi.org/10.1177/0962280214520732, URL https://doi.org/10.1177/0962280214520732.

Febrero-Bande, M. and de la Fuente, M. (2012). Statistical computing in functional dataanalysis: The r package fda.usc. Journal of Statistical Software, Articles, 51 1–28. URL https:

//www.jstatsoft.org/v051/i04.

Feig, D. S., Donovan, L. E., Corcoy, R., Murphy, K. E., Amiel, S. A., Hunt, K. F.,Asztalos, E., Barrett, J. F. R., Sanchez, J. J., de Leiva, A., Hod, M., Jovanovic,L., Keely, E., McManus, R., Hutton, E. K., Meek, C. L., Stewart, Z. A., Wysocki,T., O’Brien, R., Ruedy, K., Kollman, C., Tomlinson, G., Murphy, H. R., Grisoni, J.,

22

Page 23: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

Byrne, C., Davenport, K., Neoh, S., Gougeon, C., Oldford, C., Young, C., Green, L.,Rossi, B., Rogers, H., Cleave, B., Strom, M., Adelantado, J. M., Chico, A. I., Tun-didor, D., Malcolm, J., Henry, K., Morris, D., Rayman, G., Fowler, D., Mitchell,S., Rosier, J., Temple, R., Turner, J., Canciani, G., Hewapathirana, N., Piper, L.,Kudirka, A., Watson, M., Bonomo, M., Pintaudi, B., Bertuzzi, F., Daniela, G., Mion,E., Lowe, J., Halperin, I., Rogowsky, A., Adib, S., Lindsay, R., Carty, D., Craw-ford, I., Mackenzie, F., McSorley, T., Booth, J., McInnes, N., Smith, A., Stanton,I., Tazzeo, T., Weisnagel, J., Mansell, P., Jones, N., Babington, G., Spick, D., Mac-Dougall, M., Chilton, S., Cutts, T., Perkins, M., Scott, E., Endersby, D., Dover,A., Dougherty, F., Johnston, S., Heller, S., Novodorsky, P., Hudson, S., Nisbet,C., Ransom, T., Coolen, J., Baxendale, D., Holt, R., Forbes, J., Martin, N., Wal-bridge, F., Dunne, F., Conway, S., Egan, A., Kirwin, C., Maresh, M., Kearney, G.,Morris, J., Quinn, S., Bilous, R., Mukhtar, R., Godbout, A., Daigle, S., Lubina, A.,Jackson, M., Paul, E., Taylor, J., Houlden, R., Breen, A., Banerjee, A., Bracken-ridge, A., Briley, A., Reid, A., Singh, C., Newstead-Angel, J., Baxter, J., Philip,S., Chlost, M., Murray, L., Castorino, K., Frase, D., Lou, O. and Pragnell, M.(2017). Continuous glucose monitoring in pregnant women with type 1 diabetes (conceptt):a multicentre international randomised controlled trial. The Lancet, 390 2347 – 2359. URLhttp://www.sciencedirect.com/science/article/pii/S0140673617324005.

Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis: Theory and Practice(Springer Series in Statistics). Springer-Verlag, Berlin, Heidelberg.

for Medicinal Products for Human Use, C. et al. (2012). Guideline on clinical investigationof medicinal products in the treatment or prevention of diabetes mellitus. London, EuropeanMedicines Society.

Franca, G., Vogelstein, J. T. and Rizzo, M. (2020). Kernel k-groups via hartigan’s method.IEEE Transactions on Pattern Analysis and Machine Intelligence 1–1.

Frechet, M. R. (1948). Les elements aleatoires de nature quelconque dans un espace distancie.Annales de l’institut Henri Poincare, 10 215–310. URL http://www.numdam.org/item/AIHP_

1948__10_4_215_0.

Freeman, J. and Lyons, L. (2008). The use of continuous glucose monitoring toevaluate the glycemic response to food. Diabetes Spectrum, 21 134–137. https:

//spectrum.diabetesjournals.org/content/21/2/134.full.pdf, URL https://spectrum.

diabetesjournals.org/content/21/2/134.

Group, B. A. W. (2018). Need for regulatory change to incorporate beyond a1c glycemic metrics.Diabetes Care, 41 e92–e94.

Gude, F., Dıaz-Vidal, P., Rua-Perez, C., Alonso-Sampedro, M., Fernandez-Merino,C., Rey-Garcıa, J., Cadarso-Suarez, C., Pazos-Couselo, M., Garcıa-Lopez, J. M. andGonzalez-Quintela, A. (2017). Glycemic variability and its association with demographicsand lifestyles in a general adult population. Journal of diabetes science and technology, 11780–790. 28317402[pmid], URL https://pubmed.ncbi.nlm.nih.gov/28317402.

23

Page 24: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

Hall, H., Perelman, D., Breschi, A., Limcaoco, P., Kellogg, R., McLaughlin, T. andSnyder, M. (2018). Glucotypes reveal new patterns of glucose dysregulation. PLOS Biology,16 1–23. URL https://doi.org/10.1371/journal.pbio.2005143.

Han, K., Muller, H.-G. and Park, B. U. (2019). Additive functional regression for densitiesas responses. Journal of the American Statistical Association, 0 1–24. https://doi.org/10.

1080/01621459.2019.1604365, URL https://doi.org/10.1080/01621459.2019.1604365.

Hirsch, I. B., Sherr, J. L. and Hood, K. K. (2019). Connecting the dots: Vali-dation of time in range metrics with microvascular outcomes. Diabetes Care, 42 345–348. https://care.diabetesjournals.org/content/42/3/345.full.pdf, URL https://

care.diabetesjournals.org/content/42/3/345.

Hoelzel, W., Weykamp, C., Jeppsson, J.-O., Miedema, K., Barr, J. R., Goodall, I.,Hoshino, T., John, W. G., Kobold, U., Little, R., Mosca, A., Mauri, P., Paroni, R.,Susanto, F., Takei, I., Thienpont, L., Umemoto, M. and Wiedmeyer, H.-M. (2004).Ifcc reference system for measurement of hemoglobin a1c in human blood and the nationalstandardization schemes in the united states, japan, and sweden: A method-comparison study.Clinical Chemistry, 50 166–174. http://clinchem.aaccjnls.org/content/50/1/166.full.

pdf, URL http://clinchem.aaccjnls.org/content/50/1/166.

Hron, K., Menafoglio, A., Templ, M., Hruuzova, K. and Filzmoser, P. (2016). Simplicialprincipal component analysis for density functions in bayes spaces. Computational Statistics &Data Analysis, 94 330–350.

Izenman, A. J. (1991). Review papers: Recent developments in nonparametric density estima-tion. Journal of the American Statistical Association, 86 205–224. https://doi.org/10.1080/01621459.1991.10475021, URL https://doi.org/10.1080/01621459.1991.10475021.

Jimenez-Gamero, M., Alba-Fernandez, M. and Ariza-Lopez, F. (2019). Approximatingthe null distribution of a class of statistics for testing independence. Journal of Computationaland Applied Mathematics, 354 131 – 143. URL http://www.sciencedirect.com/science/

article/pii/S0377042718301572.

Kilpatrick, E. S. (2000). Glycated haemoglobin in the year 2000. Journal of Clinical Pathol-ogy, 53 335–339. https://jcp.bmj.com/content/53/5/335.full.pdf, URL https://jcp.

bmj.com/content/53/5/335.

Kovatchev, B. P., Breton, M., Man, C. D. and Cobelli, C. (2009). In silico preclinicaltrials: A proof of concept in closed-loop control of type 1 diabetes. Journal of Diabetes Scienceand Technology, 3 44–55. PMID: 19444330, https://doi.org/10.1177/193229680900300106,URL https://doi.org/10.1177/193229680900300106.

Leucht, A. and Neumann, M. H. (2013). Dependent wild bootstrap for degenerate u- and v-statistics. Journal of Multivariate Analysis, 117 257 – 280. URL http://www.sciencedirect.

com/science/article/pii/S0047259X13000304.

Lu, J., Ma, X., Zhou, J., Zhang, L., Mo, Y., Ying, L., Lu, W., Zhu, W., Bao, Y.,Vigersky, R. A. and Jia, W. (2018). Association of time in range, as assessed by con-tinuous glucose monitoring, with diabetic retinopathy in type 2 diabetes. Diabetes Care,

24

Page 25: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

41 2370–2376. https://care.diabetesjournals.org/content/41/11/2370.full.pdf, URLhttps://care.diabetesjournals.org/content/41/11/2370.

Matthews, D., Hosker, J., Rudenski, A., Naylor, B., Treacher, D. and Turner, R.(1985). Homeostasis model assessment: insulin resistance and β-cell function from fasting plasmaglucose and insulin concentrations in man. Diabetologia, 28 412–419.

McDonnell, C., Donath, S., Vidmar, S., Werther, G. and Cameron, F. (2005). A novelapproach to continuous glucose analysis utilizing glycemic variation. Diabetes Technology &Therapeutics, 7 253–263. PMID: 15857227, https://doi.org/10.1089/dia.2005.7.253, URLhttps://doi.org/10.1089/dia.2005.7.253.

Molnar, G., Taylor, W. and Ho, M. (1972). Day-to-day variation of continuously monitoredglycaemia: a further measure of diabetic instability. Diabetologia, 8 342–348.

Monnier, L. and Colette, C. (2011). Glycemic variability: Can we bridge the divide betweencontroversies? Diabetes Care, 34 1058–1059. https://care.diabetesjournals.org/content/34/4/1058.full.pdf, URL https://care.diabetesjournals.org/content/34/4/1058.

Monnier, L., Colette, C. and Owens, D. R. (2008). Glycemic variability: the third componentof the dysglycemia in diabetes. is it important? how to measure it? Journal of diabetes scienceand technology, 2 1094–1100.

Muller, H.-G. and Petersen, A. (2014). Density estimation including examples. Wiley StatsRef:Statistics Reference Online 1–12.

Nathan, D., Turgeon, H. and Regan, S. (2007). Relationship between glycated haemoglobinlevels and mean glucose levels over time. Diabetologia, 50 2239–2244.

Nerini, D. and Ghattas, B. (2007). Classifying densities using functional regression trees: Ap-plications in oceanology. Computational Statistics & Data Analysis, 51 4984 – 4993. URLhttp://www.sciencedirect.com/science/article/pii/S0167947306003550.

Nguyen, M., Han, J., Spanakis, E. K., Kovatchev, B. P. and Klonoff, D. C. (0). Areview of continuous glucose monitoring-based composite metrics for glycemic control. DiabetesTechnology & Therapeutics, 0 null. PMID: 32069094, https://doi.org/10.1089/dia.2019.

0434, URL https://doi.org/10.1089/dia.2019.0434.

Nichols, S. P., Koh, A., Storm, W. L., Shin, J. H. and Schoenfisch, M. H. (2013). Biocom-patible materials for continuous glucose monitoring devices. Chemical Reviews, 113 2528–2549.URL https://doi.org/10.1021/cr300387j.

Petersen, A., Liu, X. and Divani, A. A. (2019). Wasserstein f -tests and confidence bands forthe frechet regression of density response curves. 1910.13418.

Petersen, A. and Muller, H.-G. (2016). Functional data analysis for density functions bytransformation to a hilbert space. Ann. Statist., 44 183–218. URL https://doi.org/10.1214/

15-AOS1363.

Petersen, A. and Muller, H.-G. (2019). Frechet regression for random objects with euclideanpredictors. Ann. Statist., 47 691–719. URL https://doi.org/10.1214/17-AOS1624.

25

Page 26: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

Preda, C. (2007). Regression models for functional data by reproducing kernel hilbert spacesmethods. Journal of Statistical Planning and Inference, 137 829 – 840. Special Issue on Nonpara-metric Statistics and Related Topics: In honor of M.L. Puri, URL http://www.sciencedirect.

com/science/article/pii/S0378375806001443.

Ramsay, J., Ramsay, J. and Silverman, B. (2005). Functional Data Analysis. Springer Seriesin Statistics, Springer. URL https://books.google.es/books?id=mU3dop5wY_4C.

Selvin, E., Crainiceanu, C. M., Brancati, F. L. and Coresh, J. (2007). Short-term vari-ability in measures of glycemia and implications for the classification of diabetes. Archives ofinternal medicine, 167 1545–1551.

Service, F. J. (2013). Glucose variability. Diabetes, 62 1398–1404. https://

diabetes.diabetesjournals.org/content/62/5/1398.full.pdf, URL https://diabetes.

diabetesjournals.org/content/62/5/1398.

Service, F. J., Molnar, G. D., Rosevear, J. W., Ackerman, E., Gatewood, L. C. andTaylor, W. F. (1970). Mean amplitude of glycemic excursions, a measure of diabetic insta-bility. Diabetes, 19 644–655. http://diabetes.diabetesjournals.org/content/19/9/644.

full.pdf, URL http://diabetes.diabetesjournals.org/content/19/9/644.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall,London.

Singh, R., Barden, A., Mori, T. and Beilin, L. (2001). Advanced glycation end-products: areview. Diabetologia, 44 129–146. URL https://doi.org/10.1007/s001250051591.

Skrha, J., Soupal, J. and Prazny, M. (2016). Glucose variability, hba1c and microvascularcomplications. Reviews in Endocrine and Metabolic Disorders, 17 103–110.

Szabo, Z., Sriperumbudur, B. K., Poczos, B. and Gretton, A. (2016). Learning theory fordistribution regression. J. Mach. Learn. Res., 17 5272–5311.

Szekely, G. J. and Rizzo, M. L. (2017). The energy of data. Annual Review of Statistics andIts Application, 4 447–479. https://doi.org/10.1146/annurev-statistics-060116-054026,URL https://doi.org/10.1146/annurev-statistics-060116-054026.

Talska, R., Menafoglio, A., Machalova, J., Hron, K. and Fivserova, E. (2018). Com-positional regression with functional response. Computational Statistics & Data Analysis, 12366–85.

Taylor, R. (2013). Type 2 diabetes. Diabetes Care, 36 1047–1055. https:

//care.diabetesjournals.org/content/36/4/1047.full.pdf, URL https://care.

diabetesjournals.org/content/36/4/1047.

Villani, C. (2008). Optimal transport: old and new, vol. 338. Springer Science & Business Media.

Wang, J.-L., Chiou, J.-M. and Muller, H.-G. (2016). Functional data analysis. Annual Reviewof Statistics and Its Application, 3 257–295.

26

Page 27: Glucodensities: a new representation of glucose pro les using … · 2020. 8. 19. · Glucodensities: a new representation of glucose pro les using distributional data analysis Marcos

Xu, M., Reiss, P. T. and Cribben, I. (????). Generalized reliability based on distances.Biometrics, n/a. https://onlinelibrary.wiley.com/doi/pdf/10.1111/biom.13287, URLhttps://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13287.

Zaccardi, F. and Khunti, K. (2018). Glucose dysregulation phenotypes—time to improve out-comes. Nature Reviews Endocrinology, 14 632–633.

Zeevi, D., Korem, T., Zmora, N., Israeli, D., Rothschild, D., Weinberger, A., Ben-Yacov, O., Lador, D., Avnit-Sagi, T., Lotan-Pompan, M. et al. (2015). Personalizednutrition by prediction of glycemic responses. Cell, 163 1079–1094.

27