10
Learning representations of irregular particle-detector geometry with distance-weighted graph networks Shah Rukh Qasim 1,2 ([email protected]), Jan Kieseler 1 ([email protected]), Yutaro Iiyama 1 ([email protected]), and Maurizio Pierini 1 ([email protected]) 1 CERN, Geneva, Switzerland 2 National University of Sciences and Technology, Islamabad, Pakistan July 25, 2019 Abstract. We explore the use of graph networks to deal with irregular-geometry detectors in the context of particle reconstruction. Thanks to their representation-learning capabilities, graph networks can exploit the full detector granularity, while natively managing the event sparsity and arbitrarily complex detector geometries. We introduce two distance-weighted graph network architectures, dubbed GarNet and Gr- avNet layers, and apply them to a typical particle reconstruction task. The performance of the new architectures is evaluated on a data set of simulated particle interactions on a toy model of a highly granular calorimeter, loosely inspired by the endcap calorimeter to be installed in the CMS detector for the High-Luminosity LHC phase. We study the clustering of energy depositions, which is the basis for calorimetric particle reconstruction, and provide a quantitative comparison to alternative approaches. The proposed algorithms provide an interesting alternative to existing methods, offering equally performing or less resource-demanding solutions with less underlying assumptions on the detector geometry and, consequently, the possibility to generalize to other detectors. 1 Introduction Traditionally, Machine Learning (ML) techniques are a key ingredient to event processing at particle colliders, employed in tasks such as particle reconstruction (clus- tering), identification (classification), and energy or direc- tion measurement (regression) in calorimeters and track- ing devices. The first applications of Neural Networks to High Energy Physics (HEP) date back to the ’80s [1,2,3, 4]. Starting with the MiniBooNE experiment [5], Boosted Decision Trees became the state of the art, and played a crucial role in the discovery of the Higgs boson by the ATLAS and CMS experiments [6]. Recently, a series of studies on different aspects of LHC data taking and data processing workflows have demonstrated the potential of Deep Learning (DL) in collider applications, both as a way to speed up current algorithms and to improve their performance. Nevertheless, the list of DL models actu- ally deployed in the centralized workflows of the LHC ex- periments remains quite short. 1 Many of these studies, 1 As an example, at the moment such a list for the CMS experiment consists of a set of b-tagging algorithms [7, 8] and a data quality monitoring algorithm for the muon drift tube chambers [9]. Other applications exist at the analysis level, downstream from the centralized event processing. In data analyses, one typically considers abstract four-momenta and not the low-level quantities such as detector hits, making the use of DL techniques easier. which are typically proof-of-concept demonstrations, are based on convolutional neural networks (CNN) [10], which perform computing vision tasks by applying translation- invariant kernels to raw digital images. CNN architectures applied on HEP data thus imposes a requirement for the particle detectors to be represented as regular arrays of sensors. This requirement, common to many of the ap- proaches described in Section 2, creates problems for re- alistic applications of CNNs in collider experiments. 2 In this work, we propose novel Deep Learning archi- tectures based on graph networks to improve the perfor- mance and reduce the execution time of typical particle- reconstruction tasks, such as cluster reconstruction and particle identification. In contrast to CNNs, graph net- works can learn optimal detector-hits representations with- out making specific assumptions on the detector geome- try. In particular, no data preprocessing is required, even for detectors with irregular geometries. We consider the specific case of particle reconstruction in calorimeters, for which this characteristic of graph networks may become especially relevant in the near future. In view of the High- Luminosity LHC phase, the endcap calorimeter of the CMS detector will be replaced by a novel-design digital calorimeter, the High Granularity Calorimeter (HGCAL), 2 The picture is completely different in other HEP domains. For instance, CNNs have been successfully deployed in neu- trino experiments, where the regular-array assumption meets the geometry of a typical detector. arXiv:1902.07987v2 [physics.data-an] 24 Jul 2019

Learning representations of irregular particle-detector geometry … · 2019-07-25 · 2 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning representations of irregular particle-detector geometry … · 2019-07-25 · 2 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

Learning representations of irregular particle-detector geometrywith distance-weighted graph networks

Shah Rukh Qasim1,2 ([email protected]), Jan Kieseler1 ([email protected]), Yutaro Iiyama1

([email protected]), and Maurizio Pierini1 ([email protected])

1 CERN, Geneva, Switzerland2 National University of Sciences and Technology, Islamabad, Pakistan

July 25, 2019

Abstract. We explore the use of graph networks to deal with irregular-geometry detectors in the contextof particle reconstruction. Thanks to their representation-learning capabilities, graph networks can exploitthe full detector granularity, while natively managing the event sparsity and arbitrarily complex detectorgeometries. We introduce two distance-weighted graph network architectures, dubbed GarNet and Gr-avNet layers, and apply them to a typical particle reconstruction task. The performance of the newarchitectures is evaluated on a data set of simulated particle interactions on a toy model of a highlygranular calorimeter, loosely inspired by the endcap calorimeter to be installed in the CMS detector forthe High-Luminosity LHC phase. We study the clustering of energy depositions, which is the basis forcalorimetric particle reconstruction, and provide a quantitative comparison to alternative approaches. Theproposed algorithms provide an interesting alternative to existing methods, offering equally performingor less resource-demanding solutions with less underlying assumptions on the detector geometry and,consequently, the possibility to generalize to other detectors.

1 Introduction

Traditionally, Machine Learning (ML) techniques are akey ingredient to event processing at particle colliders,employed in tasks such as particle reconstruction (clus-tering), identification (classification), and energy or direc-tion measurement (regression) in calorimeters and track-ing devices. The first applications of Neural Networks toHigh Energy Physics (HEP) date back to the ’80s [1,2,3,4]. Starting with the MiniBooNE experiment [5], BoostedDecision Trees became the state of the art, and playeda crucial role in the discovery of the Higgs boson by theATLAS and CMS experiments [6]. Recently, a series ofstudies on different aspects of LHC data taking and dataprocessing workflows have demonstrated the potential ofDeep Learning (DL) in collider applications, both as away to speed up current algorithms and to improve theirperformance. Nevertheless, the list of DL models actu-ally deployed in the centralized workflows of the LHC ex-periments remains quite short.1 Many of these studies,

1 As an example, at the moment such a list for the CMSexperiment consists of a set of b-tagging algorithms [7,8] anda data quality monitoring algorithm for the muon drift tubechambers [9]. Other applications exist at the analysis level,downstream from the centralized event processing. In dataanalyses, one typically considers abstract four-momenta andnot the low-level quantities such as detector hits, making theuse of DL techniques easier.

which are typically proof-of-concept demonstrations, arebased on convolutional neural networks (CNN) [10], whichperform computing vision tasks by applying translation-invariant kernels to raw digital images. CNN architecturesapplied on HEP data thus imposes a requirement for theparticle detectors to be represented as regular arrays ofsensors. This requirement, common to many of the ap-proaches described in Section 2, creates problems for re-alistic applications of CNNs in collider experiments.2

In this work, we propose novel Deep Learning archi-tectures based on graph networks to improve the perfor-mance and reduce the execution time of typical particle-reconstruction tasks, such as cluster reconstruction andparticle identification. In contrast to CNNs, graph net-works can learn optimal detector-hits representations with-out making specific assumptions on the detector geome-try. In particular, no data preprocessing is required, evenfor detectors with irregular geometries. We consider thespecific case of particle reconstruction in calorimeters, forwhich this characteristic of graph networks may becomeespecially relevant in the near future. In view of the High-Luminosity LHC phase, the endcap calorimeter of theCMS detector will be replaced by a novel-design digitalcalorimeter, the High Granularity Calorimeter (HGCAL),

2 The picture is completely different in other HEP domains.For instance, CNNs have been successfully deployed in neu-trino experiments, where the regular-array assumption meetsthe geometry of a typical detector.

arX

iv:1

902.

0798

7v2

[ph

ysic

s.da

ta-a

n] 2

4 Ju

l 201

9

Page 2: Learning representations of irregular particle-detector geometry … · 2019-07-25 · 2 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

2 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

consisting of arrays of hexagonal silicon sensor cells inter-leaved with absorber layers [11]. Being positioned close tothe beam pipe and exposed to ∼ 200 proton-proton col-lisions on average per bunch crossing, this detector willbe characterized by high occupancy over its large numberof readout channels. Downstream in the data processingpipeline, the unprecedented number of sensors and theirgeometry will cause an increase in event size and conse-quently the computational needs, necessitating novel dataprocessing approaches given the expected computing lim-itations [12]. The detector we consider in this study, de-scribed in detail in Section 4, is loosely inspired by theHGCAL geometry. In particular, it features a similarlyirregular sensor structure, with sensor sizes varying withthe detector depth as well as within a single layer. On theother hand, the HGCAL hexagonal sensors were tradedfor square-shaped sensors, in order to keep the computingresources needed to generate the training data set withina manageable limit.

As a benchmark application, we consider the basis forall further particle reconstruction tasks in a calorimeter:clustering of the recorded energy deposits into disentan-gled showers from individual particles. To this purpose, weintroduce two novel distance-weighted graph network ar-chitectures, the GarNet and the GravNet layers, whichare designed to provide a good balance between perfor-mance and computing resources needs for inference. Whileour discussion is limited to a calorimetry-related problem,the design of these new layer architectures is such thatit automatically generalizes to any kind of sparse data,such as hits collected by a typical tracking device or re-constructed particle candidates inside a hadronic jet. Webelieve that architectures of this kind are more practi-cal to deploy in a realistic experimental environment andcould become relevant for the LHC experiments, both foroffline and real-time event processing and selection as wellas shower simulation.

This paper is structured as follows: Section 2 reviewsrelated previous works. In Section 3, we describe the Gr-avNet and GarNet architectures. Section 4 describesthe data set used for this study. Section 5 introduces themetric used to optimize the networks. Section 6 describesthe models. The results are presented in Sections 7 and 8in terms of accuracy and computational efficiency, respec-tively. Conclusions are presented in Section 9.

2 Related Work

In recent years, DL models, and in particular CNNs, havebecome very popular in different areas of HEP. CNNs weresuccessfully applied to calorimeter-oriented tasks, includ-ing particle identification [11,13,14,15,16], energy regres-sion [11,13,15,16], hadronic jet identification [17,18,19,20], fast simulation [13,21,22,23,24] and pileup subtrac-tion in jets [25]. Many of these works assume a simplifieddetector description: the detector is represented as a regu-lar array of sensors expressed as 2D or 3D images, and theproblem of overlapping regions at the transition betweendetector components (e.g. barrel and endcap) is ignored.

Sometimes the fixed-grid pixel shape is intended to re-flect the typical angular resolution of the detector, whichis implicitly assumed to be a constant, while in reality itdepends on the energy of the incoming particle.

In order to overcome this practical difficulty with CNNarchitectures, different HEP-related studies investigatingalternative architectures have been performed. In the con-text of jet identification, several authors studied modelsbased on recurrent [7,8,26] and recursive [27] networks,graph networks [28], and DeepSets [29]. Recurrent archi-tectures have also been studied for event classification [30].In general, these approaches take as input a particle-basedrepresentation of an event and thus are easier to apply inapplications running after a global event reconstructionbased on a particle-flow algorithm [31,32].

Outside the HEP domain, overcoming the necessityfor a regular structure motivated original research to usegraph-based networks [33], which in general are suitedfor processing point-wise data with no regular structureby representing them as vertices in a graph. A compre-hensive overview of various graph-based networks can befound in Ref. [34]. In a typical implementation of a graph-based network, the vertices are connected according tosome predefined criteria at the preprocessing stage. Theconnections between the vertices (edges) then define pathsof information exchange [35,36]. In some cases, the edgeand vertex properties are used to infer attention (weight)assigned to each neighbour during this information ex-change, while leaving the neighbour relations (adjacencymatrix) unchanged [37]. Some of these architectures havealready been considered for collider physics, in the contextof jet tagging [38], event topology classification [39], andfor pileup subtraction [40].

Particularly interesting for irregular detectors are, how-ever, networks that are capable of learning the geome-try, as studied in combination with message passing [41].Within this approach, the adjacency matrix is trainable.In other words, the neighbour relations, which encode thedetector geometry, are not imposed at the preprocessingstage but are inferred from the input data. Although thisapproach is promising, its downside is the need to con-nect all vertices with each other, which makes it compu-tationally challenging for graphs with a large number ofvertices as the memory requirement becomes forbiddinglyhigh. This problem is overcome by defining only a subsetof connections between neighbours in a learnable spacerepresentation, where the edge properties of each vertexto a limited number of its neighbours are used to calcu-late a new feature representation per vertex, which is thenpassed to the next layer of similar structure [42]. This ap-proach is implemented in the EdgeConv layer and thecorresponding DGCNN model [42]. The neighbours areselected based on the new vertex features, which makesit particularly challenging to create a gradient for train-ing with respect to the neighbour selection. The DGCNNmodel works around this issue by using the edge featuresthemselves. However, due to the dynamic calculation ofneighbour relations in high-dimensional space, this net-work still requires substantial computing resources, which

Page 3: Learning representations of irregular particle-detector geometry … · 2019-07-25 · 2 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries 3

would make its use for triggering purposes in collider de-tectors unfeasible.

3 The GravNet and GarNet layers

The neural network layers proposed in this study are de-signed to provide competitive performance on particle re-construction tasks while dealing with data sparsity in anefficient way. These architectures aim to keep a trainablespace representation at minimal computational costs. Thelayers receive as input a B × V × FIN data set, consistingof a batch of B examples, each represented by a set of Vdetector hits, embedded in the network set through FIN

features. For instance, the FIN features could include theCartesian coordinates of a given sensor, its address (layernumber, module number, etc.), the sensor time stamp, therecorded energy, etc.

A pictorial representation of the operations performedby the two layers is shown in Fig. 1. For both architectures,the first step is to apply a dense3 neural network to eachof the V detector hits, deriving from the FIN features twooutput arrays: the first array (S) is interpreted as a set ofcoordinates in some learned representation space (for theGravNet layer) or as the distance between the consid-ered vertex and a set of S aggregators (for the GarNetlayer); the second array (FLR) is interpreted as a learnedrepresentation of the vertex features. At this point, a giveninput example of initial dimension V × FIN is convertedinto a graph with V vertices in the abstract space identi-fied by S. Each vertex is represented by the FLR features,derived from the initial inputs. The projection from theV ×FIN to this graph is linear, with trainable weights andbias vectors.

The main difference between the GravNet and theGarNet architectures is in the way the V vertices areconnected when building the graph. In the case of the Gr-avNet layer, the Euclidean distances djk between (j, k)pairs of vertices in the S space are used to associate toeach vertex its closest N neighbors. In the case of theGarNet layer, the graph is built connecting each of theV vertices to a set of dim(S) aggregators. What is learnedby S, in this case, is the distance between a vertex andeach of the aggregators.

Once the edges of the graph are built, each vertex (ag-gregator) of the GravNet (GarNet) layer collects theinformation associated with the FLR features across itsedges. This is done in three steps:

1. The quantities

f ijk = f ij × V (djk) (1)

3 Here and in the following, dense layer refers to a learnableweight-matrix multiplication and bias vector addition with re-spect to the last feature dimension, with shared weights overall other dimensions. In this case, the weights and bias are ap-plied to the vertex features FIN and shared over the vertices V .This can also be thought of as a 2D convolution with a 1 × 1kernel.

are computed for the feature f i of each of the verticesvj connected to a given vertex or aggregator vk, scal-ing the original value by a potential, function of theeuclidean distance djk, giving the gravitational net-work GravNet its name. The potential function isintroduced to enhance the contribution of close-by ver-tices. For this reason, V has to be a decreasing func-tion of djk. In this study, we use a Gaussian potentialV (djk) = exp (−d2jk) for the GravNet layer4 and an

exponential potential V (djk) = exp (−|djk|) for theGarNet layer.

2. The f ijk functions computed from all the edges associ-ated to a vertex of aggregator vk are combined, gener-ating a new feature f ik of vk. For instance, we consider

the average of the f ijk across the j edges and their max-imum. In our case, it was particularly crucial to extendthe choice of aggregator functions beyond the maxi-mum, which was already explored for similar architec-tures [42]. In fact, the mean function (as any othersimilar function) helped improve the convergence ofthe model, by taking into account the contribution ofall the vertices.

3. Each adopted combination rule in the previous stepgenerates a new set of features FLR. All of them areconcatenated to the original FIN vector. This extendedvector is transformed into a set of FOUT new vertexfeatures, using a fully connected dense layer with tanhactivation. The concatenation is done for each initialvertex. In the case of the GarNet layer, this requiresan additional step of passing the f ik features of thevk aggregators back to the initial vertices, weighted bythe V (djk) potential. This information exchange of thegarnered information through the aggregators definesthe GarNet name.

The full process transforms the initial B × V × FIN dataset into a B×V ×FOUT data set. As common with graphnetworks, the main advantage comes from the fact thatthe FOUT output (unlike the FIN input) carries collectiveinformation from each vertex and its surrounding, provid-ing a more informative input to downstream processing.Thanks to the distinction between learned space informa-tion S and learned features FLR, the dimensionality ofconnections in the graph is kept under control, resultingin a smaller memory consumption than, for instance, theEdgeConv layer.

The two layer architectures and the models based onthem, described in the following sections, are implementedin TensorFlow [43]. 5

4 Data set

The data set used in this paper is based on a simplifiedcalorimeter with irregular geometry, built in GEANT4 [44].

4 A gravitational potential (−1/d) has singularities at d = 0and therefore cannot be used, however the potential we are us-ing has a similar qualitative effect of pulling together vertices.

5 The code for the models and layers can be found in https:

//github.com/jkiesele/caloGraphNN

Page 4: Learning representations of irregular particle-detector geometry … · 2019-07-25 · 2 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

4 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

s1

s2

FINFLR

S(a) (b) (c)

di2

di1

dj2

dj1

{ }}

(e)(d)

vk

v1

v2

v3

v4

f2i

f3i

f4i

d1k

d2k

d3k

d4k

f1i

fji

ifjk = fj ×V(djk)~i

Max( fjk)~ij

Σ fjk~i

jfk = ~i {

FOUT}FIN

FLR{’

FLR{’’

~

~

{

Fig. 1: Pictorial representation of the data flow across the GarNet and the GravNet layers. (a) The input featuresFIN of each vi ∈ V are processed by a dense neural network with two output arrays: a set of learned features FLR andspatial information S in some learned representation space. (b) In the case of the GravNet layer, the S quantitiesare interpreted as the coordinates of the vertices in some abstract space. The graph is built in this space, connectingeach vi to its N closest neighbors (N=4 in the figure), using the euclidean distance dij between the vertices to rankthe neighbors. (c) In the case of the GarNet layer, the S quantities are interpreted as the distances between thevertices and a set of S aggregators in some abstract space. The graph is then built connecting each vi vertex to eachaj aggregator, and the S quantities are the dij euclidean distances. (d) Once the graph structure is established, the f ijfeatures of the vj vertices connected to a given vertex or aggregator vk are converted into the f ijk quantities, through

a potential (function of djk). The corresponding information is then gathered across the graph and turned into a new

feature f ik of vk (e.g. summing over the edges, or taking the maximum). (e) For each choice of gathering function, a new

set of features f ik ∈ FLR is generated. The FLR vector is concatenated to the initial FIN vector. The resulting featurevector is given as input to a dense neural network with tanh activation, which returns the output representation FOUT.

z (mm)

0 250 500 750 1000 1250 1500 1750

x (m

m)

100

50

0

50

100

y (m

m)

10050050100

Fig. 2: Calorimeter geometry. The markers indicate thecentre of the sensors, their size the sensor size. Layers arecolour-coded for better visualisation.

The calorimeter is made entirely of Tungsten, with a widthof 30 cm × 30 cm in the x and y directions and a lengthof 2 m in the longitudinal direction (z), which correspondsto 20 nuclear interaction lengths. The longitudinal dimen-sion is further split into 20 layers of equal thickness. Eachlayer contains square sensor cells, with a fine segmenta-tion in the quadrant with x > 0 and y > 0 and a lowergranularity elsewhere. The total number of cells and theirindividual sizes vary by layer, replicating the basic fea-tures of a slightly irregular calorimeter. For more details,see Fig. 2 and Table 1.

Charged pions are generated at z = −2 m; the x and ycoordinates of the generation vertex are randomly sampledwithin |x| < 5 cm and |y| < 5 cm. The x and y componentsof the particle momentum are set to 0, while the z compo-nent is sampled uniformly between 10 and 100 GeV. Theparticles therefore impinge the calorimeter front face per-pendicularly and shower along the longitudinal direction.

The resulting total energy deposit in each cell, as wellas the cell position, width, and layer number, are recordedfor each event. These quantities correspond to the FIN fea-

Page 5: Learning representations of irregular particle-detector geometry … · 2019-07-25 · 2 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries 5

Layer Cells (x > 0, y > 0) Cells elsewhere0 64 481 64 1082–3 100 1924–7 64 1088–11 64 4812–13 16 1214–19 4 3

Table 1: Number of cells in the finely segmented quadrantand the rest of the layer, for the benchmark calorimetergeometry described in the text.

ture vector given as input to the graph models (see Sec-tion 3). Each example consists of the result of two over-lapping showers. Cell by cell, the energy of two showers issummed and the fraction belonging to each of the showersin each cell is defined as the ground truth. In addition,the position of the largest energy deposit per shower isrecorded. If this position is the same for the two overlap-ping showers, they are considered not separable and theevent is discarded. This applies to about 5% of the events.

In total 16 000 000 events are generated. Out of these,100 000 are used for validation and 250 000 for testing. Therest is used for training.

5 Clustering metrics

To identify individual showers and use their properties,e.g. for a subsequent particle identification task, the en-ergy deposits should be clustered so that overlapping partsare identified without removing important parts of theoriginal shower. Therefore, the clustering algorithms shouldpredict the energy fraction of each sensor belonging toeach shower. Lower energy deposits are slightly less im-portant. These considerations define the loss function:

L =∑k

∑i

√Eitik(pik − tik)2∑

i

√Eitik

, (2)

where pik and tik are the predicted and true energy frac-tions in sensor i and shower k. These are weighted by thesquare root of Eiti, which is the total energy deposit insensor i belonging to shower k, to introduce a mild energyscaling within each shower.

In addition, in each event we randomly label one ofthe showers as the test shower and the other as the noiseshower, and define the clustering energy response Rk ofshower k (k = test, noise) as:

Rk =

∑iEipik∑iEitik

(3)

6 Models

The models need to incorporate neural network layers toidentify localized structures as well as to perform informa-tion exchange globally between the sensors. This can be

achieved either by multiple message passing iterations be-tween neighbouring sensors or a direct global informationexchange. Here, we employ a combination of both. Theinput to all models is an array of sensors, each holdingits recorded energy deposits, global position coordinates,sensor size, and layer number. We compare three differentgraph-network approaches to a CNN based approach (Bin-ning), presented as a baseline. Each model is designed tocontain approximately 100 000 free parameters. The modelstructure is as follows:

– Binning: a regular grid of 20 × 20 × 20 pixels is im-posed on the irregular geometry. Each pixel containsthe information of at most one sensor6. The informa-tion is concatenated to the mean of these features inall pixels, pre-processed by one 1 × 1 × 1 CNN layerwith 20 nodes, and then fed through eight blocks ofCNN layers. Each block consists of a CNN layer witha kernel of 7×7×1 followed by a layer with a kernel of1×1×3, each containing 14 filters. The output of eachblock is passed to the next block and simultaneouslyadded to a list of all block outputs. All CNN layersemploy tanh activation functions. Finally, the full listof block outputs per pixel is reshaped to represent thevertices of the graph and fed through a dense layerwith 128 nodes and ReLU activation. Different CNNmodels have also been tested and showed similar orworse performance.

– DGCNN model: adapting the model proposed inRef [42] to our problem, the sensor features are inter-preted as positions of points in a 16-dimensional spaceand fed through one global space transformation fol-lowed by four blocks comprising one EdgeConv layer.Our EdgeConv layer has a similar configuration as inRef. [42], with 40 neighbouring vertices and three in-ternal dense layers with ReLu activation acting on theedges with 64 nodes each. The output of the Edge-Conv layer is concatenated with its mean over all ver-tices and fed to one dense layer with 64 nodes andReLu activation which concludes the block. The out-put of each block is passed to the next block and si-multaneously added to a list of all block outputs pervertex together with the mean over vertices. This listis finally fed to a dense layer with 32 nodes and ReLUactivation.

– GravNet model: the model consists of four blocks.Each block starts with concatenating the mean of thevertex features to the vertex features, three dense lay-ers with 64 nodes and tanh activation, and one Grav-Net layer with S = 4 coordinate dimensions, FLR =22 features to propagate, and FOUT = 48 output nodesper vertex. For each vertex, 40 neighbours are consid-ered. The output of each block is passed as input to thenext block and added to a list containing the outputof all blocks. This determines the full vector of vertexfeatures passed to a final dense layer with 128 nodesand ReLU activation.

6 Alternative configurations with more than one sensor perpixel were also investigated and showed similar performance.

Page 6: Learning representations of irregular particle-detector geometry … · 2019-07-25 · 2 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

6 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

– GarNet model: The original vertex features are con-catenated with the mean of the vertex features andthen passed on to one dense layer with 32 nodes andtanh activation before entering 11 subsequent Gar-Net layers. These layers contain S = 4 aggregators, towhich FLR = 20 features are passed, and FOUT = 32output nodes. The output of each layer is passed tothe next and added to a vector containing the con-catenated outputs of each GarNet layer. The latteris finally passed to a dense layer with 48 nodes andReLU activation.

In all cases, each output vertex of these model buildingblocks is fed through one dense layer with ReLU activationand three nodes, followed by a dense layer with two outputnodes and softmax activation. This last processing stepdetermines the energy fraction belonging to each shower.Batch normalisation [45] is applied in all models to theinput and after each block.

All models are trained on the full training data set us-ing the Adam optimizer [46] and an initial learning rate ofabout 3× 10−4, the exact value depending on the model.The learning rate is reduced exponentially in steps to theminimum of 3× 10−6 after 2 million iterations. Once thelearning rate has reached the minimum level, it is modu-lated by 10% at a fixed frequency, following the methodproposed in Ref. [47].

7 Clustering performance

All approaches described in Section 6 perform well forclustering purposes. An example is shown in Fig. 3, wheretwo charged pions with an energy of approximately 50 GeVenter the calorimeter. One pion loses a significant frac-tion of energy in an electromagnetic shower in the firstcalorimeter layers. The remaining energy is carried by asingle particle passing the central part of the calorimeterbefore showering. The second pion passes the first layersas a minimally ionizing particle and showers in the cen-tral part of the calorimeter. Even though the two showerslargely overlap, the GravNet network (shown here as anexample) is able to identify and separate the two showersvery well. The track within the calorimeter is well identi-fied and reconstructed and the energy fractions properlyassigned, even in the parts where the two showers heav-ily overlap. Similar performance can be observed with theother investigated methods.

Quantitatively, the models are compared with respectto multiple performance metrics. The first two are themean and the variance of the loss function value (µL andσL) computed according to Equation (2) over the testevents. The mean and the variance of the test showerresponse (µR and σR), where the response is defined inEquation (3), are also compared. While the test shower re-sponse follows an approximately normal distribution overmajority of the test events, a small outlier population,where the shower clustering fails, are seen to lead µR andσR to misparametrize the core of the distribution. There-fore, response kernel mean µ∗

R and variance σ∗R, restricted

z (mm)

250 500 750 1000 1250 1500 1750

x (m

m)

100

50

0

50

100

y (m

m)

10050050100

(a) Truth

z (mm)

250 500 750 1000 1250 1500 1750

x (mm)

100

50

0

50

100

y (m

m)

10050050100

(b) Reconstructed

Fig. 3: Comparison of true energy fractions and energyfractions reconstructed by the GravNet model for twocharged pions with an energy of approximately 50 GeVshowering in different parts of the calorimeter. Colours in-dicate the fraction belonging to each of the showers. Thesize of the markers scales with the square root of the en-ergy deposit in each sensor.

to test showers with response between 0.2 and 2.8, areadded to the set of evaluation metrics. In addition, we alsocompare the clustering accuracy (A), defined as the frac-tion of showers with response between 0.7 and 1.3. Finally,the above set of metrics is duplicated, with the second setusing only the sensors with energy fractions between 0.2and 0.8 in the computation of the loss function and theresponse. The second set of metrics characterizes the per-formance of the models in particularly challenging case ofreconstructing significantly overlapping clusters. The twosets of metrics are called inclusive and overlap-specific inthe remainder of the discussion.

Page 7: Learning representations of irregular particle-detector geometry … · 2019-07-25 · 2 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries 7

The metric values are listed in Table 2. Comparingthe inclusive metrics, it can be seen that the GravNetlayer outperforms the other approaches, including even themore resource-intensive DGCNN model. The GarNetmodel performance is in between the DGCNN model andthe binning approach in terms of reconstruction of indi-vidual shower hit fractions, parametrized by µL and σL.However, in characteristics related to clustering response,the binning model outperforms the GarNet and DG-CNN model slightly. On the other hand, with respect tooverlap-specific metrics, the graph based approaches out-perform the binning approach. The DGCNN and Grav-Net model perform equally well, and the GarNet modellies in between the binning approach and GravNet.

Table 2: Mean and variance of loss, response, and responsewithin the Gaussian kernel as well as clustering accuracy.

InclusiveµL σL µR σR µ∗

R σ∗R A

Binning 0.191 0.017 1.083 0.183 1.046 0.057 0.867DGCNN 0.174 0.012 1.082 0.179 1.045 0.052 0.881GarNet 0.182 0.011 1.086 0.190 1.048 0.055 0.872GravNet 0.172 0.012 1.077 0.173 1.042 0.049 0.886

Overlap-specificµL σL µR σR µ∗

R σ∗R A

Binning 0.163 0.0045 1.005 0.099 1.004 0.096 0.697DGCNN 0.154 0.0046 1.004 0.090 1.002 0.087 0.728GarNet 0.157 0.0048 1.005 0.095 1.004 0.092 0.714GravNet 0.156 0.0047 1.004 0.091 1.003 0.088 0.721

One should notice that part of the incorrectly pre-dicted events are actually correctly clustered events inwhich the test shower is labelled as noise shower (showerswapping). Since the labelling is irrelevant in a clusteringproblem, this behavior is not a real inefficiency of the al-gorithm. We denote by s the fraction of events where thisbehaviour is observed. In Table 3, we calculate the loss forboth choices and evaluate the performance parameters forthe assignment that minimizes the loss. The binning modelshows the largest fraction of swapped showers. The differ-ence in response between the best-performing GravNetmodel and the GarNet model is enhanced, while the dif-ference between the GravNet and DGCNN model scalessimilarly, likely because of their similar general structure.

In Fig. 4, the performance of the models are comparedin bins of the test shower energy with respect to inclusiveand overlap-specific µR and σR. For the inclusive met-rics, the GravNet model outperforms the other modelsin the full range, and the GarNet model shows the worstperformance, albeit in a comparable range. The resource-intensive DGCNN model lies in between GravNet andGarNet.

The overall upward bias in the response for lower showerenergies warrants an explanation. This bias is a result ofedge effects, induced by our choice of using an adaptedmean-square error loss to predict a quantity bounded in[0,1] (the energy fraction). This choice of loss function cre-ates an expectation value larger than 0 at a peak value of

Table 3: Mean and variance of loss, response, and responsewithin the Gaussian kernel as well as clustering accuracycorrected for shower swapping. The last column shows thefraction of swapped showers.

InclusiveµL σL µR σR µ∗

R σ∗R A s [%]

Binning 0.179 0.007 1.076 0.139 1.047 0.054 0.875 3.2DGCNN 0.167 0.006 1.076 0.138 1.047 0.050 0.887 2.6GarNet 0.176 0.006 1.081 0.149 1.049 0.054 0.877 2.5GravNet 0.164 0.006 1.071 0.126 1.044 0.047 0.892 2.7

Overlap-specificµL σL µR σR µ∗

R σ∗R A s [%]

Binning 0.160 0.0037 1.005 0.098 1.004 0.095 0.699 3.2DGCNN 0.152 0.0038 1.003 0.089 1.002 0.086 0.729 2.6GarNet 0.154 0.0040 1.005 0.094 1.003 0.091 0.715 2.5GravNet 0.152 0.0039 1.004 0.090 1.003 0.087 0.722 2.7

0 (and vice-versa at a fraction of 1), and therefore pushesthe prediction away from being exactly 0 or 1, leading toan underestimation at high energies and an overestimationat low energies. The design of a customized loss functionthat eliminates this bias is left to future studies. For themoment, we are interested in a performance comparisonbetween models, all affected by this bias.

For overlap-specific metrics, the edge effects are highlysuppressed. The Figures confirm that the graph-based mod-els outperform the binning method at all test shower ener-gies. It is also seen that the GravNet and the DGCNNmodel show similar performance.

8 Resource requirements

In addition to the clustering performance, it is impor-tant to take into account the computational resources de-manded by each model during inference. The required in-ference time and memory consumption can have a signifi-cant impact on the applicability of the network for recon-struction tasks in constrained environments, such as theonline and offline central-processing workflows of a typi-cal collider-physics experiment. We evaluate the inferencetime t and memory consumption m for the models stud-ied here on one NVIDIA GTX 1080 Ti GPU for batchsizes of 1 and 100, denoted as (t1,m1) and (t100, m100),respectively. The inference time is also evaluated on oneIntel Xeon E5-2650 CPU core (tCPU

10 ) for a fixed batch sizeof 10. As shown in Fig. 5, memory consumption and exe-cution times differ significantly between the models. Thebinning approach outperforms all other models, becauseof the highly optimized CNN implementations. The DG-CNN model requires the largest amount of memory, whilethe model using the GravNet layers requires about 50%less. The GarNet model provides a good compromiseof memory consumption with respect to performance. Interms of inference time, the binning model is the fastestand the graph-based models show a similar behaviour forsmall batch sizes on a GPU. The GarNet and the Grav-Net model benefit from parallelizing over a larger batch.

Page 8: Learning representations of irregular particle-detector geometry … · 2019-07-25 · 2 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

8 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

10 20 30 40 50 60 70Test shower energy (GeV)

1.0

1.1

1.2

1.3

1.4

1.5

1.6

Resp

onse

(mea

n)

DGCNNGravNetBinningGarNet

(a) Mean

10 20 30 40 50 60 70Test shower energy (GeV)

10 2

10 1

100

101

Resp

onse

(var

ianc

e)

DGCNNGravNetBinningGarNet

(b) Variance

10 20 30 40 50 60 70Test shower energy (GeV)

0.99

1.00

1.01

1.02

Resp

onse

(mea

n)

DGCNNGravNetBinningGarNet

(c) Mean

10 20 30 40 50 60 70Test shower energy (GeV)

0.08

0.09

0.10

0.11

0.12

0.13Re

spon

se (v

aria

nce)

DGCNNGravNetBinningGarNet

(d) Variance

Fig. 4: Mean (left) and variance (right) of the test shower response as a function of the test shower energy for fullshower (top) and for overlapping shower (bottom), computed summing the true deposited energy. Swapping of theshowers is allowed here.

In particular, the GarNet model is mostly sequential,which also explains the outstanding performance on a sin-gle CPU core, with almost a factor of 10 shorter inferencetime compared to the DGCNN model.

9 Conclusions

In this work, we introduced the GarNet and GravNetlayers, which are distance-weighted graph networks capa-ble of learning irregular patterns of sparse data, such asthe detector hits in a particle physics detector with re-alistic geometry. Using as a benchmark problem the hitclustering in a highly granular calorimeter, we show howthese network architectures offer a good compromise be-tween clustering performance and computational resourceneeds, when compared to CNN-based and other graph-based networks. In the specific case considered here, theperformance of the GarNet and GravNet models are

comparable to the CNN and graph baselines. On the otherhand, the simulated calorimeter in the benchmark study isonly slightly irregular and can still be represented by analmost regular array. In more realistic applications, e.g.with the hexagonal sensors and the non-projective geome-try of the future HGCAL detector of CMS, the differencein performance between the graph-based approaches andthe CNN-based approaches is expected to increase fur-ther, making the GarNet approach a very efficient can-didate for fast and accurate inference and the GravNetapproach a good candidate for high-performance recon-struction with significantly less resource requirements butsimilar performance compared to the DGCNN model fora similar number of free parameters.

It should also be noted that the GarNet and Gr-avNet architectures make no specific assumption on thestructure of the underlying data, and thus can be em-ployed for many other applications related to particle andevent reconstruction, such as tracking and jet identifica-

Page 9: Learning representations of irregular particle-detector geometry … · 2019-07-25 · 2 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries 9

t1 t100/100 tCPU10 /1000 m1 m100/1000

10

20

30

40

50

Infe

renc

eti

me

(ms)

0

20

40

60

80

Mem

ory

(MiB

)

DGCNN

GravNet

Binning

GarNet

Fig. 5: Comparison of inference time for the network ar-chitectures described in the text, evaluated on CPUs andGPUs with different choices of batch size. The shaded arearepresents the +1σ statistical uncertainty band.

tion. Exploring the extent of usability of these architec-tures will be the focus of follow-up work.

Note added

After the completion of this work, Ref. [28] appeared, dis-cussing the application of a similar approach to the prob-lem of jet tagging.

Acknowledgments

We thank our CMS colleagues for many suggestions re-ceived in the development of this work. The training ofthe models was performed on the GPU clusters of theCERN TechLab and the CERN CMG group. This projecthas received funding from the European Research Council(ERC) under the European Union’s Horizon 2020 researchand innovation program (grant agreement no 772369).

References

1. B. H. Denby, “Neural Networks and Cellular Automata inExperimental High-energy Physics,” Comput. Phys. Com-mun., vol. 49, 1988.

2. C. Peterson, “Track Finding With Neural Networks,”Nucl. Instrum. Meth., vol. A279, 1989.

3. P. Abreu et al., “Classification of the hadronic decays ofthe Z0 into b and c quark pairs using a neural network,”Phys. Lett., vol. B295, 1992.

4. B. H. Denby, “Neural networks in high-energy physics: Aten year perspective,” Comput. Phys. Commun., vol. 119,1999.

5. H.-J. Yang, B. P. Roe, and J. Zhu, “Studies of boosted de-cision trees for MiniBooNE particle identification,” Nucl.Instrum. Meth., vol. A555, 2005.

6. A. Radovic et al., “Machine learning at the energy andintensity frontiers of particle physics,” Nature, vol. 560,no. 7716, 2018.

7. V. Khachatryan et al., “CMS Phase 1 heavy flavour identi-fication performance and developments,” Tech. Rep. CMS-DP-2017-013, 2017.

8. V. Khachatryan et al., “New Developments for Jet Sub-structure Reconstruction in CMS,” Tech. Rep. CMS-DP-2017-027, 2017.

9. A. A. Pol et al., “Detector monitoring with artificial neu-ral networks at the CMS experiment at the CERN LargeHadron Collider,” Comput. Softw. Big Sci., vol. 3, 2019.

10. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenetclassification with deep convolutional neural networks,”Commun. ACM, vol. 60, pp. 84–90, May 2017.

11. CMS Collaboration, “The Phase-2 Upgrade of the CMSEndcap Calorimeter,” Tech. Rep. CERN-LHCC-2017-023.CMS-TDR-019, 2017.

12. V. Khachatryan et al., “Technical Proposal for the Phase-II Upgrade of the CMS Detector,” Tech. Rep. CERN-LHCC-2015-010. LHCC-P-008. CMS-TDR-15-02, 2015.

13. F. Carminati et al., “Calorimetry with deep learning: par-ticle classification, energy regression, and simulation forhigh-energy physics.” ”Deep Learning for Physical Sci-ences” workshop at NIPS 2017, 2017.

14. D. Guest, K. Cranmer, and D. Whiteson, “Deep Learn-ing and its Application to LHC Physics,” Ann. Rev. Nucl.Part. Sci., vol. 68, 2018.

15. L. De Oliveira, B. Nachman, and M. Paganini,“Electromagnetic Showers Beyond Shower Shapes.”arXiv:1806.05667[hep-ex], 2018.

16. A. Abada et al., “Fcc-hh: The hadron collider,” The Eu-ropean Physical Journal Special Topics, vol. 228, pp. 755–1107, Jul 2019.

17. J. Cogan et al., “Jet-Images: Computer Vision InspiredTechniques for Jet Tagging,” JHEP, vol. 02, 2015.

18. P. T. Komiske, E. M. Metodiev, and M. D. Schwartz,“Deep learning in color: towards automated quark/gluonjet discrimination,” JHEP, vol. 01, 2017.

19. L. de Oliveira et al., “Jet-images deep learning edition,”JHEP, vol. 07, p. 069, 2016.

20. P. Baldi et al., “Jet Substructure Classification in High-Energy Physics with Deep Neural Networks,” Phys. Rev.,vol. D93, no. 9, 2016.

21. L. de Oliveira, M. Paganini, and B. Nachman, “Learn-ing Particle Physics by Example: Location-Aware Genera-tive Adversarial Networks for Physics Synthesis,” Comput.Softw. Big Sci., vol. 1, no. 1, 2017.

22. M. Paganini, L. de Oliveira, and B. Nachman, “CaloGAN: Simulating 3D high energy particle showers in multilayerelectromagnetic calorimeters with generative adversarialnetworks,” Phys. Rev., vol. D97, no. 1, 2018.

23. G. Rukhkhattak, S. Vallecorsa, and F. Carminati, “Threedimensional energy parametrized generative adversarialnetworks for electromagnetic shower simulation,” 201825th IEEE International Conference on Image Processing(ICIP), pp. 3913–3917, 2018.

24. P. Musella and F. Pandolfi, “Fast and Accurate Simula-tion of Particle Detectors Using Generative AdversarialNetworks,” Comput. Softw. Big Sci., vol. 2, no. 1, 2018.

25. P. T. Komiske et al., “Pileup Mitigation with MachineLearning (PUMML),” JHEP, vol. 12, 2017.

Page 10: Learning representations of irregular particle-detector geometry … · 2019-07-25 · 2 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

10 S.R. Qasim et al.: Distance-weighted graph networks for irregular particle-detector geometries

26. ATLAS Collaboration, “Identification of Jets Containingb-Hadrons with Recurrent Neural Networks at the AT-LAS Experiment,” Tech. Rep. ATL-PHYS-PUB-2017-003,2017.

27. G. Louppe et al., “QCD-Aware Recursive Neural Networksfor Jet Physics,” JHEP, vol. 01, 2019.

28. H. Qu and L. Gouskos, “ParticleNet: Jet Tagging via Par-ticle Clouds.” arXiv:1902.08570[hep-ph], 2019.

29. P. T. Komiske, E. M. Metodiev, and J. Thaler, “En-ergy Flow Networks: Deep Sets for Particle Jets,” JHEP,vol. 01, 2019.

30. T. Q. Nguyen et al., “Topology classification with deeplearning to improve real-time event selection at the LHC.”arXiv:1807.00083[hep-ex], 2018.

31. A. M. Sirunyan et al., “Particle-flow reconstruction andglobal event description with the CMS detector,” JINST,vol. 12, no. 10, 2017.

32. M. Aaboud et al., “Jet reconstruction and performanceusing particle flow with the ATLAS Detector,” Eur. Phys.J., vol. C77, no. 7, 2017.

33. F. Scarselli et al., “The graph neural network model,”IEEE Transactions on Neural Networks, vol. 20, no. 1,2009.

34. P. W. Battaglia et al., “Relational inductive biases, deeplearning, and graph networks.” arXiv:1806.01261, 2018.

35. M. Defferrard, X. Bresson, and P. Vandergheynst, “Convo-lutional neural networks on graphs with fast localized spec-tral filtering,” in Advances in Neural Information Process-ing Systems 29 (D. D. Lee, M. Sugiyama, U. V. Luxburg,I. Guyon, and R. Garnett, eds.), pp. 3844–3852, CurranAssociates, Inc., 2016.

36. P. Velickovic, G. Cucurull, A. Casanova, A. Romero,P. Lio, and Y. Bengio, “Graph Attention Networks,” In-ternational Conference on Learning Representations, 2018.

37. C. Selvi and E. Sivasankar, “A novel adaptive genetic neu-ral network (agnn) model for recommender systems usingmodified k-means clustering approach,” Multimedia Toolsand Applications, vol. 78, pp. 14303–14330, Jun 2019.

38. I. Henrion et al., “Neural message passing for jet physics.””Deep Learning for Physical Sciences” workshop at NIPS2017, 2017.

39. M. Abdughani et al., “Probing stop with graph neural net-work at the LHC.” 2018.

40. J. Arjona Martinez, O. Cerri, M. Pierini, M. Spiropulu,and J.-R. Vlimant, “Pileup mitigation at the Large HadronCollider with Graph Neural Networks,” Eur. Phys. J. Plus,vol. 134, p. 333, 2019. arXiv:1807.07988 [hep-ph].

41. J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, andG. E. Dahl, “Neural message passing for quantum chem-istry,” in Proceedings of the 34th International Conferenceon Machine Learning - Volume 70, ICML’17, pp. 1263–1272, JMLR.org, 2017.

42. Y. Wang et al., “Dynamic graph cnn for learning on pointclouds.” arXiv:1801.07829 [cs.CV], 2018.

43. M. Abadi et al., “TensorFlow: Large-scale machine learn-ing on heterogeneous systems,” 2015. Software availablefrom tensorflow.org.

44. S. Agostinelli et al., “GEANT4: A Simulation toolkit,”Nucl. Instrum. Meth., vol. A506, 2003.

45. S. Ioffe and C. Szegedy, “Batch normalization: Acceler-ating deep network training by reducing internal covariateshift,” in Proceedings of the 32nd International Conferenceon Machine Learning, ICML 2015, Lille, France, 6-11 July2015, pp. 448–456, 2015.

46. D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” in 3rd International Conference on Learn-ing Representations, ICLR 2015, San Diego, CA, USA,May 7-9, 2015, Conference Track Proceedings, 2015.

47. L. N. Smith and N. Topin, “Super-convergence: Very fasttraining of residual networks using large learning rates.”arXiv:1708.07120 [cs.LG], 2017.