Boosted Top Tagging with Deep Neural Networks · Background rejection: No pile up Background...

Preview:

Citation preview

Boosted Top Taggingwith Deep Neural Networks

Jannicke PearkesUniversity of British Columbia, Engineering Physics

Wojtek Fedorko, Alison Lister, Colin GayInter-Experimental Machine Learning Workshop

March 22nd, 2017

Overview

2

• Introduction • Method

– Monte Carlo Samples– Network architecture & training

• Results – Preprocessing– PT dependence– Pileup dependence– Learning what is being learnt

• Next Steps

Introduction

• Train a deep neural network to discriminate between jets originating from top quarks and those originating from QCD background

3

boost

Low top pTHigh top pT

W

b

W

bImage: Emily Thompson

Monte Carlo Samples• Signal: Z’ to ttbar• Background: Dijet• Generated with PYTHIA v8.219 NNPDF23 LO AS 0130 QED PDF• DELPHES v3.4.0 using default CMS card• Jets clustered using DELPHES energy-flow objects

• Anti-kT jets selected with R = 1.0• Trimming performed with kT algorithm and R = 0.2, pT frac = 5%

• Signal jets are selected where a truth top decays hadronically within 𝛥R= 0.75 of a large radius jet

• Jets are required to have 𝜂<= 2.0• Jets are subsampled to be flat in pT and signal-matched in eta• Looking at jets with pT between 600-2500 GeV

• ~ 4 million signal jets and ~4 million background jets • Sample divided into 80%, 10%, 10% for training, validation and testing

4

Examples of Jet Images

5

�1.0 �0.5 0.0 0.5 1.0Translated pseudorapidity ⌘

�1.0

�0.5

0.0

0.5

1.0

Tran

slat

edaz

imut

hala

ngle

Background jet with pT = 1370 GeV

10�4

10�3

10�2

10�1

100

Jetp

Tpe

rpix

el[G

eV]

�1.0 �0.5 0.0 0.5 1.0Translated pseudorapidity ⌘

�1.0

�0.5

0.0

0.5

1.0

Tran

slat

edaz

imut

hala

ngle

Background jet with pT = 702 GeV

10�4

10�3

10�2

10�1

100

Jetp

Tpe

rpix

el[G

eV]

�1.0 �0.5 0.0 0.5 1.0Translated pseudorapidity ⌘

�1.0

�0.5

0.0

0.5

1.0

Tran

slat

edaz

imut

hala

ngle

Background jet with pT = 2376 GeV

10�4

10�3

10�2

10�1

100

Jetp

Tpe

rpix

el[G

eV]

�1.0 �0.5 0.0 0.5 1.0Translated pseudorapidity ⌘

�1.0

�0.5

0.0

0.5

1.0

Tran

slat

edaz

imut

hala

ngle

Signal jet with pT = 781 GeV

10�4

10�3

10�2

10�1

100

Jetp

Tpe

rpix

el[G

eV]

�1.0 �0.5 0.0 0.5 1.0Translated pseudorapidity ⌘

�1.0

�0.5

0.0

0.5

1.0

Tran

slat

edaz

imut

hala

ngle

Signal jet with pT = 1480 GeV

10�4

10�3

10�2

10�1

100

Jetp

Tpe

rpix

el[G

eV]

�1.0 �0.5 0.0 0.5 1.0Translated pseudorapidity ⌘

�1.0

�0.5

0.0

0.5

1.0

Tran

slat

edaz

imut

hala

ngle

Signal jet with pT = 2358 GeV

10�4

10�3

10�2

10�1

100

Jetp

Tpe

rpix

el[G

eV]

Jet images are typically very sparse roughly 5-10% pixel activation on average if using a 0.1x0.1 grid [1][1] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. Schwartzman, Jet-images -- deep learning edition, JHEP 07 (2016) 069, arXiv:1511.05190 [hep-ph].

Neural Network Inputs

• Use sequence of jet constituents rather than image

• Advantages: – No loss of information due to pixelization in an image– Inputs are more information dense

• Using 120 constituents average activation is 30%-50%

6

Training and Network Architecture

• Implemented with Keras• Initially planned on using an LSTM, but ended up using a fully connected network • We found that performance between the LSTM and the fully connected network was

very similar, but the deep networks were much faster to train (~10 times) which allowed for faster experimentation with preprocessing techniques and network architectures

7

Network Type Fully connected

Number oflayers

5,[300,150,50,10,5,1]

Number of free parameters

41,323

Activation function

Rectified linear units, sigmoid on output

Optimizer Adam

Loss Binary Cross-Entropy

Early Stopping Patience of 5

Preprocessing

Preprocessing

• Large radius, R = 1.0, jets are trimmed using subjets R = 0.2 found with the kT algorithm with and pT frac = 5%

• Order subjets by subjet pT and jet constituent pTwithin each subjet

• We use only the 120 highest pT jet constituents• Perform preprocessing using domain knowledge

about the physics at hand

9

No Preprocessing

10

0.0 0.2 0.4 0.6 0.8 1.0Top Tagging Efficiency

100

101

102

103

Bac

kgro

und

Rej

ectio

nJet pT = 600 - 2500 GeV

Trimming only

Trimming onlyAUC = 0.83Rϵ = 50% = 8.85Rϵ = 80% = 3.36

Scale

• Scale pT of all jet constituents by a common factor to ensure that the constituent pT is approximately between 0 and 1

11

0.0 0.2 0.4 0.6 0.8 1.0Top Tagging Efficiency

100

101

102

103

Bac

kgro

und

Rej

ectio

nJet pT = 600 - 2500 GeV

Trimming onlyScale

Scale

12

ScalingAUC = 0.900Rϵ = 50% = 21.3Rϵ = 80% = 6.02

Translate

• Center jet about highest pT subjetin 𝜂, 𝜙 plane

13

0.0 0.2 0.4 0.6 0.8 1.0Top Tagging Efficiency

100

101

102

103

Bac

kgro

und

Rej

ectio

nJet pT = 600 - 2500 GeV

Trimming onlyScaleTranslation

Translate

14

TranslationAUC = 0.924Rϵ = 50% = 33.2Rϵ = 80% = 8.48

Rotate• Designed method of rotations

to preserve jet mass• Transform 𝑝', 𝜂, 𝜙 into

𝑝), 𝑝*,, 𝑝+• Rotate so that second highest

pT subjet is aligned with negative y-axis:

• Transform (𝑝), 𝑝*,, 𝑝+) back to 𝑝', 𝜂, 𝜙

15

0.0 0.2 0.4 0.6 0.8 1.0Top Tagging Efficiency

100

101

102

103

Bac

kgro

und

Rej

ectio

nJet pT = 600 - 2500 GeV

Trimming onlyScaleTranslationRotation

Rotate

16

RotationAUC = 0.932Rϵ = 50% = 42.3Rϵ = 80% = 9.57

Flip

• Third subjet is not constrained, but can be moved to right half of plane

• Flip jet if average pT is in left half of plane

17

Flip

18

0.0 0.2 0.4 0.6 0.8 1.0Top Tagging Efficiency

100

101

102

103

Bac

kgro

und

Rej

ectio

nJet pT = 600 - 2500 GeV

Trimming onlyScaleTranslationRotationFlip

FlipAUC = 0.933Rϵ = 50% = 44.3Rϵ = 80% = 9.75

Performance onTruth vs Reconstructed Jets

Performance after preprocessing

20

0.0 0.2 0.4 0.6 0.8 1.0Top Tagging Efficiency

100

101

102

103B

ackg

roun

dR

ejec

tion

Jet pT = 600 - 2500 GeV

DNN, truth⌧32, truthDNN, reco⌧32, reco

Performance at 50% overall Signal Efficiency

21

600 800 1000 1200 1400 1600 1800 2000 2200 2400Jet pT [GeV]

0.0

0.2

0.4

0.6

0.8

1.0

Sig

nale

ffici

ency

Signal efficiencyBackground rejection

0

10

20

30

40

50

60

Bac

kgro

und

reje

ctio

n

600 800 1000 1200 1400 1600 1800 2000 2200 2400Jet pT [GeV]

0.0

0.2

0.4

0.6

0.8

1.0

Sig

nale

ffici

ency

Signal efficiencyBackground rejection

0

10

20

30

40

50

60

70

80

Bac

kgro

und

reje

ctio

n

Reconstructed JetsTruth Jets

AUC = 0.947Rϵ = 50% = 66Rϵ = 80% = 13

AUC = 0.933Rϵ = 50% = 44Rϵ = 80% = 9.7

Pileup

Performance at different levels of pileup

23

0.0 0.2 0.4 0.6 0.8 1.0Top Tagging Efficiency

100

101

102

103B

ackg

roun

dR

ejec

tion

Jet pT = 600 - 2500 GeV

No pile upPile up = 23Pile up = 50

Extremely stable performance with respect to pileup

24

600 800 1000 1200 1400 1600 1800 2000 2200 2400Jet pT [GeV]

0.0

0.2

0.4

0.6

0.8

1.0S

igna

leffi

cien

cy

Signal efficiency: No pile upSignal efficiency: Pile up = 23Signal efficiency: Pile up = 50

Background rejection: No pile upBackground rejection: Pile up = 23Background rejection: Pile up = 50

0

10

20

30

40

50

60

Bac

kgro

und

reje

ctio

n

0

10

20

30

40

50

60

Bac

kgro

und

reje

ctio

n

0

10

20

30

40

50

60

Bac

kgro

und

reje

ctio

n

Performance at different levels of pileup

pT dependence also stable with respect to pileup

Learning what is being learnt

0 100 200 300 400 500Jet mass [GeV]

0.0

0.2

0.4

0.6

0.8

1.0

DN

Nou

tput

Background Jets

0.000

0.015

0.030

0.045

0.060

0.075

0.090

0.105

0.120

P(J

etm

ass

[GeV

]|DN

Nou

tput

)

Jet Mass

26

0 50 100 150 200 250 300 350 400

Jet mass [GeV]

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014Flat pT distribution600 < jet pT < 2500 GeV

SignalBackground

0 100 200 300 400 500Jet mass [GeV]

0.0

0.2

0.4

0.6

0.8

1.0

DN

Nou

tput

Background Jets

0.000

0.015

0.030

0.045

0.060

0.075

0.090

0.105

0.120

P(J

etm

ass

[GeV

]|DN

Nou

tput

)

Jet Mass

27

0 50 100 150 200 250 300 350 400

Jet mass [GeV]

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014Flat pT distribution600 < jet pT < 2500 GeV

SignalBackground

Next StepsShort term:• We plan to revisit LSTMs• Thorough Bayesian hyper-parameter optimization

Longer term:• Both top and W tagging with deep neural networks now

reasonably well-established on Monte Carlo• “But does it work on data?”• Start working towards evaluating the performance of these

techniques on data • Investigate effects of systematics and strategies for

mitigating the impact of systematics

28

Thank you!

29

W-tagging performance on truth

30

QCD-Aware Recursive Neural Networks for Jet Physics.Louppe, Cho, Becot, Cranmer https://arxiv.org/abs/1702.00748

Zooming

31

Parton Shower Uncertainties in Jet Substructure Analyses with Deep Neural Networks Barnard, Dawe, Dolan, Rajcic https://arxiv.org/pdf/1609.00607v2.pdf

Performance when trained and tested on different levels of pileup

600 800 1000 1200 1400 1600 1800 2000 2200 2400Jet pT [GeV]

0.0

0.2

0.4

0.6

0.8

1.0

Sig

nale

ffici

ency

Signal efficiency: NN trained on µ = 23 tested on µ = 0Signal efficiency: NN trained on µ = 23 tested on µ = 23Signal efficiency: NN trained on µ = 23 tested on µ = 50Background rejection: NN trained on µ = 23 tested on µ = 0Background rejection: NN trained on µ = 23 tested on µ = 23Background rejection: NN trained on µ = 23 tested on µ = 50

0

10

20

30

40

50

60

Bac

kgro

und

reje

ctio

n

0

10

20

30

40

50

60

Bac

kgro

und

reje

ctio

n

0

10

20

30

40

50

60

Bac

kgro

und

reje

ctio

n

32

600 800 1000 1200 1400 1600 1800 2000 2200 2400Jet pT [GeV]

0.0

0.2

0.4

0.6

0.8

1.0

Sig

nale

ffici

ency

Signal efficiency: NN trained on µ = 0 tested on µ = 0Signal efficiency: NN trained on µ = 0 tested on µ = 23Signal efficiency: NN trained on µ = 0 tested on µ = 50Background rejection: NN trained on µ = 0 tested on µ = 0Background rejection: NN trained on µ = 0 tested on µ = 23Background rejection: NN trained on µ = 0 tested on µ = 50

0

10

20

30

40

50

60

Bac

kgro

und

reje

ctio

n

0

10

20

30

40

50

60

Bac

kgro

und

reje

ctio

n

0

10

20

30

40

50

60

Bac

kgro

und

reje

ctio

n

600 800 1000 1200 1400 1600 1800 2000 2200 2400Jet pT [GeV]

0.0

0.2

0.4

0.6

0.8

1.0

Sig

nale

ffici

ency

Signal efficiency: NN trained on µ = 50 tested on µ = 0Signal efficiency: NN trained on µ = 50 tested on µ = 23Signal efficiency: NN trained on µ = 50 tested on µ = 50Background rejection: NN trained on µ = 50 tested on µ = 0Background rejection: NN trained on µ = 50 tested on µ = 23Background rejection: NN trained on µ = 50 tested on µ = 50

0

10

20

30

40

50

60

Bac

kgro

und

reje

ctio

n

0

10

20

30

40

50

60

Bac

kgro

und

reje

ctio

n

0

10

20

30

40

50

60

Bac

kgro

und

reje

ctio

n

- Examined how a neural network trained at one pileup level performs on another level of pileup

- NN seems relatively robust to changes in pileup expected at the LHC in the next few years

0 100 200 300 400 500Jet mass [GeV]

0.0

0.2

0.4

0.6

0.8

1.0

DN

Nou

tput

Background Jets

0.000

0.015

0.030

0.045

0.060

0.075

0.090

0.105

0.120

P(J

etm

ass

[GeV

]|DN

Nou

tput

)

Jet Mass

33

0 50 100 150 200 250 300 350 400

Jet mass [GeV]

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014Flat pT distribution600 < jet pT < 2500 GeV

SignalBackground

34

0.0 0.2 0.4 0.6 0.8 1.0 1.2

⌧32

0.0

0.5

1.0

1.5

2.0

2.5Flat pT distribution600 < jet pT < 2500 GeV

SignalBackground

0.0 0.2 0.4 0.6 0.8 1.0⌧wta

32

0.0

0.2

0.4

0.6

0.8

1.0

DN

Nou

tput

Background Jets

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

P(⌧

wta

32|D

NN

outp

ut)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

⌧32

0.0

0.5

1.0

1.5

2.0

2.5Flat pT distribution600 < jet pT < 2500 GeV

SignalBackground

Recommended