Identification of Jets Containing b-Hadrons with Recurrent

Identification of Jets Containing b-Hadrons withRecurrent Neural Networks at the ATLAS

ExperimentATL-PHYS-PUB-2017-003

Dan GuestATLAS Collaboration

UC Irvine

May 9, 2017

[email protected] (UCI) RNN b-tagging May 9, 2017 1 / 20

https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PUBNOTES/ATL-PHYS-PUB-2017-003/

Why b-tag? (an oversimplification)

I Want Higgs?

I Higgs mostly

decays to b-quarks

I b-quarks make jets

I In LHC, everythingmakes jets

I Not everythingmakes b-jets

≈2.3 MeV/c²

1/2u

up

2/3

≈4.8 MeV/c²

1/2 ddown

-1/3

≈1.275 GeV/c²

1/2c

charm

2/3

≈95 MeV/c²

1/2s

strange

-1/3

≈173.07 GeV/c²

1/2 ttop

2/3

≈4.18 GeV/c²

1/2 bbottom

-1/3

0

1g

gluon

0

0.511 MeV/c²

1/2e

electron

-1

105.7 MeV/c²

1/2μ

muon

-1

1.777 GeV/c²

1/2 τtau

-1

<2.2 eV/c²

1/2 νe

electronneutrino

0

<0.17 MeV/c²

1/2 νμ0

muonneutrino

<15.5 MeV/c²

1/2 ντ0

tauneutrino

80.4 GeV/c²

1 WW boson

±1

91.2 GeV/c²

1 ZZ boson

0

0

1

photon

0 γ

≈126 GeV/c²

0 H0

Higgsboson

mass

charge

spin

QU

AR

KS

LEP

TO

NS

GA

UG

E B

OS

ON

S


The Standard Model (as Seen by Collider Physics)

I Some are stable

I Many unstable

I Some form jets

I Some metastable

I Neutrinos → EmissT

I Short-lived particles are a big part of what we measure!


The b-hadron Decay Chain

B D

K

PV

I b-hadrons decay through cascade

I βγcτ ≈ 6.4 mm for B with pT = 70 GeV

I But many decay distances are O(detector resolution)


Reconstructing Secondary VerticesThe ATLAS approaches

Single SV

PV

SV

JetFitter

PV

Flight Line

I Many discriminants come from vertices, combine them with ML


The problem with SV tagging

I Sometimes we don’t find avertex

I Requires cutting ontrack-vertex compatibility

I Cuts always looseinformation

I Tuned “by hand”

I Experiment-specific

(JF)2trk vertices≥

N0 1 2 3 4 5

Arb

itra

ry u

nits

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

b jets

c jets

Lightflavour jets

ATLAS Simulation Preliminary

t=13 TeV, ts

I There is no FastJet for vertex reconstruction


Impact parameter (IP) tagging

I Take all tracks in a jet

I Apply some selection

I Extrapolate to perigee

I Per-track discriminants:I Sd0 ≡ d0/σd0I Sz0 ≡ z0/σz0I track “quality”

IPPV

I Compute per-track likelihood Lf (track) with f ∈ {b, c, light}I Per-jet likelihood pf =

∏trk Lf (track variables)

I IP based tagging is the problem we solve with RNNsI More on this later


Putting it all together

Low-Level

I IP: track-based variables

I Likelihood: gives pb, pc, plightI SV: gives vertex variables

I JetFitter: similar to SVx

High-level

I MV2: combine with BDT

I It’s easy to focus on the high-level tagger (MV2), but upstream isimportant too


IP3D: ATLAS’s IP Tagger

I Need to define Lf (track)I Lf (Sd0 , Sz0 , category)I Sd0 shown right

I Use histograms fromsimulation

I 3D binning scheme:I 35 bins in Sd0I 20 bins in Sz0I 14 track categories

I track category representsquality of track

Track signed d0 significance (Good)

20− 10− 0 10 20 30 40

Arb

itra

ry u

nits

6−10

5−10

4−10

3−10

2−10

1−10

1

10


t = 13 TeV, ts b jets

c jets

Lightflavour jets


Improving Upstream Taggers: What IP3D misses

I Relations among tracks:I relation to neighbor binsI relation to neighbor tracks

I These are important (see right)

I New (SV inspired) track variables:

I pfracT ≡ ptrackT /pjetTI ∆R(track, jet)

Curse of Dimensionality

I Already 29,400 bins

I New variable →∼ 10× bins (andevents to “train”)

d0Leading S

-20 -10 0 10 20 30 40 50 60

d0S

uble

adin

g S

-20

-10

0

10

20

30

40

50

60

-410

-310

-210|<2.5η>20 GeV, |

Tb-jets, p


t=13 TeV, ts

d0Leading S

-20 -10 0 10 20 30 40 50 60

d0S

uble

adin

g S

-20

-10

0

10

20

30

40

50

60

-310

-210

-110

|<2.5η>20 GeV, |T

light-jets, p


t=13 TeV, ts


Recurrent Neural Networks (RNNs)

I RNNs can process an arbitrarily length sequence

I Output is a fixed dimensional vector for each jet



ROC Curves for a Multi-Background Discriminant

I Eventually we’ll combine with vertex-based approaches

I Conventional HEP discriminants are binaryI Train against a mix of backgrounds (i.e. MV2 is 7% c-jets)

I We use 4 outputs:I pb: bottom jetI pc: charm jetI plight: “light” jet (u, d, s, g)I pτ : τ jet

I Combine everything for the sake of plots

DRNN = lnpb

fcpc + fτpτ + (1− fc − fτ )plight(1)

I The f weighting parameters can be adjusted post-training

I For this talk: fc = 0.07, fτ = 0


RNN Performance (compared to IP3D)

bεb-jet efficiency, 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

lεlig

ht-je

t rej

ectio

n, 1

/

1

10

210

310

R)∆ Frac, T

, category, pz0

, Sd0

RNNIP(S

Frac)T

, category, pz0

, Sd0

RNNIP(S

, category)z0

, Sd0

RNNIP(S

IP3D


t=13 TeV, ts

|<2.5η>20 GeV, |T

p


cεc-

jet r

ejec

tion,

1/

1

10

R)∆ Frac, T

, category, pz0

, Sd0

RNNIP(S

Frac)T

, category, pz0

, Sd0

RNNIP(S

, category)z0

, Sd0

RNNIP(S

IP3D


t=13 TeV, ts

|<2.5η>20 GeV, |T

p

I Lowest line is IP3D

I Next up: RNN with IP3D inputs

I Each new variable adds discrimination

I At 70% working point:I RNN with IP3D inputs improves light rejection by 1.7I With ∆R(track, jet) and pfracT , improves light rejection by 2.5


RNN Performance (compared to high-level tagger)


lεlig

ht-je

t rej

ectio

n, 1

/

1

10

210

310

410MV2c10

RNNIP

IP3D

SV1


t=13 TeV, ts

|<2.5η>20 GeV, |T

p


cεc-

jet r

ejec

tion,

1/

1

10

210MV2c10

RNNIP

IP3D

SV1


t=13 TeV, ts

|<2.5η>20 GeV, |T

p

I MV2 using IP3D still rejects more background for εb < 0.9

I But this uses JetFitter and SV → much more information

I RNN as input for MV2 is outside the scope of this talkI But we can imagine replacing IP3D with the RNN


RNN Performance by pT

[GeV]T

b-jet p0 100 200 300 400 500

lεlig

ht-je

t Rej

ectio

n, 1

/

100

200

300

400

500

600

700

800

900MV2c10

RNNIP

IP3D

SV1


t=13 TeV, ts

|<2.5η>20 GeV, |T

p

Flat 70% b-tagging WP

[GeV]T

b-jet p0 100 200 300 400 500

cεc-

jet R

ejec

tion,

1/

5

10

15

20

25

30

35

40

MV2c10

RNNIP

IP3D

SV1


t=13 TeV, ts

|<2.5η>20 GeV, |T

p

Flat 70% b-tagging WP

I Cut on the discriminant such that εb = 0.7 in each pT bin

I Same trend as previous slide: rejection for IP3D < RNN < MV2

I RNN tagger is no more pT dependent than other taggers


RNN output correlation with input: Sd0 and Sz0

track in sequencethi2 4 6 8 10 12 14

)d0

, SR

NN

(DρC

orre

latio

n,

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

b-jetsc-jetslight-jets


t=13 TeV, ts

|<2.5η>20 GeV, |T

p


)z0

, SR

NN

(DρC

orre

latio

n,

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7



t=13 TeV, ts

|<2.5η>20 GeV, |T

p

I DRNN output is highly correlated with jet Sd0 for “early” tracks in|Sd0 | ordering

I Interesting, but maybe not surprising: b hadrons have ∼ 5 tracks

I Effect is less pronounced for Sz0


RNN output correlation with input: ∆R and pfracT


Fra

ctio

n)T

, pR

NN

(DρC

orre

latio

n,

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5



t=13 TeV, ts

|<2.5η>20 GeV, |T

p


R)

∆, R

NN

(DρC

orre

latio

n,

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5



t=13 TeV, ts

|<2.5η>20 GeV, |T

p

I Much less correlation between DRNN and ∆R(track, jet) or pfracT

I But these are useful discriminants nonetheless


Notes on Software

I We train with KerasI Use Theano backend

I Our reconstruction framework doesn’t support batchedNumPy arrays

I Within our reconstruction, we evaluate with lwtnnI Used in ATLAS for top and W taggingI Also used by CMS for DeepFlavour

Help and other ideas welcome

I lwtnn is written “as needed”

I Is there a more sustainable approach?


https://github.com/fchollet/keras

http://deeplearning.net/software/theano/

https://github.com/lwtnn/lwtnn

Conclusions

I RNNs are a promising tool for flavor taggingI Use relatively low-level variablesI Can augment vertex-based approaches

I Many interesting questions:I What other low-level variables could we include?I How does this complement a high-level tagger (e.g. MV2,

DeepFlavour)?I How does this compare to the CMS approach?I Can we “understand” (visualize) what we’ve learned?

I Thanks for listening, ideas are welcome!


BACKUP


backup

Thanks

I Michela Paganini and Jonathan Shlomi for the graphics

I Zihao Jiang, Michael Kagan, Michela, and the rest of the RNNteam for training lots of networks

I The ATLAS flavor tagging group for a good problem

I ATLAS for all the simulation


backup

IP3D Categories

Fractional contribution [%]# Category b-jets c-jets light-jets0 No hits in first two layers; expected hit in IBL and b-layer 1.9 2.0 1.91 No hits in first two layers; expected hit in IBL and no expected hit in b-layer 0.1 0.1 0.12 No hits in first two layers; no expected hit in IBL and expected hit in b-layer 0.04 0.04 0.043 No hits in first two layers; no expected hit in IBL and b-layer 0.03 0.03 0.034 No hit in IBL; expected hit in IBL 2.4 2.3 2.15 No hit in IBL; no expected hit in IBL 1.0 1.0 0.96 No hit in b-layer; expected hit in b-layer 0.5 0.5 0.57 No hit in b-layer; no expected hit in b-layer 2.4 2.4 2.28 Shared hit in both IBL and b-layer 0.01 0.01 0.039 At least one shared pixel hits 2.0 1.7 1.510 Two or more shared SCT hits 3.2 3.0 2.711 Split hits in both IBL and b-layer 1.0 0.87 0.612 Split pixel hit 1.8 1.4 0.913 Good 83.6 84.8 86.4

I Fractions are based on simulated tt̄


backup

Training

I Use 3.2 million jets from simulated tt̄

I Training time: with a CPU, a few days on a (busy) cluster

I We only train on first 15 tracks (0.5% of jets 15+ tracks)

Track Selection

I Jet Algorithm: Anti-kt, R = 0.4

I Track pT > 1 GeV

I |d0| < 1 mm, |z0 sin θ| < 1.5 mm

I nSihits ≥ 7, nSiholes ≤ 2, npixelholes ≤ 1


Documents

Identification of Jets Containing b-Hadrons with Recurrent