Top Thinkshop-2 Nov. 10-12, 2000 Pushpa Bhat1 Advanced Analysis Algorithms for Top Analysis Pushpa Bhat Fermilab Top Thinkshop 2 Fermilab, IL November

Top Thinkshop-2 Nov. 10-12, 2000 Pushpa Bhat

1

Advanced Analysis Algorithms

for Top AnalysisPushpa Bhat

Fermilab

Top Thinkshop 2Fermilab, ILNovember 2000

A reasonable man adapts himself to the world.An unreasonable man persists to adapt the world to himself.So, all So, all progress depends on the unreasonable one.

- Bernard Shaw


2

What do we gain?

b-tag efficiency in Run I: DØ ~20%, CDF ~53% b-tag efficiency in Run I: DØ ~20%, CDF ~53% But, DØ was able to measure the top quark mass But, DØ was able to measure the top quark mass with a precision approaching that of CDF by using with a precision approaching that of CDF by using multivariate techniques to separate signal and multivariate techniques to separate signal and background while minimizing the correlation of background while minimizing the correlation of the selection with the top quark mass.the selection with the top quark mass.


3

Optimal Analysis MethodsThe new generation of experiments will be a lot more demanding than the previous in data handling at all stagesThe time-honored procedure of choosing and applying cuts on one event variable at a time is rarely optimal!The measurements being multivariate, the optimal methods of analyses are necessarily multivariateDiscriminant Analysis: Partition multidimensional variable space, identify boundaries between classes of objects Cluster Analysis: Assign objects to groups based on similarityRegression Analysis: Functional approximation/fitting


4

Data Analysis TasksParticle Identification e-ID, -ID, b-ID, , q/g

Signal/Background Event Classification Signals of new physics are rare and small

(Finding a “jewel” in a hay-stack)

Parameter Estimation t mass, H mass, track parameters, for example

Function Approximation Correction functions, tag rates, fake rates

Data Exploration Data-driven extraction of information, latent structure analysis


5

x1x1

x2x2

Why Multivariate Methods?

x1x1

x2x2

Because they are optimal!Because they are optimal!

D(x1,x2)=2.014x1 + 1.592x2D(x1,x2)=2.014x1 + 1.592x2


6

Optimal Event Selection

b)p(b)(xp

s)p(s)|p(x

)x(bp

)x(sp

)(xr

b)p(b)(xp

s)p(s)|p(x

)x(bp

)x(sp

)(xr

defines decision boundariesdefines decision boundariesthat minimize the probabilitythat minimize the probabilityof misclassificationof misclassification

So, the problem mathematically reduces to that of calculating r(x), the Bayes Discriminant Function or probability densities

Posterior probabilityPosterior probability

s)|p(xb)(xp

s)|p(x

r1

r

)|( xsp

s)|p(xb)(xp

s)|p(x

r1

r

)|( xsp


7

Probability Density EstimatorsHistogramming:

The basic problem of non-parametric density estimation is very simple! Histogram data in M bins in each of the d feature variables

Md bins Curse Of Dimensionality In high dimensions, we would either require a huge

number of data points or most of the bins would be empty leading to an estimated density of zero.

But, the variables are generally correlated and hence tend to be restricted to a sub-space Intrinsic Dimensionality


8

Kernel-Based MethodsAkin to Histogramming but adopts importance sampling

Place in d-dimensional space a hypercube of side h centered on each data point x,

The estimate will have discontinuities

Can be smoothed out using different forms for kernel functions H(u). A common choice is a multivariate Gaussian kernel

N

n

n

d h

xxH

hNxp

1

11)(~

N

n

n

d h

xxH

hNxp

1

11)(~

N

n

n

d h

xx

hNxp

12

2

2/2 2

||exp

)2(

11)(~

N

n

n

d h

xx

hNxp

12

2

2/2 2

||exp

)2(

11)(~

N = Number of data points H(u) = 1 if xn in the hypercube = 0 otherwise

h=smoothingparameter


9

Place a hyper-sphere centered at each data point x and allow the radius to grow to a volume V until it contains K data points. Then, density at x

If our data set contains Nk points in class Ck and N points in total, then

NV

Kxp )(

NV

Kxp )(

K nearest-neighbor Method

N = Number of data pointsN = Number of data points

VN

KCxp

k

kk )|(

VN

KCxp

k

kk )|(

KKkk = # of points in volume = # of points in volume

V for class CV for class Ckk

K

K

xp

CpCxPxCp kkk

k )(

)()|()|(

K

K

xp

CpCxPxCp kkk

k )(

)()|()|(


10

Discriminant Approximation with Neural Networks

Output of a feed forward neural network can approximate the Bayesian posterior probability p(s|x,y)Directly without estimating class-conditional probabilities

x

y

),,( yxn

r

ryxspyxn

1),|(),,(

r

ryxspyxn

1),|(),,(


11

Calculating the Discriminant

Consider the sum

i

iii dyxnyxE 2]),,([),,(

Where di = 11 for signal

= 00 for background = vector of parameters

Then

r

ryxspyxn

d

yxdE

1),|(),,(0

),,(

in the limit of large data samples and provided that the function n(x,y,) is flexible enough.


12

NN estimates a mapping function without requiring a mathematical description of how the output formally depends on the input.

The “hidden” transformation functions, g, adapt themselves to the data as part of the training process. The number of such functions need to grow only as the complexity of the problem grows.

x1

x2

x3

x4

DNN

aijii

kjj

NN e1

1(a))};X({ D

- ggg

ij

k

Neural Networks


13

Why are NN models powerful?

Neural networks are universal approximators

With a sufficiently large NN, you can approximate a function to arbitrary accuracy

Convergence of approximation is rapid

High dimensionality is not a curse any more!

Model complexity can be controlled by regularization

Extrapolate gracefully


14

Also, they need to have optimal flexibility/complexity

x1

x2

)2sin(4.05.0)( xxh Mth Order Polynomial Fit

M=1 M=3 M=10

x1

x2

x1

x2

Simple Flexible Highly flexible


15

The Golden Rule

Keep it simpleAs simple as possibleNot any simpler

- Einstein


16

Measuring the Top Quark Mass

The DiscriminantsThe Discriminants

Discriminant variables shaded = topshaded = top DØDØ


17

Background-rich

Signal-rich

Measuring the Top Quark MassMeasuring the Top Quark Mass

mt = 173.3 ± 5.6(stat.) ± 6.2 (syst.) GeV/c2mt = 173.3 ± 5.6(stat.) ± 6.2 (syst.) GeV/c2

DØ Lepton+jetsDØ Lepton+jets

Strategy for Discovering the Higgs Boson

at the Tevatron

P.C. Bhat, R. Gilmartin, H. Prosper, PRD 62 (2000) P.C. Bhat, R. Gilmartin, H. Prosper, PRD 62 (2000) hep-ph/0001152


19

WH Results from NN AnalysisWH Results from NN AnalysisMMHH = 100 GeV/c = 100 GeV/c22

WH WH vs Wbb


20

WH (110 GeV/c2) NN Distributions


21

Results, Standard vs. NN

A good chance of discovery up to MH= 130 GeV/c2 with 20-30fb-1


22

Improving the Higgs Mass Resolution

13.8% 12.2%

13.1% 11..3%

13%13% 11%11%

Use mjj and HT (= Etjets ) to train NNs to predict the Higgs boson mass


23

Newer ApproachesEnsembles of Networks

Committees of Networks Performance can be better than the best single

network

Stacks of NetworksControl both bias and variance

Mixture of ExpertsDecompose complex problems


24

Bayesian ReasoningThe Bayesian approach provides a well-founded mathematical procedure to make straight-forward and meaningful model comparisons. It also allows treatment of all uncertainties in a consistent manner.

Examples of useful applications: Fitting binned data to multi-source models PLB 407 (1997) 73

Extraction of solar neutrino survival probability PRL 81(1998) 5056

Mathematically linked to adaptive algorithms such as Neural Networks (NN)

Hybrid methods involving NN for probability density estimation and Bayesian treatment can be very powerful


25

Summary

Multivariate methods have already made impact discoveries and precision measurements and will be the methods of choice in future analyses.

We have only scratched the surface in our use of advanced analysis algorithms.

Hybrid methods combining “intelligent” algorithms and probabilistic approach will be the wave of the future!

Documents

Top Thinkshop-2 Nov. 10-12, 2000 Pushpa Bhat1 Advanced Analysis Algorithms for Top Analysis Pushpa Bhat Fermilab Top Thinkshop 2 Fermilab, IL November