Random Forest Photometric Redshift Estimation

Random Forest Photometric Redshift

EstimationSamuel Carliles1

Tamas Budavari2, Sebastien Heinis2,Carey Priebe3, Alex Szalay2

Johns Hopkins University1Dept. of Computer Science

2Dept. of Physics & Astronomy3Dept. of Applied Mathematics & Statistics

Photometric RedshiftsPhotometric Redshifts

You know what they are I did it on SDSS DR6 colors zspec = f(u-g, g-r, r-i, i-z)

zphot = f(u-g, g-r, r-i, i-z)

= zphot - zspec

I did it with Random Forests

You know what they are I did it on SDSS DR6 colors zspec = f(u-g, g-r, r-i, i-z)

zphot = f(u-g, g-r, r-i, i-z)

= zphot - zspec

I did it with Random Forests

ˆ

Regression Trees

» A Binary Tree» It partitions input training data

into clusters of similar objects» Each new test object is matched

with the cluster to which it is “closest” in the input space

» The output value is the mean of the output values of training objects in its cluster

» A Binary Tree» It partitions input training data

into clusters of similar objects» Each new test object is matched

with the cluster to which it is “closest” in the input space

» The output value is the mean of the output values of training objects in its cluster

Building a Regression Tree

Starting at the root node choose a dimension on which to split

Choose the point which “best” distinguishes clusters in that dimensionPoints left go in the left child, right go in the right childRepeat the process in each child node until all objects are in their own leaf node

x1

x2

x3

x3

How Do You Choose the Dimension and Split

Point?

How Do You Choose the Dimension and Split

Point?The best split point in a dimension

is the one which minimizes resubstitution error in that dimension

The best dimension is the one with the lowest best resubstitution error

What’s Resubstitution Error?

• For a candidate split point, there are points left and points right

= L ( x - xL)2 / NL + R (x - xR)2 / NR

• That’s the resubstitution error

• Minimize it

¯ ¯

Randomizing a Regression Tree

Randomizing a Regression Tree

Train it on a bootstrap sampleThis is a sample of N objects chosen

uniformly at random with replacement from the complete training set

Instead of choosing the best dimension to split on, choose the best from among a random subset of input dimensions

Train it on a bootstrap sampleThis is a sample of N objects chosen

uniformly at random with replacement from the complete training set

Instead of choosing the best dimension to split on, choose the best from among a random subset of input dimensions

Random Forest

• An ensemble of “randomized” Regression Trees

• Ensemble estimate is the mean of individual tree estimates

• This gives a distribution of iid estimation errors

• Central Limit Theorem gives the distribution of their mean

• Their mean is exactly zphot - zspec

• That means we have the error distribution for that object!

Implemented in RImplemented in R◊ More training data -> better estimates◊ Forests converge pretty quickly in forest size◊ Training set size, input space constrained by

memory in R implementation

◊ More training data -> better estimates◊ Forests converge pretty quickly in forest size◊ Training set size, input space constrained by

memory in R implementation

ResultsResultsRMS Error = 0.023

Training set size = 80,000

Error DistributionError DistributionStandardized Error Distribution Since we know the error

distribution* for each object, we can standardize them and the results should be standard normal over all test objects. Like in this plot! :)

If the standardized errors are standard normal, then we can predict how many of the errors fall between the tails of the distribution for different tail sizes. Like in this plot! (mostly)

SummarySummary

Random Forest estimates come with Gaussian error distributions

0.023 RMS error is competitive with other methodologies

This makes Random Forests good

Random Forest estimates come with Gaussian error distributions

0.023 RMS error is competitive with other methodologies

This makes Random Forests good

Future WorkFuture Work

CRLB says bigger N gives better estimates from the same estimator

80,000 objects is good, but we have way more than that available

Random Forests in R are extremely memory (=time) inefficient I believe due to FORTRAN implementation

So I’m writing a C# implementation

CRLB says bigger N gives better estimates from the same estimator

80,000 objects is good, but we have way more than that available

Random Forests in R are extremely memory (=time) inefficient I believe due to FORTRAN implementation

So I’m writing a C# implementation

Documents

Random Forest Photometric Redshift Estimation