48
Using R in Astronomy Getting started quickly with R Juan Pablo Torres- Papaqui Departamento de Astronomía Universidad de Guanajuato DA-UG, México [email protected] División de Ciencias Naturales y Exactas, Campus Guanajuato, Sede Valenciana

Using R in Astronomy

  • Upload
    chun

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

Using R in Astronomy. Getting started quickly with R. Juan Pablo Torres-Papaqui. Departamento de Astronomía Universidad de Guanajuato DA-UG, México. [email protected] División de Ciencias Naturales y Exactas, Campus Guanajuato, Sede Valenciana. Why R? - PowerPoint PPT Presentation

Citation preview

Page 1: Using R in Astronomy

Using R in Astronomy

Getting started quickly with R

Juan Pablo Torres-Papaqui

Departamento de AstronomíaUniversidad de Guanajuato

DA-UG, México

[email protected]ón de Ciencias Naturales y Exactas,

Campus Guanajuato, Sede Valenciana

Page 2: Using R in Astronomy

Why R?

R is a free language and environment for statistical computing and graphics. It is superior to other graphics/analysis packages commonly used in Astronomy, e.g. supermongo, pgplot, gnuplot, and features a very wide range of statistical analysis and data plotting functions.

Furthermore, R is already the most popular amongst the leading software for statistical analysis, as measured by a variety of indicators, and is rapidly growing in influence.

“R is really important to the point that it's hard to overvalue it. It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems.” Google research scientist, quoted in the New York Times.

Page 3: Using R in Astronomy

Key features

It's a mature, widely used (around 2-3 million users) and well-supported free and open source software project; it's committed to a twice-yearly update schedule (it runs on GNU/linux, unix, MacOS and Windows)Excellent graphics capabilitieVector arithmetic, plus many built-in basic & advanced statistical and numerical analysis toolsIntelligent handling of missing data values (denoted by “NA”)Highly extensible, with around 3000 user-contributed packages availableIt's easy to use and has excellent online help and associated documentation

Page 4: Using R in Astronomy

Download

The R Project for Statistical Computing

http://www.r-project.org/

Page 5: Using R in Astronomy

Getting started

The online help pages can be accessed as follows:

help.start() # Load HTML help pages into browserhelp(package) # List help page for "package"?package # Shortform for "help(package)"help.search("keyword") # Search help pages for "keyword"?help # For more optionshelp(package=base) # List tasks in package "base"q() # Quit R

Page 6: Using R in Astronomy

Getting started

Some extra packages are not loaded by default, but can be loaded as follows:

library("package")library() # List available packages to loadlibrary(help="package") # list package contents

Page 7: Using R in Astronomy

Getting started

Commands can be separated by a semicolon (";'') or a newline

All characters after "#'' are treated as comments (even on the same line as commands)

Variables are assigned using "<-'' (although "='' also works):

x <- 4.5y <- 2*x^2 # Use "<<-'' for global assignment

Page 8: Using R in Astronomy

Vectors

x <- c(1, 4, 5, 8) # Concatenate values & assign to xy <- 2*x^2 # Evaluate expression for each element in x> y # Typing the name of a variable evaluates it[1] 2 32 50 128

Page 9: Using R in Astronomy

If x is a vector, y will be a vector of the same length. The vector arithmetic in R means that an expression involving vectors of the same length (e.g. x+y, in the above example) will be evaluated piecewise for corresponding elements, without the need to loop explicitly over the values:

> x+y[1] 3 36 55 136

> 1:10 # Same as seq(1, 10) [1] 1 2 3 4 5 6 7 8 9 10

> a <- 1:10a[-1] # Print whole vector *except* the first elementa[-c(1, 5)] # Print whole vector except elements 1 & 5a[1:5] # Print elements 1 to 5a[length(a)] # Print last element, for any length of vectorhead(a, n=3) # Print the first n elements of atail(a, n=3) # Print the last n elements of a

Page 10: Using R in Astronomy

Demonstrations of certain topics are available: demo(graphics) # Run grapics demonstrationdemo() # List topics for which demos are available?demo # List help for "demo" function

Page 11: Using R in Astronomy

Typing the name of a function lists its contents; typing a variable evaluates it, or you can use print(x).

A great feature of R functions is that they offer a lot of control to the user, but provide sensible hidden defaults. A good example is the histogram function, which works very well without any explicit control:

First, generate a set of 10,000 Gaussian distributed random numbers:

data <- rnorm(1e4) # Gaussian distributed numbers with mean=0 & sigma=1hist(data) # Plots a histogram, with sensible bin choices by default

See ?hist for the full range of control parameters available, e.g. to specifiy 7 bins:

hist(data, breaks=7)

Page 12: Using R in Astronomy

Data Input/Output

Page 13: Using R in Astronomy

Assuming you have a file (“file.dat”) containing the following data, with either spaces or tabs between fields:

r x y1 4.2 14.22 2.4 64.83 8.76 63.44 5.9 32.25 3.4 89.6

setwd("C:/Users/juanchito/Documents/Astronomy School/") #only for windows users

Page 14: Using R in Astronomy

Now read this file into R:

> inp <- read.table("file.dat", header=T) # Read data into data frame> inp # Print contents of inp> inp[0] # Print the type of object that inp is:NULL data frame with 5 rowsor data frame with 0 columns and 5 rows

> colnames(inp) # Print the column names for inp:[1] "r" "x" "y" # This is simply a vector that can easily be changed:

> colnames(inp) <- c("radius", "temperature", "time")> colnames(inp)[1] "radius" "temperature" "time"

Page 15: Using R in Astronomy

Note that you can refer to R objects, names etc. using the first few letters of the full name, provided that is unambiguous, e.g.:

> inp$r # Print the Radius column

but note what happens if the information is ambiguous or if the column doesn't exist:

> inp$t # Could match "inp$temperature" or "inp$time"NULL> inp$wibble # "wibble" column doesn't existNULL

Alternatively, you can refer to the columns by number:

> inp[[1]] # Print data as a vector (use "[[" & "]]")[1] 1 2 3 4 5

> inp[1] # Print data as a data.frame (only use "[" and "]")

Page 16: Using R in Astronomy

Writing data out to a file (if no filename is specified (the default), the output is written to the console)

> write.table(inp, quote=F, row.names=F, col.names=T, sep="\t:\t")> write.table(inp, quote=F, row.names=F, col.names=T) radius temperature time1 4.2 14.22 2.4 64.83 8.76 63.44 5.9 32.25 3.4 89.6

Page 17: Using R in Astronomy

Saving data

?savesave(inp, t, file="data.RData")load(file="data.RData") # read back in the saved data:# - this will load the saved R objects into the current session, with# the same properties, i.e. all the variable will have the same names# and contents

Note that when you quit R (by typing q()), it asks if you want to save the workspace image, if you specify yes (y), it writes out two files to the current directory, called .RData and .Rhistory. The former contains the contents of the saved session (i.e. all the R objects in memory) and the latter is a list of all the command history (it's just a simple text file you can look at).

At any time you can save the history of commands using:

savehistory(file="my.Rhistory")loadhistory(file="my.Rhistory") # load history file into current R session

Page 18: Using R in Astronomy

Plotting data

Page 19: Using R in Astronomy

Useful functions:

?plot # Help page for plot command?par # Help page for graphics parameter control?Devices # or "?device"# (R can output in postscript, PDF, bitmap, PNG, JPEG and more formats)

dev.list() # list graphics devices????colours() # or "colors()" List all available colours?plotmath # Help page for writing maths expressions in R

Tip: to generate png, jpeg etc. files, I find it's best to create pdf versions in R, then convert them in gimp (having specified strong antialiasing, if asked, when loading the pdf file into gimp), otherwise the figures are "blocky" (i.e. suffer from aliasing).

Page 20: Using R in Astronomy

To create an output file copy of a plot for printing or including in a document etc.

dev.copy2pdf(file="myplot.pdf") # } Copy contents of current graphics devicedev.copy2eps(file="myplot.eps") # } to a PDF or Postscript file

Page 21: Using R in Astronomy

Adding more datasets to a plot:

x <- 1:10; y <- x^2z <- 0.9*x^2plot(x, y) # Plot original datapoints(x, z, pch="+") # Add new data, with different symbolslines(x,y) # Add a solid line for original datalines(x, z, col="red", lty=2) # Add a red dashed line for new datacurve(1.1*x^2, add=T, lty=3, col="blue") # Plot a function as a curvetext(2, 60, "An annotation") # Write text on plotabline(h=50) # Add a horizontal lineabline(v=3, col="orange") # Add a vertical line

Page 22: Using R in Astronomy

Error bars

source("errorbar.R") # source code# Create some data to plot:

x <- seq(1,10); y <- x^2xlo <- 0.9*x; xup <- 1.08*x; ylo <- 0.85*y; yup <- 1.2*y

Plot graph, but don't show the points (type="n") & plot errors:

plot(x, y, log="xy", type="n")errorbar(x, xlo, xup, y, ylo, yup)

?plot gives more info on plotting options (as does ?par). To use different line colours, styles, thicknesses & errorbar types for different points:

errorbar(x, xlo, xup, y, ylo, yup, col=rainbow(length(x)), lty=1:5, lwd=1:3)

Page 23: Using R in Astronomy

Plotting a function or equation

curve(x^2, from=1, to=100) # Plot a function over a given rangecurve(x^2, from=1, to=100, log="xy") # With log-log axes

Plot chi-squared probability vs. reduced chi-squared for 10 and then 100 degrees of freedom

dof <- 10curve(pchisq(x*dof,df=dof), from=0.1, to=3, xlab="Reduced chi-squared", ylab="Chi-sq distribution function")abline(v=1, lty=2, col="blue") # Plot dashed line for reduced chisq=1dof <- 100curve(pchisq(x*dof, df=dof), from=0.1, to=3, xlab="Reduced chi-squared", ylab="Chi-sq distribution function")abline(v=1, lty=2, col="blue") # Plot dashed line for reduced chisq=1

Page 24: Using R in Astronomy

Define & plot custom function:

myfun <- function(x) x^3 + log(x)*sin(x)

#--Plot dotted (lty=3) red line, with 3x normal line width (lwd=3):curve(myfun, from=1, to=10, lty=3, col="red", lwd=3)

#--Add a legend, inset slightly from top left of plot:legend("topleft", inset=0.05, "My function", lty=3, lwd=3, col="red")

Plotting maths symbols:

demo(plotmath) # Shows a demonstration of many examples?plotmath # Manual page with more information

Page 25: Using R in Astronomy

Analysis

Page 26: Using R in Astronomy

Basic stuff

x <- rnorm(100) # Create 100 Gaussian-distributed random numbershist(x) # Plot histogram of xsummary(x) # Some basic statistical infod <- density(x) # Compute kernel density estimated # Print info on density analysis ("?density" for details)plot(d) # Plot density curve for xplot(density(x)) # Plot density curve directly

Plot histogram in terms of probability & also overlay density plot:

hist(x, probability=T)lines(density(x), col="red") # overlay smoothed density curve

#--Standard stuff:mean(x); median(x); max(x); min(x); sum(x); weighted.mean(x)sd(x) # Standard deviationvar(x) # Variancemad(x) # median absolute deviation (robust sd)

Page 27: Using R in Astronomy

Testing the robustness of statisical estimators

?replicate # Repeatedly evaluate an expression

Calculate standard deviation of 100 Gaussian-distributed random numbers (NB default sd is 1; can specify with rnorm(100, sd=3) or whatever)

sd(rnorm(100))

Now, let's run N=5 Monte Carlo simulations of the above test:

replicate(5, sd(rnorm(100)))

and then compute the mean of the N=5 values:

mean(replicate(5, sd(rnorm(100))))

Page 28: Using R in Astronomy

Obviously 5 is rather small for a number of MC sims (repeat the above command & see how the answer fluctuates). Let's try 10,000 instead:

mean(replicate(1e4, sd(rnorm(100))))

With a sample size of 100, the measured sd closely matches the actual value used to create the random data (sd=1). However, see what happens when the sample size decreases:

mean(replicate(1e4, sd(rnorm(10)))) # almost a 3% bias lowmean(replicate(1e4, sd(rnorm(3)))) # >10% bias low

Page 29: Using R in Astronomy

You can see the bias more clearly with a histogram or density plot:

a <- replicate(1e4, sd(rnorm(3)))hist(a, breaks=100, probability=T) # Specify lots of binslines(density(a), col="red") # Overlay kernel density estimateabline(v=1, col="blue", lwd=3) # Mark true value with thick blue line

This bias is important to consider when esimating the velocity dispersion of a stellar or galaxy system with a small number of tracers. The bias is even worse for the robust sd estimator mad():

mean(replicate(1e4, mad(rnorm(10)))) # ~9% bias lowmean(replicate(1e4, mad(rnorm(3)))) # >30% bias low

Page 30: Using R in Astronomy

However, in the presence of outliers, mad() is robust compared to sd() as seen with the following demonstration. First, create a function to add a 5 sigma outlier to a Gaussian random distribution:

outlier <- function(N,mean=0,sd=1) { a <- rnorm(N, mean=mean, sd=sd) a[1] <- mean+5*sd # Replace 1st value with 5 sigma outlier return(a)}

Now compare the performance of mad() vs. sd():

mean(replicate(1e4, mad(outlier(10))))mean(replicate(1e4, sd(outlier(10))))

Page 31: Using R in Astronomy

You can also compare the median vs. Mean:

mean(replicate(1e4, median(outlier(10))))mean(replicate(1e4, mean(outlier(10))))

...and without the outlier:

mean(replicate(1e4, median(rnorm(10))))mean(replicate(1e4, mean(rnorm(10))))

Page 32: Using R in Astronomy

Kolmogorov-Smirnov test, taken from manual page (?ks.test)

x <- rnorm(50) # 50 Gaussian random numbersy <- runif(30) # 30 uniform random numbers

# Do x and y come from the same distribution?ks.test(x, y)

Page 33: Using R in Astronomy

Correlation tests

Load in some data:

dfile <- "table1.txt"A <- read.table(dfile, header=T, sep="|")cor.test(A$z, A$Tx) # Specify X & Y data separatelycor.test(~ z + Tx, data=A) # Alternative format when using a data frame ("A")cor.test(A$z, A$Tx, method="spearman") # } Alternative methodscor.test(A$z, A$Tx, method="kendall") # }

plot(A$z,A$Tx)

Page 34: Using R in Astronomy

Spline interpolation of data

x <- 1:20y <- jitter(x^2, factor=20) # Add some noise to vectorsf <- splinefun(x, y) # Perform cubic spline interpolation

# - note that "sf" is a function:sf(1.5) # Evaluate spline function at x=1.5plot(x,y); curve(sf(x), add=T) # Plot data points & spline curve

Page 35: Using R in Astronomy

Scatter plot smoothing

#--Using above data frame ("A"):plot(Tx ~ z, data=A)

#--Return (X & Y values of locally-weighted polynomial regressionlowess(A$z, A$Tx)

#--Plot smoothed data as line on graph:lines(lowess(A$z, A$Tx), col="red")

Page 36: Using R in Astronomy

Regression

dfile <- "http://www.astro.ugto.mx/~papaqui/download/winter/table1.txt" A <- read.table(dfile, header=T, sep="|")names(A) # Print column names in data framem <- lm(Tx ~ z, data=A) # Fit linear modelsummary(m) # Show best-fit parameters & errors

Page 37: Using R in Astronomy

Plot data & best-fit model:

plot(Tx ~ z, A)abline(m, col="blue") # add best-fit straight line model

Plot residuals:

plot(A$z, resid(m))abline(h=0, col="red")

Page 38: Using R in Astronomy

Functions

Page 39: Using R in Astronomy

Basics

cat # Type function name without brackets to list contentsargs(cat) # Return arguments of any functionbody(cat) # Return main body of function

Create your own function

> fun <- function(x, a, b, c) (a*x^2) + (b*x^2) + c> fun(3, 1, 2, 3)[1] 30

> fun(5, 1, 2, 3)[1] 78

Page 40: Using R in Astronomy

A more complicated example of a function. First, create some data:

set.seed(123) # allow reproducible random numbersa <- rnorm(1000, mean=10, sd=1) # 1000 Gaussian random numbersb <- rnorm(100, mean=50, sd=15) # smaller population of higher numbersx <- c(a, b) # Combine datasetshist(x) # Shows outlier population clearlysd(x) # Strongly biased by outliersmad(x) # Robustly estimates sd of main samplemean(x) # biasedmedian(x) # robust

Page 41: Using R in Astronomy

Now create a simple sigma clipping function (you can paste the following straight into R):

sigma.clip <- function(vec, nclip=3, N.max=5) { mean <- mean(vec); sigma <- sd(vec) clip.lo <- mean - (nclip*sigma) clip.up <- mean + (nclip*sigma) vec <- vec[vec < clip.up & vec > clip.lo] # Remove outliers if ( N.max > 0 ) { N.max <- N.max - 1 # Note the use of recursion here (i.e. the function calls itself): vec <- Recall(vec, nclip=nclip, N.max=N.max) } return(vec)}

Now apply the function to the test dataset:

new <- sigma.clip(x)hist(new) # Only the main population remains!length(new) # } Compare numbers oflength(x) # } values before & after clipping

Page 42: Using R in Astronomy

Factors

Page 43: Using R in Astronomy

Factors are a vector type which treats the values as character strings which form part of a base set of values (called "levels"). The result of plotting a factor is to produce a frequency histogram of the number of entries in each level (see ?factor for more info).

Basics

a <- rpois(100, 3) # Create 100 Poisson distributed random numbers (lambda=3)hist(a) # Plot frequency histogram of dataa[0] # List type of vector ("numeric")b <- as.factor(a) # Change type to factor

List type of vector b ("factor"): also lists the "levels", i.e. all the different types of value:

b[0]

Now, plot b, to get a barchart showing the frequency of entries in each level:

plot(b)table(b) # a tabular summary of the same information

Page 44: Using R in Astronomy

A more practical example. Read in table of data (from Sanderson et al. 2006), which refers to a sample of 20 clusters of galaxies with Chandra X-ray data:

file <- "http://www.astro.ugto.mx/~papaqui/download/winter/table1.txt"A <- read.table(file, header=T, sep="|")

By default, “read.table()” will treat non-numeric values as factors. Sometimes this is annoying, in which case use “as.is=T”; you can specify the type of all the input columns explicitly -- see “?read.table”.

A[1,] # Print 1st row of data frame (will also show column names):plot(A$det) # Plot the numbers of each detector typeplot(A$cctype) # Plot the numbers of cool-core and non-cool core clusters

The “det” column is the detector used for the observation of each cluster: S for ACIS-S and I for ACIS-I. The “cctype” denotes the cool core type: “CC” for cool core, “NCC” for non-cool core.

Page 45: Using R in Astronomy

Now, plot one factor against another, to get a plot showing the number of overlaps between the two categories (the cool-core clusters are almost all observed with ACIS-S):

plot(cctype ~ det, data=A, xlab="ACIS Detector", ylab="Cool-core type")

You can easily explore the properties of the different categories, by plotting a factor against a numeric variable. The result is to show a boxplot (see “?boxplot”) of the numeric variable for each catagory. For example, compare the mean temperature of cool-core & non-cool core clusters:

plot(Tx ~ cctype, data=A)

and compare the redshifts of clusters observed with ACIS-S & ACIS-I:

plot(z ~ det, data=A)

You can define your own factor categories by partitioning a numeric vector of data into discrete bins using “cut()”, as shown here:

Page 46: Using R in Astronomy

Converting a numeric vector into factors

Split cluster redshift range into 2 bins of equal width (nearer & further clusters), and assign each z point accordingly; create new column of data frame with the bin assignments (A$z.bin):

A$z.bin <- cut(A$z, 2)A$z.bin # Show levels & bin assignmentsplot(A$z.bin) # Show bar chart

Plot mean temperatures of nearer & further clusters. Since this is essentially a flux-limited sample of galaxy clusters, the more distant clusters (higher redshift) are hotter (and hence more X-ray luminous - an example of the common selection effect known as Malmquist bias):

A$Tx.bin <- cut(A$Tx, 2)plot(z ~ Tx.bin, data=A)plot(Tx ~ z, data=A) # More clearly seen!

Page 47: Using R in Astronomy

Check if redshifts of (non) cool-core clusters differ:

plot(cctype ~ z.bin, data=A) # Equally mixed

Now, let's say you wanted to colour-code the different CC types. First, create a vector of colours by copying the cctype column (you can add this as a column to the data frame, or keep it as a separate vector):

colour <- A$cctype

Now all you need to do is change the names of the levels:

levels(colour) # Show existing levelslevels(colour) <- c("blue", "red") # specify blue=CC, red=NCCcolour[0] # Show data type (& levels, since it's is a factor)

Page 48: Using R in Astronomy

Now change the type of the vector to character (i.e. not a factor):

col <- as.character(colour)col[0] # Confirm change of data typecol # } Print values andcolour # } note the difference

Now plot mean temperature vs. redshift, colour coded by CC type:

plot(Tx ~ z, A, col=col)

Why not add a plot legend:

legend(x="topleft", legend=c(levels(A$cctype)), text.col=levels(colour))

Now for something a little bit fancier:

plot(Tx ~ z, A, col=col, xlab="Redshift", ylab="Temperature (keV)")legend(x="topleft", c(levels(A$cctype)), col=levels(colour),text.col=levels(colour), inset=0.02, pch=1)