Upload
joshua-bloom
View
510
Download
1
Tags:
Embed Size (px)
Citation preview
Computational Training for Domain Scientists & Data Literacy
Joshua BloomMoore Founda1on Data-‐Driven Inves1gator
UC Berkeley, Astronomy
@pro%sb
AAAS, San Jose Sunday Feb 15, 2015
“knowledge and understanding of scientific concepts and processes required for personal decision making, participation in civic and cultural affairs, and economic productivity"
Scientific Literacy
National Science Education Standards Report, National Academies of Science 1996
Data Literacy, as a broad new educational imperative
Data Literacy“knowledge and understanding of analytical concepts and processes required to draw fair conclusions from data that impact personal, professional, and civic lives”
now @ AAAS
http://boingboing.net/2013/01/01/correlation-between-autism-dia.html
via reddit/r/skeptic
Our ML framework found the Nearest Supernova in 3
Decades ..‣ Built & Deployed robust, real-‐?me machine learning framework, discovering >10,000 events in > 10 TB of imaging → 50+ journal ar?cles
‣ Built Probabilis?c Event classifica?on catalogs with innova?ve ac?ve learning
hPp://?medomain.org hPps://www.nsf.gov/news/news_summ.jsp?cntn_id=122537
What is the toolbox of the modern (data-driven) scientist?
domain training
statistics
advanced computing
database
GUI
parallel
visualization
Bayesian
machine learning
Physics
laboratory techniques
MCMC
MapReduce
And...How do we teach this with what little time the students have?
What is the toolbox of the modern (data-driven) scientist?
Data-Centric Coursework, Bootcamps, Seminars, & Lecture Series
BDAS: Berkeley Data Analytics Stack [Spark, Shark, ...]
parallel programming bootcamp
...and entire degree programs
Taught by CS/Stats
Aimed at Engineers & Programmers Heading Toward Industry
2010: 85 campers 2012a: 135 campers
Python Bootcamps at Berkeley
2012b: 210 campers
Python Bootcamps at Berkeley
2013a: 253 campers
‣ 3 days of live/archive streamed lectures ‣ all open material in GitHub ‣ widely disseminated (e.g., @ NASA)
http://pythonbootcamp.info
Part of the Designated Emphasis in Computational Science & Engineering at Berkeley
visualization
machine learning
database interaction
user interface & web frameworks
timeseries & numerical computing
interfacing to other languagesBayesian inference & MCMC
hardware control
parallelism
Python for Data Science @ Berkeley [Sept 2013]
64%
36%female male
8%4%
8%
12%
4%12% 8%
16%
16%
12%PsychologyAstronomyNeuroscienceBiostatisticsPhysicsChemical EngineeringISchoolEarth and Planetary SciencesIndustrial EngineeringMechanical Engineering
“Parallel Image Reconstruction from Radio Interferometry Data”
“Graph Theory Analysis of Growing Graphs”
http://mb3152.github.io/Graph-Growth/
“Realtime Prediction of Activity Behavior from Smartphone”
“Bus Arrival Time Prediction in Spain”
Time domain preprocessing
- Start with raw photometry!
- Gaussian process detrending!
- Calibration!
- Petigura & Marcy 2012!
!Transit search
- Matched filter!
- Similar to BLS algorithm (Kovcas+ 2002)!
- Leverages Fast-Folding Algorithm O(N^2) → O(N log N) (Staelin+ 1968)!
!Data validation
- Significant peaks in periodogram, but inconsistent with exoplanet transit
TERRA – optimized for small planets
Detrended/calibrated photometry
TERRA
Raw
Flu
x (p
pt)
Cal
ibra
ted
Flux
Erik Petigura Berkeley Astro Grad Student
Petigura, Howard, & Marcy (2013)
Prevalence of Earth-size planets orbiting Sun-like starsErik A. Petiguraa,b,1, Andrew W. Howardb, and Geoffrey W. Marcya
aAstronomy Department, University of California, Berkeley, CA 94720; and bInstitute for Astronomy, University of Hawaii at Manoa, Honolulu, HI 96822
Contributed by Geoffrey W. Marcy, October 22, 2013 (sent for review October 18, 2013)
Determining whether Earth-like planets are common or rare loomsas a touchstone in the question of life in the universe. We searchedfor Earth-size planets that cross in front of their host stars byexamining the brightness measurements of 42,000 stars fromNational Aeronautics and Space Administration’s Kepler mission.We found 603 planets, including 10 that are Earth size (1−2 R⊕)and receive comparable levels of stellar energy to that of Earth(0:25− 4 F⊕). We account for Kepler’s imperfect detectability ofsuch planets by injecting synthetic planet–caused dimmings intothe Kepler brightness measurements and recording the fractiondetected. We find that 11 ± 4% of Sun-like stars harbor an Earth-size planet receiving between one and four times the stellar inten-sity as Earth. We also find that the occurrence of Earth-size planets isconstant with increasing orbital period (P), within equal intervals oflogP up to∼200 d. Extrapolating, one finds 5:7+1:7
−2:2% of Sun-like starsharbor an Earth-size planet with orbital periods of 200–400 d.
extrasolar planets | astrobiology
The National Aeronautics and Space Administration’s (NASA’s)Kepler mission was launched in 2009 to search for planets
that transit (cross in front of) their host stars (1–4). The resultingdimming of the host stars is detectable by measuring their bright-ness, and Kepler monitored the brightness of 150,000 stars every30 min for 4 y. To date, this exoplanet survey has detected morethan 3,000 planet candidates (4).The most easily detectable planets in the Kepler survey are
those that are relatively large and orbit close to their host stars,especially those stars having lower intrinsic brightness fluctua-tions (noise). These large, close-in worlds dominate the list ofknown exoplanets. However, the Kepler brightness measurementscan be analyzed and debiased to reveal the diversity of planets,including smaller ones, in our Milky Way Galaxy (5–7). Theseprevious studies showed that small planets approaching Earthsize are the most common, but only for planets orbiting close totheir host stars. Here, we extend the planet survey to Kepler’smost important domain: Earth-size planets orbiting far enoughfrom Sun-like stars to receive a similar intensity of light energyas Earth.
Planet SurveyWe performed an independent search of Kepler photometry fortransiting planets with the goal of measuring the underlying oc-currence distribution of planets as a function of orbital period,P, and planet radius, RP. We restricted our survey to a set of Sun-like stars (GK type) that are the most amenable to the detectionof Earth-size planets. We define GK-type stars as those with sur-face temperatures Teff = 4,100–6,100 K and gravities logg = 4.0–4.9(logg is the base 10 logarithm of a star’s surface gravity measured incm s−2) (8). Our search for planets was further restricted to thebrightest Sun-like stars observed by Kepler (Kp = 10–15 mag). These42,557 stars (Best42k) have the lowest photometric noise, makingthem amenable to the detection of Earth-size planets. Whena planet crosses in front of its star, it causes a fractional dimmingthat is proportional to the fraction of the stellar disk blocked,δF = ðRP=RpÞ2, where Rp is the radius of the star. As viewed bya distant observer, the Earth dims the Sun by ∼100 parts permillion (ppm) lasting 12 h every 365 d.
We searched for transiting planets in Kepler brightness mea-surements using our custom-built TERRA software packagedescribed in previous works (6, 9) and in SI Appendix. In brief,TERRA conditions Kepler photometry in the time domain, re-moving outliers, long timescale variability (>10 d), and systematicerrors common to a large number of stars. TERRA then searchesfor transit signals by evaluating the signal-to-noise ratio (SNR) ofprospective transits over a finely spaced 3D grid of orbital period,P, time of transit, t0, and transit duration, ΔT. This grid-basedsearch extends over the orbital period range of 0.5–400 d.TERRA produced a list of “threshold crossing events” (TCEs)
that meet the key criterion of a photometric dimming SNR ratioSNR > 12. Unfortunately, an unwieldy 16,227 TCEs met this cri-terion, many of which are inconsistent with the periodic dimmingprofile from a true transiting planet. Further vetting was performedby automatically assessing which light curves were consistent withtheoretical models of transiting planets (10). We also visuallyinspected each TCE light curve, retaining only those exhibiting aconsistent, periodic, box-shaped dimming, and rejecting thosecaused by single epoch outliers, correlated noise, and other dataanomalies. The vetting process was applied homogeneously to allTCEs and is described in further detail in SI Appendix.To assess our vetting accuracy, we evaluated the 235 Kepler
objects of interest (KOIs) among Best42k stars having P > 50 d,which had been found by the Kepler Project and identified as planetcandidates in the official Exoplanet Archive (exoplanetarchive.ipac.caltech.edu; accessed 19 September 2013). Among them, wefound four whose light curves are not consistent with beingplanets. These four KOIs (364.01, 2,224.02, 2,311.01, and 2,474.01)have long periods and small radii (SI Appendix). This exercisesuggests that our vetting process is robust and that careful scrutinyof the light curves of small planets in long period orbits is useful toidentify false positives.
Significance
A major question is whether planets suitable for biochemistryare common or rare in the universe. Small rocky planets withliquid water enjoy key ingredients for biology. We used theNational Aeronautics and Space Administration Kepler tele-scope to survey 42,000 Sun-like stars for periodic dimmingsthat occur when a planet crosses in front of its host star. Wefound 603 planets, 10 of which are Earth size and orbit in thehabitable zone, where conditions permit surface liquid water.We measured the detectability of these planets by injectingsynthetic planet-caused dimmings into Kepler brightness mea-surements. We find that 22% of Sun-like stars harbor Earth-sizeplanets orbiting in their habitable zones. The nearest such planetmay be within 12 light-years.
Author contributions: E.A.P., A.W.H., and G.W.M. designed research, performed research,analyzed data, and wrote the paper.
The authors declare no conflict of interest.
Freely available online through the PNAS open access option.
Data deposition: The Kepler photometry is available at the Milkulski Archive for SpaceTelescopes (archive.stsci.edu). All spectra are available to the public on the CommunityFollow-up Program website (cfop.ipac.caltech.edu).1To whom correspondence should be addressed. E-mail: [email protected].
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1319909110/-/DCSupplemental.
www.pnas.org/cgi/doi/10.1073/pnas.1319909110 PNAS | November 26, 2013 | vol. 110 | no. 48 | 19273–19278
AST
RONOMY
Bootcamp/Seminar Alum
Python
DOE/NERSC computation
PNAS [2014]
Dangers of a Cursory Training/Teaching Curriculum
34 Fitting a straight line to data
0 50 100 150 200 250 300x
0
100
200
300
400
500
600
700
y
y = 1.33 x + 164
0 50 100 150 200 250 300x
0
100
200
300
400
500
600
700
y
0.0
1.0
1.0
1.0
0.0
0.0
0.00.00.0
0.00.0
0.0
0.00.0
0.0
0.0
0.0
0.0
0.0
0.0
Figure 10.— Partial solution to Exercise 14: On the left, the same as Figure 9but including the outlier points. On the right, the same as in Figure 4but applying the outlier (mixture) model to the case of two-dimensionaluncertainties.
0 50 100 150 200 250 300x
0
100
200
300
400
500
600
700
y
forward � � � y = (2.24 ± 0.11) x + (34 ± 18)reverse � · � y = (2.64 ± 0.12) x + (�50 ± 21)
Figure 11.— Partial solution to Exercise 15: Results of “forward and reverse”fitting. Don’t ever do this.
Data analysis recipes:
Fitting a model to data
⇤
David W. HoggCenter for Cosmology and Particle Physics, Department of Physics, New York University
Max-Planck-Institut fur Astronomie, Heidelberg
Jo BovyCenter for Cosmology and Particle Physics, Department of Physics, New York University
Dustin LangDepartment of Computer Science, University of Toronto
Princeton University Observatory
Abstract
We go through the many considerations involved in fitting a modelto data, using as an example the fit of a straight line to a set of pointsin a two-dimensional plane. Standard weighted least-squares fittingis only appropriate when there is a dimension along which the datapoints have negligible uncertainties, and another along which all theuncertainties can be described by Gaussians of known variance; theseconditions are rarely met in practice. We consider cases of general,heterogeneous, and arbitrarily covariant two-dimensional uncertain-ties, and situations in which there are bad data (large outliers), un-known uncertainties, and unknown but expected intrinsic scatter inthe linear relationship being fit. Above all we emphasize the impor-tance of having a “generative model” for the data, even an approx-imate one. Once there is a generative model, the subsequent fittingis non-arbitrary because the model permits direct computation of thelikelihood of the parameters or the posterior probability distribution.Construction of a posterior probability distribution is indispensible ifthere are “nuisance parameters” to marginalize away.
It is conventional to begin any scientific document with an introductionthat explains why the subject matter is important. Let us break with tra-dition and observe that in almost all cases in which scientists fit a straightline to their data, they are doing something that is simultaneously wrong
and unnecessary. It is wrong because circumstances in which a set of two
⇤The notes begin on page 39, including the license1 and the acknowledgements2.
1
arX
iv:1
008.
4686
v1 [
astro
-ph.
IM]
27 A
ug 2
010
Dataanalysisrecipes:
Fittingamodeltodata
⇤
Dav
idW
.H
oggCenter
forCosm
ologyan
dParticle
Physics,
Dep
artmentof
Physics,
New
York
University
Max
-Plan
ck-In
stitutfurAstron
omie,
Heid
elberg
JoB
ovy
Center
forCosm
ologyan
dParticle
Physics,
Dep
artmentof
Physics,
New
York
University
Dustin
Lan
gDep
artmentof
Com
puter
Scien
ce,University
ofToron
to
Prin
cetonUniversity
Observatory
Abstra
ct
We
go
thro
ugh
the
many
consid
eratio
ns
involv
edin
fittin
ga
model
todata
,usin
gas
an
exam
ple
the
fit
ofa
straig
ht
line
toa
setofpoin
tsin
atw
o-d
imen
sional
pla
ne.
Sta
ndard
weig
hted
least-sq
uares
fittin
gis
only
appro
pria
tew
hen
there
isa
dim
ensio
nalo
ng
which
the
data
poin
tshav
eneg
ligib
leuncerta
inties,
and
anoth
eralo
ng
which
all
the
uncerta
inties
can
be
describ
edby
Gaussia
ns
ofknow
nva
riance;
these
conditio
ns
are
rarely
met
inpra
ctice.W
eco
nsid
erca
sesof
gen
eral,
hetero
gen
eous,
and
arb
itrarily
covaria
nt
two-d
imen
sional
uncerta
in-
ties,and
situatio
ns
inw
hich
there
are
bad
data
(larg
eoutliers),
un-
know
nuncerta
inties,
and
unknow
nbut
expected
intrin
sicsca
tterin
the
linea
rrela
tionsh
ipbein
gfit.
Abov
eall
we
emphasize
the
impor-
tance
of
hav
ing
a“gen
erativ
em
odel”
for
the
data
,ev
enan
approx
-im
ate
one.
Once
there
isa
gen
erativ
em
odel,
the
subseq
uen
tfittin
gis
non-a
rbitra
rybeca
use
the
model
perm
itsdirect
com
puta
tion
ofth
elik
elihood
ofth
epara
meters
or
the
posterio
rpro
bability
distrib
utio
n.
Constru
ction
ofa
posterio
rpro
bability
distrib
utio
nis
indisp
ensib
leif
there
are
“nuisa
nce
para
meters”
tom
arg
inalize
away.
Itis
conven
tional
tobegin
any
scientifi
cdocu
men
tw
ithan
intro
duction
that
explain
sw
hy
the
subject
matter
isim
portan
t.Let
us
break
with
tra-dition
and
observe
that
inalm
ostall
casesin
which
scientists
fit
astraigh
tlin
eto
their
data,
they
aredoin
gsom
ethin
gth
atis
simultan
eously
wrong
andunnecessary.
Itis
wron
gbecau
secircu
mstan
cesin
which
aset
oftw
o
⇤The
notes
beg
inon
page
39,in
cludin
gth
elicen
se1
and
the
ack
now
ledgem
ents
2.
1
arXiv:1008.4686v1 [astro-ph.IM] 27 Aug 2010
Sta7s7cal Inference
Data Literacy in the Undergraduate & Graduate Training Mission
Versioning & Reproducibility
“Recently, the scientific community was shaken by reports that a troubling proportion of peer-reviewed preclinical studies are not reproducible.” McNutt, 2014
http://www.sciencemag.org/content/343/6168/229.summary
-‐ Git has emerged as the de facto versioning tool -‐ Berkeley Common Environment (BCE) So]ware Stack -‐ “Reproducible and Collabora?ve Sta?s?cal Data Science” (Sta?s?cs 157: P. Stark)
Data Literacy in the Undergraduate & Graduate Training Mission
Undergraduate Curriculum
• Data Science Task Force report generated at Chancellor/Provost behest • Recommenda?ons:
– A founda?onal lower-‐division course accessible to everyone (sta?s?cal & computa?onal thinking), with broad applicability
– A suite of courses that advance the use of data sciences within the broad swath of disciplines available to Berkeley undergraduates
– A high-‐quality minor for students who wish to couple data sciences with their major discipline
– A data sciences major for those students who want this to be their primary area of study
pg. 12; Announced Feb 11
Chancellor’s Report for 2015
Grand challenges at the intersec7on of: • Open science and reproducible data analysis • Responses of coupled human-‐natural systems to rapid environmental change (water resources, land use planning, agriculture, etc.)
• Development of evidence-‐based solu?ons in public policy, resource management and environmental design
• Five schools and colleges: LeFers & Science, Natural Resources, Engineering, Public Policy and Environmental Design
• Two year graduate training program; Master’s and PhD students • Four cohorts of up to 20 students each, first cohort in fall 2015
Data Sciences for the 21st Century (DS421): Environment and Society
via Prof. David Ackerly (Integrative Biology)
Data Sciences for the 21st Century (DS421): Environment and Society
via Prof. David Ackerly (Integrative Biology)
http://bitly.com/bundles/fperezorg/1
“Bold new partnership launches to harness potential of data scientists and big data”
Founded in December 2013 as a result of a year+ long na?onal selec?on process $37.8M over 5 years, along with University of Washington & NYU
‣ An accelerator for data-‐driven discovery ‣ An agent of change in the modern university as Data Science takes hold ‣ An incubator for the next genera?on of Data Science technology and prac?ce
Leadership from across the spectrum
Joshua Bloom, Professor, Astronomy; Director, Center for Time Domain Informa?cs Henry Brady, Dean, Goldman School of Public Policy
Cathryn Carson, Associate Dean, Social Sciences; Ac?ng Director of Social Sciences Data Laboratory "D-‐Lab” David Culler, Chair, EECS
Michael Franklin, Professor; EECS, Co-‐Director, AMP Lab
Erik Mitchell, Associate University Librarian
Faculty Lead/PI: Saul PerlmuFer, Physics, Berkeley Center for Cosmological Physics
Fernando Perez, Researcher, Henry H. Wheeler Jr. Brain Imaging Center
Jasjeet Sekhon, Professor, Poli?cal Science and Sta?s?cs; Center for Causal Inference and Program Evalua?on Jamie Sethian, Professor, Mathema?cs
Kimmen Sjölander, Professor, Bioengineering, Plant and Microbial Biology
Philip Stark, Chair, Sta?s?cs
Ion Stoica, Professor, EECS; Co-‐Director, AMP Lab
“I love working with astronomers, since their data is worthless.”
-‐ Jim Gray, Microso'
Created by Natalia Bilenko.Data source: PubMed Central. http://sciencereview.berkeley.edu/bsr_design/Issue26/datascience/cover/
Established CS/Stats/Math in Service of novelty in domain science
vs.
Novelty in domain science driving & informing novelty in CS/Stats/Math
“novelty2 problem” an extra Burden for Forefront Scien?sts
hPps://medium.com/tech-‐talk/dd88857f662
ỉπ vs.
21st Century Educational Mission: Produce Gamma-shaped people
deep domain skill/knowledge/training deep methodological knowledge/skill
deep domain or methodological skill/knowledge/training strong methodological or domain knowledge/skill
Goal: empower teams of gamma’s to excel
SummaryData science at Berkeley is thriving and BIDS is its intellectual home -‐ incuba?ng novel science & methodologies -‐ teaching & training -‐ innovate environments, interac?ons, & networks
Teaching impera7ves emerging around data literacy -‐ new campus-‐wide undergraduate curriculum (star?ng with a founda?on for all 1st years) -‐ new mul?-‐year, mul?-‐disciplinary graduate program
@pro%sb
Thank you.
PyData, May 4. 2014