Upload
deepak-singh
View
1.524
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Invited Talk given at the NCRR P41 Director's meeting on October 12, 2010
Citation preview
Amazon Web ServicesA platform for life science research
Deepak Singh, Ph.D.Amazon Web Services
NCRR P41 PI meeting, October 2010
the new reality
lots and lots and lots and lots and lots of data
lots and lots and lots and lots and lots of
people
lots and lots and lots and lots and lots of
places
constant change
science in a new reality
science in a new reality^
science in a new realitydata
^
goal
optimize the most valuable resource
compute, storage, workflows, memory,
transmission, algorithms, cost, …
people
Credit: Pieter Musterd a CC-BY-NC-ND license
enter the cloud
what is the cloud?
infrastructure
scalable
3000 CPU’s for one firm’s risk management application
!"#$%&'()'*+,'-./01.2%/'
344'+567/'(.'
8%%9%.:/'
;<"&/:1='
>?,3?,44@'
A&B:1='
>?,>?,44@'
C".:1='
>?,D?,44@'
E(.:1='
>?,F?,44@'
;"%/:1='
>?,G?,44@'
C10"&:1='
>?,H?,44@'
I%:.%/:1='
>?,,?,44@'
3444JJ'
344'JJ'
highly available
US East Region
Availability Zone A
Availability Zone B
Availability Zone C
Availability Zone D
durable
99.999999999%
dynamic
extensible
secure
a utility
on-demand instancesreserved instances
spot instances
infrastructure as code
class Instance attr_accessor :aws_hash, :elastic_ip def initialize(hash, elastic_ip = nil) @aws_hash = hash @elastic_ip = elastic_ip end def public_dns @aws_hash[:dns_name] || "" end def friendly_name public_dns.empty? ? status.capitalize : public_dns.split(".")[0] end def id @aws_hash[:aws_instance_id] endend
include_recipe "packages"include_recipe "ruby"include_recipe "apache2"
if platform?("centos","redhat") if dist_only? # just the gem, we'll install the apache module within apache2 package "rubygem-passenger" return else package "httpd-devel" endelse %w{ apache2-prefork-dev libapr1-dev }.each do |pkg| package pkg do action :upgrade end endend
gem_package "passenger" do version node[:passenger][:version]end
execute "passenger_module" do command 'echo -en "\n\n\n\n" | passenger-install-apache2-module' creates node[:passenger][:module_path]end
import botoimport boto.emrfrom boto.emr.step import StreamingStepfrom boto.emr.bootstrap_action import BootstrapActionimport time
# set your aws keys and S3 bucket, e.g. from environment or .botoAWSKEY= SECRETKEY= S3_BUCKET=NUM_INSTANCES = 1
conn = boto.connect_emr(AWSKEY,SECRETKEY)
bootstrap_step = BootstrapAction("download.tst", "s3://elasticmapreduce/bootstrap-actions/download.sh",None)
step = StreamingStep(name='Wordcount', mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py', cache_files = ["s3n://" + S3_BUCKET + "/boto.mod#boto.mod"], reducer='aggregate', input='s3n://elasticmapreduce/samples/wordcount/input', output='s3n://' + S3_BUCKET + '/output/wordcount_output')
jobid = conn.run_jobflow( name="testbootstrap", log_uri="s3://" + S3_BUCKET + "/logs", steps = [step], bootstrap_actions=[bootstrap_step], num_instances=NUM_INSTANCES)
print "finished spawning job (note: starting still takes time)"
state = conn.describe_jobflow(jobid).stateprint "job state = ", stateprint "job id = ", jobidwhile state != u'COMPLETED': print time.localtime() time.sleep(30) state = conn.describe_jobflow(jobid).state print "job state = ", state print "job id = ", jobid
print "final output can be found in s3://" + S3_BUCKET + "/output" + TIMESTAMPprint "try: $ s3cmd sync s3://" + S3_BUCKET + "/output" + TIMESTAMP + " ."
Connect to Elastic MapReduce
Install packages
Set up mappers &reduces
job state
a data science platform
dataspaces
Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
accept all data formats
evolve APIs
beyond the database and the data warehouse
move compute to the data
data is a royal garden
compute is a fungible commodity
“I terminate the instance and relaunch it. Thats my error handling”
Source: @jtimberman on Twitter
the cloud is an architectural and
cultural fit for data science
amazon web services
your data science platform
s3://1000genomes
http://aws.amazon.com/publicdatasets/
Credit: Angel Pizzaro, U. Penn
mapreduce for genomics
http://bowtie-bio.sourceforge.net/crossbow/index.shtmlhttp://contrail-bio.sourceforge.net
http://bowtie-bio.sourceforge.net/myrna/index.shtml
AWS knows scalable infrastructure
you know the science
we can make this work together
http://aws.amazon.com/educationhttp://aws.amazon.com/publicdatasets
[email protected] Twitter:@mndoci
http://slideshare.net/mndocihttp://mndoci.com
Inspiration and ideas from Matt Wood, James Hamilton
& Larry Lessig
Credit” Oberazzi under a CC-BY-NC-SA license