Platforms for data science

Preview:

DESCRIPTION

Life science research, data platforms and cloud computing

Citation preview

Platforms for data science

Deepak Singh, Ph.D.Amazon Web Services

Data transmission for international genomics projects 2010

the new reality

lots and lots and lots and lots and lots of data

lots and lots and lots and lots and lots of

people

lots and lots and lots and lots and lots of

places

constant change

science in a new reality

science in a new reality^

science in a new realitydata

^

data as a programmable resource

versioning

provenance capture

filter

aggregate

integrate

extend

mashup

automate

human interfaces

tough problem

really tough problem in the new reality

goal

optimize the most valuable resource

compute, storage, workflows, memory,

transmission, algorithms, cost, …

enter the cloud

what is the cloud?

infrastructure

scalable

highly available

dynamic

extensible

secure

a utility

programmable

class Instance attr_accessor :aws_hash, :elastic_ip def initialize(hash, elastic_ip = nil) @aws_hash = hash @elastic_ip = elastic_ip end def public_dns @aws_hash[:dns_name] || "" end def friendly_name public_dns.empty? ? status.capitalize : public_dns.split(".")[0] end def id @aws_hash[:aws_instance_id] endend

include_recipe "packages"include_recipe "ruby"include_recipe "apache2"

if platform?("centos","redhat") if dist_only? # just the gem, we'll install the apache module within apache2 package "rubygem-passenger" return else package "httpd-devel" endelse %w{ apache2-prefork-dev libapr1-dev }.each do |pkg| package pkg do action :upgrade end endend

gem_package "passenger" do version node[:passenger][:version]end

execute "passenger_module" do command 'echo -en "\n\n\n\n" | passenger-install-apache2-module' creates node[:passenger][:module_path]end

a data science platform

dataspaces

Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data

accept all data formats

evolve APIs

beyond the database and the data warehouse

move compute to the data

data is a royal garden

compute is a fungible commodity

“I terminate the instance and relaunch it. Thats my error handling”

Source: @jtimberman on Twitter

the cloud is an architectural and

cultural fit for data science

amazon web services

your data science platform

s3://1000genomes

Credit: Angel Pizzaro, U. Penn

http://usegalaxy.org/cloud

AWS knows massively scalable infrastructure

you know the needs of the science

we can make this work together

deesingh@amazon.com Twitter:@mndoci

http://slideshare.net/mndoci

Inspiration and ideas from Matt Wood, James Hamilton

& Larry Lessig

Credit” Oberazzi under a CC-BY-NC-SA license

Recommended