27
Parallel Frameworks & Big Data Hadoop and Spark on BioHPC 1 Updated for 2016-06-15 [web] portal.biohpc.swmed.edu [email] [email protected]

Parallel Frameworks & Big Data - UT Southwestern · Big Data in Biomedical Science 4 omplex, long running analyses on G datasets ≠ ig Data Small number of genomic datasets (100s

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Parallel Frameworks amp Big DataHadoop and Spark on BioHPC

1 Updated for 2016-06-15

[web] portalbiohpcswmededu

[email] biohpc-helputsouthwesternedu

What is lsquoBig Datarsquo

Big data amp parallel processing are hand-in-hand

Models of parallelization

Apache Hadoop

Apache Spark

Overview

ldquoBig Datardquo

Datasets too big for conventional data processing applications

In business often Predictive Analytics

Big Data in Biomedical Science

4

Complex long running analyses on GB datasets ne Big Data

Small number of genomic datasets (100s of GBS) no longer lsquobig datarsquo

Easy to manage on individual modern servers with standard apps

Cross lsquoomicsrsquo integration often is a big data problem

Thousands of datasets of different typesgenomics proteomics metabalomics imaginghellip

Different formats ndash import storage integration problems

Common questions fit the lsquopredictive analyticsrsquo model of Big Data in business

A Big Data Example ndash Proteomics DB

5

Partnership between academia and industry (SAP)

Data size is 721TB (not huge) but contains 43 million MSMS spectra

Biggest challenge is presentation and interpretation ndash search visualization etc

Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes

Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9

Big Datahellip light

6

BUT ndash all achieved using a traditional single relational database algorithms etc

No flexible search query architecture

Looks like lsquoBig Datarsquohellip

TB scale input data

40000 core hours of processing on TACC

Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2

Consider a Big Data problemhellip

7

I have 5000 sequenced exomes with associated clinical observations and want to

bull Find all SNPs in each patient

bull Aggregate statistics for SNPs across all patients

bull Easily query for SNP associations with any of my clinical variables

bull Build a predictive model for a specific disease

A lot of data a lot of processing ndash need to do each step in parallel

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

What is lsquoBig Datarsquo

Big data amp parallel processing are hand-in-hand

Models of parallelization

Apache Hadoop

Apache Spark

Overview

ldquoBig Datardquo

Datasets too big for conventional data processing applications

In business often Predictive Analytics

Big Data in Biomedical Science

4

Complex long running analyses on GB datasets ne Big Data

Small number of genomic datasets (100s of GBS) no longer lsquobig datarsquo

Easy to manage on individual modern servers with standard apps

Cross lsquoomicsrsquo integration often is a big data problem

Thousands of datasets of different typesgenomics proteomics metabalomics imaginghellip

Different formats ndash import storage integration problems

Common questions fit the lsquopredictive analyticsrsquo model of Big Data in business

A Big Data Example ndash Proteomics DB

5

Partnership between academia and industry (SAP)

Data size is 721TB (not huge) but contains 43 million MSMS spectra

Biggest challenge is presentation and interpretation ndash search visualization etc

Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes

Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9

Big Datahellip light

6

BUT ndash all achieved using a traditional single relational database algorithms etc

No flexible search query architecture

Looks like lsquoBig Datarsquohellip

TB scale input data

40000 core hours of processing on TACC

Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2

Consider a Big Data problemhellip

7

I have 5000 sequenced exomes with associated clinical observations and want to

bull Find all SNPs in each patient

bull Aggregate statistics for SNPs across all patients

bull Easily query for SNP associations with any of my clinical variables

bull Build a predictive model for a specific disease

A lot of data a lot of processing ndash need to do each step in parallel

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

ldquoBig Datardquo

Datasets too big for conventional data processing applications

In business often Predictive Analytics

Big Data in Biomedical Science

4

Complex long running analyses on GB datasets ne Big Data

Small number of genomic datasets (100s of GBS) no longer lsquobig datarsquo

Easy to manage on individual modern servers with standard apps

Cross lsquoomicsrsquo integration often is a big data problem

Thousands of datasets of different typesgenomics proteomics metabalomics imaginghellip

Different formats ndash import storage integration problems

Common questions fit the lsquopredictive analyticsrsquo model of Big Data in business

A Big Data Example ndash Proteomics DB

5

Partnership between academia and industry (SAP)

Data size is 721TB (not huge) but contains 43 million MSMS spectra

Biggest challenge is presentation and interpretation ndash search visualization etc

Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes

Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9

Big Datahellip light

6

BUT ndash all achieved using a traditional single relational database algorithms etc

No flexible search query architecture

Looks like lsquoBig Datarsquohellip

TB scale input data

40000 core hours of processing on TACC

Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2

Consider a Big Data problemhellip

7

I have 5000 sequenced exomes with associated clinical observations and want to

bull Find all SNPs in each patient

bull Aggregate statistics for SNPs across all patients

bull Easily query for SNP associations with any of my clinical variables

bull Build a predictive model for a specific disease

A lot of data a lot of processing ndash need to do each step in parallel

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Big Data in Biomedical Science

4

Complex long running analyses on GB datasets ne Big Data

Small number of genomic datasets (100s of GBS) no longer lsquobig datarsquo

Easy to manage on individual modern servers with standard apps

Cross lsquoomicsrsquo integration often is a big data problem

Thousands of datasets of different typesgenomics proteomics metabalomics imaginghellip

Different formats ndash import storage integration problems

Common questions fit the lsquopredictive analyticsrsquo model of Big Data in business

A Big Data Example ndash Proteomics DB

5

Partnership between academia and industry (SAP)

Data size is 721TB (not huge) but contains 43 million MSMS spectra

Biggest challenge is presentation and interpretation ndash search visualization etc

Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes

Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9

Big Datahellip light

6

BUT ndash all achieved using a traditional single relational database algorithms etc

No flexible search query architecture

Looks like lsquoBig Datarsquohellip

TB scale input data

40000 core hours of processing on TACC

Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2

Consider a Big Data problemhellip

7

I have 5000 sequenced exomes with associated clinical observations and want to

bull Find all SNPs in each patient

bull Aggregate statistics for SNPs across all patients

bull Easily query for SNP associations with any of my clinical variables

bull Build a predictive model for a specific disease

A lot of data a lot of processing ndash need to do each step in parallel

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

A Big Data Example ndash Proteomics DB

5

Partnership between academia and industry (SAP)

Data size is 721TB (not huge) but contains 43 million MSMS spectra

Biggest challenge is presentation and interpretation ndash search visualization etc

Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes

Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9

Big Datahellip light

6

BUT ndash all achieved using a traditional single relational database algorithms etc

No flexible search query architecture

Looks like lsquoBig Datarsquohellip

TB scale input data

40000 core hours of processing on TACC

Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2

Consider a Big Data problemhellip

7

I have 5000 sequenced exomes with associated clinical observations and want to

bull Find all SNPs in each patient

bull Aggregate statistics for SNPs across all patients

bull Easily query for SNP associations with any of my clinical variables

bull Build a predictive model for a specific disease

A lot of data a lot of processing ndash need to do each step in parallel

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Big Datahellip light

6

BUT ndash all achieved using a traditional single relational database algorithms etc

No flexible search query architecture

Looks like lsquoBig Datarsquohellip

TB scale input data

40000 core hours of processing on TACC

Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2

Consider a Big Data problemhellip

7

I have 5000 sequenced exomes with associated clinical observations and want to

bull Find all SNPs in each patient

bull Aggregate statistics for SNPs across all patients

bull Easily query for SNP associations with any of my clinical variables

bull Build a predictive model for a specific disease

A lot of data a lot of processing ndash need to do each step in parallel

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Consider a Big Data problemhellip

7

I have 5000 sequenced exomes with associated clinical observations and want to

bull Find all SNPs in each patient

bull Aggregate statistics for SNPs across all patients

bull Easily query for SNP associations with any of my clinical variables

bull Build a predictive model for a specific disease

A lot of data a lot of processing ndash need to do each step in parallel

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Before we beginhellip

16

Because of the way the cluster is configured we need to reserve slave nodes forour hadoop spark jobs before starting them This is so hadoopspark have permission to ssh to these nodes

The dummy job script lsquoreserve_myhadoop_nodesbatchrsquo is provided

Runhellip

sbatch reserve_myhadoop_nodesbatch

hellipas many times as the number of slave nodes you want to use

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

When you are donehellip

17

When you are finished working with hadoopspark cancel these slave reservation jobshellip

scancel ndashu $USER ndashn myhadoop-reserved

BUT donrsquot do this until your main hadoop amp spark jobs have finished and cleaned up

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Running a Hadoop Job on BioHPC ndash sbatch script

18

binbash Run on the super partitionSBATCH -p super

With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Hadoop Limitations on BioHPC

19

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 220 possibleMore up to date if a lot of demand

Looking for interested users to try out persistent HDFS on project

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

General Hadoop Limitations

20

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

SPARK

21

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Singular Value Decomposition

22

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

SVD on the cluster with Spark

23

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Running Spark jobs on BioHPC

24

binbash Run on the super partitionSBATCH -p super With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Demo ndash Interactive Spark ndash Python amp Scala

25

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Some python example code ndash compute Pi

26

from random import random

NUM_SAMPLES = 1000000

def sample(p)x y = random() random()return 1 if xx + yy lt 1 else 0

count = scparallelize(xrange(0 NUM_SAMPLES)) map(sample) reduce(lambda a b a + b)

print Pi is roughly f (40 count NUM_SAMPLES)

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

What do you want to do

27

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work