1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

1. HPC & I/O �2. BioPerl

A simplified picture of the system

compute farm Login server(s) jhpce01.jhsph.edu jhpce02.jhsph.edu

LAN 0.1- 1 Gbs

Mbs

User machines

.!.!.

.!.!.

direct attached storage

| <------- NFS exported file systems -----> | /users

72 nodes ~3000 cores

DCS03

Ethernet switches (10Gpbs – 40Gps)

data transfer server transfer01.jhsph.edu

Research network

10-100Gbps

DCL01

Lustre file system

DCS02 DCS01

Review of technical notions

n Central Processing Unit (CPU)

n The part of the computer that executes instructions (programs)

n Random Access Memory (RAM)

n Very fast volatile memory that is used like a scratchpad by the cpu

n Mass Storage (Disk)

n  Where data & apps are kept more or less permanently. Very slow compared to RAM

n Network (ethernet, internet)

n Computers and devices communicate over networks. n  These days it’s mostly ethernet.

n Storage and memory sizes

n 1 Byte = 8 bits = 1 charactern 1 megabyte (GB) = 106 bytesn 1 gigabyte (GB) = 1000 MB = 109 bytesn 1 terabyte (TB) = 1000 GB = 1012 bytesn 1 petabyte (PB) = 1000 TB = 1015 bytes

n Typical sizes

n  USB stick 4-128 GBn laptop disk drive 250 – 1000 GB n Enterprise Storage Appliance 100Bn Scale-out cluster storage > 1PB

Review of sizes

n Bandwidth

n How much data per second can you pump through a pipe.n measured in Gigabits per second (Gbps).

n Latency

n How long does it take for that first piece of data to get through?n measure in nano, micro or (gasp!) milliseconds

n A practical demonstration

n http://speedtest.comcast.net

key technical notions (Network)

Time scales for data transfer

Latency Comparison Numbers--------------------------L1 cache reference 0.5 nsBranch mispredict 5 nsL2 cache reference 7 ns 14x L1 cacheMutex lock/unlock 25 nsMain memory reference 100 ns 20x L2 cache, 200x L1 cacheCompress 1K bytes with Zippy 3,000 nsSend 1K bytes over 1 Gbps network 10,000 ns 0.01 msRead 4K randomly from SSD* 150,000 ns 0.15 msRead 1 MB sequentially from memory 250,000 ns 0.25 msRound trip within same datacenter 500,000 ns 0.5 msRead 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memoryDisk seek 10,000,000 ns 10 ms 20x datacenter roundtripRead 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSDSend packet CA->Netherlands->CA 150,000,000 ns 150 ms

What if we multiply all the time scales by 1 billion to humanize them?

Main memory reference 1.6 min Brushing your teeth

Send 2KB over 1 Gbps network 5.5 hr From lunch to end of work day

Read 1 MB sequentially from memory 2.9 days A long weekend

Round trip within same datacenter 5.8 days A medium vacation

Reading 1MB from SSD SSD random read 1.7 days A normal weekend SSD read 1 MB sequentially 11.6 days Waiting for almost 2 weeks for a delivery

Reading 1MB from Disk Seek 16.5 weeks A semester in university Read 1 MB sequentially from disk 7.8 months Almost producing a new human being Total time: 1 year

Internet packet Round trip from California to Netherlands

4.8 years Average time it takes to complete a bachelor's degree

The main lesson

n  Do as much computing as you can in RAM

n  Avoid disk i/o as much as possiblen  If you must go to the disk suck in entire

files at a time rather than fetching one record at a time.

BioPerl on the Cluster

BioPerln  Bioperl provides object-oriented software modules for

many of the typical tasks of bioinformatics programming. �

n  Manipulating individual sequencesn  Accessing genomic data directly from databasesn  Transforming formats of database/ file recordsn  Searching for ``similar'' sequencesn  Creating and manipulating sequence alignmentsn  Searching for genes and other structures in DNAn  Developing machine readable sequence annotations

www.bioperl.org

Bioperl Module Groups

BioPerl IO modulesSeqIO FASTA, GenBank, EMBL, etc.SearchIO BLAST, FASTA, HMMERAlignIO ClustalW, Phylip, MSF, etc.TreeIO Newick, Nexus, NHXMapIO MapMakerMatrix::IO Scoring, PhylipAssembly::IO Ace, PhrapOntology::IO InterPro, GO, SOMore…

Bio::SeqIOn  The principal class for input/output

n  methodsn  new -- opens a new seqstream for I/On  next_seq -- gets the next entry in the input seqstream n  write_seq -- writes to a seqstream n  there is more…

n  Refer to the web site for documentationn  Example: format conversion

UniProtKB/SwissPro format

Each sequence entry is composed of lines. Each line begins with a two-character code, which indicates the type of data contained in the line

ID - Identification. AC - Accession number(s). DT - Date. DE - Description. GN - Gene name(s). OS - Organism species. OG - Organelle. OC - Organism classification. RN - Reference number. RP - Reference position. RC - Reference comments.

RX - Reference cross-references. RA - Reference authors. RL - Reference location. CC - Comments or notes. DR - Database cross-references. KW - Keywords. FT - Feature table data. SQ - Sequence header. - (blanks) sequence data. // - Termination line.

swisspro to fasta format conversion

#!/usr/bin/perl -wuse strict;use Bio::SeqIO;

# create a SeqIO object for the input streammy $in = Bio::SeqIO->new('-file' => "sprot.txt",

'-format' => 'swiss’);

# create a SeqIO object for the input streammy $out = Bio::SeqIO->new('-file' => ">sprot.fasta",

'-format' => 'fasta’);

# read the the input stream and write to the output stream # one record at a timewhile ( my $seq = $in->next_seq() ) { $out->write_seq($seq); }

n  swiss2fasta.pl

Example: Remote database query#!/usr/bin/perl -wuse strict;use Bio::DB::GenBank;

my ($gb, $seq1, $seq2, $seq_id);

# use eval to test for success of code blockeval { $gb = new Bio::DB::GenBank() };if ($@) { die "Warning: Couldn't connect to Genbank";}

# get by sequence id$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');$seq_id = $seq1->display_id();print "got seq1 display id is $seq_id \n";

# get by accession number$seq2 = $gb->get_Seq_by_acc('AF303112');$seq_id = $seq2->display_id();print "got seq2 display id is $seq_id \n";

# get a bunch of sequences by accession numbermy $seqio = $gb->get_Stream_by_id([ qw(2981014 J00522 AF303112)]);while( my $seqobj = $seqio->next_seq()) {

print $seqobj->display_id(),"\n";print $seqobj->seq()."\n\n";

}

Documents

1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics