18
1. HPC & I/O 2. BioPerl

1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

1. HPC & I/O �2. BioPerl

Page 2: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics
Page 3: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

A simplified picture of the system

compute farm Login server(s) jhpce01.jhsph.edu jhpce02.jhsph.edu

LAN 0.1- 1 Gbs

Mbs

User machines

.!.!.

.!.!.

direct attached storage

| <------- NFS exported file systems -----> | /users

72 nodes ~3000 cores

DCS03

Ethernet switches (10Gpbs – 40Gps)

data transfer server transfer01.jhsph.edu

Research network

10-100Gbps

DCL01

Lustre file system

DCS02 DCS01

Page 4: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

Review of technical notions

n Central Processing Unit (CPU)

n The part of the computer that executes instructions (programs)

n Random Access Memory (RAM)

n Very fast volatile memory that is used like a scratchpad by the cpu

n Mass Storage (Disk)

n  Where data & apps are kept more or less permanently. Very slow compared to RAM

n Network (ethernet, internet)

n Computers and devices communicate over networks. n  These days it’s mostly ethernet.

Page 5: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

n Storage and memory sizes

n 1 Byte = 8 bits = 1 charactern 1 megabyte (GB) = 106 bytesn 1 gigabyte (GB) = 1000 MB = 109 bytesn 1 terabyte (TB) = 1000 GB = 1012 bytesn 1 petabyte (PB) = 1000 TB = 1015 bytes

n Typical sizes

n  USB stick 4-128 GBn laptop disk drive 250 – 1000 GB n Enterprise Storage Appliance 100Bn Scale-out cluster storage > 1PB

Review of sizes

Page 6: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

n Bandwidth

n How much data per second can you pump through a pipe.n measured in Gigabits per second (Gbps).

n Latency

n How long does it take for that first piece of data to get through?n measure in nano, micro or (gasp!) milliseconds

n A practical demonstration

n http://speedtest.comcast.net

key technical notions (Network)

Page 7: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

Time scales for data transfer

Latency Comparison Numbers--------------------------L1 cache reference 0.5 nsBranch mispredict 5 nsL2 cache reference 7 ns 14x L1 cacheMutex lock/unlock 25 nsMain memory reference 100 ns 20x L2 cache, 200x L1 cacheCompress 1K bytes with Zippy 3,000 nsSend 1K bytes over 1 Gbps network 10,000 ns 0.01 msRead 4K randomly from SSD* 150,000 ns 0.15 msRead 1 MB sequentially from memory 250,000 ns 0.25 msRound trip within same datacenter 500,000 ns 0.5 msRead 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memoryDisk seek 10,000,000 ns 10 ms 20x datacenter roundtripRead 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSDSend packet CA->Netherlands->CA 150,000,000 ns 150 ms 

Page 8: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

What if we multiply all the time scales by 1 billion to humanize them?

Main memory reference 1.6 min Brushing your teeth

Send 2KB over 1 Gbps network 5.5 hr From lunch to end of work day

Read 1 MB sequentially from memory 2.9 days A long weekend

Round trip within same datacenter 5.8 days A medium vacation

Reading 1MB from SSD SSD random read 1.7 days A normal weekend SSD read 1 MB sequentially 11.6 days Waiting for almost 2 weeks for a delivery

Reading 1MB from Disk Seek 16.5 weeks A semester in university Read 1 MB sequentially from disk 7.8 months Almost producing a new human being Total time: 1 year

Internet packet Round trip from California to Netherlands

4.8 years Average time it takes to complete a bachelor's degree

Page 9: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

The main lesson

n  Do as much computing as you can in RAM

n  Avoid disk i/o as much as possiblen  If you must go to the disk suck in entire

files at a time rather than fetching one record at a time.

Page 10: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

BioPerl on the Cluster

Page 11: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

BioPerln  Bioperl provides object-oriented software modules for

many of the typical tasks of bioinformatics programming. �

n  Manipulating individual sequencesn  Accessing genomic data directly from databasesn  Transforming formats of database/ file recordsn  Searching for ``similar'' sequencesn  Creating and manipulating sequence alignmentsn  Searching for genes and other structures in DNAn  Developing machine readable sequence annotations

Page 12: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

www.bioperl.org

Page 13: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

Bioperl Module Groups

Page 14: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

BioPerl IO modulesSeqIO FASTA, GenBank, EMBL, etc.SearchIO BLAST, FASTA, HMMERAlignIO ClustalW, Phylip, MSF, etc.TreeIO Newick, Nexus, NHXMapIO MapMakerMatrix::IO Scoring, PhylipAssembly::IO Ace, PhrapOntology::IO InterPro, GO, SOMore…

Page 15: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

Bio::SeqIOn  The principal class for input/output

n  methodsn  new -- opens a new seqstream for I/On  next_seq -- gets the next entry in the input seqstream n  write_seq -- writes to a seqstream n  there is more…

n  Refer to the web site for documentationn  Example: format conversion

Page 16: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

UniProtKB/SwissPro format

Each sequence entry is composed of lines. Each line begins with a two-character code, which indicates the type of data contained in the line

ID - Identification. AC - Accession number(s). DT - Date. DE - Description. GN - Gene name(s). OS - Organism species. OG - Organelle. OC - Organism classification. RN - Reference number. RP - Reference position. RC - Reference comments.

RX - Reference cross-references. RA - Reference authors. RL - Reference location. CC - Comments or notes. DR - Database cross-references. KW - Keywords. FT - Feature table data. SQ - Sequence header. - (blanks) sequence data. // - Termination line.

Page 17: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

swisspro to fasta format conversion

#!/usr/bin/perl -wuse strict;use Bio::SeqIO;

# create a SeqIO object for the input streammy $in = Bio::SeqIO->new('-file' => "sprot.txt",

'-format' => 'swiss’);

# create a SeqIO object for the input streammy $out = Bio::SeqIO->new('-file' => ">sprot.fasta",

'-format' => 'fasta’);

# read the the input stream and write to the output stream # one record at a timewhile ( my $seq = $in->next_seq() ) { $out->write_seq($seq); }

n  swiss2fasta.pl

Page 18: 1. HPC & I/O 2. BioPerlec2-54-227-251-26.compute-1.amazonaws.com/word... · BioPerl n Bioperl provides object-oriented software modules for many of the typical tasks of bioinformatics

Example: Remote database query#!/usr/bin/perl -wuse strict;use Bio::DB::GenBank;

my ($gb, $seq1, $seq2, $seq_id);

# use eval to test for success of code blockeval { $gb = new Bio::DB::GenBank() };if ($@) { die "Warning: Couldn't connect to Genbank";}

# get by sequence id$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');$seq_id = $seq1->display_id();print "got seq1 display id is $seq_id \n";

# get by accession number$seq2 = $gb->get_Seq_by_acc('AF303112');$seq_id = $seq2->display_id();print "got seq2 display id is $seq_id \n";

# get a bunch of sequences by accession numbermy $seqio = $gb->get_Stream_by_id([ qw(2981014 J00522 AF303112)]);while( my $seqobj = $seqio->next_seq()) {

print $seqobj->display_id(),"\n";print $seqobj->seq()."\n\n";

}