Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
1. HPC & I/O �2. BioPerl
A simplified picture of the system
compute farm Login server(s) jhpce01.jhsph.edu jhpce02.jhsph.edu
LAN 0.1- 1 Gbs
Mbs
User machines
.!.!.
.!.!.
direct attached storage
| <------- NFS exported file systems -----> | /users
72 nodes ~3000 cores
DCS03
Ethernet switches (10Gpbs – 40Gps)
data transfer server transfer01.jhsph.edu
Research network
10-100Gbps
DCL01
Lustre file system
DCS02 DCS01
Review of technical notions
n Central Processing Unit (CPU)
n The part of the computer that executes instructions (programs)
n Random Access Memory (RAM)
n Very fast volatile memory that is used like a scratchpad by the cpu
n Mass Storage (Disk)
n Where data & apps are kept more or less permanently. Very slow compared to RAM
n Network (ethernet, internet)
n Computers and devices communicate over networks. n These days it’s mostly ethernet.
n Storage and memory sizes
n 1 Byte = 8 bits = 1 charactern 1 megabyte (GB) = 106 bytesn 1 gigabyte (GB) = 1000 MB = 109 bytesn 1 terabyte (TB) = 1000 GB = 1012 bytesn 1 petabyte (PB) = 1000 TB = 1015 bytes
n Typical sizes
n USB stick 4-128 GBn laptop disk drive 250 – 1000 GB n Enterprise Storage Appliance 100Bn Scale-out cluster storage > 1PB
Review of sizes
n Bandwidth
n How much data per second can you pump through a pipe.n measured in Gigabits per second (Gbps).
n Latency
n How long does it take for that first piece of data to get through?n measure in nano, micro or (gasp!) milliseconds
n A practical demonstration
n http://speedtest.comcast.net
key technical notions (Network)
Time scales for data transfer
Latency Comparison Numbers--------------------------L1 cache reference 0.5 nsBranch mispredict 5 nsL2 cache reference 7 ns 14x L1 cacheMutex lock/unlock 25 nsMain memory reference 100 ns 20x L2 cache, 200x L1 cacheCompress 1K bytes with Zippy 3,000 nsSend 1K bytes over 1 Gbps network 10,000 ns 0.01 msRead 4K randomly from SSD* 150,000 ns 0.15 msRead 1 MB sequentially from memory 250,000 ns 0.25 msRound trip within same datacenter 500,000 ns 0.5 msRead 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memoryDisk seek 10,000,000 ns 10 ms 20x datacenter roundtripRead 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSDSend packet CA->Netherlands->CA 150,000,000 ns 150 ms
What if we multiply all the time scales by 1 billion to humanize them?
Main memory reference 1.6 min Brushing your teeth
Send 2KB over 1 Gbps network 5.5 hr From lunch to end of work day
Read 1 MB sequentially from memory 2.9 days A long weekend
Round trip within same datacenter 5.8 days A medium vacation
Reading 1MB from SSD SSD random read 1.7 days A normal weekend SSD read 1 MB sequentially 11.6 days Waiting for almost 2 weeks for a delivery
Reading 1MB from Disk Seek 16.5 weeks A semester in university Read 1 MB sequentially from disk 7.8 months Almost producing a new human being Total time: 1 year
Internet packet Round trip from California to Netherlands
4.8 years Average time it takes to complete a bachelor's degree
The main lesson
n Do as much computing as you can in RAM
n Avoid disk i/o as much as possiblen If you must go to the disk suck in entire
files at a time rather than fetching one record at a time.
BioPerl on the Cluster
BioPerln Bioperl provides object-oriented software modules for
many of the typical tasks of bioinformatics programming. �
n Manipulating individual sequencesn Accessing genomic data directly from databasesn Transforming formats of database/ file recordsn Searching for ``similar'' sequencesn Creating and manipulating sequence alignmentsn Searching for genes and other structures in DNAn Developing machine readable sequence annotations
www.bioperl.org
Bioperl Module Groups
BioPerl IO modulesSeqIO FASTA, GenBank, EMBL, etc.SearchIO BLAST, FASTA, HMMERAlignIO ClustalW, Phylip, MSF, etc.TreeIO Newick, Nexus, NHXMapIO MapMakerMatrix::IO Scoring, PhylipAssembly::IO Ace, PhrapOntology::IO InterPro, GO, SOMore…
Bio::SeqIOn The principal class for input/output
n methodsn new -- opens a new seqstream for I/On next_seq -- gets the next entry in the input seqstream n write_seq -- writes to a seqstream n there is more…
n Refer to the web site for documentationn Example: format conversion
UniProtKB/SwissPro format
Each sequence entry is composed of lines. Each line begins with a two-character code, which indicates the type of data contained in the line
ID - Identification. AC - Accession number(s). DT - Date. DE - Description. GN - Gene name(s). OS - Organism species. OG - Organelle. OC - Organism classification. RN - Reference number. RP - Reference position. RC - Reference comments.
RX - Reference cross-references. RA - Reference authors. RL - Reference location. CC - Comments or notes. DR - Database cross-references. KW - Keywords. FT - Feature table data. SQ - Sequence header. - (blanks) sequence data. // - Termination line.
swisspro to fasta format conversion
#!/usr/bin/perl -wuse strict;use Bio::SeqIO;
# create a SeqIO object for the input streammy $in = Bio::SeqIO->new('-file' => "sprot.txt",
'-format' => 'swiss’);
# create a SeqIO object for the input streammy $out = Bio::SeqIO->new('-file' => ">sprot.fasta",
'-format' => 'fasta’);
# read the the input stream and write to the output stream # one record at a timewhile ( my $seq = $in->next_seq() ) { $out->write_seq($seq); }
n swiss2fasta.pl
Example: Remote database query#!/usr/bin/perl -wuse strict;use Bio::DB::GenBank;
my ($gb, $seq1, $seq2, $seq_id);
# use eval to test for success of code blockeval { $gb = new Bio::DB::GenBank() };if ($@) { die "Warning: Couldn't connect to Genbank";}
# get by sequence id$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');$seq_id = $seq1->display_id();print "got seq1 display id is $seq_id \n";
# get by accession number$seq2 = $gb->get_Seq_by_acc('AF303112');$seq_id = $seq2->display_id();print "got seq2 display id is $seq_id \n";
# get a bunch of sequences by accession numbermy $seqio = $gb->get_Stream_by_id([ qw(2981014 J00522 AF303112)]);while( my $seqobj = $seqio->next_seq()) {
print $seqobj->display_id(),"\n";print $seqobj->seq()."\n\n";
}