37
Tutorial 1: An Introduc1on to Linux for NextGen DNA Sequencing Data Analysis: A Handson Tutorial Dr Stratos Efstathiadis [email protected] Technical Director High Performance Compu1ng Facility Center for Health Informa1cs and Bioinforma1cs NYULMC

Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Tutorial  1:  An  Introduc1on  to  Linux  for  Next-­‐Gen  

DNA  Sequencing  Data  Analysis:  A  Hands-­‐on  Tutorial  

Dr  Stratos  Efstathiadis  [email protected]  

Technical  Director  High  Performance  Compu1ng  Facility  

Center  for  Health  Informa1cs  and  Bioinforma1cs  NYULMC  

Page 2: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

This is the First Part in a Series of tutorials in High Performance Computing and Sequencing Informatics

•  This tutorial will cover: –  Running basic Linux commands

•  Logging-in, Files and Directories, File Editing, Running command line tools.

–  How to map millions of Next-Gen DNA reads onto a Genome –  A complete Next-Gen Seq. Read Alignment Example –  Using the Integrative Genome Viewer (IGV)

•  This tutorial will Not cover: –  Sequencing technologies or alignment algorithms in detail. –  Advanced Linux concepts.

•  File and Directory permissions, Environment variables, Shell scripts, batch job submission, etc. will be covered in another tutorial.

Page 3: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Bioinformatics Requires Powerful Computers

•  One definition of bioinformatics is: “The use of computers to analyze biological problems.”

•  As biological data sets have grown larger and biological problems have become more complex, the requirements for computing power have also grown.

•  Computers that can provide this power generally use the Unix operating system - so you must learn Unix

Page 4: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Will  computers  crash  Genomics?  SCIENCE,  11  February  2011,  pg  666  

Page 5: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Remote Login using Secure Shell (ssh)

•  Secure Shell: A set of tools that allow secure interaction with remote servers (ssh, sshd, ssh-add, ssh-keygen, ssh-agent, etc.) using two-factor authentication, based on –  What you know (a pass phrase) –  What you have (a key)

–  ssh is the de-facto remote login mechanism in Linux/Unix. rlogin, rsh, telnet, etc. use insecure protocols (transmit clear

text passwords) and are not used (actually, blocked).

Page 6: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

How to login remotely on a Linux server using Secure Shell (ssh)

Username:    your  NYULMC  Kerberos  id  Hostname:  ec2-­‐aa-­‐ddd-­‐cc-­‐nn.compute-­‐1.amazonaws.com  

•  Troubleshoo1ng:  ssh  –vvvv  username@hostname  

•  The  ssh-­‐agent  stores  the  key  in  memory  (of  your  laptop/desktop).  

$ ssh [email protected]

Page 7: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

First Linux Commands -bash-3.2$ date Wed Jan 2 13:28:49 EST 2013

-bash-3.2$ pwd (Present/current Working Directory) /home/username

-bash-3.2$ ls (list files in the pwd)

-bash-3.2$ touch myfile (creates an empty file)

Linux  file  names  and  commands  are  Case  Sensi?ve:  myfile  is  different  than  MyFile  

-bash-3.2$ LS -­‐bash:  LS:  command  not  found  

Home  directory  

Page 8: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Working  with  Directories  

•  Directories  are  a  means  of  organizing  your  files  on  a  Unix  computer.    – They  are  equivalent  to  folders  on  Windows  and  Macintosh  computers    

•  Directories  contain  files,  executable  programs,  and  sub-­‐directories    

•  Understanding  how  to  use  directories  is  crucial  to  manipula1ng  your  files  on  a  Unix  system.    

Page 9: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

The  Tree  structure  of  the  file  system  

Page 10: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Your  Home  Directory  

•  When  you  login  to  the  server,  you  always  start  in  your  Home  directory.  

•  Create  sub-­‐directories  to  store  specific  projects  or  groups  of  informa1on,  just  as  you  would  place  folders  in  a  filing  cabinet.  

•  Do  not  accumulate  thousands  of  files  with  cryp1c  names  in  your  Home  directory  

Page 11: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Shortcuts  

•  There  are  some  important  shortcuts  in  Unix  for  specifying  directories  

•   .  (dot)  means  "the  current  directory"    

•   ..  means  "the  parent  directory"  -­‐  the  directory  one  level  above  the  current  directory,  so  cd  ..  will  move  you  up  one  level  

•   ~  (1lde)  means  your  Home  directory,  so  cd  ~  will  move  you  back  to  your  Home.  

–  Just  typing  a  plain  cd  will  also  bring  you  back  to  your  home  directory  

Page 12: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

File  &  Directory  Commands  •  This  is  a  minimal  list  of  Unix  commands  that  you  must  know  for  file  management:  

ls (list) mkdir (make directory)

cd (change directory) rmdir (remove directory)

cp (copy) pwd (present working directory)

mv (move) more (view by page)

rm (remove) cat (view entire file on screen)

•  All  of  these  commands  can  be  modified  with  many  op1ons.  Learn  to  use  Unix  ‘man’  pages  for  more  informa1on.  

Page 13: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Copy  &  Move  •  cp  lets  you  copy  a  file  from  any  directory  to  any  other  directory,  or  create  a  copy  of  a  file  with  a  new  name  in  one  directory  

•  cp filename.ext newfilename.ext•  cp filename.ext subdir/newname.ext•  cp /u/jdoe01/filename.ext ./subdir/newfilename.ext

•  mv  allows  you  to  move  files  to  other  directories,  but  it  is  also  used  to  rename  files.    –  Filename  and  directory  syntax  for  mv  is  exactly  the  same  as  for  the  cp  command.    

•  mv filename.ext subdir/newfilename.ext

–  NOTE:  When  you  use  mv  to  move  a  file  into  another  directory,  the  current  file  is  deleted.    

Page 14: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Delete  •  Use  the  command  rm  (remove)  to  delete  files  

•  There  is  no  way  to  undo  this  command!!!  – We  have  set  the  server  to  ask  if  you  really  want  to  remove  each  file  before  it  is  deleted.  

– You  must  answer  “Y”  or  else  the  file  is  not  deleted.  

> ls af151074.gb_pr5 test.seq > rm test.seq rm: remove test.seq? y > ls af151074.gb_pr5

Page 15: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Exercise:  Working  with  Files  and  Directories  

-bash-3.2$ mkdir project1 (Make Directory) -bash-3.2$ cd project1 (Change Directory) -bash-3.2$ pwd /home/efstae01/project1 -bash-3.2$ cp /data/tutorial/ChIPseq_chr19.fastq . -bash-3.2$ ls –la

-bash-3.2$ head ChIPseq_chr19.fastq @HWUSI-EAS610_0001:3:1:4:1405#0/1 GATAGTTCAATTCCAGAGATCAGAGAGAGGTGAGTG + B;30;<4@7/5@=?5?7?1>A2?0<6?<<80>79## @HWUSI-EAS610_0001:3:1:5:1490#0/1 GGGCTGGTGGAGTGATCCCAAGGGGTGGGGATGGGG + B@A?AAA1BB;A5B44>AA3'@AB>+>@AB94A?A? @HWUSI-EAS610_0001:3:1:6:388#0/1 CAGAGTTCATGAAATAGGCCTCTAGTCTTCCTAGAC

Page 16: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

FASTQ  Files  

@HWUSI-EAS610_0001:3:1:4:1405#0/1 GATAGTTCAATTCCAGAGATCAGAGAGAGGTGAGTG + B;30;<4@7/5@=?5?7?1>A2?0<6?<<80>79##

36  bp  read  

•  The de-facto file format for sharing sequence read data •  Sequence and a per-base quality score •  FASTQ Files are Text files •  4 Lines per read -bash-3.2$ wc –l ChIPseq_chr19.fastq 1082524 ChIPseq_chr19.fastq

36  Quality  scores  

Page 17: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How
Page 18: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Star?ng  emacs  

•  To  start  Emacs,  at  the  >  command  prompt,  just  type:    emacs    

•  To  use  Emacs  to  edit  a  file,  type:      emacs  filename  

(where  filename  is  the  name  of  your  file)  

•  When  Emacs  is  launched,  it  opens  either  a  blank  text  window  or  a  window  containing  the  text  of  an  exis1ng  file.    

Page 19: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Save  &  Exit  •  To  save  a  file  as  you  are  working  on  it,  type:        Ctrl-­‐x  »  Ctrl-­‐s  

•  To  exit  emacs  and  return  to  the  Unix  shell,  type:    Ctrl  -­‐x  »  Ctrl  -­‐c  

 If  you  have  made  any  changes  to  the  file,  Emacs  will  ask  you  if  you  want  to  save:    

Save file /u/browns02/nrdc.msf? (y,n,!,.,q,C-r or C-h)•  Type  “y”  to  save  your  changes  and  exit  •  If  you  type  “n”,  then  it  will  ask  again:  

Modified buffers exist; exit anyway? (yes or no)•  If  you  answer  “no”,  then  it  will  return  you  to  the  file,  you  must  answer  “yes”  to  exit  without  saving  changes  

Page 20: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Copying  data  using  Secure  Copy  (scp)  

Exercise:    Login  on  the  tutorial  Amazon  Instance  -­‐bash-­‐3.2$  cd  project1  -­‐bash-­‐3.2$  fastqc  ChIPseq_chr19.fastq  -­‐bash-­‐3.2$  ls  –l    Copy  the  file  ChIPseq_chr19.fastq.zip  to  your  laptop  using  scp:  

Laptop>  scp  user@ec2-­‐54-­‐242-­‐89-­‐16.compute-­‐1.amazonaws.com:~/project1/ChIPseq_chr19_fastqc.zip  .  

Unzip  it:  Laptop>  unzip  ChIPseq_chr19_fastqc.zip  

Open  with  your  browser  fastqc-­‐report.html  

Page 21: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Short  Read  Alignment  •  Given  a  reference  and  a  set  of  reads,  report  at  least  one  “good”  local  alignment  for  each  read  if  one  exists  –  Approximate  answer  to:  where  in  genome  did  the  read  originate?  

…TGATCATA… GATCAA

…TGATCATA… GAGAAT

beqer  than  

•  What is “good”? For now, we concentrate on:

…TGATATTA… GATcaT

…TGATcaTA… GTACAT

beqer  than  

–  Fewer mismatches is better –  Failing to align a low-quality

base is better than failing to align a high-quality base

Page 22: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Short  Read  Applica1ons  

Finding  the  alignments  is  typically  the  performance  boqleneck  

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…  GCGCCCTA  

GCCCTATCG  GCCCTATCG  

CCTATCGGA  CTATCGGAAA  

AAATTTGC  AAATTTGC  

TTTGCGGT  TTGCGGTA  

GCGGTATA  

GTATAC…  

TCGGAAATT  CGGAAATTT  

CGGTATAC  

TAGGCTATA  

GCCCTATCG  GCCCTATCG  

CCTATCGGA  CTATCGGAAA  

AAATTTGC  AAATTTGC  

TTTGCGGT  

TCGGAAATT  CGGAAATTT  CGGAAATTT  

AGGCTATAT  AGGCTATAT  AGGCTATAT  GGCTATATG  

CTATATGCG  

…CC  …CC  …CCA  …CCA  …CCAT  

ATAC…  C…  C…  

…CCAT  …CCATAG   TATGCGCCC  

GGTATAC…  CGGTATAC  

GGAAATTTG  

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…  ATAC…  …CC  

 GAAATTTGC  

Page 23: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Indexing  

•  Genomes  and  reads  are  too  large  for  direct  approaches  like  dynamic  programming  

•  Indexing  is  required  

•  Choice  of  index  is  key  to  performance  

Suffix  tree   Suffix  array   Seed  hash  tables  Many  variants,  incl.  spaced  seeds  

Page 24: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

•  Invented  by  David  Wheeler  in  1983  (Bell  Labs).  Published  in  1994.              “A  Block  Sor@ng  Lossless  Data  Compression  Algorithm”                  Systems  Research  Center  Technical  Report  No  124.  Palo  Alto,  CA:  Digital  Equipment  

Corpora@on,  Burrows  M,  Wheeler  DJ.  1994    

•  Originally  developed  for  compressing  large  files  (bzip2,  etc.)  

•  Lossless,  Fully  Reversible  

•  Alignment  Tools  based  on  BWT:  bow@e,  BWA,  SOAP2,  etc.  

•  Approach:  –  Align  reads  on  the  transformed  reference  genome,  using  an  efficient  index  (FM  index)  –  Solve  the  simple  problem  first  (align  one  character)  and  then  build  on  that  solu1on  to  solve  a  

slightly  harder  problem  (two  characters)  etc.  

•  Results  in  great  speed  and  efficiency  gains  (a  few  GigaByte  of  RAM  for  the  en1re  H.  Genome).  Other  approaches  require  tens  of  GigaBytes  of  memory  and  are  much  slower.  

NGS  Read  Alignment  Burrows  Wheeler  Transforma1on  (BWT)  

Page 25: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How
Page 26: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Generating the SAM file

-­‐bash-­‐3.2$  cd  project1  -­‐bash-­‐3.2$  bwa  aln  /data/tutorial/mm9.fasta      ChIPseq_chr19.fastq  >  

ChIPseq_chr19.sai  bwa_aln]  17bp  reads:  max_diff  =  2  [bwa_aln]  38bp  reads:  max_diff  =  3  [bwa_aln]  64bp  reads:  max_diff  =  4  [bwa_aln]  93bp  reads:  max_diff  =  5  [bwa_aln]  124bp  reads:  max_diff  =  6  [bwa_aln]  157bp  reads:  max_diff  =  7  [bwa_aln]  190bp  reads:  max_diff  =  8  [bwa_aln]  225bp  reads:  max_diff  =  9  [bwa_aln_core]  calculate  SA  coordinate...  25.94  sec  [bwa_aln_core]  write  to  the  disk...  0.03  sec  [bwa_aln_core]  262144  sequences  have  been  processed.  [bwa_aln_core]  calculate  SA  coordinate...  0.86  sec  [bwa_aln_core]  write  to  the  disk...  0.00  sec  [bwa_aln_core]  270631  sequences  have  been  processed.  [main]  Version:  0.6.2-­‐r126  [main]  CMD:  bwa  aln  /data/tutorial/mm9.fasta  ChIPseq_chr19.fastq  [main]  Real  1me:  28.230  sec;  CPU:  28.229  sec  

-­‐bash-­‐3.2$  bwa  samse  /data/tutorial/mm9.fasta  ChIPseq_chr19.sai  ChIPseq_chr19.fastq    >  ChIPseq_chr19.sam  

[bwa_aln_core]  convert  to  sequence  coordinate...  3.39  sec  [bwa_aln_core]  refine  gapped  alignments...  0.43  sec  [bwa_aln_core]  print  alignments...  0.50  sec  [bwa_aln_core]  262144  sequences  have  been  processed.  [bwa_aln_core]  convert  to  sequence  coordinate...  1.75  sec  [bwa_aln_core]  refine  gapped  alignments...  0.31  sec  [bwa_aln_core]  print  alignments...  0.01  sec  [bwa_aln_core]  270631  sequences  have  been  processed.  [main]  Version:  0.6.2-­‐r126  [main]  CMD:  bwa  samse  /data/tutorial/mm9.fasta  ChIPseq_chr19.sai  ChIPseq_chr19.fastq  [main]  Real  1me:  6.710  sec;  CPU:  6.711  sec  

Page 27: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

more  •  Use  the  command  more  to  view  at  the  contents  of  a  file  one  screen  at  a  1me:    

   > -bash-3.2$ more ChIPseq_chr19.sam @SQ SN:chr10 LN:129993255

@SQ SN:chr11 LN:121843856@SQ SN:chr12 LN:121257530@SQ SN:chr13 LN:120284312@SQ SN:chr14 LN:125194864@SQ SN:chr15 LN:103494974@SQ SN:chr16 LN:98319150@SQ SN:chr17 LN:95272651@SQ SN:chr18 LN:90772031@SQ SN:chr19 LN:61342430@SQ SN:chr1LN:197195432@SQ SN:chr2LN:181748087@SQ SN:chr3LN:159599783@SQ SN:chr4LN:155630120@SQ SN:chr5LN:152537259@SQ SN:chr6LN:149517037@SQ SN:chr7LN:152524553@SQ SN:chr8LN:131738871@SQ SN:chr9LN:124076172@SQ SN:chrMLN:16299@SQ SN:chrXLN:166650296@SQ SN:chrYLN:15902555HWUSI-EAS610_0001:3:1:4:1405#0 16 chr1960874227 37 36M * 0 0

CACTCACCTCTCTCTGATCTCTGGAATTGAACTATC ##97>08<<?6<0?2A>1?7?5?=@5/7@4<;03;B XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:9T26HWUSI-EAS610_0001:3:1:5:1490#0 0 chr1932960373 37 36M * 0 0

GGGCTGGTGGAGTGATCCCAAGGGGTGGGGATGGGG B@A?AAA1BB;A5B44>AA3'@AB>+>@AB94A?A? XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36HWUSI-EAS610_0001:3:1:6:388#016 chr1918177553 37 36M * 0 0

GTCTAGGAAGACTAGAGGCCTATTTCATGAACTCTG @AABAAB@@B@?BBBABABA;?<CBBBBBA>C;BBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36HWUSI-EAS610_0001:3:1:7:1045#0 16 chr1960698168 37 36M * 0 0

ATGTGAGGCAATGTGCTCCATTTCCTTTCCCTATCC =>6AB?@BA<;:?AA@9AB87;.=@=:>@B@>3,?B XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36Hit  the  spacebar  to  page  down  through  the  file  

–  Ctrl-­‐U  moves  back  up  a  page  –  At  the  boqom  of  the  screen,  more  shows  how  much  of  the  file  has  been  displayed  

Page 28: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

hqp://samtools.sourceforge.net/SAM1.pdf  

Page 29: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

SAM/BAM  format  

V00-HWI-EAS132:3:38:959:2035#0 147 chr1 28 255 36M = 79 0 GATCTGATGGCAGAAAACCCCTCTCAGTCCGTCGTG aaX`[\`Y^Y^]ZX``\EV_BBBBBBBBBBBBBBBB NM:i:1 V00-HWI-EAS132:4:99:122:772#0 177 chr1 32 255 36M = 9127 0 AAAGGATCTGATGGCAGAAAACCCCTCTCAGTCCGT aaaaaa\OWaI_\WL\aa`Xa^]\ZUaa[XWT\^XR NM:i:1 V00-HWI-EAS132:4:44:473:970#0 25 chr1 40 255 36M * 0 0 GTCGTGGTGAAGGATCTGATGGCAGAAAACACCTCT __YaZ`W[aZNUZ[U[_TL[KVVX^QURUTDRVZBB NM:i:2 V00-HWI-EAS132:4:29:113:1934#0 99 chr1 44 255 36M = 107 0 GGGTTTTCTGCCATCAGATCCTTTACCACGACAGAC aaaQaa__``]\\_^``^a^`a`_^^^_XQ[ZS\XX NM:i:1

Query  Name   Ref  sequence  

posi1on  of  alignment  

query  sequence  (same  strand  as  ref)   query  quality  

@HD VN:1.0 @SQ SN:chr20 AS:HG18 LN:62435964 @RG ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 @RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891

Header  sec1on  

Alignment  sec1on  

Page 30: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How
Page 31: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Converting the SAM file to a BAM file

BAM  file  B:      Binary  format  vs  Text  format  (SAM),  resul1ng  in  more  efficient  data  storage.  Sequencing  plazorm  independent.  

-­‐bash-3.2$ samtools view -bt /data/tutorial/mm9.fasta -o ChIPseq_chr19.bam ChIPseq_chr19.sam

[samopen] SAM header is present: 22 sequences.

-bash-3.2$ samtools sort ChIPseq_chr19.bam ChIPseq_chr19.sorted

-bash-3.2$ samtools index ChIPseq_chr19.sorted.bam

-bash-3.2$ samtools view in.bam chr2:20,100,000-20,200,000 > out.bam

-bash-3.2$ samtools view lib15103.sorted.rmdup.bam "gi|121635883|ref|NC_008769.1|:2,100,000-2,200,000"

Page 32: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

BAM Files: Using the bitwise FLAG -f : bit set -F: bit is not set

Counting Alignments: -bash-3.2$ samtools view -c ChIPseq_chr19.sorted.bam 270631

Count unmapped reads –f 4 (the bitwise flag 0x004 is set) -bash-3.2$ samtools view -c -f 4 ChIPseq_chr19.sorted.bam 1839

Count the reads that are not unmapped (hence, count mapped alignments) –F 4 (the bitwise flag 0x0004 is Not set)

-bash-3.2$ samtools view -c -F 4 ChIPseq_chr19.sorted.bam 268792

-bash-3.2$ samtools flagstat ChIPseq_chr19.sorted.bam

Page 33: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How
Page 34: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Putting it all together in a shell script (use emacs) /data/tutorial/first

REFERENCE=/data/tutorial/mm9.fasta  

FASTQ=ChIPseq_chr19.fastq  SAI=ChIPseq_chr19.sai  

SAM=ChIPseq_chr19.sam  BAM=ChIPseq_chr19.bam  

SORTED=ChIPseq_chr19.sorted  

echo  $REFERENCE;  echo  $FASTQ;  echo  $SAI  

bwa  aln  $REFERENCE  $FASTQ  >  $SAI  bwa  samse  $REFERENCE  $SAI  $FASTQ    >  $SAM  

samtools  view  -­‐bt  $REFERENCE  -­‐o  $BAM  $SAM        samtools  sort  $BAM  $SORTED  

Page 35: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

The  screen  command  

•  Run  “screen”  before  you  run  your  command  •  Detach  by  pressing:  Control-­‐a  d  •  Logout  

•  Login  again  •  Re-­‐aOach  by  running  the  command:  “screen  –r”  

•  You  may  have  several  screen  sessions  acRve  

Page 36: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Using IGV to view alignments

The  Integra1ve  Genome  Viewer  can  be  installed  on  your  laptop  

wget  hqp://www.broadins1tute.org/igv/projects/downloads/IGV_2.1.22.zip    

unzip  IGV_2.1.22.zip    

cd  IGV_2.1.22    

java  -­‐Xmx2g  -­‐jar  igv.jar  

You  will  need  to  transfer  to  your  laptop:  •  Indexed  Reference  FASTA  file  •  Reference  Genome  Feature  File  (GFF)  •  Sorted  and  indexed  mapped  read  BAM  files  

Page 37: Tutorial)1:) An)Introduc1on)to)Linux)for)Next7Gen) DNA ...– Running basic Linux commands • Logging-in, Files and Directories, File Editing, Running command line tools. – How

Using IGV to view alignments