16
NGUYEN HOANG BACH, MSc. Sassari, 2011 WHOLE GENOME ASSEMBLY AND ANALYSIS SHORT REPORT Supervisor MASSIMO DELIGIOS, PhD. Prof. PIERO CAPPUCINELLI DIVISION OF CLINICAL AND EXPERIMENTAL MICROBIOLOGY DEPARTMENT OF BIOMEDICAL SCIENCES UNIVERSITY OF SASSARI

Whole Genome Report

  • Upload
    nhbach

  • View
    215

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Whole Genome Report

NGUYEN HOANG BACH, MSc.

Sassari, 2011

WHOLE GENOME ASSEMBLY AND ANALYSIS SHORT REPORT

Supervisor

MASSIMO DELIGIOS, PhD. Prof. PIERO CAPPUCINELLI

DIVISION OF CLINICAL AND EXPERIMENTAL MICROBIOLOGY

DEPARTMENT OF BIOMEDICAL SCIENCES

UNIVERSITY OF SASSARI

Page 2: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 1

Part 01 Install Cygwin for Windows 7 OS

Install Velvet 1.0.19

Create contig with Velvet

A. Install Cygwin with perl, C++ compiler, debugger, and make for for Windows 7 OS

Cygwin is:

a collection of tools which provide a Linux look and feel environment for Windows.

a DLL (cygwin1.dll) which acts as a Linux API layer providing substantial Linux API functionality. The Cygwin DLL currently works with all recent, commercially released x86 32 bit and 64 bit versions of Windows, with the exception of Windows CE1.

1 Windows CE (now officially known as Windows Embedded Compact and previously also known as Windows Embedded CE , and sometimes abbreviated WinCE) is an operating system developed by Microsoft for embedded systems. Windows CE is a distinct operating system and kernel, rather than a trimmed-down version of desktop Windows. It is not to be confused with Windows XP Embedded which is NT-based.

We can find Full Cygwin Package at URL: http://www.cygwin.com/packages/

gcc-g++ GCC-3 Series legacy compiler: C++ compiler

gdb The GNU Debugger

make The GNU version of the 'make' utility

perl Larry Wall's Practical Extracting and Report Language

perl-Error Perl module for OO error/exception handling

perl-ExtUtils-Depends Build Perl XS that depend on other XS

perl-ExtUtils-PkgConfig Perl module for using pkg-config

perl-Graphics-Magick GraphicsMagick Perl bind (PerlMagick)

perl-Image-Magick Image manipulation software suite (Perl bindings)

perl-Locale-gettext Perl module for using gettext and libintl

perl-SGMLSpm Perl SGMLS parser module

perl-Tk Perl interface for Tk (X11)

perl-Win32-GUI Perl Win32-GUI module

perl-XML-Simple Perl module for simple XML access

perl-libwin32 Perl extensions for using the Win32 API

perl-ming A SWF output library - (Perl bindings)

perl_manpages Perl manpages

Page 3: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 2

The make utility automatically determines which pieces of a large program need to be recompiled, and issues commands to recompile them. This manual describes GNU make, which was implemented by Richard Stallman and Roland McGrath. Development since Version 3.76 has been handled by Paul D. Smith.

GNU make conforms to section 6.2 of IEEE Standard 1003.2-1992 (POSIX.2). Our examples show C programs, since they are most common, but you can use make with any programming language whose compiler can be run with a shell command. Indeed, make is not limited to programs. You can use it to describe any task where some files must be updated automatically from others whenever the others change.

B. Install Velvet 1.0.19 running on Windows 7 OS with Cygwin (including C++ compiler, debugger, and make)

What is Velvet?

Velvet is a De Novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454, developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), United Kingdom.

Velvet currently takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.

The memory requirements and time to run Velvetg?

All depend on the number and size of the reads we have to assemble. The memory requirements can be estimated using a relationship we showed to this examples below. The speed at which velvetg will run is dependent on a lot of variables including: CPU type and speed, memory bus speed, size and number of reads, the value of k and many others and so is difficult to estimate. For 30 million 36mers, with a k of 29, to finish the initial velvetg run on the deskop (with 16GB RAM) in 15 - 20 minutes. Subsequent runs are faster. Therefore, 160 hours seems plenty. Our biggest concern will be the memory requirements. The memory estimator is: Ram required for velvetg (Kb) = -109635 + 18977*ReadSize + 86326*GenomeSize + 233353*NumReads - 51092*K

Ram required for velvetg (Gb) = Ram required for velvetg (Kb) / 1048576

Where Read size is in bases Genome size is in millions of bases (Mb) Number of reads is in millions K is the kmer hash value used in velveth

Page 4: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 3

The results are +/- 0.5 - 0.8 Gbytes on this system. (64 bit Fedora 10 - quad core - 16Gb RAM) I.e: for

K = 31, Number of reads = 50 million read size = 36 Genome size of 5 Megabases

The estimator returns ~10.5 Gbytes of Ram required. The regression equation should be fairly valid for the following ranges:

K = 15 - 31. Numreads = 5 - 70 million Genome size = 2 - 10 Megabases Read length (size) = 20 - 75 bases

C. Creat the contig of sequences data with Velvel 1.0.19

Step1: make

or make ’MAXKMERLENGTH=57’

Step2: combine the whole genome sequences of MT_HUE_20

./shuffleSequences_fastq.pl ./data/s_5_1_FC70G0HAAXX_501827_44251_MTB_20_HUE.fastq

./data/s_5_2_FC70G0HAAXX_501827_44251_MTB_20_HUE.fastq fullseq.fastq

Syntax:

./shuffleSequences_filetype.pl ./[include_path/file1_name] ./[include_path/file2_name]

./[include_path/newfile_name]

Step3

./velveth

Step4:

./velvetg

Page 5: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 4

Step5:

./velveth output_directory hash_length [-file_format] [-read_type] [filename]

output_directory hash_length [-file_format] [-read_type] [filename]

Velvel_dir/output_dir The hash length is the length of the k-mers being entered in the hash table. • it must be an odd number, to avoid palindromes. If we put in an even number, Velvet will just decrement it and proceed. • it must be below or equal to MAXKMERHASH length (default 31bp), because it is stored on 64 bits • it must be strictly inferior to read length, otherwise we simply will not observe any overlaps between reads, for obvious reasons. As is often the case, it’s a trade-off between specificity and sensitivity. Longer kmers bring we more specificity (i.e. less spurious overlaps) but lowers coverage (cf. below). . . so there’s a sweet spot to be found with time and experience. We like to think in terms of “k-mer coverage”, i.e. how many times has a k-mer been seen among the reads. The relation between k-mer coverage Ck and standard (nucleotide-wise) coverage C is Ck = C ∗ (L−k+1)/L where k is our hash length, and L we read length. Experience shows that this kmer coverage should be above 10 to start getting decent results. If Ck is above 20, we might be “wasting” coverage. Experience also shows that empirical tests with different values for k are not that costly to run!

Supported FASTA

(default) fastq FASTA.gz fastq.gz eland gerald

Read categories are: short (default) shortPaired short2 (same as short,

but for a separate insert-size library)

shortPaired2 (see above) long (for Sanger, 454 or

even reference sequences)

longPaired

Including path

I.e: ./velveth contig 31,45,2 –fastq –shortPaired seq/sequences-data1.fastq seq/ sequences-data2.fastq

We then specified the hash length as 31,45,2 which runs velveth with hash lengths of 31-43 with a step of 2 (note: k-mers have to be odd). This

creates seven directories named contig_31 .. contig_43. To save disk space, the Sequences file is symbolically linked by velvet to the first directory (in this case

contig_31).

Step6: Running velvetg and determining optimal K

./ velvetg contig_33 -exp_cov 396.0 -ins_length1 300 -ins_length2 3000

Page 6: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 5

The expected coverage parameter was estimated by first counting the number of reads in each library with grep piped to wc (word count):

grep "@HWI-EAS210R_0001" 3kb_mp_shuffled.fastq | wc

8362680 8362680 342363112

grep "@HWI-EAS210R_0001" 300bp_pe_shuffled.fastq | wc

6069248 6069248 248522420

The first number in this output is the number of lines that match the grep pattern. We can arrive at the expected coverage by multiplying those counts by the

length of reads in each library and dividing by the total length of the genome (or our best estimate of it). So to calculate the expected coverage we could

perform the following calculation: ((8362680 * 38) + (6069248 * 54)) / 1,630,000 = 396.

It is important to note here that we can increase the value of the -exp_cov parameter and we may see an improvement in the n50 of the assembly, but it may

also produce mis-assemblies.

When velvetg finishes it will output the number of nodes, n50, and max and total size of the assembly created.

If we look in the contig_* directory, we will also see a few files:

contigs.fa Graph LastGraph Log PreGraph Roadmaps Sequences stats.txt

These files are explained in detail, but the most useful files for post-analysis are the contigs.fa, Log, and stats.txt files. These results should be entered into

the spreadsheet at the front of the lab.

Running the following custom script will output the n50 as well as n90 values for this assembly. For Ubuntu Linux users, we will run:

perl /usr/local/bin/calculateN50.pl auto_*/contigs.fa

Where * is the value of k.

We may notice that this n50 value is slightly different than what was reported by velvet. This is due to the fact that velvet reports its n50 (as well as

everything else) in kmer space. For example, the relationship between coverage and kmer coverage is defined by the following:

Page 7: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 6

Ck = C ∗ (L−k+1)/L

Where C=coverage,

L=read length

k=kmer length. For other things such as a contig length it is as simple as adding k-1 to the reported length.

Result:

- Nodes: 2232 - Max length: 94 408 bp - Min length: 89 bp Can delete the nodes with short length (<400 bp) with some soflware like: Geneious, CLC Genomic Workbench.

Part 02 Assembly - Blast - Mapping - Annotation

Step7 : Ligate all the nodes of contigs obtained from Velvet and create the circular genome with Geneious Pro 4.8.5 (Build 2010-03-04 10:01)

Geneious Pro is a commercial bioinformatics software platform that is both ultra-powerful and easy to use. We are able to search, organize and analyze genomic and protein information via a single desktop program that provides publication ready images to enhance the impact of our research.

- Create a folder and import the config.faa into this folder.

- Sort all nodes by order and select all the nodes.

- Ligate of the node with Cloning tools -> Ligate Sequences….

- Select Circularize sequences to make circular genome

- Export the circular sequences into new folder and save this sequences (FASTA file)

Page 8: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 7

Step8: Create full Open Read Frame ORFs with GeneMarkS (http://exon.gatech.edu/GeneMark/genemarks.cgi)

The new gene prediction method, called GeneMarkS, utilizes a non-supervised training procedure and can be used for a newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes. The GeneMarkS implementation uses an improved version of the gene finding program GeneMark.hmm, heuristic Markov models of coding and non-coding regions and the Gibbs sampling multiple alignment program. GeneMarkS predicted precisely 83.2% of the translation starts of GenBank annotated Bacillus subtilis genes and 94.4% of translation starts in an experimentally validated set of Escherichia coli genes. GeneMarkS can detect prokaryotic genes, in terms of identifying open reading frames containing real genes, with an accuracy matching the level of the best currently used gene detection methods. Accurate translation start prediction, in addition to the refinement of protein sequence N-terminal data, provides the benefit of precise positioning of the sequence region situated upstream to a gene start.

Step-by-step diagram of the GeneMarkS procedure

Figure 2. (A) In the process of GeneMarkS training there is no division of the coding sequence into two clusters.(B)The state ‘gene’ represents a sequence composed of an RBS plus a spacer plus the protein-coding sequence (CDS). Gene overlaps encompass all possible types of superpositions: overlap of genes on the same strand (as observed in operons), overlap of genes on opposite strands, overlap of coding region with RBS, and so on.

Page 9: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 8

Sequence File upload

(Upload the circular genome)

Running Options

Use Prokaryotic Version

Output Options

Email address: (to receive the result via email)

Translate GeneMarkS predicted genes into proteins (Get a list of protein translations of predicted genes in FASTA format. Ideal for smooth transition to using protein data.)

Run

Start GeneMarkS

Result:

1. Protein Translation: Copy all of ORF and save into a FASTA fiel

>Translation: 385..582 (direct), 66 amino acids

MLDLVELLTHWHAGRSQVRLSESLGIDRKTVRKYTAPAIAAGIEPGGEPLSAEQWAELIG

GWFPE*

….

2. Gene List

GeneMark.hmm PROKARYOTIC (Version 2.8)

Date: Wed Apr 20 09:25:23 2011

Sequence file name: sequence

Model file name: GeneMarkS_plus_Heuristic_AT_and_NONC.mod

RBS: Y

Model information: Pseudonative.model

FASTA definition line: empty-FASTA-def-line

Predicted genes

Save the content into a new FASTA file

Page 10: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 9

Step9: Convert full ORF FASTA file (obtain from GeneMarkS) to tabular format with Galaxy Tool and Edit with MS Excel

Galaxy Tool: http://main.g2.bx.psu.edu/ Convert to tabular format we can open with MS Excel and manipulate on this file easily.

- Upload the full_orf_mt_hue_20_sorted.faa and convert to tabular format. - Save tabular format file and open with MS Excel. - Insert a new column (column A # C1) and label this column (orf_0001 … orf_####) - Save this tabular file and convert to FASTA format.

Step10: Blast the ORF with NCBI server via Blast2Go

Blast2GO is an ALL in ONE tool for functional annotation of (novel) sequences and the analysis of annotation data. Blast2GO can annotate thousands of sequences in one session. We can follow and modify the annotation process at any stage.

Pipeline

Page 11: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 10

Start Blast2GO by Java Web Start

Requirements:

- The minimum requirement to run Blast2GO is a working Java installation (version > 1.5) (latest version is 1.6)

- The minimum requirement system memory is 512 MB free ( recommend: 2000-3000 MB)

- High speed internet connection

A. Blast all ORF with NCBI server

- Create new project the import the full_orf_mt_hue_20_sorted.faa (added orf

order).

- Run BLAST step with configuration below

- We can stop temporality the blast process, save the data and continue the blast process in next time. With 4757 ORFs of MT_HUE_20 samples and

Blast Hits = 20, it takes us about 24 hours with high speed internet connection. But in this case, we use Blast Hit = 5

- When the blast process finished, export the blast result as fasta file: File > Export > Exports as FASTA

Page 12: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 11

Step11: create GFF file to annotate circular genome MT_HUE_20

GFF (General Feature Format) lines are based on the GFF standard file format. GFF lines have nine required fields that must be tab-separated. If the fields are separated by spaces instead of tabs, the track will not display correctly. Here is a brief description of the GFF fields:

1. seqname - The name of the sequence. Must be a chromosome or scaffold. 2. source - The program that generated this feature. 3. feature - The name of this type of feature. Some examples of standard feature types are "CDS", "start_codon", "stop_codon", and "exon". 4. start - The starting position of the feature in the sequence. The first base is numbered 1. 5. end - The ending position of the feature (inclusive). 6. score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level

of gray in which this feature is displayed (higher numbers = darker gray). If there is no score value, enter ".". 7. strand - Valid entries include '+', '-', or '.' (for don't know/don't care). 8. frame - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not

a coding exon, the value should be '.'. 9. group - All lines with the same group are linked together into a single item.

Page 13: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 12

Example: MT_HUE_20_circular GeneMarkS source 1 4559459 . + . Name source

MT_HUE_20_circular GeneMarkS CDS 385 582 . + . Name orf_0001 ; locus_tag integrase catalytic region

MT_HUE_20_circular GeneMarkS CDS 665 2749 . + . Name orf_0002 ; locus_tag transposase

MT_HUE_20_circular GeneMarkS CDS 2991 5033 . + . Name orf_0003 ; locus_tag conserved hypothetical protein

MT_HUE_20_circular GeneMarkS CDS 5046 5168 . + . Name orf_0004 ; locus_tag ---NA---

MT_HUE_20_circular GeneMarkS CDS 5333 6586 . + . Name orf_0005 ; locus_tag cytochrome p450 125 cyp125

MT_HUE_20_circular GeneMarkS CDS 6586 7605 . + . Name orf_0006 ; locus_tag acyl- dehydrogenase fade28

MT_HUE_20_circular GeneMarkS CDS 7683 8753 . + . Name orf_0007 ; locus_tag acyl- dehydrogenase fade29 - Convert the ORF’s Blast result to tabular format with Galaxy Tool

- Open tabular file with MS excel and separate the content of fist column into 2 column

orf_0001|integrase catalytic region => orf_0001 integrase catalytic region

Data > Text to column > Delimited with | > Finish

- Delete the value of amino acid sequence column

- Creat the GFF file with tabular file and the gene list with MS Excel where:

C1: Name of MT circular genome (MT_HUE_20_circular)

C9: =CONCATENATE("Name ",#column orf_number," ; ","locus_tag ", #column Sequence desc.)

- Copy the content of excel file and paste into a .txt file.

- Rename this file : mt_hue_20_circular.gff

Step12: Open GFF file with Geneious

To have a full genome of MT_HUE_20 strain with annotation, we use the circular sequence obtained from contigs; the sequence description obtained from

Blast all ORF and GFF file in Geneious Software.

- Open Geneious, create a new folder with name GFF.

- Import the mt_hue_20_circular.gff file.

- Get the sequences for this gff file (the mt_hue_20_cicular.fasta)

- Visualize the genome in form circular: Tool -> Circular Sequences

- Zoom in or out to find the specific ORF

Page 14: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 13

A long fragment of genome MT_HUE_20 strain include many ORF

Step13: Manipulate specific gene with annotated genome of MT_HUE_20

To find a specific gene, RNA polymerase beta subunit (rpoB) gene for example, we find the information in the topBlast data to identify the name of ORF. In

this case, >orf_1934|dna-directed rna polymerase subunit beta rpob.hihi

We use the Geneious Software to analyze this sequences like: export the sequences; blast with NCBI server, find the mutation...

Part 03 Bowtie 0.12.7, MagicViewer

1. Bowtie is an ultrafast, memory-efficient short read aligner geared toward quickly aligning large sets of short DNA sequences (reads) to large genomes. It aligns 35-base-pair reads to the human genome at a rate of 25 million reads per hour on a typical workstation. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: for the human genome, the index is typically about 2.2 GB (for unpaired alignment) or 2.9 GB (for paired-end or colorspace alignment). Multiple processors can be used simultaneously to achieve greater alignment speed. Bowtie can also output alignments in the standard SAM format, allowing Bowtie to interoperate with other tools supporting SAM, including the SAMtools consensus, SNP, and indel callers. Bowtie runs on the command line under Windows, Mac OS X, Linux, and Solaris.

Bowtie also forms the basis for other tools, including TopHat: a fast splice junction mapper for RNA-seq reads, Cufflinks: a tool for transcriptome assembly and isoform quantitiation from RNA-seq reads, Crossbow: a cloud-computing software tool for large-scale resequencing data,and Myrna: a cloud computing tool for calculating differential gene expression in large RNA-seq datasets.

Windows Shell: Convert full sequence reads (fastq) to .SAM file

D:\Softwares\Biotool\bowtie-0.12.7>bowtie.exe -S ./indexes/Test1/fullseq.fastq align_mt.sam

Syntax: bowtie_folder>bowtie.exe –S./[path_file_fullseq.fastq] ./[path_file_fullseq.sam]

Page 15: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 14

2. MagicViewer help us to study in the variety of genome, such as de novo sequencing, transcriptome sequencing and targeted re-sequencing, especially exon-capture and high-throughput sequencing. For mapping purposes, SNP detections or association studies.

Analyze .SAM file with MagicViewer_1.2.1_i386_win32 program

Step 1: Run MagicViewer.bat file with Windows Shell

D:\Softwares\Biotools\MagicViewer_1.2.1_i386_win32>MagicViewer.bat

Step 2: Convert .SAM to Sorted – Indexing .BAM

Create new project, input reference genome FASTA file (H37Rv genome from NCBI) and Alignment file ( full sequences SAM file) – MagicViewer will convert to Indexing - Sorted BAM file.

Page 16: Whole Genome Report

Nguyen Hoang Bach, MSc. Page 15

…… đang viết