View
7
Download
0
Category
Preview:
Citation preview
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
14 March, 2016: Introduction to Genomics
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
Genome within Ensembl browser
http://www.ensembl.org/Homo_sapiens/Location/View?db=core;g=ENSG00000139618;r=13:32315474-32400266
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
Genome within Ensembl browser
1
2
3
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
Genome
Genes
Variation
Repeats
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
Genes- https://en.wikipedia.org/wiki/Gene
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
- SNP or SNV (single-nucleotide polymorphism/variation)
- indels (insertions and deletions)
Variation- structural variation
- CNV (copy-number variation)- inversions- translocations
https://en.wikipedia.org/wiki/Structural_variation https://en.wikipedia.org/wiki/Chromosome_abnormality
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
Variation- caused by mutations- visible in DNA sequence- proportion of variable sites depends
on evolutionary distance- within species little- between species lots
seq1 CGATGCGCGATACATCGACGTGCAseq2 CGATGCGCGGTACATCGACGTGCAseq3 CGATGCGCGATACATCGACGTGCAseq4 CGATGCGCGATACATCGACGTGCA
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
RepeatsMany types
- https://en.wikipedia.org/wiki/Tandem_repeat- https://en.wikipedia.org/wiki/Retrotransposon - https://en.wikipedia.org/wiki/Transposable_element
Alu element is the most abundant transposable elements in the human genome
- ~ 300 bases long, ~ 1 million copies → makes ~11% of human genome
- repeat copies are similar, cause troubles in genome assembly and short-read mapping
- often useless and ignored where possible
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
Genome
Genes
Variation
Repeats
→ correlations !
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
Variation data- mainly interested in SNPs and indels
observed between samples- distribution of variation across sites- distribution among samples
- here, reference sequence is known- sample data are multiple genomes
- to us, data come from magic box
- data: m/billions of ~150 bp fragments
R1234
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
Variation data- mainly interested in SNPs and indels
observed between samples- distribution of variation across sites- distribution among samples
- here, reference sequence is known- sample data are multiple genomes
- to us, data come from magic box
- data: m/billions of ~100 bp fragments
R1234
Sample genome: millions to billions bp long (human ~ 3 x 109 bp)
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
Variation data- mainly interested in SNPs and indels
observed between samples- distribution of variation across sites- distribution among samples
- here, reference sequence is known- sample data are multiple genomes
- to us, data come from magic box
- data: m/billions of ~100 bp fragments
R1234
Sample genome: millions to billions bp long (human ~ 3 x 109 bp)
400-1500 bp
DNA fragmentation
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
Variation data- mainly interested in SNPs and indels
observed between samples- distribution of variation across sites- distribution among samples
- here, reference sequence is known- sample data are multiple genomes
- to us, data come from magic box
- data: m/billions of ~100 bp fragments
R1234
Sample genome: millions to billions bp long (human ~ 3 x 109 bp)
400-1500 bp
100 bpknown
200-1000 bp unknown 100 bpknown
DNA fragmentation
DNA sequencing
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
Short read mapping
Variation data- mainly interested in SNPs and indels
observed between samples- distribution of variation across sites- distribution among samples
- here, reference sequence is known- sample data are multiple genomes
- to us, data come from magic box
- data: m/billions of ~100 bp fragments
R1234
Genomic analyses
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
Short read mapping
Variation data- mainly interested in SNPs and indels
observed between samples
1. detection of CNVs (and struct. var.) using short read data is tricky
2. evolution of CNVs is unclear→ population genetics theory is best developed for SNP data
R1234
Genomic analyses
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
Illumina sequencing- most common sequencing machines
- Illumina reads have systematic errors- some errors can be accounted for
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
Illumina sequencing- data output in
fastq format
- per base sequence content shows contamination and fragment enrichment
- base call qualities reflect sample DNA quality and issues in sequencing run
- base qualities are taken into account in later analyses
- typically no need for clean up
good data bad data
@NS500198:34:H1708BGXX:1:11101:4647:1058 1:N:0:4ACAACNCGCCCGTGNTGCAGGACTGGGTCACGGCCACTGACATCCGCGTGGCCTTCCGCCGCCTGCACACGTTCGGTGACGAGAACGAGGCCGACTCCGAGCTGGCGCGCGCCTCGTACTTCTACGCCGTGTCCGACCT+7A7AA#FFFFFFFF#AFFFFF7FFFFFFF7FFFFF)FFF.F.FFFFFFFFFF.FFF.<FFFFFAFAAF.<AFF<FFFA<FAF<<)FA.AF<FA7FFFF<.AFFFFFAFAA<.AAFA<AA.FFF.F<FAAAFFF<7.FAA@NS500198:34:H1708BGXX:1:11101:20099:1059 1:N:0:4GCCACNAAAATTTANAACTAGAGCTGCCCTATGCCCCAGCAATTGCACTCCTGGGTATTTACCCCAAAGACACAGATGTAGTGAAAAGAAGGGCCATATTCACCCCAACGTTCATAGCAGCAAAGTCCACTATAGCCAA+<AA.A#A.7A.F<F#FFFF.AAFFA.FFFAFFF7FFFFFFFFFF.7)FFFFF.AFFAAAF.F)FFF<AFAF7FAF.FAFFFFFFFFA..)FFF.FF7FF)A...FF..<.<FF7)<FFFF<7F.)F.FFF.7FFFFFF7
529053 Evolutionary Genomics Ari Löytynoja / ari.loytynoja@helsinki.fi
bamdata
bamdata
bamdata
bamdatafastq
data
fastq data
fastq data
Overview of resequencing data analysis
fastq data
vcf data
18917 C A 0/0 0/0 0/0 0/018969 G T 0/0 0/0 0/0 0/019022 G T 0/1 1/1 1/1 1/119030 T A 0/1 1/1 1/1 1/119163 A G 0/0 0/0 0/0 0/0
variant calling
mapping
summary statistics
analysis
samples
analysis analysis
Recommended