Upload
anna-blendermann
View
175
Download
0
Embed Size (px)
Citation preview
Developing a Bioinformatics Pipeline
for Forensic STR Data Analysis
By Anna Blendermann
Step 1: STR Analysis and Profiling
STR Analysis – tool used in forensic analysis that evaluates specific STR regions on nuclear DNA to identify differences between DNA profiles DNA profiles can identify or exonerate suspects in crime investigation
My project works with a bioinformatics pipeline that processes the genetics data from STR regions Locus – the specific location of a DNA sequence DNA Sequence – precise ordering of nucleotide bases Short Tandem Repeat (STR) – two to six nucleotide
bases repeated in a DNA sequence within one locus
Step 2: Next Generation Sequencing
DNA Sequencing – the process of determining the precise order of nucleotide bases within a DNA molecule
Next Generation Sequencing – modern sequencing technologies used to sequence DNA Quicker and cheaper than Sanger sequencing Becoming feasible as a method for STR typing
Step 3: Bioinformatics Processing
Bioinformatics Pipeline – program that receives input data from NGS instruments and generates statistical output data Groovy script – runs fastq files through STRait Razor Perl script (STRait Razor) - parses fastq files into csv files Java code by Harish Swaminathan - creates statistics files from csv files
Bracket Notation Program
My Program – standalone Java program that converts DNA sample sequences into bracket notation
Bracket Notation – condensed format for DNA sequences that highlights allele length and number of base repeats Base repeat – sequence of four nucleotide bases Allele length – the length of four-base repeats in a DNA sequence
Default Sequence – base repeat of three, four, or five letters that is default to a particular DNA locus CSF1PO = AGAT forward, ACTC reverse D12S391 = AGAT/AGAC forward, ATCT/GTCT reverse D19S433 = AAGG/AAAG forward, CCTT/CTTT reverse
Simple Sequences
AGATAGATAGATAGATAGAT AGAT + AGAT + AGAT + AGAT + AGAT = [AGAT]5 Meaning: Allele length 5 with AGAT repeat
TGTCTGTCTGTCTGTCTGTC TGTC + TGTC + TGTC + TGTC + TGTC + TGTC = [TGTC]6 Meaning: Allele length 6 with TGTC repeat
Compound Sequences
AGATAGATAGATAGATAGACAGAC AGAT + AGAT + AGAC + AGAC + AGAC = [AGAT]2[AGAC]3 Meaning: Allele length 5 with AGAT/AGAC repeat
TCTATCTATCTATCTATCTGTCTGTCTA TCTA + TCTA + TCTA + TCTA + TCTG + TCTG = [TCTA]4[TCTG]2 Meaning: Allele length 6 with TGTA/TCTG repeat
Complex Sequences
AGATAGATAGATTGAGACAGACAGACAGAT = AGAT + AGAT + AGAT + TG + AGAC + AGAC + AGAC + AGAT = [AGAT]3TG[AGAC]3[AGAT] Meaning: Allele length 7 with AGAT/AGAC repeat
TGGCTGGCTGGCACAAATGGCTGTCTGTCTGTCAA = TGGC + TGGC + TGGC + TGGC + ACAAA + TGGC + TGTC + TGTC + TGTC + AA = [TGGC]4[ACAA][TGGC][TGTC]3AA Meaning: Allele length 9.2 with TGGC/TGTC repeat
Rare Sequences
The FGA and PentaD loci contain specific sequence patterns that make converting to Bracket Notation difficult
FGA Locus: TTTCTTTCTTTCTTTTTTCTCTTTCTTTCTCCTTCCTTCC = TTTC + TTTC + TTTC + TTTTTTCT + CTTT + CTTT + CTCC + TTCC + TTCC = [TTTC]3[TTTT][TTCT][CTTT]2[CTCC][TTCC]2 Meaning: Allele length 10 with TTTC/CTTT/TTCC repeat
PentaD Locus: GAAAGAAAAAAAAGAAAGAAAAGAAAAAACGAAGGGGAAAAAAA = GAAAGAAAAAAAAG + AAAGA + AAAGA + AAAAACGAAGGGGAAAAAAA = [AAAGA]2 (flanks removed) Meaning: Allele length 2 with AAAGA repeat
See the Difference!
RAW GENETICS DATA
BRACKET NOTATION PROGRAM
Galaxy Web Platform
Galaxy Web Platform - open, web-based platform for data intensive biomedical research where biologists can perform, reproduce, and share complete analyses of their work
Bracket Notation Program Modifies DNA sequences from Harish output files Provides user friendly data and easy readability Processes multiple folders of files with genetics data
Plans for the future? Soon to be available on Galaxy for genetics research Starting point for other computer science interns
Thank you! Questions?