12
Developing a Bioinformatics Pipeline for Forensic STR Data Analysis By Anna Blendermann

Final Presentation-Delta

Embed Size (px)

Citation preview

Page 1: Final Presentation-Delta

Developing a Bioinformatics Pipeline

for Forensic STR Data Analysis

By Anna Blendermann

Page 2: Final Presentation-Delta

Step 1: STR Analysis and Profiling

STR Analysis – tool used in forensic analysis that evaluates specific STR regions on nuclear DNA to identify differences between DNA profiles DNA profiles can identify or exonerate suspects in crime investigation

My project works with a bioinformatics pipeline that processes the genetics data from STR regions Locus – the specific location of a DNA sequence DNA Sequence – precise ordering of nucleotide bases Short Tandem Repeat (STR) – two to six nucleotide

bases repeated in a DNA sequence within one locus

Page 3: Final Presentation-Delta

Step 2: Next Generation Sequencing

DNA Sequencing – the process of determining the precise order of nucleotide bases within a DNA molecule

Next Generation Sequencing – modern sequencing technologies used to sequence DNA Quicker and cheaper than Sanger sequencing Becoming feasible as a method for STR typing

Page 4: Final Presentation-Delta

Step 3: Bioinformatics Processing

Bioinformatics Pipeline – program that receives input data from NGS instruments and generates statistical output data Groovy script – runs fastq files through STRait Razor Perl script (STRait Razor) - parses fastq files into csv files Java code by Harish Swaminathan - creates statistics files from csv files

Page 5: Final Presentation-Delta

Bracket Notation Program

My Program – standalone Java program that converts DNA sample sequences into bracket notation

Bracket Notation – condensed format for DNA sequences that highlights allele length and number of base repeats Base repeat – sequence of four nucleotide bases Allele length – the length of four-base repeats in a DNA sequence

Default Sequence – base repeat of three, four, or five letters that is default to a particular DNA locus CSF1PO = AGAT forward, ACTC reverse D12S391 = AGAT/AGAC forward, ATCT/GTCT reverse D19S433 = AAGG/AAAG forward, CCTT/CTTT reverse

Page 6: Final Presentation-Delta

Simple Sequences

AGATAGATAGATAGATAGAT AGAT + AGAT + AGAT + AGAT + AGAT = [AGAT]5 Meaning: Allele length 5 with AGAT repeat

TGTCTGTCTGTCTGTCTGTC TGTC + TGTC + TGTC + TGTC + TGTC + TGTC = [TGTC]6 Meaning: Allele length 6 with TGTC repeat

Page 7: Final Presentation-Delta

Compound Sequences

AGATAGATAGATAGATAGACAGAC AGAT + AGAT + AGAC + AGAC + AGAC = [AGAT]2[AGAC]3 Meaning: Allele length 5 with AGAT/AGAC repeat

TCTATCTATCTATCTATCTGTCTGTCTA TCTA + TCTA + TCTA + TCTA + TCTG + TCTG = [TCTA]4[TCTG]2 Meaning: Allele length 6 with TGTA/TCTG repeat

Page 8: Final Presentation-Delta

Complex Sequences

AGATAGATAGATTGAGACAGACAGACAGAT = AGAT + AGAT + AGAT + TG + AGAC + AGAC + AGAC + AGAT = [AGAT]3TG[AGAC]3[AGAT] Meaning: Allele length 7 with AGAT/AGAC repeat

TGGCTGGCTGGCACAAATGGCTGTCTGTCTGTCAA = TGGC + TGGC + TGGC + TGGC + ACAAA + TGGC + TGTC + TGTC + TGTC + AA = [TGGC]4[ACAA][TGGC][TGTC]3AA Meaning: Allele length 9.2 with TGGC/TGTC repeat

Page 9: Final Presentation-Delta

Rare Sequences

The FGA and PentaD loci contain specific sequence patterns that make converting to Bracket Notation difficult

FGA Locus: TTTCTTTCTTTCTTTTTTCTCTTTCTTTCTCCTTCCTTCC = TTTC + TTTC + TTTC + TTTTTTCT + CTTT + CTTT + CTCC + TTCC + TTCC = [TTTC]3[TTTT][TTCT][CTTT]2[CTCC][TTCC]2 Meaning: Allele length 10 with TTTC/CTTT/TTCC repeat

PentaD Locus: GAAAGAAAAAAAAGAAAGAAAAGAAAAAACGAAGGGGAAAAAAA = GAAAGAAAAAAAAG + AAAGA + AAAGA + AAAAACGAAGGGGAAAAAAA = [AAAGA]2 (flanks removed) Meaning: Allele length 2 with AAAGA repeat

Page 10: Final Presentation-Delta

See the Difference!

RAW GENETICS DATA

BRACKET NOTATION PROGRAM

Page 11: Final Presentation-Delta

Galaxy Web Platform

Galaxy Web Platform - open, web-based platform for data intensive biomedical research where biologists can perform, reproduce, and share complete analyses of their work

Bracket Notation Program Modifies DNA sequences from Harish output files Provides user friendly data and easy readability Processes multiple folders of files with genetics data

Plans for the future? Soon to be available on Galaxy for genetics research Starting point for other computer science interns

Page 12: Final Presentation-Delta

Thank you! Questions?