Upload
denis-bauer
View
2.145
Download
1
Tags:
Embed Size (px)
DESCRIPTION
An overview of QBI’s production informatics framework with an emphasis on what service will be provided and how the resulting data is made available: from interactive quality control to integration with external data on the genome browser.
Citation preview
The Queensland Brain Institute |
QBI’s Centre for Brain GenomicsThe informatics side of things
April 11, 2023
[Sprengben [why not get a friend]]
The Queensland Brain Institute |
Objective of QBI’s Centre for Brain genomics
On-time
deliveryReliable data production
Convincing data
Easy delivery
Perkel JM. Coding your way out of a problem. Nat Methods. 2011 Jun PMID: 21716280.
The Queensland Brain Institute | April 11, 2023
Birdseye view of facility’s workflow
The Queensland Brain Institute | April 11, 2023
Detailed workflow
CASAVA
Raw sequencereads
projects flowcell
CbotHiSeq
30 diff. programs
HiSeq cluster cluster
The Queensland Brain Institute | April 11, 2023
Overview of Production Informatics framework
//clusterstorage
//cluster-vm
Run/ Data/
MakeFastq.sh trigger.sh armed trigger.sh html
Unaligned/ bwa/, reCaAl/, variant/ Summary.html
Apache, IGV, R, UCSC
Automatic
Manual
Processing Evaluation
The Queensland Brain Institute | April 11, 2023
Trigger.sh
• Keeping data separate from scripts
• Automating verification, quality control and summary HTML generation
• Rerunning pipeline from every point
The Queensland Brain Institute | April 11, 2023
Flexible generic names: header
#ProgramsBWA="/clusterdata/hiseq_apps/bin/$MODE/bwa"SAMTOOLS="/clusterdata/hiseq_apps/bin/$MODE/samtools"IGVTOOLS="/clusterdata/hiseq_apps/bin/$MODE/igvtools/IGVTools/igvtools.jar”
# Task namesTASKFASTQC="fastQC"TASKBWA="bwa"TASKRCA="reCalAln”
#FileabbREADONE="read1"READTWO="read2"FASTQ="fastq.gz"ALN="aln" # aligned
The Queensland Brain Institute | April 11, 2023
Config.txt
#********************# Tasks#********************mappingBWA="1" recalibrateQualScore="1"
#********************# Paths#********************FASTA="/clusterdata/resources/hg19/hg19.fasta" SEQREG=chr1:229994688-230071581"DBSNP="/clusterdata/resources/hg19/snpdb132.vcf"
#********************# PARAMETER#********************LIBRARY="QBI”ADDPARAMBWA=“--force single”
Specifics what to do,e.g. mapping and recalibration
Specifics where to find resources
Customizes stanard sripts for this project
The Queensland Brain Institute | April 11, 2023
call
• trigger.sh config.txt armed• trigger.sh config.txt html
s_1_read1.fastqs_1_read2.fastqs_2_read1.fastqs_2_read2.fastq
s_3_read1.fastqs_3_read2.fastqs_4_read1.fastqs_4_read2.fastq
s_1.bams_2.bam
s_1.ashrr.bams_2.ashrr.bam
s_3.bams_4.bam
s_3.ashrr.bams_4.ashrr.bam
Sub1_s_1.outSub1_s_2.outSub2_s_3.outSub2_s_4.out
Sub1_s_1.outSub1_s_2.outSub2_s_3.outSub2_s_4.out
The Queensland Brain Institute | April 11, 2023
Summary.html
Project CardsSequence statistics
Data Visualization
Download
Run check points
Mapping stats
Interesting Regions
The Queensland Brain Institute | April 11, 2023
Scaffold of pbsScripts.sh: Error catching
# QCVARIABLES, loosing reads, unmapped read,no such file,file not found,bwa.sh: line
>>>>>>>>>> ErrorsQC_PASS .. 0 have We are loosing reads/184QC_PASS .. 0 have for unmapped read/184QC_PASS .. 0 have no such file/184QC_PASS .. 0 have file not found/184QC_PASS .. 0 have bwa.sh: line/184
Code example for setting up what errors to look out for
Output in Summary.html
The Queensland Brain Institute | April 11, 2023
Scaffold of pbsScripts.sh: checkpoints
>>>>>>>>>> CheckPointsQC_PASS .. 184 have mapping/184QC_PASS .. 184 have sorting and bam-conversion/184QC_PASS .. 184 have mark duplicates/184QC_PASS .. 184 have statistics/184QC_PASS .. 184 have coverage track/184
echo “********* mapping”$BWA aln -t $THREADS $FASTA $f > $OUT/${n/$FASTQ/sai}$BWA aln -t $THREADS $FASTA ${f/$READONE/$READTWO} > $OUT/${n/$READONE.$FASTQ/$READTWO.sai}
Code example for setting up checkpoints in the pbsScript.sh
Output in Summary.html
The Queensland Brain Institute |
Availability: tailored to skills
Website RStudio Command line
1 2 3
The Queensland Brain Institute |
Documentation: Project Server
Application Backup/Version Control
Data Warehousing
RSudio
Project Cards
Software
Processed Data
External Genomic
Resources
Custom Scripts
Custom Scripts
Visualization
IGVGenome Browser
Statistic Analysis
Quality Control
Hypothesis Generation
DataProcessing and Analysis
HiSeq Output
Rsync
Version Control
Genomes, Annotation, etc.
7 project-cards10 Projects, 6 HiSeq-Runs
40 wiki pages, 250 Tasks, 551h logged
160 Commits35 external programs
41 custom scripts (4197 lines of code)
5 TB raw data750 GB processed data
57 GB external data
//cluster-vm //clusterstorage //groupshare, //ethan
Covering all aspects of: design*, set-up*, maintenance*, usage (*except cluster)
//project
Processed Data
Raw Data
Cluster
GalaxyProject Server
Content
BWA, GATK, samtools, etc.
The big picture
The Queensland Brain Institute | April 11, 2023
Three things to remember
• Reliable data production– Projects have all a similar structure and are processed in
the same way
• Convincing data– All steps are tightly quality controlled and the QC report
is accessible
• Easy delivery– We tailored data availability to skill-levels (webpage,
Rstudio, console
• On time delivery– Production informatics has priority on the cluster( )
The Queensland Brain Institute | April 11, 2023
Next week
• NGS Discussion group:
Methylation analysisKevin Dudley and Danay Baker-Andresen