1
To provide a genomic narrative that can be trusted, microbiology laboratories need quality control (QC) metrics to accompany their genomic pipelines. QC metrics enable: Implementing standards in routine lab sample processing Performance comparison of pipeline optimizations or alternatives Retrospective tracing of problems that arise QC metrics are not easy to implement – they may need to be adjusted for organism type, sample quality, sequencing technology and preparation, and the mix of software components that are brought together in a pipeline. Another challenge is to transform QC reporting from a manual review of a pipeline’s disparate and often opaque application log files, into an automated system of reporting and decision making that can be adjusted by researchers and system administrators who are not expert programmers. We have developed a general purpose text-mining and reporting application called Report Calc for Quality Control (RCQC) that works directly within command-line scripts, or as a tool in Galaxy (an interactive bioinformatics platform and workflow engine). An RCQC interpreter follows instructions in a RCQC script to extract QC variables from various application log and report files. It can implement rules that trigger warning or failure statuses in an active pipeline. Various opportunities arise for metrics along the stages of a genomic pipeline; our initial focus is on basic assembly metrics as illustrated on this poster. Abstract RCQC Recipes QC Ontology Using the JSON-LD format’s metadata feature, RCQC can link particular QC report terms to their standardized ontology counterparts. Creating a controlled vocabulary for QC enables reports from disparate genomic pipelines to be compared, which should eventually lead to a set of pipeline metrics for accrediting commercial, government and open source software. Within the context of the OBOFoundry of ontologies we are introducing an ontology called GenEpiO (currently available at https://github.com/Public-Health-Bioinformatics/irida_ontology ) which holds QC terms like "genome size ratio", “contig count”, etc. Using the Protégé ontology editor it is easy to see the definitions for these terms. Acknowledgements IRIDA project funding is provided by Genome Canada, Genome BC, and the Genomics R&D Initiative (GRDI) with additional support from Simon Fraser University and Cystic Fibrosis Canada. We thank additional project advisors for constructive comments. We have started a library of simple "recipe" scripts that extract quality control (QC) data from various reports like FastQC, QUAST, CheckM and SPAdes into the popular and software-friendly JSON format (an auto- generated HTML version of the same content is also available). One can override sections of an RCQC recipe with settings that test variations in a pipeline job. An example RCQC text-mining script and output HTML and JSON report is shown below along with typical report files from other pipeline tools. 1 Department of Pathology, University of British Columbia; 2 National Microbiology Laboratory, Public Health Agency of Canada; 3 Department of Pathology, University of British Columbia & BC Public Health Microbiology and Reference Laboratory Damion M. Dooley 1 ; Aaron J. Petkau 2 ; Franklin Bristow 2 ; Gary Van Domselaar 2 ; William W.L. Hsiao 3 A Scripting Language For Standardized Evaluation Of Quality Metrics In Galaxy And Command-line Driven Workflows This work stemmed from the plan to enhance QC reporting on the web- based Integrated Rapid Infectious Disease Analysis (www.IRIDA.ca ) project which manages sequence libraries and pipelines for food-born pathogen assembly, annotation, SNP detection, and phylogenetic analysis. RCQC has been developed to work as a command-line python app, but in addition, since IRIDA uses Galaxy to execute its pipeline, we have a Galaxy RCQC tool for “pro” users to develop recipes. We will be offering a basic version of this tool that allows users without programming skills to adjust key QC parameters only. Recipes can include conditionals that trigger a halt to a pipeline by sending the appropriate signal (exit code). More than one RCQC recipe can be run in a pipeline, and their report output can be daisy chained in order to contribute to a single collective report. QC metric conditionals shown below can either signal a possible error situation (the “fail(qc)” call), or even call a halt to futile pipeline work (via “fail(job )”). adjusting parameters and formulae for pipeline operation – one that did not require recompilation after each user-driven change. As a result, the RCQC system provides a more transparent rule set that reduces the skill needed to make process adjustments. Standard assembly pipeline QC metrics are introduced which provide a blueprint for the way QC components could be shared amongst NGS sequencing pipelines. Further information, including source code, is available at https://github.com/Public-Health-Bioinformatics/rcqc. Implementation Protege ontology editor view of GenEpiO assembly quality control terms JSON-LD HTML FLASH FastQC CheckM RCQC recipe for text-mining flash.log In developing a scripting language to do this work, we did not want to reinvent the wheel (in fact RCQC offers up for reuse all of python’s built-in math and operator functions). We did however need a flexible mechanism for FLASH

Report Calc for Quality Control

Embed Size (px)

Citation preview

Page 1: Report Calc for Quality Control

To provide a genomic narrative that can be trusted, microbiology laboratories need quality control (QC) metrics to accompany their genomic pipelines. QC metrics enable:

•  Implementing standards in routine lab sample processing •  Performance comparison of pipeline optimizations or alternatives •  Retrospective tracing of problems that arise

QC metrics are not easy to implement – they may need to be adjusted for organism type, sample quality, sequencing technology and preparation, and the mix of software components that are brought together in a pipeline. Another challenge is to transform QC reporting from a manual review of a pipeline’s disparate and often opaque application log files, into an automated system of reporting and decision making that can be adjusted by researchers and system administrators who are not expert programmers. We have developed a general purpose text-mining and reporting application called Report Calc for Quality Control (RCQC) that works directly within command-line scripts, or as a tool in Galaxy (an interactive bioinformatics platform and workflow engine). An RCQC interpreter follows instructions in a RCQC script to extract QC variables from various application log and report files. It can implement rules that trigger warning or failure statuses in an active pipeline. Various opportunities arise for metrics along the stages of a genomic pipeline; our initial focus is on basic assembly metrics as illustrated on this poster.

Abstract

RCQC Recipes

QC Ontology Using the JSON-LD format’s metadata feature, RCQC can link particular QC report terms to their standardized ontology counterparts. Creating a controlled vocabulary for QC enables reports from disparate genomic pipelines to be compared, which should eventually lead to a set of pipeline metrics for accrediting commercial, government and open source software. Within the context of the OBOFoundry of ontologies we are introducing an ontology called GenEpiO (currently available at https://github.com/Public-Health-Bioinformatics/irida_ontology) which holds QC terms like "genome size ratio", “contig count”, etc. Using the Protégé ontology editor it is easy to see the definitions for these terms.

Acknowledgements IRIDA project funding is provided by Genome Canada, Genome BC, and the Genomics R&D Initiative (GRDI) with additional support from Simon Fraser University and Cystic Fibrosis Canada. We thank additional project advisors for constructive comments.

We have started a library of simple "recipe" scripts that extract quality control (QC) data from various reports like FastQC, QUAST, CheckM and SPAdes into the popular and software-friendly JSON format (an auto-generated HTML version of the same content is also available). One can override sections of an RCQC recipe with settings that test variations in a pipeline job. An example RCQC text-mining script and output HTML and JSON report is shown below along with typical report files from other pipeline tools.

1Department of Pathology, University of British Columbia; 2National Microbiology Laboratory, Public Health Agency of Canada; 3Department of Pathology, University of British Columbia & BC Public Health Microbiology and Reference Laboratory

Damion M. Dooley1; Aaron J. Petkau2; Franklin Bristow2; Gary Van Domselaar2; William W.L. Hsiao3

A Scripting Language For Standardized Evaluation Of Quality Metrics In Galaxy And Command-line Driven Workflows

This work stemmed from the plan to enhance QC reporting on the web-based Integrated Rapid Infectious Disease Analysis (www.IRIDA.ca) project which manages sequence libraries and pipelines for food-born pathogen assembly, annotation, SNP detection, and phylogenetic analysis. RCQC has been developed to work as a command-line python app, but in addition, since IRIDA uses Galaxy to execute its pipeline, we have a Galaxy RCQC tool for “pro” users to develop recipes. We will be offering a basic version of this tool that allows users without programming skills to adjust key QC parameters only. Recipes can include conditionals that trigger a halt to a pipeline by sending the appropriate signal (exit code). More than one RCQC recipe can be run in a pipeline, and their report output can be daisy chained in order to contribute to a single collective report. QC metric conditionals shown below can either signal a possible error situation (the “fail(qc…)” call), or even call a halt to futile pipeline work (via “fail(job …)”).

adjusting parameters and formulae for pipeline operation – one that did not require recompilation after each user-driven change. As a result, the RCQC system provides a more transparent rule set that reduces the skill needed to make process adjustments. Standard assembly pipeline QC metrics are introduced which provide a blueprint for the way QC components could be shared amongst NGS sequencing pipelines. Further information, including source code, is available at https://github.com/Public-Health-Bioinformatics/rcqc.

Implementation Protege ontology editor view of GenEpiO assembly quality control terms

JSON-LD HTML

FLASH FastQC

CheckM

RCQC recipe for text-mining flash.log

In developing a scripting language to do this work, we did not want to reinvent the wheel (in fact RCQC offers up for reuse all of python’s built-in math and operator functions). We did however need a flexible mechanism for

FLASH