71
Bioinformatics Resource Centers for Biodefense and Emerging/Re-Emerging Infectious Diseases (BRC) INFLUENZA RESEARCH DATABASE (IRD) AND VIRUS PATHOGEN RESOURCE (VIPR) COMPENDIUM OF BRC SYSTEM VERSION 4.0 Performance Period: September 15, 2014- September 14, 2019 Developed Under Contract Number: HHSN272201400028C Delivered: October 30, 2018 Project Sponsor: National Institutes of Health (NIH) National Institute of Allergy and Infectious Diseases (NIAID) Division of Microbiology and Infectious Diseases (DMID) Prepared by: Civilian Agencies Group Health Solutions 2101 Gaither Rd, Suite 600 Rockville, Maryland 20850 (404) 414-0925 fax: (301) 527-6401 [email protected]

Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Bioinformatics Resource Centers for Biodefense andEmerging/Re-Emerging Infectious Diseases (BRC)INFLUENZA RESEARCH DATABASE (IRD) ANDVIRUS PATHOGEN RESOURCE (VIPR)

COMPENDIUM OF BRC SYSTEMVERSION 4.0

Performance Period: September 15, 2014- September 14, 2019Developed Under Contract Number: HHSN272201400028C Delivered: October 30, 2018

Project Sponsor:National Institutes of Health (NIH)

National Institute of Allergy and Infectious Diseases (NIAID)Division of Microbiology and Infectious Diseases (DMID)

Prepared by:

Civilian Agencies GroupHealth Solutions

2101 Gaither Rd, Suite 600Rockville, Maryland 20850

(404) 414-0925fax: (301) 527-6401

[email protected]

1.0

Page 2: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4

Table of Contents

1.0 Introduction................................................................................................31.1 Scope and Purpose......................................................................................................................31.2 Identification...............................................................................................................................4

2.0 Hardware Configurations.............................................................................52.1 Development Environment........................................................................................................52.2 Production and Test Environments..........................................................................................7

3.0 Database Specification and Architecture.......................................................93.1 Production Database Architecture............................................................................................93.2 Database Configuration.............................................................................................................9

3.2.1 Staging Database..................................................................................................................93.2.2 Production Data Warehouse................................................................................................93.2.3 Workbench Database...........................................................................................................93.2.4 Data Load and Integration.................................................................................................103.2.5 Database Management Tools.............................................................................................10

3.3 Key Database Features used in the Current Database Configuration................................113.4 Data............................................................................................................................................12

3.4.1 Supported Data Types........................................................................................................123.4.2 Lookup Data.......................................................................................................................16

3.5 Database Population and Refresh Procedures.......................................................................163.6 Database Backup Procedures..................................................................................................18

4.0 Tools Used by IRD and ViPR BRC: Commercial, Open-source, and Tools Developed by the Scientific Community.......................................................................19

5.0 System Web Interface Architecture............................................................295.1 Web Interface Overview..........................................................................................................295.2 Use-Case View...........................................................................................................................295.3 Logical View..............................................................................................................................295.4 Open Source Library / Software.............................................................................................305.5 Release Testing..........................................................................................................................315.6 System Security.........................................................................................................................31

5.6.1 Security Scanning..............................................................................................................315.6.2 Conversion to https............................................................................................................31

6.0 Web Interface User Operations...................................................................326.1 Search operations.....................................................................................................................34

6.1.1 Search sequences (IRD and ViPR)....................................................................................356.1.2 Search surveillance data (IRD)..........................................................................................366.1.3 Search epitopes (IRD and ViPR).......................................................................................366.1.4 3D Protein structures (IRD and ViPR)..............................................................................366.1.5 Phenotype (IRD)................................................................................................................36

1October 30, 2018

Page 3: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V46.1.6 Human clinical metadata (IRD and select ViPR taxa).......................................................366.1.7 Serology experiments (IRD)..............................................................................................366.1.8 Sequence Feature Variant Types (IRD and select ViPR taxa)..........................................366.1.9 PCR Primer Probe Data (IRD)...........................................................................................366.1.10 Host Factor Data (IRD and ViPR)...................................................................................366.1.11 Antiviral Reagents (IRD and select ViPR and non-ViPR taxa)......................................376.1.12 Laboratory Experiments (IRD)........................................................................................376.1.13 WHO Influenza vaccine strains (IRD)............................................................................376.1.14 Protein Domains and Motifs (ViPR)...............................................................................376.1.15 Ortholog groups (selected ViPR taxa).............................................................................37

6.2 Analyze and Visualize..............................................................................................................386.2.1 Identify Similar Sequences (IRD and ViPR).....................................................................396.2.2 Align Sequences (IRD and ViPR).....................................................................................396.2.3 Visualize Aligned Sequences (IRD and ViPR).................................................................396.2.4 Identify Short Peptides in Proteins (IRD and ViPR).........................................................406.2.5 Identify Point Mutations in Proteins (IRD).......................................................................406.2.6 Analyze Sequence Variation (SNP) (IRD and ViPR)........................................................406.2.7 Generate Phylogenetic Trees (IRD and ViPR)..................................................................406.2.8 Metadata Sequence Analysis (IRD and ViPR)..................................................................406.2.9 Annotate Nucleotide Sequences (IRD)..............................................................................406.2.10 Identify Sequence Features in Segments (IRD)...............................................................406.2.11 Antiviral Resistance Risk Assessment Tool (select ViPR taxa)......................................416.2.12 Sequence Format Conversion (IRD and ViPR)...............................................................416.2.13 Genome Annotator (GATU) (ViPR)...............................................................................416.2.14 Pandemic H1N1 Classification (IRD).............................................................................416.2.15 HPAI H5N1 Clade Classification (IRD).........................................................................416.2.16 US and Global Swine H1 Clade Classification (IRD).....................................................416.2.17 PCR Primer Design (IRD and ViPR)..............................................................................426.2.18 HA Subtype Numbering Conversion (IRD)....................................................................426.2.19 Genotype Recombination-Detection (selected ViPR taxa).............................................426.2.20 Rotavirus A Genotype Determination (selected ViPR taxa)...........................................426.2.21 View Genomes in GBrowse (selected ViPR taxa)..........................................................42

6.3 Workbench................................................................................................................................436.3.1 Working Sets......................................................................................................................436.3.2 Searches.............................................................................................................................446.3.3 Analysis Tool Results........................................................................................................446.3.4 Uploaded Data Files...........................................................................................................446.3.5 Organization and Management of the Workbench............................................................45

2October 30, 2018

Page 4: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V41.0 INTRODUCTION

1.1 SCOPE AND PURPOSE

The Bioinformatics Resource Center for Infectious Diseases - Virus (Virus-BRC) contract is the renewal of the ViPR BRC contract. The Northrop Grumman BRC Team (Northrop Grumman - J. Craig Venter Institute - Vecna) developed the Influenza Research Database (IRD) under BRC Contract No HHSN266200400041C from 2004 to 2013 and the Virus Pathogen Resource (ViPR) under Contract No HHSN272200900041C from 2009 to 2014. In 2013, the ViPR contract was modified to include support for IRD through the end of September 2014. The present contract extends support for both the IRD and ViPR resources through September 14, 2019. IRD and ViPR share a common hardware platform, database schema, and software architecture as well as Principal Investigator (PI), scientific support team and development team.

The scope of the Virus-BRC contract is to provide for facilities, equipment, qualified personnel, and all necessary resources and services to collect, archive, update, integrate, and maintain genomics and other types of data in support of research on human pathogenic Category A through C viruses, and to provide for query, analysis and display of such information through user friendly interfaces and computational analysis tools freely available to the scientific community. The ultimate goal is to provide resources to the research community to facilitate development of vaccines, diagnostics and therapeutics for these viral pathogens.

IRD is maintained as a distinct web-based, curated, stable, relational database to collect, store, view, display, annotate, query, and analyze genomic and related data and bibliographic information, providing a robust and user friendly resource for the scientific influenza virus research community. IRD provides a comprehensive genomic and proteomic data repository for influenza virus research data as well as an analysis platform supported by appropriate tools to facilitate all types of influenza research.

ViPR provides a comprehensive repository for diverse data types related to families of single-stranded and double-stranded RNA and DNA viruses that are pathogenic to humans and pose a threat to public health:

Arenaviridae

Caliciviridae

Coronaviridae

Filoviridae  

Flaviviridae

Hantaviridae

Hepeviridae

Herpesviridae

Nairoviridae

Paramyxoviridae

Peribunyaviridae

Phenuiviridae

Picornaviridae

Poxviridae

Reoviridae

Rhabdoviridae

Togaviridae

3October 30, 2018

Page 5: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4ViPR also provides an analysis platform and appropriate tools to facilitate genomic, proteomic and other types of studies on these pathogens. It was designed with the capability to focus resources and activities on particular human pathogens and to respond favorably to immediate and/or long-term changing priorities and/or changes in the NIAID Category A-C Priority Pathogens list and emerging/re-emerging infectious diseases pathogen lists. This is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus and Enterovirus D68 in the first year of the current contract, Zika virus in the second year, and Lassa virus in the past year. Currently the team is rebuilding the taxonomy and architecture of viruses in the former Bunyaviridae family, in compliance with NCBI establishing a new Bunyavirales order, and assigning 9 new or existing families to it. The Contracting Officer’s Technical Representative (COTR) provides guidance regarding the type of bioinformatics support required for these pathogens as needs arise.

IRD and ViPR also serve as repositories of data from other NIAID-funded research programs: (i) influenza surveillance and laboratory data from the Centers of Excellence in Influenza Research and Surveillance (CEIRS) program, (ii) changes in intra- and extracellular host factors in response to viral infections from the Systems Biology for Infectious Diseases Research program, and (iii) human clinical metadata provided mainly by Genome Sequencing Centers (GSCs). IRD and ViPR are the vehicles for integrating these contributions with sequence data and sharing the enriched products with the worldwide virus research community.

1.2 IDENTIFICATION

Northrop Grumman IT provides Virus-BRC development under Contract No HHSN272201400028C.

4October 30, 2018

Page 6: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V42.0 HARDWARE CONFIGURATIONS

A single hardware configuration supports both the IRD and ViPR systems. The hardware is divided into the Production, Test, Staging and Development environments. The Production and Test environments support the IRD and ViPR web applications that are exposed to the research community. The Staging environment is used to collect data from public data sources such as NCBI and UniProt and to prepare it for use in the data warehouse supporting Production and Test. The Development environment is used by the development team during development of new software and the preparation of new data for release to the Production environment. The Production and Test environments consist of two identical hardware configurations that provide both a complete backup environment, and a Test environment that is identical to the Production environment. Having identical hardware configurations for both Production and Test allows us to deploy a fully tested software and data release with little or no system downtime. While the Development environment continues to reside in the Rockville, MD Northrop Grumman facility, the Production, Test and Staging environments are all hosted on the Amazon Web Services (AWS) cloud hosting facility (East).

2.1 DEVELOPMENT ENVIRONMENT

The Development environment makes heavy use of “Virtual Machines” (VMs) to optimize use of physical servers.

Figure 2-1 Architectural drawing of Development Environment

Primary components of the Development environment, as illustrated in Figure 2-1, include:

One web server (VM)5

October 30, 2018

Page 7: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4 One application server to run the ViPR/IRD website (VM)

One online analysis server to perform user initiated bioinformatics processes

An R&D Database Server with a single R&D database instance

A hardware firewall to protect the Development environment from unauthorized intrusion

High speed switches and data transfer buses to move data among the components

A high speed fibre connection to the high capacity fibre storage array that houses the Development database

An internal network used by members of the technical team

The JIRA change tracking system resides in cloud hosted environment managed by Atlassian, the product vendor.

As indicated above and in Figure 2-1, we utilize VM’s extensively for the Development environment. This approach makes efficient use of the hardware and provides great flexibility. The three physical servers configured as multiple VMs are:

VM Server 1 – 2X Dual Core Processors, 16GB RAM, 1.2TB Storage (RAID5)

VM1 – Development Web Server

VM Server 2 – 2X Quad Core Processors, 32GB RAM, 500GB Storage (RAID1)

VM1 – Development Application Server

VM2 – Build Server

VM Server 3 – 2X Quad Core Processors, 48GB RAM, 2TB Storage (RAID5)

VM 1-6 – Development Analysis Servers

6October 30, 2018

Page 8: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4

2.2 PRODUCTION AND TEST ENVIRONMENTS

Following receipt of approval from the NIAID CIO on May 5, 2016 to migrate the Production and Test environments to the Amazon Cloud (AWS) platform, migration began in June 2016 and was completed on July 21, 2016. Within AWS the Virus-BRC systems run in their own independent Virtual Private Cloud (VPC). The VPC contains a separate public and private address space. The web server(s) exists within the public address space (equivalent to a DMZ) and the application servers, data analysis servers and the RDBMS servers exist in the private address space. This separation of the public facing components from the secure components in the private address space provides a high level of security for the critical components in the private address space. All servers, whether in public or private address spaces, can only be accessed through defined security groups and host-based firewall (IP tables) rules.

On AWS all of the servers are virtual EC2 machines. The EBS volume was added to individual EC2 machines to provide required storage space. The EBS volume is provided in two forms, high speed Solid State Drives (SSD) and magnetic disks. SSD volumes are used primarily for database servers, application servers and data analysis servers. The AWS S3 Storage is used for the data backup and document archiving purpose.

The AWS CloudWatch is used for the EC2 server and AWS infrastructure monitoring. Kerio Connect is used to send automated email notifications to system administrators, database administrators and users.

The Virus-BRC system architecture on the AWS Cloud is shown in Figures 2-2a and 2-2b. Figure 2-2a shows the overall configuration while Figure 2-2b provides a detailed diagram of the hardware components comprising Availability Zone 1 (Production).

7October 30, 2018

Page 9: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4

Figure 2-2a and 2-2b: Architectural drawing of Test and Production environments on AWS

As shown in Figure 2-2a and 2-2b, the following components make up each instance (Prod1 and Prod2) of the Production environment:

An ELB-configured EC2 web server virtual machines. Web access requests are directed to the application servers in the appropriate environment. AWS instance type t2.medium servers (2 CPU/4 GB RAM) are used as the web servers.

An ELB-configured application server runs the ViPR and IRD website software. AWS instance type m4.4large servers (16 CPU/64 GB RAM) are used as the application servers. These servers serve as Prod1 and Prod2 environments that are used to provide quick deployments by pointing the web server to the application server with the latest software updates

An ELB-configured online analysis server is used to remove heavy processing loads from the application servers to maintain overall system performance. AWS instance type m4.10xlarge servers (64 CPU/160 GB RAM) are used as the analysis servers. Batch processing and user-initiated bioinformatics processes are directed to the online analysis servers.

An EC2 server using the Oracle 12c RDBMS tool is configured as the database server. AWS instance type i2.2xlarge servers (8 CPU/64 GB RAM) are used as the database servers. Database servers are configured using the Solid State Drives (SSD) to provide enhanced I/O capabilities. More information about the database configuration is provided in Section 3 “Database Specification and Architecture”

The workbench database is a separate permanent database residing on an EC2 server which is supported by a Staging database and synced to the Standby database.

In Figure 2-2b we represent the two stacks (prod1 and prod2) by the presence of two each of App and Analysis servers and Data Warehouses. We use either Prod1 or Prod2 (whichever is currently the non-active production environment) as the Staging and Test environment. This allows us to fully utilize all of the available servers in the AWS cloud. The Staging database is used for the preparation of the incoming data for the next release and is synced with the Standby database.

8October 30, 2018

Page 10: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4

3.0 DATABASE SPECIFICATION AND ARCHITECTUREWe currently have Development, Test and Production database environments. Test and Production environments use exactly the same configuration in order to ensure zero incompatibility issues between the two and for smooth deployment of data and software releases from Test to Production. The Development environment uses slightly lower capacity server in terms of hardware configuration but maintains exactly the same software configurations.

3.1 PRODUCTION DATABASE ARCHITECTURE

The ViPR and IRD database is deployed on a highly available and scalable ORACLE enterprise architecture using Oracle Automated Storage Management (ASM). At present, the total available database size is 10TB including staging, lookup, and production as well as other auxiliary datasets. Each database node is a Linux-based AWS i2.2xlarge instance type server with 8 CPUs and 61GB RAM using solid state storage.

A combination of Oracle and AWS tools, including Elastic Block Storages (EBS), and the performance enhancing features provided by Oracle 12c, including In-Memory and Multi-Tenant, are used on the AWS cloud configuration to improve the overall database performance and throughput. Section 3.2.5 includes database tools and features that are used for the AWS configuration.

3.2 DATABASE CONFIGURATION

The database is organism independent and divided into three components to protect the read-only production data warehouse and to ensure minimal interruptions in operation of IRD and ViPR website applications while refreshing data.

3.2.1 Staging Database The IRD/ViPR database is populated with data downloaded from public data sources, data generated by the IRD/ViPR team, and data submitted by IRD/ViPR users. Downloaded and submitted data is initially placed on the Staging database. Various types of automated data curation, data cleansing, and data validations are performed here to ensure data integrity and quality before it is prepared for transfer to the Production data warehouse.

The Staging database uses a microorganism independent infrastructure that was originally adapted from the Genome Unified Schema (GUS) developed at the Computational Biology and Informatics Laboratory at the University of Pennsylvania. The GUS infrastructure provides storage of genomic sequence data and its annotation along with all the available descriptive information (e.g. literature references, author names, notes and comments) for virus sequences downloaded from NCBI GenBank. It maximizes the usability of data through use of a set of schemas that integrates genome, protein, transcriptome, gene regulation and networks, ontologies and controlled vocabularies, and gene expression. Enhancements were made by the IRD/ViPR team to store new data types not supported by GUS, including pre-computed and runtime-generated enrichment/value-added information such as BLAST results, functional motifs, ORF predictions, ortholog predictions, protein annotation, predicted CTL epitopes, comparative genomics, manual and automated annotation and curation data.

3.2.2 Production Data Warehouse This read-only database stores highly curated data of production quality. The Production warehouse is populated using Extract-Transform-Load (ETL) scripts that extract data from various internal databases, including the Staging database, and apply aggregation, integration and data validation rules before updating the Production data warehouse. Data in the Production data warehouse database is available to external users via IRD and VIPR web application interfaces.

3.2.3 Workbench Database IRD and ViPR applications allow users to search for, select, combine/subtract/intersect and store various supported data types for user-specific analysis and research purposes. The application stores all such

9October 30, 2018

Page 11: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4selected datasets persistently in the “Workbench” database. Genomic and proteomic datasets stored in this database as “working sets” are simply pointers to details stored in the Production data warehouse. This database also stores results from analyses run using IRD/ViPR tools and user-submitted data sets. The data in this database is protected by a separate passive Standby database that utilizes the Oracle 12cR1 Data Guard tool for data synchronization between the two.

3.2.4 Data Load and Integration Figure 3-1 below illustrates the flow of data from external data sources through the Staging database to the Production data warehouse, where it becomes available for use by the IRD and ViPR applications.

Figure 3-1 Flow of external data into Production data warehouse

3.2.5 Database Management ToolsThe Oracle tools enumerated in Table 3-1 are used for managing IRD and ViPR data.

Table 3-1 Oracle tools used

Key Oracle features/tools used in the Database configuration

Purpose

Oracle Text Indexing Used for providing global keywords based database search. Indexing also supports Boolean searches.

RMAN and Oracle Secure Backup Database backup SQL*Loader Data LoadXMLDB To parse and store XML dataOracle Partitioning Partitioning enables us to maintain the logical division between the

IRD and ViPR data, in addition to providing greater performance in data retrieval and better data management. The database uses list and reference partitioning to gain performance advantages.

Virtual Columns Oracle virtual columns are used generously in order to split the single column’s data in different ways (without adding any hardware and data management overheads) to provide better data sorting and search facility.

EXPDP and IMPDP These tools are used for taking logical backups and restoring the logical backups.

Materialized view with Query Rewrite option

Merge of data sets which are distributed across various schemas/tables to improve performance.

Oracle EM 12c Cloud Manager Database administration and monitoringOracle Data Guard 12cR1 Standby database management Views and PL/SQL Packages, Procedures and Functions

ETL scripts

10October 30, 2018

Page 12: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4Database Links To establish communication between various databases configured

across multiple serversIn-Memory Store data in random access memory for performance improvement

of database queriesMulti-tenancy To consolidate various databases and share server resources to

ensure optimal use of database server resources. Solid State Drives (SSD) and Higher IOPS (AWS infrastructure features)

Improvement in the RDBMS performance

Oracle Automatic Storage Management (ASM)

ASM is an integrated, high-performance database file system and disk manager. Provides Striping, Mirroring, Online storage reconfiguration and dynamic rebalancing, and Managed file creation and deletion. ASM allows the database to manage storage automatically instead of requiring an administrator to do it. ASM eliminates the need to directly manage potentially thousands of Oracle database files.

3.3 KEY DATABASE FEATURES USED IN THE CURRENT DATABASE CONFIGURATION

IRD and VIPR share the same database schemas and data models across the Production, Test and Development environments. The database design and data models are microorganism independent, stable, and designed to store, process, retrieve, and manage terabytes of data. Data are partitioned and distinguished based on key data elements like family, genus and species across all data models. This configuration enables the system to separate and distribute resource intensive and time consuming tasks such as raw data capture, loading, processing and data integration onto different databases without impacting the performance and functioning of other databases and the performance of the web applications. The Staging area is fully normalized with no redundancy and tuned to provide optimal performance in loading, updating, cleansing and validating raw data from all external sources. The Production data warehouse area uses a de-normalized star schema design that incorporates data redundancy to provide optimal performance in returning data from search requests by a web application. This database configuration provides superior overall performance in the Production warehouse by isolating the impact of processing data loads to the Staging database.

The data models can be easily extended to accommodate new data types and data from new sources. Open source tools are carefully combined with the Oracle database and tools, a best-of-breed commercial relational database management system, to provide an industrial strength, technologically-advanced, and performance-efficient platform. We also use open source architectural tools and data models developed by the functional genomics community. The infrastructure guarantees scalability, high availability, and extensibility. Each component of the architecture is tightly integrated with the others to work synergistically. Additionally the environment is tuned to leverage the Oracle 12cR1’s distributed Transaction and Parallel Processing power to Read/Write data in multiple threads simultaneously for optimal resource utilization and efficiency gains.

11October 30, 2018

Page 13: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4

3.4 DATA

3.4.1 Supported Data TypesAt present the database supports:

Data from various publicly accessible or proprietary databases (e.g. GenBank, UniProt, IEDB, PDB) and information contributed by BRC collaborators (e.g. Genome Sequencing Centers, the Systems Biology Projects, the CEIRS Program)

Data provided by outside investigators and laboratories (clinical data, surveillance data, experimental research data, gene expression data)

Data submitted directly via web submissions by research scientists (e.g. genomic sequences, sequence features)

Data/results extracted from scientific literature

Data generated internally by IRD and ViPR curation and annotation tools (e.g. Influenza segment autocuration, Influenza H5N1 Clade classification, PA-X and other influenza protein computations, computation of mature peptides from polyproteins, genotype determination, recombination detection, etc.)

Table 3-2 enumerates the data types included in IRD and ViPR databases. The table also shows the source of each data type, refresh cycle, tools used to handle each data type, and the URL link to an appropriate SOP.

Table 3-2 – IRD and ViPR data types

Data Types Source IRD/ViPR/Both

Refresh Cycle Tools/Script Name SOP Link/Source URL

Genomes, segments

GenBank Both IRD: Daily ViPR: Weekly (Zika virus, Picornaviridae, Filoviridae); Bi-monthly (all other ViPR families)

Tools:GUS3.5, BioPERL

Scripts:GUS/IRD: influenza_daily.sh

VIPR:/loadBRC -m NCBI::GENOME

ftp://ftp.ncbi.nih.gov/,

http://eutils.ncbi.nlm.nih.gov/,ftp://ftp.ncbi.nih.gov/genbank/daily-nc/nc$nc_mmdd.flat.gz

Uniprot Both Every 3 months UniProtSplitFileParse.sh www.uniprot.org Active site Both Bi-monthly PERL loadBRC -m

ACTIVESITEwww.uniprot.org

Immune Epitope

IEDB Both Bi-monthly PERL loadBRC -m IEDB -d -p -l

www.iedb.org

Pfam Domains (InterProScan)

IRD/VIPR Both Bi-monthly 1. $BIN_ROOT/runController.pl -P interproscan::flu::Config, $BIN_ROOT/runController.pl -P interproscan::flu::tmhmmConfig, $BIN_ROOT/runController.pl -P interproscan::flu::moveConfig

http://www.ViPRbrc.org/brcDocs/documents/VIPR_INTERPROSCAN_SOP.pdf

Orthologs (Ortho-MCL)

ViPR ViPR Bi-monthly (currently supports 3 ViPR

$BIN_ROOT/runController.pl –P orthomcl::Config,

12October 30, 2018

Page 14: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4Data Types Source IRD/

ViPR/Both

Refresh Cycle Tools/Script Name SOP Link/Source URL

families: Poxviridae, Herpesviridae, Coronaviridae)

$BIN_ROOT/runController.pl -P orthomcl::moveConfig

Orthologs (SOG)

ViPR ViPR Bi-monthly (currently supports 3 ViPR families: Poxviridae, Herpesviridae, Coronaviridae)

https://www.viprbrc.org/brcDocs/documents/SOP_SOG_Orthologs.pdf

Predicted NetCTL

IRD/VIPR Both Bi-monthly ./netctl_influenza.sh --process --repeat --noemail -o tns=BRCSTAGE -o virus_type=flu -o start_date=[2012-04-18] -o end_date=20[12-06-07] --nolock

For noflu:./netctl_influenza.sh --process --repeat --noemail -o tns=BRCSTAGE -o virus_type=mat -o start_date=[2012-04-18] -o end_date=20[12-06-11] --nolock./netctl_master1.sh BRCSTAGE

http://www.ViPRbrc.org/brcDocs/documents/VIPR_NETCTL.pdf

Clinical and Experiment Data

DPCC, JCVI, UMB, Broad Institute and various others

Both Upon submission by data providers

PERL runBRC -x BRCPRD2.xml -m CLINICALDATA -o "-c"

Surveillance DPCC, JCVI

Both Upon submission by data providers

Reagent data CEIRS IRD Bi-yearly Tool: SQL*LoaderProtein 3D structure

PDB Both Bi-monthly PERL runBRC -x BRCPRD2.xml -m PDB -o "-c", PERL runBRC -x BRCPRD2.xml -m PDB -o "-p -w"

ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/

PFAM Sanger Both Bi-monthly Tool: SQL*Loader ftp://ftp.sanger.ac.uk/pub/databases/Pfam/releases/

Sequence Features

IRD/VIPR Both Bi-monthly Tool: CLUSTLW,

13October 30, 2018

Page 15: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4Data Types Source IRD/

ViPR/Both

Refresh Cycle Tools/Script Name SOP Link/Source URL

Scripts:ClustalW  FLU: PERL loadBRC -m FLU:SFVT -d -p -lVIPR:./loadBRC -m NOFLUSFVT -d -p -l

Host Factor Experiment

Systems Biology projects

Both Upon submission by data providers

Antiviral Reagents

Drug Bank, ATC, PDB

Both PERL loadBRC -m DRUGBANK -d -p -l

http://www.drugbank.ca/downloads

Gene Ontology EMBL-EBI

Both Every 6 months Oracle SQL*Loader http://www.ebi.ac.uk/QuickGO/GAnnotation?source=InterPro

NRDB GenBank Both Bi-monthly PERL loadBRC -m NRDB -d -p –l

ftp :// ftp.ncbi.nih.gov / blast/db/FASTA/ nr.gz

IsoElectric Point and Molecular Weight

IRD/VIPR computation

Both Bi-Monthly For flu: ./iepmw_influenza.sh --process --repeat --nolock --noemail -o tns=BRCSTAGE -o virus_type=flu -o start_date=2012-08-16 -o end_date=2012-10-26 -o v=0 -o faa=yes -o iepmw=yes

For noflu: ./iepmw_master1.sh BRCSTAGEFor Mat: ./iepmw_influenza.sh --process --repeat --nolock --noemail -o tns=BRCSTAGE -o virus_type=mat -o start_date=2012-08-16 -o end_date=2012-11-05 -o v=0 -o faa=yes -o iepmw=yes

http://www.ViPRbrc.org/brcDocs/documents/VIPR_SOP_MW_IP.pdf

Pre-computed Blast (Sequence Similarities)

ViPR VIPR Bi-Monthly ./blastdb_refseq.sh --process --repeat --nolock -o tns=BRCPRD21 -o ahw=[110912]

http://www.ViPRbrc.org/brcDocs/documents/VIPR_BLASTP.pdf

Short Sequence Search DB

IRD /VIPR internal computation

IRD and VIPR

Bi-Monthly For flu:./proteindb.sh --process --repeat --nolock -o tns=BRCPRD11

For noflu: nohup

14October 30, 2018

Page 16: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4Data Types Source IRD/

ViPR/Both

Refresh Cycle Tools/Script Name SOP Link/Source URL

./proteindb_master2.sh BRCPRD11

Mature_peptide Annotations

ViPR VIPR Bi-Monthly Tool: BioPerl, clustalw

Script: PERL loadBRC -m MATPEP -d -p -l

Sequence polymorphism data

IRD computation

IRD Bi-Monthly Tools used:clustalw2

Script Name:1.$BIN_ROOT/runController.pl -P flu::fomaSNP::aaseq::Config2. $BIN_ROOT/runController.pl -P flu::fomaSNP::aaseq::moveConfig

http://www.fludb.org/brcDocs/documents/IRD_FluPolymorphism.pdf

BlastDB Search IRD/VIPR computation

Both Bi-Monthly For flu:./blastdb.sh --process --repeat --nolock -o tns=BRCPRD11

For noflu:./blastdb_master2.sh BRCPRD11

Organism/Species Taxonomy

GenBank Both Daily GUS/IRD: influenza_daily.sh

ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

PA-X protein identification

IRD IRD Daily Tools: ClustalW

Scripts:PAXGeneDailyConfig.pm, processGenbank.pl

http://www.fludb.org/brcDocs/documents/PAX_SOP.pdf

H5N1Clade classification

IRD IRD Daily Tools:Taxit, pplacer, guppy

Scripts:H5N1Classifier.pl,processH5N1ClassifierByFastaFile.pl ,

http://www.fludb.org/brcDocs/documents/PAX_SOP.pdf

H1N1 pandemic classification

IRD IRD Daily Tool: ClustalW

Script: PERL H1N1Classifier.pl

Genotype determination and recombination detection

VIPR VIPR Bi-Monthly Tool:BioPERLScript: loadBRC -m GENOTYPE -d -p -l

Influenza variant proteins

IRD IRD Daily BRC Team developed algorithms

15October 30, 2018

Page 17: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4Data Types Source IRD/

ViPR/Both

Refresh Cycle Tools/Script Name SOP Link/Source URL

(PA-N155, PA-N182, PB1-N40, M42, NS3)Influenza swine H1 clade classification

IRD IRD Daily BRC Team developed algorithms

Rotavirus A segment and genotype determination

VIPR ViPR Bi-monthly Tool: RotaCScript: Java and PERL

External database integration

OpenFlu IRD Weekly Bash Shell script http://openflu.vital-it.ch/browse.php

External database integration

Empres-i Both Weekly Bash Shell script http://empres-i.fao.org/eipws3g/

External database integration

SSGCID Both Monthly Bash Shell script

Reagent annotations

Geo Both Quarterly SQL*Loader

Lookup data Miscellaneous (Table 3-3)

Both Daily/monthly SQL*Loader

All these data types are supported by data models and schemas that are comprehensive, fully integrated, and highly extensible. They support a wide spectrum of data relevant to targeted infectious diseases and provide the highest levels of data integrity, reliability, performance, scalability (data volume and concurrency), distribution, and interoperability.

3.4.2 Lookup DataAs part of the data curation process, IRD and VIPR databases use lookup tables containing standard values to help ensure that specific data types use consistent values.

Table 3-3 lists the most important lookup data types and their data sources.

Table 3-3 Data validation lookup tables

Data Type Data Sources Used by: IRD/ViPR/BothCountry List NCBI BothUSA States NCBI BothSample Types CEIRS, JCVI BothLatitude and Longitude range vs. country mapping External IRDGeographic Region Internal BothAvian Species AVI Base IRDNon-human mammalian Species IRDCurated Host Names IRD/ViPR (Internal) BothInfluenza Genome Autocuration Internal IRDMicroarray Reagent Information External BothWHO Flu Vaccines WHO IRDPUBMED NCBI BothTaxonomy NCBI ViPR

16October 30, 2018

Page 18: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V43.5 DATABASE POPULATION AND REFRESH PROCEDURES

The core of the database is built on genomic DNA or RNA sequence data, which is analyzed and annotated to provide additional information about gene location, gene structure, predicted open reading frames, functionally relevant polymorphisms, bio-markers, three-dimensional structures, domains and motifs, predicted or known biochemical and biological functions, immune epitope predictions, etc. With these genomic data as a foundation, we then integrate all additional data types (see Table 3-2 above) as layers onto the genomic data. These layers include, but are not limited to, gene expression data derived from microarray experiments, phenotypic data of naturally occurring and engineered genetic mutations, drug-bank data showing drug modes of action, virulence data, surveillance data, clinical and vaccine study data, and laboratory assay data.

IRD and ViPR use automated data processing pipelines for populating and refreshing different data types. These pipelines download the data from respective sources at regular intervals. The data refresh procedure includes not only processing the raw data, but also curating it by means of automated curation scripts (called from within the automated data processing pipeline) as well as by manual curation. Subsequently the data pass through rigorous data validation scripts before getting integrated with other data types and published to Production data warehouse.

We also receive updated/new data sets (e.g. influenza surveillance, human clinical metadata, sequence metadata and host-factor) from known data providers on an unscheduled basis. Submitters deposit their data onto our secure sFTP server which is scanned nightly for new submissions. Submissions are then processed by automated data processing jobs. When necessary, manual curation is used to map non-standard data fields to standard templates. Errors are reported to the data providers.

The data processing pipeline uses several protocols (e.g. FTP, HTTP, HTTPS, and Secure FTP) to download or upload data and supports a wide variety of data formats (XML, Excel, CSV, GenBank, GFF3, FASTA, etc.). A major strength of our data processing pipeline is the capability to automatically poll familiar sources and detect whether new or updated data has been released. It then downloads data incrementally or in total from these external sources and processes the data using ETLs to store and integrate with other data types. We use separate loading modules for each data type, which can be scheduled to run at any time and independently of each other. These modules can run in parallel, helping to optimize use of resources and allowing efficient processing of high data volumes. Some of the less resource intensive but time consuming data processing pipelines (for example, H5N1 clade classification, Sequence Feature Variant Type data computation) run in Virtual Server environments in order to take advantage of parallel processing.

The data loading modules use Perl, Java, and Linux-based Bourne Shell scripting language as well as SQL*Loader and Oracle’s DOM/SAX parser and PL/SQL to load data into the database. The genome sequence data loading module uses a GUS-provided GenBank parser in Perl.

These loading modules are plugged into the underlying framework by means of configuration files, which handle the basic features of task scheduling, tracking, monitoring, error handling and alerting. Because the multiple data sources provide data in different formats and use different data models, we first load the data into our Staging database where a series of scripts for data cleansing, data transformation and translation are used to convert data into an integrated data model before being published to the Production data warehouse.

Data refresh schedules as well as names of scripts responsible for data refresh for each of the data types can be found in Table 3-2 above.

Overall, the current data population and refresh procedures, in combination with scalable and flexible database architecture, are designed and tuned to:

Support massive data downloads in heterogeneous data formats from external data sources, in order to leverage scientific work already done on targeted organisms

Run resource intensive prediction algorithms and data analysis tools to add more enrichment and annotation

17October 30, 2018

Page 19: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4 Integrate large amounts of complex data from many sources

Store non-conventional complex data types (e.g. Sequence Features, topology, 3D structures, images and data streams) in addition to conventional data types

Provide ample room to handle future growth in data volume, increases in use of resource intensive prediction algorithms, and increases in the number of concurrent users

Utilize and support acquisition and integration of technologically advanced commercial, open source tools developed by the scientific community

A complete listing of tools used by ViPR and IRD in data analysis from all sources is found in Table 4-1 in Section 4.

3.6 DATABASE BACKUP PROCEDURES

IRD and ViPR use industry standard secure database backup and recovery procedures utilizing RMAN, the advanced Oracle 12c backup and recovery tool. On AWS, the database backups are automatically transferred to the AWS S3 storage immediately after the backup task finishes. In addition, we provide extra data protection by including a Standby database server using the Oracle 12c Data Guard tool. Our backup routines support both hot and cold backups and are designed to run automatically. During the third contract year, the Oracle Secure Backup utility was added to the toolset used for database backup and recovery.

The Workbench (Section 6.3) database archive logs are backed up on to the S3 storage as well as applied continuously to a Standby database in order to protect current user data in the event of a system crash.

For data types and schemas that change frequently (e.g., user workbenches, daily-updated genomic data), our automatic processes perform full backups on a weekly basis and supplement this with daily, differential, or incremental backups. Data types and schemas that remain unchanged between release cycles are backed up immediately prior to going live with a release. Tape storage media are used and media is stored both onsite and at a location.

Additionally we take logical backups for the Staging, data warehouse and Workbench databases every night in order to retrieve lost data at table levels fast enough if needed.

18October 30, 2018

Page 20: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4

4.0 TOOLS USED BY IRD AND VIPR BRC: COMMERCIAL, OPEN-SOURCE, AND TOOLS DEVELOPED BY THE SCIENTIFIC COMMUNITY

IRD and ViPR utilize popular commercial and open source software as well as custom software tools developed by the BRC team to support data annotation and analysis by users. These tools are used in data generation, data annotation, data analysis, and graphic display of analysis results. They cover not only sequence data, but also metadata and integration of sequence data with metadata. Both IRD and ViPR offer seamless integration of analysis tools with data in the shared database, making it very convenient for scientists to analyze data. They also offer workbenches that allow a user to analyze their own uploaded data in the same manner as data extracted from the database. Table 4-1 enumerates the tools that are either used by IRD and ViPR for data preparation or are integrated into the web interface for use in interactive data analysis and manipulation.

The following tools are used in ViPR or IRD as of September 2017 and will continue to be used during the third Virus-BRC Option Year.

Table 4-1 Open Source and custom-developed analysis tools and algorithms used by BRC systems

Tool Name IRD ViPR Front-endUtility

Backend DataHandling

DevelopedIn-House

Open Source

Comments

Blast (BlastN, BlastP, BlastX)

yes yes yes yes yes Convenient options are provided for user to blast against search results from IRD/ViPR databases or user’s custom database.

Muscle yes yes yes yes yes MUSCLE is chosen as the main alignment program for speed and accuracy.

ClustalW yes yes yes yes ClustalW is used when refined alignment is needed for data analysis, e.g. flu annotation pipeline, SFVT computation.

UClust yes yes yes yes yes UClust is provided as a MUSCLE pre-processor to improve speed and quality of alignment.

Mauve yes yes yes For aligning large DNA genome, e.g. Pox, etc., ViPR guides the user in installing Mauve locally and inputting sequences selected from a search or working set.

Jalview (customized)

yes yes yes yes Jalview is a Java applet chosen as the tool to visualize alignments. Customization is provided to manipulate aligned sequences and highlight special features in alignment.

19October 30, 2018

Page 21: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4Tool Name IRD ViPR Front-

endUtility

Backend DataHandling

DevelopedIn-House

Open Source

Comments

FastME yes yes yes FastME is chosen for speed to construct quick phylogenetic trees for a large set of short sequences or a small set of long sequences.

PhyML yes yes yes yes PhyML is chosen for both nucleotide and protein input type to construct phylogenetic tree.

RaxML yes yes yes yes RaxML is chosen to handle large datasets or longer input sequences for construction of a phylogenetic tree.

Phylip (dandist, protdist, drawgram, protpar)

yes yes yes yes Several programs from Phylip package are used in the tree building pipeline.

ModelCompare yes yes yes yes This is an in-house built program that employs the PhyML program to do model comparison for nucleotide input.

ProtTest yes yes yes yes ProtTest is chosen to handle model compare for amino acid input.

Archaeopteryx(customized)

yes yes yes yes Archaeopteryx is a Java applet used in visualization of phylogenetic trees. Customization is provided for user to decorate tree with metadata, or to use a tree to edit sequences from IRD/ViPR datasets.

Archaeopteryx.js(customized)

yes yes yes yes yes Archaeopteryx.js is a Javascript-based tool adapted from the original Archaeopteryx java applet and is used in visualization of phylogenetic trees. Customization is provided for user to decorate tree with metadata, or to use a tree to edit sequences from IRD/ViPR datasets.

20October 30, 2018

Page 22: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4Tool Name IRD ViPR Front-

endUtility

Backend DataHandling

DevelopedIn-House

Open Source

Comments

JMOL (customized)

yes yes yes yes JMOL is a customized Java Applet that is used for protein structure visualization. Customization is provided for user to highlight or animate features in 3-D protein structure.

JsMOL (customized)

yes yes yes yes JsMOL a customized Javascript tool that is used for protein structure visualization. Customization is provided for user to highlight or animate features in 3-D protein structure.

GBrowse yes yes yes GBrowse is used to show genome level data for several ViPR families.

GoogleMap yes yes yes GoogleMap is used to show locations of Influenza surveillance samples.

ReadSeq yes yes yes yes Some utilities provided by ReadSeq have been implemented for simple sequence file format conversion.

NetCTL yes yes yes yes NetCTL is used for epitope prediction.

InterProScan yes yes yes yes InterProScan is used for protein domain and motif prediction.

Primer3 yes yes yes yes Primer3 algorithm is adopted for predicting optimal primer set(s).

Genotype Determination and Recombination Detection

yes yes yes This tool is provided for major species from Flaviviridae.

GATU yes yes yes GATU (Genome Annotation Transfer Utility) adopted from a BRC project at the University of Victoria, is used to transfer annotations from a well-studied strain to counterparts in the same taxa.

21October 30, 2018

Page 23: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4Tool Name IRD ViPR Front-

endUtility

Backend DataHandling

DevelopedIn-House

Open Source

Comments

Meta-CATS yes yes yes yes The meta-CATS tool was developed by the BRC team to perform customized comparative genomics analyses with minimal manual manipulation. After assigning sequences to as many as 10 different groups based on metadata values, statistical analyses can identify positions varying significantly among groups.

R 3.2.3 yes yes yes yes Linux-based enterprise implementation of R. R 3.2.3 is used in statistical data analysis.

Influenza Annotation Pipeline

yes yes yes yes This tool is an interactive version of the IRD Influenza annotation pipeline. The tool will align the sequences against a consensus sequence profile to identify possible sequencing errors, determine the Influenza type, segment, and for segments 4 and 6 of type A, the subtype, and translate the nucleotide into amino acid sequence.

Sequin yes yes yes IRD provides Influenza Sequence GenBank submission service for Influenza research community. Sequin is the main tool adopted from NCBI to handle GenBank data generation.

Identify Short Peptides in Proteins (bl2seq)

yes yes yes yes This tool allows user to find short amino acid strings or oligopeptides in target proteins. It is useful for finding epitopes, ligand binding sites, or sequence domains, from among a target set of proteins.

22October 30, 2018

Page 24: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4Tool Name IRD ViPR Front-

endUtility

Backend DataHandling

DevelopedIn-House

Open Source

Comments

Identify Point Mutations in Proteins

yes yes yes This tool will scan the specified proteins from type A Influenza virus with a specified subtype for the presence of the amino acid user specifies at a chosen position.

Analyze Sequence Variation (SNP)

yes yes yes yes yes IRD provides pre-computed SNP data for Influenza A sequences grouped by segment number, subtype, and host. Sequences are aligned and a consensus sequence determined for each group, and the variation from that value has been determined for each position in the sequence. Both IRD and ViPR provide this analysis tool for nucleotide or amino acid sequences selected from database or uploaded by user.

2009 pH1N1 Classification

yes yes yes yes This Blast-based classification approach has been used to evaluate all influenza sequences from IRD. It is also available for user sequences on the IRD web site.

PA-X Protein Computation

yes yes yes This computational method was applied to all Influenza A segment 3 sequences in IRD to predict an alternative protein translation product. This new protein, known as PA-X, is the result of a ribosomal stutter and frame shift to a new +1 reading frame.

HPAI H5N1 Clade Classification

yes yes yes yes This tool evaluates H5 Influenza sequences for assignment to HPAI Clades as defined by the WHO. It is also available for user sequences on the IRD web site.

23October 30, 2018

Page 25: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4Tool Name IRD ViPR Front-

endUtility

Backend DataHandling

DevelopedIn-House

Open Source

Comments

SFVT Computation

yes yes yes yes yes These IRD and ViPR pipelines define variation at specific regions and/or subregions termed Sequence Features (SFs), which are of structural or functional interest, including curated immune epitopes. SFs and relevant metadata are obtained from scientific literature and/or public domain databases. Variant Types are computed by pairwise sequence alignments of all protein sequences in the respective taxa. A web tool accepts user sequences and maps characterized SFs to them.

Virus Mature Peptide Computation

yes yes yes This tool accepts a virus GenBank file, and uses the specified taxon_id to identify an appropriate reference sequence. It aligns the reference sequence to a target genome, and generates viral mature peptides, polyprotein cleavage sites, and sequences as a text file for subsequent data warehouse loading. A Gene Symbol is assigned to each mat_peptide.

VIGOR no yes yes no yes Used in annotating Rotavirus A segments and genotypes (see below). Also being evaluated as a potential replacement for GATU.

RotaC yes yes yes no yes An annotation pipeline for genotyping Rotavirus A viruses. This tool is based on software written by Dan Katzel at the J. Craig Venter Institute that is a Jillion optimized reimplementation of RotaC2.0. It is also available for user sequences on the ViPR web site.

24October 30, 2018

Page 26: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4Tool Name IRD ViPR Front-

endUtility

Backend DataHandling

DevelopedIn-House

Open Source

Comments

Influenza alternate protein computation

yes no no yes yes IRD developed algorithms for computing recently characterized alternative influenza proteins (e.g. PB1-N40, PA-N155, M42, etc.). Existing data enrichment pipelines are used to annotate the novel proteins.

Swine H1N1 clade classification

yes yes yes yes An IRD algorithm that classifies the clade of the HA of H1 viruses, from any host and any NA subtype, with reference to the USDA classification of US swine H1 viruses. Developed by team member Catherine Macken, in collaboration with Tavis Anderson and swine influenza experts at USDA. It is also available for user sequences on the IRD web site.

Customized Ortho MCL

yes yes yes yes A customized version of the OrthoMCL algorithm is used by ViPR to define ortholog groups of viral proteins that are predicted to perform similar functions across virus isolates.

Google reCAPTCHA

yes yes yes no no yes Google reCAPTCHA is used to embed a CAPTCHA in the web pages in order to protect them against spam and other types of automated abuse.

HA subtype numbering conversion

yes yes no no yes This tool allows users to renumber HA sequences according to a cross-subtype numbering scheme proposed in Burke DF, Smith DJ.2014.

pplacer yes yes Yes no This tool aligns a query sequence to reference MSA, and places it in a pre-defined well-curated phylogeny tree. This tool is used in both IRD clade classification tools and ViPR genotype tool.

25October 30, 2018

Page 27: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4Tool Name IRD ViPR Front-

endUtility

Backend DataHandling

DevelopedIn-House

Open Source

Comments

guppy yes yes no yes no yes Guppy is a tool for working with, visualizing, and comparing collections of phylogenetic placements, such as those made by pplacer.

MAFFT no yes no yes no yes Alignment program used in genotype tool. It was required to use MAFFT instead of ClustalW.

Antiviral Resistance Risk Assessment

yes yes yes no yes This tool leverages SFVT computation to determine whether amino acid changes associated with altered response to antiviral drugs are present in a user query sequence.

Influenza Profile Testing Tool

yes yes no yes This tool was built for offline use by IRD curators to assess performance of candidate profiles for the Influenza Annotation Pipeline. It inputs nucleotide working sets, aligns the sequences to the profile, and reports conflicts with the profile, which might represent sequencing artifacts, e.g. CDS-deletions.

Cross-BRC Human Pathogen Interaction (HPI) API

yes yes yes yes yes This tool provides RESTful web services API for other BRC centers to query IRD/ViPR host factor data to identify experiments that result in different behavior of similar host genes set.

CLASSIFI yes yes yes yes This tool performs analysis by statistical method to determine if any gene is over/under represented in an experiment in the GO/Pathway domain.

WGCNA (R package)

yes yes no yes no yes This is used in weighted gene correlation network analysis, which is implemented to visualize (i.e. heatmap) gene significance P-values for Data Models of selected Host Factor Experiments in both IRD and ViPR.

26October 30, 2018

Page 28: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4Tool Name IRD ViPR Front-

endUtility

Backend DataHandling

DevelopedIn-House

Open Source

Comments

Cytoscape yes yes yes no no yes Displays graphical Network connectivity visualization (node and edge) for Data Models of selected Host Factor Experiments in both IRD and ViPR. Specifically, includes node/edge data from the Experimental Matrix results for GenBank accessions that are included in the reagent annotation. Visualization is computed for selected experiments and displayed per metadata module (e.g. timepoint).

27October 30, 2018

Page 29: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4

Figure 4-1 below provides a schematic representation of how the tools found in IRD and ViPR can be used for integrated scientific hypothesis generation. In the example shown, nucleotide sequences from pathogens that have a particular set of conditions can be aggregated using a search tool, the sequences aligned using the MUSCLE algorithm, and the multiple sequence alignment can be visualized using the integrated JalView-based interactive alignment viewer, used to generate and view a phylogenetic tree using one of several available algorithms and the interactive tree viewer/decorator, determine sequence variation at each coordinate identified using the SNP analysis tool, or compute sequence variation between cohorts as defined by pathogen metadata using the Meta-CATS tool.

Figure 4-1 Schematic representation of integrated use of IRD/ViPR tools for research

28October 30, 2018

Page 30: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4 5.0 SYSTEM WEB INTERFACE ARCHITECTURE

5.1 WEB INTERFACE OVERVIEW

The ViPR and IRD systems provide three major functions: Search, Analysis, and Workbench.

Search tools allow users to search the IRD/ViPR database warehouse for a variety of different data types using a large number of filtering options. Search results can be refined, sent to analysis tools, saved to a secure, personal workbench, or downloaded to a local workstation.

The computational tools are used for exploring data and generating hypotheses. Additional research is continually employed to determine whether nuances exist in the input data that might affect the results. Analysis tools are customized to accept sequences for analysis from a working set on a workbench, a search result, or external files on a local workstation.

After users search the databases or perform analysis tasks, they are able to store the results to their workbench, where they are able to perform further analysis or research, and securely share the information with selected colleagues.

5.2 USE-CASE VIEW

The use case functionality diagram shown in Figure 5-1 below describes the user functions that can be performed in the system. Use cases are displayed as functionalities for the system. Functionality may enclose more than one use-case. Different natures of the functionalities are grouped in different color.

Figure 5-1 Use case functionality diagram

5.3 LOGICAL VIEW The ViPR and IRD BRCs use a three tier J2E web application structure. The presentation layer is based on the Spring MVC framework. The framework separates the view and model dependency, and it also has the controller to handle the flow of the user requests to the appropriate business layer components, and then render the view to the user. The view is constructed with jsp/html for the browser, css for styling, and javascript and Ajax code to enhance the user experience.

29October 30, 2018

Page 31: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4The business layer contains both Java/J2E and Perl modules. The Java/J2E handles the main flow and functionality to the system, and the Perl modules provide the bioinformatics analysis features to the system.

The data access layer is based on the Hibernate framework. Hibernate facilitates the storage and retrieval of Java domain objects via Object/Relational Mapping. Hibernate provides the abstraction layer for a Java/J2E application to avoid the tedious JDBC implementation.

Spring framework is utilized to “glue” the different layer together. Spring is a lightweight inversion of control and aspect-oriented container framework, so it helps to speed up the development of the J2E application.

The three tier model is based on the responsibility of each layer as shown below in Figure 5-2.

Figure 5-2 Three tier Web application

5.4 OPEN SOURCE LIBRARY / SOFTWARE

In addition to Hibernate and the Spring framework, the IRD/ViPR system development team also utilizes other open source libraries and software to build the software that provides the functionalities to the user. Table 5-1 enumerates open source development tools used.

Table 5-1 Open Source software and development tools in use

Library Description urlHibernate Java ORM Framework http://www.hibernate.org/Spring An open source application framework and

Inversion of Control container for the Java platform.

http://www.springsource.org/

xFire Java SOAP framework for building SOAP web services

http://xfire.codehaus.org/

Jersey RESTful web services framework in Java https://jersey.java.net jUnit Java Unit test framework http://www.junit.org/BioJava An open-source project dedicated to providing

a Java framework for processing biological data

http://biojava.org/wiki/Main_Page

Apache POI The Java API for Microsoft Documents http://poi.apache.org/Apache Ant A software tool for automating software build

processeshttp://ant.apache.org/

JFreeChart An open-source framework for the programming language Java, which allows the creation of a wide variety of both interactive and non-interactive charts.

http://www.jfree.org/jfreechart/

iText A free and open source library for creating and manipulating PDF files in Java.

http://itextpdf.com/

Prototype A javascript framework/library http://prototypejs.org/jQuery A multi-browser JavaScript library designed to

simplify the client-side scripting of HTMLhttp://jquery.com/

R Software R is a free software environment for statistical https://www.r-project.org/

30October 30, 2018

Page 32: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4computing and graphics.    

WGCNA R package used for weighted correlation network analysis 

https://CRAN.R-project.org/package=WGCNA

Cytoscape An open source software platform for visualizing complex networks and integrating these with any type of attribute data

https://cystoscape.org

Plotly An open-source JavaScript charting library https://plot.ly/

5.5 RELEASE TESTING

System integration test and regression test are performed during pre-deployment testing of each release to ensure data integrity and performance of web site functionalities. A majority of the system tests and all of the regression tests are automated using the Rational Functional Tester tool. Test scripts are incremented to include new features and enhancements of each release. In addition to automated testing, manual and use-case testing are performed by testers and scientists to validate new functionalities, scientific data, work flow and use cases. Issues identified during pre-deployment testing are reported and tracked in the JIRA bug tracking system. New releases are not made available for public use until all high priority issues are resolved.

Performance and load testing is conducted using the performance testing tool NeoLoad to ensure system responsiveness and stability under a specified load. The approach is to generate scripts to simulate the communications between a web client, the application and database servers for certain highly used and time-consuming transactions and parameterized for multiple users. The results are analyzed by test, development and database leads to identify performance issues, system scalability and to identify any bottlenecks in the application software. A test summary report is prepared with findings and recommendations for improvement of application responsiveness, ensuring optimal user experience.

5.6 SYSTEM SECURITY

5.6.1 Security ScanningTo mitigate application security risk, strengthen program management and achieve regulatory compliance, the Virus-BRC test team performs monthly real time security assessments on Virus-BRC applications using the IBM Security AppScan tool. These assessments aim to uncover any security issues in the web applications, explain the impact and risks associated with the issues found, and provide guidance in the prioritization and remediation steps. Tests are performed against the Virus-BRC applications, the IRD and ViPR web sites, from the perspective of an authorized/unauthorized attacker. The objective of these tests is to perform controlled attack and penetration activities to assess the overall level of security of the IRD and ViPR web applications. At the end of each test, a summary report is generated and submitted to the management team to provide a general understanding of the security status of the application. It includes all issue types found, all remediation tasks recommended, all vulnerable URLs and other information to provide a more detailed understanding of the security issues, as well as to assist in scoping and prioritizing the work required to remedy the issues found. Issues identified by AppScan include vulnerabilities often found during common web application attacks like session hijack, SQL injection, Cross-Site Scripting (XSS), and Cross-Site Request Forgery (CSRF). All vulnerabilities are evaluated and either fixed or determined to be false positives or low impact vulnerabilities. A monthly Summary Risk Assessment Report is submitted to NIAID along with the monthly AppScan results.

5.6.2 Conversion to httpsIn order to address the security risks related to HTTP in the AWS deployment of the ViPR and IRD systems, we have adopted the use of HTTPS (Hypertext Transfer Protocol Secure) so that all data transferred between the users’ computers and the websites is encrypted. HTTPS is a secure communication protocol over a computer network on the Internet. All data sent over the HTTPS protocol is encrypted between the client and the server to protect against eavesdropping and tampering attacks.

Adopting HTTPS provides the following benefits to the ViPR and IRD systems:

31October 30, 2018

Page 33: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4 Authentication – ensure security of the user communication with the BRC websites. This helps protect

against man-in-the-middle attacks where attackers deliver false information to the users.

Encryption – all user information such as phone number, email address, password, and corporation/organization is encrypted. This helps reduce the chance of exposing users to spear phishing attempts.

Data integrity – ensure data are not modified or corrupted during transfer

6.0 WEB INTERFACE USER OPERATIONSThe IRD and ViPR systems described in this document enable seamless bioinformatics searches and analyses of viral data supported by robust suites of bioinformatics analysis tools. The fundamental paradigm IRD and ViPR recommend to end-users is Search > Analyze > Save to Workbench (see Figure 6-1 for IRD and Figure 6-2 for ViPR). Search results and analyses may each be saved to the user’s workbench. For lengthy (long running) analyses, users have the option of having the job run asynchronously and then retrieving the results with a ticket number, and/or having results automatically saved to their workbench. The following sections provide brief descriptions of operations available in each category.

Figure 6-1 IRD home page

32October 30, 2018

Page 34: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4

Figure 6-2 ViPR home page

33October 30, 2018

Page 35: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4

6.1 SEARCH OPERATIONS

Both IRD and ViPR provide a suite of search capabilities. The Search Tool “landing page” of each system is shown in Figures 6-3 and 6-4. These pages provide an introduction to the suite of search tools provided by each system.

Figure 6-3 IRD landing page for Search operations.

34October 30, 2018

Page 36: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4

Figure 6-4 ViPR landing page for Search operations.

6.1.1 Search sequences (IRD and ViPR)Search for strains, segments (nucleotide) or protein sequences meeting search criteria. Data is acquired from external sources (e.g. GenBank, etc.) or custom computations (e.g. Influenza HPAI H5 clade assignment algorithm, Influenza variant proteins, ZIKV mature peptide computation, Rotavirus A or Hepatitis C Virus genotyping, etc.). Filter searches by metadata when available. Influenza and Rift Valley Fever sequence search results are integrated with outbreak information when such data is available from the EMPRES-i resource. Search results can be analyzed by integrated tools, saved to a personal workbench, or downloaded.

API command-line calls enable remote searching of nucleotide or protein sequences in IRD or ViPR via lists of identifiers or metadata criterion. In addition to GenBank metadata, unique annotations computed from custom annotations can also be used for searching (IRD-annotated H1 and H5 clades, ViPR-computed SOG orthologies, etc). Return of sequences can be accompanied by metadata in Fasta or JSON outputs.

35October 30, 2018

Page 37: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V46.1.2 Search surveillance data (IRD)Search records of avian or non-human mammalian surveillance data. Data is acquired from collaborators at Centers for Excellence in Influenza Research and Surveillance (CEIRS). Search results can be analyzed by integrated tools, displayed on a map, saved to a personal workbench as a working set, or downloaded.

6.1.3 Search epitopes (IRD and ViPR)Search for experimentally determined or predicted epitopes. Experimentally determined epitopes are obtained from the Immune Epitope Database (IEDB) or curated from the literature. Predicted MHC Class I epitopes are computed on protein sequences using the NetCTL algorithm. Search results can be downloaded.

6.1.4 3D Protein structures (IRD and ViPR)Search for 3D structures obtained from Protein Data Bank. Structures can be viewed and manipulated in JSMol (Javascript) and overlaid with public data (e.g. epitopes) or custom data (e.g. Sequence Features).

6.1.5 Phenotype (IRD)Search for strains carrying specific phenotypic characteristics (e.g. enhanced transmission), either based on experimental evidence or on the presence of sequence markers identified with unique phenotypes.

6.1.6 Human clinical metadata (IRD and select ViPR taxa)Search on patients presenting at physician with viral symptoms. Samples were collected and virus was isolated and sequenced from some positive samples. Clinical data are integrated with the sequence record by the BRC, and are available for download.

6.1.7 Serology experiments (IRD)A limited number of serum samples were collected by CEIRS investigators from avian, non-human mammalian and human subjects and tested for the presence of serotype-specific influenza antibodies. Search results are classified as positive/negative; positive samples identify the serotypes detected.

6.1.8 Sequence Feature Variant Types (IRD and select ViPR taxa)Search viral proteins for Sequence Feature Variant Types (SFVT); regions and/or sub-regions that define structural, functional, or immunological properties and which vary in sequence among genomes of their taxa. The variation of a selected SFVT is displayed and can be downloaded as a table, or can be computed as a phylogenetic tree.

Building upon SFVT, IRD added data for 113 molecular determinants of important viral phenotypes defined in the CDC H5N1 Genetic Changes Inventory (http://www.cdc.gov/flu/pdf/avianflu/h5n1-inventory.pdf). The phenotype markers are broken into categories such as: determinant of virulence, tissue tropism, species adaptation, antiviral drug activity, inflammatory response, etc. On a Details page, users can search for the strains carrying a specific Variant Type.

In order to categorize viral sequences according to the phenotype of their response to Antiviral Reagents (Section 6.1.11) sensitivity/resistance mutations are also computed as Sequence Feature Variant Types.

6.1.9 PCR Primer Probe Data (IRD)High quality primer/probe sets used in rapid detection and sub-typing of influenza and in diagnostic applications are mapped onto respective segments by an extension of the Sequence Feature Data Model. This allows users to rapidly see whether a given segment is a perfect match to the primer/probe set, and if not, determine the extent and the location of the differences.

6.1.10 Host Factor Data (IRD and ViPR)Collection includes genomic, proteomic, lipidomic, (and eventually metabolomics) studies of host gene responses to pathogens. The database consists of high-throughput results and metadata contributed by external collaborators studying host responses to influenza or coronavirus. Users can mine biosets describing specific expression patterns from one or more experiments, and analyze their contents to identify shared or

36October 30, 2018

Page 38: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4unique genes or pathways. Co-expression networks and metadata association modules are computed and the linkages among members of a module can be displayed on an interactive version of Cytoscape.js.

6.1.11 Antiviral Reagents (IRD and select ViPR and non-ViPR taxa)IRD and ViPR support searching for antiviral drug data relevant to supported viruses as well as selected other pathogens (e.g. Hepatitis B, HIV). Data include Drugbank descriptions and indications, drug binding sites on viral proteins, and viral mutations affecting drug sensitivity. Drug binding site data is integrated with relevant 3D Protein Structure records (Section 6.1.4). Drug sensitivity mutations are also computed as Sequence Feature Variant Types (Section 6.1.8). The Antiviral Drug module will facilitate research and development of antiviral drugs and is shared on both IRD and ViPR websites to support anti-viral research across taxa.

6.1.12 Laboratory Experiments (IRD)IRD provides data management support and serves as a repository for experiment and clinical data generated by NIAID-sponsored influenza virus-related research. Experiments can be searched by keyword and viewed as tables or graphs supplied by the investigators.

6.1.13 WHO Influenza vaccine strains (IRD)Season-by-season summaries of WHO recommendations for influenza vaccine composition. With each listing, a link is provided to the corresponding WHO Selection Document, and to detailed strain information.

6.1.14 Protein Domains and Motifs (ViPR)The ViPR team uses InterProScan algorithms to map domains to viral proteins. Within the selected taxa, a user can search for all proteins matching any domain (peptidase, viral_helicase, etc.) or a domain(s) specified by an accession or keyword. Users can search for viral proteins with specified motifs (coiled coil, etc.) as computed by InterProScan.

6.1.15 Ortholog groups (selected ViPR taxa)The ViPR team has computed and named ortholog groups to classify proteins according to sequence and functional similarity. Search to return all members of a group for analysis or saving to a working set.

37October 30, 2018

Page 39: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4

6.2 ANALYZE AND VISUALIZE

Both IRD and ViPR provide a suite of analysis tools. The Analysis Tool “landing page” of each system is shown in Figures 6-5 and 6-6. The following pages provide an introduction to the suite of analysis tools provided by each system.

Figure 6-5 IRD landing page for Analyze and Visualize operations.

38October 30, 2018

Page 40: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4

Figure 6-6 ViPR landing page for Analyze and Visualize operations.

6.2.1 Identify Similar Sequences (IRD and ViPR)Use BLAST algorithms to identify similar nucleotide or amino acid sequences in a variety of custom viral databases. Inputs can be search results, working sets, or external sequences.

6.2.2 Align Sequences (IRD and ViPR)IRD and ViPR use MUSCLE (Multiple Sequence Comparison by Log-Expectation) to align sequences. Uclust is provided as a MUSCLE pre-processor to improve both speed and quality of alignment. Inputs can be selected from search results, working sets or an uploaded file and results can be output in a variety of familiar formats. Output can be visualized or can be directed to the phylogenetic tree generating tool. For aligning large DNA genomes in ViPR (Poxviridae and Herpesviridae), the MAUVE algorithm is used.

6.2.3 Visualize Aligned Sequences (IRD and ViPR)A customized version of the JalView applet was used to visualize and interact with a multiple sequence alignment. The team is replacing this Java applet with a customized version of the Javascript-based MSAViewer. Visualization by HTML is available as an alternative. Either pre-aligned sequences aligned

39October 30, 2018

Page 41: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4against a profile (IRD) or a custom alignment (IRD and ViPR) can be used as input. Inputs can be selected from search results, working sets or an uploaded file.

6.2.4 Identify Short Peptides in Proteins (IRD and ViPR)Find short amino acid strings or oligopeptides in target proteins using an exact match, a fuzzy match or a pattern. A search can be applied to uploaded sequences, a working set, or custom databases prepared by the BRC team (e.g. Betacoronavirus proteins).

6.2.5 Identify Point Mutations in Proteins (IRD)Scan proteins from type A influenza virus (with option to specify subtype) in the database for the presence of the distinct amino acids at up to 10 user-designated positions. Mutations can be named to simplify interpreting the results. The tool can also analyze uploaded external sequences.

6.2.6 Analyze Sequence Variation (SNP) (IRD and ViPR)Variation from a consensus sequence in a set of nucleotide or amino acid sequences can be determined, either in a pre-computed set specified by type, subtype, or host (IRD), or after a custom alignment calculation (IRD and ViPR). Inputs can be search results, working sets, or external sequences.

6.2.7 Generate Phylogenetic Trees (IRD and ViPR)Generate phylogenetic trees from nucleotide or amino acid sequences using a selection of algorithms, evolutionary models, etc. Inputs can be search results, working sets, uploaded external sequences, or the result of a prior analysis. A customized Archaeopteryx tree viewer is available to visualize the tree and decorate its leaves using metadata values.

When logged into the workbench and generating a tree involving a very large dataset, a recommendation is made for running the job using the RaxML algorithm at the Cyberinfrastructure for Phylogenetic Research (CIPRES) Project. The sequences are automatically transferred to CIPRES, first to run MSA for generating alignments, followed by RaxML for phylogenetic analysis. Both are made via web services calls to the CIPRES RESTful tool API and the results are returned to the workbench when completed.

6.2.8 Metadata Sequence Analysis (IRD and ViPR)The metadata driven comparative analysis tool (meta-CATS) was developed by Brett Pickett, a member of the JCVI IRD team. The meta-CATS tool provides the capability to perform customized comparative genomics analyses with minimal manual manipulation. Inputs can be search results, working sets, or external sequences. You can perform a statistical analysis on sequences assigned either manually or programmatically (based on metadata values, e.g. host, country, genotype) to up to 10 different groups, to determine which residues significantly correlate with one or more metadata fields. The meta-CATS tool looks for positions that significantly differ between user-defined groups of sequences. However, biological biases due to covariation, codon biases, and differences in genotype, geography, time of isolation, or others may affect the robustness of the underlying statistical assumptions.

6.2.9 Annotate Nucleotide Sequences (IRD)This tool is an interactive version of the IRD team's automated influenza annotation pipeline. You can submit any number of nucleotide sequences in FASTA format for validation and annotation. The pipeline will align the sequences against a consensus sequence profile to identify possible sequencing errors, determine the influenza type, segment number, (and for segments 4 and 6 of type A, the subtype), and translate the nucleotide sequence. A report is provided containing these results and a list of possible sequencing errors or, if critical errors are encountered, a description of the errors. This pipeline was originally developed and is maintained in collaboration with IRD co-investigator Dr. Catherine Macken.

6.2.10 Identify Sequence Features in Segments (IRD) This tool is an interactive version of the SFVT-PVT pipeline developed by IRD. Users can upload nucleotide sequences and generate/save a report describing the presence of experimentally characterized

40October 30, 2018

Page 42: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4SFVT and if the sequences carry a particular variant type giving rise to a phenotype such as increased virulence.

6.2.11 Antiviral Resistance Risk Assessment Tool (select ViPR taxa)This tool allows users to scan their private viral sequences, and assess their response to a set of antiviral reagents, based on identity of the amino acids at positions which are known to be associated with resistance/sensitivity in other virus isolates. The computation is based on phenotype of all resistance/sensitivity Sequence Feature Variant Types (Section 6.1.8) which map to the query sequence.

6.2.12 Sequence Format Conversion (IRD and ViPR)This tool uses the ReadSeq algorithm to convert any format nucleotide or amino acid sequence file to any of 19 available file formats. The ReadSeq algorithm was developed by Don Gilbert of Indiana University, Bloomington, Indiana, and has been implemented in both IRD and ViPR for simple file format conversions.

6.2.13 Genome Annotator (GATU) (ViPR)GATU, a Genome Annotation Transfer Utility (Tcherepanov, et al., BMC Genomics 2006, 7:150 PubMed: 16772042) is an initial-stage tool to transfer annotations from a previously annotated reference to a new, closely-related target genome. The GATU interface provides controls for uploading a reference .gb file of the relevant viral family, along with the target genome in .gb or FASTA format. When done, a table summarizes the similarities of transferred annotations and provides users with checkbox control over which to accept. GATU also detects ORFs in the target and bioinformatics tools to assess if these should be annotated. The annotated target genome can be saved in multiple file formats. Originally developed at the University of Victoria, GATU was adapted for use with ViPR.

6.2.14 Pandemic H1N1 Classification (IRD)As a service to the influenza research community, the IRD team makes available an interactive version of our novel algorithm for identifying nucleotide sequences closely related to the 2009 pandemic H1N1 strain. The procedure is a robust application of BLAST and was developed by IRD team member Catherine Macken. The user can upload a FASTA format file containing nucleotide sequences to be analyzed or paste them directly into a box. One or more sequences from any of the 8 influenza segments can be submitted. Each is run through the classification pipeline and its relationship to the pandemic sequences determined as either closely related (Y) or not (N).

6.2.15 HPAI H5N1 Clade Classification (IRD)The IRD team has implemented an algorithm for classifying the clade of the hemagglutinin gene of influenza A viruses whose HA belongs to the A/goose/Guangdong/1/96 (H5N1) lineage, that is, the HA lineage of the so-called highly pathogenic Avian Influenza (HPAI) H5 viruses. This algorithm was developed by IRD co-investigator Catherine Macken. It uses phylogenetic analysis to place HA (H5) sequences within the WHO classification scheme. The IRD algorithm has been verified as highly accurate (> 99%) for sequences of at least 300 nucleotides of HA1. This tool only handles segment 4 (HA) sequences with confirmed H5 serotype and lengths greater than 300 nucleotides. Sequences from other serotypes of HA, or other segments yield unpredictable and likely incorrect results. The IRD team has used this tool to make a clade assignment to all relevant HA sequences in the database and also provides an interactive version of the tool that allows users to classify their own private sequences. Any number of sequences can be submitted for clade classification by either uploading a FASTA format file or pasting sequences into a box on the web interface.

6.2.16 US and Global Swine H1 Clade Classification (IRD)An IRD-developed algorithm classifies the clade of the HA of H1 viruses, from any host and for any NA subtype, with reference to the USDA classification of US swine H1 viruses. This algorithm is based on phylogenetic analysis, and was developed by IRD co-investigator Catherine Macken, in conjunction with Tavis Anderson and other swine influenza experts at the USDA. It has been verified as highly accurate (> 99%) for sequences of at least 300 nucleotides of HA1. An interactive version of the tool allows users to

41October 30, 2018

Page 43: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4classify their own private sequences by either uploading a FASTA format file or pasting sequences into the web interface. The analysis can be saved and a report for each individual protein can be downloaded.

6.2.17 PCR Primer Design (IRD and ViPR)IRD and ViPR provide a tool that uses the Primer3 algorithm to assist in predicting the optimal primer set(s) for a specified sequence.

6.2.18 HA Subtype Numbering Conversion (IRD)This tool allows users to renumber HA sequences according to a cross-subtype numbering scheme proposed by Burke and Smith (PLoS One 9:e112302). The computation uses analysis of known HA structures to identify amino acids that are structurally and functionally equivalent across all HA subtypes. Users paste or upload query HA protein sequences in FASTA format, select a preferred numbering scheme, and click "Convert residue numbering" to renumber their query sequence according to the selected subtype. Individual amino acids in the query may then be contextualized by comparison to proteins with 3-D structures solved in the presence or absence of ligands.

6.2.19 Genotype Recombination-Detection (selected ViPR taxa)The ViPR system uses an annotation pipeline that takes an alignment of sequences containing at least two representatives from each taxon. This reference alignment is then used to construct a distance-based tree, which is then parsed in order to find the closest relatives for any query sequence using a Branch Indexing method. By incorporating a static window size, this pipeline can also identify any recombinant query sequence. When the analysis is completed, a graphical representation of the score corresponding to the genotype classification for each region of the "sliding window" is shown. A spreadsheet file with the results is also available for download. This tool is based on the Genotype Determination Tool developed by Carla Kuiken's group at Los Alamos National Laboratory for the HCV database.

The current version operates on Dengue, West Nile, Japanese encephalitis, St Louis encephalitis, Tick-borne encephalitis, Yellow Fever, Murray Valley encephalitis, and Bovine viral diarrhea virus.

For HCV an improved pipeline to genotype untyped sequences was developed, based on a RAxML phylogenetic tree of sequences from a reference alignment maintained by the Flaviviridae Study Group of the International Committee on Taxonomy of Viruses, supplemented with confirmed subtype sequences from Dr. Donald Smith. Query sequences are aligned to the reference and placed on the tree using pplacer (Matsen, 2010), and their placement used to assign genotype using cladinator, developed by C. Zmasek of the Virus-BRC team.

6.2.20 Rotavirus A Genotype Determination (selected ViPR taxa)ViPR developed bioinformatics support for genotyping Rotavirus A (RVA) sequences and strains in collaboration with Karla Stucker and Danny Katzel at JCVI. User-submitted sequences are BLASTed against the RVA RefSeq set to assign the correct segment and map to a ViPR_Gene_Name. A web-based version of RotaC (http://rotac.regatools.be/), adapted by Danny Katzel at JCVI, is embedded into the analysis to determine the genotype of the query and report the closest reference strain with measures of identity and confidence values. A separate report is generated for sequences which fail the typing pipeline.

6.2.21 View Genomes in GBrowse (selected ViPR taxa)The ViPR team has made the annotated RefSeq genomes from the double stranded DNA genome families Poxviridae and Herpesviridae available in GBrowse. Many annotations are linked to their comprehensive Details pages in ViPR. Currently 284 sequences from the Poxviridae family and 536 sequences from the Herpesviridae family are adapted for viewing using the GBrowse genome browser.

42October 30, 2018

Page 44: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4

6.3 WORKBENCH

As described above (Sections 3.2.3) the IRD and ViPR databases are designed to provide each user with their own secure, personal workspace. Any user can create their own Workbench and manage their profile by requesting a free account. Within the workbench, a user can save working sets, search results, analyses, and uploaded data files. In fact, all of a user’s search and analysis results are initially stored in a Temporary Workbench. Users can choose to make them permanent after logging in; otherwise they are deleted upon close of the browser session.

6.3.1 Working SetsIRD and ViPR allows a user to create working sets of various data types (strain, segment, genome, protein, host factors, serology and surveillance data) on their private workbench. Working sets are created by selecting results from database searches and saving them to a working set using a “Save to Workbench” function button found on every search result page. The database uses a list of pointers to the actual data to identify the items in a working set and does not store the actual data shown in the search result. When the database is updated and annotations associated with a working set member are changed, the new data is automatically associated with that working set member. The contents of an IRD or ViPR working set may be submitted to any analysis tool (see Section 6.2) that uses that type of data.

Another feature of the workbench, the More Actions control (Figure 6-7), allows other types of actions on working sets. The available actions vary depending on the type of working set and include:

Downloading the content to local directories in several formats. If choosing to download database sequences as FASTA format files, the content and order of information appearing in the file definition lines is user controlled.

Working sets of database search results can be combined with, intersected with, or subtracted from other working sets of the same data type.

The Convert operation can be used to create new working sets by converting among strain, nucleotide and protein data types in IRD and converting genome to protein working sets in ViPR. For example, a protein working set can be created from a nucleotide working set in IRD and a protein working set can be created from a genome working set in ViPR.

The sequences in uploaded FASTA format files can be combined with database sequences in a working set containing the same type of sequence data.

In IRD, the contents of a segment type working set can be edited (pruned) using a customized interactive version of the Archaeopteryx tree viewer (Section 6.2.7).

43October 30, 2018

Page 45: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4

Figure 6-7 Actions available on working sets via More Actions control.

6.3.2 SearchesThe workbench stores search criteria rather than the results, allowing a user to re-run a saved search at any time. Any database search is automatically saved to the user’s workbench if they are logged in or to a Temporary Workbench if the user is not logged into their workbench. These searches are only retained permanently if the user designates them to be saved. All other searches are removed from the workbench when the browser session is closed. The systems also provide a “Subscribe to Search” feature that runs a saved search at specified intervals and reports new database entries meeting the search criteria that have occurred as a result of new data loaded during scheduled updates of IRD or ViPR data.

6.3.3 Analysis Tool ResultsIRD and ViPR can automatically save the results when any analysis tool is run (see Section 6.2) to the user’s workbench if they are logged in or to a Temporary Workbench if the user is not logged into their workbench. These analysis results are only retained permanently if the user designates them to be saved. All other analyses are removed from the workbench when the browser session is closed. Unlike working sets, which consist of pointers, and saved searches, which consist of the search criteria, the entire analysis result is saved and can be retrieved for review at a future time. Since underlying data can change, it is possible that re-execution of the same analysis could return different results.

6.3.4 Uploaded Data FilesTo support use of private data sets, external files in specified formats can be uploaded for use with most IRD and ViPR analysis tools. Currently, the following upload formats are supported:

FASTA – protein or nucleotide sequence file

Aligned FASTA – aligned protein or nucleotide sequence file

PDB – 3D protein structure file

Phylip (interleaved) – for phylogenetic tree display

Newick – for phylogenetic tree display

44October 30, 2018

Page 46: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4PhyloXML

Tab delimited

Private FASTA format files can be combined with working sets containing the same type of sequence data, with the result being a FASTA format file containing the sequences from both sources. Also, metadata associated with the private sequences, uploaded either in file headers or accompanying templates, can be used in analyses, to decorate (Phylogenetic Trees - Section 6.2.7) or manage (mCAT - Section 6.2.8) the uploaded sequences. Phylip, PhyloXML, and Newick files can be used with the phylogenetic tree generation and display tools. PDB files can be displayed using the built-in JMol or JSMol in IRD and ViPR.

6.3.5 Organization and Management of the WorkbenchA hierarchical folder structure can be used to group working sets together thematically, or according to sharing privileges, on the My Workbench page in IRD and ViPR. Figure 6-8 shows a representative IRD workbench containing working sets, searches and analyses. Unsaved workbench items are preserved until the end of the browser session.

Figure 6-8 IRD My Workbench page (ViPR page is similar)

Controls along the left side of the My Workbench page allow a customized view according to content, access, or special characteristics. The workbench also allows private sharing of working sets, searches, analysis results, and uploaded files with selected colleagues and collaborators (Figure 6-9). The owner retains control over who views the item, including the option to withdraw sharing privileges. The owner can also choose to publicly share an item with the general research community, either under its original name, or with a public name, by making it public. All IRD or ViPR users have access to a working set, search, analysis result, or uploaded file that is made public. To facilitate sharing among groups of collaborators, a sharing “group” can be created consisting of any number of IRD or ViPR users who have created their own workbenches.

45October 30, 2018

Page 47: Conference and Communications Support · Web viewThis is demonstrated by the rapid development and deployment of resources dedicated to contemporary outbreak situations: Ebolavirus

Compendium of BRC System V4Figure 6-9 Workbench Sharing menu lists options for controlling access to selected working sets

Items in a workbench can also be grouped into folders so that all working sets, searches, analysis results, and uploaded files associated with a particular scientific study can be quickly identified and viewed.

If a user selects a working set by clicking on the information icon, a representative set of data associated with each member of the working set is retrieved and displayed in the same format as in a search result. Items can then be selected from the working set for input directly into any of the analysis tools appropriate to that type of data. Figure 6-10 shows analysis options available in the IRD workbench for a type segment (nucleotide) working set.

Figure 6-10 Workbench showing analysis options for a segment (nucleotide) sequence working set

46October 30, 2018