Two Examples of Datanomic - DTCProvenance of sequences downloaded from NCBI on a nightly basis Every night a cron job is run to check for the NCBI release of new Medicago genome sequences

Two Examples of Datanomic

David Du Digital Technology Center

Intelligent Storage ConsortiumUniversity of Minnesota

Datanomic Computing (Autonomic Storage)

System behavior driven by characteristics of the data and the changing environmentAutomatic optimization to ever changing data requirements

Allocate resources according to increase in demand of the dataTransform data formats to support different applications

Seamless data access from anywhere at anytimeLocation and context aware access to dataContent-based searchAdaptive performanceConsistent view of each user’s dataIndependent of platforms, operating systems, and data formats

Exploit active object, active and intelligent diskSolve data explosion and provenance issues

Three Possible Approaches

Semantic WebWeb is the key

Grid ComputingServices offered by middleware

Intelligent Storage DevicesReduce layers by adding features to storage devices

Two ExamplesE2E QoS Provisioning for Network-Attached Storage Systems Solutions to Data Provenance Problem

Motivation of E2E QoSProvisioning

OSD supports diverse applicationsDifferent applications require different Performance guarantees: bandwidth, response time, throughputObjects in OSD carries application semanticsObjects in OSD has full knowledge of its current storage condition

QoS Challenges in Network Attached Storage

QoS Requirements from applicationsData are accessed from remote storage devices via IP network connectionsHow to ensure QoS delivery within storage devices?How to ensure QoS over networks?How to ensure QoS E2E?

Feedback Based QoS Control

Use a controller between clients and the storage server (client side vs. ISP)Clients provide performance goalsA feedback mechanism in control compares the performance measured and expectedThe controller throttles the user requests if there is performance goal violation

A Feedback Control

Controller

Measurement

StorageServer

Requests

PerfGoal

ComputedPerformance

DiffAdjustedReq rate

TCP/IP Network

TCP/IP Network

Possible Control Along The End User Access Data Path

ISP

Storage

ServerTCP/IPNetworkTCP/IPNetwork

Client

Client

ISP

Client

Client

11

11

2

2

3

Storage QoS Control

MotivationMultimedia application requires guaranteed timely deliveryDifferent applications has different QoSrequirementsStorage access has a lot of variations

QoS provisioningQoS aware disk scheduling to guarantee the real-time requirements

Storage Brick

TCP Network

Memory

Initiator1

Initiator2

Initiatorn

d 2

HBA(SATA)

GigabitEthernet

A Storage Brick Target

CPU

d 4

HBA(SATA)

d 3d 1 d 6 d 8d 7d 5

GigabitEthernet

BW1

BW2

BWn

Brick Controller

Challenges in QoS Provisioning in Storage Brick

Storage brick often uses striping and replication to improve performanceRAID systems make the disk scheduling more difficult and previous algorithm inappropriateClient connections have different network bandwidthsiSCSI has the upper level flow control

Issue 1: Network-Aware Scheduling

Goal: exploit the knowledge of underlying network conditions to efficiently schedule object requestsEnvironment: storage brick attached to InternetAssumption:

Multiple initiators with different BW access the storage brick through iSCSIEach session’s BW can be acquiredThe objects are striped over multiple disks in brick for performance and load-balance purpose

Issue 2 QoS-Aware Storage Scheduling

Motivations:Different QoS requirements from different applicationsDifferent network bandwidth from different sessionsDifferent RAID configurations in the storage brick

ObjectivesPropose a framework to support different app QoSfor different sessions

Scheduling with Object Replicas

Previous scheduling assumes there is only one copy of the requested objectObject can have multiple replicasLocate the most favorable replica of an object to be requestedSchedule disk access on the favorable object

Issue 3: End-End QoS Support

Storage QoS support only provides guarantee within the storage devicesTCP/IP network is a best-effort network, no hard guarantee is providedTCP/IP network is shared by a variety of users, not just storage access usersFeedback control is not practical given the variety of clients and diverse distributionIntegration of network QoS and Storage QoSto provide true end-end QoS

A List of Related Projects

iSCSI Performance StudyiSCSI Simulation Implementation and StudyAdaptive iSCSI Storage AccessQoS for iSCSI with OSD SupportImplementation and Evaluation of DMAPI-Based Data Backup Prototype Network-Aware Resource SchedulingQoS support for OSD Implementation

What is data provenance?

Provenance is a relationship between data objects to explain how a particular object has been derived.A workflow of data processes usually explains this relationshipUsing provenance, a user can trace the “workflow” that led to the aggregation of processes producing a particular object.

EnsEMBL Pipeline (Workflow)

Search Targets / Models / Parameter Sets•Examples: NCBI NR, PFAM Models

•Some updates (BLAST targets) are additive

•Some represent retraining and cannot be easily added to the (new models for HMMs, contig sets from TIGR)

•Update frequency currently driven by computational limits

Primary Data:Contig, Clone, Assembly,

dna, …

Features:Dna_align_feature,

protein_align_feature …

Genomic Sequence Data•Regular (daily) addition of new data

•Occasional updates to existing data

•Corrections take the form of updates

•Also, assembly data (partial chromosome locations)

Download & Import Scripts

Update Scripts Target SetsPreliminary

Pipeline

EnsEMBL Genes:Transcript, Exon, Gene, Xref, …

Gene Calling

External GeneCalls & xrefs

Protein PipelineProtein Annotations

Transcript, Exon, Gene, Xref, …

HTGFASTA

CCGB-DeCIFR analysis pipeline

1. Repeat masker2. Genscan (Ath smat)3. Fgenesh (Dicots smat)4. BLASTX vs

• PIR-NREF (soon to be replaced by UniProt)5. BLASTN vs

• NCBI_dbEST• NCBI_nt• NCBI Mt cDNA• NCBI Ath genome• NCBI Lj HTGs• TIGR latest unigenes

o Arabidopsis thalianao Lotus japonicuso Glycine maxo Medicago truncatula

• CCGB unigeneso Medicago truncatulao Peanuto Pseudorobinia accacia (Black locust)

Nightly download of new phase3 and phase2 (2 ordered pieces at most) HTG sequences

•Check GenBank Accession and Version numbers against CCGB-DeCIFR contents to avoid duplication. • If Acc# already present in CCGB-DeCIFRwith earlier version#, drop all analysis results from database tables•For the same Acc#, keep only the latest version to perform analysis on.•Use MtBR linkage group information to assign BAC display to chromosome

Create pipeline analysis queuing job•SubmitContig

•ContigStartState

Upon all analysis completion for a BAC, push that BAC analysis results to production database instance (public)

CCGBDeCIFRprivate

CCGBDeCIFRpublic

http://decifr.ccgb.umn.edu

MtBR

University of MinnesotaMt BAC Registry

Young lab

University of MinnesotaMt BAC Registry

Young lab

-Linkage group-BAC ordering (to come)

GenBank 2a

1

2b

3

Incremental BLAST•Target update•Query update

2a2b

http://decifr.ccgb.umn.edu/Medicago_truncatula/project_status

3

Nightly download of genomic sequences thata are to be put into the pipeline 1

Suggestions for test development for provenance using the CCGB-DeCIFR genome annotation pipeline

As the annotation pipeline currently stands, three development points in the pipeline are suggested. The first two are immediately available. The third one will be available in the near future. The third one requires us to write a fair amount of new code, and that particular project needs to be integrated into our development schedule.

1. Provenance of sequences downloaded from NCBI on a nightly basisEvery night a cron job is run to check for the NCBI release of new Medicago genome sequences that fit specific criteria.A list of the seq ID ( Acc# and gi) is made and compare with the content of CCGB-DeCIFR database.Sequences that are downloaded are:

- New accession ( an fit the specific criteria)- Old accession but new GI [sequence updates]

2. Provenance of gene prediction analysis (result features, parameters used, DAS source(?))Gene prediction programs may have been trained on different training sets ( different research groups US, EU)Focus on the FGENESH ( trained for dicots)[2a] and Genscan (trained for Arabidopsis)[2b]

3. Provenance for incremental update of target databases for homology searches [ BLAST, HMM]

How to solve data provenance in bioinformatics?

Workflow of Functional GenomicsData Dependent Relationships Between Data ObjectsAnalysis Tools: take several input data with a set of parameter values to produce a version of output data objectResults and generated knowledge are presented as annotations and feedback to the system

Generalized Black Box for An Analysis Tool

@parameters analysis instance

analysis model w/ dB (gene calling algorithm/matching algorithm/filters/general dB search/user scripts...)

all necessary configuration sets e.g. version information

includes intermediate data

any object (target/dB/query..)

input /w metadata

output /w metadata

Our Proposed Solution

Taking Intelligent Storage Approach to Demonstrate Its PowerProvenance Information is part of meta-data or attributes associated with dataInfinite Number of Versions of A Data Object exist

What is the efficient way to store and to maintain these many versions?How does a change to one object affect the others?

Documents

Two Examples of Datanomic - DTCProvenance of sequences downloaded from NCBI on a nightly basis Every night a cron job is run to check for the NCBI release of new Medicago genome sequences