Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Two Examples of Datanomic
David Du Digital Technology Center
Intelligent Storage ConsortiumUniversity of Minnesota
Datanomic Computing (Autonomic Storage)
System behavior driven by characteristics of the data and the changing environmentAutomatic optimization to ever changing data requirements
Allocate resources according to increase in demand of the dataTransform data formats to support different applications
Seamless data access from anywhere at anytimeLocation and context aware access to dataContent-based searchAdaptive performanceConsistent view of each user’s dataIndependent of platforms, operating systems, and data formats
Exploit active object, active and intelligent diskSolve data explosion and provenance issues
Three Possible Approaches
Semantic WebWeb is the key
Grid ComputingServices offered by middleware
Intelligent Storage DevicesReduce layers by adding features to storage devices
Two ExamplesE2E QoS Provisioning for Network-Attached Storage Systems Solutions to Data Provenance Problem
Motivation of E2E QoSProvisioning
OSD supports diverse applicationsDifferent applications require different Performance guarantees: bandwidth, response time, throughputObjects in OSD carries application semanticsObjects in OSD has full knowledge of its current storage condition
QoS Challenges in Network Attached Storage
QoS Requirements from applicationsData are accessed from remote storage devices via IP network connectionsHow to ensure QoS delivery within storage devices?How to ensure QoS over networks?How to ensure QoS E2E?
Feedback Based QoS Control
Use a controller between clients and the storage server (client side vs. ISP)Clients provide performance goalsA feedback mechanism in control compares the performance measured and expectedThe controller throttles the user requests if there is performance goal violation
A Feedback Control
Controller
Measurement
StorageServer
Requests
PerfGoal
ComputedPerformance
DiffAdjustedReq rate
TCP/IP Network
TCP/IP Network
Possible Control Along The End User Access Data Path
ISP
Storage
ServerTCP/IPNetworkTCP/IPNetwork
Client
Client
ISP
Client
Client
11
11
2
2
3
Storage QoS Control
MotivationMultimedia application requires guaranteed timely deliveryDifferent applications has different QoSrequirementsStorage access has a lot of variations
QoS provisioningQoS aware disk scheduling to guarantee the real-time requirements
Storage Brick
TCP Network
Memory
Initiator1
Initiator2
Initiatorn
d 2
HBA(SATA)
GigabitEthernet
A Storage Brick Target
CPU
d 4
HBA(SATA)
d 3d 1 d 6 d 8d 7d 5
GigabitEthernet
BW1
BW2
BWn
Brick Controller
Challenges in QoS Provisioning in Storage Brick
Storage brick often uses striping and replication to improve performanceRAID systems make the disk scheduling more difficult and previous algorithm inappropriateClient connections have different network bandwidthsiSCSI has the upper level flow control
Issue 1: Network-Aware Scheduling
Goal: exploit the knowledge of underlying network conditions to efficiently schedule object requestsEnvironment: storage brick attached to InternetAssumption:
Multiple initiators with different BW access the storage brick through iSCSIEach session’s BW can be acquiredThe objects are striped over multiple disks in brick for performance and load-balance purpose
Issue 2 QoS-Aware Storage Scheduling
Motivations:Different QoS requirements from different applicationsDifferent network bandwidth from different sessionsDifferent RAID configurations in the storage brick
ObjectivesPropose a framework to support different app QoSfor different sessions
Scheduling with Object Replicas
Previous scheduling assumes there is only one copy of the requested objectObject can have multiple replicasLocate the most favorable replica of an object to be requestedSchedule disk access on the favorable object
Issue 3: End-End QoS Support
Storage QoS support only provides guarantee within the storage devicesTCP/IP network is a best-effort network, no hard guarantee is providedTCP/IP network is shared by a variety of users, not just storage access usersFeedback control is not practical given the variety of clients and diverse distributionIntegration of network QoS and Storage QoSto provide true end-end QoS
A List of Related Projects
iSCSI Performance StudyiSCSI Simulation Implementation and StudyAdaptive iSCSI Storage AccessQoS for iSCSI with OSD SupportImplementation and Evaluation of DMAPI-Based Data Backup Prototype Network-Aware Resource SchedulingQoS support for OSD Implementation
What is data provenance?
Provenance is a relationship between data objects to explain how a particular object has been derived.A workflow of data processes usually explains this relationshipUsing provenance, a user can trace the “workflow” that led to the aggregation of processes producing a particular object.
EnsEMBL Pipeline (Workflow)
Search Targets / Models / Parameter Sets•Examples: NCBI NR, PFAM Models
•Some updates (BLAST targets) are additive
•Some represent retraining and cannot be easily added to the (new models for HMMs, contig sets from TIGR)
•Update frequency currently driven by computational limits
Primary Data:Contig, Clone, Assembly,
dna, …
Features:Dna_align_feature,
protein_align_feature …
Genomic Sequence Data•Regular (daily) addition of new data
•Occasional updates to existing data
•Corrections take the form of updates
•Also, assembly data (partial chromosome locations)
Download & Import Scripts
Update Scripts Target SetsPreliminary
Pipeline
EnsEMBL Genes:Transcript, Exon, Gene, Xref, …
Gene Calling
External GeneCalls & xrefs
Protein PipelineProtein Annotations
Transcript, Exon, Gene, Xref, …
HTGFASTA
CCGB-DeCIFR analysis pipeline
1. Repeat masker2. Genscan (Ath smat)3. Fgenesh (Dicots smat)4. BLASTX vs
• PIR-NREF (soon to be replaced by UniProt)5. BLASTN vs
• NCBI_dbEST• NCBI_nt• NCBI Mt cDNA• NCBI Ath genome• NCBI Lj HTGs• TIGR latest unigenes
o Arabidopsis thalianao Lotus japonicuso Glycine maxo Medicago truncatula
• CCGB unigeneso Medicago truncatulao Peanuto Pseudorobinia accacia (Black locust)
Nightly download of new phase3 and phase2 (2 ordered pieces at most) HTG sequences
•Check GenBank Accession and Version numbers against CCGB-DeCIFR contents to avoid duplication. • If Acc# already present in CCGB-DeCIFRwith earlier version#, drop all analysis results from database tables•For the same Acc#, keep only the latest version to perform analysis on.•Use MtBR linkage group information to assign BAC display to chromosome
Create pipeline analysis queuing job•SubmitContig
•ContigStartState
Upon all analysis completion for a BAC, push that BAC analysis results to production database instance (public)
CCGBDeCIFRprivate
CCGBDeCIFRpublic
http://decifr.ccgb.umn.edu
MtBR
University of MinnesotaMt BAC Registry
Young lab
University of MinnesotaMt BAC Registry
Young lab
-Linkage group-BAC ordering (to come)
GenBank 2a
1
2b
3
Incremental BLAST•Target update•Query update
2a2b
http://decifr.ccgb.umn.edu/Medicago_truncatula/project_status
3
Nightly download of genomic sequences thata are to be put into the pipeline 1
Suggestions for test development for provenance using the CCGB-DeCIFR genome annotation pipeline
As the annotation pipeline currently stands, three development points in the pipeline are suggested. The first two are immediately available. The third one will be available in the near future. The third one requires us to write a fair amount of new code, and that particular project needs to be integrated into our development schedule.
1. Provenance of sequences downloaded from NCBI on a nightly basisEvery night a cron job is run to check for the NCBI release of new Medicago genome sequences that fit specific criteria.A list of the seq ID ( Acc# and gi) is made and compare with the content of CCGB-DeCIFR database.Sequences that are downloaded are:
- New accession ( an fit the specific criteria)- Old accession but new GI [sequence updates]
2. Provenance of gene prediction analysis (result features, parameters used, DAS source(?))Gene prediction programs may have been trained on different training sets ( different research groups US, EU)Focus on the FGENESH ( trained for dicots)[2a] and Genscan (trained for Arabidopsis)[2b]
3. Provenance for incremental update of target databases for homology searches [ BLAST, HMM]
How to solve data provenance in bioinformatics?
Workflow of Functional GenomicsData Dependent Relationships Between Data ObjectsAnalysis Tools: take several input data with a set of parameter values to produce a version of output data objectResults and generated knowledge are presented as annotations and feedback to the system
Generalized Black Box for An Analysis Tool
@parameters analysis instance
analysis model w/ dB (gene calling algorithm/matching algorithm/filters/general dB search/user scripts...)
all necessary configuration sets e.g. version information
includes intermediate data
any object (target/dB/query..)
input /w metadata
output /w metadata
Our Proposed Solution
Taking Intelligent Storage Approach to Demonstrate Its PowerProvenance Information is part of meta-data or attributes associated with dataInfinite Number of Versions of A Data Object exist
What is the efficient way to store and to maintain these many versions?How does a change to one object affect the others?