21
EVS Data Curation The processing and publication of data for web browsing and programmatic access

EVS Data Curation

Embed Size (px)

DESCRIPTION

EVS Data Curation. The processing and publication of data for web browsing and programmatic access. Data Curation Flowchart. Gene Ontology and Zebrafish. Downloaded as OBO from web sites Processed with C++ program into Ontylog xml – OBO2TDE.exe - PowerPoint PPT Presentation

Citation preview

EVS Data Curation

The processing and publication of data for web browsing and programmatic access

Data Curation Flowchart

Gene Ontology and Zebrafish

Downloaded as OBO from web sites Processed with C++ program into Ontylog

xml – OBO2TDE.exe Processed with C++ program into OWL –

ontyxToOWL.exe Loaded using LoadNCIThesOWL.sh Metadata loaded using LoadMetadata Hierarchy and Sources manually edited

HL7 and VA_NDFRT

Retrieved from sources Processed by Apelon into Ontylog XML Loaded into LexBIG using LoadNCIThesOwl

and manifest Metadata loaded using LoadMetadata

MGED

OWL file downloaded from source web site Loaded into Protégé Classified Inferred version exported as OWL file Loaded into LexBIG using LoadNCIThesOwl Metadata loaded using LoadMetadata Hierarchy and Sources manually edited

Snomed, MedDRA and LOINC Extracted from the UMLS into RRF files Loaded into LexBIG using LoadUMLSFiles Metadata loaded using LoadMetadata

UMLS Semnet

Downloaded from UMLS Semnet web site Loaded using LoadUMLSSemnet Metadata loaded using LoadMetadata

Metathesaurus

Load from UMLS into MEME NCI Thesaurus imported monthly Other vocabs added or removed NCI specific edits made to data and relations Exported as RRF Imported to LexBIG using LoadNCIMeta Metadata loaded using LoadMetadata

Preparing TDE Thesaurus for MEME Thesaurus Ontylog XML baseline is

processed through C++ app publishMEME.exe

Current baseline compared to previous to get summary of new properties or roles

Summary used to create import configuration file

Baseline imported into MEME

Preparing Thesaurus for MEME

NCI Thesaurus from TDE

Edited in TDE and exported to Ontylog XML by name

Run through publishTDE to remove unpublishable properties

run through OntyxToOwl.exe to create OWL file by code

Loaded into LexBIG using LoadNCIThesOWL Metadata loaded using LoadMetadata History generated from TDE baseline History loaded using LoadNCIHistory

NCI Thesaurus from TDE

NCI Thesaurus from Protege

Run OWL through application to get Ontylog XML by name

Run Ontylog XML through publishTDE to remove unpublishable properties

Run through OntylogtoOWL to get OWL by code

Do history using the Ontylog XML

NCI Thesaurus History Processing evs_history records concept modifications

made in editor These records are extracted monthly to

consolidate and to remove identifying information

Cleaned records are loaded into concept_history

Full concept_history loaded into LexBIG for NCI Thesaurus

History

TDE to DTS

log.outNew concepts created through Create or Split actions:C72675|Feet_First.Concepts merged into other concepts:C17841|Oncologic_Surgeon.Retired concepts (including merged):C17841|Oncologic_Surgeon.New concepts not found in BSLN2:C73140|Ethaverine_.Retired concepts not found in BSLN2 C73401|Maqui_Berry_Flavor.Modify records correponding to Retired_Kind are discarded:667487|C62920|Medical_Device_Unsafe_to_Use|Modify|2008-03-05 ….Modify records correponding to new codes are discarded:666753|C72831|Pramiracetam_Hydrochloride|Modify|2008-02-29 ….Modify records correponding to merged codes are discarded:668629|C3824|Lesion|Modify|2008-03-06 11:03:49.0|remennik|6116otsaremennl.nci.nih.gov|(null)|0.Records correponding to codes not found in BSLN2 are discarded:671933|C73140|Ethaverine_|New|2008-03-19 12:03:01.0|shaiu|MSDCorp-Mesh001.inside.msdinc.com|(null)|0.WARNING: New codes created, then retired, but still found in BSLN2: (to be edited manually)C72675|Feet_First.List of all remaining records.List of all discarded records:666753|C72831|Pramiracetam_Hydrochloride|Modify|2008-02-29 09:02:56.0|shaiu|MSDCorp-Mesh001.inside.msdinc.com|(null)|0.

tde_history_report.txt

Spilanthes_oleracea (Code: C72446)

Number of modelers: 3Modeler: shaiuModeler: thomasModeler: creech

Modeler: shaiuAction: modify time: 2008-03-05 05:03:58.0

Modeler: thomasAction: modify time: 2008-03-06 02:03:05.0Action: modify time: 2008-03-14 10:03:06.0

Modeler: creechAction: modify time: 2008-03-06 02:03:06.0

------------------------------------------------------------------.

Edited actions for the following concepts are discarded:

Concept codes requiring manual review:

DTS_history

DTS_history_script.sql insert into concept_history(concept, editaction, editdate, reference)

values ('C72675', 'create', '28-MAR-08', null);

insert into concept_history(concept, editaction, editdate, reference)

values ('C72676', 'create', '28-MAR-08', null);

.

.

DTS_history_out.txt666540|C72675|create|28-MAR-08|(null)

666541|C72676|create|28-MAR-08|(null)

666542|C62171|modify|28-MAR-08|(null)

.

.

DTS_history_out.outLists complete contents of both baselines.Number of codes in {baseline A} : 65265Number of codes in {baseline B} : 66022

Concepts found in {baseline B}: but not in {baseline A} C72675C72676.Concepts found in {baseline A}: but not in {baseline B} (should be empty).Verify DTS_history_out.txt against baseline data.New Concepts: 757

(1) C72675(2) C72676

.Concepts created through Split: 0

Split Concepts: 0

Retired Concepts: 4(1) C20920(2) C62920

Concepts retired through Merge: 5(1) C14142

Merge Concepts: 5(1) C1363

Modified Concepts: 1364

Invalid actions: 0

Tiered Deployments

NCICB uses 4-tiered deployments Dev tier – used internally by EVS team to test

software and data QA tier – used by QA and other software teams to

test against new EVS software or data Stage tier – used to test software deployments in

a near-production environment Production – available to outside users