Upload
willa-foster
View
215
Download
0
Embed Size (px)
Citation preview
v3NLP FrameworkdbAnnotation Database Schema
(created 12/2011)(revised) 10/09/2012
(revised) 10/18/2012) (revised 10/23/2012) (revised 10/25/2012)
Tables (original)document
document_id BIGINT
referenceSystem VARCHAR(120)
referenceLocator VARCHAR(120)
documentAnnotations
documentAnnotation_id BIGINT
document_id BIGINT
annotation_id BIGINT
annotation
annotation_id BIGINT
entityDefinition_id BIGINT
entityDefinition
entityDefinition_id BIGINT
Name VARCHAR(120)
provenance VARCHAR(120)
span
span_id BIGINT
documentAnnotation_id BIGINT
filter VARCHAR(50)
startOffset INTEGER
endOffset INTEGER
feature
feature_id BIGINT
annotation_id BIGINT
entityDefinition_id BIGINT
featureElementText
featureElement_id BIGINT
feature_id BIGINT
value VARCHAR(6300)
1
1
n
1
11
1
n
n
1
1
n
Annotation Notes• There is a one-to-one relationship between rows in the documentAnnotations table and the
Annotations table.• It is recognized that these tables should be folded into one table. There is an explanation why they are
not.• resources, including annotation admin and chart reader use a schema similar to this. dbAnnotation’s
schema is meant to be isomorphic with the schemas for annotation admin and chart reader.• These two tables mirror external tables set up for other tools within VINCI’s data. • Chart reader and annotation admin allow for an annotation that spans across documents. Under such
circumstances, there would be an annotation_id that would have a different documentAnnotation id. • The dbAnnotation schema does not handle this pathologic circumstance, resulting in the one-to one
relationship rather than the n to 1 relationship in the other schemas.
documentAnnotations
documentAnnotation_id BIGINT
document_id BIGINT
annotation_id BIGINT
Annotation [see notes]
annotation_id BIGINT
entityDefinition_id BIGINT1
1
Additional Tables (revised)
Corpus [see notes]
corpus_id BIGINTdocument_id BIGINTrun_id VARCHAR(20)documentName VARCHAR(120)documentTitle VARCHAR(120)patient_id VARCHAR(20)tiu_id VARCHAR(20)
1
annotationConceptIndex [see notes]
corpus_id BIGINT
document_id BIGINT
run_id VARCHAR(20)
tiu_id VARCHAR(20)
patient_id VARCHAR(20)
documentTitle VARCHAR(120)
annotation_id BIGINT
startOffset INTEGER
endOffset INTEGER
annotation_name VARCHAR(60)
content VARCHAR(2100)
negationStatus VARCHAR(20)
sectionName VARCHAR(40)
conceptNames VARCHAR(160)
cuis VARCAR(12)
semanticTypes VARCHAR(20)
semanticGroups VARCHAR(20)
featureNames VARCHAR(2100)
featureValues VARCHAR(2100)
Corpus Notes• This table is needed to track the same document through the same
software multiple times, as when the software gets revised.• Document name is equivalent to reference locator in the document
table, but only filled out with a full path to location of the document. (Reference locator might be filled out with the query that created the record)
• tiu_id is the record id from the table (TIU_NOTES) whence it came. This might be different than the document name.
• patient_id. Patient id is the link to groups of documents. Patient id is not propagated to the normalized table to keep a firewall between potentially de-identified records and patient sensitive data.
• Slot for documentTitle if known.
Corpus [see notes]corpus_id BIGINT
document_id BIGINT
run_id VARCHAR(20)
documentName VARCHAR(120)
documentTitle VARCHAR(120)
patient_id VARCHAR(20)
tiu_id VARCHAR(20)
annotationConcept Index Notes• This table is a flattened view of the corpus for information retrieval
purposes• One row per annotation and one table for query purposes• Is just one of a number of indexes/views that could be made from the
normalized tables. • Includes patient and tui ids • One to one relationship between corpus, document and run id• The (normalized) text between offsets is represented in this table
within the content field.• Annotation names will contain labels that are kinds of concepts – for
example Symptom.• Includes slots for documentTitle, sectionName• Concept attributes represented as explicit fields including
conceptNames, cuis, semanticTypes, and semanticGroups• Concept attributes are pipe delimited fields• Feature names is a pipe delimited string with each field being a feature
name as a catch all for other attributes• Feature values is a pipe delimited string with each field being a feature
value as a catch all for other attributes• One to one correspondence between feature name and value fields.
annotationConceptIndex [see notes]
corpus_id BIGINT
document_id BIGINT
run_id VARCHAR(20)
tiu_id VARCHAR(20)
patient_id VARCHAR(20)
documentTitle VARCHAR(120)
annotation_id BIGINT
startOffset INTEGER
endOffset INTEGER
annotation_name VARCHAR(60)
content VARCHAR(2100)
negationStatus VARCHAR(20)
sectionName VARCHAR(40)
conceptNames VARCHAR(160)
cuis VARCAR(160)
semanticTypes VARCHAR(160)
semanticGroups VARCHAR(160)
featureNames VARCHAR(2100)
featureValues VARCHAR(2100)
View to be created from dbAnnotation to annotations-dbd
• The annotation-dbd schema is an agreed upon schema for interoperability between several systems at the Salt Lake City VA including annotationAdmin, and ChartReader
• When the need arises, a database view can be created to make dbAnnoation look like the annotations-dbd tables to preserve interoperability between systems.
Tables (revised)document
document_id BIGINT
referenceSystem VARCHAR(120)
referenceLocator VARCHAR(120)
documentAnnotations
documentAnnotation_id BIGINT
document_id BIGINT
annotation_id BIGINT
Annotation [see notes]
annotation_id BIGINT
entityDefinition_id BIGINT
entityDefinition
entityDefinition_id BIGINT
Name VARCHAR(120)
provenance VARCHAR(120)
span
span_id BIGINT
documentAnnotation_id BIGINT
filter VARCHAR(50)
startOffset INTEGER
endOffset INTEGER
feature
feature_id BIGINT
annotation_id BIGINT
entityDefinition_id BIGINT
featureElementText
featureElement_id BIGINT
feature_id BIGINT
value VARCHAR(6300)
1
1
11
1 n
n
1
1
n
Corpus [see notes]corpus_id BIGINT
document_id BIGINT
run_id VARCHAR(120)
documentName VARCHAR(120)
documentTitle VARCHAR(120)
patient_id VARCHAR(20)
tiu_id VARCHAR(20)
1
annotationConceptIndex [see notes]
corpus_id BIGINT
document_id BIGINT
run_id VARCHAR(20)
tiu_id VARCHAR(20)
patient_id VARCHAR(20)
documentTitle VARCHAR(120)
annotation_id BIGINT
startOffset INTEGER
endOffset INTEGER
annotation_name VARCHAR(60)
content VARCHAR(2100)
negationStatus VARCHAR(20)
sectionName VARCHAR(40)
conceptNames VARCHAR(160)
cuis VARCAR(160)
semanticTypes VARCHAR(160)
semanticGroups VARCHAR(160)
featureNames VARCHAR(2100)
featureValues VARCHAR(2100)
1
annotations-dbd Schema
Compatibility with the annotations-dbd schema
Annotations-dbd Table Name dbAnnotations Table Name Compatibility Notes
analyte_ reference
field: id
document
field: document_id field: run_id
Both have reference_system, and reference_locator fields. v3NLP tools do not fill these fields out.
The annotations_dbd schema does not have a run_id.
Annotation_analyte_reference field: analyte_reference_id
documentAnnotations field: documentAnnotation_id
Both have the field filter. v3NLP tools do not fill this field out.
span field: id
span field: span_id
Offsets in the annotation-dbd are long, but int’s in the dbAnnotations schema.
annotation field: id field: resource_id
annotation field: annotation_id field: entityDefinition_id
reference field: id field: uri
entityDefinition field: entityDefinition_id field: provenance
feature field: id field: resource_id
feature field: feature_id field: entityDefinition_id
1. Annotations-dbd contains a parent id field not replicated in dbAnnotations schema.
2. Annotations-dbd features table can reference other features. V3NLP tools have not implemented this relationship.
feature_element_text field: id field: text_value
featureElementText field: featureElement_id field: value
The feature_id,resource_id pair is redundant and not replicated in the dbAnnotations.
feature_element_numeric [TBD]
feature_element_blob [TBD]