26
Semantic annotation, clustering and visualization David Graus Media Technology Msc Programme Graduation Project Supervisor: Joris Slob

Semantic annotation, clustering and visualization

Embed Size (px)

DESCRIPTION

"Practise" presentation of my MSc thesis I did for the Leiden University Bio-imaging Group. More information @ http://graus.nu/category/thesis/

Citation preview

Page 1: Semantic annotation, clustering and visualization

Semantic annotation, clustering and visualization

David Graus

Media Technology Msc

ProgrammeGraduation Project

Supervisor: Joris Slob

Page 2: Semantic annotation, clustering and visualization

Introduction

David Graus Media Technology Msc Programme07/02/2012

Page 3: Semantic annotation, clustering and visualization

Cyttron DB entry"The volume of the brain evaluated in this

study. The color scale represents the

number of 4-mm voxels with data in at least

7 subjects along a 3-cm deep line into the

brain. A three-dimensional rendering of a

brain is shown in regions where insufficient

data were obtained. The most superior

regions of the frontal and parietal lobes and

the most inferior regions of the temporal

lobes were not evaluated. Imaging artifacts

may also compromise the significance of

results in the most inferior portions of the

frontal lobe."

David Graus Media Technology Msc Programme07/02/2012

Page 4: Semantic annotation, clustering and visualization

Tasks1. Semantic annotation

Identify and tag most important concepts from text

[NLP]

2. Topic extraction

Relate concepts and find clusters [Linked Data]

3. Visualization

Draw resulting graphs and clusters [Datavisualization]

David Graus Media Technology Msc Programme07/02/2012

Page 5: Semantic annotation, clustering and visualization

1. Semantic Annotation

Method I: Find words

Method II: Compare texts

David Graus Media Technology Msc Programme07/02/2012

Page 6: Semantic annotation, clustering and visualization

Semantic Annotation: Method I

"The volume of the brain evaluated in this

study. The color scale represents the

number of 4-mm voxels with data in at least

7 subjects along a 3-cm deep line into the

brain. A three-dimensional rendering of a

brain is shown in regions where insufficient

data were obtained. The most superior

regions of the frontal and parietal lobes and

the most inferior regions of the temporal

lobes were not evaluated. Imaging artifacts

may also compromise the significance of

results in the most inferior portions of the

frontal lobe."

David Graus Media Technology Msc Programme07/02/2012

Page 7: Semantic annotation, clustering and visualization

Formal knowledge: Biomedical Ontology

David Graus Media Technology Msc Programme07/02/2012

Page 8: Semantic annotation, clustering and visualization

NCI Thesaurus89.129 unique concepts50.804 definitions258.051 synonymsRelations!

Concept Agrobacterium tumefaciens

Definition A species of Gram negative, rod shaped bacteria assigned to the phylum Proteobacteria. This bacteria is motile by flagella and mediates the horizontal gene transfer of its Ti plasmid to infect plants. A. tumefaciens is commonly found in soil and around the root surfaces of plants and is the causative agent of crown gall disease.

Synonyms RHIZOBIUM RADIOBACTERCDC GROUP VD-3

David Graus Media Technology Msc Programme07/02/2012

Page 9: Semantic annotation, clustering and visualization

Semantic Annotation: Method I

"The volume of the brain evaluated in this

study. The color scale represents the

number of 4-mm voxels with data in at least

7 subjects along a 3-cm deep line into the

brain. A three-dimensional rendering of a

brain is shown in regions where insufficient

data were obtained. The most superior

regions of the frontal and parietal lobes and

the most inferior regions of the temporal

lobes were not evaluated. Imaging artifacts

may also compromise the significance of

results in the most inferior portions of the

frontal lobe."

David Graus Media Technology Msc Programme07/02/2012

Page 10: Semantic annotation, clustering and visualization

Semantic Annotation: Method I

"The volume of the brain evaluated in this

study. The color scale represents the

number of 4-mm voxels with data in at least

7 subjects along a 3-cm deep line into the

brain. A three-dimensional rendering of a

brain is shown in regions where insufficient

data were obtained. The most superior

regions of the frontal and parietal lobes and

the most inferior regions of the temporal

lobes were not evaluated. Imaging artifacts

may also compromise the significance of

results in the most inferior portions of the

frontal lobe."

David Graus Media Technology Msc Programme07/02/2012

Page 11: Semantic annotation, clustering and visualization

Semantic Annotation: Method I

"The volume of the brain evaluated in this

study. The color scale represents the

number of 4-mm voxels with data in at least

7 subjects along a 3-cm deep line into the

brain. A three-dimensional rendering of a

brain is shown in regions where insufficient

data were obtained. The most superior

regions of the frontal and parietal lobes and

the most inferior regions of the temporal

lobes were not evaluated. Imaging artifacts

may also compromise the significance of

results in the most inferior portions of the

frontal lobe."

Is a: Body Part, Organ, or Organ ComponentsubClass Of: Organ, Central Nervous System PartDefinition: An organ composed of grey and white matter containing billions of neurons that is the center for intelligence and reasoning. It is protected by the bony cranium.

David Graus Media Technology Msc Programme07/02/2012

Page 12: Semantic annotation, clustering and visualization

Example"The volume of the brain evaluated in this study. The color scale represents

the number of 4-mm voxels with data in at least 7 subjects along a 3-cm deep

line into the brain. A three-dimensional rendering of a brain is shown in regions

where insufficient data were obtained. The most superior regions of the frontal

and parietal lobes and the most inferior regions of the temporal lobes were

not evaluated. Imaging artifacts may also compromise the significance of results

in the most inferior portions of the frontal lobe."

Most, Brain, A, Inferior, Data, And, With, Volume,

Volume, Three, Temporal, Superior, Study, Scale,

Parietal, Number, Lobe, Line, Into, Frontal Lobe, Deep,

Color, At

David Graus Media Technology Msc Programme07/02/2012

Page 13: Semantic annotation, clustering and visualization

Example

David Graus Media Technology Msc Programme07/02/2012

"The volume of the brain evaluated in this study. The color scale represents

the number of 4-mm voxels with data in at least 7 subjects along a 3-cm deep

line into the brain. A three-dimensional rendering of a brain is shown in regions

where insufficient data were obtained. The most superior regions of the frontal

and parietal lobes and the most inferior regions of the temporal lobes were

not evaluated. Imaging artifacts may also compromise the significance of results

in the most inferior portions of the frontal lobe."

Page 14: Semantic annotation, clustering and visualization

Semantic Annotation: Method I

2 ‘Modifiers’ of representations:1. (Porter) Stemming (text & ontologyconcepts)

Lobes – lobeBrains – brainEtc…

2. Generate synonyms (using WordNet)

David Graus Media Technology Msc Programme07/02/2012

Page 15: Semantic annotation, clustering and visualization

Combo 'brain, regions, data, evaluated, frontal, inferior, lobes, along, also, artifacts, color, compromise, deep, dimensional, imaging, insufficient, least, line, lobe. brain, color, deep, imaging, insufficient, […]

Most frequent

words

'brain, regions, data, evaluated, frontal, inferior, lobes, along,

also, artifacts, color, compromise, deep, dimensional, imaging,

insufficient, least, line, lobe‘Most

frequent nouns

'brain, color, deep, imaging, insufficient, line, lobe, number,

rendering, scale, significance, study, volume‘

Bigrams'also compromise, artifacts may, cm deep, color scale,

compromise significance, deep line, dimensional rendering,

imaging artifacts, may also, mm voxels, represents number, scale

represents, significance results, subjects along, data least, data

obtained, evaluated study, frontal lobe, frontal parietal, inferior

portions‘Trigrams

'also compromise significance, artifacts may also, cm deep line,

color scale represents, compromise significance results, imaging

artifacts may, may also compromise, scale represents number,

insufficient data obtained, mm voxels data, portions frontal lobe,

[…]

Different text representationsDavid Graus Media Technology Msc Programme

07/02/2012

Page 16: Semantic annotation, clustering and visualization

Semantic Annotation: Method I

6 Representations (literal + 5 keyword

variations)

4 Treatments (literal + stem + synonyms +

both)

24 results

David Graus Media Technology Msc Programme07/02/2012

Page 17: Semantic annotation, clustering and visualization

Method II: Text Comparison

Find concepts that might not occur in text

"The volume of the brain evaluated in this study. The color scale represents the number of 4-mm voxels with data in at least 7 subjects along a 3-cm deep line into the brain. A three-dimensional rendering of a brain is shown in regions where insufficient data were obtained. The most superior regions of the frontal and parietal lobes and the most inferior regions of the temporal lobes were not evaluated. Imaging artifacts may also compromise the significance of results in the most inferior portions of the frontal lobe."

David Graus Media Technology Msc Programme07/02/2012

Page 18: Semantic annotation, clustering and visualization

Compare text to definitions

Find relevant concepts based on their (textual)

definitions

David Graus Media Technology Msc Programme07/02/2012

Cyttron entry

NCI Thesaurus definitions

compare

Parietal Lobe: One of the lobes of the cerebral hemisphere located superiorly to the occipital lobe and posteriorly to the frontal lobe. Cognition and visuospatial processing are its main functions.

"The volume of the brain evaluated in this study. The color scale represents the number of 4-mm voxels with data in at least 7 subjects along a 3-cm deep line into the brain. A three-dim

Page 19: Semantic annotation, clustering and visualization

Method II: Text Comparison

Find concepts that might not occur in text

Frontotemporal Dementia: A syndrome caused by progressive degeneration of the frontal or temporal lobes

of the brain. It is manifested with personality changes and deterioration of the language skills.

Parietal Lobe: One of the

lobes of the cerebral

hemisphere located

superiorly to the occipital

lobe and posteriorly to the

frontal lobe. Cognition and

visuospatial processing are

its main functions.

Area of Broca: A region of

the brain located in the frontal lobe of the dominant

hemisphere; it plays an essential role in the motor control of speech.

David Graus Media Technology Msc Programme07/02/2012

"The volume of the brain evaluated in this study. The color scale represents the number of 4-mm voxels with data in at least 7 subjects along a 3-cm deep line into the brain. A three-dimensional rendering of a brain is shown in regions where insufficient data were obtained. The most superior regions of the frontal and parietal lobes and the most inferior regions of the temporal lobes were not evaluated. Imaging artifacts may also compromise the significance of results in the most inferior portions of the frontal lobe."

Page 20: Semantic annotation, clustering and visualization

Compare how?

Bag of Words + TF-IDF

Dictionary: BioMedCentral Corpus

> 100.000 articles

> 8GB raw data

Process CorpusClean (strip tags, store only article body)

Tokenize (create list of words)

Remove common words (stopwords)

Stem remaining words

David Graus Media Technology Msc Programme07/02/2012

Page 21: Semantic annotation, clustering and visualization

Method II: Text ComparisonConvert both texts to vector space using dictionary,

compute similarity.

Return most similar concepts.

David Graus Media Technology Msc Programme07/02/2012

"The volume of the brain evaluated in this study. The color scale represents the number of 4-mm voxels with data in at least 7 subjects along a 3-cm deep line into the brain. A three-dimensional rendering of a brain is shown in regions where insufficient data were obtained. The most superior regions of the frontal and parietal lobes and the most inferior regions of the temporal lobes were not evaluated. Imaging artifacts may also compromise the significance of results in the most inferior portions of the frontal lobe."

1. Frontotemporal Dementia

2. Parietal Lobe3. Area of Broca4. Anterior Cranial

Fossa5. Brain Lobectomy6. Anterior Parietal

Artery7. Mammary Gland8. Frontal Lobe9. Interlobar10.Lobar

Page 22: Semantic annotation, clustering and visualization

Method II: Text ComparisonDifferent cut-off rules:1. Anything over x% similar

2. 5 most similar

3. 10 most similar

4. 20% most similar

5. 10% most similar

David Graus Media Technology Msc Programme07/02/2012

Page 23: Semantic annotation, clustering and visualization

ResultLong list of (linked) conceptsRelevancy?

David Graus Media Technology Msc Programme07/02/2012

Page 24: Semantic annotation, clustering and visualization

Find clustersMeasure semantic similarity between

concepts

- Shortest paths- Shared parents- Node’s ‘depth’

David Graus Media Technology Msc Programme07/02/2012

Page 25: Semantic annotation, clustering and visualization

David Graus Media Technology Msc Programme07/02/2012

Page 26: Semantic annotation, clustering and visualization

Get data!Analyse algorithms

David Graus Media Technology Msc Programme07/02/2012

To do