18
A Visual Analytics Approach to Augmenting Formal Concepts with Relational Background Knowledge in a Biological Domain 7 th December 2010 Elma Akand*, Mike Bain, Mark Temple *CSE, UNSW/School of Biomedical and Health Sciences,UWS 1 The Sixth Australasian Ontology Workshop, Adelaide University of South Australia

Elma Akand *, Mike Bain, Mark Temple *CSE, UNSW/School of Biomedical and Health Sciences,UWS

  • Upload
    nenet

  • View
    94

  • Download
    0

Embed Size (px)

DESCRIPTION

7 th December 2010. The Sixth Australasian Ontology Workshop, Adelaide University of South Australia. A Visual Analytics Approach to Augmenting Formal Concepts with Relational Background Knowledge in a Biological Domain. Elma Akand *, Mike Bain, Mark Temple - PowerPoint PPT Presentation

Citation preview

Page 1: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

1

A Visual Analytics Approach to Augmenting Formal Concepts with Relational Background Knowledge in a Biological Domain

7th December 2010

Elma Akand*, Mike Bain, Mark Temple*CSE, UNSW/School of Biomedical and Health Sciences,UWS

The Sixth Australasian Ontology Workshop, Adelaide University of South Australia

Page 2: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

Outline Machine learning and data mining in bioinformatics

Domain Ontologies in biomedical applications

Formal Concept Analysis

MCW algorithm (Mining Closed itemsets for Web apps)

BioLattice – a web based browser

Experimental Application: systems biology Part-1: Concept ranking by gene interaction

Part-2: Relational learning of multiple-stress rules

Page 3: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

Machine learning & Data mining in Bioinformatics

Bioinformatics“Bioinformatics is the study of information content and information flow in biological systems and processes” (Michael Liebman,1995) Machine Learning & Data mining-Can offer automatic knowledge acquisition -Process to discover knowledge by analyzing data from different perspectives and can contribute greatly in building knowledge base Our work: focus on knowledge-based machine learning- Previous work: learning from ontologies - Current work: ontology construction by learning- Potential application areas: ontologies – central to eCommerce, eHealth- Current application area: systems biology – predict gene function, data integration

Page 4: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

Ontology In philosophy - concerned with nature and relations of being

In knowledge representation - study of categorization of things:

Informal Ontology

Formal Ontology

Natural language

First order logic or a variant

Upper Ontology

Domain Ontology

Specific

General

Ontology

Ontology – "specification of a conceptualization” (Gruber, 1993)

Conceptualization – "formalization of knowledge in declarative form” (Genesereth and Nilsson, 1987)

Page 5: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

Gene Ontology

Missing concepts and relations One gene annotated with different GO terms

with a term specialization of other

a

b

xy

x

gene: x concepts : a ,brelations : (i) x a (ii) x b and (iii) b a

Page 6: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

Formal Concept Analysis (FCA)Mathematical order theory (Rudolf Wille in the early 80s) -Derives conceptual structures out of data -Method for data analysis, knowledge representation and information management

Components -Formal context, concept , concept lattice

four-legged

hair-covered

intelligent marine thumbed

cats x x

dogs x x

dolphins x x

gibbons x x x

humans x x

whales x x

Page 7: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

Formal concepts in a concept lattice({cats, gibbons, dogs, dolphins, humans, whales}, {})

Bottom

({gibbons, dolphins, humans, whales}, {intelligent})

({dolphins, whales}, {intelligent, marine})

({cats, gibbons, dogs}, {hair-covered})

({cats, dogs}, {hair-covered, four-legged})

({gibbons, humans}, {intelligent, thumbed})

({gibbons}, {intelligent, hair-covered, thumbed})

({}, {intelligent, hair-covered, thumbed, marine, four-legged})

2

1

56

Top

3

4

Formal context: an n by m Boolean matrixm attributes A columns n objects O rows

Formal concept: Galois connection <X, Y> X is a subset of A, Y is a subset of O

Concept lattice loosely interpretable in ontology terms:concept definitions and cf. T-boxsub-concept relations

concept membership cf. A-boxby objects

Page 8: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

FCA in data miningFCA can be seen as a clustering technique in machine learning

-Most of the work is in a propositional framework

In data mining closed itemset mining is an efficient alternative to FCA

A frequent itemset X is closed if there exists no proper superset Y such that Y⊃X with support(Y)=support(X)

E.g., if X = {a,b,c,d} and Y ={a,b,c,d,e} and support(Y)=support(X), then X is not closed

Parameters to avoid building entire lattice-Extent size must be greater than minsup

Existing closed itemset mining algorithms-Data structures to speed up closed itemset mining-But may not build lattice, or include extents

Page 9: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

MCW algorithm (Mining Closed itemsets for Web apps)Vertical data format

IT-tree (itemset-tidset tree) search space -node has X x t(X) and all children have prefix X

Pruning- 4 set difference closure operators

Subsumption check - A look-up table to record all attributes and their occurrences in closed concepts

Lattice - adding concepts following a general to specific order

D2

4

5

6

A1

3

4

5

C1

2

3

4

5

6

T1

3

5

6

W1

2

3

4

5

attribute Concept_id

D C1,C2

T C3,C4

A C4,C5

W C2,C4,C5,C6

C C1,C2,C3,C4,C5,C6,C7

Is {TA}{135} closed?i(135)={TAWC}

Page 10: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

Closure operators

{TA}{135}={TW}{135} ->{TAW}{135}

{D}{2456}⊂{C}{123456}->{DC}{2456}

{D}{2456} and {W}{12345}->{DW}{245}

D

2

4

5

6

A

1

3

4

5

C

1

2

3

4

5

6

T

1

3

5

6

W

1

2

3

4

5Based on CHARM (Zaki, 2005)

Page 11: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

Visual analytics -combination of information visualization with machine learning and data analysis (Keim et al., 2008)

Visualization of concept lattice - provides overview of the structure of the domain - means for further data analysis, e.g., classification, clustering, implication discovery, rule

learning

Previous work- lattice navigation since Godin et al. (1993)-Browsable concept lattice, e.g., Kim & Compton (2004)

Our current work - on augmenting concept lattice by integrating multiple sources of knowledge (Gene Ontology, protein interactions) for further analysis & machine learning

Concept lattice as a visual analytics approach

Page 12: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

Case study: Yeast systems biology

Page 13: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

Browsable concept lattice

more general

Page 14: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

Biological validation (1) : synthetic lethality

Synthetic lethal interactionif cell is viable when either gene A or B are individually deleted, but cannot grow when both are deleted.

Our results show that 72 (119) concepts in the lattice more likely than random chance at p < 0.01 (p < 0.05) to contain synthetic lethal pairs.

Page 15: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

Protein-protein interaction data

Microarray gene-expression data

Transcription factor binding data (ChIP-chip)

Ontology data

Biochemical pathway data

Inductive Logic

Programming

concept(A):- ppi(B,A,C), ppi(B,A,E), ppi(B,C,E)tfbinds(D,C),fbinds(F,E)

First-order rule

Biological validation (2) : ILP learning of concept definitions

Page 16: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

Transcription factors

RSM19 required for H2O2 response; RSM19, RSM22 and MRPS17 in “mitochondrial ribosomal small subunit” stable complex; and RSM22, MRPS17 bound by transcription factors under amino acid starvation.

Example rule:

Page 17: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

ConclusionsMany real-world domains are data-intensive

Machine learning and data mining applications required to generate predictive and useful outputs

We focus on knowledge-based learning for comprehensibility – use ontologies

Formal concept analysis as a framework for ontology structure

Use data mining techniques for efficient concept lattice generation

Visual analytics approach: browsable lattice, added background knowledge

Initial validation on a case study from yeast systems biology

Page 18: Elma  Akand *, Mike Bain, Mark Temple *CSE,  UNSW/School  of Biomedical and Health  Sciences,UWS

Investigate pseudo-intents to simplify concept lattice

Investigate variants of concept lattice structures-e.g., concept lattice of inverse context

Add concept definitions to background knowledge in ILP

Future work