39
Clustering of yeast genes using Literature Mining CS 6910 Project Advisor: Dr.Venu Dasigi Student: Maitreyee Bhise

Clustering of yeast genes using Literature Mining

Embed Size (px)

Citation preview

Page 1: Clustering of yeast genes using Literature Mining

Clustering of yeast genes using

Literature MiningCS 6910 Project

Advisor: Dr.Venu DasigiStudent: Maitreyee Bhise

Page 2: Clustering of yeast genes using Literature Mining

Contents

• Text mining• Overview of the project • Weight Metrics• Ohio Supercomputer (MEDLINE database)• Implementation• Results and Analysis• Limitations• Future Scope

Page 3: Clustering of yeast genes using Literature Mining

What is text mining?

• Provides mechanism to handle large amount of data

• Integrates all sources and add meaning to it

Page 4: Clustering of yeast genes using Literature Mining

Background Set

• Reference Set used to compare query set• Dictionary of documents and respective words

of importance• Restricted Background set (used in this project)• Unrestricted Background set

• Restricted Background set: Only those documents that satisfy a condition

• Unrestricted Background set: All the documents

Page 5: Clustering of yeast genes using Literature Mining

Query Set and Background Set used

• Query Set: 44 Tables for each gene• Each table with abstracts that contain

the gene name in the title• Restricted Background set: Union of 44

gene tables• Background set is a table with all

keywords from 44 tables

Page 6: Clustering of yeast genes using Literature Mining

Overview

• Restricted background set created using MEDLINE (collection of biological documents)

• Used to compare frequency of a word in query set against background set

• 44 gene keywords extracted, analyzed and clustered using weight metrics

• Frequency of each keyword in a query set is compared with background set

• Gene characteristics are discovered on the basis of computed weights

Page 7: Clustering of yeast genes using Literature Mining

Overview (cont'd)

• Entire MEDLINE collection had been preprocessed earlier which was added in this work

• Project uses a different background set• Implementation on Ohio Supercomputer

Page 8: Clustering of yeast genes using Literature Mining

MEDLINE

• MEDLINE is a collection of biological documents provided by National Library of Medicine (NLM)

• Consists of 23 million abstracts• Data provided in XML format• XML data parsed and loaded in Ohio

Supercomputer• Oakley is the newly built server with 8,300+

core HP Intel Xeon machine

Page 9: Clustering of yeast genes using Literature Mining

Portable Batch Script

• #PBS –l walltime = hh:mm:ss (running time required for the job)

• #PBS –l nodes = 2 :ppn = 12 (no. of nodes and processors per node required)

• #PBS –m abe (emails the user when job aborts/begins/ends)

• To submit the batch : qsub Job_Name

Page 10: Clustering of yeast genes using Literature Mining

MEDLINE Table Structure

• 3 Tables:• N_Word • N_Document• N_WordDocument

• This project uses only N_Document and N_WordDocument tables

Page 11: Clustering of yeast genes using Literature Mining

MEDLINE Table Structure (cont'd)

Page 12: Clustering of yeast genes using Literature Mining

Weight Metrics

• Two statistical parameters:• Z-Score• TF-IDF (Term Frequency-Inverse

Document Frequency)• Calculates frequency of a keyword in query

set in comparison with some reference set• Helps to discriminate high information

content words of a document

Page 13: Clustering of yeast genes using Literature Mining

Stop Words Removal

• Words which are commonly used but doesn’t add meaning

• Used Stop word list provided by PubMed

• Separate table created to store stop word list

• Stop words are removed using simple join with this table

Page 14: Clustering of yeast genes using Literature Mining

Z-Score

• Z-Score of a word ‘a’ in a gene ‘n’,

Zₐⁿ = Where=√( ) = Document frequency of a word a for gene n = mean frequency of a word a = sum of for all genes = Standard Deviation of word aN = Number of Genes (or set of group of documents)

Page 15: Clustering of yeast genes using Literature Mining

Z-Score (cont'd)(Document Frequency calculation)

• Captures the strength of a keywords in a collection set

• Emphasis on distribution of the keyword across genes

• It is the number of documents that contain the word

• Calculated with respect to a gene (or related group of documents)

Page 16: Clustering of yeast genes using Literature Mining

Z-Score for gene Cwp1(cont'd)(Document Frequency calculation)

Page 17: Clustering of yeast genes using Literature Mining

Z-Score for Cwp1(cont'd)(Mean Frequency calculation)

• Sum of document frequency corresponding to each keyword from different genes divided by total number of genes.

Page 18: Clustering of yeast genes using Literature Mining

Z-Score for gene Cwp1 (cont’d)( Numerator calculation)

Page 19: Clustering of yeast genes using Literature Mining

Sum of square of Numeratorfor gene Cwp1

Page 20: Clustering of yeast genes using Literature Mining

Z-Score (cont’d) (Standard Deviation Calculation)

• Tells deviation from Normal• Lesser the value of standard deviation,

more is the possibility of high-information content word

• For lesser value of standard deviation, z-score will be high

Page 21: Clustering of yeast genes using Literature Mining

Z-Score for gene Cwp1 (cont’d)(Standard Deviation Calculation)

Page 22: Clustering of yeast genes using Literature Mining

Z-Score for gene Cwp1(cont’d)

Page 23: Clustering of yeast genes using Literature Mining

TF-IGF Calculations

• TFIGF of a word ‘a’ in gene ‘n’ , TFIGFₐⁿ = TFₐⁿ * IGFₐWhere TFₐⁿ = ∑ tfₐ ͩ

And, IGFₐ = log GFₐ is number of genes that contain the word ‘a’• Emphasis on importance of a keyword

within a gene

Page 24: Clustering of yeast genes using Literature Mining

TF-IGF for gene “Cwp1” (cont’d)(Term Frequency Calculations)

Page 25: Clustering of yeast genes using Literature Mining

TF-IGF (cont’d)(Group Frequency Calculations for

gene “Cwp1”)

Page 26: Clustering of yeast genes using Literature Mining

TF-IGF (cont’d)(Inverse Group Frequency

Calculations for gene “Cwp1”)

Page 27: Clustering of yeast genes using Literature Mining

TF-IGF for gene “Cwp1” (cont’d)

Page 28: Clustering of yeast genes using Literature Mining

Results

• Determine high-information content words for a gene

• Higher the z-score value of a keyword in a gene, more unique it describes functionality the gene

• Higher the TF-IGF value of a keyword in a gene, more unique it is in that gene as compared to other genes

• TF-IGF yields better quality keywords as filters unwanted keywords

Page 29: Clustering of yeast genes using Literature Mining

Top keywords for gene “Cwp1”using z-score

• Irrespective of the document frequency, top 75 out of 1612 keywords have same high z-score value 6.245

• Top 75 keywords are unique to Cwp1

Page 30: Clustering of yeast genes using Literature Mining

Limitations

• Some types of parsing errors which results in false positives

Page 31: Clustering of yeast genes using Literature Mining

Top keywords for gene “Cwp1”using TF-IGF

• Better quality keywords obtained from TF-IGF

Page 32: Clustering of yeast genes using Literature Mining

Cluster 3.0 Clustering Software

• Used Cluster 3.0 open source clustering software specially designed for gene expression data analysis

• Developed at Stanford University• Can run on Windows/Mac/Linux

Page 33: Clustering of yeast genes using Literature Mining

Yeast Genes Grouped using Z-ScoreGenes Cluster

Gic2,Rad27,Dun1,Tel2,Cdc20,Far1

1

Cln1,Cln2,Cdc6 2Gic1,Ace2,Mcm3 3Exg1,Htb2,Cts1 4

Mcm2 5Mnn1,Och1,Hho1,Mcm6 6

Msb2,Rsr1,Bud9,Kre6,Cwp1,Clb5,Clb6,Rnr1,Cdc21,Cdc45,Htb1,

Hta1,Hta2,Hht1,Tem1

7

Rad51 8Ste2 9

Page 34: Clustering of yeast genes using Literature Mining

Yeast Genes Grouped using TF-IGFGenes Cluster

Cdc20 1Cln1,Cln2 2Swi5,Ace2 3

Cdc6,Mcm3,Mcm6,Cdc46 4Mcm2 5Cdc45 6

Msb2,Rsr1,Bud9,Mnn1,Och1,Exg1,Kre6,Cwp1,Clb5,Clb6,Rnr1,Rad27,Cdc21,Dun1,Htb1,Htb2,Hta1,Hta2,Hho1,Hht1,Tel2,Tem1,Clb2,Cts1,Gi

c1,Gic2

7

Rad51 8Ste2,Far1 9

Page 35: Clustering of yeast genes using Literature Mining

Analysis

• Z-Score is independent of Document frequency only when word is unique to the gene

• Cln1 GeneCln2 Gene

Page 36: Clustering of yeast genes using Literature Mining

Analysis (cont'd.)

• Words found with low frequency but with high z-score

Page 37: Clustering of yeast genes using Literature Mining

Analysis (cont'd.)

• Words with high frequency have low z-scores

Page 38: Clustering of yeast genes using Literature Mining

Future Scope

• New Background set can be used. Example-• Abstracts that contain Gene name and

its related words in the title• Unrestricted Background set

• Applying Stemming Algorithm on the MEDLINE database

• Concepts of Latent Semantic Metrics can be applied by preserving the order of words

Page 39: Clustering of yeast genes using Literature Mining

Special Thanks To…

Dr. Venu DasigiDr. Vipa Phuntumart

Dr. Ray KresmanPukar Hamal