Upload
maitreyee-bhise
View
208
Download
0
Embed Size (px)
Citation preview
Clustering of yeast genes using
Literature MiningCS 6910 Project
Advisor: Dr.Venu DasigiStudent: Maitreyee Bhise
Contents
• Text mining• Overview of the project • Weight Metrics• Ohio Supercomputer (MEDLINE database)• Implementation• Results and Analysis• Limitations• Future Scope
What is text mining?
• Provides mechanism to handle large amount of data
• Integrates all sources and add meaning to it
Background Set
• Reference Set used to compare query set• Dictionary of documents and respective words
of importance• Restricted Background set (used in this project)• Unrestricted Background set
• Restricted Background set: Only those documents that satisfy a condition
• Unrestricted Background set: All the documents
Query Set and Background Set used
• Query Set: 44 Tables for each gene• Each table with abstracts that contain
the gene name in the title• Restricted Background set: Union of 44
gene tables• Background set is a table with all
keywords from 44 tables
Overview
• Restricted background set created using MEDLINE (collection of biological documents)
• Used to compare frequency of a word in query set against background set
• 44 gene keywords extracted, analyzed and clustered using weight metrics
• Frequency of each keyword in a query set is compared with background set
• Gene characteristics are discovered on the basis of computed weights
Overview (cont'd)
• Entire MEDLINE collection had been preprocessed earlier which was added in this work
• Project uses a different background set• Implementation on Ohio Supercomputer
MEDLINE
• MEDLINE is a collection of biological documents provided by National Library of Medicine (NLM)
• Consists of 23 million abstracts• Data provided in XML format• XML data parsed and loaded in Ohio
Supercomputer• Oakley is the newly built server with 8,300+
core HP Intel Xeon machine
Portable Batch Script
• #PBS –l walltime = hh:mm:ss (running time required for the job)
• #PBS –l nodes = 2 :ppn = 12 (no. of nodes and processors per node required)
• #PBS –m abe (emails the user when job aborts/begins/ends)
• To submit the batch : qsub Job_Name
MEDLINE Table Structure
• 3 Tables:• N_Word • N_Document• N_WordDocument
• This project uses only N_Document and N_WordDocument tables
MEDLINE Table Structure (cont'd)
Weight Metrics
• Two statistical parameters:• Z-Score• TF-IDF (Term Frequency-Inverse
Document Frequency)• Calculates frequency of a keyword in query
set in comparison with some reference set• Helps to discriminate high information
content words of a document
Stop Words Removal
• Words which are commonly used but doesn’t add meaning
• Used Stop word list provided by PubMed
• Separate table created to store stop word list
• Stop words are removed using simple join with this table
Z-Score
• Z-Score of a word ‘a’ in a gene ‘n’,
Zₐⁿ = Where=√( ) = Document frequency of a word a for gene n = mean frequency of a word a = sum of for all genes = Standard Deviation of word aN = Number of Genes (or set of group of documents)
Z-Score (cont'd)(Document Frequency calculation)
• Captures the strength of a keywords in a collection set
• Emphasis on distribution of the keyword across genes
• It is the number of documents that contain the word
• Calculated with respect to a gene (or related group of documents)
Z-Score for gene Cwp1(cont'd)(Document Frequency calculation)
Z-Score for Cwp1(cont'd)(Mean Frequency calculation)
• Sum of document frequency corresponding to each keyword from different genes divided by total number of genes.
Z-Score for gene Cwp1 (cont’d)( Numerator calculation)
Sum of square of Numeratorfor gene Cwp1
Z-Score (cont’d) (Standard Deviation Calculation)
• Tells deviation from Normal• Lesser the value of standard deviation,
more is the possibility of high-information content word
• For lesser value of standard deviation, z-score will be high
Z-Score for gene Cwp1 (cont’d)(Standard Deviation Calculation)
Z-Score for gene Cwp1(cont’d)
TF-IGF Calculations
• TFIGF of a word ‘a’ in gene ‘n’ , TFIGFₐⁿ = TFₐⁿ * IGFₐWhere TFₐⁿ = ∑ tfₐ ͩ
And, IGFₐ = log GFₐ is number of genes that contain the word ‘a’• Emphasis on importance of a keyword
within a gene
TF-IGF for gene “Cwp1” (cont’d)(Term Frequency Calculations)
TF-IGF (cont’d)(Group Frequency Calculations for
gene “Cwp1”)
TF-IGF (cont’d)(Inverse Group Frequency
Calculations for gene “Cwp1”)
TF-IGF for gene “Cwp1” (cont’d)
Results
• Determine high-information content words for a gene
• Higher the z-score value of a keyword in a gene, more unique it describes functionality the gene
• Higher the TF-IGF value of a keyword in a gene, more unique it is in that gene as compared to other genes
• TF-IGF yields better quality keywords as filters unwanted keywords
Top keywords for gene “Cwp1”using z-score
• Irrespective of the document frequency, top 75 out of 1612 keywords have same high z-score value 6.245
• Top 75 keywords are unique to Cwp1
Limitations
• Some types of parsing errors which results in false positives
Top keywords for gene “Cwp1”using TF-IGF
• Better quality keywords obtained from TF-IGF
Cluster 3.0 Clustering Software
• Used Cluster 3.0 open source clustering software specially designed for gene expression data analysis
• Developed at Stanford University• Can run on Windows/Mac/Linux
Yeast Genes Grouped using Z-ScoreGenes Cluster
Gic2,Rad27,Dun1,Tel2,Cdc20,Far1
1
Cln1,Cln2,Cdc6 2Gic1,Ace2,Mcm3 3Exg1,Htb2,Cts1 4
Mcm2 5Mnn1,Och1,Hho1,Mcm6 6
Msb2,Rsr1,Bud9,Kre6,Cwp1,Clb5,Clb6,Rnr1,Cdc21,Cdc45,Htb1,
Hta1,Hta2,Hht1,Tem1
7
Rad51 8Ste2 9
Yeast Genes Grouped using TF-IGFGenes Cluster
Cdc20 1Cln1,Cln2 2Swi5,Ace2 3
Cdc6,Mcm3,Mcm6,Cdc46 4Mcm2 5Cdc45 6
Msb2,Rsr1,Bud9,Mnn1,Och1,Exg1,Kre6,Cwp1,Clb5,Clb6,Rnr1,Rad27,Cdc21,Dun1,Htb1,Htb2,Hta1,Hta2,Hho1,Hht1,Tel2,Tem1,Clb2,Cts1,Gi
c1,Gic2
7
Rad51 8Ste2,Far1 9
Analysis
• Z-Score is independent of Document frequency only when word is unique to the gene
• Cln1 GeneCln2 Gene
Analysis (cont'd.)
• Words found with low frequency but with high z-score
Analysis (cont'd.)
• Words with high frequency have low z-scores
Future Scope
• New Background set can be used. Example-• Abstracts that contain Gene name and
its related words in the title• Unrestricted Background set
• Applying Stemming Algorithm on the MEDLINE database
• Concepts of Latent Semantic Metrics can be applied by preserving the order of words
Special Thanks To…
Dr. Venu DasigiDr. Vipa Phuntumart
Dr. Ray KresmanPukar Hamal