View
282
Download
0
Category
Preview:
Citation preview
SUNZ2004
TEXT MINING:An Example using Enterprise Miner
SUNZ 2004Duxton Hotel, Wellington, 26 November 2004
SIVA GANESHCentre for Data Mining
s.ganesh@massey.ac.nzhttp://www-ist.massey.ac.nz/sganesha
STATISTICS
2SUNZ2004
! Brief Overview of Text MiningWhat is Text Mining?
Who needs Text Mining?
Text Mining process�
! Demo: SAS/Enterprise Miner
The Intension …
Acknowledgements …Some material are courtesy of authors of books, journal and conference presentations and software manufacturers �
Acknowledgements …Some material are courtesy of authors of books, journal and conference presentations and software manufacturers �
3SUNZ2004
What is Text Mining?! Think of all the text-based documents/data you've amassed
within the last year �client e-mails, corporate Web pages, customer surveys, prospect résumés, medical records, DNA sequences, professional papers, incident reports, EIL-corpora and essays, news stories and more�
! There may not be enough time or patience to read, explore and examine everything!
� to extract the most vital kernels of information�! So, we wish to find a way to gain knowledge (in summarised
form) from all that text, without reading or examining them fully first!
However, some text always need careful reading�Some others (e.g. DNA seq.) are hard to comprehend!
4SUNZ2004
What is Text Mining?! Typical data mining solutions focus on capturing and
analysing �structured data� (usually, quantitative/equivalent)! The analysis of �free-form text�, also referred to as
�unstructured data�, is rarely incorporated into typical data mining initiatives�
often because successful categorisation of such data is a difficult and time-consuming task�
! Clearly, there is significant value in analysing unstructured data, again to discover hidden patterns that results in actionable information�
preferably, not in isolation from other important structured data yields�i.e. combine free-form text and structured data to derive valuable, actionable information�
5SUNZ2004
Text is everywhere…Examples (SAS)�! The human resources department at a European financial
services firm uses text mining software to review 15,000 employee satisfaction surveys collected over the last five years. The company's CEO is particularly interested in how the company's "top talent" responds to open-ended questions regarding job satisfaction.
! An international pharmaceutical company uses text mining to evaluate 500 text-based responses from patients participating in a clinical study of a new allergy medication. The process detects a cluster of 50 patients whose responses include words such as nausea, loss, appetite, fatigue, clear and dry. Further examination indicates that patients in this cluster received a high dosage of the drug, and a decision-tree analysis reveals that women older than 40 are especially sensitive to the high dosage. As a result, dosage levels are adjusted, and warnings are attached to the medication for women over 40.
6SUNZ2004
Text is everywhere…Examples �! E-mail filtering: discriminate and classify as �spam� or �good��! News item routing: generate classification codes from
descriptions�! Survey data, Customer satisfaction (e.g. open-ended questions):
segment data into natural groupings�! Predict stock market prices from business news
announcements�! DNA sequencing projects generate large amounts of data:
Text mining offers practical solutions to identify patterns and simplify views of such data � classification and clustering...
! Language Tasks (EIL-corpora): Each student in five EIL groups was asked to write an essay (with at least 250 words) entitled: "The advantages of living in a large city..." - classification, clustering, association/link/correspondence.�
7SUNZ2004
Text Mining Process…Three Basic Steps to Text Mining: ! Pre-processing:
to distil the unstructured textual data into a structured format�
! Dimension-reduction:to reduce the structured text-data into a more practical and manageable size to facilitate more efficient processing�
! Data mining:apply traditional data mining (descriptive, predictive and statistical) techniques on the reduced-data to discover patterns and trends�also, add conventional structured data to enrich the analysis �
8SUNZ2004
Text Mining Process…Three Basic Steps to Text Mining: �Pre-processing�
! The easiest way to handle textual data is to transform it into an information-rich, �unit-by-term� matrix�
(unit: document or similar; term: word or phrase or similar)
! This �large� matrix usually gives the �frequency� of every term within the collection of textual data...
! During this stage, feature extraction is also used to locate specific bits of information, such as customer names, organizations and addresses�
Also, data cleansing�
9SUNZ2004
Text Mining Process…Three Basic Steps to Text Mining: �Dimension-reduction�
! Usually, a technique called �singular value decomposition (SVD)� is used to replace the original unit-by-term matrix with a much smaller matrix ⇒ dimension reduction�
! During this process, unimportant terms get discarded or ignored, and more important or highly relevant terms are singled out�
Concepts underlying the terms (obscured by term/word choice) can be expressed as collections or combination of terms�
! Note that, in textual data, the number of terms is often larger than the number of units or documents ⇒ SVD is a better choice for dimension reduction than others�
! This step may be ignored! i.e. apply DM techniques on the unit-by-term frequency data matrix (e.g. e-mail spam data)�
10SUNZ2004
Text Mining Process…Singular value decomposition: (for those hungry for theory!) The singular value decomposition of an nxp matrix X is a factorization of X into three new matrices U, D and V such that,
X = UDVT
Here, U is a nxp orthogonal matrix (i.e. UTU=I, I: Identity matrix) whose columns are called left singular vectors; V is a pxp orthogonal matrix (VTV=I) with columns called right singular vectors; and D is a pxp diagonal matrix with diagonal elements d1 ≥ d2 ≥ � ≥ dp ≥ 0 known as singular values�
Singular value decomposition calculates only the first q (rank of X = q) columns of these matrices (U, D and V), and this is called the truncated decomposition of the original matrix. The columns of UD matrix are called principal components�
If Vq is the qxq matrix of the first q columns, then VqVqT is a projection
matrix and maps each xi point in X onto the subspace spanned by Vq�
11SUNZ2004
Text Mining using SAS/Ent.Miner…! Text-parsing node: enables the decomposition of textual
input into tables of the frequency of occurrence of terms in documents/units and across the whole text data set...
creates a unit-term frequency matrix (may become very large!)
! SVD (singular value decomposition) node: enables reduction in dimensionality of the frequency matrix, typically, to 100 columns or less depending on the application...
! Results can be input into other SAS/EM nodes for predictive and descriptive modelling: clustering, neural networks, tree or other (statistical) techniques�
! Good help facilities�
! Demo�
12SUNZ2004
Conclusions…! SAS/Enterprise Miner�
Good and Bad�
! Thought questions�
Is �text mining� really useful?�
Example: How would you handle analysing a questionnaire with several open-ended questions?...
Thank you!
Recommended