12

Click here to load reader

2004 Siva Ganesh

Embed Size (px)

Citation preview

Page 1: 2004 Siva Ganesh

SUNZ2004

TEXT MINING:An Example using Enterprise Miner

SUNZ 2004Duxton Hotel, Wellington, 26 November 2004

SIVA GANESHCentre for Data Mining

[email protected]://www-ist.massey.ac.nz/sganesha

STATISTICS

Page 2: 2004 Siva Ganesh

2SUNZ2004

! Brief Overview of Text MiningWhat is Text Mining?

Who needs Text Mining?

Text Mining process�

! Demo: SAS/Enterprise Miner

The Intension …

Acknowledgements …Some material are courtesy of authors of books, journal and conference presentations and software manufacturers �

Acknowledgements …Some material are courtesy of authors of books, journal and conference presentations and software manufacturers �

Page 3: 2004 Siva Ganesh

3SUNZ2004

What is Text Mining?! Think of all the text-based documents/data you've amassed

within the last year �client e-mails, corporate Web pages, customer surveys, prospect résumés, medical records, DNA sequences, professional papers, incident reports, EIL-corpora and essays, news stories and more�

! There may not be enough time or patience to read, explore and examine everything!

� to extract the most vital kernels of information�! So, we wish to find a way to gain knowledge (in summarised

form) from all that text, without reading or examining them fully first!

However, some text always need careful reading�Some others (e.g. DNA seq.) are hard to comprehend!

Page 4: 2004 Siva Ganesh

4SUNZ2004

What is Text Mining?! Typical data mining solutions focus on capturing and

analysing �structured data� (usually, quantitative/equivalent)! The analysis of �free-form text�, also referred to as

�unstructured data�, is rarely incorporated into typical data mining initiatives�

often because successful categorisation of such data is a difficult and time-consuming task�

! Clearly, there is significant value in analysing unstructured data, again to discover hidden patterns that results in actionable information�

preferably, not in isolation from other important structured data yields�i.e. combine free-form text and structured data to derive valuable, actionable information�

Page 5: 2004 Siva Ganesh

5SUNZ2004

Text is everywhere…Examples (SAS)�! The human resources department at a European financial

services firm uses text mining software to review 15,000 employee satisfaction surveys collected over the last five years. The company's CEO is particularly interested in how the company's "top talent" responds to open-ended questions regarding job satisfaction.

! An international pharmaceutical company uses text mining to evaluate 500 text-based responses from patients participating in a clinical study of a new allergy medication. The process detects a cluster of 50 patients whose responses include words such as nausea, loss, appetite, fatigue, clear and dry. Further examination indicates that patients in this cluster received a high dosage of the drug, and a decision-tree analysis reveals that women older than 40 are especially sensitive to the high dosage. As a result, dosage levels are adjusted, and warnings are attached to the medication for women over 40.

Page 6: 2004 Siva Ganesh

6SUNZ2004

Text is everywhere…Examples �! E-mail filtering: discriminate and classify as �spam� or �good��! News item routing: generate classification codes from

descriptions�! Survey data, Customer satisfaction (e.g. open-ended questions):

segment data into natural groupings�! Predict stock market prices from business news

announcements�! DNA sequencing projects generate large amounts of data:

Text mining offers practical solutions to identify patterns and simplify views of such data � classification and clustering...

! Language Tasks (EIL-corpora): Each student in five EIL groups was asked to write an essay (with at least 250 words) entitled: "The advantages of living in a large city..." - classification, clustering, association/link/correspondence.�

Page 7: 2004 Siva Ganesh

7SUNZ2004

Text Mining Process…Three Basic Steps to Text Mining: ! Pre-processing:

to distil the unstructured textual data into a structured format�

! Dimension-reduction:to reduce the structured text-data into a more practical and manageable size to facilitate more efficient processing�

! Data mining:apply traditional data mining (descriptive, predictive and statistical) techniques on the reduced-data to discover patterns and trends�also, add conventional structured data to enrich the analysis �

Page 8: 2004 Siva Ganesh

8SUNZ2004

Text Mining Process…Three Basic Steps to Text Mining: �Pre-processing�

! The easiest way to handle textual data is to transform it into an information-rich, �unit-by-term� matrix�

(unit: document or similar; term: word or phrase or similar)

! This �large� matrix usually gives the �frequency� of every term within the collection of textual data...

! During this stage, feature extraction is also used to locate specific bits of information, such as customer names, organizations and addresses�

Also, data cleansing�

Page 9: 2004 Siva Ganesh

9SUNZ2004

Text Mining Process…Three Basic Steps to Text Mining: �Dimension-reduction�

! Usually, a technique called �singular value decomposition (SVD)� is used to replace the original unit-by-term matrix with a much smaller matrix ⇒ dimension reduction�

! During this process, unimportant terms get discarded or ignored, and more important or highly relevant terms are singled out�

Concepts underlying the terms (obscured by term/word choice) can be expressed as collections or combination of terms�

! Note that, in textual data, the number of terms is often larger than the number of units or documents ⇒ SVD is a better choice for dimension reduction than others�

! This step may be ignored! i.e. apply DM techniques on the unit-by-term frequency data matrix (e.g. e-mail spam data)�

Page 10: 2004 Siva Ganesh

10SUNZ2004

Text Mining Process…Singular value decomposition: (for those hungry for theory!) The singular value decomposition of an nxp matrix X is a factorization of X into three new matrices U, D and V such that,

X = UDVT

Here, U is a nxp orthogonal matrix (i.e. UTU=I, I: Identity matrix) whose columns are called left singular vectors; V is a pxp orthogonal matrix (VTV=I) with columns called right singular vectors; and D is a pxp diagonal matrix with diagonal elements d1 ≥ d2 ≥ � ≥ dp ≥ 0 known as singular values�

Singular value decomposition calculates only the first q (rank of X = q) columns of these matrices (U, D and V), and this is called the truncated decomposition of the original matrix. The columns of UD matrix are called principal components�

If Vq is the qxq matrix of the first q columns, then VqVqT is a projection

matrix and maps each xi point in X onto the subspace spanned by Vq�

Page 11: 2004 Siva Ganesh

11SUNZ2004

Text Mining using SAS/Ent.Miner…! Text-parsing node: enables the decomposition of textual

input into tables of the frequency of occurrence of terms in documents/units and across the whole text data set...

creates a unit-term frequency matrix (may become very large!)

! SVD (singular value decomposition) node: enables reduction in dimensionality of the frequency matrix, typically, to 100 columns or less depending on the application...

! Results can be input into other SAS/EM nodes for predictive and descriptive modelling: clustering, neural networks, tree or other (statistical) techniques�

! Good help facilities�

! Demo�

Page 12: 2004 Siva Ganesh

12SUNZ2004

Conclusions…! SAS/Enterprise Miner�

Good and Bad�

! Thought questions�

Is �text mining� really useful?�

Example: How would you handle analysing a questionnaire with several open-ended questions?...

Thank you!