View
224
Download
2
Tags:
Embed Size (px)
Citation preview
Text Mining: Finding Nuggets in Mountains
of Textual Data
Jochen Dörre, Peter Gerstl, and Roland Seiffert
Overview Introduction to Mining Text How Text Mining differs from data mining Mining Within a Document: Feature
Extraction Mining in Collections of Documents:
Clustering and Categorization Text Mining Applications Exam Questions/Answers
Introduction to Mining Text
Reasons for Text MiningReasons for Text Mining
0
10
20
30
40
50
60
70
80
90
Percentage
Collections ofText
StructuredData
Corporate Knowledge “Ore” Email Insurance claims News articles Web pages Patent portfolios
Customer complaint letters
Contracts Transcripts of phone
calls with customers Technical
documents
Challenges in Text Mining Information is in unstructured textual
form.Not readily accessible to be used by
computers.Dealing with huge collections of
documents
Two Mining PhasesKnowledge Discovery: Extraction of
codified information (features) Information Distillation: Analysis of the
feature distribution
How Text Mining Differs from Data Mining
Comparison of Procedures
Data Mining Identify data sets Select features Prepare data Analyze distribution
Text Mining Identify documents Extract features Select features by
algorithm Prepare data Analyze distribution
IBM Intelligent Miner for TextSDK: Software Development KitContains necessary components for
“real text mining”Also contains more traditional
components: IBM Text Search Engine IBM Web Crawler drop-in Intranet search solutions
Mining Within a Document: Feature Extraction
Feature ExtractionTo recognize and classify significant
vocabulary items in unrestricted natural language texts.
Let’s see an example…
Example of Vocabulary found Certificate of deposit CMOs Commercial bank Commercial paper Commercial Union
Assurance Commodity Futures
Trading Commission Consul Restaurant Convertible bond Credit facility Credit line
Debt security Debtor country Detroit Edison Digital Equipment Dollars of debt End-March Enserch Equity warrant Eurodollar …
Implementation of Feature Extraction relies onLinguistically motivated heuristicsPattern matchingLimited amounts of lexical information,
such as part-of-speech information.Not used: huge amounts of lexicalized
informationNot used: in-depth syntactic and
semantic analyses of texts
Goals of Feature ExtractionVery fast processing to be able to deal
with mass dataDomain-independence for general
applicability
Extracted information categoriesNames of persons, organizations and
placesMultiword termsAbbreviationsRelationsOther useful stuff
Canonical FormsNormalized forms of dates, numbers, …Allows applications to use information
very easilyAbstracts from different morphological
variants of a single term
Canonical Names
President Bush
Mr. Bush
George Bush
Canonical Name:George Bush
The canonical name is the most explicit, least ambiguous name constructed from the different variants found in the document
Reduces ambiguity of variants
Disambiguating Proper Names: Nominator Program
Principles of Nominator DesignApply heuristics to strings, instead of
interpreting semantics.The unit of context for extraction is a
document.The unit of context for aggregation is a
corpus.The heuristics represent English naming
conventions.
Mining in Collections of Documents: Clustering and Categorization
1. Clustering Partitions a given collection into groups of
documents similar in contents, i.e., in their feature vectors.
Two clustering engines Hierarchical Clustering tool Binary Relational Clustering tool
Both tools help to identify the topic of a group by listing terms or words that are common in the documents in the group.
Thus, provides overview of the contents of a collection of documents
Groups documents similar in their feature vectors
2. CategorizationTopic Categorization ToolAssign documents to preexisting
categories (“topics” or “themes”)Categories are chosen to match the
intended use of the collectioncategories defined by providing a set of
sample documents for each category
2. Categorization (cont.)This “training” phase produces a special
index, called the categorization schemacategorization tool returns a list of
category names and confidence levels for each document
If the confidence level is low, document is put aside for human categorizer
2. Categorization (cont.)Effectiveness:
Tests have shown that the Topic Categorization tool agrees with human categorizers to the same degree as human categorizers agree with one another.
Set of sample documents
Training phase
Special index used to categorize new documents
Returns list of category names and confidence
levels for each
document
Text Mining Applications
Main Advantages of mining technology over traditional ‘information broker’ business
Ability to quickly process large amounts of textual data
“Objectivity” and customizabilityAutomation
Applications used to:Gain insights about trends, relations
between people/places/organizationsClassify and organize documents
according to their contentOrganize repositories of document-
related meta-information for search and retrieval
Retrieve documents
Main ApplicationsKnowledge Discovery
Information Distillation
CRI: Customer Relationship Intelligence Appropriate documents selected Converted to common format Feature extraction and clustering tools are
used to create a database User may select parameters for
preprocessing and clustering step Clustering produces groups of feedback that
share important linguistic elements Categorization tool used to assign new
incoming feedback to identified categories.
CRI (continued)Knowledge Discovery
Clustering used to create a structure that can be interpreted
Information Distillation Refinement and extension of the clustering
results Interpreting the results Tuning of the clustering process Selecting meaningful clusters
Exam Question #1Name an example of each of the two
main classes of applications of text mining. Knowledge Discovery: Discovering a
common customer complaint among much feedback.
Information Distillation: Filtering future comments into pre-defined categories
Exam Question #2How does the procedure for text mining
differ from the procedure for data mining? Adds feature extraction function Not feasible to have humans select
features Highly dimensional, sparsely populated
feature vectors
Exam Question #3 In the Nominator program of IBM’s
Intelligent Miner for Text, an objective of the design is to enable rapid extraction of names from large amounts of text. How does this decision affect the ability of the program to interpret the semantics of text? Does not perform in-depth syntactic or
semantic analyses of texts
THE END
http://www-3.ibm.com/software/data/iminer/fortext/