Upload
tommy96
View
210
Download
3
Tags:
Embed Size (px)
Citation preview
Introduction to Informatics - Fall 02
Alternative methods of accessing digital information: Data mining
I. Introduction
• What is it?
II. How does it work?
• The virtuous circle of data mining
• Techniques of data mining
III. Data mining applications
• What is it good for?
• DM and CRM
Introduction to Informatics - Fall 02
I. Introduction
• What is it?
Data mining is a process of knowledge discovery in databases
It involves the extraction of interesting information, patterns, or rules from data in large databases
These data are non-trivial, implicit, previously unknown and potentially useful
It is a search for valuable information in large volumes of data
It uses statistical techniques to explore and analyze large quantities of data in order to discover meaningful patterns and rules
Introduction to Informatics - Fall 02
Why has data mining become so popular?
Large amounts of data are being produced as more functions become automated
Many algorithms require large data sets for training and learning
Data are being warehoused
They are being extracted from various systems (accounting, billing, ordering etc) and stored in a central location
They are stores in a common format
Consistent definitions for fields and keys
Computer power is increasing and costs are decreasing
Introduction to Informatics - Fall 02
And
Strong competitive pressures
Information-intensive activities (business, science) are competing for market share, funding etc.
Realization of the increasing value of information (especially as a source of revenue)
There is value in what can be discovered in data
For business, there is value in customization
Commercial data mining software is now available
There are off the self products
Introduction to Informatics - Fall 02
Data mining can be directed
The goal is to use the available data to build a model that describes a variable of interest in relation to the data set
Given what we know about people in Bloomington, which types of people are likely to subscribe to DSL?
Data mining can also be undirected
There is no variable of interest
The goal is to search through the available data to look for patterns and relationships
What can we learn about students at IU who default on their student loans?
Introduction to Informatics - Fall 02
Data mining provides an organization with “memory” and “intelligence”
Noticing
Uses on-line transaction processing systems (OLTP)
Remembering
Capturing as much of the transaction process as possible
Phone records, communications, CRM exchanges
Learning
The records must be organized into “data warehouses”
Data mining is used to analyze these data
Intelligence involves patterns, rules, and predictions
Introduction to Informatics - Fall 02
Data mining typically involves six activities
1. Classification
Examining the features of a data instance and assigning it to a predefined class
Records in a database are updated by filling in a field with a “class code”
The process uses a “training set” to sort unclassified data into discrete classes
Assigning keywords to articles as they arrive
Sorting credit card applicants according to risk levels
Assigning people to demographic categories
Introduction to Informatics - Fall 02
2. Estimation
This process sorts continuously valued outcomes
Using new data to predict whether a given data instance is above or below a threshold
This requires a model to determine the threshold level
It can be used to make predictions
Use customer data to determine churn rates
Estimate how long a person is likely to remain a customer
Assess the probability that people will respond to an offer of a home equity loan
The model runs between 0-1 with a .83 threshold
Introduction to Informatics - Fall 02
3. Prediction
Similar to estimation but with the expectation that there will be some check in the future
Uses a training set with historical data and a known predicted variable
Predicting the size of a balance that is likely to be transferred when a person accepts a credit card offer
Determining which customers will leave in a given time period
Predicting which customers will add a new service such as caller ID in a given area
Introduction to Informatics - Fall 02
4. Affinity grouping or association rules
The goal is to explore an available data set to determine which data instances should be grouped together
This involves discovering relationships among data
Which items should be placed near each other in a supermarket?
Which products can be grouped for cross-selling?
5. Clustering
The task is to sort undifferentiated data into like groups
This process does not begin with predefined classes
What do the book and music purchases tell us about our customers?
Introduction to Informatics - Fall 02
6. Description and visualization
Developing a preliminary understanding of the data
This is a first step in developing an explanation
What can we tell about the people who shop in a food coop?
Visualization is the graphic representation of the data
Directed data mining
Classification, estimation, prediction
Undirected data mining
Affinity grouping, clustering, description
Introduction to Informatics - Fall 02
Classes of data mining activity
Information Discovery, Inc. (2001). A Characterization of Data Mining Technologies and Processes. http://www.datamining.com/dm-tech.htm
Introduction to Informatics - Fall 02
Types of data mining
Hypothesis testing
Top down approach designed to test careful guesses
Process
Hypotheses are formulated to be falsified
This is done in scientific and business applications
Specific kinds of data are proposed to test the hypothesis
A data requirements document is created
The data are gathered and prepared
Profile the data, especially if it is derived from heterogeneous sources
Introduction to Informatics - Fall 02
Process continued:
Data preparation (cont)
The transformation is very important and will vary with the type of software being used
Computer model is built based on the data
The model is evaluated to reject or fail to reject thehypothesis
This is done by applying it to the data set
The end result is an analysis which statistically tests the hypothesis
The results are stated with the appropriate margin of error
Introduction to Informatics - Fall 02
Issues in data preparation
Summarization: developing the appropriate level of detail
The original data should not be summarized at all
The fine grained data may be irrelevant to the question
There may be too few examples at the finest level of detail
Incompatible computer architectures
Data transport software can translate among different languages and formats
COBOL, C, C++, ASCII encoding, single and double precision floating point integers…)
Introduction to Informatics - Fall 02
Inconsistent data encoding
Different sources represent the same data in different ways
If not caught, these can introduce error into later ` analysis
Textual data
Mostly not useful
What is useful should be encoded in another form
This is best done by hand (to key UK, Wales, Scotland to country code “44”)
Missing values
Most software is not good at handling these
Introduction to Informatics - Fall 02
Knowledge discovery: takes two major forms
Directed
Goal is to explain the value of some field (income, genetic information) or a specific relationship
Analysis seeks to estimate, classify, and predict the target field
This is an explanatory function
Finding patterns in data to explain the past and predict the future
What type of person is likely to default on a loan?
Is these genetic markers are found, what future predispositions are indicated?
Introduction to Informatics - Fall 02
Process
Identify available data sources
It’s best to have preclassified data
Preparing the data for analysis
Similar issues are involved
Also involves adding fields to the data to clarify what we take for granted but that software cannot
Based on our experience with the data
Improves the chances of finding patterns
Training set: build the initial model
Test set: adjust to improve generality
Evaluation set: test the model
Introduction to Informatics - Fall 02
Process (cont)
Building and training the model
Toss in as many variables as seem relevant and let the algorithm sort then out
Goal is to develop an explanation of independent (target) variables based on dependent (input) variables
The test set is used to minimize the problem of overfitting
Evaluating the model
Error rate of the evaluation set is a good indicator of the error rate with new data
Introduction to Informatics - Fall 02
Undirected knowledge discovery
There is no target field to serve as the focus of analysis
The goal is to search for meaningful patterns
Question might be: what goes together?
Process
Similar to directed knowledge discovery
Identify potential targets for directed knowledge discovery analysis
At the end of the process, one result is often new variables
Generate new hypotheses to test
Introduction to Informatics - Fall 02
Alternative methods of accessing digital information: Data mining
I. Introduction
• What is it?
II. How does it work?
• The virtuous circle of data mining
• Techniques of data mining
III. Data mining applications
• What is it good for?
• DM and CRM
Introduction to Informatics - Fall 02
II. How does it work?
The virtuous cycle of data miningTransform data into useful information
with DM
Act on the information
Measure the results to reuse the data
Identify problems where DM can provide
value
Introduction to Informatics - Fall 02
In business applications, data mining does not seek to replicate previous efforts
The goal is to discover new markets, not saturate old ones
In science, replication of results is more important
Data mining is a creative activity
Many patterns will be found, but the art is in focusing on the meaningful ones
Data mining results can change over time
Models can become less useful over time as data changes and markets change
Introduction to Informatics - Fall 02
Characteristics of DM systems
The focus is on the analysis of current and historical data to predict future action
The analytic work depends on the flow of data (which is not regular)
Typically the emphasis is on working with large data sets
The purpose is to support decision making in business and hypothesis testing in science
Response time are slower due to the computing cycles involved in analysis
Introduction to Informatics - Fall 02
Another way to think about it
Aggregate the data
Prepare it in a common format
Find patterns in the data
There are a range of techniques that can be used
Respond to the patterns (what do they mean?)
Data information
Act on the patterns
Information action
Action generates value
Introduction to Informatics - Fall 02
Identify data requirements
Obtain data
Validate, explore,
clean data
Transpose data
Add derived variables
Create model set
Choose modeling technique
Train model
Check model performance
Choose best model
If improvements, obtain more data
If values don’t look
correct
If data are not available
If values don’t look
correct
If new derived variable improves performance
If new segmentation improves performance
If a new technique improves performance
Building a DM model
Introduction to Informatics - Fall 02
Data mining depends on three main elements
DM techniques
These are algorithmic approaches to problem solving that are statistically based
Data
DM data should be clean, simple, and in a table with well-defined columns
Data modeling
This is a process of developing predictive models for directed DM
The method for building these models is based on principles of experimental design
Introduction to Informatics - Fall 02
Techniques of DM
Automatic cluster detection
Used for undirected DM to find groupings in data
Algorithms sort the data set top down (divisive) or bottom-up (agglomerative)
k-means cluster detection is a common method
Start with an arbitrary number of “seeds” (the initial clusters)
Assign the records to the closest seed (“centroid”)
Then recalculate the k-means and move the centroid
Continue until each is in the center of a cluster of records and there are clear boundaries
Introduction to Informatics - Fall 02
Seed 1
Seed 2
Seed 3
New centroid 2
New centroid 1
New centroid 3
Results of a cluster analysis
Introduction to Informatics - Fall 02
This is a discovery technique but is difficult to interpret
It shows that some records are closer together given the arbitrary starting points
The number of initial seeds is important
The number should minimize the distance between members of a cluster and maximize the distance between clusters
The results should be combined with other techniques to see if they have any meaning
Use it when you think there are patterns that you can’t see
Or when there are too many patterns and you want to reduce complexity
Introduction to Informatics - Fall 02
Decision trees
A classification tree labels records and assigns them to classes
The data is split iteratively until the groupings become useful
The point of the initial split is critical
Each branch cuts the space into two or more pieces and is a test on a record
Each record is tested at each node of the tree until it reaches the “leaf” or terminal node (categorical - yes/no)
Is record X greater than Y? - if yes, keep moving
Introduction to Informatics - Fall 02
Respond 8%
RentRespond 5%
OwnRespond 15%
LowRespond 9%
MediumRespond 13%
YesRespond 16%
NoRespond 45%
HighRespond 24%
YesRespond 4%
NoRespond 18%
Family income? Mortgage?
Savings account?
Own home?
A sample decision tree
http://www.spss.com/datamine/trees.htm
Introduction to Informatics - Fall 02
Decision trees are good when the goal is to develop a set of rules to organize data for predictive purposes
This works best when the tree has a manageable number of branches
The rule is formulated by tracing the branches back to the root
They are not good for discovering relationships among variables
Each split in the tree is a test of a single variable
They also may produce errors when the training set is too small
Introduction to Informatics - Fall 02
Neural networks
They use data models that simulate the structure of the brain to generalize and learn from data
They learn from a set of inputs and adjust parameters of the model using to new knowledge to find patterns in data
They fit a model to a set of historical data to classify or make predictions
They can find interaction effects among variables
You do not need to have any specific model in mind when running the analysis
They require extensive preparation of data
They also require a lengthy training period
Introduction to Informatics - Fall 02
Neural network model
OutputInputs
Appraised value
Floor space
Size of garage
Age of house
Acreage
Other factors
Example of a neural network model
Berry, M.J. and Linoff, G. (1997). Data Mining Techniques. Wiley. p290
Introduction to Informatics - Fall 02
Neural networks are useful in predicting a target variable when the data is highly non-linear with interactions
A disadvantage is that it is difficult interpret the resultant model with its layers of weights and transformations
The result is a set of weights distributed throughout the network
The weights provide no insight into why the solution is valid
This makes the use of neural nets a “black box” process
They are not very useful when these relationships in the data need to be explained
Introduction to Informatics - Fall 02
Alternative methods of accessing digital information: Data mining
I. Introduction
• What is it?
II. How does it work?
• The virtuous circle of data mining
• Techniques of data mining
III. Data mining applications
• What is it good for?
• DM and CRM
Introduction to Informatics - Fall 02
III. What is it good for?
Data mining is used for
Research
Pharmaceutical companied use DM prediction to predict which chemicals are likely to produce powerful drugs
Process improvement
Using DM to determine thresholds for manufacturing (to separate good from bad product)
Marketing
Learning about customers to refine and target marketing campaigns and save money
Introduction to Informatics - Fall 02
Fed government uses DM to search for criminals and terrorists
Analyse FBI field agent reports
Look for patterns in international funds transfer
Customer relationship management
Developing sophisticated customer profiles shared across the business
Learning from customer behavior