Upload
vubao
View
222
Download
4
Embed Size (px)
Citation preview
© 2012 IBM Corporation
Predictive Analytics Workshop using IBM SPSS Modeler
ObjectivesSmarter Software for Smarter Cities
Agenda
8:30-8:40 Welcome and Introductions
8:40-8:55 Introduction to Predictive Analytics
8:55-9:05 Exercise: Navigating IBM SPSS Modeler
9:05-9:25 Exercise: Predictive in 20 Minutes
9:25-9:40 Data Mining Methodology and Application
9:40-9:55 Break
9:55-10:55 Exercise: Data Mining Techniques
10:55-11:25 Exercise: Text Analytics
11:25-11:50 Deployment
11:50-12:00 Wrap-up
Purpose of the Workshop
Introduction to predictive analytics and data mining
Stimulate thinking about how data mining would benefit your organization
Demonstrate ease of use of powerful technology
Get experience in “doing” data mining
See examples of how other organizations are benefitting from deploying
predictive analytics
Instrumented
Interconnected
Intelligent
Smarter Planet
What is Predictive Analytics?
Predictive Analytics helps connect data
to effective action by drawing reliable
conclusions about current conditions
and future events
Gareth Herschel, Research Director, Gartner Group
Our Ultimate Goal is to Ensure Your Success
Madrid reduced
emergency response
time by 25%
Analytics decreased
fraud in the Federal
government by
50%
Analytics helped
reduce Hamilton
County’s dropout
rate by 25%
Lancaster, uses
predictive policing
models to reduced
crime saving the city
$1.3M a year.
Florida
Juvenile
Justice reduced
delinquency in
schools by 34% Predictive analytics helped
slash Memphis’s crime
rate by 40% in one
year
Predictive Analytics in Public Sector
Crime Analyses Force Deployment
Lead Generation
Hot Spot Analyses
Fraud detection and prevention• Money laundering
• Network intrusion
• Tax audits & collection
Entity Resolution
Text Analytics
Social Network Analysis
Education• Which among our students are at
risk?
• Who are the most promising
applicants?
• Which alumni will donate and how
much?
• How can my institution plan for
future development?
Corrections• Which inmates are at risk of
recidivism?
• Why do some inmates return?
• Which programs are successful?
IBM SPSS Modeler A Quick Overview
IBM SPSS Modeler
High performance data mining and text analytics workbench
Used for the proactive
• Identification of fraud, waste and abuse
• Reduction of costs
• Identification of risk
• Students at risk of failing
• Inmates at risk of returning
• Patients at risk of relapse
• Forecasting demographic shifts and migration
Allows analytics to be repeated and integrated
IBM SPSS Modeler
IBM SPSS Modeler
IBM SPSS Modeler
IBM SPSS Modeler
Predictive in 20 Minutes A Quick Exercise
Exercise: Predictive in 20 Minutes
Goal:
Create a model to identify who are at risk of heart attack
Approach:
Use patient data which contains various health and behavioral
information
Define which fields to use
Choose the modeling technique
Automatically generate a model to identify who are at risk
Review results
Why?
For public health policy implications, by proactively identifying and
quantifying high-risk behaviors and practices
Break - Please Return in 15 Minutes
IBM SPSS ModelerOne Analytical Workbench – Endless Techniques
Data Mining Methodology
Cross-Industry Standard
Process Model for Data Mining
Describes Components of
Complete Data Mining Project
Cycle
Shows Iterative Nature of Data
Mining
Vendor and Industry Neutral
Technique Usage Algorithms
Classification
(or prediction)
• Used to predict group membership
(e.g., will this employee leave?) or a
number (e.g., how many widgets
will I sell?)
• Auto Classifiers,
Decision Trees,
Logistic, SVM, Time
Series, etc.
Data Mining Techniques
Technique Usage Algorithms
Classification
(or prediction)
• Used to predict group membership
(e.g., will this employee leave?) or a
number (e.g., how many widgets
will I sell?)
• Auto Classifiers,
Decision Trees,
Logistic, SVM, Time
Series, etc.
Segmentation • Used to classify data points into
groups that are internally
homogenous and externally
heterogeneous
• Identify cases that are unusual
• Auto Clustering, K-
means, etc.
• Anomoly detection
Data Mining Techniques
Technique Usage Algorithms
Classification
(or prediction)
• Used to predict group membership
(e.g., will this employee leave?) or a
number (e.g., how many widgets
will I sell?)
• Auto Classifiers,
Decision Trees,
Logistic, SVM, Time
Series, etc.
Segmentation • Used to classify data points into
groups that are internally
homogenous and externally
heterogeneous.
• Identify cases that are unusual
• Auto Clustering, K-
means, etc.
• Anomoly detection
Association • Used to find events that occur
together or in a sequence (e.g.,
market basket)
• APRIORI, Carma,
Sequence
Data Mining Techniques
Technique Usage Algorithms
Text Analytics • Used to discover patterns resident
in text or other unstructured data
(e.g., sentiment analysis)
• Natural Language
Processing
• Parts of Speech
Analysis
Entity Analytics • Used to determine which cases are
likely the same actor, and which
seemingly identical cases are
actually independent
• Context
Accumulation
Social Network
Analysis
• Used to uncover associations which
may exist between cases, and
identify central or influential actors
Additional Data Mining Techniques
IBM SPSS ModelerSegmentation Modeling
Segmentation Modeling
Goal:
Discover natural groupings or clusters of alumni donors
Approach:
Alumni data from a university
Define which fields to use
Use K-Means Clustering to generate a model to group alumni
Appendix: Use these clusters to predict donation
Why?
Better alumni understanding (demographics, socio-economic etc)
Tailored messages for each group/segment
Personal and more relevant for alumni
Institutional Planning
IBM SPSS ModelerEntity Analytics
Entity Analytics Suppose that you have the following records from two different sources,
and are not sure whether they refer to the same person or different people.
Source 1
Record no.: 70001
Name: Jon Smith
Address: 123 Main Street
Driv. License: 0001133107
Source 2
Record no.: 9103
Name: JOHNATHAN Smith
Date of Birth: 06/17/1934
Telephone: 555-1212
Email: [email protected]
IP address: 9.50.18.77.
Source 3
Record no.: 6251
Name: Jon Smith
Telephone: 555-1212
Driv. License: 0001133107
DL
Telephone
No exact matches between the
two records.
However, if we introduce a third
source, we find some common
attributes
Entity Analytics
Fields Used in
EA Resolution
% Missing in
PD Database
LAST 0.7
FIRST 0.4
MIDDLE 63.9
RACE 0.1
SEX 0.1
DOB 2.4
ADDR 0.9
DRLIC 63.8
PHONE 59.1
3634 Suspects Results
IBM SPSS ModelerClassification Modeling
Classification model
Goal:
Identify students likely to persist
Approach:
Use student performance scores and other demographics
Define which fields to use
Use the Auto Classifier to choose the appropriate modeling technique
Review results
Why?
Identify students likely to persist into their second year
Conversely, same methods can be used to identify students at risk of
attrition (or prisoners at risk of recidivism, or patients likely to respond to
treatment)
IBM SPSS ModelerText Analytics
The Importance of Text
Because people communicate with
words, not numbers, it has become
critical to be able to mine text for its
meaning and to sort, analyse, and
understand it in the same way that data
has been tamed. In fact, the two basic
types of information complement each
other, with data supplying the “what”
and text supplying the “why”.
Source IDC: “Text Analytics: Software’s Missing Piece?”
Text Analytics
Turn unstructured officer notes and narratives into useable and searchable
context-rich content with Text Analytics.
Data Mining and Text Analytics
Use advanced analytical
techniques on data
Discover key relationships
between variables
Model effect of variables on
outcomes
Determine influence on outcomes
Predict outcomes
Apply models to new data
Extract, analyze and create
structure for unstructured data
Integrate analysis results into
operational systems
Integrate analysis results into
Business Intelligence applications
Integrate analysis results with
structured data and use as input
for Data Mining
Improves model accuracy
Data Mining Text Analytics
DeploymentMany Options
Why IBM SPSS?
Workshop Takeaways
Easy to use, visual interface Short timeframe to be productive with actionable results Does not require knowledge of programming language
Business results focused Cost effective solution that delivers powerful results across organization Flexible licensing and deployment options Full range of algorithms for your business problems
End-to-end solution Data preparation through real time interactions Use structured, unstructured and survey data Full suite of products, from data collection through deployment
Flexible architecture Leverages the investments already made in technology Does not require data in a proprietary format or DB Structured and unstructured data Open architecture (both inputs and outputs) SQL Pushback
94% of clients achieved a positive ROI, with an
average payback period of 10.7 months
Key benefits achieved include reduced costs, increased
productivity, improved citizen & employee satisfaction and
safety.
81% of projects deployed on time, 75% on or under budget
Nucleus Research: The Real ROI from IBM
“This is one of the highest ROI scoresNucleus has ever seen in its Real ROI series of research reports.”Rebecca Wettemann, Vice President of Research, Nucleus Research
Appendix
Data Mining Overview
From Amazon.com
– Paperback: 512 pages – Publisher: Wiley; 1 edition
(December 28, 1999) – Language: English – ISBN-10: 0471331236 – ISBN-13: 978-0471331230 ;
Good introductory text on data mining for marketing from two top communicators in the field
Statistical Analysis and Data Mining
Handbook of Statistical Analysis and
Data Mining Applications
Robert Nisbet, John Elder IV, and Gary
Miner
Academic Press (2009)
ISBN-10: 0123747651
An excellent guide to many aspects of
data mining including Text mining.
Data Mining Algorithms
From Amazon.com
– Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
– by Eibe Frank, Ian H. Witten– Paperback - 416 pages
(October 13, 1999) – Morgan Kaufmann Publishers; – ISBN: 1558605525;
Best book I’ve found in between highly technical and introductory books. Good coverage of topics, especially trees and rules, but no neural networks.