Predictive Analytics Workshop using IBM SPSS Modeler · Purpose of the Workshop Introduction to predictive analytics and data mining Stimulate thinking about how data mining would

© 2012 IBM Corporation

Predictive Analytics Workshop using IBM SPSS Modeler

ObjectivesSmarter Software for Smarter Cities

Agenda

8:30-8:40 Welcome and Introductions

8:40-8:55 Introduction to Predictive Analytics

8:55-9:05 Exercise: Navigating IBM SPSS Modeler

9:05-9:25 Exercise: Predictive in 20 Minutes

9:25-9:40 Data Mining Methodology and Application

9:40-9:55 Break

9:55-10:55 Exercise: Data Mining Techniques

10:55-11:25 Exercise: Text Analytics

11:25-11:50 Deployment

11:50-12:00 Wrap-up

Purpose of the Workshop

Introduction to predictive analytics and data mining

Stimulate thinking about how data mining would benefit your organization

Demonstrate ease of use of powerful technology

Get experience in “doing” data mining

See examples of how other organizations are benefitting from deploying

predictive analytics

Instrumented

Interconnected

Intelligent

Smarter Planet

What is Predictive Analytics?

Predictive Analytics helps connect data

to effective action by drawing reliable

conclusions about current conditions

and future events

Gareth Herschel, Research Director, Gartner Group

Our Ultimate Goal is to Ensure Your Success

Madrid reduced

emergency response

time by 25%

Analytics decreased

fraud in the Federal

government by

50%

Analytics helped

reduce Hamilton

County’s dropout

rate by 25%

Lancaster, uses

predictive policing

models to reduced

crime saving the city

$1.3M a year.

Florida

Juvenile

Justice reduced

delinquency in

schools by 34% Predictive analytics helped

slash Memphis’s crime

rate by 40% in one

year

Predictive Analytics in Public Sector

Crime Analyses Force Deployment

Lead Generation

Hot Spot Analyses

Fraud detection and prevention• Money laundering

• Network intrusion

• Tax audits & collection

Entity Resolution

Text Analytics

Social Network Analysis

Education• Which among our students are at

risk?

• Who are the most promising

applicants?

• Which alumni will donate and how

much?

• How can my institution plan for

future development?

Corrections• Which inmates are at risk of

recidivism?

• Why do some inmates return?

• Which programs are successful?

IBM SPSS Modeler A Quick Overview

IBM SPSS Modeler

High performance data mining and text analytics workbench

Used for the proactive

• Identification of fraud, waste and abuse

• Reduction of costs

• Identification of risk

• Students at risk of failing

• Inmates at risk of returning

• Patients at risk of relapse

• Forecasting demographic shifts and migration

Allows analytics to be repeated and integrated

IBM SPSS Modeler

IBM SPSS Modeler

IBM SPSS Modeler

IBM SPSS Modeler

Predictive in 20 Minutes A Quick Exercise

Exercise: Predictive in 20 Minutes

Goal:

Create a model to identify who are at risk of heart attack

Approach:

Use patient data which contains various health and behavioral

information

Define which fields to use

Choose the modeling technique

Automatically generate a model to identify who are at risk

Review results

Why?

For public health policy implications, by proactively identifying and

quantifying high-risk behaviors and practices

Break - Please Return in 15 Minutes

IBM SPSS ModelerOne Analytical Workbench – Endless Techniques

Data Mining Methodology

Cross-Industry Standard

Process Model for Data Mining

Describes Components of

Complete Data Mining Project

Cycle

Shows Iterative Nature of Data

Mining

Vendor and Industry Neutral

Technique Usage Algorithms

Classification

(or prediction)

• Used to predict group membership

(e.g., will this employee leave?) or a

number (e.g., how many widgets

will I sell?)

• Auto Classifiers,

Decision Trees,

Logistic, SVM, Time

Series, etc.

Data Mining Techniques


Classification

(or prediction)




will I sell?)


Decision Trees,

Logistic, SVM, Time

Series, etc.

Segmentation • Used to classify data points into

groups that are internally

homogenous and externally

heterogeneous

• Identify cases that are unusual

• Auto Clustering, K-

means, etc.

• Anomoly detection



Classification

(or prediction)




will I sell?)


Decision Trees,

Logistic, SVM, Time

Series, etc.

Segmentation • Used to classify data points into

groups that are internally

homogenous and externally

heterogeneous.

• Identify cases that are unusual

• Auto Clustering, K-

means, etc.

• Anomoly detection

Association • Used to find events that occur

together or in a sequence (e.g.,

market basket)

• APRIORI, Carma,

Sequence



Text Analytics • Used to discover patterns resident

in text or other unstructured data

(e.g., sentiment analysis)

• Natural Language

Processing

• Parts of Speech

Analysis

Entity Analytics • Used to determine which cases are

likely the same actor, and which

seemingly identical cases are

actually independent

• Context

Accumulation

Social Network

Analysis

• Used to uncover associations which

may exist between cases, and

identify central or influential actors

Additional Data Mining Techniques

IBM SPSS ModelerSegmentation Modeling

Segmentation Modeling

Goal:

Discover natural groupings or clusters of alumni donors

Approach:

Alumni data from a university


Use K-Means Clustering to generate a model to group alumni

Appendix: Use these clusters to predict donation

Why?

Better alumni understanding (demographics, socio-economic etc)

Tailored messages for each group/segment

Personal and more relevant for alumni

Institutional Planning

IBM SPSS ModelerEntity Analytics

Entity Analytics Suppose that you have the following records from two different sources,

and are not sure whether they refer to the same person or different people.

Source 1

Record no.: 70001

Name: Jon Smith

Address: 123 Main Street

Driv. License: 0001133107

Source 2

Record no.: 9103

Name: JOHNATHAN Smith

Date of Birth: 06/17/1934

Telephone: 555-1212

Email: [email protected]

IP address: 9.50.18.77.

Source 3

Record no.: 6251

Name: Jon Smith

Telephone: 555-1212

Driv. License: 0001133107

DL

Telephone

No exact matches between the

two records.

However, if we introduce a third

source, we find some common

attributes

Entity Analytics

Fields Used in

EA Resolution

% Missing in

PD Database

LAST 0.7

FIRST 0.4

MIDDLE 63.9

RACE 0.1

SEX 0.1

DOB 2.4

ADDR 0.9

DRLIC 63.8

PHONE 59.1

3634 Suspects Results

IBM SPSS ModelerClassification Modeling

Classification model

Goal:

Identify students likely to persist

Approach:

Use student performance scores and other demographics


Use the Auto Classifier to choose the appropriate modeling technique

Review results

Why?

Identify students likely to persist into their second year

Conversely, same methods can be used to identify students at risk of

attrition (or prisoners at risk of recidivism, or patients likely to respond to

treatment)

IBM SPSS ModelerText Analytics

The Importance of Text

Because people communicate with

words, not numbers, it has become

critical to be able to mine text for its

meaning and to sort, analyse, and

understand it in the same way that data

has been tamed. In fact, the two basic

types of information complement each

other, with data supplying the “what”

and text supplying the “why”.

Source IDC: “Text Analytics: Software’s Missing Piece?”

Text Analytics

Turn unstructured officer notes and narratives into useable and searchable

context-rich content with Text Analytics.

Data Mining and Text Analytics

Use advanced analytical

techniques on data

Discover key relationships

between variables

Model effect of variables on

outcomes

Determine influence on outcomes

Predict outcomes

Apply models to new data

Extract, analyze and create

structure for unstructured data

Integrate analysis results into

operational systems

Integrate analysis results into

Business Intelligence applications

Integrate analysis results with

structured data and use as input

for Data Mining

Improves model accuracy

Data Mining Text Analytics

DeploymentMany Options

Why IBM SPSS?

Workshop Takeaways

Easy to use, visual interface Short timeframe to be productive with actionable results Does not require knowledge of programming language

Business results focused Cost effective solution that delivers powerful results across organization Flexible licensing and deployment options Full range of algorithms for your business problems

End-to-end solution Data preparation through real time interactions Use structured, unstructured and survey data Full suite of products, from data collection through deployment

Flexible architecture Leverages the investments already made in technology Does not require data in a proprietary format or DB Structured and unstructured data Open architecture (both inputs and outputs) SQL Pushback

94% of clients achieved a positive ROI, with an

average payback period of 10.7 months

Key benefits achieved include reduced costs, increased

productivity, improved citizen & employee satisfaction and

safety.

81% of projects deployed on time, 75% on or under budget

Nucleus Research: The Real ROI from IBM

“This is one of the highest ROI scoresNucleus has ever seen in its Real ROI series of research reports.”Rebecca Wettemann, Vice President of Research, Nucleus Research

Appendix

Data Mining Overview

From Amazon.com

– Paperback: 512 pages – Publisher: Wiley; 1 edition

(December 28, 1999) – Language: English – ISBN-10: 0471331236 – ISBN-13: 978-0471331230 ;

Good introductory text on data mining for marketing from two top communicators in the field

Statistical Analysis and Data Mining

Handbook of Statistical Analysis and

Data Mining Applications

Robert Nisbet, John Elder IV, and Gary

Miner

Academic Press (2009)

ISBN-10: 0123747651

An excellent guide to many aspects of

data mining including Text mining.

http://www.amazon.com/gp/reader/0123747651/ref=sib_dp_pt

Data Mining Algorithms

From Amazon.com

– Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations

– by Eibe Frank, Ian H. Witten– Paperback - 416 pages

(October 13, 1999) – Morgan Kaufmann Publishers; – ISBN: 1558605525;

Best book I’ve found in between highly technical and introductory books. Good coverage of topics, especially trees and rules, but no neural networks.

Documents

Predictive Analytics Workshop using IBM SPSS Modeler · Purpose of the Workshop Introduction to predictive analytics and data mining Stimulate thinking about how data mining would