60
A New Era for Predictive Analytics with SPSS

05 predictive with spss

Embed Size (px)

Citation preview

A New Era for Predictive Analytics with SPSS

© 2012 IBM Corporation

The Mining Metaphor

2

!●Gold Mining Diamond Mining Data Mining

© 2012 IBM Corporation

What is Data Mining? An early definitionFinding patterns in your data which you can use to do your business better !

– It’s about patterns – It’s about something you can use – practical things – It’s about business

A recent definition▪Business-oriented discovery of patterns across all forms of data

▪Produces insight and a predictive capability

▪Deployment of predictions throughout the enterprise

© 2012 IBM Corporation

What is Data Mining?

4

!Information Retrieval Information Extraction Information Analysis

! + +

Discover new, previously unknown information

© 2012 IBM Corporation

IBM SPSS Supports the Predictive Enterprise Delivering Profitable Revenue Growth & Operational Efficiency

▪Capture a complete perspective –Survey customers & constituents –Leverage structured, semi-structured &

unstructured data

▪Predict behavior and preferences –Statistics for deeper insight –Data & text mining for predictive modeling

▪Act on results –Deploy scoring models for dynamic

decisions –Directly affect business process with event

integration

© 2012 IBM Corporation

IBM SPSS: Our core value propositionSPSS’ goal is to apply analytic to optimize decisions at every contact point, made possible by enabling pervasive, predictive real-time decisions at the point of impact

© 2012 IBM Corporation

▪ SPSS Data Collection – Collecting additional Attitudinal data for advanced

analytics typically collected through surveys !▪ SPSS Statistics

– Expand analytics capabilities to Professional Business User / Statistician

– Add advanced statistical analysis to PM !▪ SPSS Modeler

– Provide predictive analytics using data mining & text mining methods for key parts of the business

– Predict future outcome and understand what influences it. !

▪ SPSS Deployment & Collaboration Services – Analytical asset management across multiple

analysts – Audit, security, refresh – Provide a web service interface !

▪ SPSS Analytic Server – Provide Big Data connectivity to SPSS Modeler – It translate SPSS modeler server requests into

Hadoop jobs !!▪ SPSS Analytical Decision Manager

– Business scenario analysis – Complex Rule for operational decision management !

SPSS Predictive Analytic Platform

© 2012 IBM Corporation

SPSS Modeler 16 Editions

• SPSS Modeler GOLD -Enables organizations to build predictive models to improve business process and help people or systems

make the right decisions each time. It combines and integrates predictive analytics, rules, scoring, and optimization techniques to deliver recommended actions at the point of impact. !

SPSS Modeler Premium + C&DS + Analytical Decision Management !• SPSS Modeler Premium - Offers a range of advanced algorithms and capabilities including text analytics, entity analytics, social network

analysis, and automated modeling and preparation techniques to address a multitude of business problems and analytic requirements on almost any type of data. !

SPSS Modeler Professional + Text Analytics Workbench !• SPSS Modeler Professional -Includes a range of advanced algorithms, data manipulation, and automated modeling and preparation

techniques to build predictive models and uncover hidden patterns in structured data.

© 2012 IBM Corporation

R is gaining in popularity, Do not walk away from R opportunities it's not a competitor

You Ready ?

▪ EMBRACE: Integrate R algorithms (e.g. Random Forest) Generate R charts Use R functions for data preparations Make R available for non-programmers !▪ EXTEND: Scalability (e.g. database pushback) Leverage R engines of other vendors like SAP HANA Enterprise deployment Big Data (Analytic Server)

Powered by

Introducing CRISP-DM Methodology &

SPSS Modeling Techniques

© 2012 IBM Corporation

Modeler Interface

Stream Canvas

Stream, Outputs & Model Manager

Palettes Nodes

© 2012 IBM Corporation

Visual Programming with Modeler

4

-Visual programming -Based on icons ("nodes") -Pick nodes from palette & place them on the bench -Edit their attributes -Connect to specify flow of data ("streams")

© 2012 IBM Corporation5

Can be exported to PMML to be reuse outside of Modeler :

like in Java applications, SAS, IBM Infosphere stream using the DataMining ToolKit, …

Is the Result of a predictive model Generation

Yellow Nugget or Yellow Diamond

© 2012 IBM Corporation

CRoss-Industry Standard Process for Data Mining

2

1. Business Understanding Project objectives and requirements

understanding, Data mining problem definition

2. Data Understanding Initial data collection and familiarization, data

quality problems identification

3. Data Preparation Table, record and attribute selection, data

transformation and cleaning

4. Modeling Modeling techniques selection and application,

Parameters calibration

5. Evaluation Business objectives & issues achievement

evaluation

6. Deployment Result model deployment, Repeatable data

mining process implementationCRoss-Industry Standard Process for - Data Mining( CRISP – DM )

© 2012 IBM Corporation

2. Data Understanding

4

Initial data collection and familiarization, data quality problems identification

CRoss-Industry Standard Process for - Data Mining( CRISP – DM )

© 2012 IBM Corporation

Reading Data

5

Modeler reads a variety of different file types, including data stored in spreadsheets and databases, using the nodes within the Sources palette.

© 2012 IBM Corporation

Getting to Know your Data

8

Data Audit Node Distribution Node Histogram Node …

© 2012 IBM Corporation

3. Data Preparation

9

! Table, record and attribute selection, data

transformation and cleaning

CRoss-Industry Standard Process for - Data Mining( CRISP – DM )

© 2012 IBM Corporation

Data Manipulation in Modeler

10

To prepare the data before analysis: • Eliminate missing values • Remove unwanted fields from analysis • Derive new fields • Merge and match data

Intermediate nodes in Modeler • Record operation nodes • Field operation nodes

!!▪CLEM language is case sensitive

© 2012 IBM Corporation

CLEM language: The Expression Builder

11

© 2012 IBM Corporation

4. Modeling

13

! Modeling techniques selection and application,

Parameters calibration

CRoss-Industry Standard Process for - Data Mining( CRISP – DM )

© 2012 IBM Corporation

Sampling or Partitioning your Data• May not want to use all records • Score your model with remaining Data • May wish to examine a subgroup separately • May assist us with building a predictive model (oversampling) • Keep in mind that the sampling method must be fit to the problem at hand

!-Similar customers and I want to reduce size of dataset for modelling

then I can use simple sampling. !-But if you want to directly sample from a database with customers of

different types you may want to draw a complex sample. !

© 2012 IBM Corporation

Matching Data to the Modeling Tool

• For example – we want to use Rule Induction...we will need to think about !-How algorithm handles missing data !-Output that is created (binary versus larger splits) !-What are we trying to predict (numeric target or binary?) !-In Which format the input predictors have to be ?

© 2012 IBM Corporation

Modeling Technics in Modeler

• Supervised techniques (Predictive Models) To model an output variable based on the several input variables, to predict future cases where the outcome is unknown

-Neural Networks, Rule Induction (C5.0, CHAID, QUEST & C&RT) -Decision List, Binary Classifier -Linear Regression and Logistic Regression -Generalized Linear Models

• Unsupervised Techniques (Clustering) No field to predict, used to group similar records within the data

-Kohonen Networks, K-Means, Two Step, Anomaly, Discriminant • Association Rules To search for things that typically occur together -APRIORI, CARMA, GRI and SLRM !• Data Reduction:

-PCA/Factor Analysis, Feature Selection • Sequence Detection Models:

-Sequence • Time Series • Text Mining

!SPSS Modeling Techniques

!Association Models

© 2012 IBM Corporation

Association Models!

–Association rules search for things (events, purchases, attributes) that typically occur together in the data !–They find the patterns in data that you could manually find using visualization techniques such as the web node (yikes!) but can do so much faster and can explore more complex patterns. !–Used to answer questions such as:

• Do customers who buy fruit usually buy cheese?

© 2012 IBM Corporation

Output

!SPSS Modeling Techniques

!Segmentation Models

© 2012 IBM Corporation

Segmentation or Clustering Models

!–Clustering techniques segment data into groups of cases/records/customers that have similar patterns of input fields. !–Used in market segmentation studies whose aim it is to find distinct types of customers so they can be targeted more effectively !–Used to answer questions such as:

• How can I group my customer to address the right marketing campaign?

© 2012 IBM Corporation

Clusters Output

!SPSS Modeling Techniques

!Classification & Statistical

Models

© 2012 IBM Corporation

Predictive or Classification Models!

–Algorithms that are used to make predictions or forecasts based on historical data !–Automatic classification allows customers to let the software determine the best one or customers can choose a specific algorithms such as Neural Networks, Logistic Regression, Time Series, etc. !–Used to answer questions such as:

• What predicts whether a customer will leave? • What predicts whether this employee will be a super-star? • How many umbrellas will I sell in the next three months in Chicago?

© 2012 IBM Corporation

Output

© 2012 IBM Corporation

5. Evaluation

54

Business objectives & issues achievement evaluation

CRoss-Industry Standard Process for - Data Mining( CRISP – DM )

© 2012 IBM Corporation

6. Deployment

55

Result model deployment, Repeatable data mining process implementation

CRoss-Industry Standard Process for - Data Mining( CRISP – DM )

© 2012 IBM Corporation

Deployment Family: Products

▪IBM SPSS Collaboration and Deployment Services – A foundation for managing and

deploying analytics !▪IBM SPSS Analytical Decision Management – Integrates analytics and business

knowledge to deliver optimal outcomes

56

© 2012 IBM Corporation

IBM SPSS Modeler Deployment Options

▪Client (Desktop) –Access local files –Connect to operational databases –Connect to Cognos BI –Processing performed on local installation

!!▪Client/Server

–Data operations/processing on server – In-database data mining –SQL pushback For PureData and Hadoop Platform –Modeler Batch –SuSE Linux Enterprise Server 10 (zLinux) – Inclusion in Smart Analytics System for Power (AIX)

!!!

!!

What’s New & Hot

© 2012 IBM Corporation

Predictive Analytics for Big Data Get more Accurate Models with bigger volume and variety of data

- Read Data from Hadoop !- Write back to Hadoop !- Export your Models to Streams !

- Prepare your Data on Hadoop !- Few Models can run on Hadoop !- R analytic capabilities in SPSS !!!

© 2012 IBM Corporation

Bring Analytics on Big Data for Everyone

Automatic Summarization • Top findings in data ranked by

“interestingness” and association strength • Plain language synopsis !

Automatic Exploration • Guided presentation by selecting fields of

interest • Dynamic Visual Insights • Users can refine auto generated parameters !

Automatic Modeling • Auto selection of best models and detection

of strongest relationships: Decision Tree (CHAID) and Key Driver Reports (based on linear and logistic regression) !

Sharing of Output • Collaboration with peers • Tablet optimization !!

SPSS Analytics Catalyst CR.I.S.P.-D.M. Methology

© 2012 IBM Corporation

Generate simulated data !Fit distributions from existing data !Evaluate the simulation

Example Use Cases: - A retailer wants to simulate alternative

sales scenarios to identify which strategy will make them most likely to hit their targets

!- A parts manufacturer is interested in modeling storage costs based on simulating different scenarios for future part orders against stock supplies and excess order fees !

Monte Carlo Smulation

© 2012 IBM Corporation

Geospatial Data Mining– Understanding Geohashes

▪ Space-time Boxes use geohashes and timestamps to locate where and when entities exist

▪ A geohash is a unique identifier that uses latitude and longitude to create an alphanumeric string

▪ Its precision depends on its length; longer geohash = better precision

▪ For example, geohash dr5ru7 is midtown Manhattan...but how do we know?

© 2012 IBM Corporation

What Exactly is a Space –Time Box?

▪ Space-time Boxes extend geohashes to include a third dimension: time

! !!▪ Space-time Boxes ‘bin’ events in 3-D space and time ▪ Density (i.e. size) of the Space-time Box is a required

input ▪ Can help analysts understand proximity between

entities, verify relationships

dr5ru7|2013-01-01 00:00:00|2013-01-01 00:15:00

Geohash Start timestamp End timestamp

© 2012 IBM Corporation

IBM SPSS Modeler Embraces R1. SPSS Modeler allows the user

to build and score R models within the Modeler interface

2. SPSS Modeler allows the use of R functions for data preparation and chart/output creation

3. The Custom Dialog Builder for R allows the user to create custom nodes that run R algorithms, functions, or outputs

4. These custom nodes can be shared with other users and they do not require the end user to know any R code

© 2012 IBM Corporation

Use R to build a custom node

The world of analytics !made easy for everyone

Bouchra Denis Antoine Danil

I am Sandra, a data analyst.

USER

CODE

Sadly, SPSS Modeler cannot do

EVERYTHING

SPSS Modeler Marketplace App Store for Analytics

Spatial

Plot insightful interactive !maps to explore your data

Visualize new patterns

SpatialSocialSocial

Enhance your client understanding with social data!

Analyse the public opinion!

SpatialSocialDatabases

Connect to noSQL databases!

Connect to Bluemix in 2 clicks!

Connect to bigSQL and Hadoop!

SpatialSocialDatabasesModels

For our Business Partner

Predict which customers will come back and how much they will spend

Implemented in a BI solution for a large retailer and Generate enterprise-grade reporting

SpatialSocialDatabasesModelsAnd many more!

Come to our booth to try them out

More than 30 new functionnalities

Potential growth

A lot of code already available in packages

R is a widely used language

Survey of use

R

IBM SPSS Statistics

Rapid Miner

SAS

Weka

Microsoft SQL Server

Matlab

IBM SPSS Modeler

0 % 18 % 35 % 53 % 70 %

Value

SPSS Modeler Marketplace

SPSS Modeler BRAND

SPSS Modeler USERS

IBM PARTNERS

NODE DEVELOPERS

© 2012 IBM Corporation

Q&A