Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions

Data Mining with Analysis Services

Carlos BossyPrincipal ConsultantMCTS, MCITP BI

Aabcom Solutionswww.aabcomsolutions.comwww.carlosbossy.com

http://www.aabcomsolution.com/

http://www.carlosbossy.com/

My Background

25 years Software Development 15 years software companies as Programmer, Architect, Manager,

Director, VP, CTO 5+ years Business Intelligence Consultant

Data Warehouse deployment for State of WA Child Welfare Data Integration/Warehouse architecture for Solar Energy company Data Warehouse model for State of OR Data Mining model for Houston sports league

Current Projects

Experience

Applications

ReportsData Integration

Overview

Make Data Mining an integral component of your Data Architecture.

Session Overview

What is Data Mining Data Mining Algorithms in SQL Server Creating a Data Mining Model Using the Model in an Application Using the Model in SSIS Awareness of Data Mining

Architecture and Process

Application Architecture

Application

Service Business Data

Relational Database

Application Architecture with BI

Application

Service Business Data

Relational Database

Data Integration

Data Warehouse Cube Text

Warehouse

Performance Management

Reports

Analysis

Data Mining

Text Mining

Ad-hoc

Data Mining

If you want to…

Predict the future Get rich buying stocks Win the lottery Win your Fantasy Football League

Data Mining is for you!

Why Mine Data?

Really - Why Mine Data?

U n c o v e r H i d d e n R e l a ti o n s h i p s

F i n d s o m e t h i n g u n u s u a l o r u n e x p e c t e d

I m p r o v e u p o n d o m a i n e x p e r t ’ s k n o w l e d g e

M a n a g e l a r g e d a t a s e t s

C r e a t e P r e d i c ti v e A n a l y ti c s p l a tf o r m

Maximize value of Data

Competitive Edge

Data Mining Defined

Data Mining is the process of sorting through large amounts of data and picking out relevant information (Wikipedia).

Data Mining is the extraction of hidden predictive information from large databases.

The Process of Knowledge Discovery Sophisticated Statistical Model Discovery of Patterns and Relationships

It is not Dredging, Snooping or an Invasion of Privacy

Today vs. Yesterday

Explosion of Data doubles every 3 years (Moore’s Law)

Data Volumes can’t be comprehended by humans

Uncover complex and difficult to find patterns for competitive edge

Improve professional judgment of Domain Expert (small but valuable)

Knowledge Discovery Converting Data to Information N Infinity

FutureLegacy

Manageable Volumes of Data The power of SQL Domain Experts could grasp and

analyze a complete Database Limited CPU Horsepower Finite

Outcomes

Predict number of runs Colorado Rockies would score in their next game using Neural Network.

InputPrevious 12 games runs scoredHome/Away

PredictNumber runs next game

ModelHome Runs = (G7*.142)+(G1*.118)+…Away Runs = (G7*.129)+(G1*.091)+…

Non-linear Results

Fitting a line to a set of data points to measure the effect of a single independent variable.

y = mx + b Statistical Methods

Data Mining

Outcomes – Decision Tree

Annual Income = 108,491.880+6.394*(Savings Balance-27,898.703)-2,443.789*(Avg Check Size-59.555)-0.247*(Credit Card Limit-8,250.000)+33.430*(Age-38.500)-1,032.285*(Over Drafts-1.000)

Outcomes – Time Series

Total Gen-1 < 36276.430 and Flow Date < 3/2/2007 4:58:07 AM and Flow Date < 8/22/2006 5:54:22 PM and Flow Date < 8/2/2006 1:35:37 PM

Total Gen = 31452.534 + 0.126 * Total Gen(-1)

Total Gen-1 < 36276.430 and Flow Date >= 3/2/2007 4:58:07 AM and Flow Date < 5/16/2007 7:46:52 PM and Flow Date >= 3/15/2007 12:56:15 PM

Total Gen = 6186.992 + 0.361 * Total Gen(-2) + 0.602 * Total Gen(-1)

Applications

o Credit Risk Analysis o Churn Analysiso Customer Retentiono Targeted Marketingo Market Basket Analysiso Sales Forecastingo Stock Predictionso Medical Diagnosiso Bioscience Research

o Surveyso Insurance Rate Quoteso Credit Card Fraudo Web Site Eventso Loan Applicationso Hiring and Recruitingo Cross-Marketingo Social Scienceo Economics

Data Mining Sources

SSASCube

DW

Data Mining

OLTP Database

Table

Col 1Col 2Col 3

SQL Server

Oracle

MySQL

Etc.

Cleanse

Massage

Select Training Set

Apply DM Algorithm

Train

Test

Training the Model

Using the Model

Customer

CustomerIDCustName

Mining Model

CheckingBalanceSavingsBalanceNumTransactions

MonthlyBankAccount

BankAccountIDCheckingBalanceSavingsBalanceNumTransactionsCustomerID

PredictionIncome

SQL Join

DMX Prediction Join

Result

Prediction JoinDMX

SELECT Predict([Movie Purchases],3) as MoviesFrom [Customer Movie Association]NATURAL PREDICTION JOIN(SELECT 44 AS [Age], 'Female' AS [Gender], 'Rent' AS [Home Ownership], 'Divorced' AS [Marital Status], (SELECT 'Mrs. Doubtfire' AS [Movie Title] UNION SELECT 'My Big Fat Greek Wedding' AS [Movie Title] UNION SELECT 'Patriot Games' AS [Movie Title]) AS [Movie Purchases]) AS t

Approaches

Clustering Classification

Regression Market Basket Analysis

Mining Algorithms

Time Series Naïve Bayes Association Clustering Decision Trees Logistic Regression Clustering Sequence Clustering Neural Networks

Data Mining Algorithms

Analytical problem Examples AlgorithmsClassification: Assign cases to predefined classes

Credit risk analysisChurn analysisCustomer retention

Decision TreesNaive BayesNeural Nets

Segmentation: Taxonomy for grouping similar cases

Customer profile analysisMailing campaign

ClusteringSequence Clustering

Association: Advanced counting for correlations

Market basket analysisAdvanced data exploration

Decision TreesAssociation

Time Series Forecasting: Predict the future

Forecast salesPredict stock prices

Time Series

Prediction: Predict a value for a new case based on values for similar cases

Quote insurance ratesPredict customer income

All

Deviation analysis: Discover how a case or segment differs from others

Credit card fraud detectionNetwork infusion analysis

All

* Andy Cheung, Microsoft

Time Series

Uses Autoregression + Decision Tree to build model Each time series is a single case No prediction join with test or actual cases Prediction is always the same for given time slots

Analyzes how a variable changes over time.

Time Series Tree

All

Stock(t-5) <= 22.83

Stock(t-5) >

22.83

Stock(t-1) <= 30.15

Stock(t-1) >

30.15

Stock = 6.85 + .62*Stock(t-1) + .21*Stock(t-2)

Node Regression Formula

Naïve Bayes

Probabilistic classifier based on Bayes’ theorem with strong (naive) independence assumptions.

Simple Classification Algorithm Good starting point for better understanding of your data Uses only discrete data

Naïve Bayes Example

Cell Phone Service

Gender Premium Service

Custom Ring Tones International Calls

Female 53% 19% 56% 27%

Male 47% 41% 14% 38%

Premium Svc = Yes Ring Tones = Yes International = No

Likelihood of Female = .53 * .19 * .56 * .73 = .0412Likelihood of Male = .47 * .41 * .14 * .62 = .0167

P(Female) = .0412/(.0412 + .0167) = 71.2%P(Male) = .0167/(.0412 + .0167) = 28.8%

Association Rules

Detect relationships or associations between specific values of categorical variables in large data sets.

Uses only Discrete Data

Rule - Attribute value conditions that occur frequently together in a given dataset {Male, IT, Star Wars} {Star Trek}

Itemset - A set of attribute values.

Support - Total number of transactions.

Confidence - Probability that {X} {Y}

Importance (Lift) –Interestingness. Measure of whether Correlation is positive, negative or none.

Logistic Regression

Predict the probability of a discrete outcome from a set of variables that are continuous, discrete or both.

Non-linear regression model that produces results between 0 and 1. Popular in health science for disease prediction. Marketing uses for dichotomous predictions (buy or not buy, renew or

cancel). Same as Neural Network without the hidden layer.

Probability = 1/1 + e-z where z = c + yx1 + yx2 + …

If x1 = Weight and x2 = Age thenRisk of heart attack = 1/1 + e-z where z = -2.4 + 1.3x1 - .7x2

Decision Tree

Graphical representation displaying options, risks and the decision-making sequence.

Most popular data mining model. East to visualize because of it’s graphical representation. Branches represent

choices with associated risks, costs, results, or probabilities. Each test examines the value of a single column in the data and uses it to

determine the next test to apply. The results of all tests determine which label to predict.

Similar to human thought process when making a decision. Finds non-linear relationships. Supports classification, regression and association within the model.

Neural Network

Classifies large and complex data sets by grouping cases together in a way loosely based on the brain.

Most sophisticated algorithm but difficult to interpret. Works well with non-linear data and finds smooth non-linear relationships. Modeled as a group of interconnected nodes. No agreed upon definition. Microsoft algorithm is one of many techniques. Can build multiple models based on discrete inputs.

I

I

I

H

H

H

H

O

O

Back to Input layer after weights adjusted for error

Clustering

Places data elements into related groups without advance knowledge of the group definitions.

Good starting point for better understanding of your data. Finds the hidden variable that accurately classifies data. Data grouped into clusters have a high similarity based on

the attribute values.

Sequence Clustering

Discovers the properties of sequences by grouping them into clusters and assigning them to one of the clusters.

Hybrid of sequence and clustering techniques. Typically used with web and event logs as data sources.

Demo

Conclusion

Slow Adoption Where do you start? Science + Art Not quite A.I.

… yet!

More Info and ReferencesTDWI – The Data Warehousing InstituteACM, IEEEBooks: Data Mining with SQL Server 2005/8 (Wiley)

Mining the Talk (IBM Press)Data Mining know it all (Morgan/Kaufman)

Documents

Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions