Upload
kelly-satterthwaite
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Data Mining with Analysis Services
Carlos BossyPrincipal ConsultantMCTS, MCITP BI
Aabcom Solutionswww.aabcomsolutions.comwww.carlosbossy.com
My Background
25 years Software Development 15 years software companies as Programmer, Architect, Manager,
Director, VP, CTO 5+ years Business Intelligence Consultant
Data Warehouse deployment for State of WA Child Welfare Data Integration/Warehouse architecture for Solar Energy company Data Warehouse model for State of OR Data Mining model for Houston sports league
Current Projects
Experience
Applications
ReportsData Integration
Overview
Make Data Mining an integral component of your Data Architecture.
Session Overview
What is Data Mining Data Mining Algorithms in SQL Server Creating a Data Mining Model Using the Model in an Application Using the Model in SSIS Awareness of Data Mining
Architecture and Process
Application Architecture
Application
Service Business Data
Relational Database
Application Architecture with BI
Application
Service Business Data
Relational Database
Data Integration
Data Warehouse Cube Text
Warehouse
Performance Management
Reports
Analysis
Data Mining
Text Mining
Ad-hoc
Data Mining
If you want to…
Predict the future Get rich buying stocks Win the lottery Win your Fantasy Football League
Data Mining is for you!
Why Mine Data?
Really - Why Mine Data?
U n c o v e r H i d d e n R e l a ti o n s h i p s
F i n d s o m e t h i n g u n u s u a l o r u n e x p e c t e d
I m p r o v e u p o n d o m a i n e x p e r t ’ s k n o w l e d g e
M a n a g e l a r g e d a t a s e t s
C r e a t e P r e d i c ti v e A n a l y ti c s p l a tf o r m
Maximize value of Data
Competitive Edge
Data Mining Defined
Data Mining is the process of sorting through large amounts of data and picking out relevant information (Wikipedia).
Data Mining is the extraction of hidden predictive information from large databases.
The Process of Knowledge Discovery Sophisticated Statistical Model Discovery of Patterns and Relationships
It is not Dredging, Snooping or an Invasion of Privacy
Today vs. Yesterday
Explosion of Data doubles every 3 years (Moore’s Law)
Data Volumes can’t be comprehended by humans
Uncover complex and difficult to find patterns for competitive edge
Improve professional judgment of Domain Expert (small but valuable)
Knowledge Discovery Converting Data to Information N Infinity
FutureLegacy
Manageable Volumes of Data The power of SQL Domain Experts could grasp and
analyze a complete Database Limited CPU Horsepower Finite
Outcomes
Predict number of runs Colorado Rockies would score in their next game using Neural Network.
InputPrevious 12 games runs scoredHome/Away
PredictNumber runs next game
ModelHome Runs = (G7*.142)+(G1*.118)+…Away Runs = (G7*.129)+(G1*.091)+…
Non-linear Results
Fitting a line to a set of data points to measure the effect of a single independent variable.
y = mx + b Statistical Methods
Data Mining
Outcomes – Decision Tree
Annual Income = 108,491.880+6.394*(Savings Balance-27,898.703)-2,443.789*(Avg Check Size-59.555)-0.247*(Credit Card Limit-8,250.000)+33.430*(Age-38.500)-1,032.285*(Over Drafts-1.000)
Outcomes – Time Series
Total Gen-1 < 36276.430 and Flow Date < 3/2/2007 4:58:07 AM and Flow Date < 8/22/2006 5:54:22 PM and Flow Date < 8/2/2006 1:35:37 PM
Total Gen = 31452.534 + 0.126 * Total Gen(-1)
Total Gen-1 < 36276.430 and Flow Date >= 3/2/2007 4:58:07 AM and Flow Date < 5/16/2007 7:46:52 PM and Flow Date >= 3/15/2007 12:56:15 PM
Total Gen = 6186.992 + 0.361 * Total Gen(-2) + 0.602 * Total Gen(-1)
Applications
o Credit Risk Analysis o Churn Analysiso Customer Retentiono Targeted Marketingo Market Basket Analysiso Sales Forecastingo Stock Predictionso Medical Diagnosiso Bioscience Research
o Surveyso Insurance Rate Quoteso Credit Card Fraudo Web Site Eventso Loan Applicationso Hiring and Recruitingo Cross-Marketingo Social Scienceo Economics
Data Mining Sources
SSASCube
DW
Data Mining
OLTP Database
Table
Col 1Col 2Col 3
SQL Server
Oracle
MySQL
Etc.
Cleanse
Massage
Select Training Set
Apply DM Algorithm
Train
Test
Training the Model
Using the Model
Customer
CustomerIDCustName
Mining Model
CheckingBalanceSavingsBalanceNumTransactions
MonthlyBankAccount
BankAccountIDCheckingBalanceSavingsBalanceNumTransactionsCustomerID
PredictionIncome
SQL Join
DMX Prediction Join
Result
Prediction JoinDMX
SELECT Predict([Movie Purchases],3) as MoviesFrom [Customer Movie Association]NATURAL PREDICTION JOIN(SELECT 44 AS [Age], 'Female' AS [Gender], 'Rent' AS [Home Ownership], 'Divorced' AS [Marital Status], (SELECT 'Mrs. Doubtfire' AS [Movie Title] UNION SELECT 'My Big Fat Greek Wedding' AS [Movie Title] UNION SELECT 'Patriot Games' AS [Movie Title]) AS [Movie Purchases]) AS t
Approaches
Clustering Classification
Regression Market Basket Analysis
Mining Algorithms
Time Series Naïve Bayes Association Clustering Decision Trees Logistic Regression Clustering Sequence Clustering Neural Networks
Data Mining Algorithms
Analytical problem Examples AlgorithmsClassification: Assign cases to predefined classes
Credit risk analysisChurn analysisCustomer retention
Decision TreesNaive BayesNeural Nets
Segmentation: Taxonomy for grouping similar cases
Customer profile analysisMailing campaign
ClusteringSequence Clustering
Association: Advanced counting for correlations
Market basket analysisAdvanced data exploration
Decision TreesAssociation
Time Series Forecasting: Predict the future
Forecast salesPredict stock prices
Time Series
Prediction: Predict a value for a new case based on values for similar cases
Quote insurance ratesPredict customer income
All
Deviation analysis: Discover how a case or segment differs from others
Credit card fraud detectionNetwork infusion analysis
All
* Andy Cheung, Microsoft
Time Series
Uses Autoregression + Decision Tree to build model Each time series is a single case No prediction join with test or actual cases Prediction is always the same for given time slots
Analyzes how a variable changes over time.
Time Series Tree
All
Stock(t-5) <= 22.83
Stock(t-5) >
22.83
Stock(t-1) <= 30.15
Stock(t-1) >
30.15
Stock = 6.85 + .62*Stock(t-1) + .21*Stock(t-2)
Node Regression Formula
Naïve Bayes
Probabilistic classifier based on Bayes’ theorem with strong (naive) independence assumptions.
Simple Classification Algorithm Good starting point for better understanding of your data Uses only discrete data
Naïve Bayes Example
Cell Phone Service
Gender Premium Service
Custom Ring Tones International Calls
Female 53% 19% 56% 27%
Male 47% 41% 14% 38%
Premium Svc = Yes Ring Tones = Yes International = No
Likelihood of Female = .53 * .19 * .56 * .73 = .0412Likelihood of Male = .47 * .41 * .14 * .62 = .0167
P(Female) = .0412/(.0412 + .0167) = 71.2%P(Male) = .0167/(.0412 + .0167) = 28.8%
Association Rules
Detect relationships or associations between specific values of categorical variables in large data sets.
Uses only Discrete Data
Rule - Attribute value conditions that occur frequently together in a given dataset {Male, IT, Star Wars} {Star Trek}
Itemset - A set of attribute values.
Support - Total number of transactions.
Confidence - Probability that {X} {Y}
Importance (Lift) –Interestingness. Measure of whether Correlation is positive, negative or none.
Logistic Regression
Predict the probability of a discrete outcome from a set of variables that are continuous, discrete or both.
Non-linear regression model that produces results between 0 and 1. Popular in health science for disease prediction. Marketing uses for dichotomous predictions (buy or not buy, renew or
cancel). Same as Neural Network without the hidden layer.
Probability = 1/1 + e-z where z = c + yx1 + yx2 + …
If x1 = Weight and x2 = Age thenRisk of heart attack = 1/1 + e-z where z = -2.4 + 1.3x1 - .7x2
Decision Tree
Graphical representation displaying options, risks and the decision-making sequence.
Most popular data mining model. East to visualize because of it’s graphical representation. Branches represent
choices with associated risks, costs, results, or probabilities. Each test examines the value of a single column in the data and uses it to
determine the next test to apply. The results of all tests determine which label to predict.
Similar to human thought process when making a decision. Finds non-linear relationships. Supports classification, regression and association within the model.
Neural Network
Classifies large and complex data sets by grouping cases together in a way loosely based on the brain.
Most sophisticated algorithm but difficult to interpret. Works well with non-linear data and finds smooth non-linear relationships. Modeled as a group of interconnected nodes. No agreed upon definition. Microsoft algorithm is one of many techniques. Can build multiple models based on discrete inputs.
I
I
I
H
H
H
H
O
O
Back to Input layer after weights adjusted for error
Clustering
Places data elements into related groups without advance knowledge of the group definitions.
Good starting point for better understanding of your data. Finds the hidden variable that accurately classifies data. Data grouped into clusters have a high similarity based on
the attribute values.
Sequence Clustering
Discovers the properties of sequences by grouping them into clusters and assigning them to one of the clusters.
Hybrid of sequence and clustering techniques. Typically used with web and event logs as data sources.
Demo
Conclusion
Slow Adoption Where do you start? Science + Art Not quite A.I.
… yet!
More Info and ReferencesTDWI – The Data Warehousing InstituteACM, IEEEBooks: Data Mining with SQL Server 2005/8 (Wiley)
Mining the Talk (IBM Press)Data Mining know it all (Morgan/Kaufman)