45
Copyright © 2002, SAS Institute Inc. All rights reserved. SAS Enterprise Miner™: What does the future hold? David Duling EM Development Director SAS Inc. Sascha Schubert Product Manager Data Mining SAS International

SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Embed Size (px)

Citation preview

Page 1: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

SAS Enterprise Miner™:What does the future hold?David DulingEM Development DirectorSAS Inc.

Sascha SchubertProduct Manager Data MiningSAS International

Page 2: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Topics for Discussion:

EM 4.2/SAS 9.0AF/SCL

Architecture

EM 5.0/SAS 9.13-tier

Architecture EM

Demo of the Alpha EM 5.0

JavaUI

Page 3: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

! Revolutionary Development of Data Mining Architecture! Address scalability and performance ! Address the limitations of current architecture! Make new architecture future-proof

EM – Two Paths for Two Goals! Evolutionary Development of Data Mining Functionality

! Keep up the quality! Upgrade release for current sites! Stay on top of the market

Page 4: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Time Line Project Mercury + DMJun 02Apr 02 Nov 02 Feb 03

DP EA LA GA

SAS V9SAS V9

SAS V9.1SAS V9.1

EA LA GA

EM 4.2EM 4.2

EM 5.0EM 5.0

Evolutionary Release

Revolutionary Release

Page 5: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Goals for EM 4.2

! Maintain current product! Fix known defects! Evolve beta tools to production status! Interactive Grouping! Improve scalability (parallel processing)

Page 6: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

EM 4.2Evolve Beta Tools to Production Status! Memory Based Reasoning! DM Neural! Two-Stage Model! Time Series! Link Analysis ! J-Score, XML

Page 7: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Interactive Grouping Node

! Was developed as part of Credit Scoring Solution! Will be fully integrated in EM 4.2 / 5.0! Used to calculate weights of evidence! also useful for general interactive grouping

! Interactive grouping of variables into natural groups in relation to target

! now possible for class and interval variables

Page 8: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Publishing Enterprise Miner Models via the Open Meta Server

Enterprise Miner

Open Meta ServerWWW Server

WWW clients•Search Models•Retrieve Models

•Reports•Score code

Register

Save

Read

HTTP/JSP

Page 9: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Mining Model Repository! SAS Code, C Code, Java Code! Statistics, Charts, Reports! Input and Output Variables described in XML

• Fit and assessment statistics in SAS data sets

• Formats, score, and macro code as SAS code

• Metadata info about the model in a SAS catalog• Cscore code• Cscore meta information stored in XML • Fit and assessment statistics stored in CSV • Target and input data set info stored in text

Process flow report in HTML format

Page 10: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Performance and Scalability! XOT

! enables parallel input (read) of partitioned data sets)

! Using XOT for data I/O

! TK (Threaded Kernel)! Multi Threading, making use of multiple CPUs ! TK for PROC DMDB, PROC DMINE (Vsel),

PROC DMREG

! Optional for all listed procedures

Page 11: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Scale-Up Proc DMINEStones (S64)

05

10152025

2 4 6 8

Number of Threads

Tim

e XOT-TKUnthreaded

–64 bit Solaris - 8 CPUs

Page 12: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Benchmarking TK(Proc DMDB)

real time 1.51 secondscpu time 4.92 seconds

real time 6.50 secondscpu time 6.50 seconds

5M obs

2 interval vars

real time 12.48 secondscpu time 29.00 seconds

real time 22.69 secondscpu time 22.68 seconds

100K obs

50 class vars

real time 1.95 secondscpu time 4.82 seconds

real time 26.80 secondscpu time 26.81 seconds

100K obs

50 interval vars

50 class vars

real time 1.95 seconds cpu time 4.82 seconds

real time 7.77 secondscpu time 7.77 seconds

100K obs

100 interval vars

Multi-Threaded (4 Threads)

Single Threaded

Page 13: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

EM 5.0 – The Future of Enterprise Miner™

Page 14: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

! Create a new 3-tier architecture

• SAS server- Batch and interactive modes- Use existing tools and expertise

• Java foundation services- Metadata services- Configuration management

• Java client- API – Integration projects- GUI – Swing-based

Plans for EM 5.0

Data Mining fromData Mining fromeverywhereeverywhere

Page 15: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Goals for EM 5.0• Create a new EM 5.0

! SAS server• Batch and interactive modes• Use existing tools and

expertise! Java middleware

• Metadata services• Configuration management

! Java client• API – Integration projects• GUI – Swing-based

• New procedures– PATH – production– ARBOR – production (replace split)– TAXONOMY – experimental– SVM – experimental

• Production version of MFC Tree viewer– PROC ARBOR– IOM procedure interface for

interactive training

• Production Model Repository• EM 5.0 model registration• EM 4.2 model registration• Web GUI• Warehouse Admin. Scoring

Page 16: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Current AF / SCL Architecture

SAS EM Client

EM 4.x classesSAS Version 8.2

SAS Server

SAS Version 8.2

Projectpersistence

DataPersistence

! SAS AF/SCL Infrastructure

! Project Stored Locally on the Windows Client as well as the SAS installation

! EM models trained on EM server (single threaded)

Page 17: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Distributed Architecture in EM 5.0Data Mining

Compute Server

SAS System

Middleware Server

EM 5.0 Java Middlware

ProjectData

Persistence

MetadataPersistence

Java EM Client

EM 5.0 Java UI

EM 5.0 Java API

Page 18: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Compute Server

SAS System

Middleware Server

EM 5.0 Java Middlware

ProjectData

Persistence

MetadataPersistence

JSP Server

EM 5.0 Java UI

EM 5.0 Java API

SAS Open MetadataServer

Web Client

Distributed Architecture in EM 5.0Reporting

Page 19: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Compute Server

SAS System

Middleware Server

EM 5.0 Java Middlware

ProjectData

Persistence

MetadataPersistence

JSP Server

EM 5.0 Java UI

EM 5.0 Java API

SAS Open MetadataServer

Web Client

Data BuilderJava Client

Distributed Architecture in EM 5.0Warehousing

Page 20: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

EM 5.0 – Configuration Options! Stand alone client

! SAS Server, Java middleware, GUI on the same machine

! Client – server ! SAS server, Java middleware server, clients

connect through Java GUI! Distributed computing

! All components on different machines, user connect from anywhere

Page 21: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Reasons for n-tier Architecture

• Central administration• Easier thin-client deployment • Reduce client footprint• Offers centralized location for file storage• Improved security –control of all login processes• Easier configuration• More persistence options – controlled by administrator• Better resource monitoring

• Who’s using the system• How many processes are running

Client 1

Client 2

SAS Server

OMS

Client 1

Client 2

SAS Server

OMS

EM Server

Page 22: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

New GUI Based on Java Swing

! Improved Graphics! Deployed through the web

allowing multiple user access

! Platform independent! Server independent! Configurable! On-line help! Extendable! XML import/export of

diagrams! Start and stop processes

Page 23: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Sample EM 5.0 Results

Exploratory Plots Assessment Plots

Page 24: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Interactive Tree Results Viewer

Page 25: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

EM5.0 Reporting! SPK=SAS Publish and Subscribe! SAS distributes a package reader

! Tables stored as CSV files => activate MS Excel! Can be registered in OMS and Model Repository

Page 26: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Enhanced Performance! Uses MP CONNECT technologies to distribute mining processes

across multiple CPUs providing the ability to run nodes in parallel.

! DMINE and DMREG procedures have been reengineered to take advantage of the TK and XOT frameworks of V9.

! Supports Stop Processing of an EM process.

Page 27: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

EM 5.0 Performance

Event Threads TotalUser 1 Connects 1 1User 2 Connects 1 2User 2 Starts process 1 3User 2 Disconnects -1 2Process starts model 1 training 1 3Process starts model 2 training 1 4Model 2 starts four threads running 4 8Model 2 completes -4 4Process completes -3 1User 2 Reconnects 1 2

SAS Server

CPU CPU CPU CPU

Server Operating System

IOM user session: user1

IOM user session: user2

SAS: Train Model 1

SAS: Train Model 2

tk 1 tk 2 tk 3 tk 4

User 1 User 2

IOM process session: user2

Middleware

! GUI sessions get dedicated SAS/IOM workspace

! Model training gets dedicated SAS/IOM workspace

! Parallel branches in process flow run in dedicated SAS/IOM workspaces

! xot procedures with spds libname engine start multiple data read threads

! tk enabled procedures start multiple computational threads

Page 28: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

EM 5.0 Batch Processing

! Java API/UI for batch processing

• Runs in middleware

• Opens existing workspace and starts training process

• Loads XML diagram files

! XML files API

• Save entire diagrams as XML files

• Mail from one user to another

• Scheduled execution

• %EM5(xmlfile=) macro for running diagrams

!Data set API

• Nodes data set: all nodes and properties

• Connections data set: flow of logic from one node to another

• Actions data set: nodes and actions to perform on nodes

• Workspace data set: library and files locations

• Variables meta data sets: input, target, rejected, etc…

• %EM5(nodes=,connect=,…) macro for running diagrams

Page 29: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

! Compatible with all EM5 file structures! Run the same diagram from UI or batch

! Automate model training from diagrams built in the GUI

! All SAS language capabilities! Encapsulates EM processing

! BATCH.SAS always created for every node! Automate creation of new diagrams! Distribute diagrams

! Consulting: initial setup and delivery! May include results, or not

EM 5.0 Batch Processing

Page 30: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

! API to Allow Java Programs to Call EM ! String ids_id=myWorkspace.addNode(“Datasource”);! String reg_id=myWorkspace.addNode(“Regression”); ! myWorkspace.connectNode(ids_id,reg_id);! myWorkspace.runNode(reg_id);

EM 5.0 Batch Processing

Page 31: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Integrated with OMS and Data Builder! OMS persists metadata about SAS

servers, EM project locations, results packages, and data dictionaries for training tables

! Scoring processes as well as input/output data sets can be defined and exchanged with other SAS companion products through registration of EM metadata and processes within the SAS OMR.

Page 32: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Other Major Enhancements

! New Mining Algorithms:! Support Vector Machines – popular algorithm for general

classification problems! Web Path Analysis – provides efficient and scalable mining of

frequent paths from click-stream data.! Taxonomy – supports hierarchical associations to populate

rules at different levels in the hierarchy.! Improved decision tree algorithm to enable interactive training

on the server and provide improved performance of disk resident data.

Page 33: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

New Procedures

! PROC PATH! PROC SVM! PROC ARBOR! PROX TAXONOMY

Page 34: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

New Path node (production)! PROC PATH - a new procedure to mine frequent

paths from preprocessed click stream data ! Features:

! Efficient, scalable and fast! Path completion - Reintroduce missing requests

(e.g., back button clicks)! Detecting path breaks - Identify separate sub-

paths ! Generating longest contiguous sub-paths! Correctly handling page reload requests

Page 35: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Path Analysis! Improved customer experience

! Tuning web-site structure based on browsing patterns! Build customer relationships

! Customizing content at individual or segment level! Real-time target marketing

! Cross-sell, up-sell product recommendations! Ad/Rebate placement! Predict site abandonment

! Browsing behavior as input to predictive modeling ! Segmentation based on browsing behavior

Page 36: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Support Vector Machines (experimental)! Supervised learning tool for creating functions from a

set of labeled training data! A binary classifier ! A general regression function

! Applications! Suitable for general classification problems ! Text Categorization! Biosequence Analysis; Micro Arrays

Page 37: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

SVM Classification is achieved by a linear or nonlinear separating surface in the input space of the dataset.

! Linear SVMs operate by finding a hypersurface in the space of possible inputs. This hypersurface will attempt to split the positive examples from the negative examples. The split will be chosen to have the largest distance from the hypersurfaceto the nearest of the positive and negative examples.

! If the training examples are not linearly separable, SVMs work by mapping the training data into a higher dimension feature space using an appropriate kernel function.

Page 38: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Other new Nodes/Procedures

! Taxonomy – Hierarchical associations (exp)! ARBOR – Replacement for SPLIT.

! Support client/server interactive training• As an interactive procedure• As an engine for a client side Windows Application

! Improved performance of disk-resident data! Documented at the level of SAS/STAT procedures

! All procedures will use a dynamic DMDB! No permanent physical DMDB data set is created

Page 39: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Early Adopters for EM 5

! Looking for Early Adopters in SeUGI time frame! 5 – 20 sites worldwide – recommended from

local offices! Different regions and different industries! Following scenarios

Page 40: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Early Adopters for EM 5! Following scenarios desired

! distribute the EM Java thin client to multiple users that are geographically dispersed to test 3-tier architecture

! small to medium sized firm to evaluate EM 5.0 running entirely on a local client

! site to test Java API to integrate EM analytics and scoring services into site specific mining applications

! site to test EM analytical deployment – test Model Repository

! sites with excellent statistical/AI modeling skills and applications to evaluate the new algorithms (SVM, Path analysis node, Interactive Tree, Hierarchical Associations)

Page 41: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

EM 5.0 Summary

! Delivered as a modern, distributed client-server system for data mining

! Enables wide area collaboration on data mining projects and extensive integration opportunities

! SAS server uses new parallel and multi-processing features of the SAS V9.0 system and includes an API for running data mining processes and for adding new data mining tools.

! Java middleware manages SAS server sessions, user identity, metadata, and report delivery.

! Data mining sessions can be created and managed through a Java API.

! The user interface is based on Java Swing libraries containing advanced graphics and visualization techniques

! New mining algorithms

Page 42: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

EM Summary! Provide renowned data mining functionality

based on modern future-proof architecture! Clear differentiation between data processing,

meta–data management and flexible user interface

! Architecture open for integration with other SAS and 3rd party applications

! Ensure backward compatibility by parallel maintenance of traditional AF solution

Page 43: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

Other Data Mining Presentations at SeUGI

! Wed, 16:25 , TKC “Distributed Data Mining with SAS Enterprise Miner”

! Wed, 11:40, Analytical Expertise stream, “SAS Text Miner”

! Wed, 17:05, TKC, “SAS Text Mining”! Analytical Demo Station in TKC

Page 44: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.

DEMO

Page 45: SAS Enterprise Miner™: What does the future hold? · Evolutionary Development of Data Mining Functionality! ... EM 5.0 Java Middlware Project Data Persistence Metadata Persistence

Copyright © 2002, SAS Institute Inc. All rights reserved.