Upload
doanquynh
View
221
Download
3
Embed Size (px)
Citation preview
Copyright © 2002, SAS Institute Inc. All rights reserved.
SAS Enterprise Miner™:What does the future hold?David DulingEM Development DirectorSAS Inc.
Sascha SchubertProduct Manager Data MiningSAS International
Copyright © 2002, SAS Institute Inc. All rights reserved.
Topics for Discussion:
EM 4.2/SAS 9.0AF/SCL
Architecture
EM 5.0/SAS 9.13-tier
Architecture EM
Demo of the Alpha EM 5.0
JavaUI
Copyright © 2002, SAS Institute Inc. All rights reserved.
! Revolutionary Development of Data Mining Architecture! Address scalability and performance ! Address the limitations of current architecture! Make new architecture future-proof
EM – Two Paths for Two Goals! Evolutionary Development of Data Mining Functionality
! Keep up the quality! Upgrade release for current sites! Stay on top of the market
Copyright © 2002, SAS Institute Inc. All rights reserved.
Time Line Project Mercury + DMJun 02Apr 02 Nov 02 Feb 03
DP EA LA GA
SAS V9SAS V9
SAS V9.1SAS V9.1
EA LA GA
EM 4.2EM 4.2
EM 5.0EM 5.0
Evolutionary Release
Revolutionary Release
Copyright © 2002, SAS Institute Inc. All rights reserved.
Goals for EM 4.2
! Maintain current product! Fix known defects! Evolve beta tools to production status! Interactive Grouping! Improve scalability (parallel processing)
Copyright © 2002, SAS Institute Inc. All rights reserved.
EM 4.2Evolve Beta Tools to Production Status! Memory Based Reasoning! DM Neural! Two-Stage Model! Time Series! Link Analysis ! J-Score, XML
Copyright © 2002, SAS Institute Inc. All rights reserved.
Interactive Grouping Node
! Was developed as part of Credit Scoring Solution! Will be fully integrated in EM 4.2 / 5.0! Used to calculate weights of evidence! also useful for general interactive grouping
! Interactive grouping of variables into natural groups in relation to target
! now possible for class and interval variables
Copyright © 2002, SAS Institute Inc. All rights reserved.
Publishing Enterprise Miner Models via the Open Meta Server
Enterprise Miner
Open Meta ServerWWW Server
WWW clients•Search Models•Retrieve Models
•Reports•Score code
Register
Save
Read
HTTP/JSP
Copyright © 2002, SAS Institute Inc. All rights reserved.
Mining Model Repository! SAS Code, C Code, Java Code! Statistics, Charts, Reports! Input and Output Variables described in XML
• Fit and assessment statistics in SAS data sets
• Formats, score, and macro code as SAS code
• Metadata info about the model in a SAS catalog• Cscore code• Cscore meta information stored in XML • Fit and assessment statistics stored in CSV • Target and input data set info stored in text
Process flow report in HTML format
Copyright © 2002, SAS Institute Inc. All rights reserved.
Performance and Scalability! XOT
! enables parallel input (read) of partitioned data sets)
! Using XOT for data I/O
! TK (Threaded Kernel)! Multi Threading, making use of multiple CPUs ! TK for PROC DMDB, PROC DMINE (Vsel),
PROC DMREG
! Optional for all listed procedures
Copyright © 2002, SAS Institute Inc. All rights reserved.
Scale-Up Proc DMINEStones (S64)
05
10152025
2 4 6 8
Number of Threads
Tim
e XOT-TKUnthreaded
–64 bit Solaris - 8 CPUs
Copyright © 2002, SAS Institute Inc. All rights reserved.
Benchmarking TK(Proc DMDB)
real time 1.51 secondscpu time 4.92 seconds
real time 6.50 secondscpu time 6.50 seconds
5M obs
2 interval vars
real time 12.48 secondscpu time 29.00 seconds
real time 22.69 secondscpu time 22.68 seconds
100K obs
50 class vars
real time 1.95 secondscpu time 4.82 seconds
real time 26.80 secondscpu time 26.81 seconds
100K obs
50 interval vars
50 class vars
real time 1.95 seconds cpu time 4.82 seconds
real time 7.77 secondscpu time 7.77 seconds
100K obs
100 interval vars
Multi-Threaded (4 Threads)
Single Threaded
Copyright © 2002, SAS Institute Inc. All rights reserved.
EM 5.0 – The Future of Enterprise Miner™
Copyright © 2002, SAS Institute Inc. All rights reserved.
! Create a new 3-tier architecture
• SAS server- Batch and interactive modes- Use existing tools and expertise
• Java foundation services- Metadata services- Configuration management
• Java client- API – Integration projects- GUI – Swing-based
Plans for EM 5.0
Data Mining fromData Mining fromeverywhereeverywhere
Copyright © 2002, SAS Institute Inc. All rights reserved.
Goals for EM 5.0• Create a new EM 5.0
! SAS server• Batch and interactive modes• Use existing tools and
expertise! Java middleware
• Metadata services• Configuration management
! Java client• API – Integration projects• GUI – Swing-based
• New procedures– PATH – production– ARBOR – production (replace split)– TAXONOMY – experimental– SVM – experimental
• Production version of MFC Tree viewer– PROC ARBOR– IOM procedure interface for
interactive training
• Production Model Repository• EM 5.0 model registration• EM 4.2 model registration• Web GUI• Warehouse Admin. Scoring
Copyright © 2002, SAS Institute Inc. All rights reserved.
Current AF / SCL Architecture
SAS EM Client
EM 4.x classesSAS Version 8.2
SAS Server
SAS Version 8.2
Projectpersistence
DataPersistence
! SAS AF/SCL Infrastructure
! Project Stored Locally on the Windows Client as well as the SAS installation
! EM models trained on EM server (single threaded)
Copyright © 2002, SAS Institute Inc. All rights reserved.
Distributed Architecture in EM 5.0Data Mining
Compute Server
SAS System
Middleware Server
EM 5.0 Java Middlware
ProjectData
Persistence
MetadataPersistence
Java EM Client
EM 5.0 Java UI
EM 5.0 Java API
Copyright © 2002, SAS Institute Inc. All rights reserved.
Compute Server
SAS System
Middleware Server
EM 5.0 Java Middlware
ProjectData
Persistence
MetadataPersistence
JSP Server
EM 5.0 Java UI
EM 5.0 Java API
SAS Open MetadataServer
Web Client
Distributed Architecture in EM 5.0Reporting
Copyright © 2002, SAS Institute Inc. All rights reserved.
Compute Server
SAS System
Middleware Server
EM 5.0 Java Middlware
ProjectData
Persistence
MetadataPersistence
JSP Server
EM 5.0 Java UI
EM 5.0 Java API
SAS Open MetadataServer
Web Client
Data BuilderJava Client
Distributed Architecture in EM 5.0Warehousing
Copyright © 2002, SAS Institute Inc. All rights reserved.
EM 5.0 – Configuration Options! Stand alone client
! SAS Server, Java middleware, GUI on the same machine
! Client – server ! SAS server, Java middleware server, clients
connect through Java GUI! Distributed computing
! All components on different machines, user connect from anywhere
Copyright © 2002, SAS Institute Inc. All rights reserved.
Reasons for n-tier Architecture
• Central administration• Easier thin-client deployment • Reduce client footprint• Offers centralized location for file storage• Improved security –control of all login processes• Easier configuration• More persistence options – controlled by administrator• Better resource monitoring
• Who’s using the system• How many processes are running
Client 1
Client 2
SAS Server
OMS
Client 1
Client 2
SAS Server
OMS
EM Server
Copyright © 2002, SAS Institute Inc. All rights reserved.
New GUI Based on Java Swing
! Improved Graphics! Deployed through the web
allowing multiple user access
! Platform independent! Server independent! Configurable! On-line help! Extendable! XML import/export of
diagrams! Start and stop processes
Copyright © 2002, SAS Institute Inc. All rights reserved.
Sample EM 5.0 Results
Exploratory Plots Assessment Plots
Copyright © 2002, SAS Institute Inc. All rights reserved.
Interactive Tree Results Viewer
Copyright © 2002, SAS Institute Inc. All rights reserved.
EM5.0 Reporting! SPK=SAS Publish and Subscribe! SAS distributes a package reader
! Tables stored as CSV files => activate MS Excel! Can be registered in OMS and Model Repository
Copyright © 2002, SAS Institute Inc. All rights reserved.
Enhanced Performance! Uses MP CONNECT technologies to distribute mining processes
across multiple CPUs providing the ability to run nodes in parallel.
! DMINE and DMREG procedures have been reengineered to take advantage of the TK and XOT frameworks of V9.
! Supports Stop Processing of an EM process.
Copyright © 2002, SAS Institute Inc. All rights reserved.
EM 5.0 Performance
Event Threads TotalUser 1 Connects 1 1User 2 Connects 1 2User 2 Starts process 1 3User 2 Disconnects -1 2Process starts model 1 training 1 3Process starts model 2 training 1 4Model 2 starts four threads running 4 8Model 2 completes -4 4Process completes -3 1User 2 Reconnects 1 2
SAS Server
CPU CPU CPU CPU
Server Operating System
IOM user session: user1
IOM user session: user2
SAS: Train Model 1
SAS: Train Model 2
tk 1 tk 2 tk 3 tk 4
User 1 User 2
IOM process session: user2
Middleware
! GUI sessions get dedicated SAS/IOM workspace
! Model training gets dedicated SAS/IOM workspace
! Parallel branches in process flow run in dedicated SAS/IOM workspaces
! xot procedures with spds libname engine start multiple data read threads
! tk enabled procedures start multiple computational threads
Copyright © 2002, SAS Institute Inc. All rights reserved.
EM 5.0 Batch Processing
! Java API/UI for batch processing
• Runs in middleware
• Opens existing workspace and starts training process
• Loads XML diagram files
! XML files API
• Save entire diagrams as XML files
• Mail from one user to another
• Scheduled execution
• %EM5(xmlfile=) macro for running diagrams
!Data set API
• Nodes data set: all nodes and properties
• Connections data set: flow of logic from one node to another
• Actions data set: nodes and actions to perform on nodes
• Workspace data set: library and files locations
• Variables meta data sets: input, target, rejected, etc…
• %EM5(nodes=,connect=,…) macro for running diagrams
Copyright © 2002, SAS Institute Inc. All rights reserved.
! Compatible with all EM5 file structures! Run the same diagram from UI or batch
! Automate model training from diagrams built in the GUI
! All SAS language capabilities! Encapsulates EM processing
! BATCH.SAS always created for every node! Automate creation of new diagrams! Distribute diagrams
! Consulting: initial setup and delivery! May include results, or not
EM 5.0 Batch Processing
Copyright © 2002, SAS Institute Inc. All rights reserved.
! API to Allow Java Programs to Call EM ! String ids_id=myWorkspace.addNode(“Datasource”);! String reg_id=myWorkspace.addNode(“Regression”); ! myWorkspace.connectNode(ids_id,reg_id);! myWorkspace.runNode(reg_id);
EM 5.0 Batch Processing
Copyright © 2002, SAS Institute Inc. All rights reserved.
Integrated with OMS and Data Builder! OMS persists metadata about SAS
servers, EM project locations, results packages, and data dictionaries for training tables
! Scoring processes as well as input/output data sets can be defined and exchanged with other SAS companion products through registration of EM metadata and processes within the SAS OMR.
Copyright © 2002, SAS Institute Inc. All rights reserved.
Other Major Enhancements
! New Mining Algorithms:! Support Vector Machines – popular algorithm for general
classification problems! Web Path Analysis – provides efficient and scalable mining of
frequent paths from click-stream data.! Taxonomy – supports hierarchical associations to populate
rules at different levels in the hierarchy.! Improved decision tree algorithm to enable interactive training
on the server and provide improved performance of disk resident data.
Copyright © 2002, SAS Institute Inc. All rights reserved.
New Procedures
! PROC PATH! PROC SVM! PROC ARBOR! PROX TAXONOMY
Copyright © 2002, SAS Institute Inc. All rights reserved.
New Path node (production)! PROC PATH - a new procedure to mine frequent
paths from preprocessed click stream data ! Features:
! Efficient, scalable and fast! Path completion - Reintroduce missing requests
(e.g., back button clicks)! Detecting path breaks - Identify separate sub-
paths ! Generating longest contiguous sub-paths! Correctly handling page reload requests
Copyright © 2002, SAS Institute Inc. All rights reserved.
Path Analysis! Improved customer experience
! Tuning web-site structure based on browsing patterns! Build customer relationships
! Customizing content at individual or segment level! Real-time target marketing
! Cross-sell, up-sell product recommendations! Ad/Rebate placement! Predict site abandonment
! Browsing behavior as input to predictive modeling ! Segmentation based on browsing behavior
Copyright © 2002, SAS Institute Inc. All rights reserved.
Support Vector Machines (experimental)! Supervised learning tool for creating functions from a
set of labeled training data! A binary classifier ! A general regression function
! Applications! Suitable for general classification problems ! Text Categorization! Biosequence Analysis; Micro Arrays
Copyright © 2002, SAS Institute Inc. All rights reserved.
SVM Classification is achieved by a linear or nonlinear separating surface in the input space of the dataset.
! Linear SVMs operate by finding a hypersurface in the space of possible inputs. This hypersurface will attempt to split the positive examples from the negative examples. The split will be chosen to have the largest distance from the hypersurfaceto the nearest of the positive and negative examples.
! If the training examples are not linearly separable, SVMs work by mapping the training data into a higher dimension feature space using an appropriate kernel function.
Copyright © 2002, SAS Institute Inc. All rights reserved.
Other new Nodes/Procedures
! Taxonomy – Hierarchical associations (exp)! ARBOR – Replacement for SPLIT.
! Support client/server interactive training• As an interactive procedure• As an engine for a client side Windows Application
! Improved performance of disk-resident data! Documented at the level of SAS/STAT procedures
! All procedures will use a dynamic DMDB! No permanent physical DMDB data set is created
Copyright © 2002, SAS Institute Inc. All rights reserved.
Early Adopters for EM 5
! Looking for Early Adopters in SeUGI time frame! 5 – 20 sites worldwide – recommended from
local offices! Different regions and different industries! Following scenarios
Copyright © 2002, SAS Institute Inc. All rights reserved.
Early Adopters for EM 5! Following scenarios desired
! distribute the EM Java thin client to multiple users that are geographically dispersed to test 3-tier architecture
! small to medium sized firm to evaluate EM 5.0 running entirely on a local client
! site to test Java API to integrate EM analytics and scoring services into site specific mining applications
! site to test EM analytical deployment – test Model Repository
! sites with excellent statistical/AI modeling skills and applications to evaluate the new algorithms (SVM, Path analysis node, Interactive Tree, Hierarchical Associations)
Copyright © 2002, SAS Institute Inc. All rights reserved.
EM 5.0 Summary
! Delivered as a modern, distributed client-server system for data mining
! Enables wide area collaboration on data mining projects and extensive integration opportunities
! SAS server uses new parallel and multi-processing features of the SAS V9.0 system and includes an API for running data mining processes and for adding new data mining tools.
! Java middleware manages SAS server sessions, user identity, metadata, and report delivery.
! Data mining sessions can be created and managed through a Java API.
! The user interface is based on Java Swing libraries containing advanced graphics and visualization techniques
! New mining algorithms
Copyright © 2002, SAS Institute Inc. All rights reserved.
EM Summary! Provide renowned data mining functionality
based on modern future-proof architecture! Clear differentiation between data processing,
meta–data management and flexible user interface
! Architecture open for integration with other SAS and 3rd party applications
! Ensure backward compatibility by parallel maintenance of traditional AF solution
Copyright © 2002, SAS Institute Inc. All rights reserved.
Other Data Mining Presentations at SeUGI
! Wed, 16:25 , TKC “Distributed Data Mining with SAS Enterprise Miner”
! Wed, 11:40, Analytical Expertise stream, “SAS Text Miner”
! Wed, 17:05, TKC, “SAS Text Mining”! Analytical Demo Station in TKC
Copyright © 2002, SAS Institute Inc. All rights reserved.
DEMO
Copyright © 2002, SAS Institute Inc. All rights reserved.