35
Michael Welge, Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois [email protected]. edu , [email protected] 217 244-1999 October 8, 2002 A Briefing Given to the Access Grid Community on D2K – Data To Knowledge

algdocs.ncsa.uiuc.edu

  • Upload
    tommy96

  • View
    456

  • Download
    0

Embed Size (px)

Citation preview

Page 1: algdocs.ncsa.uiuc.edu

Michael Welge, Loretta Auvil

Automated Learning GroupNational Center for Supercomputing ApplicationsUniversity of Illinois

[email protected], [email protected]

217 244-1999

October 8, 2002

A Briefing Given to the Access Grid Community on D2K – Data To Knowledge

Page 2: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Presentation Overview

• Brief Introduction to Knowledge Discovery in Databases and Data Mining

• Knowledge Discovery in Databases Framework

• Primer on Using the D2K – Data To Knowledge Framework

• Questions?

Page 3: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Goals

• Understanding of the Knowledge Discovery in Databases Process

• Gain Knowledge of Basic Data Mining Operations and Techniques

• Understanding the Role of the Knowledge Discovery Framework

• Key Issues in Utilization of D2K Framework

• Understanding the Role of Information Visualization in Data Mining

Page 4: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Motivation: “Necessity is the Mother of Invention”

• Data Explosion Problem• Automated Data Collection Tools and Mature Database Technology

Lead to Tremendous Amounts of Data Stores in Databases, Data Warehouses, and Other Information Repositories.

• We Are Drowning In Data, But Starving For Knowledge

• Solution: Data Management Environments and Data Mining Frameworks

• Data Warehousing and On-Line Analytical Processing• Extraction Of Interesting Knowledge (Rules, Regularities, Patterns)

from Large Data and Large Databases

Page 5: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Why Data Mining? - Potential Applications

• Eliminating Waste, Fraud, Abuse• Taxpayer Non-compliance• Medicaid Claims Fraud• Food Stamp Program• Auditor “Interestingness” Tool

• Corporate Analysis and Risk Management• Resource Planning• Competitive Analysis• Finance Planning and Asset Evaluation

Page 6: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Why Data Mining? - Potential Applications

• Crisis Management • Anticipatory Models • Topic Detection• Text Extraction• Network Intrusion• Multi-Objective Optimization

• Workforce/Education• Constituent Relationship Management• Real-time Profiling• Peer Review Analysis• Curriculum Generator• Retention Programs

Page 7: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Why Data Mining? - Potential Applications

• Managing Natural Resources• Land Usage• Water Resource Management• Surveillance• Biometrics for Identification

• Other Applications • Astronomy• Computational Biology

Page 8: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Data Mining: On What Kind of Data?

• Relational Databases

• Data Warehouses

• Transactional Databases

• Advanced Database Systems• Object-Relational• Spatial• Temporal• Text• Heterogeneous, Legacy, and Distributed• WWW

Page 9: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Data Mining: Confluence of Multiple Disciplines

• Database Systems, Data Warehouses, and OLAP

• Machine Learning

• Statistics

• Mathematical Programming

• Visualization

• High Performance Computing

Page 10: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Why Do We Need Data Mining ?

• Data volumes are too large for classical analysis approaches:

• Large number of records (108 – 1012 bytes)• High dimensional data ( 102 – 104 attributes)

How do you explore millions of records, tens or hundreds of fields, and find patterns?

Page 11: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Why Do We Need Data Mining?

• As databases grow, the ability to support the decision support process using traditional query languages becomes infeasible

• Many queries of interest are difficult to state in a query language (query formulation problem)

• “Find all cases of fraud”

• “Find all individuals likely to need Education Credit Assistance”

• “Find all documents that are similar to this customers problem”

Page 12: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

What is It?

Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

• The understandable patterns are used to:• Make predictions or classifications about new data• Explain existing data• Summarize the contents of a large database to support decision

making• Graphical data visualization to aid humans in discovering deeper

patterns

Page 13: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Three Primary Data Mining Paradigms

• Predictive Modeling

• Classification – (Categorical or Discrete)

• Regression – (Continuous)

• Discovery

• Association Rules, Link Analysis, Sequences, Clustering

• Deviation Detection/ Monitoring

Page 14: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Knowledge Discovery In Databases Process

Page 15: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Need for Data Mining Framework

• Visual Programming Environment

• Robust Computational Infrastructure

• Flexible And Extensible Architecture

• Rapid Application Development Environment

• Integrated Environment For Models And Visualization

• Workflow and Group Use Interface

Page 16: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

D2K - Data To Knowledge

D2K is a rapid, flexible data mining system that integrates effective analytical data mining methods for prediction, discovery, and anomaly detection with data management and information visualization.

Page 17: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

D2K – Infrastructure, Toolkit, Modules, and Applications

• Data Selection• Distributed

Knowledge Sources

• Data Transformation• Feature Selection/

Construction• Example Selection

• Data Modeling• Scalable Algorithms

– Predictive– Discovery– Anomaly

Detection• Bias Optimization• Layer Learning

• Model Evaluation• Information

Visualization

Page 18: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

D2K/T2K/I2K - Data, Text, and Image Analysis

Page 19: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Summary

• Data mining: discovering interesting patterns from large amounts of data

• A natural evolution of database technology, in great demand, with wide applications

• A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation

• Mining can be performed in a variety of information repositories

• Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc…

• Importance of data mining framework

Page 20: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

D2K ToolKit Tool Menu

Tool Bar

Workspace

Side Tab Panes

Jump Up Panes

Page 21: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

D2K - Software Environment for Data Mining

• Visual programming system employing a scalable framework

• Robust computational infrastructure• Enable processor intensive apps, support distributed computing• Enable data intensive apps, support multi-processor, shared memory

architectures, thread pooling• Very low granularity, fast data flow paradigm, integrated control flow

• Reduction of “time to market”• Increase code reuse and sharing• Expedite custom software developments• Relieve distributed computing burden

• Flexible and extensible architecture• Create plug and play subsystem architectures, and standard APIs

• Rapid application development (RAD) environment• Integrated environment for models and visualization

Page 22: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

D2K Components

• D2K Infrastructure• D2K API, data flow environment,

distributed computing framework and runtime system

• D2K Modules• Computational units written in Java that

follow the D2K API

• D2K Itineraries• Modules that are connected to form an

application

• D2K Toolkit• User interface for specification of

itineraries and execution which provides the rapid application development environment

• D2K-Driven Applications• Applications that use D2K modules, but

do not need to run in the D2K Toolkit

Page 23: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

D2K Infrastructure

• D2K Module API Specification

• Distributed Computing Framework

• Uses Socket Based Connections to communicate to remote machines

• Uses Grid Services to deploy on the Grid

• Local D2K• Controls the execution of an itinerary • Manages the passing of data

between modules and machines (if necessary)

• Remote D2K• Executes a module on a remote

machine

Page 24: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

D2K Modules

Input Module: Loads data from the outside world.• Flat files, database, etc.

Data Prep Module: Performs functions to select, clean, or transform the data

• Binning, Normalizing, Feature Selection, etc.

Compute Module: Performs main algorithmic computations.• Naïve Bayesian, Decision Tree, Apriori, etc.

User Input Module: Requires interaction with the user.• Data Selection, Input and Output selection, etc.

Output Module: Saves data to the outside world.• Flat files, databases, etc.

Visualization Module: Provides visual feedback to the user.• Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D

Scatterplot, 3D Surface Plot

Page 25: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

D2K Module Icon Description

Module Progress BarAppears during execution to show the

percentage of time that this module executed over the entire execution

time. It is green when the module is executing and red when not.

Input TriggerSpecifies the input for control flow.

Input PortRectangular shapes on the left side of

the module represent the inputs for the module. They are colored

according to the data type that they represent

Properties SymbolIf a “P” is shown in the lower left corner

of the module, then the module has properties that can be set before

execution.

Output TriggerSpecifies the output for control flow.

Output PortRectangular shapes on the right side of the module represent the outputs for the module. They are colored according to the data type that they represent.

Serializable SymbolIf an “S” is shown in the lower right corner of the module, then the module is serializable and can be saved.

Page 26: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

D2K Itineraries

• Itineraries are applications that have connected modules with their properties set.

• D2K Core Itineraries include:

• Prediction• Discovery• Anomaly Detection• Data Selection• Transformation• Visualization

Page 27: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

D2K-Driven Applications

• Advantages of Building D2K-Driven Applications• Code reuse shortens development time• Use the distributed computing features implemented in D2K

• Current Application Development By the ALG• Text Analysis (ThemeWeaver uses T2K - Text to Knowledge)

• Other Potential Application Areas• Image Analysis (I2K – Image to Knowledge)

D2K-Driven applications are those that use D2K modules and/or itineraries but do not require interaction with the D2K Toolkit to function. They can operate as stand alone applications.

Page 28: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

New D2K 3.0 Features

• Extension of existing API• Include the ability to programmatically connect modules and set properties.• Allows D2K-driven applications to be developed.• Ability to pause and restart an itinerary.

• Enhanced Distributed Computing• Modules that are re-entrant can be executed remotely.• Use of Jini services to look up distributed resources.• For specifying the runtime layout of a distributed itinerary, which can be

changed dynamically during runtime.

• Processor Status Overlay • Shows user how distributed computing resources are being used.• Shows how many resources are ready to compute on each machine.

• Distributed Checkpointing• Resource Manager

• Provides an API for indicating data structures to be stored by the resource manager.

• Resource manager provides these data structures to distributed machines.

Page 29: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Processor Status Overlay

• Represents each machine being used.

• Multiple lines represent multiple processors per machine.

Page 30: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Lets look at D2K…

Demos

• D2K Toolkit

• Prediction • Naive Bayesian• Decision Tree

• Discovery• Rule Association• Text Analysis (D2K)• Image Analysis (I2K)

• Visualization

Page 31: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

D2K SL

• Intuitive interfaces into a subset of D2K functionality for non-data mining professionals.

• Transparent access to mine data stored in databases.

• Extensible from desktop to cluster to grid.

• Visualization support at all stages of the data mining process.

• Support for very large data sets.

Page 32: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

New D2K User Interface – D2K SL

• Provides step by step interface to guide user in data analysis

• Uses same D2K modules

• Provides way to capture different experiments (streams)

Page 33: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

Another View of the New D2K User Interface – D2K SL

• Help users keep track of data

• Define templates that can be reused in different experiments (streams)

Page 34: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

How To Write A Module

• How hard is it to write a module??

• We have an API to define what a given module is.

• Most modules need the following methods implemented:• Module Info (getModuleInfo)• Input and Output Info (getInputInfo and getOutputInfo)• Input and Output Types (getInputTypes and getOutputTypes)• Names (getModuleName, getInputName, getOutputName)• Module execution (doit)

• Flexibility exists for other methods to be overwritten to provide different functionality.

• Optional methods exist for providing more information about properties, module icon, etc.

Page 35: algdocs.ncsa.uiuc.edu

alg | Automated Learning Group

The ALG Team

StaffLoretta AuvilRuth AydtPeter BajcsyColleen BushellDora CaiDavid ClutterYair Even-ZoharLisa GatzkeVered GorenChris NavarroGreg PapeTom RedmanDuane SearsmithAndrew ShirkAnca SuvaialaDavid TchengMichael Welge

StudentsTyler AlumbaughBradley BerkinMartin ButzPeter GrovesNazan KhanAlexander KosorukoffKiran LakkarajuSang-Chul LeeSameer MathurSunayana SahaArun Srinivasan Bei Yu