algdocs.ncsa.uiuc.edu

Michael Welge, Loretta Auvil

Automated Learning GroupNational Center for Supercomputing ApplicationsUniversity of Illinois

[email protected], [email protected]

217 244-1999

October 8, 2002

A Briefing Given to the Access Grid Community on D2K – Data To Knowledge

alg | Automated Learning Group

Presentation Overview

• Brief Introduction to Knowledge Discovery in Databases and Data Mining

• Knowledge Discovery in Databases Framework

• Primer on Using the D2K – Data To Knowledge Framework

• Questions?


Goals

• Understanding of the Knowledge Discovery in Databases Process

• Gain Knowledge of Basic Data Mining Operations and Techniques

• Understanding the Role of the Knowledge Discovery Framework

• Key Issues in Utilization of D2K Framework

• Understanding the Role of Information Visualization in Data Mining


Motivation: “Necessity is the Mother of Invention”

• Data Explosion Problem• Automated Data Collection Tools and Mature Database Technology

Lead to Tremendous Amounts of Data Stores in Databases, Data Warehouses, and Other Information Repositories.

• We Are Drowning In Data, But Starving For Knowledge

• Solution: Data Management Environments and Data Mining Frameworks

• Data Warehousing and On-Line Analytical Processing• Extraction Of Interesting Knowledge (Rules, Regularities, Patterns)

from Large Data and Large Databases


Why Data Mining? - Potential Applications

• Eliminating Waste, Fraud, Abuse• Taxpayer Non-compliance• Medicaid Claims Fraud• Food Stamp Program• Auditor “Interestingness” Tool

• Corporate Analysis and Risk Management• Resource Planning• Competitive Analysis• Finance Planning and Asset Evaluation



• Crisis Management • Anticipatory Models • Topic Detection• Text Extraction• Network Intrusion• Multi-Objective Optimization

• Workforce/Education• Constituent Relationship Management• Real-time Profiling• Peer Review Analysis• Curriculum Generator• Retention Programs



• Managing Natural Resources• Land Usage• Water Resource Management• Surveillance• Biometrics for Identification

• Other Applications • Astronomy• Computational Biology


Data Mining: On What Kind of Data?

• Relational Databases

• Data Warehouses

• Transactional Databases

• Advanced Database Systems• Object-Relational• Spatial• Temporal• Text• Heterogeneous, Legacy, and Distributed• WWW


Data Mining: Confluence of Multiple Disciplines

• Database Systems, Data Warehouses, and OLAP

• Machine Learning

• Statistics

• Mathematical Programming

• Visualization

• High Performance Computing


Why Do We Need Data Mining ?

• Data volumes are too large for classical analysis approaches:

• Large number of records (108 – 1012 bytes)• High dimensional data ( 102 – 104 attributes)

How do you explore millions of records, tens or hundreds of fields, and find patterns?


Why Do We Need Data Mining?

• As databases grow, the ability to support the decision support process using traditional query languages becomes infeasible

• Many queries of interest are difficult to state in a query language (query formulation problem)

• “Find all cases of fraud”

• “Find all individuals likely to need Education Credit Assistance”

• “Find all documents that are similar to this customers problem”


What is It?

Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

• The understandable patterns are used to:• Make predictions or classifications about new data• Explain existing data• Summarize the contents of a large database to support decision

making• Graphical data visualization to aid humans in discovering deeper

patterns


Three Primary Data Mining Paradigms

• Predictive Modeling

• Classification – (Categorical or Discrete)

• Regression – (Continuous)

• Discovery

• Association Rules, Link Analysis, Sequences, Clustering

• Deviation Detection/ Monitoring


Knowledge Discovery In Databases Process


Need for Data Mining Framework

• Visual Programming Environment

• Robust Computational Infrastructure

• Flexible And Extensible Architecture

• Rapid Application Development Environment

• Integrated Environment For Models And Visualization

• Workflow and Group Use Interface


D2K - Data To Knowledge

D2K is a rapid, flexible data mining system that integrates effective analytical data mining methods for prediction, discovery, and anomaly detection with data management and information visualization.


D2K – Infrastructure, Toolkit, Modules, and Applications

• Data Selection• Distributed

Knowledge Sources

• Data Transformation• Feature Selection/

Construction• Example Selection

• Data Modeling• Scalable Algorithms

– Predictive– Discovery– Anomaly

Detection• Bias Optimization• Layer Learning

• Model Evaluation• Information

Visualization


D2K/T2K/I2K - Data, Text, and Image Analysis


Summary

• Data mining: discovering interesting patterns from large amounts of data

• A natural evolution of database technology, in great demand, with wide applications

• A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation

• Mining can be performed in a variety of information repositories

• Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc…

• Importance of data mining framework


D2K ToolKit Tool Menu

Tool Bar

Workspace

Side Tab Panes

Jump Up Panes


D2K - Software Environment for Data Mining

• Visual programming system employing a scalable framework

• Robust computational infrastructure• Enable processor intensive apps, support distributed computing• Enable data intensive apps, support multi-processor, shared memory

architectures, thread pooling• Very low granularity, fast data flow paradigm, integrated control flow

• Reduction of “time to market”• Increase code reuse and sharing• Expedite custom software developments• Relieve distributed computing burden

• Flexible and extensible architecture• Create plug and play subsystem architectures, and standard APIs

• Rapid application development (RAD) environment• Integrated environment for models and visualization


D2K Components

• D2K Infrastructure• D2K API, data flow environment,

distributed computing framework and runtime system

• D2K Modules• Computational units written in Java that

follow the D2K API

• D2K Itineraries• Modules that are connected to form an

application

• D2K Toolkit• User interface for specification of

itineraries and execution which provides the rapid application development environment

• D2K-Driven Applications• Applications that use D2K modules, but

do not need to run in the D2K Toolkit


D2K Infrastructure

• D2K Module API Specification

• Distributed Computing Framework

• Uses Socket Based Connections to communicate to remote machines

• Uses Grid Services to deploy on the Grid

• Local D2K• Controls the execution of an itinerary • Manages the passing of data

between modules and machines (if necessary)

• Remote D2K• Executes a module on a remote

machine


D2K Modules

Input Module: Loads data from the outside world.• Flat files, database, etc.

Data Prep Module: Performs functions to select, clean, or transform the data

• Binning, Normalizing, Feature Selection, etc.

Compute Module: Performs main algorithmic computations.• Naïve Bayesian, Decision Tree, Apriori, etc.

User Input Module: Requires interaction with the user.• Data Selection, Input and Output selection, etc.

Output Module: Saves data to the outside world.• Flat files, databases, etc.

Visualization Module: Provides visual feedback to the user.• Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D

Scatterplot, 3D Surface Plot


D2K Module Icon Description

Module Progress BarAppears during execution to show the

percentage of time that this module executed over the entire execution

time. It is green when the module is executing and red when not.

Input TriggerSpecifies the input for control flow.

Input PortRectangular shapes on the left side of

the module represent the inputs for the module. They are colored

according to the data type that they represent

Properties SymbolIf a “P” is shown in the lower left corner

of the module, then the module has properties that can be set before

execution.

Output TriggerSpecifies the output for control flow.

Output PortRectangular shapes on the right side of the module represent the outputs for the module. They are colored according to the data type that they represent.

Serializable SymbolIf an “S” is shown in the lower right corner of the module, then the module is serializable and can be saved.


D2K Itineraries

• Itineraries are applications that have connected modules with their properties set.

• D2K Core Itineraries include:

• Prediction• Discovery• Anomaly Detection• Data Selection• Transformation• Visualization


D2K-Driven Applications

• Advantages of Building D2K-Driven Applications• Code reuse shortens development time• Use the distributed computing features implemented in D2K

• Current Application Development By the ALG• Text Analysis (ThemeWeaver uses T2K - Text to Knowledge)

• Other Potential Application Areas• Image Analysis (I2K – Image to Knowledge)

D2K-Driven applications are those that use D2K modules and/or itineraries but do not require interaction with the D2K Toolkit to function. They can operate as stand alone applications.


New D2K 3.0 Features

• Extension of existing API• Include the ability to programmatically connect modules and set properties.• Allows D2K-driven applications to be developed.• Ability to pause and restart an itinerary.

• Enhanced Distributed Computing• Modules that are re-entrant can be executed remotely.• Use of Jini services to look up distributed resources.• For specifying the runtime layout of a distributed itinerary, which can be

changed dynamically during runtime.

• Processor Status Overlay • Shows user how distributed computing resources are being used.• Shows how many resources are ready to compute on each machine.

• Distributed Checkpointing• Resource Manager

• Provides an API for indicating data structures to be stored by the resource manager.

• Resource manager provides these data structures to distributed machines.


Processor Status Overlay

• Represents each machine being used.

• Multiple lines represent multiple processors per machine.


Lets look at D2K…

Demos

• D2K Toolkit

• Prediction • Naive Bayesian• Decision Tree

• Discovery• Rule Association• Text Analysis (D2K)• Image Analysis (I2K)

• Visualization


D2K SL

• Intuitive interfaces into a subset of D2K functionality for non-data mining professionals.

• Transparent access to mine data stored in databases.

• Extensible from desktop to cluster to grid.

• Visualization support at all stages of the data mining process.

• Support for very large data sets.


New D2K User Interface – D2K SL

• Provides step by step interface to guide user in data analysis

• Uses same D2K modules

• Provides way to capture different experiments (streams)


Another View of the New D2K User Interface – D2K SL

• Help users keep track of data

• Define templates that can be reused in different experiments (streams)


How To Write A Module

• How hard is it to write a module??

• We have an API to define what a given module is.

• Most modules need the following methods implemented:• Module Info (getModuleInfo)• Input and Output Info (getInputInfo and getOutputInfo)• Input and Output Types (getInputTypes and getOutputTypes)• Names (getModuleName, getInputName, getOutputName)• Module execution (doit)

• Flexibility exists for other methods to be overwritten to provide different functionality.

• Optional methods exist for providing more information about properties, module icon, etc.


The ALG Team

StaffLoretta AuvilRuth AydtPeter BajcsyColleen BushellDora CaiDavid ClutterYair Even-ZoharLisa GatzkeVered GorenChris NavarroGreg PapeTom RedmanDuane SearsmithAndrew ShirkAnca SuvaialaDavid TchengMichael Welge

StudentsTyler AlumbaughBradley BerkinMartin ButzPeter GrovesNazan KhanAlexander KosorukoffKiran LakkarajuSang-Chul LeeSameer MathurSunayana SahaArun Srinivasan Bei Yu

Documents

algdocs.ncsa.uiuc.edu