12
Mining Large Data at SDSC Natasha Balac, Ph.D.

Mining Large Data at SDSC Natasha Balac, Ph.D.. A Deluge of Data Astronomy Life Sciences Modeling and Simulation Data Management and Mining Geosciences

Embed Size (px)

Citation preview

Mining Large Data at SDSC

Natasha Balac, Ph.D.

A Deluge of Data

Astronomy

Life Sciences

Modeling and Simulation

Data Managementand Mining

Geosciences

Preservationand Archiving

• Today, data comes from everywhere– Scientific instruments– Experiments– Sensors and sensor nets– New devices

• And is used by everyone– Scientists– Consumers– Educators– General public

• IT environments must support unprecedented diversity, globalization, integration, scale, and use

• Turning the deluge of data into usable information requires an unprecedented level of integration, globalization, scale, and access

Why DATA MINING?

• Necessity is mother of invention• Huge amounts of data• Electronic records of our decisions

– Choices in the supermarket – Financial records– Our comings and goings

• We swipe our way through the world – every swipe is a record in a database

• Data rich – but information poor• Lying hidden in all this data is information!

What is DATA MINING?

• Extracting or “mining” knowledge from large amounts of data

• Data-driven discovery and modeling of hidden patterns (we never new existed) in large volumes of data

• Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data

• Fundamental idea:

learn rules/patterns/relationships automatically from the data

Terminology• Gold Mining vs. Sand Mining• Knowledge mining from databases• Knowledge extraction• Data/pattern analysis• Knowledge Discovery Databases (KDD)• Predictive Modeling• Machine Learning• Business Intelligence

CRISP-DM (Cross Industry Standard Process for Data Mining)

CRISP-DM Process Model

Data Mining Driven Engineering Product Design

• Incorporate parallel computing and data mining capabilities into engineering and optimizing product design models

• Complex challenges new product design– accurate acquisition/ interpretation of raw customer data– Integrating newly found knowledge in the engineering

design process– developing analytical techniques that help reduce the

computational time required to generate product portfolios.

• Mining paid search on-line customer preference data

A java based Data Driven Product Design (DDPD)

• Platform is developed that integrates the supercomputing resources at the SDSC with complex engineering design simulation platforms such as Matlab in an effort to streamline the product design and development process

Tools in the GUI

• Data Mining algorithms: Weka, Parallel Weka and Parallel C4.5, Parallel K-means

• Data Driven Product Design Platform utilizes Matlab’s powerful computation engine directly from the GUI. Optimization choices available from the user interface include Matlab , Tomlab, Excel Solver, Star-P, Parallel Matlab, Parallel CPLEX, etc.

Visual Representation of Data Mining results linking with serial optimization

models

Thank You