Knowledge and Data Management in Grids

Knowledge and Data Management in Grids

Domenico TaliaCoreGRID / University of Calabria

[email protected]

CereGRID Summer School – Budapest – 3-7 September, 2007

(or using Grids to climb data mountainsand find knowledge nuggets)

European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

- 2 -

AGENDA

• Introduction

• Objectives

• Some Data Grid Projects

• Distributed Data Mining

• The KDM Institute

• Conclusions


- 3 -

Introduction (1)

• Data and knowledge management is becoming key element in GRIDs as or more than high performance delivery.

Distributed Knowledge and Data ManagementDealing with issues concerning representation, storing, querying, mining, exchanging and integration of data (and resulting knowledge) in dynamic distributed environments.

• Those issues are today addressed by exploiting features offered by Grid/P2P/GC/UC (Distributed) Technologies.


- 4 -

Introduction (2)

• Many activities all over the word on

– GRID/P2P databases and distributed repositories

– Distributed metadata management

– Pervasive information systems

– GRID-based digital libraries

– Distributed data streaming management

– Distributed knowledge management

– Data-oriented services

• A more important role is expected in the near future.


- 5 -

• Today the information stored in digital data archives is enormous and its size is still growing very rapidly.

Introduction (3)

The world has created 161 exabytes (161 billion gigabytes) of digital information in 2006.

(source: IDC)


- 6 -

• Lots of data collected and warehoused.

– Data collected and stored at enormous speeds in local databases, from remote sources, from the environment and from the sky.

– Traditional techniques are infeasible for large raw data.

• Scientific simulations generating terabytes of data.

– Huge data sets are hard to understand.

– Most data will never be examined by humans; it is analyzed and summarized by computers.

• Storage costs are currently decreasing faster than computing costs: this trend makes things worse.

Introduction (4)


- 7 -

Whereas until some decades ago the main problem was the shortage of information, the challenge now seems to be

– the very large volume of information to deal with and

– the associated complexity to process it and to extract significant and useful parts or summaries.

Introduction (5)


- 8 -

Data Management in Science

• Data intensive applications are those that explore,

query, analyze, visualize, and in general, process very large-scale data sets.

• Computational science is evolving toward data intensive applications that include data integration and analysis, information management, and knowledge discovery.

• Data intensive applications in science help scientists in hypothesis formation companies to provide better, customized services and

support decision making.


- 9 -

Evolution of Science Methods

Jim Gray's formulation of the evolution of science methodologies:

– Thousand years ago: science was empirical, describing natural phenomena.

– Few hundred years: theoretical branch, using models, generalizations.

– Last few decades: a computational branch, simulating complex phenomena.

– Today: data exploration (eScience) - unify theory, experiment, and simulation.

(Data captured by instruments, or generated by simulator; processed by software; information/knowledge stored in computer; scientist analyzes database/files, using data management and statistics.)


- 10 -

• The use of computers is changing our way to make discoveries and is improving both speed and quality of the scientific discovery processes.

• In this scenario the Grid provides an effective computational support for distributed data intensive application and for knowledge discovery from large and distributed data sets.

• Grid services can be the basic element for composing software and data elements, and executing complex applications on Grid and Web systems.

GRIDS: From Computing to Data (1)


- 11 -

GRIDS: From Computing to Data (2)

• The Grid allows to federate and share heterogeneous resources and services such as software, computers, storage, data, networks in a dynamic way.

• Today the Gris is not just compute cycles, but it is also a distributed data management infrastructure. Integrating those two features with “smart” algorithms we can obtain a knowledge-intensive platform.

• In the latest years many significant Grid-based data intensive applications and infrastructures have been implemented.

• The service-based approach is allowing the integration of Grid and Web for handling with data.


- 12 -

The LHC generates 1GB/sec or 10PB/year

Applications: particle physics, earth observation, bioinformatics & medical

http://cern.ch/eu-datagrid/

The EUROPEAN DATA GRID


- 13 -

First “production quality” Grid

Linking NASA & academic supercomputing sites at 10 sites

Applications: computational fluid dynamics, meteorological data mining, Grid benchmarking

http://www.ipg.nasa.gov/

NASA INFORMATION POWER GRID


- 14 -

Linking supercomputers through a high-speed network 4x 10GBps between SDSC, Caltech, Argonne & NCSA

Compute and data services for science applications & users

http://www.teragrid.org/

TERAGRID


- 15 -

GOOGLE & NASA

Source: Charlie Catlett blog – Jan 2007


- 16 -

TerraService.NET

• A photo of the United States– 1 meter resolution

(photographic/topographic)

– USGS data– Some demographic data

(BestPlaces.net)

– Home sales data– Linked to Encarta Encyclopedia

15 TB raw, 6 TB cooked (grows 10GB/w)

• Offered as a Web Service to many applications (business, public administrations, scientits)


- 17 -

OPEN SCIENCE GRID

• The Open Science Grid links storage and computing resources at more than 30 sites across the United States

• Support a variety of services and applications, many concerned with large-scale data analysis.

• Thousands of computers and tens of terabytes of storage


- 18 -

myExperiment

• myExperiment is a collaborative research environment which enables scientists to share, re-use and repurpose experiments.

• myExperiment has been influenced by social networking programs such as Wired and Flickr, and is based on the mySpace infrastructure.

• myExperiment creates an environment for scientists to adopt Grid technologies, where they can define, when they share data, with whom they share it and how much of it can be accessed.

"Scientists would rather share their toothbrush than their data" Mike Ashburn, University of Cambridge


- 19 -

DATA MINING ON GRIDS


- 20 -

• Is not uncommon to have sequential data mining applications that require some days or weeks to complete their task.

• Parallel computing can bring significant benefits in the implementation of data mining and knowledge discovery applications by means of the exploitation of inherent parallelism of data mining algorithms.

Main goals: performance improvements of existing techniques, implementation of new (parallel) techniques and algorithms, concurrent analysis with different data mining techniques and

result integration to get a more accurate model → Ensemble Learning

Data Mining and Computational Needs


- 21 -

• Today many data is distributed geographically or

locally.

• When

– large data sets are coupled with

– geographic distribution of data, users, and

systems,

it is necessary to combine different technologies for

implementing high-performance distributed

knowledge discovery systems.

Data Mining and Computational Needs


- 22 -

– Parallel data mining• Task or control parallelism• Independent parallelism• SPMD parallelism• Hybrid parallelism

– Distributed data mining• Voting• Meta-learning, ensemble learning etc.

Can be a component

of

Parallel and Distributed DATA MINING


- 23 -

Three main strategies in the exploitation of parallelism in data mining algorithms:

independent parallelism

control parallelism

SPMD parallelism. • Independent parallelism: processes are executed in parallel in

an independent way; generally each process has access to the whole data set.

• Control parallelism (or Task parallelism): each process executes different operations on (a different partition of) a data set.

• SPMD parallelism: a set of processes execute in parallel the same algorithm on different partitions of a data set; processes exchange partial results.

Parallel and Distributed DATA MINING

D. Talia, "Parallelism in Knowledge Discovery Techniques", Proc. Sixth Int. Conference on Applied Parallel Computing, Helsinki, LNCS 2367, pp. 127-136, June 2002.


- 24 -

Work1

Work2

Work3

Data1

Data2

Data3

Uniprocessor

Data1 Data2 Data3

P0 P1 P2

Task Parallelism

SPMD Parallelism

Data i Data iData i

Task and SPMD Parallelism

Work Work Work

Work1 Work2 Work3


- 25 -

• These three basic strategies are not necessarily alternative for parallelizing data mining algorithms.

• They can be combined to improve both performance, scalability and accuracy of results.

• With parallel strategies different data partition strategies can be used

• sequential partitioning• separate partitions without overlapping

• cover-based partitioning• some data can be replicated on different partitions

• range-based query partitions based on some queries that select data according to

attribute values.

Parallel DM Strategies


- 26 -

• Parallel Decision Trees

tree construction in parallel ( processes subtrees )

• Discovery of Association Rules in Parallel

rule and/or data partitioning on different processors

• Parallel Neural Networks

parallelism exploitation: training, layers, neurons, weights

• Parallel Rough Set mining

Parallel computing of reducts (construction of the rows of

the discernibility matrix)

• Parallel Cluster Analysis

different clustering in parallel, data partitioning,

computing similarity matrix in parallel.

Parallel in DATA MINING Techniques


- 27 -

Classification: assigning new items to predefined classes.

Tree leaves represent classes and tree nodes represents attribute values.

Task parallel approachOne process is associated to each sub-tree.– The search occurs in parallel in each sub-tree.– The degree of parallelism P is equal to the number of

active processes at a given time.

P1

P3 P2

Parallel Decision Trees


- 28 -

SPMD approachEach process classifies the items of a subset of data.– The P processes search in parallel on the whole tree using a

partition of the data set D/P.– The global result is obtained by exchanging partial results.

P1 P2 P3

– The data set partitioning can be operated:• partitioning the tuples of the data set: (D/P) per processor. • partitioning the N attributes of each tuple: D tuples of (N/P) attributes

per processor.

Parallel Decision Trees


- 29 -

SPMD approach : Each processor executes the same algorithm on a different

partition of the data set to compute partial clustering results.

Local results are then exchanged among all the processors to get global values on every processor.

The global values are used in all processors to start the next clustering step until a convergence is reached or a certain number of steps are executed.

The SPMD strategy can be also used to implement clustering algorithms where each processor generates a local approximation of a model (classification) that at each iteration can be passed to the other processors that can use it to improve their clustering model.

Parallel Cluster Analysis


- 30 -

The SPMD approach has been used in P-AutoClass.

Execution times



- 31 -

The SPMD approach has been used in P-AutoClass.

Speedup



- 32 -

• This objective can be achieved through

– development of techniques and tools for supporting data mining applications and

– integration of Data and Computation Grids with Knowledge Grids.

to support the process of unification of data management and knowledge discovery systems with Grid technologies for providing knowledge-based Grid services.

Towards Data and Knowledge Services


- 33 -

Parallel & Distributed KDD on Grids

The basic principles that motivate the architecture

design of the Grid-aware KDD systems

Data heterogeneity and large data size

management

Algorithm integration and independence

Grid awareness

Openness

Scalability

Security and data privacy.


- 34 -

What The Grid Offers

• Grid tools, such as the Globus Toolkit, Legion and UNICORE, provide basic services that can be effectively exploited in the development of distributed data and knowledge management applications.

• Data Grid middleware (e.g. Globus RSL, RFT, …) implements data management architectures based on two main services: storage system and metadata management.

• Additional services are needed.


- 35 -

Parallel & Distributed KDD on Grids

• By exploting a service-oriented approach, knowledge discovery applications can be developed on Grids to deliver high performance and manage data and knowledge distribution

• Efforts are on going for the development of

– Data access and management

– knowledge discovery tools and services

on the Grid.

Examples:

- OGSA-DAI, OGSA-DQP, Discovery Net, KNOWLEDGE GRID, Data Cutter, GDIS, …


- 36 -

KNOWLEDGE GRID University of Calabria

Discovery Net Imperial College (e-Science)

DataMiningGrid FP6 EU project

ADaM University of Alabama

Terra Wide Data Mining Testbed University of Illinois at Chicago

Grid Miner University of Vienna

Data Cutter University of Maryland

Weka4WS University of Calabria

KDD PROJECTS and PROTOTYPES


- 37 -

COREGRID KDM INSTITUTE


- 38 -

CoreGRID KDM Institute

• The KDM Institute is providing a collaborative

environment for 13 research teams working on:

– Distributed storage management on GRIDs

– Data Access and Semantic GRID techniques and tools for supporting data intensive applications

– Knowledge discovery and data mining in GRIDs.

• With focus also on

– Service Level Agreement Negotiation and

– Security Requirements for Data Management


- 39 -

Objectives for the Institute (1)

MAIN OBJECTIVE:• Strengthen joint activities of European research

groups and promoting larger leading teams, operating as a Research Institute working on models and tools for KNOWLEDGE and DATA MANAGEMENT in GRIDs and P2P SYSTEMS.

• Consolidate the research activities carried out till now by partners in the KDM Institute and among the CoreGRID Institutes.

• Establish cooperation with future partners and look for interaction with other players in the area of KDM.


- 40 -

Objectives

• Discuss R&D issues in Data Management in Grids scenarios.

• Identify:– Missing Solutions in Distributed Data Management

– Research Challenges in Global Data Management

– Potential Overlaps and Gaps in current Research Activities

– Common vision of Data Management and research interests

– Industrial needs and transfer

– Synergies and future common work


- 41 -

Partners and Tasks

• JPA: three tasks in the areas

1. Distributed Storage Management• Storage Infrastructure• Storage Management Mechanisms• Specifying Management Policies

2. Information and Knowledge Management

• Semantic Modeling• Semantic Representation• Standardization and Integration• Data Integration and Query

reformulation in OGSA Grids

3. Data Mining and Knowledge Discovery

• Semantic Mapping of KDD entities• Intelligent Queries• Distributed Knowelge Grid

Services• Monitoring Services

CETIC Belgium

CNR-ISTI Italy

ICS-FORTH Greece

INFN Italy

PSNC Poland

STFC-RAL UK

SZTAKI Hungary

University of Calabria

Italy

University of Cyprus

Cyprus

University of Manchester

UK

University of Newcastle

UK

CNR-ICAR Italy

Universidade Nova de Lisboa

Portugal

Partners

More than 50 active researchers and PhD students are involved


- 42 -

Institute Roadmap (1)

• The research tasks that compose KDM Institute give a unified vision of the data and knowledge management in Grids through a layered approach that starts from efficient data storage techniques (Task 2.1) up to information management (Task 2.2) and knowledge representation and discovery (Task 2.3).

• The main vision of this Institute is based on common models and frameworks that can integrate the research results of the involved partners

and

result in common activities that advance the present results and systems.


- 43 -

Institute Roadmap (2)

• We started with Phase 1:

“Exchanging partner information, experiences, and knowledge about techniques, tools and systems for Data and Knowledge Grids.”

• Then moved to Phase 2: “Sharing and integration of common goals, research results,projects and system prototypes of Environments andServices for Data and Knowledge-based Grids.”

• Now we are in Phase 3:Use of the results of the previous phases for providing a set of solutions in the KDM area and for envisioning a unified framework for handling data, information and knowledge on GRIDs.


- 44 -

KDM Research Groups

• The Joint Research Groups are working on:

1. GRID Data Storage Access and Management Architecture

• Partners: FORTH, PSNC, SZTAKI, UCY, INFN

2. Storage security• Partners: INFN, FORTH, STFC

3. GRID Data Integration Models and Architectures• Partners: UNICAL, UoM

4. Methods for Deriving GRID Trust and Security Policies for Managing VOs

• Partners: CETIC, STFC

5. Distributed Data Mining in GRIDs and P2P Systems• Partners: UNICAL, ISTI-CNR, UCY

6. Adaptivity in Distributed Query and Workflow• Partners: UoM, UNCL


- 45 -

Achievements – KDM Research Work

• The KDM Institute developed new research results on: – Grid Services for distributed data mining (WekaxWS,

Knowledge Grid)– Dynamic loading of services: extensions to support

loading at different granularities (DYNASOAR)– Data integration and query reformulation in Grids

(GDIS on OGSA-DQP)– A methodology for deriving Grid Trust and Security

Policies for VOs– A distributed storage virtualization architecture– Adaptive scalable data mining algorithms (Frequent

Itemsets Mining - FIM)– A model for self-configuring Storage Area Networks

(Conductor)– The design of an ontology for Grid scheduling– Scalable information services for large-scale Grids


- 46 -

Cooperation With Other Institutes

• Semantic support for Meta-scheduling in Grids(University of Manchester & Fraunhofer SCAI & JUELICH) WP2-WP6

• P2P Models for Resource Discovery, Data Management and Reliable Grid Services(University of Calabria & SICS & KTH) WP2-WP4

• Grid Scheduling for data-intensive applications(University of Dortmund & University of Calabria ) WP2-WP6

• Dynamic adjustment of block size for minimum data transfer cost in Data Grids(University of Cyprus & University of Manchester) WP2-WP4

• Trust management in Grids(STFC, University of Coimbra) WP2-WP4

• Public resource computing for Data Management(University of Calabria & University of South Wales) WP2-WP7


- 47 -

Joint publications

• Many joint papers and technical reports on the KDM topics have been published in journals and conferences by KDM researchers in the second year of activities.

• A book has been published in the SPRINGER CoreGRID series as a Post proceedings of the First Workshop on Knowledge and Data Management

• The post-proceedings of the First CoreGrid Middleware workshop have been published by Springer in the LNCS series.


- 48 -

• Science and industry must be able to handle very large data sources (archives, databases, flat files).

• Data management and knowledge discovery tools are necessary to find what is interesting in them.

• Grids may be used as a distributed infrastructure for service-based data intensive applications.

Conclusions (1)


- 49 -

Conclusions (2)

• We are much more able to store data than to extract knowledge from it.

• The integration of knowledge discovery and Grid technologies can help in this task.

• Future Applications in Science and Business:

• Collection of world-wide Grid/Web services implementing complex applications.

Internet-scale distributed computing integrating

data and knowledge services + computing services


- 50 -

THANKSwww.coregrid.net

Documents

Knowledge and Data Management in Grids