27
Arjun Shankar, Ph.D. Thanks also to: James Horey and Arvind Ramanathan ORNL Computational Sciences and Engineering Division SOS, Jekyll Island, Georgia March 2013 Analytics and Fusion for Distributed Big Data

Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

Arjun Shankar, Ph.D.

Thanks also to: James Horey and Arvind Ramanathan ORNL Computational Sciences and Engineering Division SOS, Jekyll Island, Georgia March 2013

Analytics and Fusion for

Distributed Big Data

Page 2: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

2 Managed by UT-Battelle for the U.S. Department of Energy

Outline

• Context

• Examples of Distributed Big Data Analytics

• Emerging needs as Big Compute and Big Data analytics converge

Page 3: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

3 Managed by UT-Battelle for the U.S. Department of Energy

Context

Page 4: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

4 Managed by UT-Battelle for the U.S. Department of Energy

Volume and rates

Twitter Updates • 400 M/d

Facebook • Likes/Comments: 2.7 B/d

• Shared Contents: 30 B/m

World Emails • 419 B/d

YouTube • Storage: 76 PB/yr

• Traffic: 16.2 EB/yr

World Social Media • 1.8 ZB (x2 every 2 years)

Page 5: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

5 Managed by UT-Battelle for the U.S. Department of Energy

Volume and rates

2.5 m Telescope 200 GB/d

Ion Mobility Spectroscopy 10 TB/d

3D X-ray Diffraction Microscopy 24 TB/d

Boeing 737 cross-country flight 240 TB

Personal Location Data 1 PB/yr

Astrophysics Data 10 PB (2014)

Square Kilometer Array 480 PB/d

Page 6: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

6 Managed by UT-Battelle for the U.S. Department of Energy

Big Data = Volume, Variety, Velocity

6

Page 7: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

7 Managed by UT-Battelle for the U.S. Department of Energy

Adam Jacobs, “The Pathologies of Big Data” 2009

“Data is typically acquired in a transactional fashion...The

trouble comes when we want to take that accumulated

data, collected over months or years, and learn something

from it.”

Learning from data is a major problem

How we harness our infrastructure to do data management

and data analytics?

Page 8: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

8 Managed by UT-Battelle for the U.S. Department of Energy

Data Systems and Analytics

Pre-History 1900-1970’s

Databases 1970→

Networking and Analytics

1980→

• Machine learning

• Large-scale cluster computing

Infrastructure 2000→

• Large scale commodity processing (Google, Hadoop)

• NoSQL systems

• Virtualization

Web 3.0 2005→

• Mobile

• Semantics, linked-data (IBM – Watson)

• Apps, social-media

Page 9: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

9 Managed by UT-Battelle for the U.S. Department of Energy

Big Computing Developments

Prehistory: ENIAC, ILLIAC, CDC, Cray

1930-1970

Vector, Pipelined, Compilers,

Interconnects, FLOPS

Shared/ Distributed

Memory Hierarchies,

Storage

Multicore, Heterogeneity

Big Data

Programming

Models

Flexible Data

Access

Page 10: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

10 Managed by UT-Battelle for the U.S. Department of Energy

Examples of Distributed Big Data

Analytics

Page 11: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

11 Managed by UT-Battelle for the U.S. Department of Energy

Large Scale Infrastructures for

Analytics

• Centralized (aka Big Compute) HPC

• Centralized Big Compute/Data platforms

• Distributed Big Compute/Data platforms

• Wide-Area Distributed Big Data platforms

Page 12: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

12 Managed by UT-Battelle for the U.S. Department of Energy

Big Data Analytics

1. Centralized (aka Big Compute) HPC

– First Principles, Floating Point Solvers, Prospective Data Generation

2. Centralized Big Compute/Data platforms

– Discrete/Combinatorial, Retrieval, Indexing Lookup, Query, Graph-lookups, Data Ingest (ETL)

3. Distributed Big Compute/Data platforms

– Virtualization and Utility Computing, Machine Learning Stack

4. Wide-Area Distributed Big Data platforms

– Distributed Sensing, Correlate, and Actuate – Real-time, HPC Resource Use

Page 13: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

13 Managed by UT-Battelle for the U.S. Department of Energy

(#2) Ebay – Big Data Analytics Example

Killer app: mining web logs

Slides from:Tom Fastener, 2011 Ebay

Principal Architect, @ High-Performance

Transaction Systems, Asilomar

c. 2011

Page 14: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

14 Managed by UT-Battelle for the U.S. Department of Energy

(#3) Healthcare Overpayments

Analysis (@ORNL)

$10M

$7M

Deceased beneficiaries

Omitted from consolidated

billing

Inaccurate payments (preliminary estimates)

User submits SQL to Hive

Hive transforms SQL to MapReduce

MapReduce is executed in Hadoop cluster

SQL Hive

MapReduce job

MapReduce job

NAS

Data is loaded into Hadoop filesystem (HDFS)

Compute nodes

Name node and job tracker

Hadoop infrastructure

Fraud patterns

Page 15: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

15 Managed by UT-Battelle for the U.S. Department of Energy

(#4) Real-Time Distributed Analytics

Centralized data analysis across infrastructure and data modalities is prohibitive (and often too late)!

Page 16: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

16 Managed by UT-Battelle for the U.S. Department of Energy

Distributed Pattern Storage and

Correlation

If (Sensor-noticed-X && Report-said-Y && Within-Last-Day)

Notify!

Me

ssage f

low

up a

hie

rarc

hy

Middleware

Optimizes In-

Network Storage

Page 17: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

17 Managed by UT-Battelle for the U.S. Department of Energy

Processing in the Network

Application specific open questions

• Value of Information

– What to keep and what to drop (or send later)?

– When to take notice?

• Infrastructure

– Where in the hierarchy to compute?

– How to set up the infrastructure to enable the forward joins?

Page 18: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

18 Managed by UT-Battelle for the U.S. Department of Energy

Big Data Analytics on Big Compute -

Emerging Applications

1. Graph analytics

2. Hypothesis driven to data driven analytics

3. Compute-analyze-compute paradigm

Page 19: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

19 Managed by UT-Battelle for the U.S. Department of Energy

(1) Data Analytics Beyond Data-

Parallelism

Data-Parallel Graph-Parallel

Cross Validation

Feature Extraction

Map Reduce

Computing Sufficient Statistics

Graphical Models Gibbs Sampling

Belief Propagation

Variational Opt.

Semi-Supervised

Learning Label Propagation

CoEM

Graph Analysis PageRank

Triangle Counting

Collaborative

Filtering Tensor Factorization

Slide courtesy: Prof.

Carlos Guestrin’s

GraphLab Workshop,

July 2012

Page 20: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

20 Managed by UT-Battelle for the U.S. Department of Energy

PageRank

What’s the rank of this user?

Rank?

Depends on rank of who follows her

Depends on rank of who follows them…

Loops in graph Must iterate!

Slide courtesy: Prof.

Carlos Guestrin’s

GraphLab Workshop,

July 2012

Page 21: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

21 Managed by UT-Battelle for the U.S. Department of Energy

PageRank Iteration

– α is the random reset probability

– wji is the prob. transitioning (similarity) from j to i

R[i]

R[j] wji Iterate until convergence:

“My rank is weighted

average of my friends’ ranks”

Slide courtesy: Prof.

Carlos Guestrin’s

GraphLab Workshop,

July 2012

Page 22: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

22 Managed by UT-Battelle for the U.S. Department of Energy

Properties of Graph Parallel Algorithms

Dependency

Graph Iterative

Computation

My Rank

Friends Rank

Local

Updates

Slide courtesy: Prof.

Carlos Guestrin’s

GraphLab Workshop,

July 2012

Page 23: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

23 Managed by UT-Battelle for the U.S. Department of Energy

Addressing Graph-Parallel ML

Data-Parallel Graph-Parallel

Cross Validation

Feature Extraction

Map Reduce

Computing Sufficient Statistics

Graphical Models Gibbs Sampling

Belief Propagation

Variational Opt.

Semi-Supervised

Learning Label Propagation

CoEM

Data-Mining PageRank

Triangle Counting

Collaborative

Filtering Tensor Factorization

Map Reduce? Graph-Parallel Abstraction

Slide courtesy: Prof.

Carlos Guestrin’s

GraphLab Workshop,

July 2012

GraphLab software tailors abstractions for asynchronous

updates, “natural” graphs, in-memory computations, etc.

Page 24: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

24 Managed by UT-Battelle for the U.S. Department of Energy

(2) Scaling Analytics: Let a Million Hypotheses

Bloom ...

• Classic scientific discovery scenario:

– Choose relevant D dimensions, evaluate on N samples

• Modern data-driven analysis:

– Measure everything, hope to discover relevant D using N samples

• The progression to ultrascale:

– Characterized by interdependency (not necessarily redundancy); many inter-related subsystems

D N

N

D N

1

D

: Error

D : Dimensionality

N :#Samples

D N

N : fixed

i

1 i

2 i

3 i

4

i

1 i

2 i

3 i

4

i

1 i

2 i

3 i

4

i

1 i

2 i

3 i

4

i

1 i

2 i

3 i

4

i

1 i

2 i

3 i

4

i

1 i

2 i

3 i

4

i

1 i

2 i

3 i

4

i

1 i

2 i

3 i

4

i

1 i

2 i

3 i

4

i

1 i

2 i

3 i

4

i

1 i

2 i

3 i

4 i

5 i

6 i

7 i

8 i

9 i1

0

i

1 i

2 i

3 i

4 i

5 i

6 i

7 i

8 i

9 i1

0

Page 25: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

25 Managed by UT-Battelle for the U.S. Department of Energy

(3) Big Data Analytics Integrated with Big

Compute: In Situ Machine Learning

Biological Data

High resolution all-atom

simulations (Jaguar/Titan)

Infer biological function?

> 1 petabyte

Save state (Big Data)

Analyze state (Big Data

Analytics)

Run MD (Big Compute)

Save

?

Page 26: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

26 Managed by UT-Battelle for the U.S. Department of Energy

Big-Data Analytic Needs from Big-

Compute

• Better programming abstractions and environments for

– Machine learning suites

– Automatic memory hierarchies management libraries

– Automatic storage management libraries

– Simplify communication and synchronization abstractions

DARPA is asking for programming

techniques to simplify big data

analytics… (19th March 2013)

Page 27: Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics •Emerging needs as Big Compute and Big Data analytics converge . 3 Managed by UT-Battelle

27 Managed by UT-Battelle for the U.S. Department of Energy

Thank you!

Discussion, Questions?