Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics...

Arjun Shankar, Ph.D.

Thanks also to: James Horey and Arvind Ramanathan ORNL Computational Sciences and Engineering Division SOS, Jekyll Island, Georgia March 2013

Analytics and Fusion for

Distributed Big Data

2 Managed by UT-Battelle for the U.S. Department of Energy

Outline

• Context

• Examples of Distributed Big Data Analytics

• Emerging needs as Big Compute and Big Data analytics converge

Context

Volume and rates

Twitter Updates • 400 M/d

Facebook • Likes/Comments: 2.7 B/d

• Shared Contents: 30 B/m

World Emails • 419 B/d

YouTube • Storage: 76 PB/yr

• Traffic: 16.2 EB/yr

World Social Media • 1.8 ZB (x2 every 2 years)

Volume and rates

2.5 m Telescope 200 GB/d

Ion Mobility Spectroscopy 10 TB/d

3D X-ray Diffraction Microscopy 24 TB/d

Boeing 737 cross-country flight 240 TB

Personal Location Data 1 PB/yr

Astrophysics Data 10 PB (2014)

Square Kilometer Array 480 PB/d

Big Data = Volume, Variety, Velocity

Adam Jacobs, “The Pathologies of Big Data” 2009

“Data is typically acquired in a transactional fashion...The

trouble comes when we want to take that accumulated

data, collected over months or years, and learn something

from it.”

Learning from data is a major problem

How we harness our infrastructure to do data management

and data analytics?

Data Systems and Analytics

Pre-History 1900-1970’s

Databases 1970→

Networking and Analytics

1980→

• Machine learning

• Large-scale cluster computing

Infrastructure 2000→

• Large scale commodity processing (Google, Hadoop)

• NoSQL systems

• Virtualization

Web 3.0 2005→

• Mobile

• Semantics, linked-data (IBM – Watson)

• Apps, social-media

Big Computing Developments

Prehistory: ENIAC, ILLIAC, CDC, Cray

1930-1970

Vector, Pipelined, Compilers,

Interconnects, FLOPS

Shared/ Distributed

Memory Hierarchies,

Storage

Multicore, Heterogeneity

Big Data

Programming

Models

Flexible Data

Access

Examples of Distributed Big Data

Analytics

Large Scale Infrastructures for

Analytics

• Centralized (aka Big Compute) HPC

• Centralized Big Compute/Data platforms

• Distributed Big Compute/Data platforms

• Wide-Area Distributed Big Data platforms

Big Data Analytics

1. Centralized (aka Big Compute) HPC

– First Principles, Floating Point Solvers, Prospective Data Generation

2. Centralized Big Compute/Data platforms

– Discrete/Combinatorial, Retrieval, Indexing Lookup, Query, Graph-lookups, Data Ingest (ETL)

3. Distributed Big Compute/Data platforms

– Virtualization and Utility Computing, Machine Learning Stack

4. Wide-Area Distributed Big Data platforms

– Distributed Sensing, Correlate, and Actuate – Real-time, HPC Resource Use

(#2) Ebay – Big Data Analytics Example

Killer app: mining web logs

Slides from:Tom Fastener, 2011 Ebay

Principal Architect, @ High-Performance

Transaction Systems, Asilomar

c. 2011

(#3) Healthcare Overpayments

Analysis (@ORNL)

Deceased beneficiaries

Omitted from consolidated

billing

Inaccurate payments (preliminary estimates)

User submits SQL to Hive

Hive transforms SQL to MapReduce

MapReduce is executed in Hadoop cluster

SQL Hive

MapReduce job

Data is loaded into Hadoop filesystem (HDFS)

Compute nodes

Name node and job tracker

Hadoop infrastructure

Fraud patterns

(#4) Real-Time Distributed Analytics

Centralized data analysis across infrastructure and data modalities is prohibitive (and often too late)!

Distributed Pattern Storage and

Correlation

If (Sensor-noticed-X && Report-said-Y && Within-Last-Day)

Notify!

ssage f

Middleware

Optimizes In-

Network Storage

Processing in the Network

Application specific open questions

• Value of Information

– What to keep and what to drop (or send later)?

– When to take notice?

• Infrastructure

– Where in the hierarchy to compute?

– How to set up the infrastructure to enable the forward joins?

Big Data Analytics on Big Compute -

Emerging Applications

1. Graph analytics

2. Hypothesis driven to data driven analytics

3. Compute-analyze-compute paradigm

(1) Data Analytics Beyond Data-

Parallelism

Data-Parallel Graph-Parallel

Cross Validation

Feature Extraction

Map Reduce

Computing Sufficient Statistics

Graphical Models Gibbs Sampling

Belief Propagation

Variational Opt.

Semi-Supervised

Learning Label Propagation

Graph Analysis PageRank

Triangle Counting

Collaborative

Filtering Tensor Factorization

Slide courtesy: Prof.

Carlos Guestrin’s

GraphLab Workshop,

July 2012

PageRank

What’s the rank of this user?

Depends on rank of who follows her

Depends on rank of who follows them…

Loops in graph Must iterate!

Carlos Guestrin’s

GraphLab Workshop,

July 2012

PageRank Iteration

– α is the random reset probability

– wji is the prob. transitioning (similarity) from j to i

R[j] wji Iterate until convergence:

“My rank is weighted

average of my friends’ ranks”

Carlos Guestrin’s

GraphLab Workshop,

July 2012

Properties of Graph Parallel Algorithms

Dependency

Graph Iterative

Computation

My Rank

Friends Rank

Updates

Carlos Guestrin’s

GraphLab Workshop,

July 2012

Addressing Graph-Parallel ML

Data-Parallel Graph-Parallel

Cross Validation

Feature Extraction

Map Reduce

Computing Sufficient Statistics

Graphical Models Gibbs Sampling

Belief Propagation

Variational Opt.

Semi-Supervised

Learning Label Propagation

Data-Mining PageRank

Triangle Counting

Collaborative

Filtering Tensor Factorization

Map Reduce? Graph-Parallel Abstraction

Carlos Guestrin’s

GraphLab Workshop,

July 2012

GraphLab software tailors abstractions for asynchronous

updates, “natural” graphs, in-memory computations, etc.

(2) Scaling Analytics: Let a Million Hypotheses

Bloom ...

• Classic scientific discovery scenario:

– Choose relevant D dimensions, evaluate on N samples

• Modern data-driven analysis:

– Measure everything, hope to discover relevant D using N samples

• The progression to ultrascale:

– Characterized by interdependency (not necessarily redundancy); many inter-related subsystems

: Error

D : Dimensionality

N :#Samples

N : fixed

(3) Big Data Analytics Integrated with Big

Compute: In Situ Machine Learning

Biological Data

High resolution all-atom

simulations (Jaguar/Titan)

Infer biological function?

> 1 petabyte

Save state (Big Data)

Analyze state (Big Data

Analytics)

Run MD (Big Compute)

Big-Data Analytic Needs from Big-

Compute

• Better programming abstractions and environments for

– Machine learning suites

– Automatic memory hierarchies management libraries

– Automatic storage management libraries

– Simplify communication and synchronization abstractions

DARPA is asking for programming

techniques to simplify big data

analytics… (19th March 2013)

Thank you!

Discussion, Questions?

Analytics and Fusion for Distributed Big Data · •Examples of Distributed Big Data Analytics...

Documents

Big Data Analytics - 7. Resilient Distributed Datasets ...€¦ · Big Data Analytics 2. Apache Spark ApacheSparkStack Data platform: Distributedﬁlesystem/database I Ex: HDFS,HBase,Cassandra

Analytics MA-INF 3241- Lab Distributed Big Data · MA-INF 3241- Lab Distributed Big Data Analytics. Smart Data Analytics (SDA ) ... transport, meteorological, ... Teradata) are not

Analytics MA-INF 4223- Lab Distributed Big Data · 2018-02-08 · A generic stack for (big) Linked Data o Build on top of a state-of-the-art distributed frameworks (Spark, Flink)

Optimizing Big Data Analytics Frameworks in Geographically ...€¦ · Optimizing Big Data Analytics Frameworks in Geographically Distributed Datacenters by Shuhao Liu A thesis submitted

E6893 Big Data Analytics Lecture 4: Big Data Analytics

Analytics and Big Data Analytics

Big Data & Analytics · Concepts, technologies and the IBM perspective Alberto Ortiz Big Data & Analytics Architect January 2015 Big Data & Analytics

Introduction - CAIDA · 2018-05-14 · Time Series Analytics Big Data Analytics Query Engine Tra !c Flow Analytics distributed time series DB distributed "ow tuple DB KAFKA databus

E6893 Big Data Analytics Lecture 2: Big Data Analytics

Machine Learning: from flight data to claims management · Analytics for BIG Data, Mobile Data Mining Semantic Computing Text Mining Distributed Analytics . ... sophisticated underwriting

Privacy Preservation in the Context of Big Data …ey204/pubs/talks/BIGDATA_PRIVACY.pdfOperations on big data Analytics – Realtime Analytics 10 Distributed Infrastructure 19 Computing

Introduction to Big Data & Architectures · Distributed Semantic Analytics Aims to develop scalable analytics algorithms based on Apache Spark and Apache Flink for analysing large

Hadoop, Big Data and Big Analytics 2014 - SAS...Hadoop, Big Data and Big Analytics 2014 3 waves of Big Analytics The Business Improvement Frameworks Big Analytics Use Cases - The data,

Analytics MA-INF 4223- Lab Distributed Big Datasewiki.iai.uni-bonn.de/_media/teaching/labs/...fundamentals_ii.pdf · MA-INF 4223-Lab Distributed Big Data ... Created directly from

Wide Area Analytics: Efficient Analytics for a Geo ...ijsrcseit.com/paper/CSEIT1833607.pdfKeywords: Big Data, Analytics, Geo-Distributed Datacenters I. INTRODUCTION Many large organizations

Big Data Analytics - York Universitypapaggel/docs/talks/talk... · 2019-05-10 · Big Data Technology & Analytics Computing Platforms Distributed Commodity, Clustered High-Performance,

Big Data and Analytics in Governmentmedia.govtech.net/GOVTECH_WEBSITE/EVENTS/PRESENTATION... · 2016-10-07 · “The promise of Big Data ... Distributed File Systems NoSQL Data Streams

Distributed Ledger Analytics

Big Analytics Without Big Hassles

Big Data Analytics with MATLAB - MathWorks · Big Data Analytics with MATLAB Dmitrij Martynenko, Application Engineer, ... • DDS (Data Distribution Service) ... Distributed Computing