53
CSIRO DIGITAL PRODUCTIVITY FLAGSHIP Platform for Big Data Analytics and Visual Analytics: CSIRO Use Cases Tomasz Bednarz | Research Team Leader 23 rd February 2015 | Statistical Modelling and Analysis of Big Data Workshop 2015 The ARC Centre of Excellence in Mathematical and Statistical Frontiers in Big Data, Big Models and New Insights Project Team: Piotr Szul, Yulia Arzhaeva, Luke Domanski, Ryan Lagerstrom, Surya Nepal, John Zic, John Taylor

Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Embed Size (px)

Citation preview

Page 1: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

CSIRO DIGITAL PRODUCTIVITY FLAGSHIP

Platform for Big Data Analytics and Visual Analytics: CSIRO Use CasesTomasz Bednarz | Research Team Leader

23rd February 2015 | Statistical Modelling and Analysis of Big Data Workshop 2015The ARC Centre of Excellence in Mathematical and Statistical Frontiers in Big Data, Big Models and New Insights

Project Team: Piotr Szul, Yulia Arzhaeva, Luke Domanski, Ryan Lagerstrom, Surya Nepal, John Zic, John Taylor

Page 2: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Platform for Big Data Analytics and Visual Analytics

CSIRO Computational Simulation Sciences TCP project, Digital Productivity Flagship Platform for Big Data Analytics and Visual Analytics

Dual use of Platform:

• Support and foster a community around Big Data processing and visualisation

• Provide computing tools and services supporting CSIRO specific Big Data Analytics needs

What will the tools be:

• Facility (software + hardware)

• Portable VM or container image (run everywhere)

Platform for Big Data Analytics and Visual Analytics

Page 3: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Platform for Big Data Analytics and Visual AnalyticsDefinition

Platform for Big Data Analytics and Visual Analytics

Platform is a software solution stack (on hardware infrastructure) that support development of big data analytics and visual analytics applications.

It is:

• Scalable: give appropriate hardware can scale to petabytes of data and thousands of nodes.

• Universal: can be deployed on variety of computational platforms (clouds, HPC clusters, dedicated clusters, can use GPGPUs transparently).

• Integrated: is integrated with relevant CSIRO systems (e.g. Digital Access Portal, Bowen Clouds).

Page 4: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Isn’t Big Data a solved problem?

Can’t we just install the most popular software and be done with it?

No….for CSIRO, it is more complex Science vs Commercial has a different set of needs

CSIRO = many disciplines/applications = different tool requirements

CSIRO = diverse large scale storage facilities, discipline specific/optimised data cubes, HPC parallel storage systems

CSIRO = diverse set of compute infrastructure

Platform for Big Data Analytics and Visual Analytics

Platform for Big Data Analytics and Visual AnalyticsWhy?

Page 5: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

What does Big Data Analytics mean to Science?

Big data software survey and analysis

R Big data package survey and analysis

Conceptual Platform Design Planning layered architecture

– Big picture view: available software,

CSIRO Infrastructure + Science

Plan of attack

Assessment of user requirements User and project group outreach

Workshop Questionnaires and Abstracts

Platform for Big Data Analytics and Visual Analytics

What we’ve been doing?Understanding

Page 6: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Understand Big Data Analytics in Science?

Scientist & CSIRO specific needs

Tools and software landscape

Big Picture Design Forest from the trees Layering: General to Specific, extensible, clear

boundaries/responsibility/interfaces Portable & Interoperable: share nothing/minimum, technology adapters,

diverse infrastructure, diverse applications, extensible

Refine Design + Implementation (Plan of attack) Driven by Real business/use cases

Platform for Big Data Analytics and Visual Analytics

Goals + ProgressTools to empower scientists

Page 7: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Platform for Big Data Analytics and Visual Analytics

What is “Big Data” processing?

“Python is like the jazz movement in machine learning to R is like classical music.”

Page 8: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Definition:

Collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications (Wikipedia)

Simple right? But private sector has the loudest Big Data voice.

Most popular tools and resources lean heavily towards: Unstructured data, high number of small loosely related data elements

Hadoop, HDFS, NoSQL, Hadoop, HDFS, NoSQL, Hadoop, HDFS… etc.

Platform for Big Data Analytics and Visual Analytics

Big Data definition vs discussion?Understand

Page 9: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Some science problems fit the commercial mold. Many don’t e.g: Highly regular and structured data samples

Single large datasets of tightly coupled samples

Streaming data from sensors

Getting data from domain specific data cubes

Right tools do exists, just not as visible in the community Which ones do we need?

How do we integrate them with popular tools?

Can we still use commercially driven tools for science problems that break the mold???

Platform for Big Data Analytics and Visual Analytics

Big Data: where does science fit?Understand

Page 10: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Definition:

The discovery and communication of meaningful patterns in data (Wikipedia)

Wow that’s broad! But commercial world has loudest voice again: Analytics = [predictions] used to recommend action or to guide decision

making rooted in business context (Wikipedia)

Fortunately, this requires tools commonly used in science also: data modeling, machine learning, optimization algorithms, visualisation etc.

Platform for Big Data Analytics and Visual Analytics

Analytics definition vs discussion?Understand

Page 11: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Who is REALY doing Big Data? What are their needs? Application/tools

– Linear Algebra? Machine Learning? Image Processing? Text/pattern matching/mining?

Data– Streaming vs Persistent+(Static||Dynamic)? Unstructured vs Structured? SQL vs

NoSQL vs Text vs Binary Human Workflow

– Prototype vs Production, Exploratory vs Directed, Interactive vs Batch– Scale code+tools from Interactive+Prototype => Production+Batch

What will they need to work on? How much can we support!? CSIRO infrastructure: Storage + Compute

– Where is (should be) the data? Don’t move it!!– What/Where is the compute?

Possible?? Transparency + Interoperability + Portability over Infrastructure– HPC + Internal Cloud + Dedicated System

Platform for Big Data Analytics and Visual Analytics

The punters want this!

Scientist and CSIRO specific needsUnderstand

Page 12: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

What is out there?

What delivers our scientists requirements?

Does it support CSIRO infrastructure?

How does it all fit together? Inter layer: Does product X work with product Y

Intra layer: Can data stored by A be easily abstracted/ingested by B

Platform for Big Data Analytics and Visual Analytics

Tools and Software LandscapeUnderstand

Page 13: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Data A, B, C + Infrastructure 1, 2, 4 + Tool/Software α, β, γ + Science App/Domain l, m, n

How to deal with Complexity!!!

1. Define the Forest

2. Map the Trees to Forest

3. Pick which Trees to keep/use

Platform for Big Data Analytics and Visual Analytics

Seeing the Forest from the TreesDesign

Page 14: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Platform for Big Data Analytics and Visual Analytics

Seeing the Forest from the TreesDesign

Page 15: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Big Data: PetabytesStorage of low-value dataH/W failure commonCode: frequency, graphs, machine-learning, renderingIngress/egress problemsDense storage of dataMix CPU and dataSpindle:core ratio

HPC: PetaflopsStorage for checkpointingSurprised by H/W failureCode: simulation, renderingLess persistent data, ingress & egressDense computeCPU + GPUBandwidth to other servers

• Failure is inevitable fault tolerance build-in• Bandwidth and IO is precious topology aware scheduling• Linear scalability massive parallelisation, minimal communication • Hide the complexities from developers expressive programming model

Platform for Big Data Analytics and Visual Analytics

Big Data versus HPCUnderstand

Page 16: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Platform for Big Data Analytics and Visual Analytics

• Become Big Data Excellence Centre with the vision/mission to be a hub for big data analytics and processing technology and provide technical expertise in this area.

• Achieve a step change in the size of big data problems that are being tackled in CSIRO.

• Decrease the effort and time required for CSIRO to discover new patterns in Massive datasets.

• Simplify Scientist’s workflows with big data set.

• Develop solution architectures and software components to support specific needs of big data processing and visualisation in CSIRO.

• Deliver CSIRO shared "big data facility” supporting integration and processing data from different data sources. That would be more of an infrastructure project that built together with IM&T (Bowen Clouds) for certain types of in-house big data processing scenarios.

Platform for Big Data Analytics and Visual AnalyticsVision

Page 17: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

• Connect data analytics, simulations, statistical modeling, image & video analytics, machine learning, visualisationinto one stack of reusable solutions supporting various science domains.

• Build more interactive solutions that connect users with analytical models to improve business decisions.

• Create new business cases.

Platform for Big Data Analytics and Visual Analytics

Platform for Big Data Analytics and Visual AnalyticsMission

Page 18: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

• Uptake of the technology in CSIRO, transforming the way we do science.

• Contribution to Big Data Science globally.

• International collaborations.

• Enable new discoveries.

• Reduce time to new discovery.

• Global outreach.

• External grants, engagements with industry.

Platform for Big Data Analytics and Visual Analytics

Platform for Big Data Analytics and Visual AnalyticsSuccess factors

Page 19: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

• Data discovery

• Quantitative visualisation focus:

• Measurement on visualisation

• Uncertainty - from data to display

• Integration

• Interaction

• Views of the data

• Collaboration across virtual environments

• Annotated 3D videos

• Augmented Reality

• Immersive Virtual Reality

• Wearables + Visual Analytics

Platform for Big Data Analytics and Visual Analytics

Platform for Big Data Analytics and Visual AnalyticsVisual Analytics

RAVE @ NIST/USA

Page 20: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Platform for Big Data and Visual Analytics

Our project is orientated at providing incremental, use-case driven development of technical capabilities including skills, software and infrastructure to facilitate

scientists’ access to big data processing

Come talk to us! https://wiki.csiro.au/display/bigdata/PBDAVA+Collaboration

Platform for Big Data Analytics and Visual Analytics

Page 21: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Funded from CAPEX & build in collaboration with IM&TDeployed on Bowen Cloud16 nodes each: 128GB RAM and 16 CPU cores Infiniband network ~100 TB of storage (planned)

Various storage options being consider: OSM/NFS HDFS, GPFS+FPOYARN cluster (CDH5) : Hadoop MR, Spark, h2o … (any YARN compatible framework)Status: storage testingFor more see: https://wiki.csiro.au/display/ICTCRC/DP+Research+Big+Data+Cluster

The DB Research Big Data Cluster is a dedicated hardware cluster intended both to support big data related computer science research and to provide experimental

big data processing capabilities for scientific projects within DP.

Platform for Big Data Analytics and Visual Analytics

DP Big Data Cluster

Page 22: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

OSM/NFS

DP Big Data Cluster - Architecture

GPFS DAS

Edge NodeClients, Compiler, Staging, Monitor

Bowen Storage

Worker nodesYarn WorkerHDFS Worker

Master nodesYarn MasterHDFS Master

Bowen Compute

Nexu

s Au

then

tication

Gan

glia Mo

nito

r

CSIRO Intranet

Workstations

hadoop1-01-cdchadoop1-{03..16}-cdchadoop1-02-cdc

Infiniband Network

Bragg, Pearcey

Platform for Big Data Analytics and Visual Analytics

Page 23: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

HadoopWhat is it?

Platform for Big Data Analytics and Visual Analytics

● The Apache Hadoop is a framework that allows for the distributed processing of large data sets across cluster of computers using simple programming models.

● Designed to scale up from single servers to thousands of machines, each offering local computation and storage.

● Designed to detect and handle failures at the application layer.

http://hadoop.apache.org

Page 24: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

HadoopComponents

Platform for Big Data Analytics and Visual Analytics

● Hadoop components:

● Hadoop Distributed File System (HDFS)

● MapReduce

●Handles any data type

● Structured

● Unstructured

● Schema

● No schema

● High volume

● Low volume

Page 25: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

HadoopHadoop Distributed File System

Platform for Big Data Analytics and Visual Analytics

● Breaks incoming files into blocks and stores them redundantly across the cluster

● A single large file is split into blocks, and the blocks are distributed among the nodes

● Blocks in HDFS are large – typically 128MB in size

● Files in HDFS are ‘write ones’ (no random writes allowed) and processed by MR framework. Results stored back in HDFS.

● Original data file not modified during lifecycle

Page 26: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

HadoopHDFS

Platform for Big Data Analytics and Visual Analytics

● Data replication (to enhance reliability and availability) – default is threefold

● HDFS optimised for large, streaming reads of files (rather than random reads)

● A master node NameNode keeps track (metadata) of blocks that make a file and their locations

Page 27: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

HadoopExample

Platform for Big Data Analytics and Visual Analytics

● NameNode holds metadata for files

● DataNodes hold the actual blocks

Page 28: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

MapReduceWord count example

Platform for Big Data Analytics and Visual Analytics

Map: reads each line in the text one at a time, splits out each word into a separate string, and for each word output the word and a 1 to indicate it has seen the word one time.

Shuffle: uses the word as the key, hashing the records to reducers.

Reduce: sums up the number of times each word was seen and write that together with the word as output.

Page 29: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Big Volume Processing

Architectures

– Share nothing

– Traditional: compute + storage

Parallel file systems

– HDFS, GPFS + FPO,

– S3, Swift, Lustre, Gluster

Processing

– Out of core (MapReduce)

– In memory

Scheduling:

– Yarn, Mesos

A programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster + a parallel filesystem

MapReduceModel

DAG Model Graph ModelBSP/Collectiv

e Model

Twister

Hadoop

MPI

Dryad

Spark

Giraph

Hama

GraphLab

Harp

GraphX

HaLoop

Stratosphere

Reef

Iterative

Platform for Big Data Analytics and Visual Analytics

Page 30: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

PigPhilosophy

● Pigs eat anything

○ Input data can come in any format – popular formats, such as tab-delimited are natively supported. Users can add functions to support other data formats.

○ Operates on data: relational, nested, semi-structured, or unstructured

● Pigs live anywhere

● Pigs are domestic animals

● Pigs fly

○ Pig processes data quickly.

Platform for Big Data Analytics and Visual Analytics

Page 31: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

PigWhat is it?

● Pig provides an engine for executing data flows in parallel on Hadoop

● Pig includes a language called Pig Latin for expressing data flows

● Pig Latin includes operators for many of the traditional data operations (not to be re-invented as in Hadoop): JOIN, SORT, FILTER, FOREACH, GROUP, LOAD and STORE.

● Pig makes use of: the Hadoop Distributed File System (HDFS) and processing system MapReduce

Why?

Faster Development (increases productivity 10x), Flexible,

Express data transformation tasks in just a few lines of code

Don’t reinvent the wheel, 10 lines of Pig Latin = ~200 lines of Java

Platform for Big Data Analytics and Visual Analytics

Page 32: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

PigWorkflow

● A LOAD statement reads data from the file system.

● A series of transformation statements process the data.

● A STORE statement writes output to the file system or, a DUMPstatement displays output to the screen.

● Pig always at first validates the syntax and semantics of all statements and execute them only when encounters DUMP or STORE statements.

Platform for Big Data Analytics and Visual Analytics

Page 33: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

PigThe whole Picture

Platform for Big Data Analytics and Visual Analytics

Page 34: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

PigPig Latin

● Pig Latin is a dataflow language --> allows users to describe how data from one or more inputs should be read, processed and stored to one or more outputs in parallel.

● Data flows can be:

○ Linear: as in the word count example

○ Complex: multiple inputs are joined and where data is split into multiple streams to be processed by different operators

● Pig Latin script describes a directed acyclic graph (DAG) where the edges are data flows and the nodes are operators that process the data

● Pig Latin has no if statements or for loops (= it focuses on data flow)

○ Traditional procedural and OO programing languages describe control flow; data flow is a side effect of the program.

Platform for Big Data Analytics and Visual Analytics

Page 35: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

PigRunning Pig / Starting Grunt

Platform for Big Data Analytics and Visual Analytics

● Pig supports local mode: useful for prototyping and debugging Pig Latin scripts. Test on small data and move to large data.

● Pig also runs in mapreduce mode: it does parsing, checking and planning locally, but executes MapReduce jobs on Hadoop cluster (it needs to know where NameNode and JobTracker are located).

You can execute Pig Latin statements:

● Using command line / Grunt shell

● In local mode or mapreduce mode (to interact with HDFS on your cluster)

● Either interactively or in batch

● Embedded Pig

Page 36: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

PigData types: scalar and complex

Page 37: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

PigSchemas

● Pig eats everything - lax attitude for schemas

● If schema for data is available, Pig will use it

● If schema for data is not available, Pig will process the data and will make the best guesses (on how script treats data)

Page 38: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

PigCommands

Platform for Big Data Analytics and Visual Analytics

Page 39: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

PigWords count example

Platform for Big Data Analytics and Visual Analytics

Page 40: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

PigUser Defined Functions (UDF)

Platform for Big Data Analytics and Visual Analytics

● Benefits

○ Use legacy code

○ Use library in scripting language

○ Leverage Hadoop for non-Java programmers

● Extensible Interface

○ Minimum effort to support another language

● Currently supported languages

○ Python

○ JavaScript

○ Ruby

Page 41: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

PigDataFu

Platform for Big Data Analytics and Visual Analytics

● DataFu is a collection of user-defined functions for working with large-scale data in Hadoop and Pig.

● This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics.

● Used at LinkedIn in many of our off-line workflows for data derived products like "People You May Know" and "Skills". It contains functions for:

○ PageRank

○ Statistics (e.g. quantiles, median, variance, etc.)

○ Sampling (e.g. weighted, reservoir, etc.)

○ Convenience bag functions (e.g. enumerating items)

○ Convenience utility function (e.g., assertions, etc.)

○ Set operations (intersect, union)

Page 42: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

PigABC Radio Stations and Toilets example

Platform for Big Data Analytics and Visual Analytics

● We have list of local ABC Radio stations in Australia

● We have list of all Public Toilets across Australia

● We want to find a closest toilet to a Radio Station

Demonstration of:

● Data Schemas● Use of external libraries● Google Maps API

https://github.com/tomaszbednarz/pig-abc-toilets

Page 43: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Apache SparkFast, general engine for large-scale data processing and analysis

• Open source, developed at the UC Berkeley

• Written in Scala (functional programming language that runs in a JVM)

• Key Concepts

• Avoid the data bottleneck by distributing data when it is stored

• Bring the processing to the data

• Data is stored in memory

• Improves efficiency through (up to 100x faster):

In-memory computing primitives

General computation graphs

• Improves usability through:

Rich APIs in Java, Scala, Python

Interactive shell in Python, Scala

Up to 2-10x less code

Platform for Big Data Analytics and Visual Analytics

APISpark

Cluster Computing• Spark Standalone

• YARN• Mesos

StorageHDFS

Page 44: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Apache SparkRDD (Resilient Distributed Dataset)

• RDD (Resilient Distributed Dataset)

• Resilient – if data in memory is lost, it can be recreated

• Distributed – stored in memory across the cluster

• Dataset – initial data can come from a file or created programmaticaly

• RDDs are the fundamental unit of data in Spark

• Concept: Resilient Distributed Datasets (RDDs)

Immutable collections of objects spread across a cluster

Built through parallel transformations (map, filter, etc)

Automatically rebuilt on failure

Controllable persistence (e.g. caching in RAM)

Platform for Big Data Analytics and Visual Analytics

From “Parallel Programming with Spark” by Matei Zaharia, UC Berkeley

Page 45: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

OperationsTwo types: transformation and actions

Transformations (e.g. map, filter, groupBy, join, flatMap) Lazy operations to build RDDs from other RDDs

Actions (e.g. count, collect, reduce) Return a result or write it to storage

From “Parallel Programming with Spark” by Matei Zaharia, UC Berkeley

Platform for Big Data Analytics and Visual Analytics

RDDs can hold any type of element:- Primitive types:

- Integers, characters, strings, etc.- Sequence types:

- Lists, arrays, dics, etc.- Scala/Java Objects- Mixed types

Page 46: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Apache SparkAPI

Platform for Big Data Analytics and Visual Analytics

http://www.slideshare.net/frodriguezolivera/apache-spark-41601032

Page 47: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(‘\t’)[2])

messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda s: “foo” in s).count()

messages.filter(lambda s: “bar” in s).count()

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD

Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec(vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)

Example: Mining Console Logs

Load error messages from a log into memory, then interactively search for patterns

From “Parallel Programming with Spark” by Matei Zaharia, UC Berkeley

Page 48: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

QI group has developed algorithm to extract significant frames A single 30 day trip produces 720 hours or

180 GB of video footage – single CPU processing takes about 9 hours

We developed Sparkle prototype integration of SPARK and OpenCV

video reduction tool on top of Sparkle

Results Processing (reduction) of 256 x 0.5GB =

128GB video files on bragg with SPARK-HPC

Resources requested: 128 nodes with 4 process per node = 512 CPU cores

Execution time: 137s

Automated Big Video Analysis

Integrated video camera systems have been installed on fishing boats to trial for the 24/7 fishery monitoring of tuna longline operations in Australia.

Platform for Big Data Analytics and Visual Analytics

Page 49: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

WebVR: Virtual Reality in Web Browserscollaboration with NIST (Sandy Ressler)

Platform for Big Data Analytics and Visual Analytics

Page 50: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

SPARK-HPC

SPARK-HPC is an open-source adapter for running Spark on PBS clusters

Well suited for compute and memory intensive applications (e.g., large scale machine learning)

Enables Spark computation on CSIRO HPC clusters including bragg(128 Dual Xeon 8-core E5-2650 nodes with 384 Kepler Tesla K20 GPUs)

Open-source see: https://github.com/csirobigdata/spark-hpc

Status on CSIRO HPC Clusters:

Needs to be migrated to SLURM and redeployed

Platform for Big Data Analytics and Visual Analytics

Page 51: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

www.bdva.net

Platform for Big Data Analytics and Visual Analytics

Page 52: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

For even more discussionsDirections

• Connect Big Data and Science

• Infrastructure

• Data Provenance

• How to link data centers together

• Visual Analytics

• Real time data processing

• Internet of Things

• Art + Science: communication

• Spark + GPUs http://devblogs.nvidia.com/parallelforall/bidmach-machine-learning-limit-gpus/

Platform for Big Data Analytics and Visual Analytics

Page 53: Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015

Thank youCONTACT Tomasz BednarzE: [email protected]: (07) 3833 5544

CSIRO DIGITAL PRODUCTIVITY FLAGSHIP