Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng...

Preview:

Citation preview

www.pervasivebigdata.com

Pervasive Partner Presentation

KNIME + DataRush Mike Hoskins, GM - Pervasive Big Data

KNIME Conf, Zurich Technopark, 1 Feb 2012

Big Data Pipeline

Data Scientists

Data Analysts

Business Analysts

Decision Makers

Operational Intelligence

Data Integrators

App Developers

Prepare

profile match

cleanse aggregate

audit

Analyze sample model

discover visualize predict

Consume report chart

dashboard alert

closed loop

Collect

monitor log

ingest event capture

decrypt

Big Data Challenges

Volume

Prepare

profile match

cleanse aggregate

audit

Analyze sample model

discover visualize predict

Consume report chart

dashboard alert

closed loop

Collect

monitor log

ingest event capture

decrypt

www.pervasivebigdata.com

Pervasive DataRush

Full Core and Memory Utilization

5

Legacy Applications DataRush

• Single Threaded

• In-Memory

• Dynamic Scaling Multi-Threaded

• Full Resource Utilization

• Data Flow

• Overcome Memory Heap Sizes

© Copyright 2011 Pervasive Software. All rights reserved

Auto-Scaling

370,0

192,4

90,3

51,6

31,5

0,0

50,0

100,0

150,0

200,0

250,0

300,0

350,0

400,0

2 cores 4 cores 8 cores 16 cores 32 cores

Tim

e in

Min

ute

s

Core Count

Run-time

3.2 hours

using 4

cores

1.5 hours

using 8

cores Under 1

hour

using 16

cores

6

© Copyright 2011 Pervasive Software. All rights reserved

Full-Featured Data Preparation Functions

© Copyright 2011 Pervasive Software. All rights reserved

Analytics Functions For Deep Insights

www.pervasivebigdata.com

DataRush & Hadoop

Malstone Benchmark – Logfile Processing

• Web site logs

• 10 billion rows

(nearly 1

terabyte)

• Aggregates

site intrusion

information

Run Time

Tota

l Cost

of

Ow

ners

hip

(TCO

)

• 20-node cluster

• 4 cores per node

• 14 hours

• 32 cores

• single machine

• 31.5 minutes

*www.opencloudconsortium.org/benchmarks

26 X

Difference

!

10

© Copyright 2011 Pervasive Software. All rights reserved

Malstone Benchmark – Price/Performance

11

www.pervasivebigdata.com

DataRush & Hadoop & KNIME

Pervasive DataRush Plug-in for KNIME

13

DataRush

Plug-Ins

Drag and

Drop to

call

DataRush

for

KNIME

Retrospective

Analytics

What’s new since 2011 KNIME Conference

• Major Additions:

– New “DeriveFields” Operator

– Two new Join types from our Hive (SQL in Hadoop) work

• Semi-Join and Anti-Join

– Range Partitioning

• New Functions:

– Many Data Preparation functions

• Hadoop & Big Data Operators:

– Extreme high-performance HBase read/write

– Other Hadoop reader/writers

• Avro, Syslog, Netflow, Flume HBase sink

– KNIME nodes for HBase and HDFS read/write

14

What’s new since 2011 KNIME Conference (2)

• DataRush v6 (releasing later in 2012)

– Unified API/Composition model for scale-up SMP or scale-out

Clusters

– Full Integration with NextGen MapReduce (DataRush as

embedded dataflow computational alternative to coarse-grained

MapReduce programming)

• DataRush for KNIME integration

– Continue the Krunner work (high-speed execution of

contiguous DataRush nodes in a KNIME flow); make it work for

DDR6 (Distributed DataRush v6, summer 2012)

– Standalone server or cluster execution of KNIME flows that

contain only DataRush nodes

15

www.pervasivebigdata.com

Pervasive Big Data Stack

© Copyright 2011 Pervasive Software. All rights reserved

Azure

BigTable…

Pervasive

Big Data

Profiler

Pervasive

Big Data

Matcher

Moving from SDK to Consumable Products

17

Pervasive

Big

Miner

Telecom

Analyzer

Pervasive

Big ETL

SCADA

manufacturi

ng

Cyber

security Marketing/

advertising

Pervasive

BigOLAP

Time series, event, analytics

Platform

Tools

Products

Solutions

Pervasive DataRush

Big Data Integration and Analytics Platform

Hardware

• Single server or cluster

• On-premises or in cloud

Data

Sources

• Flat files

• Relational databases

• NoSQL databases

• Hadoop

Pervasive

Big BI

Pervasive

Big Viz

Hadoop add-

ons

(TurboRush)

Eco system add-ons

Big Data (NoSQL)Tools

• TurboRush for HBase

• Big Tooling w/GUI

– BigIntegrator (aka PDI)

– BigETL (aka KNIME)

– BigBI

• Report, Chart, OLAP, Query

– BigMiner (aka KNIME)

Pervasive Data Integrator™ v10

• All Service Oriented / ESB

• Browser-based UI

• Deploy On-premises or Cloud

• Extensible and Embeddable

• New management capabilities

WEB INTERFACE

Drag and drop palette Flexible workflow Auto or drag and map

© Copyright 2011 Pervasive Software. All rights reserved

Predictive Analytics in DataRush for KNIME

20

Big Data Capture and Analysis for Telecom

Customer Churn

Network Performance

Fraud detection

Revenue Assurance

Customer Experience

Least-Cost Routing

Vendor Performance

SaaS apps

Server/Web/App

logs

In-house apps

Sensors/Switches/

Routers

Partner data

Flume,

Snort,

Esper

Collect Prepare Analyze

Monitor

Decrypt

Add timestamps

Log receipt

Store CSV, XLS

Store HDFS, Hbase

Event ingest

What does it mean? Where is the fit good?

• KNIME is ready for Big Data! Just add DataRush

– Extreme scaling on modern commodity hardware: scale-up on

Servers/Appliances, and scale-out on Clusters

– Native support for Hadoop and NoSQL

• Use cases already worked with DataRush for KNIME

– Telecomms CDR (Call Detail Records)

– Cybersecurity (Network and Weblog analytics)

– Life Sciences (Gene alignment and assembly)

– Financial Services and Healthcare

– General Data Mining (Clustering, Linear Regression, Decision Tree)

– Almost no limit to the use cases

• Well suited for:

– Machine generated “event” data (aka: log events)

– Long-running Analytic workloads (including Matching)

– Heavy “Data Prep” pre-processing

• Lacking Operators (today) for text, multimedia

22

www.pervasivebigdata.com

Thanks! Q&A

© Copyright 2011 Pervasive Software. All rights reserved

Big Data Benchmarks on Hadoop

24

• Developed by the Open Cloud Consortium

• Benchmark related to web site visits and cyber infection status

• 10 billion row dataset with 100 bytes/row for a total of 1 Terabyte

1. The MalStone Benchmark, TeraSort and Clouds For Data Intensive Computing – Robert Grossman

http://rgrossman.com/2009/05/25/malstone-benchmark. Java code probably not optimized.

2. Subject to further review and potential optimization

3. Early test results – all subject to further optimization

Log file processing – Malstone benchmark

NOT FOR PUBLICATION

Rows/sec Rows/watt Rows/$

20-nodes x 4 cores - Open Cloud Consortium cluster

Grossman (Hadoop + Java MapReduce) 1 187,266 62,422 46,816,479

Single server: 48-core, 64-disk "Hadoop Appliance"

Pervasive 1 - Hadoop + Java MapReduce 2 75,597 88,938 110,630,075

Pervasive 2 - Flat file + DataRush 3 3,267,974 3,844,675 4,782,400,765

Pervasive 3 - HDFS/Hbase + DataRush 3 6,024,096 7,087,172 8,815,750,808

Performance ratio - Pervasive 3 vs Hadoop/MR cluster 32x 114x 188x

Read-only performance - HDFS/Hbase + DataRush 3 12,800,000 15,058,824 18,731,707,317

Hadoop

Structured

Data

Events

ERP

CRM

APPs

Devices

Syslog

Event

Collection

Framework

Collector

Collector

HBase

End User Tools

Aggregates

(RDBMS)

OLAP

Engine

Data

Prep

Real-time Visualization

Reporting

OLAP

Data Mining

ETL

HBase Sink

HBase Sink SQL/MED

JDBC

XMLA

KNIME Wrapper

Query

Big Data Platform

HDFS

ETL

Integration

www.pervasivebigdata.com

Big Data Solutions

Telecom Provider Challenges

Switches /

Network Elements

Off-net Usage OSS/BSS Data

Corporate

Sales/Marketing

Network OPS

Customer Care

Information Technology Vendor Performance

Pricing optimization

Product/Service

Offers

Operational

Performance

Profitability Analysis

Customer Experience

Capacity Optimization

Network Performance

Churn

Segment Insights

Usage Trends

Continuously

Integrate

Problem Solving

Pervasive DataRush™

28

DataRush is a parallel dataflow platform that eliminates

performance bottlenecks in your data-intensive applications

• Scalable

• High Throughput

• Cost Efficient

• Easy to Implement

• Extensible

Business Issues

• Time to decision is critical

– Missed opportunities; wasted resources

– Customer issue reaction is too slow

• Deeper granularity of data is critical

– Understanding of trends is needed

– Pricing optimization

– Vendor performance

• Decision time - from days to minutes

– Deeper understanding of operational issues

– Which situations are problematic (or not)

Pervasive DataRush and Hadoop

• DataRush embedded within Hadoop

– Reduce complexities of MapReduce experience

– Increased efficiencies = significantly faster run times

– Cloudera Certification

Mapper Mapper Mapper Mapper

Reducer Reducer

Hadoop

Distributed

File System

DataRush DataRush DataRush DataRush

DataRush DataRush

33

mins

135

mins

Malstone B

0.5 TB

DataRush in Hadoop

Hadoop

30

Pervasive DataRush™

31

DataRush is a parallel dataflow platform that eliminates performance bottlenecks in your data-intensive applications

• Scalable: Performance dynamically scales with increased core/server

counts. No change to the code.

• High Throughput: Patented parallel dataflow technology enables fast,

deep analysis of large data sets with no limit on input data size.

• Cost Efficient: Fully exploit commodity multicore servers – save

significant capital and energy costs via efficient node utilization.

• Easy to Implement: DataRush takes care of complex parallel

processing issues at design time: hides threading complexity; no

deadlocks; runs on any platform – including Hadoop; etc..

• Extensible: DataRush is a component-based platform with an open API

so you can easily extend it for your own needs.

© Copyright 2011 Pervasive Software. All rights reserved

DataRush Release Timeline

CQ1-2011 CQ2-2011 CQ3-2011 CQ4 2011 CQ1 2012 CQ2 2012

DataRush 5.0 • Distributed DR

• KNIME

• Performance

DataRush 5.0.1 • Bug fixes

• Targeted features

DataRush 5.1 • Hadoop and Hive integration

• I-Labs connectivity

• KNIME 2.4.1

• Bug fixes

(January 2011)

(March 2011, ongoing …)

(December 2011)

DataRush 6 • Fully distributed composition

and library

• Distributed execution in KNIME

• Next Gen MapReduce (?)

(TBD)

TurboRush for Hive 0.9 • Hive accelerator

• Limited release

www.pervasivebigdata.com

DataRush & KNIME

KNIME Introduction

• Open source workflow for data mining

• Desktop designer

– Eclipse based (RCP app and plug-in)

– Node based architecture

• Nodes provide connectivity, transformations, algorithms, …

• Extensible model: user developed nodes supported

– Drag and drop, graphical editing of projects

– Project execution from GUI

– Workflow model – each node executes completely

before next node is invoked

© Copyright 2011 Pervasive Software. All rights reserved

Predictive Analytics in DR-KNIME

35

© Copyright 2011 Pervasive Software. All rights reserved

Profiling in DR-KNIME

36

www.pervasivebigdata.com

NextGen Sequencing and

Genomic Pipelines

NGS data explosion

38

Convert/filter FastA/FastQ files

39

Align/order/assemble

40

Report/visualize matching/coverage

41

www.pervasivebigdata.com

Q & A

www.pervasivebigdata.com

Big Data Products

Pervasive Big Data (NoSQL)Tools

• TurboRush for HBase

• Big Tooling w/GUI

– BigIntegrator

– BigBI

• Rpt, Cht, OLAP, Qry

– BigMiner

– BigSearch

BigIntegrator: HBase as Source or Target

45

BigIntegrator: Visual Mapping to/from HBase

46

BigBI (aka BigQuery)

47

www.pervasivebigdata.com

DataRush & KNIME

DataRush + KNIME – what is it?

• Plug-in of DataRush v5.1 to KNIME v3.2?

• Adds extreme high-performance data preparation

and analytic functions

• Adds support for Hadoop data sources (both

HDFS and Hbase)

• Adds special dataflow “k-runner” mode that

recognizes adjacent DataRush nodes and

executes entirely in memory by “flowing” data

from node to node

• KNIME functionality can be further extended with

the DataRush SDK and Scripting

Pervasive RushMiner

Visual Environment for Big Data Analytics and Preparation

• Quickly cleanse, profile and aggregate big data

• Use Data mining, predictive analytics, machine learning to uncover actionable

intelligence

• Works with flat files, relation databases, NoSQL databases, and Hadoop filesystem

(HDFS)

• High performance, scales up to terabytes of data

• Design on your desktop using simple drag-and-drop interfaceExecute on desktop,

remote server, or clusters --including Hadoop clusters

50

Event Processing with DataRush

• Capture ALL data

• Discover previously unavailable patterns, correlations, etc.

• Scalable to meet growing needs

Processed 100 Million Syslog events in 58 seconds on a 48 core system. A sustained run rate of 14 Tb per day

51

Recommended