Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng...

www.pervasivebigdata.com

Pervasive Partner Presentation

KNIME + DataRush Mike Hoskins, GM - Pervasive Big Data

KNIME Conf, Zurich Technopark, 1 Feb 2012

Big Data Pipeline

Data Scientists

Data Analysts

Business Analysts

Decision Makers

Operational Intelligence

Data Integrators

App Developers

Prepare

profile match

cleanse aggregate

Analyze sample model

discover visualize predict

Consume report chart

dashboard alert

closed loop

Collect

monitor log

ingest event capture

decrypt

Big Data Challenges

Volume

Prepare

profile match

cleanse aggregate

Analyze sample model

discover visualize predict

Consume report chart

dashboard alert

closed loop

Collect

monitor log

ingest event capture

decrypt

Pervasive DataRush

Full Core and Memory Utilization

Legacy Applications DataRush

• Single Threaded

• In-Memory

• Dynamic Scaling Multi-Threaded

• Full Resource Utilization

• Data Flow

• Overcome Memory Heap Sizes

Auto-Scaling

2 cores 4 cores 8 cores 16 cores 32 cores

Core Count

Run-time

3.2 hours

using 4

1.5 hours

using 8

cores Under 1

using 16

Full-Featured Data Preparation Functions

Analytics Functions For Deep Insights

DataRush & Hadoop

Malstone Benchmark – Logfile Processing

• Web site logs

• 10 billion rows

(nearly 1

terabyte)

• Aggregates

site intrusion

information

Run Time

l Cost

• 20-node cluster

• 4 cores per node

• 14 hours

• 32 cores

• single machine

• 31.5 minutes

*www.opencloudconsortium.org/benchmarks

Difference

Malstone Benchmark – Price/Performance

DataRush & Hadoop & KNIME

Pervasive DataRush Plug-in for KNIME

DataRush

Plug-Ins

Drag and

Drop to

DataRush

Retrospective

Analytics

What’s new since 2011 KNIME Conference

• Major Additions:

– New “DeriveFields” Operator

– Two new Join types from our Hive (SQL in Hadoop) work

• Semi-Join and Anti-Join

– Range Partitioning

• New Functions:

– Many Data Preparation functions

• Hadoop & Big Data Operators:

– Extreme high-performance HBase read/write

– Other Hadoop reader/writers

• Avro, Syslog, Netflow, Flume HBase sink

– KNIME nodes for HBase and HDFS read/write

What’s new since 2011 KNIME Conference (2)

• DataRush v6 (releasing later in 2012)

– Unified API/Composition model for scale-up SMP or scale-out

Clusters

– Full Integration with NextGen MapReduce (DataRush as

embedded dataflow computational alternative to coarse-grained

MapReduce programming)

• DataRush for KNIME integration

– Continue the Krunner work (high-speed execution of

contiguous DataRush nodes in a KNIME flow); make it work for

DDR6 (Distributed DataRush v6, summer 2012)

– Standalone server or cluster execution of KNIME flows that

contain only DataRush nodes

Pervasive Big Data Stack

BigTable…

Pervasive

Big Data

Profiler

Pervasive

Big Data

Matcher

Moving from SDK to Consumable Products

Pervasive

Telecom

Analyzer

Pervasive

Big ETL

manufacturi

security Marketing/

advertising

Pervasive

BigOLAP

Time series, event, analytics

Platform

Products

Solutions

Pervasive DataRush

Big Data Integration and Analytics Platform

Hardware

• Single server or cluster

• On-premises or in cloud

Sources

• Flat files

• Relational databases

• NoSQL databases

• Hadoop

Pervasive

Big BI

Pervasive

Big Viz

Hadoop add-

(TurboRush)

Eco system add-ons

Big Data (NoSQL)Tools

• TurboRush for HBase

• Big Tooling w/GUI

– BigIntegrator (aka PDI)

– BigETL (aka KNIME)

– BigBI

• Report, Chart, OLAP, Query

– BigMiner (aka KNIME)

Pervasive Data Integrator™ v10

• All Service Oriented / ESB

• Browser-based UI

• Deploy On-premises or Cloud

• Extensible and Embeddable

• New management capabilities

WEB INTERFACE

Drag and drop palette Flexible workflow Auto or drag and map

Predictive Analytics in DataRush for KNIME

Big Data Capture and Analysis for Telecom

Customer Churn

Network Performance

Fraud detection

Revenue Assurance

Customer Experience

Least-Cost Routing

Vendor Performance

SaaS apps

Server/Web/App

In-house apps

Sensors/Switches/

Routers

Partner data

Flume,

Snort,

Collect Prepare Analyze

Monitor

Decrypt

Add timestamps

Log receipt

Store CSV, XLS

Store HDFS, Hbase

Event ingest

What does it mean? Where is the fit good?

• KNIME is ready for Big Data! Just add DataRush

– Extreme scaling on modern commodity hardware: scale-up on

Servers/Appliances, and scale-out on Clusters

– Native support for Hadoop and NoSQL

• Use cases already worked with DataRush for KNIME

– Telecomms CDR (Call Detail Records)

– Cybersecurity (Network and Weblog analytics)

– Life Sciences (Gene alignment and assembly)

– Financial Services and Healthcare

– General Data Mining (Clustering, Linear Regression, Decision Tree)

– Almost no limit to the use cases

• Well suited for:

– Machine generated “event” data (aka: log events)

– Long-running Analytic workloads (including Matching)

– Heavy “Data Prep” pre-processing

• Lacking Operators (today) for text, multimedia

Thanks! Q&A

Big Data Benchmarks on Hadoop

• Developed by the Open Cloud Consortium

• Benchmark related to web site visits and cyber infection status

• 10 billion row dataset with 100 bytes/row for a total of 1 Terabyte

1. The MalStone Benchmark, TeraSort and Clouds For Data Intensive Computing – Robert Grossman

http://rgrossman.com/2009/05/25/malstone-benchmark. Java code probably not optimized.

2. Subject to further review and potential optimization

3. Early test results – all subject to further optimization

Log file processing – Malstone benchmark

NOT FOR PUBLICATION

Rows/sec Rows/watt Rows/$

20-nodes x 4 cores - Open Cloud Consortium cluster

Grossman (Hadoop + Java MapReduce) 1 187,266 62,422 46,816,479

Single server: 48-core, 64-disk "Hadoop Appliance"

Pervasive 1 - Hadoop + Java MapReduce 2 75,597 88,938 110,630,075

Pervasive 2 - Flat file + DataRush 3 3,267,974 3,844,675 4,782,400,765

Pervasive 3 - HDFS/Hbase + DataRush 3 6,024,096 7,087,172 8,815,750,808

Performance ratio - Pervasive 3 vs Hadoop/MR cluster 32x 114x 188x

Read-only performance - HDFS/Hbase + DataRush 3 12,800,000 15,058,824 18,731,707,317

Hadoop

Structured

Events

Devices

Syslog

Collection

Framework

Collector

End User Tools

Aggregates

(RDBMS)

Engine

Real-time Visualization

Reporting

Data Mining

HBase Sink

HBase Sink SQL/MED

KNIME Wrapper

Big Data Platform

Integration

Big Data Solutions

Telecom Provider Challenges

Switches /

Network Elements

Off-net Usage OSS/BSS Data

Corporate

Sales/Marketing

Network OPS

Customer Care

Information Technology Vendor Performance

Pricing optimization

Product/Service

Offers

Operational

Performance

Profitability Analysis

Customer Experience

Capacity Optimization

Network Performance

Segment Insights

Usage Trends

Continuously

Integrate

Problem Solving

Pervasive DataRush™

DataRush is a parallel dataflow platform that eliminates

performance bottlenecks in your data-intensive applications

• Scalable

• High Throughput

• Cost Efficient

• Easy to Implement

• Extensible

Business Issues

• Time to decision is critical

– Missed opportunities; wasted resources

– Customer issue reaction is too slow

• Deeper granularity of data is critical

– Understanding of trends is needed

– Pricing optimization

– Vendor performance

• Decision time - from days to minutes

– Deeper understanding of operational issues

– Which situations are problematic (or not)

Pervasive DataRush and Hadoop

• DataRush embedded within Hadoop

– Reduce complexities of MapReduce experience

– Increased efficiencies = significantly faster run times

– Cloudera Certification

Mapper Mapper Mapper Mapper

Reducer Reducer

Hadoop

Distributed

File System

DataRush DataRush DataRush DataRush

DataRush DataRush

Malstone B

0.5 TB

DataRush in Hadoop

Hadoop

Pervasive DataRush™

DataRush is a parallel dataflow platform that eliminates performance bottlenecks in your data-intensive applications

• Scalable: Performance dynamically scales with increased core/server

counts. No change to the code.

• High Throughput: Patented parallel dataflow technology enables fast,

deep analysis of large data sets with no limit on input data size.

• Cost Efficient: Fully exploit commodity multicore servers – save

significant capital and energy costs via efficient node utilization.

• Easy to Implement: DataRush takes care of complex parallel

processing issues at design time: hides threading complexity; no

deadlocks; runs on any platform – including Hadoop; etc..

• Extensible: DataRush is a component-based platform with an open API

so you can easily extend it for your own needs.

DataRush Release Timeline

CQ1-2011 CQ2-2011 CQ3-2011 CQ4 2011 CQ1 2012 CQ2 2012

DataRush 5.0 • Distributed DR

• KNIME

• Performance

DataRush 5.0.1 • Bug fixes

• Targeted features

DataRush 5.1 • Hadoop and Hive integration

• I-Labs connectivity

• KNIME 2.4.1

• Bug fixes

(January 2011)

(March 2011, ongoing …)

(December 2011)

DataRush 6 • Fully distributed composition

and library

• Distributed execution in KNIME

• Next Gen MapReduce (?)

TurboRush for Hive 0.9 • Hive accelerator

• Limited release

DataRush & KNIME

KNIME Introduction

• Open source workflow for data mining

• Desktop designer

– Eclipse based (RCP app and plug-in)

– Node based architecture

• Nodes provide connectivity, transformations, algorithms, …

• Extensible model: user developed nodes supported

– Drag and drop, graphical editing of projects

– Project execution from GUI

– Workflow model – each node executes completely

before next node is invoked

Predictive Analytics in DR-KNIME

Profiling in DR-KNIME

NextGen Sequencing and

Genomic Pipelines

NGS data explosion

Convert/filter FastA/FastQ files

Align/order/assemble

Report/visualize matching/coverage

Big Data Products

Pervasive Big Data (NoSQL)Tools

• TurboRush for HBase

• Big Tooling w/GUI

– BigIntegrator

– BigBI

• Rpt, Cht, OLAP, Qry

– BigMiner

– BigSearch

BigIntegrator: HBase as Source or Target

BigIntegrator: Visual Mapping to/from HBase

BigBI (aka BigQuery)

DataRush & KNIME

DataRush + KNIME – what is it?

• Plug-in of DataRush v5.1 to KNIME v3.2?

• Adds extreme high-performance data preparation

and analytic functions

• Adds support for Hadoop data sources (both

HDFS and Hbase)

• Adds special dataflow “k-runner” mode that

recognizes adjacent DataRush nodes and

executes entirely in memory by “flowing” data

from node to node

• KNIME functionality can be further extended with

the DataRush SDK and Scripting

Pervasive RushMiner

Visual Environment for Big Data Analytics and Preparation

• Quickly cleanse, profile and aggregate big data

• Use Data mining, predictive analytics, machine learning to uncover actionable

intelligence

• Works with flat files, relation databases, NoSQL databases, and Hadoop filesystem

(HDFS)

• High performance, scales up to terabytes of data

• Design on your desktop using simple drag-and-drop interfaceExecute on desktop,

remote server, or clusters --including Hadoop clusters

Event Processing with DataRush

• Capture ALL data

• Discover previously unavailable patterns, correlations, etc.

• Scalable to meet growing needs

Processed 100 Million Syslog events in 58 seconds on a 48 core system. A sustained run rate of 14 Tb per day

Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng...

Documents

KNIME Workbench Guide · When you start KNIME Analytics Platform, the KNIME Analytics Platform launcher window appears and you are asked to define the KNIME workspace, as shown in

JChem Extensions for KNIME

Knime social media_white_paper

The KNIME Cookbook

KNIME Meetup Pavia/Milano€¦ · KNIME Meetup Pavia/Milano. A Brief History of KNIME 2004: KNIME development commences 2006: KNIME v1 released 2006: Spin-off in Konstanz, Germany

KNIME – An Integration Platform –

KNIME Workbench Guide · KNIME Workbench After selecting a workspace for the current project, click Launch. The KNIME Analytics Platform user interface - the KNIME Workbench - opens

Open for Innovation KNIME - Unistrainfochim.u-strasbg.fr/IMG/pdf/knime_tuto.pdf · Open for Innovation . KNIME . ... the KNIME Workflows and data files that we ... Feel free to post

KNIME - Create Workflow with KNIME

Knime Evaluation Smaller

Big Data & KNIME · 512.231.6000 - 512.231.6010 fax - Big Data & KNIME Michael Hoskins, CTO Pervasive Software KNIME User Conf, Zurich, 1 February 2012 . Big Data and the Digital

KNIME Python Integration Guide

KNIME Quickstart Guide - KNIMEtech |

KNIME TUTORIAL - unipi.itdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/knime_slides_dm.pdf · KNIME Workflow •KNIME does not work with scripts, it works with workflows. •A workflow

Introduction to knime

Erlwood KNIME nodes 2014

Meher Corporate Brochure · Engineering : Manufacturi ng busi nesses i n Advanced Materi al s, Power El ectroni c Capaci tors & Power Inductors. e-M obility : Manufacturi ng busi

KNIME Server 4download.knime.org/server/4.4/KNIME_Server_Enterprise_Setup_Guid… · The KNIME Server enterprise setup guide covers advanced topics of a KNIME server deployment, setup

KNIME Meetup Flyer

KNIME Server on Azure Marketplace · 2021. 2. 12. · KNIME Server on Azure Marketplace KNIME AG, Zurich, Switzerland Version 4.12 (last updated on 2020-12-04)