KNIME Meetup 2016-04-16

Preview:

Citation preview

Creating Insights

at the

Speed of Business

W. Daniel Cox, III CPA, CMA, CFM

Chief Executive Officer

WELCOMEto

Meet Up Group

Energise Organisational

Advantage through

Awareness and Insight

Registration & Networking

Keynote – Dan Cox, CEO of Data Transformed

KNIME & Harvest Analytics – Tom Park

Office of State Revenue Case Study – Anand Antony

Using Spark with KNIME – Chhitesh Shrestha

Networking & Drinks

Journey to Best in Class AnalyticsWe Help our Clients along this Path

Time

Value

ProactiveDiscover and

Predict Performers

ReactiveMonitor and Alert FollowersStatic

Report and Drill-down

Laggards

DynamicAnalytics-enabled

business processes

Innovators

YOUR DATA. CLEARLY

Source

Your

Data

Realise

Data

Value

Prepare

Your

Data

Data Preparation

Plan

With

Data

Budget/Planning

Visualise

All

Data

Visualisation

BUDGET PLANNING Budgeting

Forecasting

Planning

Demand Planning

Workforce Management

Accounting

Financing

Cashflow

Sales Forecasting

Modelling

Campaign Forecasting

DATA PREPARATION

Data GovernanceData QualityMaster Data ManagementData WarehousingData ScienceETL ApplicationsData AnalyticsSQL LanguagePython LanguageScriptingDatabase ManagementApplication DevelopmentDatabase DevelopmentTextual ETLText AnalyticsHadoop EcosphereAnalytical DatabasesRelational DatabasesMicrosoft Analysis ServerOLAPOLTPMulti-Dimensional DatabasesData Vault ArchitecturesStar-Schema ArchitecturesData Marting

Data Transformed Skill Sets

VISUALISATION30%

BUDGET PLANNING

20%

DATA PREPARATION

50%

VISUALISATIONDashboarding

Reporting

Charting

Location Analytics

Statistical Analytics

Data Analytics

Business Analysis

Story Telling

Symmantic Layer

Presentation Layer

Collabration

Slow Fast

Immature

IndustrialStrength

Ente

rpri

se R

ead

ine

ss

Performance

Good Enough

ProductionReady

TraditionalOperational

Open Source

Vortex

Actian – Fast, Industrialized, Open

Superior Big Data SQL with Industrialized strength

Do YOU

Have a

BIG DATA Role

Global Data Snapshot

7,254,549,796Total World Population

3,035,749,340Internet Users

2,078,680,860Active Social Network Users

6,572,950,124Mobile Subscribers

• Challenges• Constrains data to app

• Can’t manage new data

• Costly to Scale

Business Value

Clickstream

Geolocation

Web Data

Internet of Things

Docs, emails

Server logs

2012

2.8 Zettabytes

2020

44 Zettabytes

LAGGARDS

INDUSTRY

LEADERS

1

2New Data

ERP CRM SCM

New

Traditional

Traditional systems under pressure

12 Zettabytes

Volume Exponential Growth

Variety New Data Types

Velocity Time To Value

The Digital Floodgates have opened…

and will never be turned off again

Big Data equals Big Opportunity

Data Source & Type Untouched

Value New Possibilities

88OF BIG DATA

15TRILLION

$

Universal Access Time To Value

OF COMPANIES

%

%

1

Trends for BIG DATA

In the Cloud

Trends for BIG DATA

Personal ETL

Trends for BIG DATA

NoSQL

Trends for BIG DATA

Hadoop

Trends for BIG DATA

Data Lake

Trends for BIG DATA

Ecosystem

Trends for BIG DATA

Internet of

Things

Big Data Trends

1. Big Data in the Cloud

2. Personal ETL

3. NoSQL

4. Hadoop

5. Data Lakes

6. Big Data Ecosystem

7. Internet of Things

BIG DATA

is STILL just

Data

It needs to be translated into Answers

Acquire, Grow & Retain Customers

Who are your best customers

and how can you keep them

satisfied?

Where can you find more

customers like them?

Big data holds the insights into

who your customers are and

what motivates them.

Optimise Operations & Reduce Fraud

Are your operational processes

and systems as efficient as

they could be?

Could you reduce waste and

fraud if you had real-time

visibility into your business?

Adopting a big data and analytics

strategy can help you plan,

manage and maximise

operations, supply chains and the

use of infrastructure assets.

Transform Financial Processes

Do you have real-time access

to reliable information about all

aspects of your business?

Do you have the visibility,

insight and control over

financial performance to better

measure, monitor and shape

business outcomes?

Analysing all of your data,

including big data, can drive

enterprise agility and provide

insights to help you make better

decisions

Manage Risk

How can you mitigate the

financial and operational risks

that could devastate your

organisation?

How can you manage

regulatory change and reduce

the risk of non-compliance?

Proactively identifying,

understanding and managing

financial and operational risk can

enable more risk-aware,

confident decision making

Create New Business Models

Are your competitors making

bigger strides in changing your

industry or creating new markets

than you?

Does your organisation’s culture

support innovative thinking and

exploration?

Explore strategic options for

business growth, using new

perspectives gained from exploiting

big data and analytics

Improve IT Economics

Is your existing IT infrastructure

able to provide the insights that

decision makers need?

Are you doing enough to protect

your data centre and data from

potential criminal activity or

fraud?

Lead the creation of new value

and agility for your business by

optimising big data and analytics

for faster insight at a lower cost

Analytics Trends

1. Data Governance

2. Social Intelligence

3. Analytics Organisation-Wide

4. Community Collaboration

5. Integration of Everything

6. Cloud Analytics

7. Conversational Data

8. Journalism Data

9. Mature Mobility

10.Smart Analytics

Areas BIG DATA is Helping

1. Operations & Optimising

2. Product Development

3. Customer Experience

4. Understanding and Targeting Customers

Performance Examples

Actian is Helping These Companies Achieve Leadership

Digital Marketing: Hyper-segmentation every hour

Banking: Enterprise Risk every 2 minutes

Retail: Enterprise Market Basket Analysis every minute

Defense: Network intrusion models every second

Fraud: Adjustments every nano-second

Amazon Redshift – Actian Matrix Cloud-based, Petabyte Scale Data Warehouse

The Value of Business Intelligence

Organisations competing with Analytics

Substantially OUTPERFORMtheir peers by

220%

Data Transformed

Actian Vector: Example

https://youtu.be/dYTF5ZNioEI

Identical 150 Million Transaction Query

Comparison between Actian Vector & Oracle DBMS

Harvest Analytics

Tom Park

Overview KNIME & Big Data

Tom Park

Gartner 2016 Magic Quadrant Advanced Analytics Platforms

Niche Players (5):FICOLavastormMegaputerPrognozAccenture

Leaders (5):SAS IBM KNIME RapidMinerDell

Visionaries (4):Microsoft Alteryx Alpine Data Labs Predixion

Challengers (2):SAP Angoss

Changes from 2015 to 2016

X Salford & TIBCODropped due to not satisfying the visual composition

Main Big Data Technologies

NO SQL

Big Data Architecture

KNIME Big Data Extensions

Future Trends

Missing Ingredient to Success?

www.dataroos.com

Office of State

Revenue

Anand Antony

KNIME @ OSRAnand Antony

Senior Data AnalystOperations Analytics and Intelligence

Office of State Revenue

anandjantony@gmail.comPh. 0414491765

OSR: Who are we?

As NSW’s principal revenue agency, OSR administers state taxation and revenue for, and on behalf of, the people of NSW

◦ Payroll tax

◦ Land tax

◦ Duties

◦ Grants such as First Home Benefits

Data Analytics Team: Who are we?

Operations Analytics & Intelligence is the analytics wing of the Operations Division in OSR◦ Three teams – Business Intelligence, Data Analytics and

Data Team

Data Analytics team consists of 10 analysts

Supports tax auditors by detecting possible non-compliant clients◦ Via matching data from various sources and analysing

them

◦ 60+ data sources

Data Analytics Scenario - Past

Data matching, preparation and analysis◦ SPSS Clementine, SAS Enterprise Guide

Data mining◦ Salford Systems

Reporting/Dashboards◦ Excel

Fuzzy data matching◦ SSA Name (Informatica)

Data Analytics Scenario - Current

Data matching, preparation and analysis◦ KNIME (around 70% transitioned from

Clementine/SAS)

Data mining◦ Salford Systems◦ Will be evaluating KNIME

Reporting/Dashboards◦ Excel

Fuzzy data matching◦ SSA Name (Informatica)

Inte

rnal &

Ext

ern

al D

ata

Sourc

es

Data Governance

Data Quality

Data Matching

MetadataManagement

MapR Hadoop Distribution

Data Lake

VortexMapR

Advanced Data AnalyticsActian/Knime

Machine LearningH2O/ Spark

Actian/Knime

Future: Unified Analytic & Data Management Platform

Governance

Visualisation

Presentation Layer

Datamart

On the fly / Sandpit

Spotfire/Tableau/

Graph DBs

Why KNIME?

Enrich with coding via coding snippets◦ Mostly Java snippet at the moment

Start with canvas programming

Fast and easy learning curve for data scientists

Can tackle almost any analytic task

KNIME - Having the best of both worlds!

◦ Canvas programming Coding

What do we use KNIME for?

Pretty much for everything! (except reporting and datamining)◦ Data reading (text files, databases, non-standard formats)

◦ Data merging (potentially fuzzy matching too in future)

◦ Data manipulation

◦ Creating new variables

◦ Data Output

◦ Modelling (possibly in future)

Key nodes/functionalities

◦ Sorter, Column Reorder, Column Filter, Column Rename

◦ Concatenate, Joiner, Reference Row Filter (anti-join)

◦ Missing value◦ Math Formula, String Manipulation, Rule Engine, Java Snippet

◦ GroupBy (aggregate, dedupe)◦ Value Counter, Pivoting◦ Looping◦ Regular expressions/wildcards in various nodes

Data Preparation Example

Case study 1

Officers fill in a questionnaire on the entity audited – one excel spreadsheet for one entity

Collate all the spreadsheets stored in a location

Massage the data to produce an analysis dataset with one row per entity

Key KNIME nodes/functionalities used◦ List files

◦ Table Row to Variable Loop Start, Loop End

◦ Java Snippet

Questionnaire data for one client

Overview of Knime flow

Bring data to tabular form

Within this Meta node, there is one Java Snippet for each question in the questionnaire

Details of a Java Snippet

Result of the Meta Node

To get a single record for a client- Just take the last row for a “client

block”!- Explained in the next slide

For each “client block” aggregate the variables

End result

1000 spread-sheets 1000 rows

Case study 2 – Use of Flow variables

Technique ◦ Input metadata rules into a file

◦ Read and convert into flow variables

Example ◦ Reorder variables in a dataset as per the

order in the data dictionary

◦ We use “Flow variables” tab in Column Reorder tab to achieve this

Use of flow variables

Use this tab

Do not use this “manual” tab

KNIME wishlist! Offset function in some nodes

eg. Rule Engine, Math formula Offset function gives the value of a variable in a previous row.

Eg. In SPSS Clementine @OFFSET(var,1) gives the value in the previous row.

Note:- Within Java Snippet this is readily achieved since a variable retains its value until it is over-written. Therefore we can conveniently first utilise the value populated from the previous row inside a formula.Then we can update the value from the current row so as to be used in the next row.

Questions?

Data Transformed

Chhitesh Shrestha

Apache Spark on KNIME

Unleash the power of Big Data on Hadoop

The Big Data Problem: Data Volume

1. Storage are getting cheaper

2. Data sources are increasing

3. Thus, data is growing faster

YARN

But, Still processing them is a problem. Why ?

The Big Data Problem: Processing

Now, as the memory is cheaper.

Why Apache Spark ?

Apache Spark is an open source parallel

processing framework that enables users to

run large scale data analytics across clustered

computers.

• Speed

• Flexible with programming platform

• Generality

• Run Everywhere

Spark Components

Spark Comparison on Calculation of Average

List of Spark Nodes

Getting the data in and out of Spark

Data into Spark Data out of Spark

Statistics and Data Manipulation Nodes

Statistics Data Manipulation

Mining Nodes

Learners Predictors

Other Nodes

KNIME Spark Executor Architecture

Current Supported Hadoop and KNIME Versions

Hadoop Versions

• Hortonworks HDP 2.2 with Spark 1.2.x

• Hortonworks HDP 2.3 with Spark 1.3.x

• Cloudera CDH 5.3 with Spark 1.2.x

• Cloudera CDH 5.4 with Spark 1.3.x

KNIME Versions

• KNIME Analytics Platform 3.1

• KNIME Server 4.2

Data TransformedYOUR DATA. CLEARLY.

info@DataTransformed.com.au

02 9956 3781

Actian Vortex on Hadoop 10 minute Demo

http://videos.actian.com/watch/6iEZqvJrEKL2btoqIDImcg

Demonstration of Vortex, Dataflow & Vector

Comparison between Actian Vortex & Cloudera Impala

Actian Vector: Example

https://youtu.be/dYTF5ZNioEI

Identical 150 Million Transaction Query

Comparison between Actian Vector & Oracle DBMS

Recommended