Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

The Data Scientists Workplace of the Future - Data Science Connect 22nd of July, 2014

Romeo Kienzler

IBM Center of Excellence for Data Science, Cognitive Systems and BigData(A joint-venture between IBM Research Zurich and IBM Innovation Center DACH)

Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg

What is DataScience?

Source: Statoo.com http://slidesha.re/1kmNiX0

DataScience at present● Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html)

● SQL (42%)● R (33%)● Python (26%)● Excel (25%)● Java, Ruby, C++ (17%)● SPSS, SAS (9%)

● Limitations (Single Node usage)● Main Memory● CPU <> Main Memory Bandwidth● CPU ● Storage <> Main Memory Bandwidth (either Single node or SAN)

What is BIG data?

Big Data

Hadoop

What is BIG data?

Business Intelligence

Data Warehouse

BigData == Hadoop?

Hadoop BigData

Hadoop

What is beyond “Data Warehouse”?

Data Lake

Data Warehouse

First “BigData” UseCase ?● Google Index

● 40 X 10^9 = 40.000.000.000 => 40 billion pages indexed● Will break 100 PB barrier soon● Derived from MapReduce● now “caffeine” based on “percolator”

● Incremental vs. batch● In-Memory vs. disk

Map-Reduce → Hadoop → BigInsights

BigData Analytics – Predictive Analytics

"sometimes it's not who has the best algorithm that wins; it's who has the most data."

(C) Google Inc.

The Unreasonable Effectiveness of Data¹

¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf

No Sampling => Work with full dataset => No p-Value/z-Scores anymore

Aggregated Bandwith between CPU, Main Memory and Hard Drive

1 TB (at 10 GByte/s)

- 1 Node - 100 sec

- 10 Nodes - 10 sec

- 100 Nodes - 1 sec

- 1000 Nodes - 100 msec

Fault Tolerance / Commodity Hardware

AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,

3TB SEAGATE Barracuda 7200.14

< CHF 500

100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD

MTBF ~ 365 d > 1,5 d

Source: http://www.cloudcomputingpatterns.org/Watchdog

“Elastic” Scale-Out

Source: http://www.cloudcomputingpatterns.org/Continuously_Changing_Workload

CPU Cores

CPU Cores Storage

CPU Cores Storage Memory

linear

Source: http://www.cloudcomputingpatterns.org/Elastic_Platform

How do Databases Scale-Out?

Shared Disk Architectures

How do Databases Scale-Out?

Shared Nothing Architectures

Hadoop?

Shared Nothing Architecture?

Shared Disk Architecture?

http://bluemix.net/6 Node Hadoop Cluster 4 Free

Data Science on Hadoop

SQL (42%)

R (33%)

Python (26%)

Excel (25%)

Java, Ruby, C++ (17%)

SPSS, SAS (9%)

Data Science Hadoop

SQL on Hadoop● IBM BigSQL (ANSI 92 compliant)● HIVE, Presto● Cloudera Impala ● Lingual● Shark● ...

SQL Hadoop

Two types of SQL Engines● Type I

● Compiler and Optimizer SQL->MapReduce● Type II

● Brings own distributed execution engine on Data Nodes● Brings own Task Scheduler

● The Hadoop SQL Ecosystem is evolving very fast

Hive● Runs on top of MapReduce● → Type I

Source: http://cdn.venublog.com/wp-content/uploads/2013/07/hive-1.jpg

Lingual● ANSI SQL Layer on top of Cascading● Cascading

● Java API do express DAG● Runs on top of MapReduce● → Type I

Limits of MapReduce● Disk writes between Map and Reduce● Slow for computations which depend on previously computed values● JOINs are very slow and difficult to implement

● Only sequential data access● Only tuple-wise data access● Map-Side joins have sort and size constraints● Reduce-Side joins require secondary sorting of values● …

● ...

Impala (Type II)

http://blog.cloudera.com/blog/wp-content/uploads/2012/10/impala.png

Presto (Type II)

https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920

Spark / Shark (Type II)

Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png

BigSQL V3.0 (Type II)

Like in Spark, MapReduce has been Kicked out :)(No JobTracker, No Task Tracker, But HDFS/GPFS remains)

BigSQL V3.0 – Architecture

Putting the story together….Big SQL shares a common SQL dialect with DB2Big SQL shares the same client drivers with DB2

BigSQL V3.0 – PerformanceQuery rewritesExhaustive query rewrite capabilitiesLeverages additional metadata such as constraints and nullability

OptimizationStatistics and heuristic driven query optimizationQuery optimizer based upon decades of IBM RDBMS experience

Tools and metricsHighly detailed explain plans and query diagnostic toolsExtensive number of available performance metrics

SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST)

FROM PERIOD, DAILY_SALES, PRODUCT, STORE

PERIOD.PERKEY=DAILY_SALES.PERKEY AND

PRODUCT.PRODKEY=DAILY_SALES.PRODKEY AND

STORE.STOREKEY=DAILY_SALES.STOREKEY AND

CALENDAR_DATE BETWEEN AND

'01/01/2012' AND '04/28/2012' AND

STORE_NUMBER='03' AND

CATEGORY=72

GROUP BY ITEM_DESC

Access plan generationQuery transformation

Dozens of query transformations

Hundreds or thousands of access plan options

Product

Product Store

NLJOIN

Daily SalesNLJOIN

Period

NLJOIN

Product

NLJOIN

Daily Sales

NLJOIN

Period

NLJOIN

HSJOIN

Daily Sales

HSJOIN

Period

HSJOIN

Product

StoreZZJOIN

Daily Sales

HSJOIN

Period

BigSQL V3.0 – PerformanceYou are substantially faster if you don't use MapReduce

IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification. Source: http://www.ibmbigdatahub.com/blog/big-deal-about-infosphere-biginsights-v30-big-sql

BigSQL V3.0 – Query Federation

Head Node

Big SQL

Compute Node

Task Tracker Data Node BigSQL

Compute Node

Task Tracker Data NodeBigSQL

Compute Node

BigSQL V1.0 – Demo (small)● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

BigSQL V1.0 – Demo (small)CREATE EXTERNAL TABLE trace (

hour integer, employeeid integer,

departmentid integer, clientid integer,

date string, timestamp string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/biadmin/32Gtest';

BigSQL V1.0 – Demo (small)

BigSQL V1.0 – Demo (small)[bivm.ibm.com][biadmin] 1> select count(*) from trace1;

+----------+

| 11416740 |

+----------+

1 row in results(first row: 39.78s; total: 39.78s)

select count(hour), hour from trace group by hour order by hour

30 rows in results(first row: 37.98s; total: 37.99s)

[bivm.ibm.com][biadmin] 1> select count(*) from trace1 t3 inner join trace2 t4 on t3.hour=t4.hour;

+--------+

| 477340 |

+--------+

BigSQL V3.0 – Demo (small)CREATE HADOOP TABLE trace3 (

hour int, employeeid int,

departmentid int,clientid int,

date varchar(30), timestamp varchar(30) )

row format delimited

fields terminated by '|'

stored as textfile;

BigSQL V3.0 – Demo (small)[bivm.ibm.com][biadmin] 1> select count(*) from trace3;

+----------+

| 12014733 |

+----------+

[bivm.ibm.com][biadmin] 1> select count(*) from trace3 t3 inner join trace4 t4 on t3.hour=t4.hour;

+--------+

| 504360 |

+--------+

[bivm.ibm.com][biadmin] 1> select count(hour), hour from trace3 group by hour order by hour;

29 rows in results(first row: 1.88s; total: 1.89s)

R on Hadoop● IBM BigR (based on SystemML Almadan Research project)● Rhadoop● RHIPE● ...

“R” Hadoop

Goal: Find column mean

Problems:• Column vector can not fit into memory

You have to partition and parallelize

● Sampling Full dataset > RAM Example: use 1% vs 100% of dataset Precision loss from skewed/sparse data

● Numerical Stability Limitation from finite precision in computing Algorithms must be carefully implemented Instability causes errors to cascade throughout your analysis

Catastrophic Cancellation Error: 6.375 – 5.625

True value: 0.75 Computed: 0 Relative Error: 1.06.375 round to 6.0

5.625 round to 6.0

Data in Hadoop

R User

Data in distributed memory

Data in Hadoop: Can run R on a single node

R User

Data in distributed memory

BigR (based on SystemML)SystemML compiles hybrid runtime plans ranging from in-memory, single machine (CP) to large-scale, cluster (MR) compute

● Challenge● Guaranteed hard memory constraints

(budget of JVM size)● for arbitrary complex ML programs

● Key Technical Innovations● CP & MR Runtime: Single machine & MR operations, integrated runtime● Caching: Reuse and eviction of in-memory objects● Cost Model: Accurate time and worst-case memory estimates● Optimizer: Cost-based runtime plan generation● Dyn. Recompiler: Re-optimization for initial unknowns

Data size

CP CP/MR MR

Gradually exploit MR parallelism

High performance computing for small data sizes.

Scalable computing for large data sizes.

Hybrid Plans

R Clients

SystemMLStatistics

Engine

Data Sources

Embedded R Execution

IBM R Packages

Pull data (summaries) to

R client

Or, push R functions

right on the data

BigR Architecture

Big R Data Structures: Proxy to entire dataset

data <- bigr.frame(…)

Appears and acts like all of the data is on your laptop

BigR Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

BigR Demo (small) library(bigr)

bigr.connect(host="bigdata",

port=7052, database="default",

user="biadmin", password="xxx")

is.bigr.connected()

tbr <- bigr.frame(dataSource="DEL", coltypes = c("numeric","numeric","numeric","numeric","character","character"),

dataPath="/user/biadmin/32Gtest", delimiter=",",

header=F, useMapReduce=T)

h <- bigr.histogram.stats(tbr$V1, nbins=24)

BigR Demo (small) class bins counts centroids

1 ALL 0 18289280 1.583333

2 ALL 1 15360 2.750000

3 ALL 2 55040 3.916667

4 ALL 3 189440 5.083333

5 ALL 4 579840 6.250000

6 ALL 5 5292160 7.416667

7 ALL 6 8074880 8.583333

8 ALL 7 15653120 9.750000

BigR Demo (small)

BigR Demo (small) jpeg('hist.jpg')

bigr.histogram(tbr$V1, nbins=24)

# This command runs on 32 GB / ~650.000.000 rows in HDFS

dev.off()

SPSS on Hadoop

BigSheets Demo (small)● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

BigSheets Demo (small)

This command runs on 32 GB /

~650.000.000 rows in HDFS

Text Extraction (SystemT, AQL)

If this is not enough? → BigData AppStore

BigData AppStore, Eclipse Tooling● Write your apps in

● Java (MapReduce)● PigLatin,Jaql● BigSQL/Hive/BigR

● Deploy it to BigInsights via Eclipse● Automatically

● Schedule● Update

● hdfs files● BigSQL tables● BigSheets collections

Questions?

http://www.ibm.com/software/data/bigdata/

Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps

Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

Data & Analytics

IBM Zurich Research Lab © 2004 IBM Corporation PART 5 Enterprise Privacy Policies

© 2009 IBM Corporation Simply Top Talkers Jeroen Massar, Andreas Kind and Marc Ph. Stoecklin IBM Research - Zurich

Automatic Workflow Graph Refactoring and Completion - IBM › researcher › files › ... · 2012-04-03 · IBM Zurich Research Laboratory | Business Integration Technologies 10

HPCG IBM P9 v05 · 2018-11-23 · PanagiotisChatzidoukas,CristianoMalossi,ChristophHagleitner,CostasBekas IBMResearch–Zurich November13th,2018 Porting optimized HPCG 3.1 for IBM

Cryptography 4 Privacy 2016 IBM Corporation Cryptography 4 Privacy Jan Camenisch Principle RSM; Member, IBM Academy of Technology IBM Research – Zurich @JanCamenisch ibm.biz/jancamenisch

Nano, SuperMUC and Photovoltaics:A Day in the Life of IBM Research - Zurich

© IBM 2004 Optimal Price Design for Variable Capacity Outsourcing Contracts Chris Kenyon & Giuseppe Paleologo IBM Research (ZRL, WRL) {chk@zurich|gappy@us}.ibm.com

IBM Research - Zurich Presentation template · IBM Research - Zurich Presentation template Created Date: 5/7/2012 4:50:40 PM

© 2002 IBM Corporation IBM Research Internet Act IINovember 25, 2004 1 Internet: Act II Krishna Nathan VP Services Director Zurich Research Laboratory

Authentication w/out Identification - IBM · 2017-06-06 · IBM Research – Zurich @JanCamenisch ibm.biz/jancamenisch Authentication w/out Identification Data protection for IoT

IBM Research - Zurich · The World is Our Lab 3 3 Brazil T.J Watson Almaden Austin Ireland Zurich Haifa Africa India China Tokyo Australia IBM invested $5.2B on R&D in 2015 More than

Business-Driven Software Engineering (7.Vorlesung) …...IBM Research – Zurich © 2011 IBM Corporation Business-Driven Software Engineering (7.Vorlesung) Message-Driven Beans, Web

Real-World Cryptography Workshop - Stanford University© 2013 IBM Corporation Identity Mixer: From papers to pilots – and beyond Gregory Neven, IBM Research – Zurich

Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

® © 2008 IBM Corporation IBM Software Group EGL Simplify Innovation EGL International Conference Zurich Oct 20-21 2008 Financial Services Application Modernization

Cryptography 4 People - IBM · 3/16/2017 · 3 ZISC Lunch Seminar 15.3.2017 - Jan Camenisch - IBM Research - Zurich © 2016 IBM Corporation 33% of cyber crimes, including identity

Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Research Zurich

® © 2008 IBM Corporation IBM Software Group EGL Simplify Innovation EGL International Conference Zurich Oct 20-21 2008 Enterprise Modernization The EGL

IBM Brand Template … · © 2016 IBM Corporation Cryptographic e-Cash Jan Camenisch IBM Research – Zurich @JanCamenisch ibm.biz/jancamenisch IACR Summerschool – Blockchain Technologies

Zurich Research Laboratory Fundamental graphene interface studies motivated by nano ... · 2016-11-07 · © 2015 IBM Corporation Zurich Research Laboratory Fundamental graphene interface