28
BIG DATA ANALYTICS BUSINESS INTELLIGENCE INFORMATION MANAGEMENT PERFORMANCE MANAGEMENT

BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers - Keyrus

Embed Size (px)

Citation preview

Page 1: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

BIG DATA ANALYTICS

BUSINESS INTELLIGENCE

INFORMATION MANAGEMENT

PERFORMANCE MANAGEMENT

Page 2: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 2

DIVING INTO WEBLOG DATA WITH SAS ON HADOOP

Lisa Truyers, Data Scientist Consultant at Keyrus

March 24, 2016

Logo

Page 3: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 3

Project summary

WHO HAS EVER TRIED TO OPEN A 1 GB FILE ON A COMPUTER?

Page 4: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 4

What is Hadoop?

Project summary

Components of the Hadoop-SAS framework

Setup to load data

Benchmarks

Lessons learned

AGENDA

Page 5: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 5

PROS Open-source software framework

Storage and large-scale data processing

Easy and economic scaling

Both structured and unstructured data

Low-cost commodity hardware

Starts multiple copies of the same task for the same block of data

What is Hadoop?

51% OF COMPANIES THINKS ABOUT INTEGRATING HADOOP IN THEIR COMPANY BY 2016Philip Russom, TDWI Best Practices Report= Integrating Hadoop into Business

Page 6: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 6

CONS Management and high-availability

capabilities are just starting to emerge

Data security is fragmented

MapReduce is very batch-oriented

No easy-to-use, full-feature tools for data integration, data cleansing, governance and metadata

Lacking skilled professionals

What is Hadoop?

MANAGE THE DATA AND USE ANALYTICS TO QUICKLY IDENTIFY PREVIOUSLY UNKNOWN INSIGHTS: ACCESS THE DIFFERENT TOOLS OF SAS

Page 7: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 7

WHAT ARE COMPANIES DOING WITH HADOOP?The percentages mentioned here cover the whole world, not only Europe.

What is Hadoop?

What? PercentageData warehouse extensions 46 %Data exploration and discovery 46 %Data staging for data warehousing and data integration 39 %Data lake 39 %Queryable archive for non-traditional data 36 %Computational platform and sandbox for advanced analytics 33 %

Page 8: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 8

WHY IS HADOOP (NOT) IMPORTANT?“Cost savings. Linear scalability. Evaluate ‘the hype’ practically. Complement BI.”

BI architect, telecom, Europe

“Reduces cost of data. New ability to query big data sets. Supply chain improvements. Predictive analytics.”Vice president, food and beverage, Asia

“Our existing infrastructure cannot handle the tenfold increase in data volumes.”Data strategy manager, hospitality, US

“It’s important to realize the potential of big data and to explore new business opportunities.”Data specialist, consulting, Asia

What is Hadoop?

Page 9: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 9

What is Hadoop?

Project summary

Components of the Hadoop-SAS framework

Setup to load data

Benchmarks

Lessons learned

AGENDA

Page 10: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 10

INTRODUCTIONProject summary

1. Discover web traffic data

• Discover web traffic data• Sheer volume of data makes it impossible to analyse at the moment• Prove the added value of a combined Hadoop – SAS environment

2. Lead generation

• More business oriented: scoring a neural network model takes one hour on daily basis• Reducing this time

Page 11: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 11

Project summary

What is Hadoop?

Components of the Hadoop-SAS framework

Setup to load data

Benchmarks

Lessons learned

AGENDA

Page 12: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 12

HADOOP COMPONENTSComponents of the Hadoop-SAS framework

HBASE PIG HIVE & HCATALOG

MAP REDUCE

HDFS

AMBARI

OOZIE

FLUME

SQOOP

NFS

WebHDFS

YARN

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

SAS IMSTAT for Hadoop

SAS® Visual Analytics & Statistics

SAS® LASR™ AnalyticServer

SAS® High-Performance

Analytic Procedures

SAS® Enterprise Guide®

Page 13: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 13

HADOOP COMPONENTSComponents of the Hadoop-SAS framework

HBASE PIG HIVE & HCATALOG

MAP REDUCE

HDFS

AMBARI

OOZIE

FLUME

SQOOP

NFS

WebHDFS

YARN

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

SAS IMSTAT for Hadoop

SAS® Visual Analytics & Statistics

SAS® LASR™ AnalyticServer

SAS® High-Performance

Analytic Procedures

SAS® Enterprise Guide®

Page 14: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 14

HADOOP COMPONENTSComponents of the Hadoop-SAS framework

HBASE PIG HIVE & HCATALOG

MAP REDUCE

HDFS

AMBARI

OOZIE

FLUME

SQOOP

NFS

WebHDFS

YARN

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

SAS IMSTAT for Hadoop

SAS® Visual Analytics & Statistics

SAS® LASR™ AnalyticServer

SAS® High-Performance

Analytic Procedures

SAS® Enterprise Guide®

Page 15: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 15

HADOOP COMPONENTSComponents of the Hadoop-SAS framework

HBASE PIG HIVE & HCATALOG

MAP REDUCE

HDFS

AMBARI

OOZIE

FLUME

SQOOP

NFS

WebHDFS

YARN

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

SAS IMSTAT for Hadoop

SAS® Visual Analytics & Statistics

SAS® LASR™ AnalyticServer

SAS® High-Performance

Analytic Procedures

SAS® Enterprise Guide®

Page 16: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 16

HADOOP COMPONENTSComponents of the Hadoop-SAS framework

HBASE PIG HIVE & HCATALOG

MAP REDUCE

HDFS

AMBARI

OOZIE

FLUME

SQOOP

NFS

WebHDFS

YARN

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

SAS IMSTAT for Hadoop

SAS® Visual Analytics & Statistics

SAS® LASR™ AnalyticServer

SAS® High-Performance

Analytic Procedures

SAS® Enterprise Guide®

Page 17: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 17

HADOOP COMPONENTSComponents of the Hadoop-SAS framework

HBASE PIG HIVE & HCATALOG

MAP REDUCE

HDFS

AMBARI

OOZIE

FLUME

SQOOP

NFS

WebHDFS

YARN

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

SAS IMSTAT for Hadoop

SAS® Visual Analytics & Statistics

SAS® LASR™ AnalyticServer

SAS® High-Performance

Analytic Procedures

SAS® Enterprise Guide®

Page 18: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 18

SAS COMPONENTSComponents of the Hadoop-SAS framework

HBASE PIG HIVE & HCATALOG

MAP REDUCE

HDFS

AMBARI

OOZIE

FLUME

SQOOP

NFS

WebHDFS

YARN

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

SAS® LASR™ AnalyticServer

SAS® High-Performance

Analytic Procedures

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

SAS IMSTAT for Hadoop

SAS® Visual Analytics & Statistics

SAS® Enterprise Guide®

Page 19: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 19

SAS COMPONENTSComponents of the Hadoop-SAS framework

HBASE PIG HIVE & HCATALOG

MAP REDUCE

HDFS

AMBARI

OOZIE

FLUME

SQOOP

NFS

WebHDFS

YARN

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

SAS® LASR™ AnalyticServer

SAS® High-Performance

Analytic Procedures

Base SAS & SAS/ACCESS® to Hadoop™

SAS Metadata

SAS IMSTAT for Hadoop

SAS® Visual Analytics & Statistics

SAS® Enterprise Guide®

Page 20: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 20

Project summary

What is Hadoop?

Components of the Hadoop-SAS framework

Setup to load data

Benchmarks

Lessons learned

AGENDA

Page 21: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 21

FULL PROCESSSetup to load data

Day

A Partitioned, non-parsed for day-filesC Partitioned, parsed for day-files

Hour

B Partitioned, non-parsed for hour-filesD Partitioned, parsed for hour-files

Page 22: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 22

Setup to load data

Page 23: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 23

PROCESS CSetup to load data

Delete HIVE Table

Transfer to Hadoop Parse data Merge Loop

Page 24: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 24

Project summary

What is Hadoop?

Components of the Hadoop-SAS framework

SAS-tools used in this project

Setup to load data

Benchmarks

Lessons learned

AGENDA

Page 25: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 25

HADOOP COMPARED TO SERVERServer

Query test one day: 35 seconds

Parsing data on one day: 15 minutes

Parsing of one week: 4hours 30 minutes

Benchmarks

Hadoop

Query test on one day: 35 seconds

Parsing data on one day: 15 minutes

Parsing of one week: 53 minutes

MORE TIME NEEDED FOR EXTRA BENCHMARKS

Page 26: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 26

Project summary

What is Hadoop?

Components of the Hadoop-SAS framework

SAS-tools used in this project

Setup to load data

Benchmarks

Lessons learned

AGENDA

Page 27: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

© Copyright 2015 – Keyrus 27

Teamwork is key• Set-up Hadoop cluster with

Hadoop-experts

• Install SAS with experts from the company

SAS ON HADOOP In SAS, take your time to set the correct

variable length

Choose the strength of the cluster rationally

Create Benchmarks on both environments (server VS Hadoop) early on so a good comparison can be done and the correct decision can be taken

Data must be large enough on Hadoop to see a difference

Lessons learned

Page 28: BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers - Keyrus

THANK YOU FOR YOUR ATTENTIONTo contact us

www.keyrus.com

[email protected]