Big Data Analytics with Hadoop, MongoDB and SQL Server

SQL Server and Big Data Projects in the Real World

Mark KromerPentaho Big Data Analytics Product Manager

@mssqldude

@kromerbigdata

http://www.kromerbigdata.com

http://www.kromerbigdata.com/

1. The Big Data Technology Landscape2. Big Data Analytics3. 3 Big Data Analytics Scenarios:

❯ Digital Marketing Analytics• Hadoop, Aster Data, SQL Server

❯ Sentiment Analysis• MongoDB, SQL Server

❯ Data Refinery• Hadoop, MPP, SQL Server, Pentaho

4. SQL Server in the Big Data world

What we’ll (try) to cover today

Big Data 101

3 V’s❯ Volume – Terabyte records, transactions, tables, files❯ Velocity – Batch, near-time, real-time (analytics), streams.❯ Variety – Structures, unstructured, semi-structured, and all the above in a

mix

Text Processing❯ Techniques for processing and analyzing unstructured (and structured)

LARGE files

Analytics & InsightsDistributed File System & Programming

• Batch Processing• Commodity Hardware• Data Locality, no

shared storage• Scales linearly• Great for large text file

processing, not so great on small files

• Distributed programming paradigm

Hadoop 1.x

© Hortonworks Inc. 2014

Hadoop 1 vs Hadoop 2

HADOOP 1.0

HDFS(redundant, reliable storage)

MapReduce(cluster resource management

& data processing)

HDFS2(redundant, highly-available & reliable storage)

YARN(cluster resource management)

MapReduce(data processing)

Others

HADOOP 2.0

Single Use SystemBatch Apps

Multi Purpose PlatformBatch, Interactive, Online, Streaming, …


YARN: Taking Hadoop Beyond Batch

Applications Run Natively in Hadoop

HDFS2 (Redundant, Reliable Storage)

YARN (Cluster Resource Management)

BATCH(MapReduce)

INTERACTIVE(Tez)

STREAMING(Storm, S4,…)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

ONLINE(HBase)

OTHER(Search)

(Weave…)

Store ALL DATA in one place…

Interact with that data in MULTIPLE WAYS

with Predictable Performance and Quality of Service


YARN Eco-system

Page 7

Applications Powered by YARN

Apache Giraph – Graph Processing

Apache Hama - BSP

Apache Hadoop MapReduce – Batch

Apache Tez – Batch/Interactive

Apache S4 – Stream Processing

Apache Samza – Stream Processing

Apache Storm – Stream Processing

Apache Spark – Iterative applications

Elastic Search – Scalable Search

Cloudera Llama – Impala on YARN

DataTorrent – Data Analysis

HOYA – HBase on YARN

Frameworks Powered By YARN

Apache Twill

REEF by Microsoft

Spring support for Hadoop 2

Apache SparkHigh- Speed In- Memory Analytics over Hadoop

● Open Source● Alternative to Map Reduce for certain applications● A low latency cluster computing system● For very large data sets● May be 100 times faster than Map Reduce for

– Iterative algorithms

– Interactive data mining

● Used with Hadoop / HDFS● Released under BSD License

Popular Hadoop Distributions

Hosted PaaS Hadoop platforms: Amazon EMR, Pivotal, Microsoft Hadoop on Azure

Popular NoSQL Distributions

Transactional-based, not analytics schemas

Popular MPP Distributions

Big Data as distributed, scale-out, sharded data stores

Big Data Analytics Web Platform – RA 1

Data Source

s

Data Maste

ring

Data

Warehouse &

Analytics

Presentatio

n

AttributionSegmentationStacking Effect

…

Media Level Data Warehouse Audience Level

Data WarehouseBig Data SandboxesData Mapping

Business RulesExternal &

Extended Data

Tableau & Pentaho

MapReduceJobs

Sentiment AnalysisReference Architecture 2

Big Data Platforms

Hadoop

PDW

MongoDB

Social Media Sources

Data Orchestration

Data ModelsAnalytical

Models

OLAP Cubes

Data Mining

OLAP Analytics Tools,

Reporting Tools,

Dashboards

Streamlined Data RefineryReference Architecture 3

Transactions – Batch & Real-time

Enrollments & Redemptions

Location, Email, Other Data

Hadoop Cluster

Analytics

Reports

Data Orchestration

Big Data Analytics

• Distributed Data (Data Locality)❯ HDFS / MapReduce❯ YARN / TEZ❯ Replicated / Sharded Data

• MPP Databases❯ Vertica, Aster, Microsoft, Greenplum … In-database analytics that can scale-out

with distributed processing across nodes• Distributed Analytics

❯ SAS: Quickly solve complex problems using big data and sophisticated analytics in a distributed, in-memory and parallel environment.” http://www.sas.com/resources/whitepaper/wp_46345.pdf

• In-memory Analytics❯ Microsoft PowerPivot (Tabular models)❯ SAP HANA❯ Tableau

Big Data AnalyticsCore Tenets

http://www.sas.com/resources/whitepaper/wp_46345.pdf

SQL on HadoopHortonworks and Cloudera DW Engine Approaches

SQL on Hadoop LandscapeGartner Research on SQL on Hadoop

Not Quite Real Time

Many vendors market their SQL interfaces to Hadoop as providing so called "real-time access" to

data stored in a Hadoop cluster … SQL on Hadoop provides a purely interactive data query and data manipulation experience — faster than batch, but not truly real time. In the case of Hadoop and the types of tasks it performs, we define interactive time frames as between 30 milliseconds and 10 minutes. If

your usage truly needs realtime, a different set of technologies and vendors may be required.

SQL on HadoopVendor Perspective: MapR

Batch SQL

Hive is used primarily for queries on very large data sets and large ETL jobs. The queries can take anywhere between a few minutes to several hours depending on the complexity of the job. The Apache Tez project aims to provide targeted performance improvements for Hive to deliver interactive query capabilities in future. MapR ships and supports Apache Hive today.

Interactive SQLTechnologies such as Impala and Apache Drill provide interactive query capabilities to enable traditional business intelligence and analytics on Hadoop-scale datasets. The response times vary between milliseconds to minutes depending on the query complexity. Users expect SQL-on-Hadoop technologies to support common BI tools such as Tableau and MicroStrategy (to name a couple) for reporting and ad-hoc queries. MapR supports customers using Impala on the MapR distribution of Hadoop today. Apache Drill will be available Q2 2014.

using Microsoft.Hadoop.MapReduce;using System.Text.RegularExpressions;public class TotalHitsForPageMap : MapperBase{ public override void Map(string inputLine, MapperContext context) { context.Log(inputLine); var parts = Regex.Split(inputLine, "\\s+"); if (parts.Length != expected) //only take records with all values { return; } context.EmitKeyValue(parts[pagePos], hit); } }

MapReduce Framework (Map)

public class TotalHitsForPageReducerCombiner : ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context) { context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString()); } } public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var retVal = new HadoopJobConfiguration(); retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT"); retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT"); retVal.DeleteOutputFolder = true; return retVal; } }

MapReduce Framework (Reduce & Job)

Linux shell commands to access data in HDFSPut file in HDFS: hadoop fs -put sales.csv /import/sales.csvList files in HDFS:c:\Hadoop>hadoop fs -ls /import

Found 1 items-rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv

View file in HDFS:c:\Hadoop>hadoop fs -cat /import/sales.csvKromer,123,5,55Smith,567,1,25Jones,123,9,99James,11,12,1Johnson,456,2,2.5Singh,456,1,3.25Yu,123,1,11

Now, we can work on the data with MapReduce, Hive, Pig, etc.

Get Data into Hadoop

create external table ext_sales( lastname string, productid int, quantity int, sales_amount float)row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input';LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;

Use Hive for Data Schema and Analysis

sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1

> hadoop fs -cat /user/mark/customers/part-m-00000

> 5,Bob Smith

sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3

12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)

12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.

SqoopData transfer to & from Hadoop & SQL Server

Role of NoSQL in a Big Data Analytics Solution‣ Use NoSQL to store data quickly without the overhead of RDBMS

‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few

‣ Why NoSQL?

‣ In the world of “Big Data”

‣ “Schema later”

‣ Ignore ACID properties

‣ Drop data into key-value store quick & dirty

‣ Worry about query & read later

‣ Why NOT NoSQL?

‣ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface

‣ SQL Server and NoSQL

‣ Not a natural fit

‣ Use HDFS or your favorite NoSQL database

‣ Consider turning off SQL Server locking mechanisms

‣ Focus on writes, not reads (read uncommitted)

MongoDB and Enterprise IT Stack

EDWHadoop

Man

agem

ent

& M

on

ito

rin

gS

ecurity &

Au

ditin

g

RDBMS

CRM, ERP, Collaboration, Mobile, BI

OS & Virtualization, Compute, Storage, Network

RDBMS

Applications

Infrastructure

Data Management

Online Data Offline Data

{ _id : ObjectId("4e2e3f92268cdda473b628f6"),sourceIDs: {

ABCSystemIDPart1: 8397897, ABCSystemIDPart2: 2937430,ABCSystemIDPart3: 932018 }

accountType: “Checking”,accountOwners: [

{ firstName : ”John", lastName: “Smith”, contactMethods: [

{ type: “phone”, subtype: “mobile”, number: 8743927394},{ type: “mail”, address: “58 3rd St.”, city: …} ]

possibleMatchCriteria: { govtID: 2938932432, fullName: “johnsmith”, dob: … } }, { firstName : ”Anne",

maidenName: “Collins”, lastName: “Smith”, …} ],

openDate: ISODate("2013-02-15 10:00”), accountFeatures { Overdraft: true, APR: 20, … }

}

General document per customer per account

OR creditCardNumber: 8392384938391293OR mortgageID: 2374389OR policyID: 18374923

Text Search Example (e.g. address typo so do fuzzy match)

// Text search for address filtered by first name and NY

> db.ticks.runCommand(

“text”,

{ search: “vanderbilt ave. vander bilt”,

filter: {name: “Smith”,

city: “New York”} })

//Find total value of each customer’s accounts for a given RM (or Agent) sorted by value

db.accts.aggregate( { $match: {relationshipManager: “Smith”}},

{ $group : { _id : “$ssn”, totalValue: {$sum: ”$value”} }},

{ $sort: { totalValue: -1}} )

Aggregate: Total Value of Accounts

SQL Server Big Data – Data Loading

Amazon HDFS & EMR Data Loading

Amazon S3 Bucket

SQL Server Database❯ SQL 2012 Enterprise Edition❯ Page Compression❯ 2012 Columnar Compression on Fact Tables❯ Clustered Index on all tables❯ Auto-update Stats Asynch❯ Partition Fact Tables by month and archive data with sliding window technique❯ Drop all indexes before nightly ETL load jobs❯ Rebuild all indexes when ETL completes

SQL Server Analysis Services❯ SSAS 2012 Enterprise Edition❯ 2008 R2 OLAP cubes partition-aligned with DW❯ 2012 cubes in-memory tabular cubes❯ All access through MSMDPUMP or SharePoint

SQL Server Big Data Environment

ColumnstoreSqoop adapterPolyBaseHiveIn-memory analyticsScale-out MPPSQL Server APS

SQL Server Big Data Analytics Features

DBA ETL/BI Developer Business Users & Executives

Analysts & Data Scientists

OPERATIONAL DATA BIG DATA DATA STREAMPUBLIC/PRIVATE CLOUDS

Enterprise & Interactive Reporting

Interactive Analysis

Dashboards Predictive Analytics

Pentaho Business Analytics

Data IntegrationInstaview | Visual Map Reduce

DIRECT ACCESS

Pentaho Big Data Analytics

Pentaho Big Data Analytics Accelerate the time to big data value

• Full continuity from data

access to decisions –

complete data integration &

analytics for any big data

store

• Faster development,

faster runtime – visual

development, distributed

execution

• Instant and interactive

analysis – no coding and

no ETL required

Product Components

Pentaho Data Integration

• Visual development for big data• Broad connectivity• Data quality & enrichment• Integrated scheduling• Security integration

• Visual data exploration• Ad hoc analysis• Interactive charts & visualizations

Pentaho Dashboards

• Self-service dashboard builder• Content linking & drill through• Highly customized mash-ups

Pentaho Data Mining & Predictive Analytics

• Model construction & evaluation • Learning schemes• Integration with 3rd part models

using PMML

Pentaho Enterprise & Interactive Reports

• Both ad hoc & distributed reporting• Drag & drop interactive reporting• Pixel-perfect enterprise reports

Pentaho for Big Data MapReduce & Instaview

• Visual Interface for Developing MR

• Self-service big data discovery• Big data access to Data Analysts

Pentaho Analyzer

❯ Simple, easy-to-use visual data exploration

❯ Web-based thin client; in-memory caching

❯ Rich library of interactive visualizations • Geo-mapping, heat grids, scatter plots, bubble

charts, line over bar and more• Pluggable visualizations

❯ Java ROLAP engine to analyze structured and unstructured data, with SQL dialects for querying data from RDBMs

❯ Pluggable cache integrating with leading caching architectures: Infinispan (JBoss Data Grid) & Memcached

Pentaho Interactive Analysis & Data DiscoveryHighly Flexible Advanced Visualizations