Big Data with SQL Server

Preview:

DESCRIPTION

My updated Big Data with SQL Server presentation from my Razorfish case study. Presented Nov 17 @ Philly Code Camp.

Citation preview

Big Data with SQL Server

Philly Code CampNovember 2012

Mark KromerBI & Big Data Technology Directorhttp://www.kromerbigdata.com@kromerbigdata@mssqldude

‣What is Big Data?

‣The Big Data and Apache Hadoop environment

‣Big Data Analytics

‣SQL Server in the Big Data world

‣How we utilize Big Data @ Razorfish

What we’ll (try) to cover today

2

Big Data 101

‣ 3 V’s

‣ Volume – Terabyte records, transactions, tables, files

‣ Velocity – Batch, near-time, real-time (analytics), streams.

‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix

‣ Text Processing‣ Techniques for processing and analyzing unstructured (and structured)

LARGE files

‣ Analytics & Insights

‣ Distributed File System & Programming

‣ Batch Processing

‣ Commodity Hardware

‣ Data Locality, no shared storage

‣ Scales linearly

‣ Great for large text file processing, not so great on small files

‣ Distributed programming paradigm

using Microsoft.Hadoop.MapReduce;

using System.Text.RegularExpressions;

public class TotalHitsForPageMap : MapperBase

{

public override void Map(string inputLine, MapperContext context)

{

context.Log(inputLine);

var parts = Regex.Split(inputLine, "\\s+");

if (parts.Length != expected) //only take records with all values

{

return;

}

context.EmitKeyValue(parts[pagePos], hit);

}

}

MapReduce Framework (Map)

public class TotalHitsForPageReducerCombiner : ReducerCombinerBase

{

public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)

{

context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());

}

}

public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>

{

public override HadoopJobConfiguration Configure(ExecutorContext context)

{

var retVal = new HadoopJobConfiguration();

retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");

retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");

retVal.DeleteOutputFolder = true;

return retVal;

}

}

MapReduce Framework (Reduce & Job)

‣ Big Data ≠ NoSQL

‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing

‣ Facebook, for example, uses Hbase from the Hadoop stack

‣ Big Data ≠ Real Time

‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value

‣ Use in-memory analytics for real time insights

‣ Big Data ≠ Data Warehouse

‣ I still refer to large multi-TB DWs as “VLDB”

‣ Big Data is about crunching stats in text files for discovery of new patterns and insights

‣ Use the DW to aggregate and store the summaries of those calculations for reporting

Mark’s Big Data Myths

‣ Web Analytics

‣ Big Data Analytics

‣ Digital Marketing – Ad Server Analytics

‣ Multiple TBs of online data per client per year

‣ Elastic Web-scale MapReduce & Hadoop

‣ Increase ROI of digital marketing campaigns

Razorfish & Big Data

Big Data Analytics Web Platform

Data Source

s

Data M

asterin

g

Data

Warehouse

&

Analytics

Prese

ntatio

n

AttributionSegmentation

Stacking Effect

Media Level Data WarehouseAudience Level

Data WarehouseBig Data

SandboxesData Mapping

Business RulesExternal &

Extended Data

Tableau, SAS &Excel

MapReduceJobs

In-Database Analytics (Teradata Aster)

Prepackaged Analytics Functions (including Attribution)

• Because of built-in analytics functions and big data performance, Aster becomes the data scientist’s sandbox and BI’s big data analytics processor.

SQL Server Big Data – Data Loading

Amazon HDFS & EMR

Data Loading

Amazon S3 Bucket

‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1

‣ > hadoop fs -cat /user/mark/customers/part-m-00000

‣ > 5,Bob Smith

‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3

‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)

‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.

SqoopData transfer to & from Hadoop & SQL Server

‣ SQL Server Database‣ SQL Server 2008 R2 or 2012 Enterprise Edition

‣ Page Compression

‣ 2012 Columnar Compression on Fact Tables

‣ Clustered Index on all tables

‣ Auto-update Stats Asynch

‣ Partition Fact Tables by month and archive data with sliding window technique

‣ Drop all indexes before nightly ETL load jobs

‣ Rebuild all indexes when ETL completes

‣ SQL Server Analysis Services‣ SSAS 2008 R2 or 2012 Enterprise Edition

‣ 2008 R2 OLAP cubes partition-aligned with DW

‣ 2012 cubes in-memory tabular cubes

‣ All access through MSMDPUMP or SharePoint

SQL Server Big Data Environment

‣Columnstore

‣Sqoop adapter

‣PolyBase

‣Hive

‣In-memory analytics

‣Scale-out MPP

SQL Server Big Data Analytics Features

‣ What is a Big Data approach to Analytics?

‣ Massive scale

‣ Data discovery & research

‣ Self-service

‣ Reporting & BI

‣ Why did we take this Big Data Analytics approach?

‣ Each Web client produces an average of 6 TBs of ICA data in a year

‣ The data in the sources are variable and unstructured

‣ SSIS ETL alone couldn’t keep up or handle complexity

‣ SQL Server 2012 columnstore and tabular SSAS 2012 were key to using SQL Server for Big Data

‣ With the configs mentioned previously, SQL Server is working great

‣ Analytics on Big Data also requires Big Data Analytics tools

‣ Aster, Tableau, PowerPivot, SAS

Wrap-up

Recommended