10
Big Data with SQL Server Philly SQL Server User Group November 2012 Mark Kromer Razorfish BI & Big Data Technology Director http://www.kromerbigdata.com @kromerbigdata @mssqldude

PSSUG Nov 2012: Big Data with SQL Server

Embed Size (px)

DESCRIPTION

Mark Kromer's presentation of Big Data Analytics with Hadoop, Teradata, SQL Server, Tableau, SAS & PowerPivot

Citation preview

Page 1: PSSUG Nov 2012: Big Data with SQL Server

Big Data with SQL Server

Philly SQL Server User GroupNovember 2012

Mark KromerRazorfish BI & Big Data Technology Directorhttp://www.kromerbigdata.com@kromerbigdata@mssqldude

Page 2: PSSUG Nov 2012: Big Data with SQL Server

‣What is Big Data?

‣The Big Data and Apache Hadoop environment

‣Big Data Analytics

‣SQL Server in the Big Data world

‣How we utilize Big Data @ Razorfish

What we’ll (try) to cover tonight

2

Page 3: PSSUG Nov 2012: Big Data with SQL Server

Big Data 101

‣ 3 V’s

‣ Volume – Terabyte records, transactions, tables, files

‣ Velocity – Batch, near-time, real-time (analytics), streams.

‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix

‣ Text Processing‣ Techniques for processing and analyzing unstructured (and structured)

LARGE files

‣ Analytics & Insights

‣ Distributed File System & Programming

Page 4: PSSUG Nov 2012: Big Data with SQL Server

‣ Big Data ≠ NoSQL

‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing

‣ Facebook, for example, uses Hbase from the Hadoop stack

‣ Big Data ≠ Real Time

‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value

‣ Use in-memory analytics for real time insights

‣ Big Data ≠ Data Warehouse

‣ I still refer to large multi-TB DWs as “VLDB”

‣ Big Data is about crunching stats in text files for discovery of new patterns and insights

‣ Use the DW to aggregate and store the summaries of those calculations for reporting

Mark’s Big Data Myths

Page 5: PSSUG Nov 2012: Big Data with SQL Server

‣ Batch Processing

‣ Commodity Hardware

‣ Data Locality, no shared storage

‣ Scales linearly

‣ Great for large text file processing, not so great on small files

‣ Distributed programming paradigm

Page 6: PSSUG Nov 2012: Big Data with SQL Server

Big Data Analytics Web Platform

Data Source

s

Data M

asterin

g

Data

Warehouse

&

Analytics

Prese

ntatio

n

AttributionSegmentation

Stacking Effect

Media Level Data WarehouseAudience Level

Data WarehouseBig Data

SandboxesData Mapping

Business RulesExternal &

Extended Data

Tableau, SAS &Excel

MapReduceJobs

Page 7: PSSUG Nov 2012: Big Data with SQL Server

In-Database Analytics (Teradata Aster)

Prepackaged Analytics Functions (including Attribution)

• Because of built-in analytics functions and big data performance, Aster becomes the data scientist’s sandbox and BI’s big data analytics processor.

Page 8: PSSUG Nov 2012: Big Data with SQL Server

SQL Server Big Data – Data Loading

Amazon HDFS & EMR

Data Loading

Amazon S3 Bucket

Page 9: PSSUG Nov 2012: Big Data with SQL Server

‣ SQL Server Database‣ SQL Server 2008 R2 or 2012 Enterprise Edition

‣ Page Compression

‣ 2012 Columnar Compression on Fact Tables

‣ Clustered Index on all tables

‣ Auto-update Stats Asynch

‣ Partition Fact Tables by month and archive data with sliding window technique

‣ Drop all indexes before nightly ETL load jobs

‣ Rebuild all indexes when ETL completes

‣ SQL Server Analysis Services‣ SSAS 2008 R2 or 2012 Enterprise Edition

‣ 2008 R2 OLAP cubes partition-aligned with DW

‣ 2012 cubes in-memory tabular cubes

‣ All access through MSMDPUMP or SharePoint

SQL Server Big Data Environment

Page 10: PSSUG Nov 2012: Big Data with SQL Server

‣ What is a Big Data approach to Analytics?

‣ Massive scale

‣ Data discovery & research

‣ Self-service

‣ Reporting & BI

‣ Why did we take this Big Data Analytics approach?

‣ Each Web client produces an average of 6 TBs of ICA data in a year

‣ The data in the sources are variable and unstructured

‣ SSIS ETL alone couldn’t keep up or handle complexity

‣ SQL Server 2012 columnstore and tabular SSAS 2012 were key to using SQL Server for Big Data

‣ With the configs mentioned previously, SQL Server is working great

‣ Analytics on Big Data also requires Big Data Analytics tools

‣ Aster, Tableau, PowerPivot, SAS

Wrap-up