PSSUG Nov 2012: Big Data with SQL Server

Big Data with SQL Server

Philly SQL Server User GroupNovember 2012

Mark KromerRazorfish BI & Big Data Technology Directorhttp://www.kromerbigdata.com@kromerbigdata@mssqldude

http://www.kromerbigdata.com/

‣What is Big Data?

‣The Big Data and Apache Hadoop environment

‣Big Data Analytics

‣SQL Server in the Big Data world

‣How we utilize Big Data @ Razorfish

What we’ll (try) to cover tonight

2

Big Data 101

‣ 3 V’s

‣ Volume – Terabyte records, transactions, tables, files

‣ Velocity – Batch, near-time, real-time (analytics), streams.

‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix

‣ Text Processing‣ Techniques for processing and analyzing unstructured (and structured)

LARGE files

‣ Analytics & Insights

‣ Distributed File System & Programming

‣ Big Data ≠ NoSQL

‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing

‣ Facebook, for example, uses Hbase from the Hadoop stack

‣ Big Data ≠ Real Time

‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value

‣ Use in-memory analytics for real time insights

‣ Big Data ≠ Data Warehouse

‣ I still refer to large multi-TB DWs as “VLDB”

‣ Big Data is about crunching stats in text files for discovery of new patterns and insights

‣ Use the DW to aggregate and store the summaries of those calculations for reporting

Mark’s Big Data Myths

‣ Batch Processing

‣ Commodity Hardware

‣ Data Locality, no shared storage

‣ Scales linearly

‣ Great for large text file processing, not so great on small files

‣ Distributed programming paradigm

Big Data Analytics Web Platform

Data Source

s

Data M

asterin

g

Data

Warehouse

&

Analytics

Prese

ntatio

n

AttributionSegmentation

Stacking Effect

…

Media Level Data WarehouseAudience Level

Data WarehouseBig Data

SandboxesData Mapping

Business RulesExternal &

Extended Data

Tableau, SAS &Excel

MapReduceJobs

In-Database Analytics (Teradata Aster)

Prepackaged Analytics Functions (including Attribution)

• Because of built-in analytics functions and big data performance, Aster becomes the data scientist’s sandbox and BI’s big data analytics processor.

SQL Server Big Data – Data Loading

Amazon HDFS & EMR

Data Loading

Amazon S3 Bucket

‣ SQL Server Database‣ SQL Server 2008 R2 or 2012 Enterprise Edition

‣ Page Compression

‣ 2012 Columnar Compression on Fact Tables

‣ Clustered Index on all tables

‣ Auto-update Stats Asynch

‣ Partition Fact Tables by month and archive data with sliding window technique

‣ Drop all indexes before nightly ETL load jobs

‣ Rebuild all indexes when ETL completes

‣ SQL Server Analysis Services‣ SSAS 2008 R2 or 2012 Enterprise Edition

‣ 2008 R2 OLAP cubes partition-aligned with DW

‣ 2012 cubes in-memory tabular cubes

‣ All access through MSMDPUMP or SharePoint

SQL Server Big Data Environment

‣ What is a Big Data approach to Analytics?

‣ Massive scale

‣ Data discovery & research

‣ Self-service

‣ Reporting & BI

‣ Why did we take this Big Data Analytics approach?

‣ Each Web client produces an average of 6 TBs of ICA data in a year

‣ The data in the sources are variable and unstructured

‣ SSIS ETL alone couldn’t keep up or handle complexity

‣ SQL Server 2012 columnstore and tabular SSAS 2012 were key to using SQL Server for Big Data

‣ With the configs mentioned previously, SQL Server is working great

‣ Analytics on Big Data also requires Big Data Analytics tools

‣ Aster, Tableau, PowerPivot, SAS

Wrap-up