Upload
mark-kromer
View
1.072
Download
3
Embed Size (px)
DESCRIPTION
Mark Kromer's presentation of Big Data Analytics with Hadoop, Teradata, SQL Server, Tableau, SAS & PowerPivot
Citation preview
Big Data with SQL Server
Philly SQL Server User GroupNovember 2012
Mark KromerRazorfish BI & Big Data Technology Directorhttp://www.kromerbigdata.com@kromerbigdata@mssqldude
‣What is Big Data?
‣The Big Data and Apache Hadoop environment
‣Big Data Analytics
‣SQL Server in the Big Data world
‣How we utilize Big Data @ Razorfish
What we’ll (try) to cover tonight
2
Big Data 101
‣ 3 V’s
‣ Volume – Terabyte records, transactions, tables, files
‣ Velocity – Batch, near-time, real-time (analytics), streams.
‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix
‣ Text Processing‣ Techniques for processing and analyzing unstructured (and structured)
LARGE files
‣ Analytics & Insights
‣ Distributed File System & Programming
‣ Big Data ≠ NoSQL
‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing
‣ Facebook, for example, uses Hbase from the Hadoop stack
‣ Big Data ≠ Real Time
‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value
‣ Use in-memory analytics for real time insights
‣ Big Data ≠ Data Warehouse
‣ I still refer to large multi-TB DWs as “VLDB”
‣ Big Data is about crunching stats in text files for discovery of new patterns and insights
‣ Use the DW to aggregate and store the summaries of those calculations for reporting
Mark’s Big Data Myths
‣ Batch Processing
‣ Commodity Hardware
‣ Data Locality, no shared storage
‣ Scales linearly
‣ Great for large text file processing, not so great on small files
‣ Distributed programming paradigm
Big Data Analytics Web Platform
Data Source
s
Data M
asterin
g
Data
Warehouse
&
Analytics
Prese
ntatio
n
AttributionSegmentation
Stacking Effect
…
Media Level Data WarehouseAudience Level
Data WarehouseBig Data
SandboxesData Mapping
Business RulesExternal &
Extended Data
Tableau, SAS &Excel
MapReduceJobs
In-Database Analytics (Teradata Aster)
Prepackaged Analytics Functions (including Attribution)
• Because of built-in analytics functions and big data performance, Aster becomes the data scientist’s sandbox and BI’s big data analytics processor.
SQL Server Big Data – Data Loading
Amazon HDFS & EMR
Data Loading
Amazon S3 Bucket
‣ SQL Server Database‣ SQL Server 2008 R2 or 2012 Enterprise Edition
‣ Page Compression
‣ 2012 Columnar Compression on Fact Tables
‣ Clustered Index on all tables
‣ Auto-update Stats Asynch
‣ Partition Fact Tables by month and archive data with sliding window technique
‣ Drop all indexes before nightly ETL load jobs
‣ Rebuild all indexes when ETL completes
‣ SQL Server Analysis Services‣ SSAS 2008 R2 or 2012 Enterprise Edition
‣ 2008 R2 OLAP cubes partition-aligned with DW
‣ 2012 cubes in-memory tabular cubes
‣ All access through MSMDPUMP or SharePoint
SQL Server Big Data Environment
‣ What is a Big Data approach to Analytics?
‣ Massive scale
‣ Data discovery & research
‣ Self-service
‣ Reporting & BI
‣ Why did we take this Big Data Analytics approach?
‣ Each Web client produces an average of 6 TBs of ICA data in a year
‣ The data in the sources are variable and unstructured
‣ SSIS ETL alone couldn’t keep up or handle complexity
‣ SQL Server 2012 columnstore and tabular SSAS 2012 were key to using SQL Server for Big Data
‣ With the configs mentioned previously, SQL Server is working great
‣ Analytics on Big Data also requires Big Data Analytics tools
‣ Aster, Tableau, PowerPivot, SAS
Wrap-up