View
1.355
Download
0
Category
Tags:
Preview:
DESCRIPTION
My updated Big Data with SQL Server presentation from my Razorfish case study. Presented Nov 17 @ Philly Code Camp.
Citation preview
Big Data with SQL Server
Philly Code CampNovember 2012
Mark KromerBI & Big Data Technology Directorhttp://www.kromerbigdata.com@kromerbigdata@mssqldude
‣What is Big Data?
‣The Big Data and Apache Hadoop environment
‣Big Data Analytics
‣SQL Server in the Big Data world
‣How we utilize Big Data @ Razorfish
What we’ll (try) to cover today
2
Big Data 101
‣ 3 V’s
‣ Volume – Terabyte records, transactions, tables, files
‣ Velocity – Batch, near-time, real-time (analytics), streams.
‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix
‣ Text Processing‣ Techniques for processing and analyzing unstructured (and structured)
LARGE files
‣ Analytics & Insights
‣ Distributed File System & Programming
‣ Batch Processing
‣ Commodity Hardware
‣ Data Locality, no shared storage
‣ Scales linearly
‣ Great for large text file processing, not so great on small files
‣ Distributed programming paradigm
using Microsoft.Hadoop.MapReduce;
using System.Text.RegularExpressions;
public class TotalHitsForPageMap : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
context.Log(inputLine);
var parts = Regex.Split(inputLine, "\\s+");
if (parts.Length != expected) //only take records with all values
{
return;
}
context.EmitKeyValue(parts[pagePos], hit);
}
}
MapReduce Framework (Map)
public class TotalHitsForPageReducerCombiner : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
{
context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());
}
}
public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
var retVal = new HadoopJobConfiguration();
retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");
retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");
retVal.DeleteOutputFolder = true;
return retVal;
}
}
MapReduce Framework (Reduce & Job)
‣ Big Data ≠ NoSQL
‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing
‣ Facebook, for example, uses Hbase from the Hadoop stack
‣ Big Data ≠ Real Time
‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value
‣ Use in-memory analytics for real time insights
‣ Big Data ≠ Data Warehouse
‣ I still refer to large multi-TB DWs as “VLDB”
‣ Big Data is about crunching stats in text files for discovery of new patterns and insights
‣ Use the DW to aggregate and store the summaries of those calculations for reporting
Mark’s Big Data Myths
‣ Web Analytics
‣ Big Data Analytics
‣ Digital Marketing – Ad Server Analytics
‣ Multiple TBs of online data per client per year
‣ Elastic Web-scale MapReduce & Hadoop
‣ Increase ROI of digital marketing campaigns
Razorfish & Big Data
Big Data Analytics Web Platform
Data Source
s
Data M
asterin
g
Data
Warehouse
&
Analytics
Prese
ntatio
n
AttributionSegmentation
Stacking Effect
…
Media Level Data WarehouseAudience Level
Data WarehouseBig Data
SandboxesData Mapping
Business RulesExternal &
Extended Data
Tableau, SAS &Excel
MapReduceJobs
In-Database Analytics (Teradata Aster)
Prepackaged Analytics Functions (including Attribution)
• Because of built-in analytics functions and big data performance, Aster becomes the data scientist’s sandbox and BI’s big data analytics processor.
SQL Server Big Data – Data Loading
Amazon HDFS & EMR
Data Loading
Amazon S3 Bucket
‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1
‣ > hadoop fs -cat /user/mark/customers/part-m-00000
‣ > 5,Bob Smith
‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3
‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)
‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
SqoopData transfer to & from Hadoop & SQL Server
‣ SQL Server Database‣ SQL Server 2008 R2 or 2012 Enterprise Edition
‣ Page Compression
‣ 2012 Columnar Compression on Fact Tables
‣ Clustered Index on all tables
‣ Auto-update Stats Asynch
‣ Partition Fact Tables by month and archive data with sliding window technique
‣ Drop all indexes before nightly ETL load jobs
‣ Rebuild all indexes when ETL completes
‣ SQL Server Analysis Services‣ SSAS 2008 R2 or 2012 Enterprise Edition
‣ 2008 R2 OLAP cubes partition-aligned with DW
‣ 2012 cubes in-memory tabular cubes
‣ All access through MSMDPUMP or SharePoint
SQL Server Big Data Environment
‣Columnstore
‣Sqoop adapter
‣PolyBase
‣Hive
‣In-memory analytics
‣Scale-out MPP
SQL Server Big Data Analytics Features
‣ What is a Big Data approach to Analytics?
‣ Massive scale
‣ Data discovery & research
‣ Self-service
‣ Reporting & BI
‣ Why did we take this Big Data Analytics approach?
‣ Each Web client produces an average of 6 TBs of ICA data in a year
‣ The data in the sources are variable and unstructured
‣ SSIS ETL alone couldn’t keep up or handle complexity
‣ SQL Server 2012 columnstore and tabular SSAS 2012 were key to using SQL Server for Big Data
‣ With the configs mentioned previously, SQL Server is working great
‣ Analytics on Big Data also requires Big Data Analytics tools
‣ Aster, Tableau, PowerPivot, SAS
Wrap-up
Recommended