Big Data with SQL Server

Philly Code CampNovember 2012

Mark KromerBI & Big Data Technology Directorhttp://www.kromerbigdata.com@kromerbigdata@mssqldude

‣What is Big Data?

‣The Big Data and Apache Hadoop environment

‣Big Data Analytics

‣SQL Server in the Big Data world

‣How we utilize Big Data @ Razorfish

What we’ll (try) to cover today

Big Data 101

‣ 3 V’s

‣ Volume – Terabyte records, transactions, tables, files

‣ Velocity – Batch, near-time, real-time (analytics), streams.

‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix

‣ Text Processing‣ Techniques for processing and analyzing unstructured (and structured)

LARGE files

‣ Analytics & Insights

‣ Distributed File System & Programming

‣ Batch Processing

‣ Commodity Hardware

‣ Data Locality, no shared storage

‣ Scales linearly

‣ Great for large text file processing, not so great on small files

‣ Distributed programming paradigm

using Microsoft.Hadoop.MapReduce;

using System.Text.RegularExpressions;

public class TotalHitsForPageMap : MapperBase

public override void Map(string inputLine, MapperContext context)

context.Log(inputLine);

var parts = Regex.Split(inputLine, "\\s+");

if (parts.Length != expected) //only take records with all values

return;

context.EmitKeyValue(parts[pagePos], hit);

MapReduce Framework (Map)

public class TotalHitsForPageReducerCombiner : ReducerCombinerBase

public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)

context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());

public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>

public override HadoopJobConfiguration Configure(ExecutorContext context)

var retVal = new HadoopJobConfiguration();

retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");

retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");

retVal.DeleteOutputFolder = true;

return retVal;

MapReduce Framework (Reduce & Job)

‣ Big Data ≠ NoSQL

‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing

‣ Facebook, for example, uses Hbase from the Hadoop stack

‣ Big Data ≠ Real Time

‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value

‣ Use in-memory analytics for real time insights

‣ Big Data ≠ Data Warehouse

‣ I still refer to large multi-TB DWs as “VLDB”

‣ Big Data is about crunching stats in text files for discovery of new patterns and insights

‣ Use the DW to aggregate and store the summaries of those calculations for reporting

Mark’s Big Data Myths

‣ Web Analytics

‣ Big Data Analytics

‣ Digital Marketing – Ad Server Analytics

‣ Multiple TBs of online data per client per year

‣ Elastic Web-scale MapReduce & Hadoop

‣ Increase ROI of digital marketing campaigns

Razorfish & Big Data

Big Data Analytics Web Platform

Data Source

Data M

asterin

Warehouse

Analytics

ntatio

AttributionSegmentation

Stacking Effect

Media Level Data WarehouseAudience Level

Data WarehouseBig Data

SandboxesData Mapping

Business RulesExternal &

Extended Data

Tableau, SAS &Excel

MapReduceJobs

In-Database Analytics (Teradata Aster)

Prepackaged Analytics Functions (including Attribution)

• Because of built-in analytics functions and big data performance, Aster becomes the data scientist’s sandbox and BI’s big data analytics processor.

SQL Server Big Data – Data Loading

Amazon HDFS & EMR

Data Loading

Amazon S3 Bucket

‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1

‣ > hadoop fs -cat /user/mark/customers/part-m-00000

‣ > 5,Bob Smith

‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3

‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)

‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.

SqoopData transfer to & from Hadoop & SQL Server

‣ SQL Server Database‣ SQL Server 2008 R2 or 2012 Enterprise Edition

‣ Page Compression

‣ 2012 Columnar Compression on Fact Tables

‣ Clustered Index on all tables

‣ Auto-update Stats Asynch

‣ Partition Fact Tables by month and archive data with sliding window technique

‣ Drop all indexes before nightly ETL load jobs

‣ Rebuild all indexes when ETL completes

‣ SQL Server Analysis Services‣ SSAS 2008 R2 or 2012 Enterprise Edition

‣ 2008 R2 OLAP cubes partition-aligned with DW

‣ 2012 cubes in-memory tabular cubes

‣ All access through MSMDPUMP or SharePoint

SQL Server Big Data Environment

‣Columnstore

‣Sqoop adapter

‣PolyBase

‣Hive

‣In-memory analytics

‣Scale-out MPP

SQL Server Big Data Analytics Features

‣ What is a Big Data approach to Analytics?

‣ Massive scale

‣ Data discovery & research

‣ Self-service

‣ Reporting & BI

‣ Why did we take this Big Data Analytics approach?

‣ Each Web client produces an average of 6 TBs of ICA data in a year

‣ The data in the sources are variable and unstructured

‣ SSIS ETL alone couldn’t keep up or handle complexity

‣ SQL Server 2012 columnstore and tabular SSAS 2012 were key to using SQL Server for Big Data

‣ With the configs mentioned previously, SQL Server is working great

‣ Analytics on Big Data also requires Big Data Analytics tools

‣ Aster, Tableau, PowerPivot, SAS

Wrap-up

Big Data with SQL Server

Technology

Big Data Working with Terabytes in SQL Server Andrew Novick

Microsoft SQL Server - SQL Server Migrations Presentation

SQL Server 2012 Big Data Overview

Overview SQL Server 2019 Big Data Cluster-CT · 2021. 2. 21. · Microsoft PowerPoint - Overview SQL Server 2019 Big Data Cluster-CT.pptx Author: niels Created Date: 9/15/2019 8:13:57

SQL-Server 2012 Always On. SQL Server “SQL-Server 2012” Highlights High Availability SQL Server AlwaysOn Security & Manageability User-Defined Server

SQL Server Setup - using SQL Server authentication

440900914440900 - kutub-download.com · 440900914440900 2 : 1-Microsoft SQL Server 2-Visual Basic.NET Visual Basic.NET SQL SQL Server SQL Server Microsoft SQL Server SQL

Big Data (NJ SQL Server User Group)

Chapter 4 SQL. SQL server Microsoft SQL Server is a client/server database management system. Microsoft SQL Server is a client/server database management

Big Data con Windows Azure HDInsight | Lanzamiento SQL Server 2014

SQL Server Benchmarking The Powershell speedometer Server Benchmarking.pdf · SQL SERVER BENCHMARKING THE POWERSHELL SPEEDOMETER. ... Gather and parse SQL Server information ... SQL

SQL Server 2012 and Big Data

Advanced SQL Server Troubleshooting SQL Server 2008... · Bring your SQL Server Bring your SQL Server installations to installations to a new level of excellence! Advanced SQL Server

Connecting to the IBM Big SQL Server and running SQL queries.db2university.db2oncampus.com/BD125EN/Others/Exercise 1.pdf · Connecting to the IBM Big SQL Server and running SQL queries

WordPress.comUpgrade from SQL Server 2000, SQL Server 2005 or SQL Server 2008 Launch a wizard to upgrade SQL Server 2000, SQL Server 2005 or SQL Server 2008 to SQL 2008 RI. Search

How to Connect to CDL SQL Server Database via Internet · 5. Using SQL Server and SQL Server Linked Server Connect to CDL SQL Server using SQL server Management studio is very easy

Installin SQL Server 2014 CTP2 without error - WordPress.com · Upgrade Advisor analyzes any SQL Server 2012, SQL Server 2008 RI, SQL Server 2008 or SQL Server 2005 components that

We Practice What We Teach. Public Web Site Outages Web Servers SQL Server SQL Server SQL Server SQL Server SQL Server SQL Server SQL Server SQL Server

The Big Data Accelerator Manual Partitioning on SQL Server ... · The Big Data Accelerator Manual Partitioning on SQL Server A Fully Worked Example By Version 0.01 1st January 2017

SQL Server Heterogêneo: SQL Server + BigData