Upload
andrew-brust
View
3.291
Download
2
Embed Size (px)
DESCRIPTION
SQL Server Live! Orlando 2012
Citation preview
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 1
Big Data and NoSQLin Microsoft-Land
Andrew Brust and Lynn LangitBlue Badge Insights & Data Wrangler
Level: Intermediate
• CEO and Founder, Blue Badge Insights• Big Data blogger for ZDNet• Microsoft Regional Director, MVP• Co-chair VSLive! and 17 years as a speaker• Founder, Microsoft BI User Group of NYC
– http://www.msbinyc.com
• Co-moderator, NYC .NET Developers Group– http://www.nycdotnetdev.com
• “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News
• brustblog.com, Twitter: @andrewbrust
Meet Andrew
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 2
Andrew’s New Blog (bit.ly/bigondata)
• CEO and Founder, Lynn Langit consulting• Former Microsoft Evangelist (4 years)• Google Developer Expert• MongoDB Master• MCT 13 years – 7 certifications• Cloudera Certified Developer
• MSDN Magazine articles – SQL Azure– Hadoop on Azure– MongoDB on Azure
• www.LynnLangit.com• @LynnLangit
Meet Lynn
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 3
Lynn’s YouTube Channel
www.TeachingKidsProgramming.org• Free Courseware ( • Do a Recipe Teach a Kid (Ages 10 ++)• Java or Microsoft SmallBasic
• recipes)
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 4
Read all about it!
Agenda
• Overview / Landscape – Big Data, and Hadoop
– NoSQL
– The Big Data-NoSQL Intersection
• Drilldown on Big Data
• Drilldown on NoSQL
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 5
What is Big Data?• 100s of TB into PB and higher• Involving data from: financial data,
sensors, web logs, social media, etc.• Parallel processing often involved
– Hadoop is emblematic, but other technologies are Big Data too
• Processing of data sets too large for transactional databases– Analyzing interactions, rather than transactions– The three V’s: Volume, Velocity, Variety
• Big Data tech sometimes imposed on small data problems
BigData = Exponentially More Data• Retail Example -> ‘Feedback Economy’
– Number of transactions
– Number of behaviors (collected every minute)
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 6
BigData = ‘Next State’ Questions
• What could happen?• Why didn’t this happen?• When will the next new thing
happen?• What will the next new thing
be?• What happens?
Collecting Behavioral
data
What’s MapReduce?
• “Big” input data as key-value pair series
• Partition the data and send to mappers (nodes in cluster)
• Mappers pre-process, put into key-value format, and send all output for a given (set of) key(s) to a reducer
• Reducer aggregates; one output per key, with value
• Map and Reduce code natively written as Java functions
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 7
MapReduce, in a Diagram
mapper
mapper
mapper
mapper
mapper
mapper
Input
reducer
reducer
reducer
Input
Input
Input
Input
Input
Input
Output
Output
Output
Output
Output
Output
Output
Input
Input
Input
K1
K2
K3
Output
Output
Output
A MapReduce Example
• Count by suite, on each floor
• Send per-suite, per platform totals to lobby
• Sort totals by platform
• Send two platform packets to 10th, 20th, 30th floor
• Tally up each platform
• Merge tallies into one spreadsheet
• Collect the tallies
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 8
What’s a Distributed File System?
• One where data gets distributed over commodity drives on commodity servers
• Data is replicated
• If one box goes down, no data lost– “Shared Nothing”
• BUT: Immutable– Files can only be written to once
– So updates require drop + re-write (slow)
– You can append though
– Like a DVD/CD-ROM
Hadoop = MapReduce + HDFS
• Modeled after Google MapReduce + GFS
• Have more data? Just add more nodes to cluster. – Mappers execute in parallel
– Hardware is commodity
– “Scaling out”
• Use of HDFS means data may well be local to mapper processing
• So, not just parallel, but minimal data movement, which avoids network bottlenecks
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 9
Example Comparison: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response Time
Can be near immediate Has latency (due to batch processing)
Just-in-time Schema
• When looking at unstructured data, schema is imposed at query time
• Schema is context specific– If scanning a book, are the values words, lines, or
pages?
– Are notes a single field, or is each word value?
– Are date and time two fields or one?
– Are street, city, state, zip separate or one value?
– Pig and Hive let you determine this at query time
– So does the Map function in MapReduce code
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 10
What’s HBase?
• A Wide-Column Store NoSQL database
• Modeled after Google BigTable
• Uses HDFS– Therefore, Hadoop-compatible
• Hadoop often used with HBase– But you can use either without the other
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 11
NoSQL Confusion• Many ‘flavors’ of NoSQL data stores• Easiest to group by functionality, but…
– Dividing lines are not clear or consistent
• NoSQL choice(s) driven by many factors– Type of data– Quantity of tool– Knowledge of technical staff– Product maturity– Tooling
So much wrong information
Everything is ‘new’
Everything is ‘new’
People are religious about data storage
People are religious about data storage
Lots of incorrect
information
Lots of incorrect
information
‘Try’ before you ‘buy’ (or
use)
‘Try’ before you ‘buy’ (or
use)
Watch out for over
simplification
Watch out for over
simplification
Confusion over vendor
offerings
Confusion over vendor
offerings
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 12
Common NoSQL Misconceptions
ProblemsProblems
Everything is ‘new’
People are religious about data storage
Open source is always cheaper
Cloud is always cheaper
Replace RDBMS with NoSQL
SolutionsSolutions
‘Try’ before you ‘buy’ (or use)
Leverage NoSQL communities
Add NoSQL to existing RDBMS solution
NoSQL + Big Data• HBase and Cassandra work with Hadoop, are
NoSQL databases• MongoDB brands itself a Big Data technology• Couchbase does too• Just-in-time schema• MapReduce in MongoDB, others• Hadoop and most NoSQL DBs are
partitioned, scale-out technologies• It’s all about analytics on semi- or un-
structured data
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 13
DRILLDOWN ON BIG DATA
The Hadoop Stack
MapReduce, HDFS
Database
RDBMS Import/Export
Query: HiveQL and Pig Latin
Machine Learning/Data Mining
Log file integration
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 14
What’s Hive?
• Began as Hadoop sub-project– Now top-level Apache project
• Provides a SQL-like (“HiveQL”) abstraction over MapReduce
• Has its own HDFS table file format (and it’s fully schema-bound)
• Can also work over HBase
• Acts as a bridge to many BI products which expect tabular data
Hadoop Distributions
• Cloudera
• Hortonworks– HCatalog: Hive/Pig/MR Interop
• MapR– Network File System replaces HDFS
• IBM InfoSphere BigInsights– HDFS<->DB2 integration
• And now Microsoft…
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 15
Microsoft HDInsight
• Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows
• Windows Azure HDInsight and Microsoft HDInsight (for Windows Server)– Single node preview runs on Windows client
• Includes ODBC Driver for Hive– And Excel Add-In that uses it
• JavaScript MapReduce framework
• Contribute it all back to open source Apache Project
HortonworksData Platform for
Windows
MRLib(NuGet
Package)
LINQ to Hive
OdbcClient + Hive ODBC
Driver
Deployment
Debugging
MR code in C#,
HadoopJob, MapperBase, ReducerBase
Amenities for Visual Studio/.NET
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 16
Some ways to work• Microsoft HDInsight
– Cloud: go to www.hadooponazure.com, request invite– Local: Download Microsoft HDInsight
Runs on just about anything, including Windows XPGet it via the Web Platform installer (WebPI)
– Both are free for now; Azure HDInsight will be fee-based when RTM
• Amazon Web Services Elastic MapReduce– Create AWS account– Select Elastic MapReduce in Dashboard– Cheap for experimenting, but not free
• Cloudera CDH VM image– Download as .tar.gz file– “Un-tar” (can use WinRAR, 7zip)– Run via VMWare Player or Virtual Box– Everything’s free
Some ways to work
HDInsight EMR CDH 4
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 17
Microsoft HDInsight• Much simpler than the others• Browser-based portal
– Launch MapReduce jobs– Azure: Provisioning cluster, managing ports, gather external
data
• Interactive JavaScript & Hive console– JS: HDFS, Pig, light data visualization– Hive commands and metadata discovery– New console coming
• Desktop Shortcuts:– Command window, MapReduce, Name Node status in
browser– Azure: from portal page you can RDP directly to Hadoop
head node for these desktop shortcuts
Windows Azure HDInsight
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 18
Amazon Elastic MapReduce
• Lots of steps!
• At a high level:– Setup AWS account and S3 “buckets”
– Generate Key Pair and PEM file
– Install Ruby and EMR Command Line Interface
– Provision the cluster using CLI
A batch file can work very well here– Setup and run SSH/PuTTY
– Work interactively at command line
Amazon EMR – Prep Steps
• Create an AWS account
• Create an S3 bucket for log storage– with list permissions for authenticated users
• Create a Key Pair and save PEM file
• Install Ruby
• Install Amazon Web Services Elastic MapReduce Command Line Interface – aka AWS EMR CLI
• Create credentials.json in EMR CLI folder– Associate with same region as where key pair created
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 19
Amazon – Security and Startup• Security
– Download PuTTYgen and run it– Click Load and browse to PEM file– Save it in PPK format– Exit PuTTYgen
• In a command window, navigate to EMR CLI folder and enter command:– ruby elastic-mapreduce --create --alive [--num-instance xx]
[--pig-interactive] [--hive-interactive] [--hbase --instance-type m1.large]
• In AWS Console, go to EC2 Dashboard and click Instances on left nav bar
• Wait until instance is running and get its Public DNS name– Use Compatibility View in IE or copy may not work
Connect!• Download and run PuTTY• Paste DNS name of EC2 instance into hostname
field • In Treeview, drill down and navigate to
Connection\SSH\Auth, browse to PPK file• Once EC2 instance(s) running, click Open• Click Yes to “The server’s host key is not cached
in the registry…” PuTTY Security Alert• When prompted for user name, type “hadoop” and
hit Enter• cd bin, then hive, pig, hbase shell• Right-click to paste from clipboard; option to go
full-screen• (Kill EC2 instance(s) from Dashboard when done)
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 20
Amazon Elastic MapReduce
Cloudera CDH4 Virtual Machine• Get it for free, in VMWare and Virtual Box
versions.– VMWare player and Virtual Box are free too
• Run it, and configure it to have its own IP on your network. Use ifconfig to discover IP.
• Assuming IP of 192.168.1.59, open browser on your own (host) machine and navigate to:– http://192.168.1.59:8888
• Can also use browser in VM and hit:– http://localhost:8888
• Work in “Hue”…
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 21
Hue• Browser based UI,
with front ends for:– HDFS (w/ upload &
download)– MapReduce job
creation and monitoring
– Hive (“Beeswax”)
• And in-browser command line shells for:– HBase– Pig (“Grunt”)
Impala: What it Is
• Distributed SQL query engine over Hadoop cluster
• Announced at Strata/Hadoop World in NYC on October 24th
• In Beta, as part of CDH 4.1
• Works with HDFS and Hive data
• Compatible with HiveQL and Hive drivers– Query with Beeswax
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 22
Impala: What it’s Not
• Impala is not Hive– Hive converts HiveQL to Java MapReduce code and
executes it in batch mode
– Impala executes query interactively over the data
– Brings BI tools and Hadoop closer together
• Impala is not an Apache Software Foundation project– Though it is open source and Apache-licensed, but
it’s still incubated by Cloudera
– Only in CDH
Cloudera CDH4
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 23
Hadoop commands• HDFS
– hadoop fs filecommand
– Create and remove directories:mkdir, rm, rmr
– Upload and download files to/from HDFSget, put
– View directory contentsls, lsr
– Copy, move, view filescp, mv, cat
• MapReduce– Run a Java jar-file based job
hadoop jar jarname params
Hadoop (directly)
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 24
HBase• Concepts:
– Tables, column families– Columns, rows– Keys, values
• Commands:– Definition: create, alter, drop, truncate– Manipulation: get, put, delete, deleteall, scan– Discovery: list, exists, describe, count– Enablement: disable, enable– Utilities: version, status, shutdown, exit
– Reference: http://wiki.apache.org/hadoop/Hbase/Shell
• Moreover,– Interesting HBase work can be done in MapReduce, Pig
HBase Examples
• create 't1', 'f1', 'f2', 'f3'
• describe 't1'
• alter 't1', {NAME => 'f1', VERSIONS => 5}
• put 't1', 'r1', 'c1:f1', 'value'
• get 't1', 'r1'
• count 't1'
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 25
HBase
Submitting, Running and Monitoring Jobs
• Upload a JAR
• Use Streaming– Use other languages (i.e. other than Java) to write
MapReduce code
– Python is popular option
– Any executable works, even C# console apps
– On MS HDInsight, JavaScript works too
– Still uses a JAR file: streaming.jar
• Run at command line (passing JAR name and params) or use GUI
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 26
Running MapReduce Jobs
Hive
• Used by most BI products which connect to Hadoop
• Provides a SQL-like abstraction over Hadoop– Officially HiveQL, or HQL
• Works on own tables, but also on HBase
• Query generates MapReduce job, output of which becomes result set
• Microsoft has Hive ODBC driver– Connects Excel, Reporting Services, PowerPivot,
Analysis Services Tabular Mode (only)
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 27
Hive, Continued
• Load data from flat HDFS files– LOAD DATA [LOCAL] INPATH 'myfile'INTO TABLE mytable;
• SQL Queries– CREATE, ALTER, DROP
– INSERT OVERWRITE (creates whole tables)
– SELECT, JOIN, WHERE, GROUP BY
– SORT BY, but ordering data is tricky!
– MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce steps utilizing Java or streaming code
Excel Add-In for Hive
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 28
Hive
Pig• Instead of SQL, employs a language (“Pig
Latin”) that accommodates data flow expressions– Do a combo of Query and ETL
• “10 lines of Pig Latin ≈ 200 lines of Java.”• Works with structured or unstructured data• Operations
– As with Hive, a MapReduce job is generated– Unlike Hive, output is only flat file to HDFS or text at
command line console– With MS Hadoop, can easily convert to JavaScript array,
then manipulate
• Use command line (“Grunt”) or build scripts
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 29
Example
• A = LOAD 'myfile'AS (x, y, z);
B = FILTER A by x > 0;C = GROUP B BY x;D = FOREACH A GENERATEx, COUNT(B);
STORE D INTO 'output';
Pig Latin Examples• Imperative, file system commands
– LOAD, STORESchema specified on LOAD
• Declarative, query commands (SQL-like)– xxx = file or data set
– FOREACH xxx GENERATE (SELECT…FROM xxx)
– JOIN (WHERE/INNER JOIN)
– FILTER xxx BY (WHERE)
– ORDER xxx BY (ORDER BY)
– GROUP xxx BY / GENERATE COUNT(xxx)(SELECT COUNT(*) GROUP BY)
– DISTINCT (SELECT DISTINCT)
• Syntax is assignment statement-based:– MyCusts = FILTER Custs BY SalesPerson eq 15;
• Access Hbase– CpuMetrics = LOAD 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey -returnTuple');
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 30
Pig
Sqoop
sqoop import--connect "jdbc:sqlserver://<servername>.database.windows.net:1433;database=<dbname>;user=<username>@<servername>;password=<password>"
--table <from_table>--target-dir <to_hdfs_folder>--split-by <from_table_column>
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 31
Sqoop
sqoop export--connect "jdbc:sqlserver://<servername>.database.windows.net:1433;database=<dbname>;user=<username>@<servername>;password=<password>"
--table <to_table>--export-dir <from_hdfs_folder>--input-fields-terminated-by "<delimiter>"
Flume NG
• Source– Avro (data serialization system – can read json-
encoded data files, and can work over RPC)
– Exec (reads from stdout of long-running process)
• Sinks– HDFS, HBase, Avro
• Channels– Memory, JDBC, file
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 32
Flume NG (next generation)• Setup conf/flume.conf# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 0.0.0.0:41414. Connect it to channel ch1.
agent1.sources.avro-source1.channels = ch1
agent1.sources.avro-source1.type = avro
agent1.sources.avro-source1.bind = 0.0.0.0
agent1.sources.avro-source1.port = 41414
# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.log-sink1.channel = ch1
agent1.sinks.log-sink1.type = logger
# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = log-sink1
• From the command line:flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
Mahout Algorithms• Recommendation
– Your info + community info– Give users/items/ratings; get user-user/item-item– itemsimilarity
• Classification/Categorization– Drop into buckets– Naïve Bayes, Complementary Naïve Bayes, Decision
Forests
• Clustering– Like classification, but with categories unknown– K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-
Shift
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 33
Workflow, Syntax• Workflow
– Run the job– Dump the output– Visualize, predict
• mahout algorithm-- input folderspec-- output folderspec-- param1 value1-- param2 value2
…• Example:
– mahout itemsimilarity --input <input-hdfs-path>--output <output-hdfs-path>--tempDir <tmp-hdfs-path>-s SIMILARITY_LOGLIKELIHOOD
The Truth About Mahout• Mahout is really just an algorithm engine
• Its output is almost unusable by non-statisticians/non-data scientists
• You need a staff or a product to visualize, or make into a usable prediction model
• Investigate Predixion Software– CTO, Jamie MacLennan, used to lead SQL Server Data
Mining team
– Excel add-in can use Mahout remotely, visualize its output, run predictive analyses
– Also integrates with SQL Server, Greenplum, MapReduce
– http://www.predixionsoftware.com
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 34
The “Data-Refinery” Idea
• Use Hadoop to “on-board” unstructured data, then extract manageable subsets
• Load the subsets into conventional DW/BI servers and use familiar analytics tool to examine
• This is the current rationalization of Hadoop + BI tools’ coexistence
• Will it stay this way?
DRILLDOWN ON NOSQL
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 35
Hitting (Relational) Walls
• CA– Highly-available consistency
• CP– Enforced consistency
• AP– Eventual consistency
The reality…two pivots
Storage Methods• SQL (RDBMS) • NoSQL
Storage Locations• On premises • Cloud-hosted
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 36
So many NoSQL options• More than just the Elephant in the room
• Over 120+ types of noSQL databases
Flavors of NoSQL
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 37
Graph DatabaseUse for data with
– a lot of many-to-many relationships
– recursive self-joins
– when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data
– Examples: Neo4J, FreeBase (Google)
Column Database
• Wide, sparse column sets
• Schema-light
• Examples:– Cassandra
– HBase
– BigTable
– GAE HR DS
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 38
More about Column Databases
• Type A– Column-families
– Non-relational
– Sparse
– Examples: HBase, Cassandra, xVelocity (SQL 2012 BISM)
• Type B– Column-stores
– Relational
– Dense
– Example:
SQL Server 2012 Columnstore index
Demo - Document Database (MongoDB)• Use for data that is
– document-oriented (collection of JSON documents) w/semi structured data
Encodings include XML, YAML, JSON& BSON
– binary forms PDF, Microsoft Office documents -- Word, Excel…)
• Examples: MongoDB, CouchDB
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 39
Demo MongoDB
Persistent Key / Value Database• Schema-less
• State - Persistent
• Examples– AWS DynamoDB
– Azure Tables
– Project Voldemort
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 40
Volatile Key / Value Database• Schema-less
• State - Volatile
• Examples– Redis
– Memcahed
Which type of NoSQL for which type of data?
Type of Data Type of NoSQL solution
Example
Log files Wide Column HBase
Product Catalogs Key Value on disk DynamoDB
User profiles Key Value in memory Redis
Startups Document MongoDB
Social media connections
Graph Neo4j
LOB w/Transactions NONE! Use RDBMS SQL Server
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 41
What about the cloud?
Cloud-hosted NoSQL up to 50x CHEAPER
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 42
Consumer Storage Buckets• Dropbox
• Box
• Windows SkyDrive
• Google Drive
• Amazon Cloud Drive
• Apple iCloud
Developer BLOB Storage Buckets• Amazon – S3 or Glacier
• Google – Cloud Storage
• Microsoft Azure BLOBS
• Others
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 43
Cloud-hosted RDBMS• AWS RDS – SQL
Server, MySQL, Oracle– Medium cost– Solid feature set, i.e.
backup, snapshot– Use existing tooling
• Google – MySQL– Lowest cost– Most limited RDBMS
functionality
• Microsoft – Windows Azure SQL Database– Highest cost– Azure VMs w/MySQL
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 44
Other cloud data services
Hosting public datasetsHosting public datasets
• Pay to read• Earn revenue by offering for read
Cleaning / matching (your) data Cleaning / matching (your) data
• ETL – Microsoft Data Explorer, Google Refine• Data Quality – Windows Azure Marketplace,
InfoChimps, DataMarket.com
Cloud – RDBMS, NoSQL & Hadoop
AWS Google Microsoft
Cloud RDBMS SQL Server, Oracle / mySQL
MySQL SQL Azure
NoSQL buckets S3 or Glacier Cloud Storage Azure Storage
NoSQL databases DynamoDB H/R Datastore on GAE
Azure Tables
StreamingMachine Learning
Custom EC2 Prospective Search &Prediction API
StreamInsight & Mahout with Hadoop
Document or Graph
MongoDB on EC2 Freebase (g) MongoDB on Windows Azure
Hadoop Elastic MapReduce using S3 & EC2
MapR & GCE Windows Azure HDInsight
Data sets & other Karmasphere Translation APIFull-text search
Azure DataMarket
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 45
Demo Amazon RDS
Pick your mix and then…
NoSQL
• Host locally• Host in the
Cloud
RDBMS
• Host locally• Host in the
Cloud
Other Services
• Use Cloud Data Markets
• Use Cloud ETL
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 46
What about me?
Common DBA Tasks in NoSQLRDBMS NoSQL
Import Data Import Data
Setup Security Setup Security
Perform a Backup Make a copy of the data
Restore a Database Move a copy to a location
Create an Index Create an Index
Join Tables Together Run MapReduce
Schedule a Job Schedule a (Cron) Job
Run Database Maintenance Monitor space and resources used
Send an Email from SQL Server Set up resource threshold alerts
Search BOL Interpret Documentation
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 47
Making Sense – Asking Questions
Data Scientists…
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 48
Com
pari
ng…
Karmasphere Studio for AWS
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 49
Google BigQuery w/Excel• Dremel-based service
– For massive amounts of data
– BigQuery currently has quota limits
– SQL-like query language
Demo Google Big Query
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 50
NoSQL To-Do List
Understand CAP & types of NoSQL databases• Use NoSQL when business needs designate• Use the right type of NoSQL for your business problem
Try out NoSQL on the cloud• Quick and cheap for behavioral data• Mashup cloud datasets• Good for specialized use cases, i.e. dev, test , training
environments
Learn noSQL access technologies• New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon
Karmasphere, Microsoft Excel connectors, etc…
The Changing Data Landscape
NoSQLRDBMS
Other
Services
SQL Server Live! Orlando 2012
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 51
NoSQL for .NET Developers
• RavenDB
• MongoDB C#/.NET Driver
• MongoDB on Windows Azure
• CouchBase .NET Client Library
• Riak client for .NET
• AWS Toolkit for Visual Studio
• Google cloud APIs (REST-based)