Upload
karen-lopez
View
1.116
Download
1
Embed Size (px)
Citation preview
Technical Architect at Microsoft
Primary focus on data solutions in the cloud
Lara Rubbelke
@sqlgal
www.linkedin.com/in/lararubbelke/
Karen has 20+ years of data and information architecture
experience on large, multi-project programs.
She is a frequent speaker on data modeling, data-driven
methodologies and pattern data models.
She wants you to love your data.
Karen López #TEAMDATA
The only reason for time is so that
everything doesn’t happen at once.
- Albert Einstein*
Session inspired by the book
Seven Databases in Seven Weeks
key concepts for
hybrid database
architectures
database /
datastore types
reasons to go
explore
OutcomesWe want you to leave here understanding:
This
is
NOT…
a deep dive on any technology
a comprehensive list
a roadmap discussion
What We Will Cover
What We’ll CoverNoSQL
101Comparison to relational
Not Only SQL (but really “Not SQL”)
Terminology
Categories What they are
Why you use them
When you use them
A little of how to use them
CAPACID
BASE
SCHEMA
Cloud
Scale
Distributed Systems and the CAP Theorem
AvailabilityConsistency
Partition Tolerant
Eric Brewer’s
CAP Theorem
and even better
CAP Twelve Years Later
Myth: Eric Brewer On Why Banks Are
BASE Not ACID - Availability Is Revenue
Basically Available
Soft State
Eventually Consistent
BASE ACID
Atomic
Consistent
Isolated
Durable
BASE - ACID
Polyglot
persistence
• Optimized for data
• Optimized for workload
Not all new
• EAV
• XML
• Architecture paradigm: OLAP/DW
and OLTP
The And
Polyschematic
Multiple schemas over
the same data
Schema on read, not
on write
Data integrity may be
managed elsewhere
The Why
* ALL DATA HAS STRUCTURE!
** EMBRACE DENORMALIZATION
Kinect Telemetry Retail Application
Reporting/Analysis
Hadoop Batch
Processing
Sensor Data
Column Family
Price Check
Key-Value
Product Catalog
Document Store
{ }
Data-Intensive Applications in
the Cloud Computing World
Activity QueueAzure Storage
Google Analytics Logs
Azure Storage
Email DBsSQL Azure x 16
Username DBsSQL Azure x 16
User Profiles SQL Azure x 400
Activity TableX 50 PartitionsAzure Storage
IIS LogsAzure Storage
Data Analysis: StagingVirtual Machine
Data Warehouse
Reporting Services
Activity ProcessorsWorker Roles x 2
Cache
Users and Friends FeedGames and Leader BoardsResources and ReferencesDistributed Cache x 32
Cache TasksWorker Roles x 4
Back OfficeWeb Roles x 2
Background Tasks DBUtility DB, Content DB, Taxonomy DBSQL Azure
Web ApplicationWeb Roles x 180
Web Service/APIWeb Roles x 2
Moderation Service/Appliance
CRISP/3rd Party
Database
Key-Value: Sample Use
Table: PriceCompare
LocationID ProductBySellerID ProductProperties
123 013803204131 {Seller:“Camera Superstore”,
Price:425.99, PriceDate:2014-11-06,
SellerType:”Online”}
Row Key PropertiesPartition Key
• Low cost, scalable, highly available
and geo-redundant
• Flexible schema
• Fast reads and writes on single key
values or partitioned key values
• Log data and cache
Patterns/What Works Anti-Pattern/Danger
Anything that requires:
• Joins
• Custom sorting
• Non-key filters
Why Key-Value
// Create a table client.
CloudTableClient tableKinect = account.CreateCloudTableClient();
CloudTable tableKinectTelemetry = tableKinect.GetTableReference(“pricecompare");
// Create a query for all entities.
IQueryable<DynamicTableEntity> query =
from q in tableKinectTelemetry.CreateQuery<DynamicTableEntity>()
where q.PartitionKey.Equals(123)
and q.RowKey.Equals(013803204131)
select q;
Azure Tables: LINQ Query
Introduction to Windows Azure Tables
Azure Redis Cache 101 on Channel9
Learn More: Azure Tables and Redis Cache
• Variable Data Structures for same
type of entity
• Fast reads and writes on a complete
entity set
• Highly nested data stories
• Partially completed workflows
• You love JavaScript
Patterns/What Works Anti-Pattern/Danger
Anything that requires:
• Joins
• Complex transactional needs
• Lots of aggregation
Why Document
Azure DocumentDB .NET Code Samples
Azure DocumentDB 101 on Channel9
Azure DocumentDB 102 on Channel9
Build a web application with ASP.NET MVC using
DocumentDB
Learn More: Azure DocumentDB
Sensor Data Analysis
Real-time Query
Web Indexer
Message Systems
Interactive Dashboards
Column Family Use Cases
Apache HBase Features
Random and Consistent Real-Time Read/Write
Automatic Sharding and Linear Scale
Billions of Rows and Millions of Columns
Th
ink A
bo
ut
Row Key
720 gender -> male age -> 62
721 gender -> male photo -> image
723 video -> stream
Person Table
sparse | persistent | distributed | sorted | multidimensional
Understanding BigTable
{"trackingid" : 720,"gender" : "male","age" : 62
}
Great Reference: Understanding HBase and Big Table
HBase: A map of maps…{"720" : {"age" : "62","gender" : "male"
},"721" : {"age" : "40","gender" : "male","confidence" : "0.65"
},"722" : {"gender" : "female"
},“723" : {"age" : "12","gender" : "female","confidence" : "0.65"
},…
}
Row KeySparse
HBase: Column Families"720" : { “demographics” :
{ "age" : “62","gender" : “male“ },
“interactions” :{ “devicestate” : “removed”,“duration” : “100” }
},"721" : { “demographics” :
{ "age" : “40","gender" : “male“ },
“interactions” :{ “devicestate” : “replaced”,“duration” : “50” }
}…
Demographics
Interactions
Demographics
Interactions
Multidimensional
HBase: Physical View of a Sorted Map
Sort OrderRow Key
Column Name
Timestamp
Row Key Column Key Timestamp Value
720 demographics:age 1423234758774 62
720 demographics:gender 1423234758711 male
721 demographics:age 1423234758946 22
721 demographics:age 1423234758725 32
721 demographics:gender 1423234758950 female
telemetry
CellUninterpreted Bytes
{row, column, version}
CREATE TABLE IF NOT EXISTS "kinecttelemetry"("k" VARCHAR primary key, "age" VARCHAR, "gender" VARCHAR) default_column_family='demographics';
Apache Phoenix: SQL Skin over HBase
Phoenix in 15 Minutes or Less
Get started using HBase with Hadoop in HDInsight
Analyze Real-Time Twitter Sentiment with HBase in
HDInsight
Learn More: HBase on Azure
Distributed Storage
(HDFS or Blob Storage)
Distributed Processing
(MapReduce)
Scripting
(Pig)
SQL-like Query
(HiveQL)
SQL-like Query
(Impala)
Resource Scheduling
(YARN)
Hadoop Zoo
Real-Time
(HBase)
Hadoop On Your Terms
Cloudera Selects Microsoft
Azure as a Preferred Cloud
Platform
Hortonworks Data Platform
is now Microsoft Azure
Certified
100% Apache Hadoop-based
Service in the Cloud
Microsoft Azure
HDInsight
Qubole Partners with
Microsoft Azure
CREATE EXTERNAL TABLE irs_data_20082(
state string,
zipcode string,
agi_class int,
n1 int,
mars2 int,
prep int,
n2 int,
numdep int,
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE
LOCATION 'wasb://$containerName@$storageAccountName.blob.core.windows.net/all/data/';
Create Table Queryselect state, zipcode,
agi_class
from irs_Data_20082;
Hadoop Hive: External Table
• Batch processing
• Map…and reduce
• Lots of aggregation
• Multiple schemas on same data
• Fast
Patterns/What Works Anti-Pattern/Danger
Anything that requires:
• Joins
• Complex transactional needs
• Granular security requirements
• Not a relational database
replacement
• Not fast
Why Hadoop
http://azure.microsoft.com/en-
us/documentation/services/hdinsight/
http://vision.cloudera.com/cloudera-on-azure/
http://hortonworks.com/labs/microsoft/
Resource for Hadoop on Azure
• Highly connected data
• Relationships make the data story
• Paths through data
• Finding shortest/longest path
Patterns/What Works Anti-Pattern/Danger
• Low connected data (e.g. Log data)
• Very high number of updates on a
regular basis.
Why Graph
FoaF
(Social Graph)
Market Basket Analysis
Forensics
Fraud Detection
Recommendations
Use Cases for Graph Databases
Free Graph Dabases E-Book
Project Naiad from Microsoft Research
Learn More: Graph Databases
It’s fun
Database technologies aren’t YES/NO decisions
It’s inexpensive to learn
It’s fast to spin up a learning environment
A data professional needs to knows more than one tool
Using the right tool for the right job is key
It’s fun
7 Reasons to Go Explore