57
Tamir Dresher Senior Software Architect May 19, 2014 Where is my Data? (In the Cloud)

Where is my data (in the cloud) tamir dresher

Embed Size (px)

Citation preview

Page 1: Where is my data (in the cloud)   tamir dresher

Tamir Dresher

Senior Software ArchitectMay 19, 2014

Where is my Data? (In the Cloud)

Page 2: Where is my data (in the cloud)   tamir dresher

About Me

• Software architect, consultant and instructor

• Software Engineering Lecturer @ Ruppin Academic Center

• Technology addict

• 10 years of experience

• .NET and Native Windows Programming

@[email protected]://www.TamirDresher.com.

Page 3: Where is my data (in the cloud)   tamir dresher

Agenda

• Storage

• Blob

• Azure SQL Server

• Azure Tables

• HDInsight

Page 4: Where is my data (in the cloud)   tamir dresher

Agenda

• Storage

• Blob

• Azure SQL Server

• Azure Tables

• HDInsight

Page 5: Where is my data (in the cloud)   tamir dresher

Storage

Where is my data Storage

Page 6: Where is my data (in the cloud)   tamir dresher

Storage Prices

6

Page 7: Where is my data (in the cloud)   tamir dresher

Types of information

Where is my data Storage

Page 8: Where is my data (in the cloud)   tamir dresher

North America Europe Asia Pacific

Data centers

Windows Azure Growing Global Presence

Storage SLA – 99.99%52.56 minutes per year

http://azure.microsoft.com/en-us/support/legal/sla

Page 9: Where is my data (in the cloud)   tamir dresher

AZURE BLOBS

9

Page 10: Where is my data (in the cloud)   tamir dresher

What is a BLOB

• BLOB – Binary Large OBject

• Storage for any type of entity such as binary files and text documents

• Distributed File Service (DFS)

– Scalability and High availability

• BLOB file is distributed between multiple server and replicated at least 3 times

Where is my data BLOB

Page 11: Where is my data (in the cloud)   tamir dresher

Blob Storage Concepts

11

Where is my data BLOB

Page 12: Where is my data (in the cloud)   tamir dresher

Blob Operations

REST

Where is my data BLOB

Page 13: Where is my data (in the cloud)   tamir dresher

DEMOCreating a Blob

13

Page 14: Where is my data (in the cloud)   tamir dresher

BLOBS

• Block blob - up to 200 GB in size

• Page blobs – up to 1 TB in size

• Total Account Capacity - 500 TB

• Pricing– Storage capacity used

– Replication option (LRS, GRS, RA-GRS)

– Number of requests

– Data egress

– http://azure.microsoft.com/en-us/pricing/details/storage/

Where is my data BLOB

Page 15: Where is my data (in the cloud)   tamir dresher

SQL AZURE

15

Page 16: Where is my data (in the cloud)   tamir dresher

SQL Azure

• SQL Server in the cloud

• No administrative overheads

• High Availability

• pay-as-you-grow pricing

• Familiar Development Model*

* Despite missing features and some limitations - http://msdn.microsoft.com/en-us/library/ff394115.aspx

Where is my data SQL Azure

Page 17: Where is my data (in the cloud)   tamir dresher

DEMOCreating and Using SQL Azure

17

Page 18: Where is my data (in the cloud)   tamir dresher

SQL Azure – Pricing

Where is my data SQL Azure

Page 19: Where is my data (in the cloud)   tamir dresher

Case Study - https://haveibeenpwned.com/

Where is my data SQL Azure

Page 20: Where is my data (in the cloud)   tamir dresher

Case Study - https://haveibeenpwned.com/

• http://www.troyhunt.com/2013/12/working-with-154-million-records-on.html

• How do I make querying 154 million email addresses as fast as possible?

• if I want 100GB of SQL Server and I want to hit it 10 million times, it’ll cost me $176 a month (now its ~20$)

Where is my data SQL Azure

Page 21: Where is my data (in the cloud)   tamir dresher

AZURE TABLES

21

Page 22: Where is my data (in the cloud)   tamir dresher

Table Storage Concepts

22

Where is my data Tables

Page 23: Where is my data (in the cloud)   tamir dresher

Table Storage

• Not RDBMS – No relationships between entities

– NoSql

• Entity can have up to 255 properties - Up to 1MB per entity

• Mandatory Properties for every entity– PartitionKey & RowKey (only indexed properties)

• Uniquely identifies an entity

• Same RowKey can be used in different PartitionKey

• Defines the sort order

– Timestamp - Optimistic Concurrency

Where is my data Tables

Page 24: Where is my data (in the cloud)   tamir dresher

No Fixed Schema

24

Where is my data Tables

Page 25: Where is my data (in the cloud)   tamir dresher

Table Object Model

• ITableEntity interface –PartitionKey, RowKey, Timestamp, and Etag properties

– Implemented by TableEntity and DynamicTableEntity// This class defines one additional property of integer type,

// since it derives from TableEntity it will be automatically

// serialized and deserialized.

public class SampleEntity : TableEntity

{

public int SampleProperty { get; set; }

}

Where is my data Tables

Page 26: Where is my data (in the cloud)   tamir dresher

Sample – Inserting an Entity into a Table// You will need the following using statements

using Microsoft.WindowsAzure.Storage;

using Microsoft.WindowsAzure.Storage.Table;

// Create the table client.

CloudTableClient tableClient = storageAccount.CreateCloudTableClient();

CloudTable peopleTable = tableClient.GetTableReference("people");

peopleTable.CreateIfNotExists();

// Create a new customer entity.

CustomerEntity customer1 = new CustomerEntity("Harp", "Walter");

customer1.Email = "[email protected]";

customer1.PhoneNumber = "425-555-0101";

// Create an operation to add the new customer to the people table.

TableOperation insertCustomer1 = TableOperation.Insert(customer1);

// Submit the operation to the table service.

peopleTable.Execute(insertCustomer1);

Where is my data Tables

Page 27: Where is my data (in the cloud)   tamir dresher

Retrieve

// Create the table client.

CloudTableClient tableClient = storageAccount.CreateCloudTableClient();

CloudTable peopleTable = tableClient.GetTableReference("people");

// Retrieve the entity with partition key of "Smith" and row key of "Jeff"

TableOperation retrieveJeffSmith =

TableOperation.Retrieve<CustomerEntity>("Smith", "Jeff");

// Retrieve entity

CustomerEntity specificEntity =

(CustomerEntity)peopleTable.Execute(retrieveJeffSmith).Result;

Where is my data Tables

Page 28: Where is my data (in the cloud)   tamir dresher

Table Storage – Important Points

• Azure Tables can store TBs of data

• Tables Operations are fast

• Tables are distributed –PartitionKey defines the partition

– A table might be stored in different partitions on different storage devices.

Where is my data Tables

Page 29: Where is my data (in the cloud)   tamir dresher

Pricing

Where is my data Tables

Page 30: Where is my data (in the cloud)   tamir dresher

Case Study - https://haveibeenpwned.com/

Where is my data Tables

Page 31: Where is my data (in the cloud)   tamir dresher

Case Study - https://haveibeenpwned.com/

• How do I make querying 154 million email addresses as fast as possible?

[email protected] – the domain is the partition key and the alias is the row key

• if I want 100GB of storage and I want to hit it 10 million times, it’ll cost me $8 a month

• SQL Server will cost $176 a month - 22 times more expensive

Where is my data Tables

Page 32: Where is my data (in the cloud)   tamir dresher

HDINSIGHT

32

Page 33: Where is my data (in the cloud)   tamir dresher

Hadoop in the cloud

• Hadoop on Azure Cloud

• Some Facts:

– Bing ingests > 7 petabytes a month

– The Twitter community generates over 1 terabyte of tweets every day

– Cisco predicts that by 2013 annual internet traffic flowing will reach 667 exabytes

Where is my data HDInsight

Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp

Page 34: Where is my data (in the cloud)   tamir dresher

MapReduce – The BigData Power

• Map – takes input and output key;value pairs

(Key1,Value1)(Key2,Value2)::(Keyn,Valuen)

Where is my data HDInsight

Page 35: Where is my data (in the cloud)   tamir dresher

MapReduce – The BigData Power

• Reduce – take group of values per key and produce new group of values

Key1:[value1-1,Value1-2…]

Key2:[value2-1,Value2-2…]

Keyn:[valueN-1,ValueN-2…]

[new_value1-1,new_value1-2…]

[new_value2-1,new_value2-2…]

[new_valueN-1,new_valueN-2…]

: :

Where is my data HDInsight

Page 36: Where is my data (in the cloud)   tamir dresher

MapReduce - How Does It Work?Where is my data HDInsight

Page 37: Where is my data (in the cloud)   tamir dresher

So How Does It Work?Where is my data HDInsight

Page 38: Where is my data (in the cloud)   tamir dresher

Finding common friends

• Facebook shows you how many common friends you have with someone

• There were 1,310,000,000 active users in facebookwith 130 friends on average (01.01.2014)

• Calculating the mutual friends

Where is my data HDInsight

Page 39: Where is my data (in the cloud)   tamir dresher

Finding common friends

• We can represent Friend Relationship as:

• Note that a Friend relationship is Symmetrical

– if A is a friend of B then B is a friend of A

Where is my data HDInsight

Someone [List of his\her friends]

Common Friends

Page 40: Where is my data (in the cloud)   tamir dresher

Example of Friends file

• U1 -> U2 U3 U4

• U2 -> U1 U3 U4 U5

• U3 -> U1 U2 U4 U5

• U4 -> U1 U2 U3 U5

• U5 -> U2 U3 U4

Where is my data HDInsight Common Friends

Page 41: Where is my data (in the cloud)   tamir dresher

Designing our MapReduce job

• Each line from the file will input line to the Mapper

• The Mapper will output key-value pairs

• Key: (user, friend)

– Sorted, friend might be before user

• value: list of friends

Where is my data HDInsight Common Friends

Page 42: Where is my data (in the cloud)   tamir dresher

Designing our MapReduce job - Mapper

• Each line from the file will input line to the Mapper

• The Mapper will output key-value pairs

• Key: (user, friend)

– Sorted, friend might be before user

• value: list of friends

• Having the key sorted will help us with the reducer, same pairs will be provided together

Where is my data HDInsight Common Friends

Page 43: Where is my data (in the cloud)   tamir dresher

Mapper Example

Where is my data HDInsight Common Friends

Mapper Output:Given the Line:

(U1 U2) U2 U3 U4(U1 U3) U2 U3 U4(U1 U4) U2 U3 U4

U1U2 U3 U4

Page 44: Where is my data (in the cloud)   tamir dresher

Mapper Example

Where is my data HDInsight Common Friends

Mapper Output:Given the Line:

(U1 U2) U2 U3 U4(U1 U3) U2 U3 U4(U1 U4) U2 U3 U4

U1U2 U3 U4

(U1 U2) -> U1 U3 U4 U5(U2 U3) -> U1 U3 U4 U5(U2 U4) -> U1 U3 U4 U5(U2 U5) -> U1 U3 U4 U5

U2 U1 U3 U4 U5

Page 45: Where is my data (in the cloud)   tamir dresher

Mapper Example – final result

Where is my data HDInsight Common Friends

Mapper Output:Given the Line:

(U1 U2) U2 U3 U4(U1 U3) U2 U3 U4(U1 U4) U2 U3 U4

U1U2 U3 U4

(U1 U2) -> U1 U3 U4 U5(U2 U3) -> U1 U3 U4 U5(U2 U4) -> U1 U3 U4 U5(U2 U5) -> U1 U3 U4 U5

U2 U1 U3 U4 U5

(U1 U3) -> U1 U2 U4 U5(U2 U3) -> U1 U2 U4 U5(U3 U4) -> U1 U2 U4 U5(U3 U5) -> U1 U2 U4 U5

U3 -> U1 U2 U4 U5

Mapper Output:Given the Line:

(U1 U4) -> U1 U2 U3 U5(U2 U4) -> U1 U2 U3 U5(U3 U4) -> U1 U2 U3 U5(U4 U5) -> U1 U2 U3 U5

U4 -> U1 U2 U3 U5

(U2 U5) -> U2 U3 U4(U3 U5) -> U2 U3 U4(U4 U5) -> U2 U3 U4

U5 -> U2 U3 U4

Page 46: Where is my data (in the cloud)   tamir dresher

Designing our MapReduce job - Reducer

• The input for the reducer will be structured as:

(friend1, friend2) (friend1 friends) (friend2 friends)

• The reducer will find the intersection between the lists

• Output:

(friend1, friend2) (intersection of friend1 and friend2 friends)

Where is my data HDInsight Common Friends

Page 47: Where is my data (in the cloud)   tamir dresher

Reducer Example

Where is my data HDInsight Common Friends

Reducer Output:Given the Line:

(U1 U2) -> (U3 U4)(U1 U2) -> (U1 U3 U4 U5) (U2 U3 U4)(U1 U3) -> (U2 U4)(U1 U3) -> (U1 U2 U4 U5) (U2 U3 U4)(U1 U4) -> (U2 U3)(U1 U4) -> (U1 U2 U3 U5) (U2 U3 U4)(U2 U3) -> (U1 U4 U5)(U2 U3) -> (U1 U2 U4 U5) (U1 U3 U4 U5)(U2 U4) -> (U1 U3 U5)(U2 U4) -> (U1 U2 U3 U5) (U1 U3 U4 U5)(U2 U5) -> (U3 U4)(U2 U5) -> (U1 U3 U4 U5) (U2 U3 U4)(U3 U4) -> (U1 U2 U5)(U3 U4) -> (U1 U2 U3 U5) (U1 U2 U4 U5)(U3 U5) -> (U2 U4)(U3 U5) -> (U1 U2 U4 U5) (U2 U3 U4)(U4 U5) -> (U2 U3)(U4 U5) -> (U1 U2 U3 U5) (U2 U3 U4)

Page 48: Where is my data (in the cloud)   tamir dresher

Creating c# MapReduce

Where is my data HDInsight Common Friends

Page 49: Where is my data (in the cloud)   tamir dresher

Creating c# MapReduce - Mapper

Where is my data HDInsight Common Friends

public class CommonFriendsMapper:MapperBase{

public override void Map(string inputLine, MapperContext context){

var strings = inputLine.Split(new []{' '}, StringSplitOptions.RemoveEmptyEntries);if (strings.Any()){

var currentUser = strings[0];var friends = strings.Skip(1);foreach (var friend in friends){

var keyArr = new[] {currentUser, friend};Array.Sort(keyArr);var key = String.Join(" ", keyArr);context.EmitKeyValue(key, string.Join(" ",friends));

}}

}}

Page 50: Where is my data (in the cloud)   tamir dresher

Creating c# MapReduce - Reduce

Where is my data HDInsight Common Friends

public class CommonFriendsReducer:ReducerCombinerBase{

public override void Reduce(string key,IEnumerable<string> strings,ReducerCombinerContext context)

{var friendsLists = strings

.Select(friendList => friendList.Split(' '))

.ToList();var intersection = friendsLists[0].Intersect(friendsLists[1]);

context.EmitKeyValue(key, string.Join(" ", intersection));}

}

Page 51: Where is my data (in the cloud)   tamir dresher

Creating c# MapReduce – Hadoop Job

Where is my data HDInsight Common Friends

HadoopJobConfiguration myConfig = new HadoopJobConfiguration();myConfig.InputPath = "wasb:///example/data/friends/friends";myConfig.OutputFolder = "wasb:////example/data/friends/output";

Environment.SetEnvironmentVariable("HADOOP_HOME", @"c:\hadoop");Environment.SetEnvironmentVariable("Java_HOME", @"c:\hadoop\jvm");

var hadoop = Hadoop.Connect(clusterUri,clusterUserName,hadoopUserName,clusterPassword,azureStorageAccount,azureStorageKey,azureStorageContainer,createContinerIfNotExist);

var jobResult = hadoop.MapReduceJob.Execute<CommonFriendsMapper, CommonFriendsReducer>(myConfig);

int exitCode = jobResult.Info.ExitCode; // (0 – success, otherwise – failure)

Page 52: Where is my data (in the cloud)   tamir dresher

Pricing

Where is my data HDInsight

10 node cluster that will exist for 24 hours:• Secure Gateway Node - free.• head node - 15.36 USD per 24-hour day• 1 data node - 7.68 USD per 24-hour day• 10 data nodes - 76.80 USD per 24-hour day• Total: $92.16 USD

Page 53: Where is my data (in the cloud)   tamir dresher

WRAP UP

53

Page 54: Where is my data (in the cloud)   tamir dresher

Comparing the alternatives

Storage Type When Should you Use Implications

BLOB Unstructured dataFiles

- Application Logic Responsibility- Consider using HDInsight(Hadoop)

SQL Server Structured Relational DataACID transactionsMax 150GB (500GB in preview)

- SQL DML+DDL- Could affect scalability- BI Abilities- Reporting

Azure Tables Structured DataLoose SchemaGeo Replication (High DR)Auto Sharding

- OData, REST- Application Logic- Responsibility(Multiple Schemas)

Where is my data Wrap Up

Page 55: Where is my data (in the cloud)   tamir dresher

What have we seen

• Azure Blobs

• Azure Tables

• Azure SQL Server

• HDinsight

Where is my data Wrap Up

Page 56: Where is my data (in the cloud)   tamir dresher

What’s Next

• NoSql – MongoDB, Cassandra, CouchDB, RavenDB

• Hadoop ecosystem – Hive, Pig, SQOOP, Mahout

• http://blogs.msdn.com/b/windowsazure/

• http://blogs.msdn.com/b/windowsazurestorage/

• http://blogs.msdn.com/b/bigdatasupport/

Where is my data Wrap Up

Page 57: Where is my data (in the cloud)   tamir dresher

Presenter contact detailsc: +972-52-4772946t: @tamir_dreshere: [email protected]: TamirDresher.comw: www.codevalue.net