Big Data – Exercises · 2020-01-31 · Big Data – Exercises Fall 2019 – Week 2 – ETH Zurich Exercise 1: Storage devices (Optional) In this exercise, we want to understand

Big Data – Exercises

Fall 2019 – Week 2 – ETH Zurich

Exercise 1: Storage devices (Optional)

In this exercise, we want to understand the differences between SSD, HDD, and SDRAM in terms of capacity, speed

and price.

Task 1

Fill in the table below by visiting your local online hardware store and choosing the storage device with largestcapacity available but optimizing for read/write speed. For instance, you can visit Digitec.ch to explore the prices onSSDs, HDDs, and SDRAMs. You are free to use any other website for filling the table.

Storage Device Maximum capacity, GBPrice,

CHF/GBRead speed, GB/s Write speed, GB/s Link

HDD

SSD

DRAM

Task 2

Answer the following questions:

1. What type of storage devices above is the cheapest one?2. What type of storage devices above is the fastest in terms of read speed?

Solution:

Looking at Digitec, we fill the table as follows:

Storage Device Maximum capacity, GBPrice,

CHF/GBRead speed, GB/s Write speed, GB/s Link

HDD 16 TB; 0.034 CHF/GB 0.25 GB/s 0.22 GB/s link link2

SSD 2 TB; 0.32 CHF/GB 3.5 GB/s 2.5 GB/s link

DRAM 64 GB per module; 12.5 CHF/GB 60 GB/s 48 GB/s link1 link2

1. HDDs are the cheapest storage device among mentioned devices2. DRAMs are the fastest storage device among mentioned devices

Exercise 2: Set up an Azure storage account

It is comprised of the following steps:

1. Create a Locally redundant storage2. Learn features of blob's types: Append, Block, Page3. Test the Locally redundant storage by writing to it.

Step 1: Create a Locally redundant storage

1. First, you need to create a storage account. To do that, go to "Storage accounts" in the left menu. Click on "Add"at the top of the blade and fill out the following form. Choose the "exercise02" resource group or create a newone, select "Locally redundant storage (LRS)" as replication mode, and let all other values at their defaults. The specific name is not important. Choose one whatever you want.

2. Open the new storage account (you may need to wait a while before it has been created), go to "Access keys"(under "Settings") and copy one of its keys to the clipboard.

(under "Settings") and copy one of its keys to the clipboard.

3. Paste the key, the account name, and some container name into the following cell and run the cell.

In [1]:

ACCOUNT_NAME = '...' #see picture aboveACCOUNT_KEY = '...'CONTAINER_NAME = 'exercise02'

Step 2: Install a management library for Azure storage and import it

1. Run two cells below for that

In [2]:

!pip install azure-storage==0.33.0

Collecting azure-storage==0.33.0 Downloading https://files.pythonhosted.org/packages/6e/d4/d26d82b468e1c2c7e9bf214dc75c4f0a6d10c4da839cc029e36662569c67/azure_storage-0.33.0-py3-none-any.whl (181kB) 100% |████████████████████████████████| 184kB 6.1MB/s ta 0:00:01Requirement already satisfied: python-dateutil in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (from azure-storage==0.33.0) (2.7.3)Requirement already satisfied: azure-common in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (from azure-storage==0.33.0) (1.1.15)Requirement already satisfied: requests in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (from azure-storage==0.33.0) (2.19.1)Requirement already satisfied: cryptography in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (from azure-storage==0.33.0) (2.0.3)Requirement already satisfied: azure-nspkg in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (from azure-storage==0.33.0) (2.0.0)Requirement already satisfied: six>=1.5 in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (from python-dateutil->azure-storage==0.33.0) (1.11.0)Requirement already satisfied: certifi>=2017.4.17 in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (from requests->azure-storage==0.33.0) (2018.8.24)Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (from requests->azure-storage==0.33.0) (3.0.4)Requirement already satisfied: urllib3<1.24,>=1.21.1 in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (from requests->azure-storage==0.33.0) (1.23)Requirement already satisfied: idna<2.8,>=2.5 in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (from requests->azure-storage==0.33.0) (2.7)Requirement already satisfied: asn1crypto>=0.21.0 in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (from cryptography->azure-storage==0.33.0) (0.24.0)Requirement already satisfied: cffi>=1.7 in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (from cryptography->azure-storage==0.33.0) (1.7.0)Requirement already satisfied: pycparser in /home/nbuser/anaconda3_420/lib/python3.5/site-packages (from cffi>=1.7->cryptography->azure-storage==0.33.0) (2.14)

In [3]:

from azure.storage.blob import BlockBlobServicefrom azure.storage.blob import PageBlobServicefrom azure.storage.blob import AppendBlobServicefrom azure.storage.blob import PublicAccessfrom azure.storage.models import LocationModefrom azure.storage.blob import ContentSettingsfrom azure.storage.blob import BlockListTypefrom azure.storage.blob import BlobBlockfrom timeit import default_timer as timerimport uuidimport random

#function for genereting unique names for blobsdef get_blob_name(): return '{}{}'.format('blob', str(uuid.uuid4()).replace('-', ''))

Exercise 3

Step 1: Explore Concepts of Azure Blob Storage

1. A container provides a grouping of a set of blobs. All blobs must be in a container. An account can contain anunlimited number of containers. A container can store an unlimited number of blobs. Note that the containername must be lowercase.

2. We now want to understand the difference between blobs of distinct types. Read the following link which contains a brief explanation regarding the different blob types, and then select (byputting an X on the table below) which blob type is the most suitable for a particular use case.

Block Blob Append Blob Page Blob

Static content delivery

As a disk for a VirtualMachine

Streaming video

Log Files

Social network events (e.g., uploading photos to Instagram)

Solution:

Block Blob Append Blob Page Blob

Static content delivery X

As a disk for a VirtualMachine X

Streaming video X

Log Files X

Social network events (e.g., uploading photos to Instagram) X

Block blobs let you upload large blobs efficiently. Block blobs are comprised of blocks, each of which is identified bya block ID. You create or modify a block blob by writing a set of blocks and committing them by their block IDs. Each

.7->cryptography->azure-storage==0.33.0) (2.14)Installing collected packages: azure-storageSuccessfully installed azure-storage-0.33.0

a block ID. You create or modify a block blob by writing a set of blocks and committing them by their block IDs. Eachblock can be a different size, up to a maximum of 100 MB, and a block blob can include up to 50,000 blocks. Themaximum size of a block blob is therefore slightly more than 4.75 TB (100 MB X 50,000 blocks).

An append blob is comprised of blocks and is optimized for append operations. When you modify an append blob,blocks are added to the end of the blob only, via the append_block operation. Updating or deleting of existing blocksis not supported. Unlike a block blob, an append blob does not expose its block IDs.

Each block in an append blob can be a different size, up to a maximum of 4 MB, and an append blob can include upto 50,000 blocks. The maximum size of an append blob is therefore slightly more than 195 GB (4 MB X 50,000blocks).

Page blobs are a collection of 512-byte pages optimized for random read and write operations. To create a pageblob, you initialize the page blob and specify the maximum size the page blob will grow. To add or update thecontents of a page blob, you write a page or pages by specifying an offset and a range that align to 512-byte pageboundaries. A write to a page blob can overwrite just one page, some pages, or up to 4 MB of the page blob. Writesto page blobs happen in-place and are immediately committed to the blob. The maximum size for a page blob is 8TB.

Step 2: Test your first container

1. Now we need to create a new container under the specified account. If the container with the same namealready exists, the operation fails and returns False

In [5]:

block_blob_service = BlockBlobService(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY)

status = block_blob_service.create_container(CONTAINER_NAME)#status = block_blob_service.create_container(CONTAINER_NAME, # public_access=PublicAccess.Container) #create the server with public accessif status==True: print("{0}: Container created!".format(CONTAINER_NAME))else: print("{0}: Container already exists".format(CONTAINER_NAME))

2. Download a file to your Jupyter's virtual machine.

In [6]:

!wget https://www.vetbabble.com/wp-content/uploads/2016/11/hiding-cat.jpg -O cat.jpg

3. Finally, upload the file to the blob storage

Note: The name of the file on local machine can differ from the name of the blob

In [7]:

local_file = "cat.jpg"

blob_name = "picture"try: block_blob_service.create_blob_from_path(

exercise02: Container already exists

--2018-09-26 16:10:28-- https://www.vetbabble.com/wp-content/uploads/2016/11/hiding-cat.jpgResolving webproxy (webproxy)... 10.36.60.1Connecting to webproxy (webproxy)|10.36.60.1|:3128... connected.Proxy request sent, awaiting response... 200 OKLength: 136356 (133K) [image/jpeg]Saving to: ‘cat.jpg’

cat.jpg 100%[===================>] 133.16K --.-KB/s in 0.03s

2018-09-26 16:10:28 (4.22 MB/s) - ‘cat.jpg’ saved [136356/136356]

CONTAINER_NAME, blob_name, local_file, content_settings=ContentSettings(content_type='image/jpg') ) print(block_blob_service.make_blob_url(CONTAINER_NAME,blob_name))except: print ("something wrong happened when uploading the data %s"%blob_name)

4. Try to open the link above

By default, the new container is private, so you must specify your storage access key (as you did earlier) todownload blobs from this container. If you want to make the blobs within the container available to everyone, youcan create the container and pass the public access level using the following code.

In [8]:

block_blob_service.set_container_acl(CONTAINER_NAME, public_access=PublicAccess.Container)

After this change, anyone on the Internet can see blobs in a public container, but only you can modify or deletethem. Now, try to open the link again (It takes few seconds to change access permisions)

5. List all blobs in the container

To list the blobs in a container, use the list_blobs method. This method returns a generator. The following codeoutputs name, type, size and url of each blob in a container

In [9]:

# List all blobs in the containerblobs = block_blob_service.list_blobs(CONTAINER_NAME)for blob in blobs: try: print('Name: {} \t\t Type: {} \t\t Size: {}'.format(blob.name,blob.properties.blob_type, blob.properties.content_length)) print(block_blob_service.make_blob_url(CONTAINER_NAME,blob.name)) #We can also print the url of a blob except: print("something went wrong")

8. Download blobs

To download data from a blob, use get_blob_to_path, get_blob_to_stream, get_blob_to_bytes, or get_blob_to_text.They are high-level methods that perform the necessary chunking when the size of the data exceeds 64 MB.

Note: The Name of a file after downloading can differ from the name of a blob

The following example demonstrates using get_blob_to_path to download the content of your container and store itwith names file_i where i is a sequential index.

In [10]:

LOCAL_DIRECT = "./"

blobs = block_blob_service.list_blobs(CONTAINER_NAME)i=0for blob in blobs: local_file = LOCAL_DIRECT + 'file_{}'.format(str(i))

https://ex02ethbg2018.blob.core.windows.net/exercise02/picture

Out[8]:

<azure.storage.blob.models.ResourceProperties at 0x7f97513ac5c0>

Name: picture Type: BlockBlob Size: 136356https://ex02ethbg2018.blob.core.windows.net/exercise02/picture

local_file = LOCAL_DIRECT + 'file_{}'.format(str(i)) i=i+1 try: block_blob_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file) print("success for %s"%local_file) except: print("something went wrong with the blob: %s"%blob.name)

Step 3: Using the REST API

REpresentational State Transfer (REST), or RESTful, web services provide interoperability between computersystems on the Internet. REST-compliant web services allow the requesting systems to access and manipulatetextual representations of web resources by using a uniform and predefined set of stateless operations.

The most popular operations in REST are GET, POST, PUT, DELETE. A response may be in XML, HTML, JSON, or someother format. You can find the Azure Blob Service API description in the following link, and the HTTP response codes defined by theWorld Wide Web Consortium (W3C) here.

We could use tools like Postman, CURL or others to make REST requests to Azure Storage. In this tutorial we will useReqbin, a simple effective website to post HTTP request online.

Task 1.

1. Use any tool for listing all blobs in your container. For this use the following REST request. To avoid setting up an authentication you can make your container public by changing access policy to"Container". See Pictures below:

success for ./file_0

2. Explain why the request above does not include body part.3. What is the response format of the request?

Solution: Using Reqbin.

You can use as URI the following, but replacing the storage account name with what you chose instead of myaccountand the container we set up in this exercise instead of mycontainer.

https://myaccount.blob.core.windows.net/mycontainer?restype=container&comp=list

Solution2: Using curl.

In [1]:

!curl "https://myaccount.blob.core.windows.net/mycontainer?restype=container&comp=list"

1. We are perforing a GET request, which does not have a body2. xml

Exercise 4. KeyValue Vector ClocksAs pointed out in the lecture already, the concepts are clearly explained in the Dynamo paper by the DeCandia, G.,et. al. (2007). "Dynamo: Amazon’s Highly Available Key-value Store". In SOSP ’07 (Vol. 41, p. 205). DOI

Task 1

Multiple distributed hash tables use vector clocks for capturing causality among different versions of the sameobject. In Amazon's Dynamo, a vector clock is associated with every version of every object.

Let VC be an N-element array which contains non-negative integers, initialized to 0, representing N logical clocks ofthe N processes (nodes) of the system. VC gets its j element incremented by one everytime node j performs a writeoperation on it. Moreover, VC(x) denotes the vector clock of a write event, and VC(x)z denotes the element of that clock for the node z.

Try to formally define the partial ordering that we get from using vector clocks.

Solution:

VC(x) ≤ VC(y) ⟺ ∀z[VC(x)z ≤ VC(y)z]

Task 2

<?xml version="1.0" encoding="utf-8"?><EnumerationResults ContainerName="https://ex02ethbg2018.blob.core.windows.net/exercise02"><Blobs><Blob><Name>picture</Name><Url>https://ex02ethbg2018.blob.core.windows.net/exercise02/picture</Url><Properties><Last-Modified>Wed, 26 Sep 2018 16:10:34 GMT</Last-Modified><Etag>0x8D623CA975341E1</Etag><Content-Length>136356</Content-Length><Content-Type>image/jpg</Content-Type><Content-Encoding /><Content-Language /><Content-MD5>8ZGxCm3vBjOkcN4SbNK2IQ==</Content-MD5><Cache-Control /><BlobType>BlockBlob</BlobType><LeaseStatus>unlocked</LeaseStatus></Properties></Blob></Blobs><NextMarker /></EnumerationResults>

Vector clock antisymmetry property is defined as follows:

if VC(x) < VC(y), then ¬ (VC(y) < VC(x))

Proof such property.

Solution:

For VC(x) < VC(y) to be true, then VC(x) has to be strictly lower than VC(y). Thus, the partial order definition can beextended as follows:

VC(x) < VC(y) ⟺ ∀z[VC(x)z ≤ VC(y)z] ∧ ∃z′[VC(x)z ′ < VC(y)z ′]

We can start with the hypothesis that VC(x) < VC(y) is true. This means that there is a z' element for which VC(x) < VC(y).

Then to proof by contradiction we assume, that such z′ element does not exist. This cannot be case as to we require a strict partial order of VC(x) and VC(y) as part of our hypothesis.

Task 3

Consider j servers in a cluster where Sj denotes the jth node.In this exercise we adopt a slightly modified notation from the Dynamo paper:

while the dynamo paper indicates the writing server on the edge, we write it before the colonfor brevity we index server by position and omit server name in the vector clock

For example aa ([S0,0],[S1,4]) with S1 as writing server become S1 : aa ([0,4]) So, given the following versionevolution DAG for a particular object, complete the vector clocks computed at the corresponding version.

Solution:

Task 4

When a get request comes in to Amazon Dynamo with some key, then:

The coordinator node (selected from the preference list as the top node for this key) is taking care of this requestThe coordinator node requests from other nodes (itself + the next N-1 healthy ones on the preference list), and

The coordinator node requests from other nodes (itself + the next N-1 healthy ones on the preference list), andreceives, a whole bunch of versions for the value associated with the key, that are modelled as value (vector

clock) pairs such as a ([1, 3, 2])

Task 4.1

Given the following list of versions, draw the version DAG that the coordinator node will build for returning availableversions.

1 ([0,0,1])1 ([0,1,1])2 ([1,1,1])3 ([0,2,1])10 ([1,3,1])

Solution: We could use the following rules as guidance for creating correct version DAGs.

If no partial ordering exists between nodes, then those nodes should be on the same level, i.e., they are bothvalid versions.If there is an edge between two nodes, then the “father node” should be smaller than the child.You cannot have skip connections, i.e., there cannot be an edge from a grandfather node directly to a child node.Transitive edges shouldn't be present in the version DAG.Each edge represents the update of exactly one entry of the vector clock.

Task 4.2

Given the following list of versions, draw the version DAG that the coordinator node will build for returning thecorrect version.

a ([1,0,0])b ([0,1,0])c ([2,1,0])d ([2,1,1])e ([3,1,1])f ([2,2,1])g ([3,1,2])h ([3,2,3])i ([4,2,2])j ([5,2,2])k ([4,3,3])l ([5,2,3])m ([5,4,3])n ([6,3,3])o ([6,4,4])

Solution:

Task 4.3

Given the following list of versions, draw the version DAG that the coordinator node will build for returning thecorrect version.

a ([1,0,0,0])b ([0,0,0,1])aa ([0,0,1,0])bb ([0,1,0,0])c ([1,2,0,1])cc ([0,1,1,2])d ([1,3,0,1])f ([1,2,1,3])e ([2,1,1,2])g ([2,2,2,3])

Solution:

Task 5 (Optional)

Consider j servers in a cluster where Sj denotes the jth node. The following table denotes the execution of a series ofget/put operations. Also each line of the table represents the events that happen at the time time. For example, attime 0 (t0) servers S1 and S3 perform operations. Complete the values and vector clocks of each get operation contex where each context will contain(list_values,vector_clock).

S0 S1 S2 S3 S4

t0

Get(1) →_______________,Cxt1

(_______________)Put(1,Cxt1, ”a”)

Get(1) →_______________,Cxt2

(_______________)Put(1,Cxt2, ”bb”)

t2

Get(1) →_______________,Cxt4

(_______________)Put (1, Cxt4, “rr”)

Get(1) →_______________,Cxt5

(_______________)Put (1, Cxt5, ”dd”)

t4

Get(1) →_______________,Cxt9

(_______________) Put(1, Cxt9, ”ccc”)

Get(1) →_______________,Cxt10

(_______________) Put(1, Cxt10, ”dd”)

t5

Get(1) →_______________,Cxt11

(_______________) Put(1, Cxt11, “fff”)

The DAG below shows the interaction among nodes when retrieving values. You can use it for determining theexpected values.

Solution:

S0 S1 S2 S3 S4

t0Get(1) → Cxt1()Put(1,Cxt1, ”a”)

Get(1) → Cxt2()Put(1,Cxt2, ”bb”)

t2Get(1) → [a, bb], Cxt4

([0,1,0,1,0])Put (1, Cxt4, “rr”)

Get(1) → [a, bb], Cxt5([0,1,0,1,0])

Put (1, Cxt5, ”dd”)

Get(1) → [rr], Cxt9 Get(1) → [dd], Cxt10

t4Get(1) → [rr], Cxt9

([1,1,0,1,0]) Put(1, Cxt9, ”ccc”)

Get(1) → [dd], Cxt10([0,2,0,1,0])

Put(1, Cxt10, ”dd”)

t5Get(1) → [ccc, dd], Cxt11

([1,2,1,2,0]) Put(1, Cxt11, “fff”)

Exercise 5. Merkle TreesA hash tree or Merkle tree is a binary tree in which every leaf node gets as its label a data block and every non-leafnode is labelled with the cryptographic hash of the labels of its child nodes. Some KeyValue stores use Merkle treesfor detecting inconsistencies in data between replicas because differences in data blocks stored can be efficientlyfound and then this can help in reducing the amount of data that needs to be transferred to compare data blocks. This works by exchaging first the root hash, comparing it with their own, and if this does not match the one thereplica contains, then the children of the node will be retrieved, and their hashes compared. If the hash of a nodematches, then its children are not fetched because their are in sync. However if they are not, then its children arefetched to be compared.

Task 1

The two pictures below depict two Merkle trees each of one belonging to two different replicas and where each treenode and leaf are labeled with a different letter. However, these replicas have run out of sync. Specify for any couple of trees if it is a possible configuration and which nodes (and in case which chunks of files)have to be exchanged to sync the trees

Solution 1:

Node A (root hash) is exchanged. Node A is not the same.Fetch Node A children.Nodes B and C are exchanged. Node B is the same so its children are not exchanged, but node C is not.Fetch Node C children.Nodes F and G are exchanged. Node F is the same so its children are not exchanged, but node G is not.Fetch Node G children.Nodes N and O are exchanged. They both differ, so data blocks represented by the leaf's hashes.

Solution 2:

The configuration of the trees is theoretically impossible because the node C are matching hashes, while one of thechildren (G) is different.

Exercise 6. Virtual nodesVirtual nodes were introduced to avoid assigning data in an unbalanced manner and coping with hardwareheterogeneity by taking into consideration the physical capacity of each server

Let assume we have ten servers (i1 to i10) each with the following amount of main memory: 8GB, 16GB, 32GB, 8GB,16GB, 0.5TB, 1TB, 0.25TB, 10GB, 20GB. Calculate the number of virtual nodes/tokens each server should getaccording to its main memory capacity if we want to have a total of 256 virtual nodes/tokens.

(just for the purpose of the exercises if you get a fractional number of virtual nodes, always round up, even if thetotal sum of nodes in the end exceed 256)

Solution:

The total amount of main memory available is: 1902GB So if we want to have 256 virtual nodes/tokens, each should get ~7.42GB. Thus, each physical server i1 to i10 should get around: 1, 2, 4, 1, 2, 69, 138, 34, 1, 3 number of virtual nodes/tokens.

In general, rounding up to the next integer number or rounding down to the previous integer number is notdeterminant as virtual nodes are logically positioned in random parts of the ring anyway. However, one thing to be taken into consideration in practice is that the server with the smallest capacity shouldhave more than a single virtual node. This is because it might be "unlucky" and have to handle a large domain ofresponsability.

Documents

Big Data – Exercises · 2020-01-31 · Big Data – Exercises Fall 2019 – Week 2 – ETH Zurich Exercise 1: Storage devices (Optional) In this exercise, we want to understand