From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

From Big data to Data Lakes


It’s all about me…

Prof. Mark Whitehorn Emeritus Professor of Analytics School of Science and Engineering (Computing) University of Dundee [email protected]

Consultant Writer (author)


It’s all about me…

Computing

Teach Masters in: Data Science • Full/Part time • Remote learning • Aimed at existing data

professionals

Data Engineering


From big data to data lakes


“Not waving but drowning” Stevie Smith


“Not waving but drowning” Stevie Smith

Atomic Relational Schema Schema first Early binding schema Schema-on-write Schema last Late binding schema Schema-on-read Schema-less storage HDFS MapReduce Hadoop Sparse data Key Value Pairs (KVPs) NoSQL Document - MongoDB Column store - Cassandra Graph – Neo4J JSON Nodes Edges Data lake


What is Big Data?

Data has always existed in two, very broad, flavours…..

1. Data that is inherently atomic and is a good fit with the relational way of storing and querying data

2. Data that is not as above


Taxonomy of Data

Data

Tabular/

Relational Big


“Small” Data – relational data





Data is stored in tables

LicenceNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red




Car

Each table has a name




Car

Data is

Atomic

What does this mean?




Car

Data is

Atomic

We have sub-divided it to the state where we can query it satisfactorily




Columns

Car



LicenceNo Make Model Year Color CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red

Car

Each row represents a unique entity in the ‘real’ world with a set of attributes (the columns)…

Rows



The manipulation consists typically of sub-setting the data by rows and columns and then doing some sums





Indeed, SQL (the language of relational databases), simply allows us to sub-set the data by column and by row



SELECT Make, Colour

FROM Cars

WHERE Make = ‘Bentley’


Make Colour Bentley Black Bentley Red

Note that this is easy because the data is atomic


Data is stored in tables Note also that what we have done is to first define a shape for the data – a ‘schema’ and then fit the data into that schema. The schema includes our decisions about atomicity and is based on entities and attributes, not querying.

LicenceNo Make Model Year Colour



Data is stored in tables So we have to do hard work at the beginning to define the schema but thereafter the querying is very easy.

This is called:

• Schema first

• Early binding schema

• Schema-on-write


Big Data





Examples Suppose we want to analyse images, video or audio.

These queries are not about sub-setting atomic data.

“Which of these audio files has been plagiarised?”

“Find the logs.”


Sawmills


Examples Is this

picture big

data?

It is if the

driver

sends the

image from

his phone

to the

sawmill.


Examples Trim

(the picture,

not the

logs).


Examples Find the

logs


Examples Count (53)

and

measure

diameters


Examples Driver: Olaf Smithson

Location: Adelsburg

Road distance: 235KM

Journey Time: 6:23

Break Due: 2:30

53 Logs

Load classification: Medium

Schedule: 10:20AM tomorrow

Log No. Diameter 1 35 2 23 3 68 4 45 5 23 … … 53 23


Examples How do we store and process this?

One answer is simply store the raw files in a file system. This is no more complex than copying the files to a folder on your hard disk in Windows.

This is not the same as putting the data into a database.


Schema Note that we are not even attempting to apply a schema at this stage. We are going to need a schema in order to query the data, but we apply it at query time, not when we store. So this is called:

• Schema last

• Late binding schema

• Schema-on-read

• Schema-less storage


Forward reference to data lakes In general data lakes hold schema-less data.


Schema Good news – it is very easy to store the data and no information can possibly have been lost.

Bad news – querying the data is now harder because no schema has yet been applied. So every query has to be hand-crafted in a programming language such as Java or C#.


HDFS & MapReduce Few people would store the files in Windows because there are file systems that are optimised to allow us to:

• store files redundantly on cheap, commodity servers (PCs).

•Allow the files to be analysed in parallel.


HDFS & MapReduce An example of such a store is HDFS -Hadoop Distributed File Store.


HDFS & MapReduce MapReduce is a ‘programming framework’ – all that means is that you (or your programmers) can write programs within it that query the files you have stored in HDFS.


Hadoop Good news (again) - you can run HDFS and MapReduce on 10, 100, 1,000s of cheap, commodity hardware and achieve very safe, cost-effective storage and fast querying.

Together HDFS and MapReduce are

Wyld Stallyns Hadoop!


Examples But images, audio, video etc. are only one class of big data. Some big data is tabular and atomic. It is the way we want to analyse it that makes it big.

For example, if we look for order in the data.


Big Data often has order

We often want to look for order:

• over time

• identify the best customer

• between rows

• Find the peak in the graph

• Follow the customer in a log file


Relational databases are bad at order

Relational databases are poor at looking for order between rows

• Really, really, shockingly terrible


Examples

• Log file

Order Order!


Field

Date

Time

Client IP Address

User Name

Service Name

Server Name

Server IP Address

etc.

Order Order!


Order Order! But suppose we want to understand how people use our website to, for example, close an account; in order, ultimately, to understand WHY they close the account. This in no longer a question of sub-setting the data by row and column, it requires working out the order in which specific people visit specific pages.


Order Order!


Relational databases are also bad with sparse data


Going back to relational data for a moment


Car

Columns

Rows

All entities have the same set of attributes, and only one of each.

In practical terms we could also say that each row will have data for each column.


Going back to relational data for a moment

LicenceNo Make Model Year Colour CER 162 C Triumph 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Blue/Red

Car

Columns

Rows

Nulls are tolerated, but frowned upon:

all cars should have a model.

Duplicated are not tolerated:

a car cannot have more than one colour.


Examples But some data simply isn’t like this. Think about:

• Sensor data (different sensors send different data)

• Customer interactions


This particular data sits poorly in a table. But note that each reading can be identified by the column name.

So we could store, for each row, only the columns that do have data.

Nulls can be common SensorID Manufacturer TimeDate Pressure Humidity Temp Wind Depth And so on

213342332 34 1/1/2016:11:23 23

2-BSDEFF76 12 2016/1/1:11:34 1034 12


{

“SensorID”: “213342332”,

“Manufacturer”: ”34”,

“TimeDate": ” 1/1/2016:11:23”,

“Temp”: “23”

},

{

“SensorID”: “2-BSDEFF76”,


“TimeDate": ” 2016/1/1:11:34:43”,

“Pressure”: “1034”,

“Depth”: “12”

}

Nulls can be common SensorID Manufacturer TimeDate Pressure Humidity Temp Wind Depth And so on

213342332 34 1/1/2016:11:23 23

2-BSDEFF76 12 2016/1/1:11:34 1034 12


{

“SensorID”: “213342332”,


“TimeDate": ” 1/1/2016:11:23”,

“Temp”: “23”

},

{



“TimeDate": ” 2016/1/1:11:34:43”,


“Depth”: “12”

}

Nulls can be common Key

Value


{

“SensorID”: “213342332”,


“TimeDate": ” 1/1/2016:11:23”,

“Temp”: “23”

},

{



“TimeDate": ” 2016/1/1:11:34:43”,


“Depth”: “12”

}

Nulls can be common Key

Value

Key

Value


Nulls can be common Key Value Pairs (KVPs) are a very effective way of storing sparse data (data where we expect a large number of nulls).

They are also excellent in cases where we know the data that is collected will vary over time.


Different classes of NoSQL tools

NoSQL Database engines:

These are database engines that are not relational and happen to be very good at holding and manipulating big data.



•Hadoop – is it a database?

•Key Value Pair databases such as:

•Dynamo

• SimpleDB



In fact, I would sub-divide KVP NoSQL systems into:

•Document – MongoDB, CouchDB

•Column store/table – Cassandra, Hbase, Big Table

•Graph – Neo4J


Big Data

Audio/

Images

Atomic/

KVP

Document

(Mongo)

Column Store

(cassandra)

Graph

(Neo4j)




We’ll look at a few of the features of these three classes simply to illustrate WHY there is such a proliferation of NoSQL database engines.

All have their pros and cons.


NoSQL database systems

•Mongo - document

•Cassandra - column

•Neo4j - graph


Document databases MongoDB


MongoDB

MongoDB uses KVPs in what are called collections.

A collection in some ways resembles a table in a relational database.

So, in a relational database, we would have a table called employee and a row for Mark and another for Sally.

In MongoDB we would have a collection called Employee, a document for Mark and another for Sally.


MongoDB

{

“_id”: ObjectID(“45645a645ab4dad6456”),

“First Name”: ”Mark”,

“Last Name”: ”Whitehorn”,

“Dept”: “Computing”

}

Documents in a collection do not have to store the same information.

Key

Value


MongoDB

{ “_id”: ObjectID(“45645a645ab4dad6456”), “First Name”: ”Mark”, “Last Name”: ”Whitehorn”, “Dept”: “Computing” }, { “_id”: ObjectID(“12345a645ab4dad6456”), “First Name”: ”Sally”, “Last Name”: ”Jones”, “Date of Birth”: “02-07-1953”, “Dept”: “Computing” },

The above is a collection of two documents


MongoDB

{ “_id”: ObjectID(“45645a645ab4dad6456”), “First Name”: ”Mark”, “Last Name”: ”Whitehorn”, “Dept”: “Computing”, Addresses: [ { Street:”23 Acatia Gdns”, City:”London”, }, { Street:”Penguin Towers”, City:”Hereford”, } ] }

Fields can have multiple values within one document (multi-valued fields)


MongoDB

MongoDB ingests data as text; specifically BSON documents which are a modified form of JSON.


MongoDB

Query and output

Find all employees with the first name of Mark.

db.employees.find( {“First Name”: ”Mark”} );

Find everything and sort by first name.

db.employees.find().sort( { “First Name”: 1});


MongoDB

Pros and cons Very good with data: • which is incomplete • where the structure may change over time • Where there is a great deal of data of a “reasonably

similar kind” • Where the queries are reasonably simple requests to

find documents that match criteria. Not good where: • The data is not suitable • there are potential relationships (of various kinds)

between collections (think graph and relational)


Column Store/Table Cassandra


Cassandra

Cassandra uses a partitioned row store.

Designed to partition the data across commodity servers.

A database consists of a set of column families.

Each column family is a set of key-value pairs.

Schema.

Data is modelled on the design of the queries that you want to run, not on entities, attributes and relationships.


Cassandra

Query and output.

The schema in Cassandra is designed to answer the questions that you have from a single table, so there are no joins between tables. The queries are simple subsetting operations run against single tables.

SELECT FName, Job FROM Employee WHERE key IN (19, 401, 617);

SELECT COUNT(*) FROM Sales;


Cassandra

Pros and cons Very good with data: • which is incomplete • where the structure may change over time • Where there is a great deal of data of a “reasonably

similar kind” • Where the queries are reasonable simple requests to

subset data by row and by column. Not good where: • everything else • there are potential relationships (of various kinds)

between tables (think graph and relational)


Graph databases Neo4j Specific slides

Graph Databases

@gerrymcnicol

Neo4j

Slides courtesy of

Gerry McNicol

Exeter

London

S'hampton

Bristol

Taunton

HORSE

TRAIN

TRAIN TRAIN

TRAIN

BUS

TRAIN

BUS

time:35 time:120

busco:mega

time:37

busco:mega

time:34

time:31

time:65

time:45

time:453

name: buttercup

stn:esd

stn:trs

stn:ssm

stn:btm

stn:lpad

What is a Graph?

• Made up of Nodes and Relationships (Edges)

• Nodes are connected by Relationships

• Every Relationship has ...

• a starting and ending Node

• a direction

• Both Nodes and Relationships can have properties.

• Very flexible data structure


Graph – Neo4J

• It stores KVP data in both nodes and edges

• Both are equally important

• There is no need for nodes (or edges) to store the same data


Graph – Neo4J

Pros – excellent for examining relationships between objects, think:

• Facebook

• Travel problems

• Customers

• Fraud


Graph Summary

Schema applied when data stored

• But schema is light(ish) because all of the nodes and edges don’t have to store the same data


Data summary

So data is highly variable.

It can be atomic, dense and tabular

• If we want to find and subset it, then relational is the way to go

• If we are interested in order, then relational becomes less attractive and NoSQL systems more so


Data summary


It can be atomic and sparse or guaranteed to vary over time and/or our queries are not just subsetting and/or it may be HUGE.

KVPs may well be the answer here.

• Document

• Column store/table

• Graph


Data summary


It can be non-atomic such as images, audio, graphs, seismic scans etc.

Some kind of file storage and programming model may be appropriate here – Hadoop.


Data Lakes

Which brings us, finally, to data lakes.

Big repository of data stored:

• schema less (or a light schema)

•with high redundancy

•on cheap, commodity hardware

http://hortonworks.com/wp-content/uploads/2014/05/TeradataHortonworks_Datalake_White-Paper_20140410.pdf








Data Swamp

The point about a data lake is that it leans heavily towards schema-less storage.

It is vital to add meta data to describe the data otherwise your lake will rapidly become a data swamp.


Data Swamp

Synonyms for the word swamp include:

quagmire, bog, morass and quicksand…


Thank you. Mark Whitehorn [email protected]