85
Copyright Mark Whitehorn From Big data to Data Lakes

From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

From Big data to Data Lakes

Page 2: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

It’s all about me…

Prof. Mark Whitehorn Emeritus Professor of Analytics School of Science and Engineering (Computing) University of Dundee [email protected]

Consultant Writer (author)

Page 3: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

It’s all about me…

Computing

Teach Masters in: Data Science • Full/Part time • Remote learning • Aimed at existing data

professionals

Data Engineering

Page 4: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

From big data to data lakes

Page 5: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

“Not waving but drowning” Stevie Smith

Page 6: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

“Not waving but drowning” Stevie Smith

Atomic Relational Schema Schema first Early binding schema Schema-on-write Schema last Late binding schema Schema-on-read Schema-less storage HDFS MapReduce Hadoop Sparse data Key Value Pairs (KVPs) NoSQL Document - MongoDB Column store - Cassandra Graph – Neo4J JSON Nodes Edges Data lake

Page 7: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

What is Big Data?

Data has always existed in two, very broad, flavours…..

1. Data that is inherently atomic and is a good fit with the relational way of storing and querying data

2. Data that is not as above

Page 8: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Taxonomy of Data

Data

Tabular/

Relational Big

Page 9: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

“Small” Data – relational data

Data has always existed in two, very broad, flavours…..

1. Data that is inherently atomic and is a good fit with the relational way of storing and querying data

2. Data that is not as above

Page 10: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data is stored in tables

LicenceNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red

Page 11: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data is stored in tables

LicenceNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red

Car

Each table has a name

Page 12: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data is stored in tables

LicenceNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red

Car

Data is

Atomic

What does this mean?

Page 13: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data is stored in tables

LicenceNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red

Car

Data is

Atomic

We have sub-divided it to the state where we can query it satisfactorily

Page 14: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data is stored in tables

LicenceNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red

Columns

Car

Page 15: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data is stored in tables

LicenceNo Make Model Year Color CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red

Car

Each row represents a unique entity in the ‘real’ world with a set of attributes (the columns)…

Rows

Page 16: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Page 17: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

The manipulation consists typically of sub-setting the data by rows and columns and then doing some sums

Data is stored in tables

Page 18: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data is stored in tables

LicenceNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red

Indeed, SQL (the language of relational databases), simply allows us to sub-set the data by column and by row

Page 19: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data is stored in tables

SELECT Make, Colour

FROM Cars

WHERE Make = ‘Bentley’

LicenceNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red

Make Colour Bentley Black Bentley Red

Note that this is easy because the data is atomic

Page 20: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data is stored in tables Note also that what we have done is to first define a shape for the data – a ‘schema’ and then fit the data into that schema. The schema includes our decisions about atomicity and is based on entities and attributes, not querying.

LicenceNo Make Model Year Colour

LicenceNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red

Page 21: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data is stored in tables So we have to do hard work at the beginning to define the schema but thereafter the querying is very easy.

This is called:

• Schema first

• Early binding schema

• Schema-on-write

Page 22: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Big Data

Data has always existed in two, very broad, flavours…..

1. Data that is inherently atomic and is a good fit with the relational way of storing and querying data

2. Data that is not as above

Page 23: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Examples Suppose we want to analyse images, video or audio.

These queries are not about sub-setting atomic data.

“Which of these audio files has been plagiarised?”

“Find the logs.”

Page 24: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Sawmills

Page 25: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Examples Is this

picture big

data?

It is if the

driver

sends the

image from

his phone

to the

sawmill.

Page 26: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Examples Trim

(the picture,

not the

logs).

Page 27: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Examples Find the

logs

Page 28: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Examples Count (53)

and

measure

diameters

Page 29: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Examples Driver: Olaf Smithson

Location: Adelsburg

Road distance: 235KM

Journey Time: 6:23

Break Due: 2:30

53 Logs

Load classification: Medium

Schedule: 10:20AM tomorrow

Log No. Diameter 1 35 2 23 3 68 4 45 5 23 … … 53 23

Page 30: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Examples How do we store and process this?

One answer is simply store the raw files in a file system. This is no more complex than copying the files to a folder on your hard disk in Windows.

This is not the same as putting the data into a database.

Page 31: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Schema Note that we are not even attempting to apply a schema at this stage. We are going to need a schema in order to query the data, but we apply it at query time, not when we store. So this is called:

• Schema last

• Late binding schema

• Schema-on-read

• Schema-less storage

Page 32: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Forward reference to data lakes In general data lakes hold schema-less data.

Page 33: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Schema Good news – it is very easy to store the data and no information can possibly have been lost.

Bad news – querying the data is now harder because no schema has yet been applied. So every query has to be hand-crafted in a programming language such as Java or C#.

Page 34: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

HDFS & MapReduce Few people would store the files in Windows because there are file systems that are optimised to allow us to:

• store files redundantly on cheap, commodity servers (PCs).

•Allow the files to be analysed in parallel.

Page 35: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

HDFS & MapReduce An example of such a store is HDFS -Hadoop Distributed File Store.

Page 36: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

HDFS & MapReduce MapReduce is a ‘programming framework’ – all that means is that you (or your programmers) can write programs within it that query the files you have stored in HDFS.

Page 37: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Hadoop Good news (again) - you can run HDFS and MapReduce on 10, 100, 1,000s of cheap, commodity hardware and achieve very safe, cost-effective storage and fast querying.

Together HDFS and MapReduce are

Wyld Stallyns Hadoop!

Page 38: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Examples But images, audio, video etc. are only one class of big data. Some big data is tabular and atomic. It is the way we want to analyse it that makes it big.

For example, if we look for order in the data.

Page 39: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Big Data often has order

We often want to look for order:

• over time

• identify the best customer

• between rows

• Find the peak in the graph

• Follow the customer in a log file

Page 40: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Relational databases are bad at order

Relational databases are poor at looking for order between rows

• Really, really, shockingly terrible

Page 41: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Examples

• Log file

Order Order!

Page 42: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Field

Date

Time

Client IP Address

User Name

Service Name

Server Name

Server IP Address

etc.

Order Order!

Page 43: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Order Order! But suppose we want to understand how people use our website to, for example, close an account; in order, ultimately, to understand WHY they close the account. This in no longer a question of sub-setting the data by row and column, it requires working out the order in which specific people visit specific pages.

Page 44: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Order Order!

Page 45: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Relational databases are also bad with sparse data

Page 46: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Going back to relational data for a moment

LicenceNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red

Car

Columns

Rows

All entities have the same set of attributes, and only one of each.

In practical terms we could also say that each row will have data for each column.

Page 47: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Going back to relational data for a moment

LicenceNo Make Model Year Colour CER 162 C Triumph 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Blue/Red

Car

Columns

Rows

Nulls are tolerated, but frowned upon:

all cars should have a model.

Duplicated are not tolerated:

a car cannot have more than one colour.

Page 48: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Examples But some data simply isn’t like this. Think about:

• Sensor data (different sensors send different data)

• Customer interactions

Page 49: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

This particular data sits poorly in a table. But note that each reading can be identified by the column name.

So we could store, for each row, only the columns that do have data.

Nulls can be common SensorID Manufacturer TimeDate Pressure Humidity Temp Wind Depth And so on

213342332 34 1/1/2016:11:23 23

2-BSDEFF76 12 2016/1/1:11:34 1034 12

Page 50: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

{

“SensorID”: “213342332”,

“Manufacturer”: ”34”,

“TimeDate": ” 1/1/2016:11:23”,

“Temp”: “23”

},

{

“SensorID”: “2-BSDEFF76”,

“Manufacturer”: ”12”,

“TimeDate": ” 2016/1/1:11:34:43”,

“Pressure”: “1034”,

“Depth”: “12”

}

Nulls can be common SensorID Manufacturer TimeDate Pressure Humidity Temp Wind Depth And so on

213342332 34 1/1/2016:11:23 23

2-BSDEFF76 12 2016/1/1:11:34 1034 12

Page 51: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

{

“SensorID”: “213342332”,

“Manufacturer”: ”34”,

“TimeDate": ” 1/1/2016:11:23”,

“Temp”: “23”

},

{

“SensorID”: “2-BSDEFF76”,

“Manufacturer”: ”12”,

“TimeDate": ” 2016/1/1:11:34:43”,

“Pressure”: “1034”,

“Depth”: “12”

}

Nulls can be common Key

Value

Page 52: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

{

“SensorID”: “213342332”,

“Manufacturer”: ”34”,

“TimeDate": ” 1/1/2016:11:23”,

“Temp”: “23”

},

{

“SensorID”: “2-BSDEFF76”,

“Manufacturer”: ”12”,

“TimeDate": ” 2016/1/1:11:34:43”,

“Pressure”: “1034”,

“Depth”: “12”

}

Nulls can be common Key

Value

Key

Value

Page 53: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Nulls can be common Key Value Pairs (KVPs) are a very effective way of storing sparse data (data where we expect a large number of nulls).

They are also excellent in cases where we know the data that is collected will vary over time.

Page 54: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Different classes of NoSQL tools

NoSQL Database engines:

These are database engines that are not relational and happen to be very good at holding and manipulating big data.

Page 55: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Different classes of NoSQL tools

•Hadoop – is it a database?

•Key Value Pair databases such as:

•Dynamo

• SimpleDB

Page 56: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Different classes of NoSQL tools

In fact, I would sub-divide KVP NoSQL systems into:

•Document – MongoDB, CouchDB

•Column store/table – Cassandra, Hbase, Big Table

•Graph – Neo4J

Page 57: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Big Data

Audio/

Images

Atomic/

KVP

Document

(Mongo)

Column Store

(cassandra)

Graph

(Neo4j)

Different classes of NoSQL tools

Page 58: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Different classes of NoSQL tools

We’ll look at a few of the features of these three classes simply to illustrate WHY there is such a proliferation of NoSQL database engines.

All have their pros and cons.

Page 59: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

NoSQL database systems

•Mongo - document

•Cassandra - column

•Neo4j - graph

Page 60: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Document databases MongoDB

Page 61: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

MongoDB

MongoDB uses KVPs in what are called collections.

A collection in some ways resembles a table in a relational database.

So, in a relational database, we would have a table called employee and a row for Mark and another for Sally.

In MongoDB we would have a collection called Employee, a document for Mark and another for Sally.

Page 62: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

MongoDB

{

“_id”: ObjectID(“45645a645ab4dad6456”),

“First Name”: ”Mark”,

“Last Name”: ”Whitehorn”,

“Dept”: “Computing”

}

Documents in a collection do not have to store the same information.

Key

Value

Page 63: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

MongoDB

{ “_id”: ObjectID(“45645a645ab4dad6456”), “First Name”: ”Mark”, “Last Name”: ”Whitehorn”, “Dept”: “Computing” }, { “_id”: ObjectID(“12345a645ab4dad6456”), “First Name”: ”Sally”, “Last Name”: ”Jones”, “Date of Birth”: “02-07-1953”, “Dept”: “Computing” },

The above is a collection of two documents

Page 64: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

MongoDB

{ “_id”: ObjectID(“45645a645ab4dad6456”), “First Name”: ”Mark”, “Last Name”: ”Whitehorn”, “Dept”: “Computing”, Addresses: [ { Street:”23 Acatia Gdns”, City:”London”, }, { Street:”Penguin Towers”, City:”Hereford”, } ] }

Fields can have multiple values within one document (multi-valued fields)

Page 65: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

MongoDB

MongoDB ingests data as text; specifically BSON documents which are a modified form of JSON.

Page 66: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

MongoDB

Query and output

Find all employees with the first name of Mark.

db.employees.find( {“First Name”: ”Mark”} );

Find everything and sort by first name.

db.employees.find().sort( { “First Name”: 1});

Page 67: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

MongoDB

Pros and cons Very good with data: • which is incomplete • where the structure may change over time • Where there is a great deal of data of a “reasonably

similar kind” • Where the queries are reasonably simple requests to

find documents that match criteria. Not good where: • The data is not suitable • there are potential relationships (of various kinds)

between collections (think graph and relational)

Page 68: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Column Store/Table Cassandra

Page 69: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Cassandra

Cassandra uses a partitioned row store.

Designed to partition the data across commodity servers.

A database consists of a set of column families.

Each column family is a set of key-value pairs.

Schema.

Data is modelled on the design of the queries that you want to run, not on entities, attributes and relationships.

Page 70: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Cassandra

Query and output.

The schema in Cassandra is designed to answer the questions that you have from a single table, so there are no joins between tables. The queries are simple subsetting operations run against single tables.

SELECT FName, Job FROM Employee WHERE key IN (19, 401, 617);

SELECT COUNT(*) FROM Sales;

Page 71: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Cassandra

Pros and cons Very good with data: • which is incomplete • where the structure may change over time • Where there is a great deal of data of a “reasonably

similar kind” • Where the queries are reasonable simple requests to

subset data by row and by column. Not good where: • everything else • there are potential relationships (of various kinds)

between tables (think graph and relational)

Page 72: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Graph databases Neo4j Specific slides

Page 73: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Graph Databases

@gerrymcnicol

Neo4j

Slides courtesy of

Gerry McNicol

Page 74: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Exeter

London

S'hampton

Bristol

Taunton

HORSE

TRAIN

TRAIN TRAIN

TRAIN

BUS

TRAIN

BUS

time:35 time:120

busco:mega

time:37

busco:mega

time:34

time:31

time:65

time:45

time:453

name: buttercup

stn:esd

stn:trs

stn:ssm

stn:btm

stn:lpad

Page 75: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

What is a Graph?

• Made up of Nodes and Relationships (Edges)

• Nodes are connected by Relationships

• Every Relationship has ...

• a starting and ending Node

• a direction

• Both Nodes and Relationships can have properties.

• Very flexible data structure

Page 76: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Graph – Neo4J

• It stores KVP data in both nodes and edges

• Both are equally important

• There is no need for nodes (or edges) to store the same data

Page 77: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Graph – Neo4J

Pros – excellent for examining relationships between objects, think:

• Facebook

• Travel problems

• Customers

• Fraud

Page 78: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Graph Summary

Schema applied when data stored

• But schema is light(ish) because all of the nodes and edges don’t have to store the same data

Page 79: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data summary

So data is highly variable.

It can be atomic, dense and tabular

• If we want to find and subset it, then relational is the way to go

• If we are interested in order, then relational becomes less attractive and NoSQL systems more so

Page 80: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data summary

So data is highly variable.

It can be atomic and sparse or guaranteed to vary over time and/or our queries are not just subsetting and/or it may be HUGE.

KVPs may well be the answer here.

• Document

• Column store/table

• Graph

Page 81: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data summary

So data is highly variable.

It can be non-atomic such as images, audio, graphs, seismic scans etc.

Some kind of file storage and programming model may be appropriate here – Hadoop.

Page 83: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data Swamp

The point about a data lake is that it leans heavily towards schema-less storage.

It is vital to add meta data to describe the data otherwise your lake will rapidly become a data swamp.

Page 84: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Data Swamp

Synonyms for the word swamp include:

quagmire, bog, morass and quicksand…

Page 85: From Big data to Data Lakes 2015 Big Data To Date Lakes.pdfCopyright Mark Whitehorn “Not waving but drowning” Stevie Smith Atomic Relational Schema Schema first Early binding schema

Copyright Mark Whitehorn

Thank you. Mark Whitehorn [email protected]