Big Data as Opposed to Small Data Mark Whitehorn

Preview:

Citation preview

BIG DATA - AS OPPOSED TO SMALL DATA

Mark Whitehorn

2

What is Big data?

Is it really just a marketing campaign?

http://www.perceptualedge.com/articles/visual_business_intelligence/big_data_big_ruse.pdf

“If you’re like me, the mere mention of Big Data now turns your stomach….Why all the fuss? Why, indeed. Essentially, Big Data is a marketing campaign, pure and simple.” Stephen Few

3

Big dataClearly I am not like Stephen Few.

I don’t believe I have a particular axe to grind, I simply find this interesting

This talk is designed to try to explain:• what Big Data is• what characteristics we have found useful• why it may be of interest to you• a paradox

4

Data

All computer applications manipulate data

5

Data

So, in the ’60 and ‘70s we rapidly learnt to separate the data, and its manipulation, from the application

6

Data

So, in the ’60 and ‘70s we rapidly learnt to separate the data, and its manipulation, from the applicationWhich led directly to the development of database engines and, ultimately, relational ones (DB2, Oracle, SQL Server)

7

Data

Data has always existed in two, very broad, flavours…..

• Data that is treated as small, discrete packages and is a good fit with the relational way of storing and querying data

• Data that is not as above

Data is stored in tables

8

Mark Whitehorn

LicenseNo Make Model Year ColourCER 162 C Triumph Spitfire 1965 GreenEF 8972 Bentley Mk. VI 1946 BlackYSK 114 Bentley Mk. VI 1949 Red

Data is stored in tables

9

Mark Whitehorn

LicenseNo Make Model Year ColourCER 162 C Triumph Spitfire 1965 GreenEF 8972 Bentley Mk. VI 1946 BlackYSK 114 Bentley Mk. VI 1949 Red

CarEach table has a name

Data is stored in tables

10

Mark Whitehorn

LicenseNo Make Model Year ColourCER 162 C Triumph Spitfire 1965 GreenEF 8972 Bentley Mk. VI 1946 BlackYSK 114 Bentley Mk. VI 1949 Red

Car

Data isatomic

Data is stored in tables

11

Mark Whitehorn

LicenseNo Make Model Year ColourCER 162 C Triumph Spitfire 1965 GreenEF 8972 Bentley Mk. VI 1946 BlackYSK 114 Bentley Mk. VI 1949 Red

Columns

Car

Data is stored in tables

12

Mark Whitehorn

LicenseNo Make Model Year ColourCER 162 C Triumph Spitfire 1965 GreenEF 8972 Bentley Mk. VI 1946 BlackYSK 114 Bentley Mk. VI 1949 Red

CarColumns

Rows

Data is stored in tables

13

Mark Whitehorn

LicenseNo Make Model Year ColorCER 162 C Triumph Spitfire 1965 GreenEF 8972 Bentley Mk. VI 1946 BlackYSK 114 Bentley Mk. VI 1949 Red

Car

Each row represents a unique entity in the ‘real’ world……

14

15

Data

The manipulation consists typically of sub-setting the data by rows and columns and then doing some sums

16

Data

Note that this kind of manipulation is treating the data as atomic, which is fine, because the relational model assumes atomicity of data

Note also, that the rows are unordered

17

Data

• Data has always existed in two, very broad, flavours…..• Data that is inherently atomic and is a good

fit with the relational way of storing and querying data

• Data that is not as above

Examples

• Examples of ‘other’ data:• Images• Music• Word docs• Sensor data• Web logs• Twitter• Machines

• Point of Sale• Mass spectrometers

18

What’s in a name?

So, what do we call the ‘rest’?• Un-structured?• Semi-structured?• Multi-structured?• Non-relational?• Non-tabular?

19

What’s in a name?

• What about: • Big data?

20

Other definitions?

• V V V v v v v • Volume• Variety• Velocity• Value• Very interesting• Various other words beginning with V…..

21

22

Big Data – not new?

• So why have we focused, for the last 30 years, almost exclusively on the first flavour?

• Because it:• is easy (relatively easy – Jim Gray*)• represents a significant proportion of the

available data

*Jim Gray and Andreas Reuter - Transaction Processing: Concepts and Techniques (1993)Turning Award 1998

23

Big Data has come of age

• Two factors have changed• Rise of the Machines• Increase is computational power

• There is a great synergy here• We are acquiring far more big data and we

have computational power to extract the information it contains

Big Data is hard

• 3 Vs• It is highly variable• We often want to look inside the data

• Frequently non-atomic• Need custom functions for virtually every operation

• find the rotating wing aircraft in the image• Identify the best customer• What does the blog sphere think of our

company?24

• Examples• Log file• Mass spec.• Images

Big Data

25

• Examples• Log file• Mass spectrometer• Image

Big Data

26

• Examples• Log file• Mass spec.• Images

Big Data

27

What is Big Data?• Examples

• Log file• Mass spec.• Images

BIG DATA

Summary so far……

• Just as you can always fit an aircraft engine into a car chassis, you can always put Big Data in a table, but you probably don’t want to

• The analysis is not sub-setting the data by rows and columns

• So each class of big data usually require a (lovingly hand-crafted) custom analysis

30

Case Study

Big Data in the Life Sciences WorldThe massed spectrometers

Why would anyone do that?

31

Human Genome Project$3 billion – 13 Years

Sequencing completed (2003).

32

Human Genome Project

Our genes define us.

Errr…. how does that work exactly?

Human Genome Project$3 billion – 13 Years

33

DNA Protein

blueprint product

What is a protein?

34

Genes contain instructions for creating

proteins

Proteins carry out functions within a cell

GENOME

PROTEOME

Why study proteins

35

Example ProteinsProtein: ACTINFunction: Contracts Muscles

Protein: InsulinFunction: Controls Blood Sugar

O2

Protein: HemoglobinFunction: Carries Oxygen

Protein: KeratinFunction: Forms Hair and Nails

Protein: AntibodyFunction: Fights Viruses

36

20-25,000 genes in the human genome.Every nucleated cell in the same human has the same genome.

But not all genes are active at the same time.Perm any 15-18,00 active proteins in any one cell at any one time.

biS

CIE

NC

E

37

slowly changing millions of years

rapidly changingover a day38

Studying Proteins

Proteins are chopped up using an enzyme to make them easier to measure.

A specialised instrument (Mass Spectrometer) is used to measure (‘weigh’) the small protein fragments.

We can use the mass of the small fragments to carry out intelligent database searches to identify which protein was detected.

39

Protein

MKLNISFPATGCQKLIEVDDERKLRTFYEKRMATEVAADALGEEWKGYVVRISGGNDKQGFPMKQGVLTHGRVRLLLSKGHSCYRPRRTGERKRKSVRGCIVDANLSVLNLVIVKKGEKDIPGLTDTTVPRRLGPKRASRIRKLFNLSKEDDVRQYVVRKPLNKEGKKPRTKAPKIQRLVTPRVLQHKRRRIALKKQRTKKNKEEAAEYAKLLAKRMKEAKEKRQEQIAKRRRLSSLRASTSKSESSQK

Amino Acids

Peptides

40

Mass SpectrometryAn analytical technique for the determination of the elemental composition of a sample.

41

Spectra

P1

P2

P3

42

Mass SpectraFile Sizes: typically several gigabytes per MS run.

Identifications: range from 500-8000 protein identifications.

43

pep TRACKERTRACK. VISUALISE. DISCOVER.

44

80%60%

40%20%

45

Protein Peptide Alignment Map

Normalised Profiles for Synthesis,

Degradation and Turnover

Localisation

Comparison Between Compartments

46

Custom analysis and custom visualisation – vital tools in understanding big data

47

Proteomics Volume 3, Issue 8, Article first published online: 12 AUG 2003

Deisotoping

Base Line Correction Peak Detection

BIOConductor PROcess R Package

Intensive Data Processing Required to derive Information from the raw data

48

“proteomics is much more complicated than genomics . . . while an organism's genome is

more or less constant, the proteome differs from cell to cell

and over time”

Computationally, perhaps three orders of magnitude more

complex than HGP

49

Why bother trying to quantify it?

Because this is payback time.

Documenting the proteome opens the door to a whole new world.

biS

CIE

NC

E

50

So, what is a data scientist?My favourite description comes from Twitter:“Yeah, so I'm actually a data scientist. I just do this barista thing in between gigs.”More cynically:“A data scientist is just an analyst who lives in California.”

biS

CIE

NC

E

51

Possibly more accurate is that a data scientist (DS) is “a better software engineer than any statistician and a better statistician than any software engineer”.

biS

CIE

NC

E

52

DSs are also part artist and part engineer. They need a toolbox of techniques, skills, processes and abilities from which to construct novel solutions. And they need the ability to create a UI that turns their abstract finding into something that the users of the system can understand, so DSs also need the skills to create elegant visualisations that turn raw data into information.

biS

CIE

NC

E

53

And (yes, there’s more) they need to be able to communicate well with people. There is little use in creating a superb analytical process if you can’t communicate how and why it works to the board members.

biS

CIE

NC

E

54

And then there is the curiosity. Duncan Ross (Director of Data Sciences at Teradata) characterised data scientists well:The first and most important trait is curiosity. Insane curiosity. In many walks of life evolution selects against the kind of person who decides to find out what happens “if I push that button”. Data Science selects for it.

biS

CIE

NC

E

55

So, what are the general characteristics of a DS? They include:• insatiable curiosity (see above)• interdisciplinary interests• excellent communication skills • excellent analytical capabilities

biS

CIE

NC

E

56

DSs also need a good working knowledge of:• machine learning techniques• data mining• statistics• maths• algorithm development• code development • data visualisation• multi-dimensional database design and

implementation

biS

CIE

NC

E

57

Specific skills include the technologies to handle big data:• NoSQL databases• Hadoop and related technologies• MapReduce and its implementation on differing

software platforms

biS

CIE

NC

E

58

DSs also have an intimate knowledge of languages such as:• SQL• MDX • R• Functional and OOP languages such as Erlang and

Java

biS

CIE

NC

E

59

Most of all, no matter what they are called, all true data scientists have started playing with some data at 8:00PM and suddenly found it is 3:00AM.

biS

CIE

NC

E

Case Study

TwitterWho loves you?Social/text/sentiment

61

Consider the humble tweet…

62

Consider the humble tweet…

63

As, indeed, Sally Bercow should have done

Consider the humble tweet…

64

As, indeed, Sally Bercow should have done *Innocent Face*

Consider the humble tweet…

I’d just like to apologise for that last slide but I would point out

that it “contained no accusation whatsoever … Mischievous but

not libellous.”

65

Case Study

Oil Rig dataGone fishing

Sensor data

66

Lessons learned

• Engagement

• Choose you battles – look for an area where you can gain competitive advantage

• Choose your platform carefully• Programming – algorithm development• Data scientists

• Custom algorithms • Custom visualisations 67

Any Questions?Mark Whitehorn (MarkWhitehorn@computing.dundee.ac.uk)

Thank you very much for listening

68

BIG DATA - AS OPPOSED TO SMALL DATA

60 minutes

Mark Whitehorn

Recommended