27
Big Data: Information, Data, Events, Analytics at Scale Prof Peter Triantafillou Chair of Data Systems Associate Director UBDC IDEAS Research Group School of Computing Science University of Glasgow http://dcs.gla.ac.uk/ideas/ Scottish Competition Forum: Big Data 03/04/2017

Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Big Data:Information, Data, Events, Analytics at Scale

Prof Peter TriantafillouChair of Data Systems

Associate Director UBDC

IDEAS Research GroupSchool of Computing Science

University of Glasgowhttp://dcs.gla.ac.uk/ideas/

Scottish Competition Forum: Big Data03/04/2017

Page 2: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

http://dilbert.com/strip/2012-09-05

A Bird’s Eye View of Big Data Research

Scottish Competition Forum: Big Data03/04/2017

Page 3: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

What is big data ?

03/04/2017 Scottish Competition Forum: Big Data

Page 4: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

You know you have big data when …

• “you get a call from the utility company, asking not to run that query again” … disruptive queries!

• “your IT spends most of tis time purchasing storage”

• “ a query is long enough to require a couple of DBA admin generations to see the first results”

Frequently, one has to redefine what his big data problem is

03/04/2017 4

Page 5: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Take home message

Struct & Unstruct

Data

Information

Knowledge

03/04/2017 5

Page 6: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Wows

• GBs, TBs, petabytes, Exabytes, …• New vocabulary: Exabytes, Zettabytes, Yottabytes, …

• @CERN: ~= 50 TBs per day

• @FB: 250 M photo uploads each day PBs…

• ca 2011: 1.8 Zettabytes• Grows by a factor of about 3 per year …

• Open Library Project: • Have online every book ever written …

• How big is the web?• ~= 1 Bio domains registered

• Think of web archiving … 03/04/2017 6

Page 7: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

St Peters Square, Rome.

UK Data Services –Dr. Nathan Cunningham

Page 8: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

03/04/2017 8

Page 9: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Overwhelming!

Page 10: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Keys

• Big Data Infrastructures• Modern File Systems -- HDFS• Modern DBs

• HBase, Cassandra, MongoDB, Neo4j,• Analytics platforms

• Hadoop, Spark, SparkML, GraphLab, SpatialHadoop…

• Ingest and export/querying• Handling different data/query types/formats

• Tabular, Graph, Documents/Text, Images/Video• Spatial, Temporal, spatio-temporal, • Streaming and/or in-rest data

• Statistical & machine learning for analytics tasksScottish Competition Forum: Big Data03/04/2017

Page 11: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

A High Level View of a Modern Big Data Box

RDBMS EDW MPPMANAGE & MONITOR

BUILD & TEST

Page 12: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Big Data: The complete story: V5C

• Volume: …

• Variety: Structured, semi-structured, unstructured

• DB Tables, csv files, …

• Text, video, audio, photos

• Wikipedia pages: text + infoboxes

• Microformat, microdata, (schema.org), …

• Velocity: near-real time

• Storage, querying, analytics, …

• Variability: data flows: peaks & valleys …

• Veracity: Is it really the “true” data: errors? alterations?

• Complexity: entities, data, hierarchies, links, relationships, …03/04/2017 12

Page 13: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

A working definition of Big Data

UK Data Services –Dr. Nathan Cunningham

Page 14: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Big Data: The real story

Scottish Competition Forum: Big Data

t1 t2 t3 t4

D

ATA

S

I

Z

ETime

Big Data

Relevant Data

“Mind the Gap”

“Ride the Right

Curve”

03/04/2017

Page 15: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

So what can big data boxes do for me ?

03/04/2017 Scottish Competition Forum: Big Data

Page 16: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Big Data: Take home message

03/04/2017 16

Collect

Understand

Exploit

3 End-user Tasks

Storage Resources

Management

Data MngmtIR

HCI, Analytics,

Visualization

3+ System Layers Services

Page 17: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

The value is in Smart Data

“In 2016, the world of big data will focus more on smart data, regardless of size. Smart data are wide data (high variety), not necessarily deep data (high volume).

Data are “smart” when they consist of feature-rich content and context(time, location, associations, links, interdependencies, etc.) that enable intelligent and even autonomous data-driven processes, discoveries, decisions, and applications.” Kirk Borne, Principal Data Scientist at Booz Allen Hamilton

Page 18: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

So what can big data boxes do for me ?

• Everything your small data box was doing for you

• PLUS … scale !

• Can now access and analyze data I could not before !

• ALSO: scale leads to more knowledge !

• Size matters.

• ALSO: linking data silos !

03/04/2017 Scottish Competition Forum: Big Data

Page 19: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Big Data Usefulness

Data IntegrityData Integrity ReproducibilityReproducibility Provenance Provenance

QualityQuality CurationCuration PreservationPreservation

Long term access and

value.

Long term access and

value.ContextContext

Ethics and legal frameworks

Ethics and legal frameworks

Publication and Citation

Publication and Citation

Licensing ConditionsLicensing Conditions

Page 20: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Specific examples

• Collect, store, manage, analyze, mine / Learn / Predict …• Marketing:

• Which items are bought together ? Supermarkets, Travel, …• Recommendations: e.g., netflix

• Given your previous history of purchases and that of people like you…• Energy analytics:

• Aggregate / drill down on consumption per home• Specific time intervals, geographical regions,• Aggregate over many households• Link with education / income data• Find patterns / correlations..

• Text analytics: Given a (corpora of) books:• Can I summarize it?• Find main characters/entities?• Their relationships?• Identify contradictions ?

03/04/2017 Scottish Competition Forum: Big Data

Page 21: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Specific examples

• Science – bio-informatics – poly-omics

• Given a graph-pattern describing a sample protein-to-protein interactions, or metabolomic pathways, have I seen this before in my database ?

• Science – Urban informatics

• Given traffic patterns, land contamination, etc., predict house prices ?

• Use surveys (on education, income, work,…) and life-logging data (user journeys with pix/videos) find patetrns on transport modes, habits, etc.

• Can I use social media posts (tweeter, FB, etc.) to identify urban events of interest and annotate them accordingly ?

03/04/2017 Scottish Competition Forum: Big Data

Page 22: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Barriers: The Big Data Hubris

• Google Flu Trends: no longer good at predicting flu, scientists find

• Researchers warn of 'big data hubris' and the importance of updating analytical models, claiming Google has made inaccurate forecasts for 100 of 108 weeks.

Google's own autosuggest feature may have driven more people to make flu-related searches - and misled its Flu Trends forecasting system. Photograph: /Guardian

Page 23: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Barriers: Big Data Risks

• The ‘five safes’ framework (Desai et al , 2014; see Camden, 2014, or Sullivan, 2011, for examples of use) is a way of identifying sources of risk in data access:1. Safe projects – whether the data use is lawful

2. Safe people – whether the researchers can be trusted to hold and use the data appropriately

3. Safe settings – whether the manner of accessing the data offers protection

4. Safe data – whether there is any inherent protection in the data

5. Safe outputs – whether the outputs from the research pose a disclosure risk

Ritchie, F. and Elliott, M. (2015) Principles- versus rules-based output statistical disclosure control in remote access environments. Working Paper. University of the West of England, Bristol. Available from: http://eprints.uwe.ac.uk/25376

03/04/2017 Scottish Competition Forum: Big Data

Page 24: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Barriers: Human in the Loop…

• Getting to the Data

• Humans: Digital divide and related social exclusions remain

• Data: acquisitions

• Sharability of Obtained Data / Information / Services

03/04/2017 Scottish Competition Forum: Big Data

Page 25: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Barriers: Human in the Loop…

Acquiring data can be • costly and time-consuming !

• Example: Zoopla• Purchased a data pipeline

• ~3,000 calls to data access APIs per hour

• physically acquiring the whole historical DB • can take a long time

• requires dedicated human resources

03/04/2017 Scottish Competition Forum: Big Data

Page 26: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

Barriers: Human in the Loop…

• Sharing data is not easy!!!

• Licensing restrictions

• Who can use it and how much of it

• Legal expertise needed – cost: £ and time

• UBDC is a broker: need one license

• Between UBDC and data owner and

• Between UBDC and end-user

• Too many possible end users

• Hard to come up with a single EULA

• Liability risks:

• Pass them on to end-users ?

• What if they cannot afford these ? (e.g., private citizens)

• How can we know of organisation or citizen can afford these?

03/04/2017 Scottish Competition Forum: Big Data

Page 27: Big Data: Information, Data, Events, nalytics at Scale › ...• Big Data Infrastructures • Modern File Systems --HDFS • Modern DBs • HBase, Cassandra, MongoDB, Neo4j, • Analytics

03/04/2017 Scottish Competition Forum: Big Data