Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Big Data:Information, Data, Events, Analytics at Scale
Prof Peter TriantafillouChair of Data Systems
Associate Director UBDC
IDEAS Research GroupSchool of Computing Science
University of Glasgowhttp://dcs.gla.ac.uk/ideas/
Scottish Competition Forum: Big Data03/04/2017
http://dilbert.com/strip/2012-09-05
A Bird’s Eye View of Big Data Research
Scottish Competition Forum: Big Data03/04/2017
What is big data ?
03/04/2017 Scottish Competition Forum: Big Data
You know you have big data when …
• “you get a call from the utility company, asking not to run that query again” … disruptive queries!
• “your IT spends most of tis time purchasing storage”
• “ a query is long enough to require a couple of DBA admin generations to see the first results”
Frequently, one has to redefine what his big data problem is
03/04/2017 4
Take home message
Struct & Unstruct
Data
Information
Knowledge
03/04/2017 5
Wows
• GBs, TBs, petabytes, Exabytes, …• New vocabulary: Exabytes, Zettabytes, Yottabytes, …
• @CERN: ~= 50 TBs per day
• @FB: 250 M photo uploads each day PBs…
• ca 2011: 1.8 Zettabytes• Grows by a factor of about 3 per year …
• Open Library Project: • Have online every book ever written …
• How big is the web?• ~= 1 Bio domains registered
• Think of web archiving … 03/04/2017 6
St Peters Square, Rome.
UK Data Services –Dr. Nathan Cunningham
03/04/2017 8
Overwhelming!
Keys
• Big Data Infrastructures• Modern File Systems -- HDFS• Modern DBs
• HBase, Cassandra, MongoDB, Neo4j,• Analytics platforms
• Hadoop, Spark, SparkML, GraphLab, SpatialHadoop…
• Ingest and export/querying• Handling different data/query types/formats
• Tabular, Graph, Documents/Text, Images/Video• Spatial, Temporal, spatio-temporal, • Streaming and/or in-rest data
• Statistical & machine learning for analytics tasksScottish Competition Forum: Big Data03/04/2017
A High Level View of a Modern Big Data Box
RDBMS EDW MPPMANAGE & MONITOR
BUILD & TEST
Big Data: The complete story: V5C
• Volume: …
• Variety: Structured, semi-structured, unstructured
• DB Tables, csv files, …
• Text, video, audio, photos
• Wikipedia pages: text + infoboxes
• Microformat, microdata, (schema.org), …
• Velocity: near-real time
• Storage, querying, analytics, …
• Variability: data flows: peaks & valleys …
• Veracity: Is it really the “true” data: errors? alterations?
• Complexity: entities, data, hierarchies, links, relationships, …03/04/2017 12
A working definition of Big Data
UK Data Services –Dr. Nathan Cunningham
Big Data: The real story
Scottish Competition Forum: Big Data
t1 t2 t3 t4
D
ATA
S
I
Z
ETime
Big Data
Relevant Data
“Mind the Gap”
“Ride the Right
Curve”
03/04/2017
So what can big data boxes do for me ?
03/04/2017 Scottish Competition Forum: Big Data
Big Data: Take home message
03/04/2017 16
Collect
Understand
Exploit
3 End-user Tasks
Storage Resources
Management
Data MngmtIR
HCI, Analytics,
Visualization
3+ System Layers Services
The value is in Smart Data
“In 2016, the world of big data will focus more on smart data, regardless of size. Smart data are wide data (high variety), not necessarily deep data (high volume).
Data are “smart” when they consist of feature-rich content and context(time, location, associations, links, interdependencies, etc.) that enable intelligent and even autonomous data-driven processes, discoveries, decisions, and applications.” Kirk Borne, Principal Data Scientist at Booz Allen Hamilton
So what can big data boxes do for me ?
• Everything your small data box was doing for you
• PLUS … scale !
• Can now access and analyze data I could not before !
• ALSO: scale leads to more knowledge !
• Size matters.
• ALSO: linking data silos !
03/04/2017 Scottish Competition Forum: Big Data
Big Data Usefulness
Data IntegrityData Integrity ReproducibilityReproducibility Provenance Provenance
QualityQuality CurationCuration PreservationPreservation
Long term access and
value.
Long term access and
value.ContextContext
Ethics and legal frameworks
Ethics and legal frameworks
Publication and Citation
Publication and Citation
Licensing ConditionsLicensing Conditions
Specific examples
• Collect, store, manage, analyze, mine / Learn / Predict …• Marketing:
• Which items are bought together ? Supermarkets, Travel, …• Recommendations: e.g., netflix
• Given your previous history of purchases and that of people like you…• Energy analytics:
• Aggregate / drill down on consumption per home• Specific time intervals, geographical regions,• Aggregate over many households• Link with education / income data• Find patterns / correlations..
• Text analytics: Given a (corpora of) books:• Can I summarize it?• Find main characters/entities?• Their relationships?• Identify contradictions ?
03/04/2017 Scottish Competition Forum: Big Data
Specific examples
• Science – bio-informatics – poly-omics
• Given a graph-pattern describing a sample protein-to-protein interactions, or metabolomic pathways, have I seen this before in my database ?
• Science – Urban informatics
• Given traffic patterns, land contamination, etc., predict house prices ?
• Use surveys (on education, income, work,…) and life-logging data (user journeys with pix/videos) find patetrns on transport modes, habits, etc.
• Can I use social media posts (tweeter, FB, etc.) to identify urban events of interest and annotate them accordingly ?
03/04/2017 Scottish Competition Forum: Big Data
Barriers: The Big Data Hubris
• Google Flu Trends: no longer good at predicting flu, scientists find
• Researchers warn of 'big data hubris' and the importance of updating analytical models, claiming Google has made inaccurate forecasts for 100 of 108 weeks.
Google's own autosuggest feature may have driven more people to make flu-related searches - and misled its Flu Trends forecasting system. Photograph: /Guardian
Barriers: Big Data Risks
• The ‘five safes’ framework (Desai et al , 2014; see Camden, 2014, or Sullivan, 2011, for examples of use) is a way of identifying sources of risk in data access:1. Safe projects – whether the data use is lawful
2. Safe people – whether the researchers can be trusted to hold and use the data appropriately
3. Safe settings – whether the manner of accessing the data offers protection
4. Safe data – whether there is any inherent protection in the data
5. Safe outputs – whether the outputs from the research pose a disclosure risk
Ritchie, F. and Elliott, M. (2015) Principles- versus rules-based output statistical disclosure control in remote access environments. Working Paper. University of the West of England, Bristol. Available from: http://eprints.uwe.ac.uk/25376
03/04/2017 Scottish Competition Forum: Big Data
Barriers: Human in the Loop…
• Getting to the Data
• Humans: Digital divide and related social exclusions remain
• Data: acquisitions
• Sharability of Obtained Data / Information / Services
03/04/2017 Scottish Competition Forum: Big Data
Barriers: Human in the Loop…
Acquiring data can be • costly and time-consuming !
• Example: Zoopla• Purchased a data pipeline
• ~3,000 calls to data access APIs per hour
• physically acquiring the whole historical DB • can take a long time
• requires dedicated human resources
03/04/2017 Scottish Competition Forum: Big Data
Barriers: Human in the Loop…
• Sharing data is not easy!!!
• Licensing restrictions
• Who can use it and how much of it
• Legal expertise needed – cost: £ and time
• UBDC is a broker: need one license
• Between UBDC and data owner and
• Between UBDC and end-user
• Too many possible end users
• Hard to come up with a single EULA
• Liability risks:
• Pass them on to end-users ?
• What if they cannot afford these ? (e.g., private citizens)
• How can we know of organisation or citizen can afford these?
03/04/2017 Scottish Competition Forum: Big Data
03/04/2017 Scottish Competition Forum: Big Data