Upload
others
View
24
Download
0
Embed Size (px)
Citation preview
BIG DATA
Big Data are data that cannot be handled using standard data base systems and standard data analysis tools.
Big Data sources: sensors, surveillance systems, mobile phones, GPS devices, RFID readers, social networks, computer networks, web logs, scientific data …
Big Data characteristics (3 V’s or 4 V’s)
Volume: the size of Big Data goes beyond standard
data storage and manipulation techniques
Velocity: Big Data is often available in real time
Variety: Big Data contains not only structured data
(e.g. in tabular or relational form) but also texts, images, audio or video
Veracity: the quality and reliability of Big Data can
vary
The Big Data pipeline
Data generation
Data acquisition: data collection,
data transmission,
data pre-processing (integration, celansing, rendundancy elimination)
Data storage
Data analysis
Challenges for collecting, storing and manipulating Big Data
New forms of data storage: file systems, NoSQL databases
New forms of computation: parallel computing, distributed computing, grid computing, cloud computing
batch processing X stream processing
Apache Hadoop
Software platform that supports data-intensive distributed applications
Hadoop distributed file system
Map/Reduce: divide and conquer approach to break-down intractable problem into tractable sub-problems
Challenges for analyzing Big Data
New forms of data: heterogeneous data, unstructured data, stream data
New properties of data: non-stationarity, concept drift
New forms of learning: real-time learning, incremental learning, sequentional learning
New forms of computation: distributed computation, cloud computation
Areas related to Big Data Analysis
Analysis of Big Data grounded in Knowledge Discovery in Databases and Data Mining. However, new names appear used by different people:
Ubiquitous Knowledge Discovery
Reality Mining
UBIQUITOUS KNOWLEDGE
DISCOVERY
data mining in mobile systems, wireless communication networks, calm technologies,
distributed architectures: distributed data mining, grid, P2P, autonomic computing,
agents,
learning components: statistical learning (incl. online learning), evolutionary computing,
anytime algorithms data types: spatio-temporal, stream, multimedia,
security and privacy: privacy preserving data mining, intrusion detection,
HCI and cognitive modelling: user interfaces of ubiquitous discovery systems.
EU funded project KDUbiq (2005-2008 FP6 FET IST )
Knowledge discovery process in mobile, distributed, dynamic environments, in presence of massive amounts of data
REALITY MINING
Tackles some of the most challenging data mining problems: scaling up for high dimensional data/high speed streams mining sequence data and time series data data mining in a network setting
Collection and analysis of machine-sensed environmentaldata pertaining to human social behavior, with the goal of identifying predictable patterns of behavior. (Pentland2004)
Example reality mining projects
Complex social systems
Public health and medicine
Traffic monitoring and control
Smart homes and ambient assisted living
Environmental monitoring
Complex social systems (1/3)
Data from mobile phones of MIT students and researchers used to analyze collective human behaviour
Proximity pattern (left) and inferred friendship network (right)
Eagle and Pentland, 2004
Complex social systems (2/3)same data as before to investigate how people’s social relations affect their encounters (Friends or Strangers)
Miklas et al., 2007
Number of encounters in two-week period (left), and number of pairs of people and number of encounters for friends and strangers (right)
Complex social systems (3/3)daily travel patterns of 14 816 521 individuals across Kenya to study human mobility
Wesolowski et al., 2013
Relation between mobility and expenditure (left), and between income and expenditure (right)
Public health and medicineData from mobile phones to study the role of human mobility in dissemination of malaria in Kenya
Buckee et al., 2011
The parasite rate (left), and location of mobile phone tower in overlaid on a settlement and parasite rate maps (right)
Traffic monitoring and control (1/2)
Analyze data from GPS-enabled mobile phones as a proof-of-concept of traffic monitoring system
Herrera et al., 2010
Snapshot of Mobile Millennium Traffic in San Francisco and the Bay Area
Traffic monitoring and control (2/2)
Data on taxi locations and booking requests in a GPS-enabled taxi dispatch system in Singapore
Santani et al., 2008
Taxi Observations by Location and Booking Frequency of Zone
https://www.novinky.cz/ekonomika/412841-prvni-samoridici-taxiky-vozi-v-singapuru-zakazniky.html
Smart homes and ambient assisted living
Monitor and analyze data form mobile phones, wearable sensors and other devices integrated in the residential infrastructure
O’Grady et al, 2010
A generic scheme of Ambient Assistive Living Systems
Environmental monitoringNoiseTube – a low-cost approach involving the general public to monitor noise pollution using their mobile phones as noise sensors
Maisonneuve et al, 2010
Collective noise map for part of Paris
https://play.google.com/store/apps/details?id=net.noisetube
Lessons from reality miningprojects
Using mobile phones we can gather significantly larger and more reliable data sets than by querying the users
Mobile phones are a cheap alternative to more complex sensor systems
The data anlysis often does not go beyond data description and summarization task