View
2.887
Download
2
Embed Size (px)
DESCRIPTION
Presentation at the AAAI 2013 Fall Symposium on Semantics for Big Data, Arlington, Virginia, November 15-17, 2013 Additional related material at: http://wiki.knoesis.org/index.php/Smart_Data Related paper at: http://www.knoesis.org/library/resource.php?id=1903 Abstract: We discuss the nature of Big Data and address the role of semantics in analyzing and processing Big Data that arises in the context of Physical-Cyber-Social Systems. We organize our research around the five V's of Big Data, where four of the Vs are harnessed to produce the fifth V - value. To handle the challenge of Volume, we advocate semantic perception that can convert low-level observational data to higher-level abstractions more suitable for decision-making. To handle the challenge of Variety, we resort to the use of semantic models and annotations of data so that much of the intelligent processing can be done at a level independent of heterogeneity of data formats and media. To handle the challenge of Velocity, we seek to use continuous semantics capability to dynamically create event or situation specific models and recognize new concepts, entities and facts. To handle Veracity, we explore the formalization of trust models and approaches to glean trustworthiness. The above four Vs of Big Data are harnessed by the semantics-empowered analytics to derive Value for supporting practical applications transcending physical-cyber-social continuum.
Citation preview
Semantics-empowered Big Data Processing for PCS ApplicationsKrishnaprasad Thirunarayan (T. K. Prasad) and Amit Sheth
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, OH-45435
Prasad 2
Outline
• 5 V’s of Big Data Research
• Semantic Perception for Scalability
• Lightweight semantics to manage heterogeneity – Cost-benefit trade-off and continuum
• Hybrid Knowledge Representation and Reasoning– Anomaly, Correlation, Causation
11/15/2013
Prasad 3
5V’s of Big Data Research
Volume
Velocity
Variety
Veracity
Value
11/15/2013
Big Data => Smart Data
Prasad 4
Volume : Assorted Examples
• 25+ billion sensors have been deployed.
• About 250TB of sensor data are generated for a NY-LA flight on Boeing 737.
• Parkinson disease dataset that tracked 16 patients with mobile phone using 7 sensors over 8 weeks is 12GB.
Check engine light analogy11/15/2013
Prasad 5
Volume : Semantic Perception
• Abstracting machine-sensed data – E.g., fine-grained to coarse-grained– E.g., average, peak, rate of change
• Extracting human-comprehensible features/entities• Machine perception
– Derive conclusions using domain models
and hybrid abductive/deductive reasoning
Goal: Human accessible situational awareness and actionable intelligence for decision making
11/15/2013
Prasad 6
Weather Use Case
• Machine-sensed phenomenon– temperature, precipitation, humidity, wind speed, etc.
• Human perceived features– blizzard, flurry, rain storm, clear, etc.– categories of hurricanes (SSHWS)
• Machine perception– Using domain models from NOAA
• Ultimately, generate weather alerts …
11/15/2013
Prasad 7
Parkinson’s Disease Use Case
• Data from machine-sensors– accelerometer, GPS, compass, microphone, etc.
• Human perceived features– tremors, walking style, balance, slurred speech, etc.
• Machine perception– Using domain models to be created to diagnose and
monitor disease progression
• Ultimately, recommend options to control chronic conditions …
11/15/2013
Prasad 8
Heart Failure Use Case
• Machine-sensed data – weight, heart rate, blood pressure, oxygen level, etc.
• Human perceived features– Risk-level for hospital readmission of CHF/ADHR patient
• Machine perception– Using domain models to be created to monitor heart
condition of a cardiac patient post hospital discharge
• Ultimately, recommend treatments to reduce preventable hospital readmissions …
11/15/2013
Prasad 9
Asthma Use Case
• Data from machine-sensors– Environmental sensors, physiological sensors, etc.
• Human perceived features– Asthma severity gleaned from frequency of asthma
attacks, wheezing, coughing, sleeplessness, etc.
• Machine perception– Using domain models to be created to monitor asthma
patients and their surroundings
• Ultimately, recommend prevention and control options …
11/15/2013
Prasad 10
Traffic Use Case
• Data from machine-sensors, social media stream, and planned event schedules– Traffic flow sensors : link speed, link volume, Event-
specific tweets, etc.
• Human perceived features– traffic delays and congestion, etc.
• Machine perception– Using domain models to be created to understand traffic
patterns in response to events
• Ultimately, recommend traffic management options …
11/15/2013
Slow moving traffic
Link Description
Scheduled Event
Scheduled Event
511.org
511.org
Schedule Information
511.org
Traffic Monitoring
11
Heterogeneity in a Physical-Cyber-Social System
Prasad 12
Volume with a Twist
Resource-constrained reasoning on mobile-devices
Goal: Boolean encodings to ensure feasibility, efficiency, and economy
11/15/2013
13* based on Neisser’s cognitive model of perception
ObserveProperty
PerceiveFeature
Explanation
Discrimination
1
2
Perception Cycle* that exploits background knowledge / domain models
Abstracting raw data for human
comprehension
Focus generation for disambiguation and action(incl. human in the loop)
Prior Knowledge
Virtues of Our Approach to Semantic Perception
Blends simplicity, effectiveness, and scalability.
• Declarative specification of explanation and discrimination;
• With applications (e.g., to healthcare) that are of contemporary relevance and interdisciplinary;
• Using encodings/algorithms that are significant (asymptotic order of magnitude gain) and necessary (“tractable” due to time/memory reduction for typical problem sizes); and
• Prototyped using extant PCs and mobile devices.
O(n3) < x < O(n4) O(n)
Efficiency Improvement
• Problem size increased from 10’s to 1000’s of nodes• Time reduced from minutes to milliseconds• Complexity growth reduced from polynomial to
linear
Evaluation on a mobile device
15
Prasad 16
Volume and Velocity
• Lightweight semantics-based Adaptive/Continuous Filtering
E.g.,: Track evolution of crowd-sourced and verified Wikipedia event pages for relevance ranking of Twitter hashtags in Disaster response use-case
• Building domain models dynamically
11/15/2013
Heliopolis is a suburb of
Cairo.
Dynamic Model Creation
Continuous Semantics 17
Variety
Syntactic and semantic heterogeneity • in textual and sensor data, • in (legacy) materials data• in (long tail) geosciences data
Idea: Semantics-empowered integration
11/15/2013 Prasad 18
Prasad 19
Variety (What?): Materials/Geosciences Use Case
• Structured Data (e.g., relational)
• Semi-structured, Heterogeneous Documents (e.g., Publications and technical specs, which usually include text, numerics, maps and images)
• Tabular data (e.g., ad hoc spreadsheets and complex tables incorporating “irregular” entries)
11/15/2013
20
Variety (How?/Why?): Granularity of Semantics & Applications
• Lightweight semantics: File and document-level annotation to enable discovery and sharing
• Richer semantics: Data-level annotation and extraction for semantic search and summarization
• Fine-grained semantics: Data integration, interoperability and reasoning in Linked Open Data
Cost-benefit trade-off and continuum
Prasad 21
Challenges Associated with Typical Spreadsheet/Table
• Meant for human consumption • Irregular :
– Not simple rectangular grid• Heterogeneous
– All rows not interpreted similarly• Complex
– Meaning of each row and each column context dependent • Footnotes modify meaning of entries (esp. in materials
and process specifications)
11/15/2013
22
Prasad 23
Practical Semi-Automatic Content Extraction
• DESIGN: Develop regular data structures that can be used to formalize tabular information.– Provide a natural expression of data – Provide semantics to data, thereby removing potential
ambiguities– Enable automatic translation
• USE: Manual population of regular tables and automatic translation into LOD
11/15/2013
Variety (What?) : Sensor Data Use Case
Develop/learn domain models to exploit complementary and corroborative information
• To relate patterns in multimodal data to “situation”
• To integrate machine sensed and human sensed data
11/15/2013 Prasad 24
Variety: Hybrid KRR
Blending data-driven models with declarative knowledge – Data-driven: Bottom-up, correlation-based,
statistical– Declarative: Top-down, causal/taxonomical,
logical– Refine structure to better estimate parameters
E.g., Traffic Analytics using PGMs + KBs
11/15/2013 Prasad 25
Variety (Why?): Hybrid KRR
Data can help compensate for our overconfidence in our own intuitions and reduce the extent to which our desires distort our perceptions.
-- David Brooks of New York Times
However, inferred correlations require clear justification that they are not coincidental, to inspire confidence.
11/15/2013 Prasad 26
• Correlations due to common cause or origin
• Coincidental due to data skew or misrepresentation
• Coincidental new discovery
• Strong correlation vs causation
• Anomalous and accidental
• Correlation turning into causations
Correlations vs Causation vs Anomalies
11/15/2013 Prasad 27
• Correlations Due to common cause or origin– E.g., Planets: Copernicus > Kepler > Newton > Einstein
• Coincidental due to data skew or misrepresentation – E.g., Tall policy claims made by politicians!
• Coincidental new discovery– E.g., Hurricanes and Strawberry Pop-Tarts Sales
• Strong correlation vs causation– E.g., Spicy foods vs Helicobacter Pyroli : Stomach Ulcers
• Anomalous and accidental– E.g., CO2 levels and Obesity
• Correlation turning into causations– E.g., Pavlovian learning: conditional reflex
Correlations vs Causation vs Anomalies
11/15/2013 Prasad 28
• Correlations Due to common cause or origin– E.g., Planets: Copernicus > Kepler > Newton > Einstein
• Coincidental due to data skew or misrepresentation – E.g., Tall policy claims made by politicians!
• Coincidental new discovery– E.g., Hurricanes and Strawberry Pop-Tarts Sales
• Strong correlation vs causation– E.g., Spicy foods vs Helicobacter Pyroli : Stomach Ulcers
• Anomalous and accidental– E.g., CO2 levels and Obesity
• Correlation turning into causations– E.g., Pavlovian learning: conditional reflex
Paradoxes: The Seeds of Progress
Correlations vs Causation vs Anomalies
11/15/2013 Prasad 29
Veracity
Lot of existing work on Trust ontologies, metrics and models, and on Provenance tracking
• Homogeneous data: Statistical techniques• Heterogeneous data: Semantic models
Open Problem: Develop semantics of trust using expressive frameworks that are both declarative and computational • To make explicit all aspects that go into trust
formation, to inspire confidence in inferences
11/15/2013 Prasad 30
Veracity
Machine sensing: objective, quantitative,
but prone to environmental effects, battery life, …
Human sensing: subjective, qualitative,
but prone to bias, perceptual errors, rumors, …
Open problem: Improving trustworthiness by combining machine sensing and human sensing– E.g., 2002 Überlingen mid-air collision :Pilot incorrectly
using Traffic controller advice over electronic TCAS system recommendation
11/15/2013 Prasad 31
(More on) Value
Learning domain models from “big data” for prediction
E.g., Harnessing Twitter "Big Data" for Automatic Emotion Identification
Idea: Exploit “emotion-hashtagged” tweets as training dataset
11/15/2013 Prasad 32
(More on) Value
Discovering gaps and enriching domain models using data
E.g., Data driven knowledge acquisition method for domain knowledge enrichment in the healthcare
Idea: Use associations between diseases, symptoms and medications in EMR documents
11/15/2013 Prasad 33
Prasad 34
Conclusions
• Glimpse of our research organized around
the 5 V’s of Big Data• Discussed role in harnessing Value
– Semantic Perception (Volume)– Continuum of Semantic models to manage
Heterogeneity (Variety)– Hybrid KRR: Probabilistic + Logical (Variety)– Continuous Semantics (Velocity)– Trust Models (Veracity)
11/15/2013
Prasad35
thank you, and please visit us at
http://knoesis.org/
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled ComputingWright State University, Dayton, Ohio, USA
Kno.e.sis
11/15/2013
Special Thanks to: Pramod Anantharam and Cory Henson