Upload
charo
View
45
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Knowledge Streams: Stream Processing of Semantic Web Content. Mike Dean Principal Engineer Raytheon BBN Technologies [email protected]. Assumptions. Technology – Intermediate Familiarity with RDF and OWL Interest in Stream processing Scalability. Presenter Background. - PowerPoint PPT Presentation
Citation preview
Knowledge Streams: Stream Processing of Semantic Web Content
Mike DeanPrincipal Engineer
Raytheon BBN [email protected]
1
Assumptions
• Technology – Intermediate– Familiarity with RDF and OWL
• Interest in– Stream processing– Scalability
2
Presenter Background
• Principal Engineer at Raytheon BBN Technologies (1984-present)• Principal Investigator for DARPA Agent Markup Language (DAML)
Integration and Transition (2000-2005)– Chaired the Joint US/EU Committee that developed DAML+OIL and SWRL
• Developer and/or Principal Investigator for many Semantic Web tools, datasets, and applications (2000-present)
• Member of the W3C RDF Core, Web Ontology, and Rule Interchange Format Working Groups
– Co-editor of the W3C OWL Reference• Local co-chair for ISWC2009• Other SemTech presentations
– Semantic Query: Solving the Needs of a Net-Centric Data Sharing Environment (2007, w/ Matt Fisher)
– Semantic Queries and Mediation in a RESTful Architecture (2008, w/ John Gilman and Matt Fisher)
– Use of SWRL for Ontology Translation (2008)– Semantic Web @ BBN: Application to the Digital Whitewater Challenge (2009, w/ John
Hebeler)– How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge
Corpus (2009)– Finding a Good Ontology: The Open Ontology Repository Initiative (2010, w/ Peter Yim
and Todd Schneider)3
Outline
• Motivation• Vision• Building Blocks• Demonstration
4
Motivations
• Timeliness• Performance
5
Timeliness
• Streaming minimizes latency– Processing elements see events as they occur– Resources are expended only when an event occurs
• This is in contrast to polling– Latency averages half the polling interval– Resources are expended on every poll– Popular web syndication mechanisms such as RSS
and Atom involve polling
6
Performance
• Many Semantic Web tools provide streaming parsers rather than, or in addition to, model access– Analogous to XML SAX vs. DOM
• For suitable applications, this can be 10x faster than loading all statements into memory or a KB
7
2 Streaming Stories
• dumpont of OpenCyc (circa 2003)– HTML-based ontology visualization tool periodically
bogged down daml.org server– Reimplementation using event-based Jena ARP parser
yielded 10x performance and scalability improvements
• Billion Triples Challenge 2009– Streaming analysis of the 2009 corpus was
performed at an overall rate of 103K statements/sec on a Mac laptop with a portable external disk
– Compare to loading 10-20K statements/second on a server
8
Stream Processing Examples
• Unix pipes• Dataflow architectures• Streambase• IBM System S/InfoSphere Streams
9
aggregationaggregation
persistentqueriespersistentqueries
augmentationaugmentationcontextfiltercontextfilter
alertsalerts
correlationcorrelationtranslationtranslation
inferenceinference
distributiondistribution
DataDataSourcesSources
Distribution And Processing ElementsDistribution And Processing Elements
UsersUsers
CEPCEPNLPNLP
Sensor Sensor NetworkNetwork
ImageryImagery
RSSRSS
IMIM
GazetteerGazetteer
SensorSensor
Semantic Semantic WebWeb
DatabaseDatabase
Persistent pipelines• Streams of statements comprising
object subgraphs• URI naming allows drill-down• Provenance, timestamps
Processing elements •Consume and produce subgraphs •Multiple functions may be combined
ArchiveArchive
User 2User 2
User 3User 3
Community of Interest 1
Community of Interest 2
User 1User 1
Vision: Knowledge Streams
10
Goals
• Web-scale– Decentralized among multiple sites– Heterogenous implementations
• Long-lived, persistent connections– User accountability
• Introspection over the processing network for control and optimization– E.g. aggregating subscriptions– Balance with security, privacy, and autonomy
concerns
11
Building Blocks
• RDF Content• Existing stream processing frameworks• Workflow systems• Publish/subscribe message oriented middleware
12
RDF Payloads
• Malleable data– Standards-based graph structure– Can easily add, remove, and transform statements
• Self-describing– Unique naming via URIs– References to vocabularies and ontologies
• Potential for inference
13
Workflow Systems
• Graphical environments for developing processing pipelines– Yahoo Pipes, DERI Pipes, SPARQLMotion– Nice user interfaces for development and execution
14
http://pipes.deri.org
Semantic Complex Event Processing
• Complex Event Processing– One of the leading edges of rules technology – Formal specification of higher-level events in terms of lower-level
events• E.g. alert if the moving average increases 15% within a 10 minute window
– Engine can be compiled/optimized for a specific rule set– High-volume deployments in finance and other industries– Most implementations focus on self-contained tuples
• Semantic Complex Event Processing– Enrich CEP using Semantic Web technology– Emerging topic at recent conferences
• Early implementations– Wrappers around open source CEP engines– Native implementation
• Provides a powerful set of operators and engines for Knowledge Streams
15
Implementation Approach
• Well-defined APIs for implementing operators• Operator execution containers
– Could encapsulate existing engines
• Start with manual processing network configuration, then automate
16
Use Cases
• Dissemination of metadata for new satellite imagery
• Social network changes• Alerting of friends’ new publications• …
17
Demo
• Processing using DERI Pipes with new operators– Ingest of #SemTechBiz tweets using Twitter
Streaming API– Conversion of JSON to RDF– Mapping to SIOC vocabulary using SWRL rules– Enrich by matching Twitter @handles with contacts– Persistent buffering using Java Message Service– Monitoring
18