Upload
simon-price
View
18
Download
0
Embed Size (px)
Citation preview
A Higher-Order Data Flow Model for Heterogeneous Big Data
Simon Price and Peter Flach
2
Outline of this presentation
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
3
2. JSONMatch
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
4
JSONMatchJSON is the de facto data format for Web 2.0 and mobile apps. JSON is the 'record' in many NoSQL databases.JSONMatch compares the similarity of JSON documents.Use case: interactive web applications for profiling and matching Big (Variety) Data.
http://jsonmatch.com
5
JSONMatch
• A web service for analyzing and integrating data from heterogeneous sources in these formats:
• JSON (default)• CSV• HTML• RDF• XML• YAML• Plain text• Prolog terms• Weka AARF machine learning datasets
6
JSONMatch
• Stores and retrieves structured data (e.g. JSON documents) like a NoSQL database.
• Processes data using data flows defined dynamically in JSON using the REST API.
• Aims to produce results:o quickly for small datasetso eventually for larger datasets.
7
3. Data Flow Model
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
8
Data Model
• Each dataset is a relation. E.g. S• Each relation is a set of key-value pairs. E.g.
S1,S2,...,Sn
• Values can be 'unstructured', semi-structured or structured data.
• In JSONMatch: value = JSON document
9
Example Data Flow
w = Φ3(Φ1(s), Φ2(t))
10Another Example Data Flow
11Higher-Order Transformation
v = Φ(g)(h)(s, t, u, ...)
Function Φ transforms relations s,t,u,... into relation v.Functions g and h are the higher-order parameters.
12Generator Function (g)
• Choose one of three:o Mapo Producto Lambda
13Generator Function (g=map)
14Generator Function (g=product)
15Generator Function (g=lambda)
16Template Function (h)
• Template data item with embedded functions that are expanded by Φ to produce an output item.
• The embedded functions have access to the "current" items from the input relations. i.e. items selected by g.
• The embedded functions use JSONPath expressions (i.e. simplified XPath for JSON) to access sub-parts of the input items.• $.person.title
• $.person.paper[*].author[0].name
• $[0][3][1].foo
• One input relation S. Each item si is an array like this.
• g=map and h is:
• Output relation V has items si like this.
17Example JSONMatch template data
item (h)[ "Ad Feelders", "http://dblp.uni-trier.de/pers/hd/f/Feelders:Ad.html.", "Rankings_and_Partial_Orders", "Active_Learning; Bioinformatics; ..." ]
{ "name": "$.items[0][0]", "url": "$.items[0][1]", "text": ["jm:http_get", "$.items[0][1]"], "primary": "$.items[0][2]", "keywords": ["jm:split", ";", "$.items[0][3]"] }
{ "name": "Ad Feelders", "url": "http://dblp.uni-trier.de/...", "text": "<html><title>A. J. Feeld...</html>", "primary": "Rankings_and_Partial_Orders", "keywords": [ "Active_Learning", "Bioinformatics", ... ] }
18
4. Example
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
19SubSift
SubSift is a prototype application to support academic peer review.
SubSift matches submitted conference/journal papers to potential peer reviewers based on similarity to published works.
Website:http://subsift.ilrt.bris.ac.uk
20Recreating SubSift in JSONMatch
• All the nice features of SubSift are preserved.
• JSONMatch implementation adds other advantages:• Functionality defined by application as data
flow at runtime.
• REST API much smaller and simpler because functionality defined in item template h.
• Does not require a separate web harvester robot.
• External web services can be embedded in data flow.
• Handles much larger numbers of reviewers and papers.
21
5. Summary
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
22Higher-Order Data Flow Model
Concise formalism for Big Variety data flows specified dynamically from interactive web applications.
JSONMatch proof-of-concept implementation:• For analyzing and integrating data from heterogeneous
sources• http://jsonmatch.com
Nice properties for analyzing data serially over extended periods of time without Big Data infrastructure.
http://simonprice.infoGet in touch: