Steffen [email protected]
2WeST
You want to have more free
Web Science Education on the Web?
Vote for our course at
https://moocfellowship.org/
now!
Steffen [email protected]
3WeST
Web Science & Technologies
University of Koblenz ▪ Landau, Germany
The Challenges of Building Interoperable Web Observatories
http://wow.west.webobservatory.org/
Steffen Staab
Steffen [email protected]
4WeST
Produce
Consume
Cognition
Emotion
Behavior
SocialisationKnowledge
Observable Micro-
interactions in the Web
AppsProtocols
Data & InformationGovernance
WWW
Observable Macro-
effects in the Web
What to observe?
Steffen [email protected]
5WeST
Why to observe?
Understanding Collecting Describing Analyzing Modeling Predicting Repeating!
Steffen [email protected]
6WeST
Why to observe?
Understanding Collecting Describing Analyzing Modeling Predicting Repeating!
Steffen [email protected]
7WeST
Produce
Consume
Cognition
Emotion
Behavior
SocialisationKnowledge
Observable Micro-
interactions in the Web
AppsProtocols
Data & InformationGovernance
WWW
Observable Macro-
effects in the Web
What to observe?
Web Crawling Usage Logging
Steffen [email protected]
8WeST
Challenges – Data Collection Issues
Legal and/or Ethical Crawling
May be disallowed by provider
Usage logging Privacy of individuals
Even if it is allowed....
Steffen [email protected]
9WeST
Challenges – Data Collection Issues
Crawling What does it mean to crawl a heavily interactive site? Incomplete data
• Unreachability• Time outs
Steffen [email protected]
10WeST
Challenges – Data Collection Issues
Crawling What does it mean to crawl a heavily interactive site? Incomplete data Where to start?
• We cannot observe everything!– Even just for data size!– What appear to be most fruitful starting points?
Steffen [email protected]
11WeST
Challenges – Data Collection Issues
Crawling What does it mean to crawl a heavily interactive site? Incomplete data Where to start? Where to stop?
• Each crawl is a view– Twitter
» Tweet» URL
» Web Page» Subweb
» Followers» Followers‘ Followers
» ...
Steffen [email protected]
12WeST
Challenges – Data Collection Issues
Crawling What does it mean to crawl a heavily interactive site? Incomplete data Where to start? Where to stop? Synchronous vs asynchronous
• Strictly speaking: only asynchronous crawling possible– But in [Dellschaft&Staab] we targeted the construction of
models for streams of tags
Steffen [email protected]
13WeST
Challenges – Data Publishing Issues
Legal and/or Ethical Example Issues AOL query log Netflix challenge Delicious
http://www.tagora-project.eu/data/ Twitter
Collecting, but no sharing• SocialSensor project
Steffen [email protected]
14WeST
Challenges – Data Publishing Issues
Technical/Modelling issues Generic format, e.g. RDF Format ready for digestion by a certain software, e.g. for
Matlab processing Openness to other data
E.g. references to DBPedia/Wikipedia Accuracy of publishing
http://me.org showed „...“ http://me.org showed „...“@2013-05-01:0900CEST http://me.org showed „...“@2013-05-01:0900CEST called
from IP 193.99.144.85 using browser...version...history...
Steffen [email protected]
15WeST
Sharing Software
Software For crawling or usage logging Rather than sharing the data, share the code for observing
Example: code for crawling Twitter in a certain way
Issues Limited repeatability Disturbance liability („Störerhaftung“) – at least in DE
• If you provide source code for crawling, e.g., Facebook, even if you do not crawl FB, FB can sue you
Steffen [email protected]
16WeST
Why to observe?
Understanding Collecting Describing Analyzing Modeling Predicting Repeating!
Steffen [email protected]
18WeST
Ongoing discussion
What to do about sharing Web Science datasets?
Let‘s do simple things first Collect pointers! Publish whatever you can publish – others will reuse Make it more archival
In a way that makes it easy to expand to handle more complex issues Semantic Wiki!
Steffen [email protected]
19WeST
Web Observatory Wiki
• Main Goals:• Registry of Web Science datasets• Compiled by Web Observatory participants –
YOU!
• Minor Goals• Semantically store all information about
datasets• Make it
• Explorable• Queryable• Reuseable
Steffen [email protected]
20WeST
Semantic MediaWiki + Forms Extension URL: http://wow.west.webobservatory.org/
Main classes: Examples: Dataset_Repository KONECT Dataset Slashdot Zoo Organization WeST
Quick Facts -1
Steffen [email protected]
21WeST
Semantic MediaWiki + Forms Extension URL: http://wow.west.webobservatory.org/
Class Hierarchy Example: Attributes: Dataset Dublin Core +
Size, license, URL,…
Network Node Count Social Network …
Quick Facts - 2
Steffen [email protected]
24WeST
ko:konect
ko:slashdot-zoo
wow:contains
1944
wow:network-volumewow:social-network
rdf:type
wow:network
rdfs:subClassOf
wow:dataset
rdfs:subClassOf
ko:twitter
wow:contains
120000000
wow:size
wow:network-volume
rdfs:domain
wow:size
rdfs:domain
rdf:type
wow:dataset-repositoryrdf:type
wow:contains
rdfs:domain
rdfs:range
Schema (Excerpt)
Steffen [email protected]
25WeST
Discussion & Q&A
Access to wiki Current model:
• Edits allowed by IPs and users• Everyone can be blocked, including IPs
Contribute: Content Modeling requirements ... Let us know!
Steffen [email protected]
26WeST
Sanity Check
Understanding
Collecting (to some extent: commodity service)
Describing (WOW)
Analyzing
Modeling
Predicting
Repeating!
So far ad hoc –needs much more:• Experience• Guidelines• Processing workflow• Executable code shares
(on big data!)• ...