28
Steffen Staab [email protected] 1 WeST Vote for free Web Science MOOC!

Challenges of Building Web Observatories

Embed Size (px)

DESCRIPTION

Invited Talk at WebSci workshop on Building Web Observatories

Citation preview

Page 1: Challenges of Building Web Observatories

Steffen [email protected]

1WeST

Vote for free Web Science MOOC!

Page 2: Challenges of Building Web Observatories

Steffen [email protected]

2WeST

You want to have more free

Web Science Education on the Web?

Vote for our course at

https://moocfellowship.org/

now!

Page 3: Challenges of Building Web Observatories

Steffen [email protected]

3WeST

Web Science & Technologies

University of Koblenz ▪ Landau, Germany

The Challenges of Building Interoperable Web Observatories

http://wow.west.webobservatory.org/

Steffen Staab

Page 4: Challenges of Building Web Observatories

Steffen [email protected]

4WeST

Produce

Consume

Cognition

Emotion

Behavior

SocialisationKnowledge

Observable Micro-

interactions in the Web

AppsProtocols

Data & InformationGovernance

WWW

Observable Macro-

effects in the Web

What to observe?

Page 5: Challenges of Building Web Observatories

Steffen [email protected]

5WeST

Why to observe?

Understanding Collecting Describing Analyzing Modeling Predicting Repeating!

Page 6: Challenges of Building Web Observatories

Steffen [email protected]

6WeST

Why to observe?

Understanding Collecting Describing Analyzing Modeling Predicting Repeating!

Page 7: Challenges of Building Web Observatories

Steffen [email protected]

7WeST

Produce

Consume

Cognition

Emotion

Behavior

SocialisationKnowledge

Observable Micro-

interactions in the Web

AppsProtocols

Data & InformationGovernance

WWW

Observable Macro-

effects in the Web

What to observe?

Web Crawling Usage Logging

Page 8: Challenges of Building Web Observatories

Steffen [email protected]

8WeST

Challenges – Data Collection Issues

Legal and/or Ethical Crawling

May be disallowed by provider

Usage logging Privacy of individuals

Even if it is allowed....

Page 9: Challenges of Building Web Observatories

Steffen [email protected]

9WeST

Challenges – Data Collection Issues

Crawling What does it mean to crawl a heavily interactive site? Incomplete data

• Unreachability• Time outs

Page 10: Challenges of Building Web Observatories

Steffen [email protected]

10WeST

Challenges – Data Collection Issues

Crawling What does it mean to crawl a heavily interactive site? Incomplete data Where to start?

• We cannot observe everything!– Even just for data size!– What appear to be most fruitful starting points?

Page 11: Challenges of Building Web Observatories

Steffen [email protected]

11WeST

Challenges – Data Collection Issues

Crawling What does it mean to crawl a heavily interactive site? Incomplete data Where to start? Where to stop?

• Each crawl is a view– Twitter

» Tweet» URL

» Web Page» Subweb

» Followers» Followers‘ Followers

» ...

Page 12: Challenges of Building Web Observatories

Steffen [email protected]

12WeST

Challenges – Data Collection Issues

Crawling What does it mean to crawl a heavily interactive site? Incomplete data Where to start? Where to stop? Synchronous vs asynchronous

• Strictly speaking: only asynchronous crawling possible– But in [Dellschaft&Staab] we targeted the construction of

models for streams of tags

Page 13: Challenges of Building Web Observatories

Steffen [email protected]

13WeST

Challenges – Data Publishing Issues

Legal and/or Ethical Example Issues AOL query log Netflix challenge Delicious

http://www.tagora-project.eu/data/ Twitter

Collecting, but no sharing• SocialSensor project

Page 14: Challenges of Building Web Observatories

Steffen [email protected]

14WeST

Challenges – Data Publishing Issues

Technical/Modelling issues Generic format, e.g. RDF Format ready for digestion by a certain software, e.g. for

Matlab processing Openness to other data

E.g. references to DBPedia/Wikipedia Accuracy of publishing

http://me.org showed „...“ http://me.org showed „...“@2013-05-01:0900CEST http://me.org showed „...“@2013-05-01:0900CEST called

from IP 193.99.144.85 using browser...version...history...

Page 15: Challenges of Building Web Observatories

Steffen [email protected]

15WeST

Sharing Software

Software For crawling or usage logging Rather than sharing the data, share the code for observing

Example: code for crawling Twitter in a certain way

Issues Limited repeatability Disturbance liability („Störerhaftung“) – at least in DE

• If you provide source code for crawling, e.g., Facebook, even if you do not crawl FB, FB can sue you

Page 16: Challenges of Building Web Observatories

Steffen [email protected]

16WeST

Why to observe?

Understanding Collecting Describing Analyzing Modeling Predicting Repeating!

Page 17: Challenges of Building Web Observatories

Steffen [email protected]

17WeST

WEB OBSERVATORY WIKIIn spite of all this....

Page 18: Challenges of Building Web Observatories

Steffen [email protected]

18WeST

Ongoing discussion

What to do about sharing Web Science datasets?

Let‘s do simple things first Collect pointers! Publish whatever you can publish – others will reuse Make it more archival

In a way that makes it easy to expand to handle more complex issues Semantic Wiki!

Page 19: Challenges of Building Web Observatories

Steffen [email protected]

19WeST

Web Observatory Wiki

• Main Goals:• Registry of Web Science datasets• Compiled by Web Observatory participants –

YOU!

• Minor Goals• Semantically store all information about

datasets• Make it

• Explorable• Queryable• Reuseable

Page 20: Challenges of Building Web Observatories

Steffen [email protected]

20WeST

Semantic MediaWiki + Forms Extension URL: http://wow.west.webobservatory.org/

Main classes: Examples: Dataset_Repository KONECT Dataset Slashdot Zoo Organization WeST

Quick Facts -1

Page 21: Challenges of Building Web Observatories

Steffen [email protected]

21WeST

Semantic MediaWiki + Forms Extension URL: http://wow.west.webobservatory.org/

Class Hierarchy Example: Attributes: Dataset Dublin Core +

Size, license, URL,…

Network Node Count Social Network …

Quick Facts - 2

Page 22: Challenges of Building Web Observatories

Steffen [email protected]

22WeST

Semantic Exploration by Views

Page 23: Challenges of Building Web Observatories

Steffen [email protected]

23WeST

Semantic Forms: Providing Data

Page 24: Challenges of Building Web Observatories

Steffen [email protected]

24WeST

ko:konect

ko:slashdot-zoo

wow:contains

1944

wow:network-volumewow:social-network

rdf:type

wow:network

rdfs:subClassOf

wow:dataset

rdfs:subClassOf

ko:twitter

wow:contains

120000000

wow:size

wow:network-volume

rdfs:domain

wow:size

rdfs:domain

rdf:type

wow:dataset-repositoryrdf:type

wow:contains

rdfs:domain

rdfs:range

Schema (Excerpt)

Page 25: Challenges of Building Web Observatories

Steffen [email protected]

25WeST

Discussion & Q&A

Access to wiki Current model:

• Edits allowed by IPs and users• Everyone can be blocked, including IPs

Contribute: Content Modeling requirements ... Let us know!

Page 26: Challenges of Building Web Observatories

Steffen [email protected]

26WeST

Sanity Check

Understanding

Collecting (to some extent: commodity service)

Describing (WOW)

Analyzing

Modeling

Predicting

Repeating!

So far ad hoc –needs much more:• Experience• Guidelines• Processing workflow• Executable code shares

(on big data!)• ...

Page 27: Challenges of Building Web Observatories

Steffen [email protected]

27WeST

What else do we need?

Page 28: Challenges of Building Web Observatories

Steffen [email protected]

28WeST

Vote at: https://moocfellowship.org/