97
10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources Pierre Senellart

Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

10 September 2014, Yves Rocher

Data Acquisition andExtraction from the Varietyof Web Sources

Pierre Senellart

Page 2: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

2 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content

Exploiting Acquired Information

Opportunities for Market Insights

Page 3: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

3 / 74 Télécom ParisTech Pierre Senellart

Internet and the Web

Internet: physical network of computers (or hosts)

World Wide Web, Web, WWW: logical collection of hyperlinkeddocuments

static and dynamicpublic Web and private Webseach document (or Web page, or resource) identifiedby a URL

Page 4: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

4 / 74 Télécom ParisTech Pierre Senellart

Uniform Resource Locators

https| {z }scheme

:// www.example.com| {z }hostname

:443| {z }port

/ path/to/doc| {z }path

?name=foo&town=bar| {z }query string

#para| {z }fragment

scheme: way the resource can be accessed; generally http or https

hostname: domain name of a host (cf. DNS); hostname of a websitemay start with www., but not a rule.

port: TCP port; defaults: 80 for http and 443 for https

path: logical path of the document

query string: additional parameters (dynamic documents), optional

fragment: subpart of the document, optional

Relative URIs with respect to a context (e.g., the URI above):/titi https://www.example.com/tititata https://www.example.com/path/to/tata

Page 5: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

5 / 74 Télécom ParisTech Pierre Senellart

(X)HTML

Choice format for Web pages

Dialect of SGML (the ancestor of XML), but seldom parsed as is

HTML 4.01: most common version, W3C recommendation

XHTML 1.0: XML-ization of HTML 4.01, minor differences

HTML5: most recent version, still in development, adds somebetter structuring

Actual situation of the Web: tag soup

Page 6: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

6 / 74 Télécom ParisTech Pierre Senellart

XHTML example<!DOCTYPE html PUBLIC"-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"

lang="en" xml:lang="en"><head>

<meta http-equiv="Content-Type"content="text/html; charset=utf-8" />

<title>Example XHTML document</title></head><body>

<p>This is a<a href="http://www.w3.org/">link to the<strong>W3C</strong>!</a></p>

</body></html>

Page 7: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

7 / 74 Télécom ParisTech Pierre Senellart

HTTPClient-server protocol for the Web, on top of TCP/IPExample request/response

GET /myResource HTTP/1.1Host: www.example.com

HTTP/1.1 200 OKContent-Type: text/html; charset=ISO-8859-1

<html><head><title>myResource</title></head>

<html><head><title>myResource</title></head><body><p>Hello world!</p></body>

</html>

HTTPS: secure version of HTTP

Page 8: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

8 / 74 Télécom ParisTech Pierre Senellart

Features of HTTP/1.1

virtual hosting: different Web content for different hostnames on asingle machine

login/password protection

content negociation: same URL identifying several resources, clientindicates preferences

cookies: chunks of information persistently stored on the client

keep-alive connections: several requests using the same TCPconnection

etc.

Page 9: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

9 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web ContentRegular Web ContentCMS-based Web ContentSocial Networking SitesThe Deep WebThe Semantic Web

Exploiting Acquired Information

Opportunities for Market Insights

Page 10: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

10 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web ContentRegular Web ContentCMS-based Web ContentSocial Networking SitesThe Deep WebThe Semantic Web

Exploiting Acquired Information

Opportunities for Market Insights

Page 11: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

11 / 74 Télécom ParisTech Pierre Senellart

Web Crawlers

crawlers, (Web) spiders, (Web) robots: autonomous user agentsthat retrieve pages from the WebBasics of crawling:1. Start from a given URL or set of URLs2. Retrieve and process the corresponding page3. Discover new URLs (cf. next slide)4. Repeat on each found URL

No real termination condition (virtual unlimited number of Webpages!)

Graph-browsing problemdeep-first: not well adapted, can be lost in robot traps

best: breadth-first with limited-depth deep-first on eachdiscovered website

Page 12: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

12 / 74 Télécom ParisTech Pierre Senellart

Sources of new URLs

From HTML pages:hyperlinks <a href="...">...</a>media <img src="..."> <embed src="..."><object data="...">frames <frame src="..."> <iframe src="...">JavaScript links window.open("...")etc.

Other hyperlinked content (e.g., PDF files)

Non-hyperlinked URLs that appear anywhere on the Web (inHTML text, text files, etc.): use regular expressions to extractthem

Referrer URLs

Sitemaps [sitemaps.org, 2008]

Page 13: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

13 / 74 Télécom ParisTech Pierre Senellart

Scope of a crawler

Web-scaleThe Web is infinite! Avoid robot traps by putting depth or pagenumber limits on each Web serverFocus on important pages [Abiteboul et al., 2003]

Web servers under a list of DNS domains: easy filtering of URLs

A given topic: focused crawling techniques [Chakrabarti et al.,1999, Diligenti et al., 2000, Gouriten et al., 2014] based onclassifiers of Web page content and predictors of the interest of alink.

The national Web (cf. public deposit, national libraries): what isthis? [Abiteboul et al., 2002]

A given Web site: what is a Web site? [Senellart, 2005]

Page 14: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

14 / 74 Télécom ParisTech Pierre Senellart

Identification of duplicate Web pages

ProblemIdentifying duplicates or near-duplicates on the Web to prevent multipleindexing

trivial duplicates: same resource at the same canonized URL:http://example.com:80/totohttp://example.com/titi/../toto

exact duplicates: identification by hashing

near-duplicates: (timestamps, tip of the day, etc.) more complex!

Page 15: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

15 / 74 Télécom ParisTech Pierre Senellart

Crawling ethics

Standard for robot exclusion: robots.txt at the root of a Webserver [Koster, 1994].

User-agent: *Allow: /searchhistory/Disallow: /search

Per-page exclusion.

<meta name="ROBOTS" content="NOINDEX,NOFOLLOW">

Per-link exclusion.

<a href="toto.html" rel="nofollow">Toto</a>

Avoid Denial Of Service (DOS), wait �1s between two repeatedrequests to the same Web server

Page 16: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

16 / 74 Télécom ParisTech Pierre Senellart

Parallel processing

Network delays, waits between requests:

Per-server queue of URLs

Parallel processing of requests to different hosts:

multi-threaded programmingasynchronous inputs and outputs (select, classes fromjava.util.concurrent): less overhead

Use of keep-alive to reduce connexion overheads

Page 17: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

General Architecture [Chakrabarti, 2003]

Page 18: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

18 / 74 Télécom ParisTech Pierre Senellart

Refreshing URLs

Content on the Web changes

Different change rates:online newspaper main page: every hour or sopublished article: virtually no change

Continuous crawling, and identification of change rates foradaptive crawling: how to know the time of last modification of aWeb page?

Page 19: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

19 / 74 Télécom ParisTech Pierre Senellart

Estimating the Freshness of a Page

1. Check HTTP timestamp.

2. Check content timestamp.

3. Compare a hash of the page with a stored hash.

4. Non-significant differences (ads, fortunes, request timestamp):

only hash text content, or “useful” text content;compare distribution of n-grams (shingling);or even compute edit distance with previous version.

Adapting strategy to each different archived website?

Page 20: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

20 / 74 Télécom ParisTech Pierre Senellart

Crawling Modern Web Sites

Some modern Web sites only work when cookies are activated(session cookies), or when JavaScript code is interpreted

Regular Web crawlers (wget, Heritrix, Apache Nutch) usuallydon’t do cookie management and don’t interpret JavaScript code

Crawling of some Websites therefore require more advanced tools

Page 21: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

21 / 74 Télécom ParisTech Pierre Senellart

Advanced crawling tools

Web scraping frameworks such as scrapy (Python) orWWW::Mechanize (Perl) simulate a Web browserinteraction and cookie management (but no JSinterpretation)

Headless browsers such as htmlunit simulate a Web browser, includingsimple JavaScript processing

Browser instrumentors such as Selenium allow full instrumentation ofa regular Web browser (Chrome, Firefox, InternetExplorer)

OXPath: a full-fledged navigation and extraction language forcomplex Web sites [Sellers et al., 2011] Demo

Page 22: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

22 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web ContentRegular Web ContentCMS-based Web ContentSocial Networking SitesThe Deep WebThe Semantic Web

Exploiting Acquired Information

Opportunities for Market Insights

Page 23: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

23 / 74 Télécom ParisTech Pierre Senellart

Templated Web Site

Many Web sites (especially, Web forums, blogs) use one of a fewcontent management systems (CMS)

Web sites that use the same CMS will be similarly structured,present a similar layout, etc.

Information is somewhat structured in CMSs: publication date,author, tags, forums, threads, etc.

Some structure differences may exist when Web sites use differentversions, or different themes, of a CMS

Page 24: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

24 / 74 Télécom ParisTech Pierre Senellart

Crawling CMS-Based Web Sites

Traditional crawling approaches crawl Web sites independently ofthe nature of the sites and of their CMSWhen the CMS is known:

Potential for much more efficient crawling strategies (avoid pageswith redundant information, uninformative pages, etc.)Potential for automatic extraction of structured content

Two ways of approaching the problem:Have a handcrafted knowledge base of known CMSs, theircharacteristics, how to crawl and extract information [Faheem andSenellart, 2013b,a] (AAH) DemoAutomatically infer the best way to crawl a given CMS [Faheemand Senellart, 2014] (ACE)

Need to be robust w.r.t. template change

Page 25: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

25 / 74 Télécom ParisTech Pierre Senellart

Detecting CMSsOne main challenge in intelligent crawling and content extractionis to identify the CMS and then perform the best crawlingstrategy accordinglyDetecting CMS using:1. URL patterns,2. HTTP metadata,3. textual content,4. XPath patterns, etc.

These can be manually described (AAH), or automatically inferred(ACE)

For instance the vBulletin Web forum content managementsystem, that can be identified by searching for a reference to avbulletin_global.js JavaScript script by using a simple//script/@src XPath expression.

Page 26: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

26 / 74 Télécom ParisTech Pierre Senellart

Crawling http://www.rockamring-blog.de/[Faheem and Senellart, 2014]

0 2;000 4;000 6;0000

100

200

300

Number of HTTP requestsNum

berof

distinct

2-gram

s(�

1;00

0)

ACEAAHwget

Page 27: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

27 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web ContentRegular Web ContentCMS-based Web ContentSocial Networking SitesThe Deep WebThe Semantic Web

Exploiting Acquired Information

Opportunities for Market Insights

Page 28: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

28 / 74 Télécom ParisTech Pierre Senellart

Most popular Web sites1 google.com2 facebook.com3 youtube.com4 yahoo.com5 baidu.com6 wikipedia.org7 live.com8 twitter.com9 qq.com

10 amazon.com11 blogspot.com12 linkedin.com13 google.co.in14 taobao.com15 sina.com.cn16 yahoo.co.jp17 msn.com18 wordpress.com19 google.com.hk20 t.co21 google.de22 ebay.com23 google.co.jp24 googleusercontent.com25 google.co.uk26 yandex.ru27 163.com28 weibo.com

(Alexa)

Social networking sites

Sites with social networking features (friends,user-shared content, user profiles, etc.)

Page 29: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

28 / 74 Télécom ParisTech Pierre Senellart

Most popular Web sites1 google.com2 facebook.com3 youtube.com4 yahoo.com5 baidu.com6 wikipedia.org7 live.com8 twitter.com9 qq.com

10 amazon.com11 blogspot.com12 linkedin.com13 google.co.in14 taobao.com15 sina.com.cn16 yahoo.co.jp17 msn.com18 wordpress.com19 google.com.hk20 t.co21 google.de22 ebay.com23 google.co.jp24 googleusercontent.com25 google.co.uk26 yandex.ru27 163.com28 weibo.com

(Alexa)

Social networking sites

Sites with social networking features (friends,user-shared content, user profiles, etc.)

Page 30: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

28 / 74 Télécom ParisTech Pierre Senellart

Most popular Web sites1 google.com2 facebook.com3 youtube.com4 yahoo.com5 baidu.com6 wikipedia.org7 live.com8 twitter.com9 qq.com

10 amazon.com11 blogspot.com12 linkedin.com13 google.co.in14 taobao.com15 sina.com.cn16 yahoo.co.jp17 msn.com18 wordpress.com19 google.com.hk20 t.co21 google.de22 ebay.com23 google.co.jp24 googleusercontent.com25 google.co.uk26 yandex.ru27 163.com28 weibo.com

(Alexa)

Social networking sites

Sites with social networking features (friends,user-shared content, user profiles, etc.)

Page 31: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

29 / 74 Télécom ParisTech Pierre Senellart

Social data on the Web

Huge numbers of users(2012):

Facebook 900 million

QQ 540 million

W. Live 330 million

Weibo 310 million

Google+ 170 million

Twitter 140 million

LinkedIn 100 million

Huge volume of shared data:

250 million tweets per day on Twitter(3,000 per second on average!). . .

. . . including statements by heads ofstates, revelations of political activists, etc.

Page 32: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

29 / 74 Télécom ParisTech Pierre Senellart

Social data on the Web

Huge numbers of users(2012):

Facebook 900 million

QQ 540 million

W. Live 330 million

Weibo 310 million

Google+ 170 million

Twitter 140 million

LinkedIn 100 million

Huge volume of shared data:

250 million tweets per day on Twitter(3,000 per second on average!). . .

. . . including statements by heads ofstates, revelations of political activists, etc.

Page 33: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

30 / 74 Télécom ParisTech Pierre Senellart

Crawling Social Networks

Theoretically possible to crawl social networking sites using aregular Web crawler

Sometimes not possible: https://www.facebook.com/robots.txt

Often very inefficient, considering politeness constraints

Better solution: Use provided social networking APIshttps://dev.twitter.com/docs/api/1.1https://developers.facebook.com/docs/graph-api/reference/v2.1/https://developer.linkedin.com/apishttps://developers.google.com/youtube/v3/

Also possible to buy access to the data, directly from the socialnetwork or from brokers such as http://gnip.com/

Page 34: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

31 / 74 Télécom ParisTech Pierre Senellart

Social Networking APIs

Most social networking Web sites (and some other kinds of Websites) provide APIs to effectively access their content

Usually a RESTful API, occasionally SOAP-baed

Usually require a token identifying the application using the API,sometimes a cryptographic signature as well

May access the API as an authenticated user of the social network,or as an external party

APIs seriously limit the rate of requests:https://dev.twitter.com/docs/api/1.1/get/search/tweets

Page 35: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

32 / 74 Télécom ParisTech Pierre Senellart

REST

Mode of interaction with a Web service

Follow the KISS (Keep it Simple, Stupid) principle

Each request to the service is a simple HTTP GET method

Base URL is the URL of the service

Parameters of the service are sent as HTTP parameters (in theURL)

HTTP response code indicates success or failure

Response contains structured output, usually as JSON or XML

No side effect, each request independent of previous ones

Example: http://graph.facebook.com:80/?ids=7901103

Page 36: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

33 / 74 Télécom ParisTech Pierre Senellart

The Case of Twitter

Two main APIs:REST APIs, including search, getting information about a user, alist, followers, etc. https://dev.twitter.com/docs/api/1.1Streaming API, providing real-time result

Very limited history available

Search can be on keywords, language, geolocation (for a smallportion of tweets)

Page 37: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

34 / 74 Télécom ParisTech Pierre Senellart

Cross-Network Crawling

Often useful to combine results from different social networks

Numerous libraries facilitating SN API accesses (twipy,Facebook4J, FourSquare VP C++ API. . . ) incompatible witheach other. . . Some efforts at generic APIs (OneAll,APIBlender [Gouriten et al., 2014]) Demo

Example use case: No API to get all check-ins from FourSquare,but a number of check-ins are available on Twitter; given results ofTwitter Search/Streaming, use FourSquare API to get informationabout check-in locations.

Page 38: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

35 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web ContentRegular Web ContentCMS-based Web ContentSocial Networking SitesThe Deep WebThe Semantic Web

Exploiting Acquired Information

Opportunities for Market Insights

Page 39: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

36 / 74 Télécom ParisTech Pierre Senellart

The Deep Web

Definition (Deep Web, Hidden Web, Invisible Web)All the content on the Web that is not directly accessible throughhyperlinks. In particular: HTML forms, Web services.

Size estimate: 500 times more content than on the surface Web![BrightPlanet, 2000]. Hundreds of thousands of deep Web databases[Chang et al., 2004]

Page 40: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

37 / 74 Télécom ParisTech Pierre Senellart

Sources of the Deep Web

Example

Yellow Pages and other directories;

Library catalogs;

Weather services;

US Census Bureau data;

etc.

Page 41: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

38 / 74 Télécom ParisTech Pierre Senellart

Discovering Knowledge from the Deep Web[Nayak et al., 2012]

Content of the deep Web hidden to classical Web search engines(they just follow links)

But very valuable and high quality!

Even services allowing access through the surface Web (e.g.,e-commerce) have more semantics when accessed from the deepWeb

How to benefit from this information?

How to analyze, extract and model this information?

Focus here: Automatic, unsupervised, methods, for a given domain ofinterest

Page 42: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

39 / 74 Télécom ParisTech Pierre Senellart

Extensional Approach

WWWdiscovery

siphoning

bootstrapIndex

indexing

Page 43: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

40 / 74 Télécom ParisTech Pierre Senellart

Notes on the Extensional Approach

Main issues:Discovering servicesChoosing appropriate data to submit formsUse of data found in result pages to bootstrap the siphoning processEnsure good coverage of the database

Approach favored by Google, used in production [Madhavan et al.,2006]

Not always feasible (huge load on Web servers)

Page 44: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

Intensional Approach

WWWdiscovery

probing

analyzingForm wrapped as

a Web service

query

Page 45: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

42 / 74 Télécom ParisTech Pierre Senellart

Notes on the Intensional Approach

More ambitious [Chang et al., 2005, Senellart et al., 2008]Main issues:

Discovering servicesUnderstanding the structure and semantics of a formUnderstanding the structure and semantics of result pagesSemantic analysis of the service as a wholeQuery rewriting using the services

No significant load imposed on Web servers

Page 46: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

43 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web ContentRegular Web ContentCMS-based Web ContentSocial Networking SitesThe Deep WebThe Semantic Web

Exploiting Acquired Information

Opportunities for Market Insights

Page 47: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

44 / 74 Télécom ParisTech Pierre Senellart

The Semantic Web

A Web in which the resources are semantically describedannotations give information about a page, explain an expression ina page, etc.

More precisely, a resource is anything that can be referred to by aURI

a web page, identified by a URLa fragment of an XML document, identified by an element node ofthe document,a web service,a thing, an object, a concept, a property, etc.

Semantic annotations: logical assertions that relate resources tosome terms in associated ontologies

Page 48: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

45 / 74 Télécom ParisTech Pierre Senellart

Ontologies

Formal descriptions providing human users a sharedunderstanding of a given domain

A controlled vocabulary

Formally defined so that it can also be processed by machines

Logical semantics that enables reasoning

Reasoning is the key for different important tasks of Web datamanagement, in particular:

to answer queries (over possibly distributed data)to relate objects in different data sources enabling their integrationto detect inconsistencies or redundanciesto refine queries with too many answers, or to relax queries with noanswer

Page 49: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

46 / 74 Télécom ParisTech Pierre Senellart

Where Do Ontologies Come From?

Manually crafted to represent the knowledge of a specific domain(e.g., life sciences)

Exported from classical Web databases

Through information extraction from the Web, Wikipedia, etc.(e.g., DBpedia, YAGO)

Private to a company or public

Some ontologies focus on instances, others on a schema (seefurther)

Value of the Semantic Web: bits of ontologies can be re-used inanother, and ontologies can be mapped through an owl:sameAslink

Page 50: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

As of September 2011

MusicBrainz

(zitgist)

P20

Turismo de

Zaragoza

yovisto

Yahoo! Geo

Planet

YAGO

World Fact-book

El ViajeroTourism

WordNet (W3C)

WordNet (VUA)

VIVO UF

VIVO Indiana

VIVO Cornell

VIAF

URIBurner

Sussex Reading

Lists

Plymouth Reading

Lists

UniRef

UniProt

UMBEL

UK Post-codes

legislationdata.gov.uk

Uberblic

UB Mann-heim

TWC LOGD

Twarql

transportdata.gov.

uk

Traffic Scotland

theses.fr

Thesau-rus W

totl.net

Tele-graphis

TCMGeneDIT

TaxonConcept

Open Library (Talis)

tags2con delicious

t4gminfo

Swedish Open

Cultural Heritage

Surge Radio

Sudoc

STW

RAMEAU SH

statisticsdata.gov.

uk

St. Andrews Resource

Lists

ECS South-ampton EPrints

SSW Thesaur

us

SmartLink

Slideshare2RDF

semanticweb.org

SemanticTweet

Semantic XBRL

SWDog Food

Source Code Ecosystem Linked Data

US SEC (rdfabout)

Sears

Scotland Geo-

graphy

ScotlandPupils &Exams

Scholaro-meter

WordNet (RKB

Explorer)

Wiki

UN/LOCODE

Ulm

ECS (RKB

Explorer)

Roma

RISKS

RESEX

RAE2001

Pisa

OS

OAI

NSF

New-castle

LAASKISTI

JISC

IRIT

IEEE

IBM

Eurécom

ERA

ePrints dotAC

DEPLOY

DBLP (RKB

Explorer)

Crime Reports

UK

Course-ware

CORDIS (RKB

Explorer)CiteSeer

Budapest

ACM

riese

Revyu

researchdata.gov.

ukRen. Energy Genera-

tors

referencedata.gov.

uk

Recht-spraak.

nl

RDFohloh

Last.FM (rdfize)

RDF Book

Mashup

Rådata nå!

PSH

Product Types

Ontology

ProductDB

PBAC

Poké-pédia

patentsdata.go

v.uk

OxPoints

Ord-nance Survey

Openly Local

Open Library

OpenCyc

Open Corpo-rates

OpenCalais

OpenEI

Open Election

Data Project

OpenData

Thesau-rus

Ontos News Portal

OGOLOD

JanusAMP

Ocean Drilling Codices

New York

Times

NVD

ntnusc

NTU Resource

Lists

Norwe-gian

MeSH

NDL subjects

ndlna

myExperi-ment

Italian Museums

medu-cator

MARC Codes List

Man-chester Reading

Lists

Lotico

Weather Stations

London Gazette

LOIUS

Linked Open Colors

lobidResources

lobidOrgani-sations

LEM

LinkedMDB

LinkedLCCN

LinkedGeoData

LinkedCT

LinkedUser

FeedbackLOV

Linked Open

Numbers

LODE

Eurostat (OntologyCentral)

Linked EDGAR

(OntologyCentral)

Linked Crunch-

base

lingvoj

Lichfield Spen-ding

LIBRIS

Lexvo

LCSH

DBLP (L3S)

Linked Sensor Data (Kno.e.sis)

Klapp-stuhl-club

Good-win

Family

National Radio-activity

JP

Jamendo (DBtune)

Italian public

schools

ISTAT Immi-gration

iServe

IdRef Sudoc

NSZL Catalog

Hellenic PD

Hellenic FBD

PiedmontAccomo-dations

GovTrack

GovWILD

GoogleArt

wrapper

gnoss

GESIS

GeoWordNet

GeoSpecies

GeoNames

GeoLinkedData

GEMET

GTAA

STITCH

SIDER

Project Guten-berg

MediCare

Euro-stat

(FUB)

EURES

DrugBank

Disea-some

DBLP (FU

Berlin)

DailyMed

CORDIS(FUB)

Freebase

flickr wrappr

Fishes of Texas

Finnish Munici-palities

ChEMBL

FanHubz

EventMedia

EUTC Produc-

tions

Eurostat

Europeana

EUNIS

EU Insti-

tutions

ESD stan-dards

EARTh

Enipedia

Popula-tion (En-AKTing)

NHS(En-

AKTing) Mortality(En-

AKTing)

Energy (En-

AKTing)

Crime(En-

AKTing)

CO2 Emission

(En-AKTing)

EEA

SISVU

education.data.g

ov.uk

ECS South-ampton

ECCO-TCP

GND

Didactalia

DDC Deutsche Bio-

graphie

datadcs

MusicBrainz

(DBTune)

Magna-tune

John Peel

(DBTune)

Classical (DB

Tune)

AudioScrobbler (DBTune)

Last.FM artists

(DBTune)

DBTropes

Portu-guese

DBpedia

dbpedia lite

Greek DBpedia

DBpedia

data-open-ac-uk

SMCJournals

Pokedex

Airports

NASA (Data Incu-bator)

MusicBrainz(Data

Incubator)

Moseley Folk

Metoffice Weather Forecasts

Discogs (Data

Incubator)

Climbing

data.gov.uk intervals

Data Gov.ie

databnf.fr

Cornetto

reegle

Chronic-ling

America

Chem2Bio2RDF

Calames

businessdata.gov.

uk

Bricklink

Brazilian Poli-

ticians

BNB

UniSTS

UniPathway

UniParc

Taxonomy

UniProt(Bio2RDF)

SGD

Reactome

PubMedPub

Chem

PRO-SITE

ProDom

Pfam

PDB

OMIMMGI

KEGG Reaction

KEGG Pathway

KEGG Glycan

KEGG Enzyme

KEGG Drug

KEGG Com-pound

InterPro

HomoloGene

HGNC

Gene Ontology

GeneID

Affy-metrix

bible ontology

BibBase

FTS

BBC Wildlife Finder

BBC Program

mes BBC Music

Alpine Ski

Austria

LOCAH

Amster-dam

Museum

AGROVOC

AEMET

US Census (rdfabout)

Media

Geographic

Publications

Government

Cross-domain

Life sciences

User-generated content

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Page 51: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

48 / 74 Télécom ParisTech Pierre Senellart

Classes and class hierarchy

Backbone of the ontology

AcademicStaff is a Class (A class will be interpreted as a set ofobjects)

AcademicStaff isa Staff (isa is interpreted as set inclusion)FacultyComponent

Course

MathCourse

ProbabilitiesAlgebra

LogicCSCourse

DBAIJava

Student

UndergraduateStudentMasterStudentPhDStudent

Department

PhysicsDeptMathsDeptCSDept

Staff

AcademicStaff

LecturerResearcherProfessor

AdministrativeStaff

Page 52: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

49 / 74 Télécom ParisTech Pierre Senellart

Relations

Declaration of relations with their signature

(Relations will be interpreted as binary relations between objects)TeachesIn(AcademicStaff, Course)

if one states that “X TeachesIn Y ”, then X belongs toAcademicStaff and Y to Course

TeachesTo(AcademicStaff, Student)

Leads(Staff, Department)

Page 53: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

50 / 74 Télécom ParisTech Pierre Senellart

Instances

Classes have instances

Dupond is an instance of the class Professor

corresponds to the fact: Professor(Dupond)

Relations also have instances

(Dupond,CS101) is an instance of the relation TeachesIn

corresponds to the fact: TeachesIn(Dupond,CS101)

The instance statements can be seen as (and stored in) a database

Page 54: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

51 / 74 Télécom ParisTech Pierre Senellart

Ontology = schema + instance

Schema (TBox)

The set of class and relation namesThe signatures of relations and also constraintsThe constraints are used for two purposes

– checking data consistency (like dependencies in databases)– inferring new facts

Instance (ABox)

The set of factsThe set of base facts together with the inferred facts should satisfythe constraints

Ontology (i.e., Knowledge Base) = Schema + Instance

Page 55: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

52 / 74 Télécom ParisTech Pierre Senellart

Where can Semantic Content be Found?

In the linked data, through Web-available RDF data:dumps of an entire ontology, in one of the RDF serializationformats (RDF/XML, Turtle, N-Triples)crawlable RDF content, with small fragments pointing to otherfragmentsa SPARQL endpointHTML annotated with RDFa,cf. http://www.w3.org/TR/rdfa-syntax/

Other popular semantic content embedded in Web pages:microformats (hCard, vCard, etc.), microdata(cf. http://www.schemas.org/). Not directly the spirit of theSemantic Web, but heavily used.

RDF content used internally in a company

Page 56: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

53 / 74 Télécom ParisTech Pierre Senellart

How to Acquire Semantic Content?

Much easier to exploit, as it is already semantically described

Individual resources (dumps, SPARQL endpoints) that have beenidentified as valuable can be directly exploited

RDFa content, microformats, microdata, can be discovered fromregular Web crawls

Not perfect! There are errors, lies, etc.

Page 57: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

54 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content

Exploiting Acquired InformationInformation ExtractionGraph MiningOpinion Mining

Opportunities for Market Insights

Page 58: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

55 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content

Exploiting Acquired InformationInformation ExtractionGraph MiningOpinion Mining

Opportunities for Market Insights

Page 59: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

56 / 74 Télécom ParisTech Pierre Senellart

Information Extraction

See Parts “Instance Extraction” and “Fact Extraction” from mycolleague Fabian Suchanek’s lecturehttp://suchanek.name/work/teaching/IE2010a.pdf

Page 60: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

57 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content

Exploiting Acquired InformationInformation ExtractionGraph MiningOpinion Mining

Opportunities for Market Insights

Page 61: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

58 / 74 Télécom ParisTech Pierre Senellart

The Web Graph

The World Wide Web seen as a (directed) graph:

Vertices: Web pages

Edges: hyperlinks

Same for other interlinked environments:

dictionaries

encyclopedias

scientific publications

social networks

Page 62: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

59 / 74 Télécom ParisTech Pierre Senellart

Google’s PageRank [Brin and Page, 1998]

IdeaImportant pages are pages pointed to by important pages.

8<:

gij = 0 if there is no link between page i and j ;

gij =1ni

otherwise, with ni the number of outgoing links of page i .

Definition (Tentative)Probability that the surfer following the random walk in G has arrivedon page i at some distant given point in the future.

pr(i) =�

limk!+1

(GT )kv�

i

where v is some initial column vector.

Page 63: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.1000.100

0.100

0.100

0.100

0.100

0.100

0.100

0.100

0.100

Page 64: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.0330.317

0.075

0.108

0.025

0.058

0.083

0.150

0.117

0.033

Page 65: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.0360.193

0.108

0.163

0.079

0.090

0.074

0.154

0.094

0.008

Page 66: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.0540.212

0.093

0.152

0.048

0.051

0.108

0.149

0.106

0.026

Page 67: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.0510.247

0.078

0.143

0.053

0.062

0.097

0.153

0.099

0.016

Page 68: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.0480.232

0.093

0.156

0.062

0.067

0.087

0.138

0.099

0.018

Page 69: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.0520.226

0.092

0.148

0.058

0.064

0.098

0.146

0.096

0.021

Page 70: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.0490.238

0.088

0.149

0.057

0.063

0.095

0.141

0.099

0.019

Page 71: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.0500.232

0.091

0.149

0.060

0.066

0.094

0.143

0.096

0.019

Page 72: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.0500.233

0.091

0.150

0.058

0.064

0.095

0.142

0.098

0.020

Page 73: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.0500.234

0.090

0.148

0.058

0.065

0.095

0.143

0.097

0.019

Page 74: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.0490.233

0.091

0.149

0.058

0.065

0.095

0.142

0.098

0.019

Page 75: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.0500.233

0.091

0.149

0.058

0.065

0.095

0.143

0.097

0.019

Page 76: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.0500.234

0.091

0.149

0.058

0.065

0.095

0.142

0.097

0.019

Page 77: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

61 / 74 Télécom ParisTech Pierre Senellart

PageRank With Damping

May not always converge, or convergence may not be unique.To fix this, the random surfer can at each step randomly jump to anypage of the Web with some probability d (1� d : damping factor).

pr(i) =�

limk!+1

((1� d)GT + dU )kv�

i

where U is the matrix with all 1N values with N the number of vertices.

Page 78: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

62 / 74 Télécom ParisTech Pierre Senellart

Using PageRank to Score Search Results

PageRank: global score, independent of the query

Can be used to raise the weight of important pages, associatedwith some scoring function dependent of the query:

final(q ; d) = score(q ; d)� pr(d),

PageRank only useful in directed graphs! Proportional to degreeotherwise

Page 79: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

63 / 74 Télécom ParisTech Pierre Senellart

HITS [Kleinberg, 1999]

IdeaTwo kinds of important pages: hubs and authorities. Hubs are pagesthat point to good authorities, whereas authorities are pages that arepointed to by good hubs.

G 0 adjacency matrix (with 0 and 1 values) of a subgraph of the Web.We use the following iterative process (starting with a and h vectors ofnorm 1):

8<:

a := 1kG 0Thk G 0Th

h := 1kG 0ak G 0a

Converges under some technical assumptions to authority and hubscores.

Page 80: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

64 / 74 Télécom ParisTech Pierre Senellart

Using HITS to Order Web Query Results

1. Retrieve the set D of Web pages matching a keyword query.

2. Retrieve the set D� of Web pages obtained from D by adding alllinked pages, as well as all pages linking to pages of D .

3. Build from D� the corresponding subgraph G 0 of the Web graph.

4. Compute iteratively hubs and authority scores.

5. Sort documents from D by authority scores.

Less efficient than PageRank, because local scores.

Page 81: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

65 / 74 Télécom ParisTech Pierre Senellart

Discovery of communities

Classical problem in social networks: identifying communities ofusers (or of content) using the graph structure

Two subproblems:

1. Given some initial vertex or vertex set, finding the correspondingcommunity

2. Given the graph as a whole, finding a partition in communities

Page 82: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

66 / 74 Télécom ParisTech Pierre Senellart

Maximum Flow / Minimum Cut

/6 /2

/1

/5

/2

/3

sinksource

/4

Use of a maximum flow computation algorithm [Goldberg andTarjan, 1988] to separate a seed of users from the remaining of thegraph

Complexity O(n2m) (n : vertices, m : edges)

Page 83: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

66 / 74 Télécom ParisTech Pierre Senellart

Maximum Flow / Minimum Cut

/6 /2

/1

/5

/2

/3

source

4 0

3 2

1

4/4

1sink

Use of a maximum flow computation algorithm [Goldberg andTarjan, 1988] to separate a seed of users from the remaining of thegraph

Complexity O(n2m) (n : vertices, m : edges)

Page 84: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

66 / 74 Télécom ParisTech Pierre Senellart

Maximum Flow / Minimum Cut

/6 /2

/1

/5

/2

/3

sinksource

4 0

3 2

1

4/4

1

Use of a maximum flow computation algorithm [Goldberg andTarjan, 1988] to separate a seed of users from the remaining of thegraph

Complexity O(n2m) (n : vertices, m : edges)

Page 85: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

67 / 74 Télécom ParisTech Pierre Senellart

Markov Cluster Algorithm (MCL) [van Don-gen, 2000]

Graph clustering algorithmBased as well on maximum flow simulation, in the whole graphIteration of a matrix computation alternating:

Expansion (matrix multiplication, corresponding to flowpropagation)Inflation (non-linear operation to increase heterogeneity)

Complexity: O(n3) for an exact computation, O(n) for anapproximate one

[van Dongen, 2000]

Page 86: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

67 / 74 Télécom ParisTech Pierre Senellart

Markov Cluster Algorithm (MCL) [van Don-gen, 2000]

Graph clustering algorithmBased as well on maximum flow simulation, in the whole graphIteration of a matrix computation alternating:

Expansion (matrix multiplication, corresponding to flowpropagation)Inflation (non-linear operation to increase heterogeneity)

Complexity: O(n3) for an exact computation, O(n) for anapproximate one

[van Dongen, 2000]

Page 87: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

68 / 74 Télécom ParisTech Pierre Senellart

Deletion of the edges with the highest be-twenness [Newman and Girvan, 2004]

Top-down graph clustering algorithmBetwenness of an edge: number of minimal paths between twoarbitrary vertices going through this edgeGeneral principle:1. Compute the betweenness of each edge in the graph2. Remove the edge with the highest betweenness3. Redo the whole process, betweenness computation included

Complexity: O(n3) for a sparse graph

[Newman and Girvan, 2004]

Page 88: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

69 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content

Exploiting Acquired InformationInformation ExtractionGraph MiningOpinion Mining

Opportunities for Market Insights

Page 89: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

70 / 74 Télécom ParisTech Pierre Senellart

Opinion Mining

See my colleague Chloé Clavel’s lecture http://pierre.senellart.com/enseignement/2013-2014/inf344/10-opinion-mining.pdf

Page 90: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

71 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content

Exploiting Acquired Information

Opportunities for Market Insights

Page 91: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

72 / 74 Télécom ParisTech Pierre Senellart

Opportunities for Market Insights

Crawl a competitor’s Web site, apply a wrapper to extractstructured information, regularly refresh this crawl ) a localdatabase of a competitor’s products and prices, ready to beanalyzed

Crawl Web forums, blogs, social networking sites, for opinionsabout a brand, and mine the obtained social network ) followidentify opinion leaders, and target them for marketing

Exploit Deep Web forms to crawl all patents pertaining to aparticular topic, perform instance extraction to identify allmolecules cited in the patent, use linked open data ontologies toconnect these molecules to known metabolic pathways ) get moreinsight onto which biological phenomena are targeted bycompetitors’ inventions

Page 92: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

Bibliography I

Serge Abiteboul, Grégory Cobena, Julien Masanès, and Gerald Sedrati.A first experience in archiving the French Web. In Proc. ECDL,Roma, Italie, September 2002.

Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-linepage importance computation. In Proc. WWW, May 2003.

BrightPlanet. The deep Web: Surfacing hidden value. White Paper,July 2000.

Sergey Brin and Lawrence Page. The anatomy of a large-scalehypertextual Web search engine. Computer Networks, 30(1–7):107–117, April 1998.

Soumen Chakrabarti. Mining the Web: Discovering Knowledge fromHypertext Data. Morgan Kaufmann, San Fransisco, USA, 2003.

Page 93: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

Bibliography IISoumen Chakrabarti, Martin van den Berg, and Byron Dom. Focusedcrawling: A new approach to topic-specific Web resource discovery.Computer Networks, 31(11–16):1623–1640, 1999.

Kevin Chen-Chuan Chang, Bin He, Chengkai Li, Mitesh Patel, andZhen Zhang. Structured databases on the Web: Observations andimplications. SIGMOD Record, 33(3):61–70, September 2004.

Kevin Chen-Chuan Chang, Bin He, and Zhen Zhang. Toward largescale integration: Building a metaquerier over databases on theWeb. In Proc. CIDR, Asilomar, USA, January 2005.

Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles,and Marco Gori. Focused crawling using context graphs. In Proc.VLDB, Cairo, Egypt, September 2000.

Page 94: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

Bibliography IIIMuhammad Faheem and Pierre Senellart. Demonstrating intelligentcrawling and archiving of web applications. In Proc. CIKM, pages2481–2484, San Francisco, USA, October 2013a. Demonstration.

Muhammad Faheem and Pierre Senellart. Intelligent and adaptivecrawling of Web applications for Web archiving. In Proc. ICWE,pages 306–322, Aalborg, Denmark, July 2013b.

Muhammad Faheem and Pierre Senellart. Adaptive crawling driven bystructure-based link classification, July 2014. Preprint available athttp://pierre.senellart.com/publications/faheem2015adaptive.pdf.

Andrew V. Goldberg and Robert E. Tarjan. A new approach to themaximum-flow problem. Journal of the ACM, 35(4):921–940,October 1988.

Page 95: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

Bibliography IVGeorges Gouriten, Silviu Maniu, and Pierre Senellart. Scalable,generic, and adaptive systems for focused crawling. In Proc.Hypertext, Santiago, Chile, September 2014. Douglas Engelbart BestPaper Award.

Jon M. Kleinberg. Authoritative Sources in a HyperlinkedEnvironment. Journal of the ACM, 46(5):604–632, 1999.

Martijn Koster. A standard for robot exclusion.http://www.robotstxt.org/orig.html, June 1994.

Jayant Madhavan, Alon Y. Halevy, Shirley Cohen, Xin Dong,Shawn R. Jeffery, David Ko, and Cong Yu. Structured data meetsthe Web: A few observations. IEEE Data Engineering Bulletin, 29(4):19–26, December 2006.

Page 96: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

Bibliography V

Richi Nayak, Pierre Senellart, Fabian M. Suchanek, and Aparna Varde.Discovering interesting information with advances in Webtechnology. SIGKDD Explorations, 14(2), December 2012.

M. E. J. Newman and M. Girvan. Finding and evaluating communitystructure in networks. Physical Review E, 69(2), 2004.

Andrew Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, andChristian Schallhart. Exploring the Web with OXPath. In LWDM,2011.

Pierre Senellart. Identifying Websites with flow simulation. In Proc.ICWE, pages 124–129, Sydney, Australia, July 2005.

Page 97: Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

Bibliography VI

Pierre Senellart, Avin Mittal, Daniel Muschick, Rémi Gilleron, andMarc Tommasi. Automatic wrapper induction from hidden-Websources with domain knowledge. In Proc. WIDM, pages 9–16, Napa,USA, October 2008.

sitemaps.org. Sitemaps XML format.http://www.sitemaps.org/protocol.php, February 2008.

Stijn Marinus van Dongen. Graph Clustering by Flow Simulation.PhD thesis, University of Utrecht, May 2000.