Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart

10 September 2014, Yves Rocher

Data Acquisition andExtraction from the Varietyof Web Sources

Pierre Senellart

2 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content

Exploiting Acquired Information

Opportunities for Market Insights


Internet and the Web

Internet: physical network of computers (or hosts)

World Wide Web, Web, WWW: logical collection of hyperlinkeddocuments

static and dynamicpublic Web and private Webseach document (or Web page, or resource) identifiedby a URL


Uniform Resource Locators

https| {z }scheme

:// www.example.com| {z }hostname

:443| {z }port

/ path/to/doc| {z }path

?name=foo&town=bar| {z }query string

#para| {z }fragment

scheme: way the resource can be accessed; generally http or https

hostname: domain name of a host (cf. DNS); hostname of a websitemay start with www., but not a rule.

port: TCP port; defaults: 80 for http and 443 for https

path: logical path of the document

query string: additional parameters (dynamic documents), optional

fragment: subpart of the document, optional

Relative URIs with respect to a context (e.g., the URI above):/titi https://www.example.com/tititata https://www.example.com/path/to/tata


(X)HTML

Choice format for Web pages

Dialect of SGML (the ancestor of XML), but seldom parsed as is

HTML 4.01: most common version, W3C recommendation

XHTML 1.0: XML-ization of HTML 4.01, minor differences

HTML5: most recent version, still in development, adds somebetter structuring

Actual situation of the Web: tag soup


XHTML example<!DOCTYPE html PUBLIC"-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"

lang="en" xml:lang="en"><head>

<meta http-equiv="Content-Type"content="text/html; charset=utf-8" />

<title>Example XHTML document</title></head><body>

This is a<a href="http://www.w3.org/">link to theW3C!</a>

</body></html>


HTTPClient-server protocol for the Web, on top of TCP/IPExample request/response

GET /myResource HTTP/1.1Host: www.example.com

HTTP/1.1 200 OKContent-Type: text/html; charset=ISO-8859-1

<html><head><title>myResource</title></head>

<html><head><title>myResource</title></head><body>Hello world!</body>

</html>

HTTPS: secure version of HTTP


Features of HTTP/1.1

virtual hosting: different Web content for different hostnames on asingle machine

login/password protection

content negociation: same URL identifying several resources, clientindicates preferences

cookies: chunks of information persistently stored on the client

keep-alive connections: several requests using the same TCPconnection

etc.


Outline

The World Wide Web

Acquiring Various Forms of Web ContentRegular Web ContentCMS-based Web ContentSocial Networking SitesThe Deep WebThe Semantic Web




Outline

The World Wide Web





Web Crawlers

crawlers, (Web) spiders, (Web) robots: autonomous user agentsthat retrieve pages from the WebBasics of crawling:1. Start from a given URL or set of URLs2. Retrieve and process the corresponding page3. Discover new URLs (cf. next slide)4. Repeat on each found URL

No real termination condition (virtual unlimited number of Webpages!)

Graph-browsing problemdeep-first: not well adapted, can be lost in robot traps

best: breadth-first with limited-depth deep-first on eachdiscovered website


Sources of new URLs

From HTML pages:hyperlinks <a href="...">...</a>media <img src="..."> <embed src="..."><object data="...">frames <frame src="..."> <iframe src="...">JavaScript links window.open("...")etc.

Other hyperlinked content (e.g., PDF files)

Non-hyperlinked URLs that appear anywhere on the Web (inHTML text, text files, etc.): use regular expressions to extractthem

Referrer URLs

Sitemaps [sitemaps.org, 2008]


Scope of a crawler

Web-scaleThe Web is infinite! Avoid robot traps by putting depth or pagenumber limits on each Web serverFocus on important pages [Abiteboul et al., 2003]

Web servers under a list of DNS domains: easy filtering of URLs

A given topic: focused crawling techniques [Chakrabarti et al.,1999, Diligenti et al., 2000, Gouriten et al., 2014] based onclassifiers of Web page content and predictors of the interest of alink.

The national Web (cf. public deposit, national libraries): what isthis? [Abiteboul et al., 2002]

A given Web site: what is a Web site? [Senellart, 2005]


Identification of duplicate Web pages

ProblemIdentifying duplicates or near-duplicates on the Web to prevent multipleindexing

trivial duplicates: same resource at the same canonized URL:http://example.com:80/totohttp://example.com/titi/../toto

exact duplicates: identification by hashing

near-duplicates: (timestamps, tip of the day, etc.) more complex!

http://example.com:80/toto

http://example.com/titi/../toto


Crawling ethics

Standard for robot exclusion: robots.txt at the root of a Webserver [Koster, 1994].

User-agent: *Allow: /searchhistory/Disallow: /search

Per-page exclusion.

<meta name="ROBOTS" content="NOINDEX,NOFOLLOW">

Per-link exclusion.

<a href="toto.html" rel="nofollow">Toto</a>

Avoid Denial Of Service (DOS), wait �1s between two repeatedrequests to the same Web server


Parallel processing

Network delays, waits between requests:

Per-server queue of URLs

Parallel processing of requests to different hosts:

multi-threaded programmingasynchronous inputs and outputs (select, classes fromjava.util.concurrent): less overhead

Use of keep-alive to reduce connexion overheads

General Architecture [Chakrabarti, 2003]


Refreshing URLs

Content on the Web changes

Different change rates:online newspaper main page: every hour or sopublished article: virtually no change

Continuous crawling, and identification of change rates foradaptive crawling: how to know the time of last modification of aWeb page?


Estimating the Freshness of a Page

1. Check HTTP timestamp.

2. Check content timestamp.

3. Compare a hash of the page with a stored hash.

4. Non-significant differences (ads, fortunes, request timestamp):

only hash text content, or “useful” text content;compare distribution of n-grams (shingling);or even compute edit distance with previous version.

Adapting strategy to each different archived website?


Crawling Modern Web Sites

Some modern Web sites only work when cookies are activated(session cookies), or when JavaScript code is interpreted

Regular Web crawlers (wget, Heritrix, Apache Nutch) usuallydon’t do cookie management and don’t interpret JavaScript code

Crawling of some Websites therefore require more advanced tools


Advanced crawling tools

Web scraping frameworks such as scrapy (Python) orWWW::Mechanize (Perl) simulate a Web browserinteraction and cookie management (but no JSinterpretation)

Headless browsers such as htmlunit simulate a Web browser, includingsimple JavaScript processing

Browser instrumentors such as Selenium allow full instrumentation ofa regular Web browser (Chrome, Firefox, InternetExplorer)

OXPath: a full-fledged navigation and extraction language forcomplex Web sites [Sellers et al., 2011] Demo


Outline

The World Wide Web





Templated Web Site

Many Web sites (especially, Web forums, blogs) use one of a fewcontent management systems (CMS)

Web sites that use the same CMS will be similarly structured,present a similar layout, etc.

Information is somewhat structured in CMSs: publication date,author, tags, forums, threads, etc.

Some structure differences may exist when Web sites use differentversions, or different themes, of a CMS


Crawling CMS-Based Web Sites

Traditional crawling approaches crawl Web sites independently ofthe nature of the sites and of their CMSWhen the CMS is known:

Potential for much more efficient crawling strategies (avoid pageswith redundant information, uninformative pages, etc.)Potential for automatic extraction of structured content

Two ways of approaching the problem:Have a handcrafted knowledge base of known CMSs, theircharacteristics, how to crawl and extract information [Faheem andSenellart, 2013b,a] (AAH) DemoAutomatically infer the best way to crawl a given CMS [Faheemand Senellart, 2014] (ACE)

Need to be robust w.r.t. template change


Detecting CMSsOne main challenge in intelligent crawling and content extractionis to identify the CMS and then perform the best crawlingstrategy accordinglyDetecting CMS using:1. URL patterns,2. HTTP metadata,3. textual content,4. XPath patterns, etc.

These can be manually described (AAH), or automatically inferred(ACE)

For instance the vBulletin Web forum content managementsystem, that can be identified by searching for a reference to avbulletin_global.js JavaScript script by using a simple//script/@src XPath expression.


Crawling http://www.rockamring-blog.de/[Faheem and Senellart, 2014]

0 2;000 4;000 6;0000

100

200

300

Number of HTTP requestsNum

berof

distinct

2-gram

s(�

1;00

0)

ACEAAHwget

http://www.rockamring-blog.de/


Outline

The World Wide Web





Most popular Web sites1 google.com2 facebook.com3 youtube.com4 yahoo.com5 baidu.com6 wikipedia.org7 live.com8 twitter.com9 qq.com

10 amazon.com11 blogspot.com12 linkedin.com13 google.co.in14 taobao.com15 sina.com.cn16 yahoo.co.jp17 msn.com18 wordpress.com19 google.com.hk20 t.co21 google.de22 ebay.com23 google.co.jp24 googleusercontent.com25 google.co.uk26 yandex.ru27 163.com28 weibo.com

(Alexa)

Social networking sites

Sites with social networking features (friends,user-shared content, user profiles, etc.)




(Alexa)






(Alexa)




Social data on the Web

Huge numbers of users(2012):

Facebook 900 million

QQ 540 million

W. Live 330 million

Weibo 310 million

Google+ 170 million

Twitter 140 million

LinkedIn 100 million

Huge volume of shared data:

250 million tweets per day on Twitter(3,000 per second on average!). . .

. . . including statements by heads ofstates, revelations of political activists, etc.


Social data on the Web

Huge numbers of users(2012):

Facebook 900 million

QQ 540 million

W. Live 330 million

Weibo 310 million

Google+ 170 million

Twitter 140 million

LinkedIn 100 million

Huge volume of shared data:

250 million tweets per day on Twitter(3,000 per second on average!). . .

. . . including statements by heads ofstates, revelations of political activists, etc.


Crawling Social Networks

Theoretically possible to crawl social networking sites using aregular Web crawler

Sometimes not possible: https://www.facebook.com/robots.txt

Often very inefficient, considering politeness constraints

Better solution: Use provided social networking APIshttps://dev.twitter.com/docs/api/1.1https://developers.facebook.com/docs/graph-api/reference/v2.1/https://developer.linkedin.com/apishttps://developers.google.com/youtube/v3/

Also possible to buy access to the data, directly from the socialnetwork or from brokers such as http://gnip.com/

https://www.facebook.com/robots.txt

https://dev.twitter.com/docs/api/1.1

https://developers.facebook.com/docs/graph-api/reference/v2.1/

https://developers.facebook.com/docs/graph-api/reference/v2.1/

https://developer.linkedin.com/apis

https://developers.google.com/youtube/v3/

http://gnip.com/


Social Networking APIs

Most social networking Web sites (and some other kinds of Websites) provide APIs to effectively access their content

Usually a RESTful API, occasionally SOAP-baed

Usually require a token identifying the application using the API,sometimes a cryptographic signature as well

May access the API as an authenticated user of the social network,or as an external party

APIs seriously limit the rate of requests:https://dev.twitter.com/docs/api/1.1/get/search/tweets

https://dev.twitter.com/docs/api/1.1/get/search/tweets


REST

Mode of interaction with a Web service

Follow the KISS (Keep it Simple, Stupid) principle

Each request to the service is a simple HTTP GET method

Base URL is the URL of the service

Parameters of the service are sent as HTTP parameters (in theURL)

HTTP response code indicates success or failure

Response contains structured output, usually as JSON or XML

No side effect, each request independent of previous ones

Example: http://graph.facebook.com:80/?ids=7901103

http://graph.facebook.com:80/?ids=7901103


The Case of Twitter

Two main APIs:REST APIs, including search, getting information about a user, alist, followers, etc. https://dev.twitter.com/docs/api/1.1Streaming API, providing real-time result

Very limited history available

Search can be on keywords, language, geolocation (for a smallportion of tweets)

https://dev.twitter.com/docs/api/1.1


Cross-Network Crawling

Often useful to combine results from different social networks

Numerous libraries facilitating SN API accesses (twipy,Facebook4J, FourSquare VP C++ API. . . ) incompatible witheach other. . . Some efforts at generic APIs (OneAll,APIBlender [Gouriten et al., 2014]) Demo

Example use case: No API to get all check-ins from FourSquare,but a number of check-ins are available on Twitter; given results ofTwitter Search/Streaming, use FourSquare API to get informationabout check-in locations.


Outline

The World Wide Web





The Deep Web

Definition (Deep Web, Hidden Web, Invisible Web)All the content on the Web that is not directly accessible throughhyperlinks. In particular: HTML forms, Web services.

Size estimate: 500 times more content than on the surface Web![BrightPlanet, 2000]. Hundreds of thousands of deep Web databases[Chang et al., 2004]


Sources of the Deep Web

Example

Yellow Pages and other directories;

Library catalogs;

Weather services;

US Census Bureau data;

etc.


Discovering Knowledge from the Deep Web[Nayak et al., 2012]

Content of the deep Web hidden to classical Web search engines(they just follow links)

But very valuable and high quality!

Even services allowing access through the surface Web (e.g.,e-commerce) have more semantics when accessed from the deepWeb

How to benefit from this information?

How to analyze, extract and model this information?

Focus here: Automatic, unsupervised, methods, for a given domain ofinterest


Extensional Approach

WWWdiscovery

siphoning

bootstrapIndex

indexing


Notes on the Extensional Approach

Main issues:Discovering servicesChoosing appropriate data to submit formsUse of data found in result pages to bootstrap the siphoning processEnsure good coverage of the database

Approach favored by Google, used in production [Madhavan et al.,2006]

Not always feasible (huge load on Web servers)

Intensional Approach

WWWdiscovery

probing

analyzingForm wrapped as

a Web service

query


Notes on the Intensional Approach

More ambitious [Chang et al., 2005, Senellart et al., 2008]Main issues:

Discovering servicesUnderstanding the structure and semantics of a formUnderstanding the structure and semantics of result pagesSemantic analysis of the service as a wholeQuery rewriting using the services

No significant load imposed on Web servers


Outline

The World Wide Web





The Semantic Web

A Web in which the resources are semantically describedannotations give information about a page, explain an expression ina page, etc.

More precisely, a resource is anything that can be referred to by aURI

a web page, identified by a URLa fragment of an XML document, identified by an element node ofthe document,a web service,a thing, an object, a concept, a property, etc.

Semantic annotations: logical assertions that relate resources tosome terms in associated ontologies


Ontologies

Formal descriptions providing human users a sharedunderstanding of a given domain

A controlled vocabulary

Formally defined so that it can also be processed by machines

Logical semantics that enables reasoning

Reasoning is the key for different important tasks of Web datamanagement, in particular:

to answer queries (over possibly distributed data)to relate objects in different data sources enabling their integrationto detect inconsistencies or redundanciesto refine queries with too many answers, or to relax queries with noanswer


Where Do Ontologies Come From?

Manually crafted to represent the knowledge of a specific domain(e.g., life sciences)

Exported from classical Web databases

Through information extraction from the Web, Wikipedia, etc.(e.g., DBpedia, YAGO)

Private to a company or public

Some ontologies focus on instances, others on a schema (seefurther)

Value of the Semantic Web: bits of ontologies can be re-used inanother, and ontologies can be mapped through an owl:sameAslink

As of September 2011

MusicBrainz

(zitgist)

P20

Turismo de

Zaragoza

yovisto

Yahoo! Geo

Planet

YAGO

World Fact-book

El ViajeroTourism

WordNet (W3C)

WordNet (VUA)

VIVO UF

VIVO Indiana

VIVO Cornell

VIAF

URIBurner

Sussex Reading

Lists

Plymouth Reading

Lists

UniRef

UniProt

UMBEL

UK Post-codes

legislationdata.gov.uk

Uberblic

UB Mann-heim

TWC LOGD

Twarql

transportdata.gov.

uk

Traffic Scotland

theses.fr

Thesau-rus W

totl.net

Tele-graphis

TCMGeneDIT

TaxonConcept

Open Library (Talis)

tags2con delicious

t4gminfo

Swedish Open

Cultural Heritage

Surge Radio

Sudoc

STW

RAMEAU SH

statisticsdata.gov.

uk

St. Andrews Resource

Lists

ECS South-ampton EPrints

SSW Thesaur

us

SmartLink

Slideshare2RDF

semanticweb.org

SemanticTweet

Semantic XBRL

SWDog Food

Source Code Ecosystem Linked Data

US SEC (rdfabout)

Sears

Scotland Geo-

graphy

ScotlandPupils &Exams

Scholaro-meter

WordNet (RKB

Explorer)

Wiki

UN/LOCODE

Ulm

ECS (RKB

Explorer)

Roma

RISKS

RESEX

RAE2001

Pisa

OS

OAI

NSF

New-castle

LAASKISTI

JISC

IRIT

IEEE

IBM

Eurécom

ERA

ePrints dotAC

DEPLOY

DBLP (RKB

Explorer)

Crime Reports

UK

Course-ware

CORDIS (RKB

Explorer)CiteSeer

Budapest

ACM

riese

Revyu

researchdata.gov.

ukRen. Energy Genera-

tors

referencedata.gov.

uk

Recht-spraak.

nl

RDFohloh

Last.FM (rdfize)

RDF Book

Mashup

Rådata nå!

PSH

Product Types

Ontology

ProductDB

PBAC

Poké-pédia

patentsdata.go

v.uk

OxPoints

Ord-nance Survey

Openly Local

Open Library

OpenCyc

Open Corpo-rates

OpenCalais

OpenEI

Open Election

Data Project

OpenData

Thesau-rus

Ontos News Portal

OGOLOD

JanusAMP

Ocean Drilling Codices

New York

Times

NVD

ntnusc

NTU Resource

Lists

Norwe-gian

MeSH

NDL subjects

ndlna

myExperi-ment

Italian Museums

medu-cator

MARC Codes List

Man-chester Reading

Lists

Lotico

Weather Stations

London Gazette

LOIUS

Linked Open Colors

lobidResources

lobidOrgani-sations

LEM

LinkedMDB

LinkedLCCN

LinkedGeoData

LinkedCT

LinkedUser

FeedbackLOV

Linked Open

Numbers

LODE

Eurostat (OntologyCentral)

Linked EDGAR

(OntologyCentral)

Linked Crunch-

base

lingvoj

Lichfield Spen-ding

LIBRIS

Lexvo

LCSH

DBLP (L3S)

Linked Sensor Data (Kno.e.sis)

Klapp-stuhl-club

Good-win

Family

National Radio-activity

JP

Jamendo (DBtune)

Italian public

schools

ISTAT Immi-gration

iServe

IdRef Sudoc

NSZL Catalog

Hellenic PD

Hellenic FBD

PiedmontAccomo-dations

GovTrack

GovWILD

GoogleArt

wrapper

gnoss

GESIS

GeoWordNet

GeoSpecies

GeoNames

GeoLinkedData

GEMET

GTAA

STITCH

SIDER

Project Guten-berg

MediCare

Euro-stat

(FUB)

EURES

DrugBank

Disea-some

DBLP (FU

Berlin)

DailyMed

CORDIS(FUB)

Freebase

flickr wrappr

Fishes of Texas

Finnish Munici-palities

ChEMBL

FanHubz

EventMedia

EUTC Produc-

tions

Eurostat

Europeana

EUNIS

EU Insti-

tutions

ESD stan-dards

EARTh

Enipedia

Popula-tion (En-AKTing)

NHS(En-

AKTing) Mortality(En-

AKTing)

Energy (En-

AKTing)

Crime(En-

AKTing)

CO2 Emission

(En-AKTing)

EEA

SISVU

education.data.g

ov.uk

ECS South-ampton

ECCO-TCP

GND

Didactalia

DDC Deutsche Bio-

graphie

datadcs

MusicBrainz

(DBTune)

Magna-tune

John Peel

(DBTune)

Classical (DB

Tune)

AudioScrobbler (DBTune)

Last.FM artists

(DBTune)

DBTropes

Portu-guese

DBpedia

dbpedia lite

Greek DBpedia

DBpedia

data-open-ac-uk

SMCJournals

Pokedex

Airports

NASA (Data Incu-bator)

MusicBrainz(Data

Incubator)

Moseley Folk

Metoffice Weather Forecasts

Discogs (Data

Incubator)

Climbing

data.gov.uk intervals

Data Gov.ie

databnf.fr

Cornetto

reegle

Chronic-ling

America

Chem2Bio2RDF

Calames

businessdata.gov.

uk

Bricklink

Brazilian Poli-

ticians

BNB

UniSTS

UniPathway

UniParc

Taxonomy

UniProt(Bio2RDF)

SGD

Reactome

PubMedPub

Chem

PRO-SITE

ProDom

Pfam

PDB

OMIMMGI

KEGG Reaction

KEGG Pathway

KEGG Glycan

KEGG Enzyme

KEGG Drug

KEGG Com-pound

InterPro

HomoloGene

HGNC

Gene Ontology

GeneID

Affy-metrix

bible ontology

BibBase

FTS

BBC Wildlife Finder

BBC Program

mes BBC Music

Alpine Ski

Austria

LOCAH

Amster-dam

Museum

AGROVOC

AEMET

US Census (rdfabout)

Media

Geographic

Publications

Government

Cross-domain

Life sciences

User-generated content

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

http://lod-cloud.net/


Classes and class hierarchy

Backbone of the ontology

AcademicStaff is a Class (A class will be interpreted as a set ofobjects)

AcademicStaff isa Staff (isa is interpreted as set inclusion)FacultyComponent

Course

MathCourse

ProbabilitiesAlgebra

LogicCSCourse

DBAIJava

Student

UndergraduateStudentMasterStudentPhDStudent

Department

PhysicsDeptMathsDeptCSDept

Staff

AcademicStaff

LecturerResearcherProfessor

AdministrativeStaff


Relations

Declaration of relations with their signature

(Relations will be interpreted as binary relations between objects)TeachesIn(AcademicStaff, Course)

if one states that “X TeachesIn Y ”, then X belongs toAcademicStaff and Y to Course

TeachesTo(AcademicStaff, Student)

Leads(Staff, Department)


Instances

Classes have instances

Dupond is an instance of the class Professor

corresponds to the fact: Professor(Dupond)

Relations also have instances

(Dupond,CS101) is an instance of the relation TeachesIn

corresponds to the fact: TeachesIn(Dupond,CS101)

The instance statements can be seen as (and stored in) a database


Ontology = schema + instance

Schema (TBox)

The set of class and relation namesThe signatures of relations and also constraintsThe constraints are used for two purposes

– checking data consistency (like dependencies in databases)– inferring new facts

Instance (ABox)

The set of factsThe set of base facts together with the inferred facts should satisfythe constraints

Ontology (i.e., Knowledge Base) = Schema + Instance


Where can Semantic Content be Found?

In the linked data, through Web-available RDF data:dumps of an entire ontology, in one of the RDF serializationformats (RDF/XML, Turtle, N-Triples)crawlable RDF content, with small fragments pointing to otherfragmentsa SPARQL endpointHTML annotated with RDFa,cf. http://www.w3.org/TR/rdfa-syntax/

Other popular semantic content embedded in Web pages:microformats (hCard, vCard, etc.), microdata(cf. http://www.schemas.org/). Not directly the spirit of theSemantic Web, but heavily used.

RDF content used internally in a company

http://www.w3.org/TR/rdfa-syntax/

http://www.schemas.org/


How to Acquire Semantic Content?

Much easier to exploit, as it is already semantically described

Individual resources (dumps, SPARQL endpoints) that have beenidentified as valuable can be directly exploited

RDFa content, microformats, microdata, can be discovered fromregular Web crawls

Not perfect! There are errors, lies, etc.


Outline

The World Wide Web


Exploiting Acquired InformationInformation ExtractionGraph MiningOpinion Mining



Outline

The World Wide Web





Information Extraction

See Parts “Instance Extraction” and “Fact Extraction” from mycolleague Fabian Suchanek’s lecturehttp://suchanek.name/work/teaching/IE2010a.pdf

http://suchanek.name/work/teaching/IE2010a.pdf


Outline

The World Wide Web





The Web Graph

The World Wide Web seen as a (directed) graph:

Vertices: Web pages

Edges: hyperlinks

Same for other interlinked environments:

dictionaries

encyclopedias

scientific publications

social networks


Google’s PageRank [Brin and Page, 1998]

IdeaImportant pages are pages pointed to by important pages.

8<:

gij = 0 if there is no link between page i and j ;

gij =1ni

otherwise, with ni the number of outgoing links of page i .

Definition (Tentative)Probability that the surfer following the random walk in G has arrivedon page i at some distant given point in the future.

pr(i) =�

limk!+1

(GT )kv�

i

where v is some initial column vector.


Illustrating PageRank Computation

0.1000.100

0.100

0.100

0.100

0.100

0.100

0.100

0.100

0.100



0.0330.317

0.075

0.108

0.025

0.058

0.083

0.150

0.117

0.033



0.0360.193

0.108

0.163

0.079

0.090

0.074

0.154

0.094

0.008



0.0540.212

0.093

0.152

0.048

0.051

0.108

0.149

0.106

0.026



0.0510.247

0.078

0.143

0.053

0.062

0.097

0.153

0.099

0.016



0.0480.232

0.093

0.156

0.062

0.067

0.087

0.138

0.099

0.018



0.0520.226

0.092

0.148

0.058

0.064

0.098

0.146

0.096

0.021



0.0490.238

0.088

0.149

0.057

0.063

0.095

0.141

0.099

0.019



0.0500.232

0.091

0.149

0.060

0.066

0.094

0.143

0.096

0.019



0.0500.233

0.091

0.150

0.058

0.064

0.095

0.142

0.098

0.020



0.0500.234

0.090

0.148

0.058

0.065

0.095

0.143

0.097

0.019



0.0490.233

0.091

0.149

0.058

0.065

0.095

0.142

0.098

0.019



0.0500.233

0.091

0.149

0.058

0.065

0.095

0.143

0.097

0.019



0.0500.234

0.091

0.149

0.058

0.065

0.095

0.142

0.097

0.019


PageRank With Damping

May not always converge, or convergence may not be unique.To fix this, the random surfer can at each step randomly jump to anypage of the Web with some probability d (1� d : damping factor).

pr(i) =�

limk!+1

((1� d)GT + dU )kv�

i

where U is the matrix with all 1N values with N the number of vertices.


Using PageRank to Score Search Results

PageRank: global score, independent of the query

Can be used to raise the weight of important pages, associatedwith some scoring function dependent of the query:

final(q ; d) = score(q ; d)� pr(d),

PageRank only useful in directed graphs! Proportional to degreeotherwise


HITS [Kleinberg, 1999]

IdeaTwo kinds of important pages: hubs and authorities. Hubs are pagesthat point to good authorities, whereas authorities are pages that arepointed to by good hubs.

G 0 adjacency matrix (with 0 and 1 values) of a subgraph of the Web.We use the following iterative process (starting with a and h vectors ofnorm 1):

8<:

a := 1kG 0Thk G 0Th

h := 1kG 0ak G 0a

Converges under some technical assumptions to authority and hubscores.


Using HITS to Order Web Query Results

1. Retrieve the set D of Web pages matching a keyword query.

2. Retrieve the set D� of Web pages obtained from D by adding alllinked pages, as well as all pages linking to pages of D .

3. Build from D� the corresponding subgraph G 0 of the Web graph.

4. Compute iteratively hubs and authority scores.

5. Sort documents from D by authority scores.

Less efficient than PageRank, because local scores.


Discovery of communities

Classical problem in social networks: identifying communities ofusers (or of content) using the graph structure

Two subproblems:

1. Given some initial vertex or vertex set, finding the correspondingcommunity

2. Given the graph as a whole, finding a partition in communities


Maximum Flow / Minimum Cut

/6 /2

/1

/5

/2

/3

sinksource

/4

Use of a maximum flow computation algorithm [Goldberg andTarjan, 1988] to separate a seed of users from the remaining of thegraph

Complexity O(n2m) (n : vertices, m : edges)



/6 /2

/1

/5

/2

/3

source

4 0

3 2

1

4/4

1sink





/6 /2

/1

/5

/2

/3

sinksource

4 0

3 2

1

4/4

1




Markov Cluster Algorithm (MCL) [van Don-gen, 2000]

Graph clustering algorithmBased as well on maximum flow simulation, in the whole graphIteration of a matrix computation alternating:

Expansion (matrix multiplication, corresponding to flowpropagation)Inflation (non-linear operation to increase heterogeneity)

Complexity: O(n3) for an exact computation, O(n) for anapproximate one

[van Dongen, 2000]


Markov Cluster Algorithm (MCL) [van Don-gen, 2000]

Graph clustering algorithmBased as well on maximum flow simulation, in the whole graphIteration of a matrix computation alternating:

Expansion (matrix multiplication, corresponding to flowpropagation)Inflation (non-linear operation to increase heterogeneity)

Complexity: O(n3) for an exact computation, O(n) for anapproximate one

[van Dongen, 2000]


Deletion of the edges with the highest be-twenness [Newman and Girvan, 2004]

Top-down graph clustering algorithmBetwenness of an edge: number of minimal paths between twoarbitrary vertices going through this edgeGeneral principle:1. Compute the betweenness of each edge in the graph2. Remove the edge with the highest betweenness3. Redo the whole process, betweenness computation included

Complexity: O(n3) for a sparse graph

[Newman and Girvan, 2004]


Outline

The World Wide Web





Opinion Mining

See my colleague Chloé Clavel’s lecture http://pierre.senellart.com/enseignement/2013-2014/inf344/10-opinion-mining.pdf

http://pierre.senellart.com/enseignement/2013-2014/inf344/10-opinion-mining.pdf

http://pierre.senellart.com/enseignement/2013-2014/inf344/10-opinion-mining.pdf


Outline

The World Wide Web






Crawl a competitor’s Web site, apply a wrapper to extractstructured information, regularly refresh this crawl ) a localdatabase of a competitor’s products and prices, ready to beanalyzed

Crawl Web forums, blogs, social networking sites, for opinionsabout a brand, and mine the obtained social network ) followidentify opinion leaders, and target them for marketing

Exploit Deep Web forms to crawl all patents pertaining to aparticular topic, perform instance extraction to identify allmolecules cited in the patent, use linked open data ontologies toconnect these molecules to known metabolic pathways ) get moreinsight onto which biological phenomena are targeted bycompetitors’ inventions

Bibliography I

Serge Abiteboul, Grégory Cobena, Julien Masanès, and Gerald Sedrati.A first experience in archiving the French Web. In Proc. ECDL,Roma, Italie, September 2002.

Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-linepage importance computation. In Proc. WWW, May 2003.

BrightPlanet. The deep Web: Surfacing hidden value. White Paper,July 2000.

Sergey Brin and Lawrence Page. The anatomy of a large-scalehypertextual Web search engine. Computer Networks, 30(1–7):107–117, April 1998.

Soumen Chakrabarti. Mining the Web: Discovering Knowledge fromHypertext Data. Morgan Kaufmann, San Fransisco, USA, 2003.

Bibliography IISoumen Chakrabarti, Martin van den Berg, and Byron Dom. Focusedcrawling: A new approach to topic-specific Web resource discovery.Computer Networks, 31(11–16):1623–1640, 1999.

Kevin Chen-Chuan Chang, Bin He, Chengkai Li, Mitesh Patel, andZhen Zhang. Structured databases on the Web: Observations andimplications. SIGMOD Record, 33(3):61–70, September 2004.

Kevin Chen-Chuan Chang, Bin He, and Zhen Zhang. Toward largescale integration: Building a metaquerier over databases on theWeb. In Proc. CIDR, Asilomar, USA, January 2005.

Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles,and Marco Gori. Focused crawling using context graphs. In Proc.VLDB, Cairo, Egypt, September 2000.

Bibliography IIIMuhammad Faheem and Pierre Senellart. Demonstrating intelligentcrawling and archiving of web applications. In Proc. CIKM, pages2481–2484, San Francisco, USA, October 2013a. Demonstration.

Muhammad Faheem and Pierre Senellart. Intelligent and adaptivecrawling of Web applications for Web archiving. In Proc. ICWE,pages 306–322, Aalborg, Denmark, July 2013b.

Muhammad Faheem and Pierre Senellart. Adaptive crawling driven bystructure-based link classification, July 2014. Preprint available athttp://pierre.senellart.com/publications/faheem2015adaptive.pdf.

Andrew V. Goldberg and Robert E. Tarjan. A new approach to themaximum-flow problem. Journal of the ACM, 35(4):921–940,October 1988.

http://pierre.senellart.com/publications/faheem2015adaptive.pdf

http://pierre.senellart.com/publications/faheem2015adaptive.pdf

Bibliography IVGeorges Gouriten, Silviu Maniu, and Pierre Senellart. Scalable,generic, and adaptive systems for focused crawling. In Proc.Hypertext, Santiago, Chile, September 2014. Douglas Engelbart BestPaper Award.

Jon M. Kleinberg. Authoritative Sources in a HyperlinkedEnvironment. Journal of the ACM, 46(5):604–632, 1999.

Martijn Koster. A standard for robot exclusion.http://www.robotstxt.org/orig.html, June 1994.

Jayant Madhavan, Alon Y. Halevy, Shirley Cohen, Xin Dong,Shawn R. Jeffery, David Ko, and Cong Yu. Structured data meetsthe Web: A few observations. IEEE Data Engineering Bulletin, 29(4):19–26, December 2006.

http://www.robotstxt.org/orig.html

Bibliography V

Richi Nayak, Pierre Senellart, Fabian M. Suchanek, and Aparna Varde.Discovering interesting information with advances in Webtechnology. SIGKDD Explorations, 14(2), December 2012.

M. E. J. Newman and M. Girvan. Finding and evaluating communitystructure in networks. Physical Review E, 69(2), 2004.

Andrew Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, andChristian Schallhart. Exploring the Web with OXPath. In LWDM,2011.

Pierre Senellart. Identifying Websites with flow simulation. In Proc.ICWE, pages 124–129, Sydney, Australia, July 2005.

Bibliography VI

Pierre Senellart, Avin Mittal, Daniel Muschick, Rémi Gilleron, andMarc Tommasi. Automatic wrapper induction from hidden-Websources with domain knowledge. In Proc. WIDM, pages 9–16, Napa,USA, October 2008.

sitemaps.org. Sitemaps XML format.http://www.sitemaps.org/protocol.php, February 2008.

Stijn Marinus van Dongen. Graph Clustering by Flow Simulation.PhD thesis, University of Utrecht, May 2000.

http://www.sitemaps.org/protocol.php

Documents

Data Acquisition and Extraction from the Variety of Web ... · 10 September 2014, Yves Rocher Data Acquisition and Extraction from the Variety of Web Sources PierreSenellart