Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
10 September 2014, Yves Rocher
Data Acquisition andExtraction from the Varietyof Web Sources
Pierre Senellart
2 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content
Exploiting Acquired Information
Opportunities for Market Insights
3 / 74 Télécom ParisTech Pierre Senellart
Internet and the Web
Internet: physical network of computers (or hosts)
World Wide Web, Web, WWW: logical collection of hyperlinkeddocuments
static and dynamicpublic Web and private Webseach document (or Web page, or resource) identifiedby a URL
4 / 74 Télécom ParisTech Pierre Senellart
Uniform Resource Locators
https| {z }scheme
:// www.example.com| {z }hostname
:443| {z }port
/ path/to/doc| {z }path
?name=foo&town=bar| {z }query string
#para| {z }fragment
scheme: way the resource can be accessed; generally http or https
hostname: domain name of a host (cf. DNS); hostname of a websitemay start with www., but not a rule.
port: TCP port; defaults: 80 for http and 443 for https
path: logical path of the document
query string: additional parameters (dynamic documents), optional
fragment: subpart of the document, optional
Relative URIs with respect to a context (e.g., the URI above):/titi https://www.example.com/tititata https://www.example.com/path/to/tata
5 / 74 Télécom ParisTech Pierre Senellart
(X)HTML
Choice format for Web pages
Dialect of SGML (the ancestor of XML), but seldom parsed as is
HTML 4.01: most common version, W3C recommendation
XHTML 1.0: XML-ization of HTML 4.01, minor differences
HTML5: most recent version, still in development, adds somebetter structuring
Actual situation of the Web: tag soup
6 / 74 Télécom ParisTech Pierre Senellart
XHTML example<!DOCTYPE html PUBLIC"-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"
lang="en" xml:lang="en"><head>
<meta http-equiv="Content-Type"content="text/html; charset=utf-8" />
<title>Example XHTML document</title></head><body>
<p>This is a<a href="http://www.w3.org/">link to the<strong>W3C</strong>!</a></p>
</body></html>
7 / 74 Télécom ParisTech Pierre Senellart
HTTPClient-server protocol for the Web, on top of TCP/IPExample request/response
GET /myResource HTTP/1.1Host: www.example.com
HTTP/1.1 200 OKContent-Type: text/html; charset=ISO-8859-1
<html><head><title>myResource</title></head>
<html><head><title>myResource</title></head><body><p>Hello world!</p></body>
</html>
HTTPS: secure version of HTTP
8 / 74 Télécom ParisTech Pierre Senellart
Features of HTTP/1.1
virtual hosting: different Web content for different hostnames on asingle machine
login/password protection
content negociation: same URL identifying several resources, clientindicates preferences
cookies: chunks of information persistently stored on the client
keep-alive connections: several requests using the same TCPconnection
etc.
9 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web ContentRegular Web ContentCMS-based Web ContentSocial Networking SitesThe Deep WebThe Semantic Web
Exploiting Acquired Information
Opportunities for Market Insights
10 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web ContentRegular Web ContentCMS-based Web ContentSocial Networking SitesThe Deep WebThe Semantic Web
Exploiting Acquired Information
Opportunities for Market Insights
11 / 74 Télécom ParisTech Pierre Senellart
Web Crawlers
crawlers, (Web) spiders, (Web) robots: autonomous user agentsthat retrieve pages from the WebBasics of crawling:1. Start from a given URL or set of URLs2. Retrieve and process the corresponding page3. Discover new URLs (cf. next slide)4. Repeat on each found URL
No real termination condition (virtual unlimited number of Webpages!)
Graph-browsing problemdeep-first: not well adapted, can be lost in robot traps
best: breadth-first with limited-depth deep-first on eachdiscovered website
12 / 74 Télécom ParisTech Pierre Senellart
Sources of new URLs
From HTML pages:hyperlinks <a href="...">...</a>media <img src="..."> <embed src="..."><object data="...">frames <frame src="..."> <iframe src="...">JavaScript links window.open("...")etc.
Other hyperlinked content (e.g., PDF files)
Non-hyperlinked URLs that appear anywhere on the Web (inHTML text, text files, etc.): use regular expressions to extractthem
Referrer URLs
Sitemaps [sitemaps.org, 2008]
13 / 74 Télécom ParisTech Pierre Senellart
Scope of a crawler
Web-scaleThe Web is infinite! Avoid robot traps by putting depth or pagenumber limits on each Web serverFocus on important pages [Abiteboul et al., 2003]
Web servers under a list of DNS domains: easy filtering of URLs
A given topic: focused crawling techniques [Chakrabarti et al.,1999, Diligenti et al., 2000, Gouriten et al., 2014] based onclassifiers of Web page content and predictors of the interest of alink.
The national Web (cf. public deposit, national libraries): what isthis? [Abiteboul et al., 2002]
A given Web site: what is a Web site? [Senellart, 2005]
14 / 74 Télécom ParisTech Pierre Senellart
Identification of duplicate Web pages
ProblemIdentifying duplicates or near-duplicates on the Web to prevent multipleindexing
trivial duplicates: same resource at the same canonized URL:http://example.com:80/totohttp://example.com/titi/../toto
exact duplicates: identification by hashing
near-duplicates: (timestamps, tip of the day, etc.) more complex!
15 / 74 Télécom ParisTech Pierre Senellart
Crawling ethics
Standard for robot exclusion: robots.txt at the root of a Webserver [Koster, 1994].
User-agent: *Allow: /searchhistory/Disallow: /search
Per-page exclusion.
<meta name="ROBOTS" content="NOINDEX,NOFOLLOW">
Per-link exclusion.
<a href="toto.html" rel="nofollow">Toto</a>
Avoid Denial Of Service (DOS), wait �1s between two repeatedrequests to the same Web server
16 / 74 Télécom ParisTech Pierre Senellart
Parallel processing
Network delays, waits between requests:
Per-server queue of URLs
Parallel processing of requests to different hosts:
multi-threaded programmingasynchronous inputs and outputs (select, classes fromjava.util.concurrent): less overhead
Use of keep-alive to reduce connexion overheads
General Architecture [Chakrabarti, 2003]
18 / 74 Télécom ParisTech Pierre Senellart
Refreshing URLs
Content on the Web changes
Different change rates:online newspaper main page: every hour or sopublished article: virtually no change
Continuous crawling, and identification of change rates foradaptive crawling: how to know the time of last modification of aWeb page?
19 / 74 Télécom ParisTech Pierre Senellart
Estimating the Freshness of a Page
1. Check HTTP timestamp.
2. Check content timestamp.
3. Compare a hash of the page with a stored hash.
4. Non-significant differences (ads, fortunes, request timestamp):
only hash text content, or “useful” text content;compare distribution of n-grams (shingling);or even compute edit distance with previous version.
Adapting strategy to each different archived website?
20 / 74 Télécom ParisTech Pierre Senellart
Crawling Modern Web Sites
Some modern Web sites only work when cookies are activated(session cookies), or when JavaScript code is interpreted
Regular Web crawlers (wget, Heritrix, Apache Nutch) usuallydon’t do cookie management and don’t interpret JavaScript code
Crawling of some Websites therefore require more advanced tools
21 / 74 Télécom ParisTech Pierre Senellart
Advanced crawling tools
Web scraping frameworks such as scrapy (Python) orWWW::Mechanize (Perl) simulate a Web browserinteraction and cookie management (but no JSinterpretation)
Headless browsers such as htmlunit simulate a Web browser, includingsimple JavaScript processing
Browser instrumentors such as Selenium allow full instrumentation ofa regular Web browser (Chrome, Firefox, InternetExplorer)
OXPath: a full-fledged navigation and extraction language forcomplex Web sites [Sellers et al., 2011] Demo
22 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web ContentRegular Web ContentCMS-based Web ContentSocial Networking SitesThe Deep WebThe Semantic Web
Exploiting Acquired Information
Opportunities for Market Insights
23 / 74 Télécom ParisTech Pierre Senellart
Templated Web Site
Many Web sites (especially, Web forums, blogs) use one of a fewcontent management systems (CMS)
Web sites that use the same CMS will be similarly structured,present a similar layout, etc.
Information is somewhat structured in CMSs: publication date,author, tags, forums, threads, etc.
Some structure differences may exist when Web sites use differentversions, or different themes, of a CMS
24 / 74 Télécom ParisTech Pierre Senellart
Crawling CMS-Based Web Sites
Traditional crawling approaches crawl Web sites independently ofthe nature of the sites and of their CMSWhen the CMS is known:
Potential for much more efficient crawling strategies (avoid pageswith redundant information, uninformative pages, etc.)Potential for automatic extraction of structured content
Two ways of approaching the problem:Have a handcrafted knowledge base of known CMSs, theircharacteristics, how to crawl and extract information [Faheem andSenellart, 2013b,a] (AAH) DemoAutomatically infer the best way to crawl a given CMS [Faheemand Senellart, 2014] (ACE)
Need to be robust w.r.t. template change
25 / 74 Télécom ParisTech Pierre Senellart
Detecting CMSsOne main challenge in intelligent crawling and content extractionis to identify the CMS and then perform the best crawlingstrategy accordinglyDetecting CMS using:1. URL patterns,2. HTTP metadata,3. textual content,4. XPath patterns, etc.
These can be manually described (AAH), or automatically inferred(ACE)
For instance the vBulletin Web forum content managementsystem, that can be identified by searching for a reference to avbulletin_global.js JavaScript script by using a simple//script/@src XPath expression.
26 / 74 Télécom ParisTech Pierre Senellart
Crawling http://www.rockamring-blog.de/[Faheem and Senellart, 2014]
0 2;000 4;000 6;0000
100
200
300
Number of HTTP requestsNum
berof
distinct
2-gram
s(�
1;00
0)
ACEAAHwget
27 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web ContentRegular Web ContentCMS-based Web ContentSocial Networking SitesThe Deep WebThe Semantic Web
Exploiting Acquired Information
Opportunities for Market Insights
28 / 74 Télécom ParisTech Pierre Senellart
Most popular Web sites1 google.com2 facebook.com3 youtube.com4 yahoo.com5 baidu.com6 wikipedia.org7 live.com8 twitter.com9 qq.com
10 amazon.com11 blogspot.com12 linkedin.com13 google.co.in14 taobao.com15 sina.com.cn16 yahoo.co.jp17 msn.com18 wordpress.com19 google.com.hk20 t.co21 google.de22 ebay.com23 google.co.jp24 googleusercontent.com25 google.co.uk26 yandex.ru27 163.com28 weibo.com
(Alexa)
Social networking sites
Sites with social networking features (friends,user-shared content, user profiles, etc.)
28 / 74 Télécom ParisTech Pierre Senellart
Most popular Web sites1 google.com2 facebook.com3 youtube.com4 yahoo.com5 baidu.com6 wikipedia.org7 live.com8 twitter.com9 qq.com
10 amazon.com11 blogspot.com12 linkedin.com13 google.co.in14 taobao.com15 sina.com.cn16 yahoo.co.jp17 msn.com18 wordpress.com19 google.com.hk20 t.co21 google.de22 ebay.com23 google.co.jp24 googleusercontent.com25 google.co.uk26 yandex.ru27 163.com28 weibo.com
(Alexa)
Social networking sites
Sites with social networking features (friends,user-shared content, user profiles, etc.)
28 / 74 Télécom ParisTech Pierre Senellart
Most popular Web sites1 google.com2 facebook.com3 youtube.com4 yahoo.com5 baidu.com6 wikipedia.org7 live.com8 twitter.com9 qq.com
10 amazon.com11 blogspot.com12 linkedin.com13 google.co.in14 taobao.com15 sina.com.cn16 yahoo.co.jp17 msn.com18 wordpress.com19 google.com.hk20 t.co21 google.de22 ebay.com23 google.co.jp24 googleusercontent.com25 google.co.uk26 yandex.ru27 163.com28 weibo.com
(Alexa)
Social networking sites
Sites with social networking features (friends,user-shared content, user profiles, etc.)
29 / 74 Télécom ParisTech Pierre Senellart
Social data on the Web
Huge numbers of users(2012):
Facebook 900 million
QQ 540 million
W. Live 330 million
Weibo 310 million
Google+ 170 million
Twitter 140 million
LinkedIn 100 million
Huge volume of shared data:
250 million tweets per day on Twitter(3,000 per second on average!). . .
. . . including statements by heads ofstates, revelations of political activists, etc.
29 / 74 Télécom ParisTech Pierre Senellart
Social data on the Web
Huge numbers of users(2012):
Facebook 900 million
QQ 540 million
W. Live 330 million
Weibo 310 million
Google+ 170 million
Twitter 140 million
LinkedIn 100 million
Huge volume of shared data:
250 million tweets per day on Twitter(3,000 per second on average!). . .
. . . including statements by heads ofstates, revelations of political activists, etc.
30 / 74 Télécom ParisTech Pierre Senellart
Crawling Social Networks
Theoretically possible to crawl social networking sites using aregular Web crawler
Sometimes not possible: https://www.facebook.com/robots.txt
Often very inefficient, considering politeness constraints
Better solution: Use provided social networking APIshttps://dev.twitter.com/docs/api/1.1https://developers.facebook.com/docs/graph-api/reference/v2.1/https://developer.linkedin.com/apishttps://developers.google.com/youtube/v3/
Also possible to buy access to the data, directly from the socialnetwork or from brokers such as http://gnip.com/
31 / 74 Télécom ParisTech Pierre Senellart
Social Networking APIs
Most social networking Web sites (and some other kinds of Websites) provide APIs to effectively access their content
Usually a RESTful API, occasionally SOAP-baed
Usually require a token identifying the application using the API,sometimes a cryptographic signature as well
May access the API as an authenticated user of the social network,or as an external party
APIs seriously limit the rate of requests:https://dev.twitter.com/docs/api/1.1/get/search/tweets
32 / 74 Télécom ParisTech Pierre Senellart
REST
Mode of interaction with a Web service
Follow the KISS (Keep it Simple, Stupid) principle
Each request to the service is a simple HTTP GET method
Base URL is the URL of the service
Parameters of the service are sent as HTTP parameters (in theURL)
HTTP response code indicates success or failure
Response contains structured output, usually as JSON or XML
No side effect, each request independent of previous ones
Example: http://graph.facebook.com:80/?ids=7901103
33 / 74 Télécom ParisTech Pierre Senellart
The Case of Twitter
Two main APIs:REST APIs, including search, getting information about a user, alist, followers, etc. https://dev.twitter.com/docs/api/1.1Streaming API, providing real-time result
Very limited history available
Search can be on keywords, language, geolocation (for a smallportion of tweets)
34 / 74 Télécom ParisTech Pierre Senellart
Cross-Network Crawling
Often useful to combine results from different social networks
Numerous libraries facilitating SN API accesses (twipy,Facebook4J, FourSquare VP C++ API. . . ) incompatible witheach other. . . Some efforts at generic APIs (OneAll,APIBlender [Gouriten et al., 2014]) Demo
Example use case: No API to get all check-ins from FourSquare,but a number of check-ins are available on Twitter; given results ofTwitter Search/Streaming, use FourSquare API to get informationabout check-in locations.
35 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web ContentRegular Web ContentCMS-based Web ContentSocial Networking SitesThe Deep WebThe Semantic Web
Exploiting Acquired Information
Opportunities for Market Insights
36 / 74 Télécom ParisTech Pierre Senellart
The Deep Web
Definition (Deep Web, Hidden Web, Invisible Web)All the content on the Web that is not directly accessible throughhyperlinks. In particular: HTML forms, Web services.
Size estimate: 500 times more content than on the surface Web![BrightPlanet, 2000]. Hundreds of thousands of deep Web databases[Chang et al., 2004]
37 / 74 Télécom ParisTech Pierre Senellart
Sources of the Deep Web
Example
Yellow Pages and other directories;
Library catalogs;
Weather services;
US Census Bureau data;
etc.
38 / 74 Télécom ParisTech Pierre Senellart
Discovering Knowledge from the Deep Web[Nayak et al., 2012]
Content of the deep Web hidden to classical Web search engines(they just follow links)
But very valuable and high quality!
Even services allowing access through the surface Web (e.g.,e-commerce) have more semantics when accessed from the deepWeb
How to benefit from this information?
How to analyze, extract and model this information?
Focus here: Automatic, unsupervised, methods, for a given domain ofinterest
39 / 74 Télécom ParisTech Pierre Senellart
Extensional Approach
WWWdiscovery
siphoning
bootstrapIndex
indexing
40 / 74 Télécom ParisTech Pierre Senellart
Notes on the Extensional Approach
Main issues:Discovering servicesChoosing appropriate data to submit formsUse of data found in result pages to bootstrap the siphoning processEnsure good coverage of the database
Approach favored by Google, used in production [Madhavan et al.,2006]
Not always feasible (huge load on Web servers)
Intensional Approach
WWWdiscovery
probing
analyzingForm wrapped as
a Web service
query
42 / 74 Télécom ParisTech Pierre Senellart
Notes on the Intensional Approach
More ambitious [Chang et al., 2005, Senellart et al., 2008]Main issues:
Discovering servicesUnderstanding the structure and semantics of a formUnderstanding the structure and semantics of result pagesSemantic analysis of the service as a wholeQuery rewriting using the services
No significant load imposed on Web servers
43 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web ContentRegular Web ContentCMS-based Web ContentSocial Networking SitesThe Deep WebThe Semantic Web
Exploiting Acquired Information
Opportunities for Market Insights
44 / 74 Télécom ParisTech Pierre Senellart
The Semantic Web
A Web in which the resources are semantically describedannotations give information about a page, explain an expression ina page, etc.
More precisely, a resource is anything that can be referred to by aURI
a web page, identified by a URLa fragment of an XML document, identified by an element node ofthe document,a web service,a thing, an object, a concept, a property, etc.
Semantic annotations: logical assertions that relate resources tosome terms in associated ontologies
45 / 74 Télécom ParisTech Pierre Senellart
Ontologies
Formal descriptions providing human users a sharedunderstanding of a given domain
A controlled vocabulary
Formally defined so that it can also be processed by machines
Logical semantics that enables reasoning
Reasoning is the key for different important tasks of Web datamanagement, in particular:
to answer queries (over possibly distributed data)to relate objects in different data sources enabling their integrationto detect inconsistencies or redundanciesto refine queries with too many answers, or to relax queries with noanswer
46 / 74 Télécom ParisTech Pierre Senellart
Where Do Ontologies Come From?
Manually crafted to represent the knowledge of a specific domain(e.g., life sciences)
Exported from classical Web databases
Through information extraction from the Web, Wikipedia, etc.(e.g., DBpedia, YAGO)
Private to a company or public
Some ontologies focus on instances, others on a schema (seefurther)
Value of the Semantic Web: bits of ontologies can be re-used inanother, and ontologies can be mapped through an owl:sameAslink
As of September 2011
MusicBrainz
(zitgist)
P20
Turismo de
Zaragoza
yovisto
Yahoo! Geo
Planet
YAGO
World Fact-book
El ViajeroTourism
WordNet (W3C)
WordNet (VUA)
VIVO UF
VIVO Indiana
VIVO Cornell
VIAF
URIBurner
Sussex Reading
Lists
Plymouth Reading
Lists
UniRef
UniProt
UMBEL
UK Post-codes
legislationdata.gov.uk
Uberblic
UB Mann-heim
TWC LOGD
Twarql
transportdata.gov.
uk
Traffic Scotland
theses.fr
Thesau-rus W
totl.net
Tele-graphis
TCMGeneDIT
TaxonConcept
Open Library (Talis)
tags2con delicious
t4gminfo
Swedish Open
Cultural Heritage
Surge Radio
Sudoc
STW
RAMEAU SH
statisticsdata.gov.
uk
St. Andrews Resource
Lists
ECS South-ampton EPrints
SSW Thesaur
us
SmartLink
Slideshare2RDF
semanticweb.org
SemanticTweet
Semantic XBRL
SWDog Food
Source Code Ecosystem Linked Data
US SEC (rdfabout)
Sears
Scotland Geo-
graphy
ScotlandPupils &Exams
Scholaro-meter
WordNet (RKB
Explorer)
Wiki
UN/LOCODE
Ulm
ECS (RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAASKISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints dotAC
DEPLOY
DBLP (RKB
Explorer)
Crime Reports
UK
Course-ware
CORDIS (RKB
Explorer)CiteSeer
Budapest
ACM
riese
Revyu
researchdata.gov.
ukRen. Energy Genera-
tors
referencedata.gov.
uk
Recht-spraak.
nl
RDFohloh
Last.FM (rdfize)
RDF Book
Mashup
Rådata nå!
PSH
Product Types
Ontology
ProductDB
PBAC
Poké-pédia
patentsdata.go
v.uk
OxPoints
Ord-nance Survey
Openly Local
Open Library
OpenCyc
Open Corpo-rates
OpenCalais
OpenEI
Open Election
Data Project
OpenData
Thesau-rus
Ontos News Portal
OGOLOD
JanusAMP
Ocean Drilling Codices
New York
Times
NVD
ntnusc
NTU Resource
Lists
Norwe-gian
MeSH
NDL subjects
ndlna
myExperi-ment
Italian Museums
medu-cator
MARC Codes List
Man-chester Reading
Lists
Lotico
Weather Stations
London Gazette
LOIUS
Linked Open Colors
lobidResources
lobidOrgani-sations
LEM
LinkedMDB
LinkedLCCN
LinkedGeoData
LinkedCT
LinkedUser
FeedbackLOV
Linked Open
Numbers
LODE
Eurostat (OntologyCentral)
Linked EDGAR
(OntologyCentral)
Linked Crunch-
base
lingvoj
Lichfield Spen-ding
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Kno.e.sis)
Klapp-stuhl-club
Good-win
Family
National Radio-activity
JP
Jamendo (DBtune)
Italian public
schools
ISTAT Immi-gration
iServe
IdRef Sudoc
NSZL Catalog
Hellenic PD
Hellenic FBD
PiedmontAccomo-dations
GovTrack
GovWILD
GoogleArt
wrapper
gnoss
GESIS
GeoWordNet
GeoSpecies
GeoNames
GeoLinkedData
GEMET
GTAA
STITCH
SIDER
Project Guten-berg
MediCare
Euro-stat
(FUB)
EURES
DrugBank
Disea-some
DBLP (FU
Berlin)
DailyMed
CORDIS(FUB)
Freebase
flickr wrappr
Fishes of Texas
Finnish Munici-palities
ChEMBL
FanHubz
EventMedia
EUTC Produc-
tions
Eurostat
Europeana
EUNIS
EU Insti-
tutions
ESD stan-dards
EARTh
Enipedia
Popula-tion (En-AKTing)
NHS(En-
AKTing) Mortality(En-
AKTing)
Energy (En-
AKTing)
Crime(En-
AKTing)
CO2 Emission
(En-AKTing)
EEA
SISVU
education.data.g
ov.uk
ECS South-ampton
ECCO-TCP
GND
Didactalia
DDC Deutsche Bio-
graphie
datadcs
MusicBrainz
(DBTune)
Magna-tune
John Peel
(DBTune)
Classical (DB
Tune)
AudioScrobbler (DBTune)
Last.FM artists
(DBTune)
DBTropes
Portu-guese
DBpedia
dbpedia lite
Greek DBpedia
DBpedia
data-open-ac-uk
SMCJournals
Pokedex
Airports
NASA (Data Incu-bator)
MusicBrainz(Data
Incubator)
Moseley Folk
Metoffice Weather Forecasts
Discogs (Data
Incubator)
Climbing
data.gov.uk intervals
Data Gov.ie
databnf.fr
Cornetto
reegle
Chronic-ling
America
Chem2Bio2RDF
Calames
businessdata.gov.
uk
Bricklink
Brazilian Poli-
ticians
BNB
UniSTS
UniPathway
UniParc
Taxonomy
UniProt(Bio2RDF)
SGD
Reactome
PubMedPub
Chem
PRO-SITE
ProDom
Pfam
PDB
OMIMMGI
KEGG Reaction
KEGG Pathway
KEGG Glycan
KEGG Enzyme
KEGG Drug
KEGG Com-pound
InterPro
HomoloGene
HGNC
Gene Ontology
GeneID
Affy-metrix
bible ontology
BibBase
FTS
BBC Wildlife Finder
BBC Program
mes BBC Music
Alpine Ski
Austria
LOCAH
Amster-dam
Museum
AGROVOC
AEMET
US Census (rdfabout)
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
48 / 74 Télécom ParisTech Pierre Senellart
Classes and class hierarchy
Backbone of the ontology
AcademicStaff is a Class (A class will be interpreted as a set ofobjects)
AcademicStaff isa Staff (isa is interpreted as set inclusion)FacultyComponent
Course
MathCourse
ProbabilitiesAlgebra
LogicCSCourse
DBAIJava
Student
UndergraduateStudentMasterStudentPhDStudent
Department
PhysicsDeptMathsDeptCSDept
Staff
AcademicStaff
LecturerResearcherProfessor
AdministrativeStaff
49 / 74 Télécom ParisTech Pierre Senellart
Relations
Declaration of relations with their signature
(Relations will be interpreted as binary relations between objects)TeachesIn(AcademicStaff, Course)
if one states that “X TeachesIn Y ”, then X belongs toAcademicStaff and Y to Course
TeachesTo(AcademicStaff, Student)
Leads(Staff, Department)
50 / 74 Télécom ParisTech Pierre Senellart
Instances
Classes have instances
Dupond is an instance of the class Professor
corresponds to the fact: Professor(Dupond)
Relations also have instances
(Dupond,CS101) is an instance of the relation TeachesIn
corresponds to the fact: TeachesIn(Dupond,CS101)
The instance statements can be seen as (and stored in) a database
51 / 74 Télécom ParisTech Pierre Senellart
Ontology = schema + instance
Schema (TBox)
The set of class and relation namesThe signatures of relations and also constraintsThe constraints are used for two purposes
– checking data consistency (like dependencies in databases)– inferring new facts
Instance (ABox)
The set of factsThe set of base facts together with the inferred facts should satisfythe constraints
Ontology (i.e., Knowledge Base) = Schema + Instance
52 / 74 Télécom ParisTech Pierre Senellart
Where can Semantic Content be Found?
In the linked data, through Web-available RDF data:dumps of an entire ontology, in one of the RDF serializationformats (RDF/XML, Turtle, N-Triples)crawlable RDF content, with small fragments pointing to otherfragmentsa SPARQL endpointHTML annotated with RDFa,cf. http://www.w3.org/TR/rdfa-syntax/
Other popular semantic content embedded in Web pages:microformats (hCard, vCard, etc.), microdata(cf. http://www.schemas.org/). Not directly the spirit of theSemantic Web, but heavily used.
RDF content used internally in a company
53 / 74 Télécom ParisTech Pierre Senellart
How to Acquire Semantic Content?
Much easier to exploit, as it is already semantically described
Individual resources (dumps, SPARQL endpoints) that have beenidentified as valuable can be directly exploited
RDFa content, microformats, microdata, can be discovered fromregular Web crawls
Not perfect! There are errors, lies, etc.
54 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content
Exploiting Acquired InformationInformation ExtractionGraph MiningOpinion Mining
Opportunities for Market Insights
55 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content
Exploiting Acquired InformationInformation ExtractionGraph MiningOpinion Mining
Opportunities for Market Insights
56 / 74 Télécom ParisTech Pierre Senellart
Information Extraction
See Parts “Instance Extraction” and “Fact Extraction” from mycolleague Fabian Suchanek’s lecturehttp://suchanek.name/work/teaching/IE2010a.pdf
57 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content
Exploiting Acquired InformationInformation ExtractionGraph MiningOpinion Mining
Opportunities for Market Insights
58 / 74 Télécom ParisTech Pierre Senellart
The Web Graph
The World Wide Web seen as a (directed) graph:
Vertices: Web pages
Edges: hyperlinks
Same for other interlinked environments:
dictionaries
encyclopedias
scientific publications
social networks
59 / 74 Télécom ParisTech Pierre Senellart
Google’s PageRank [Brin and Page, 1998]
IdeaImportant pages are pages pointed to by important pages.
8<:
gij = 0 if there is no link between page i and j ;
gij =1ni
otherwise, with ni the number of outgoing links of page i .
Definition (Tentative)Probability that the surfer following the random walk in G has arrivedon page i at some distant given point in the future.
pr(i) =�
limk!+1
(GT )kv�
i
where v is some initial column vector.
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.1000.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.0330.317
0.075
0.108
0.025
0.058
0.083
0.150
0.117
0.033
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.0360.193
0.108
0.163
0.079
0.090
0.074
0.154
0.094
0.008
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.0540.212
0.093
0.152
0.048
0.051
0.108
0.149
0.106
0.026
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.0510.247
0.078
0.143
0.053
0.062
0.097
0.153
0.099
0.016
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.0480.232
0.093
0.156
0.062
0.067
0.087
0.138
0.099
0.018
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.0520.226
0.092
0.148
0.058
0.064
0.098
0.146
0.096
0.021
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.0490.238
0.088
0.149
0.057
0.063
0.095
0.141
0.099
0.019
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.0500.232
0.091
0.149
0.060
0.066
0.094
0.143
0.096
0.019
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.0500.233
0.091
0.150
0.058
0.064
0.095
0.142
0.098
0.020
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.0500.234
0.090
0.148
0.058
0.065
0.095
0.143
0.097
0.019
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.0490.233
0.091
0.149
0.058
0.065
0.095
0.142
0.098
0.019
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.0500.233
0.091
0.149
0.058
0.065
0.095
0.143
0.097
0.019
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.0500.234
0.091
0.149
0.058
0.065
0.095
0.142
0.097
0.019
61 / 74 Télécom ParisTech Pierre Senellart
PageRank With Damping
May not always converge, or convergence may not be unique.To fix this, the random surfer can at each step randomly jump to anypage of the Web with some probability d (1� d : damping factor).
pr(i) =�
limk!+1
((1� d)GT + dU )kv�
i
where U is the matrix with all 1N values with N the number of vertices.
62 / 74 Télécom ParisTech Pierre Senellart
Using PageRank to Score Search Results
PageRank: global score, independent of the query
Can be used to raise the weight of important pages, associatedwith some scoring function dependent of the query:
final(q ; d) = score(q ; d)� pr(d),
PageRank only useful in directed graphs! Proportional to degreeotherwise
63 / 74 Télécom ParisTech Pierre Senellart
HITS [Kleinberg, 1999]
IdeaTwo kinds of important pages: hubs and authorities. Hubs are pagesthat point to good authorities, whereas authorities are pages that arepointed to by good hubs.
G 0 adjacency matrix (with 0 and 1 values) of a subgraph of the Web.We use the following iterative process (starting with a and h vectors ofnorm 1):
8<:
a := 1kG 0Thk G 0Th
h := 1kG 0ak G 0a
Converges under some technical assumptions to authority and hubscores.
64 / 74 Télécom ParisTech Pierre Senellart
Using HITS to Order Web Query Results
1. Retrieve the set D of Web pages matching a keyword query.
2. Retrieve the set D� of Web pages obtained from D by adding alllinked pages, as well as all pages linking to pages of D .
3. Build from D� the corresponding subgraph G 0 of the Web graph.
4. Compute iteratively hubs and authority scores.
5. Sort documents from D by authority scores.
Less efficient than PageRank, because local scores.
65 / 74 Télécom ParisTech Pierre Senellart
Discovery of communities
Classical problem in social networks: identifying communities ofusers (or of content) using the graph structure
Two subproblems:
1. Given some initial vertex or vertex set, finding the correspondingcommunity
2. Given the graph as a whole, finding a partition in communities
66 / 74 Télécom ParisTech Pierre Senellart
Maximum Flow / Minimum Cut
/6 /2
/1
/5
/2
/3
sinksource
/4
Use of a maximum flow computation algorithm [Goldberg andTarjan, 1988] to separate a seed of users from the remaining of thegraph
Complexity O(n2m) (n : vertices, m : edges)
66 / 74 Télécom ParisTech Pierre Senellart
Maximum Flow / Minimum Cut
/6 /2
/1
/5
/2
/3
source
4 0
3 2
1
4/4
1sink
Use of a maximum flow computation algorithm [Goldberg andTarjan, 1988] to separate a seed of users from the remaining of thegraph
Complexity O(n2m) (n : vertices, m : edges)
66 / 74 Télécom ParisTech Pierre Senellart
Maximum Flow / Minimum Cut
/6 /2
/1
/5
/2
/3
sinksource
4 0
3 2
1
4/4
1
Use of a maximum flow computation algorithm [Goldberg andTarjan, 1988] to separate a seed of users from the remaining of thegraph
Complexity O(n2m) (n : vertices, m : edges)
67 / 74 Télécom ParisTech Pierre Senellart
Markov Cluster Algorithm (MCL) [van Don-gen, 2000]
Graph clustering algorithmBased as well on maximum flow simulation, in the whole graphIteration of a matrix computation alternating:
Expansion (matrix multiplication, corresponding to flowpropagation)Inflation (non-linear operation to increase heterogeneity)
Complexity: O(n3) for an exact computation, O(n) for anapproximate one
[van Dongen, 2000]
67 / 74 Télécom ParisTech Pierre Senellart
Markov Cluster Algorithm (MCL) [van Don-gen, 2000]
Graph clustering algorithmBased as well on maximum flow simulation, in the whole graphIteration of a matrix computation alternating:
Expansion (matrix multiplication, corresponding to flowpropagation)Inflation (non-linear operation to increase heterogeneity)
Complexity: O(n3) for an exact computation, O(n) for anapproximate one
[van Dongen, 2000]
68 / 74 Télécom ParisTech Pierre Senellart
Deletion of the edges with the highest be-twenness [Newman and Girvan, 2004]
Top-down graph clustering algorithmBetwenness of an edge: number of minimal paths between twoarbitrary vertices going through this edgeGeneral principle:1. Compute the betweenness of each edge in the graph2. Remove the edge with the highest betweenness3. Redo the whole process, betweenness computation included
Complexity: O(n3) for a sparse graph
[Newman and Girvan, 2004]
69 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content
Exploiting Acquired InformationInformation ExtractionGraph MiningOpinion Mining
Opportunities for Market Insights
70 / 74 Télécom ParisTech Pierre Senellart
Opinion Mining
See my colleague Chloé Clavel’s lecture http://pierre.senellart.com/enseignement/2013-2014/inf344/10-opinion-mining.pdf
71 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content
Exploiting Acquired Information
Opportunities for Market Insights
72 / 74 Télécom ParisTech Pierre Senellart
Opportunities for Market Insights
Crawl a competitor’s Web site, apply a wrapper to extractstructured information, regularly refresh this crawl ) a localdatabase of a competitor’s products and prices, ready to beanalyzed
Crawl Web forums, blogs, social networking sites, for opinionsabout a brand, and mine the obtained social network ) followidentify opinion leaders, and target them for marketing
Exploit Deep Web forms to crawl all patents pertaining to aparticular topic, perform instance extraction to identify allmolecules cited in the patent, use linked open data ontologies toconnect these molecules to known metabolic pathways ) get moreinsight onto which biological phenomena are targeted bycompetitors’ inventions
Bibliography I
Serge Abiteboul, Grégory Cobena, Julien Masanès, and Gerald Sedrati.A first experience in archiving the French Web. In Proc. ECDL,Roma, Italie, September 2002.
Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-linepage importance computation. In Proc. WWW, May 2003.
BrightPlanet. The deep Web: Surfacing hidden value. White Paper,July 2000.
Sergey Brin and Lawrence Page. The anatomy of a large-scalehypertextual Web search engine. Computer Networks, 30(1–7):107–117, April 1998.
Soumen Chakrabarti. Mining the Web: Discovering Knowledge fromHypertext Data. Morgan Kaufmann, San Fransisco, USA, 2003.
Bibliography IISoumen Chakrabarti, Martin van den Berg, and Byron Dom. Focusedcrawling: A new approach to topic-specific Web resource discovery.Computer Networks, 31(11–16):1623–1640, 1999.
Kevin Chen-Chuan Chang, Bin He, Chengkai Li, Mitesh Patel, andZhen Zhang. Structured databases on the Web: Observations andimplications. SIGMOD Record, 33(3):61–70, September 2004.
Kevin Chen-Chuan Chang, Bin He, and Zhen Zhang. Toward largescale integration: Building a metaquerier over databases on theWeb. In Proc. CIDR, Asilomar, USA, January 2005.
Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles,and Marco Gori. Focused crawling using context graphs. In Proc.VLDB, Cairo, Egypt, September 2000.
Bibliography IIIMuhammad Faheem and Pierre Senellart. Demonstrating intelligentcrawling and archiving of web applications. In Proc. CIKM, pages2481–2484, San Francisco, USA, October 2013a. Demonstration.
Muhammad Faheem and Pierre Senellart. Intelligent and adaptivecrawling of Web applications for Web archiving. In Proc. ICWE,pages 306–322, Aalborg, Denmark, July 2013b.
Muhammad Faheem and Pierre Senellart. Adaptive crawling driven bystructure-based link classification, July 2014. Preprint available athttp://pierre.senellart.com/publications/faheem2015adaptive.pdf.
Andrew V. Goldberg and Robert E. Tarjan. A new approach to themaximum-flow problem. Journal of the ACM, 35(4):921–940,October 1988.
Bibliography IVGeorges Gouriten, Silviu Maniu, and Pierre Senellart. Scalable,generic, and adaptive systems for focused crawling. In Proc.Hypertext, Santiago, Chile, September 2014. Douglas Engelbart BestPaper Award.
Jon M. Kleinberg. Authoritative Sources in a HyperlinkedEnvironment. Journal of the ACM, 46(5):604–632, 1999.
Martijn Koster. A standard for robot exclusion.http://www.robotstxt.org/orig.html, June 1994.
Jayant Madhavan, Alon Y. Halevy, Shirley Cohen, Xin Dong,Shawn R. Jeffery, David Ko, and Cong Yu. Structured data meetsthe Web: A few observations. IEEE Data Engineering Bulletin, 29(4):19–26, December 2006.
Bibliography V
Richi Nayak, Pierre Senellart, Fabian M. Suchanek, and Aparna Varde.Discovering interesting information with advances in Webtechnology. SIGKDD Explorations, 14(2), December 2012.
M. E. J. Newman and M. Girvan. Finding and evaluating communitystructure in networks. Physical Review E, 69(2), 2004.
Andrew Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, andChristian Schallhart. Exploring the Web with OXPath. In LWDM,2011.
Pierre Senellart. Identifying Websites with flow simulation. In Proc.ICWE, pages 124–129, Sydney, Australia, July 2005.
Bibliography VI
Pierre Senellart, Avin Mittal, Daniel Muschick, Rémi Gilleron, andMarc Tommasi. Automatic wrapper induction from hidden-Websources with domain knowledge. In Proc. WIDM, pages 9–16, Napa,USA, October 2008.
sitemaps.org. Sitemaps XML format.http://www.sitemaps.org/protocol.php, February 2008.
Stijn Marinus van Dongen. Graph Clustering by Flow Simulation.PhD thesis, University of Utrecht, May 2000.