The promises of web scrapping: Mining the web for relational data about artists
Journée d’étude ResTo14 avril 2016
Guillaume Cabanac
@gcabanac
2
Web ScrappingThe promises of web scrapping
1. Why?
2. How?
3. Case studies
4. Why not?
3
Web Scrapping
1. Why?
2. How?
3. Case studies
4. Why not?
The promises of web scrapping
4
Web Scrapping: Why? Purpose:
Fetch the Impact Factor of indexed journals in Computer Science
5
Web Scrapping
1. Why?
2. How?
3. Case studies
4. Why not?
The promises of web scrapping
6
Web Scrapping: How?
Source: http://www.dartlang.org/docs/tutorials/connect-dart-html/
HTML Page Structure: the Document Object Model
7
Web Scrapping
1. Why?
2. How?
3. Case studies
4. Why not?
The promises of web scrapping
Study 1: scientists and workaholism
8
9
Sunday ! Even
on
bank
hol
iday
s in
man
y co
untr
ies!
Study 1: scientists and workaholism
10
SCRAPStudy 1: scientists and workaholism
Study 2: networks of references via Google Scholar
11
12
Study 2: networks of references via Google Scholar
13
Study 2: networks of references via Google Scholar
14
SCRAP
Study 2: networks of references via Google Scholar
15
Study 2: networks of references via Google Scholar
16
Study 2: networks of references via Google Scholar
Study 3: The world of arts – work in progress …
17
18
Study 3: The world of arts – work in progress …
19
Study 3: The world of arts – work in progress …
20
Study 3: The world of arts – work in progress …
Result
21
Web Scrapping
1. Why?
2. How?
3. Case studies
4. Why not?
The promises of web scrapping
22
Web Scrapping: Why not ?
M Scrapping is usually forbidden
23
Web Scrapping: Why not ?
M … but things are changing, at least in the UK
http://www.slideshare.net/petermurrayrust/content-mining-at-wellcome-trust
24
M Data quality issues, especially on Google Scholar?
Web Scrapping: Why not ?
25
M Data quality issues, especially on Google Scholar?
Web Scrapping: Why not ?