25
The promises of web scrapping: Mining the web for relational data about artists Journée d’étude ResTo 14 avril 2016 Guillaume Cabanac @gcabanac

The promises of web scrapping: Mining the web for relational data about artists

Embed Size (px)

Citation preview

The promises of web scrapping: Mining the web for relational data about artists

Journée d’étude ResTo14 avril 2016

Guillaume Cabanac

@gcabanac

2

Web ScrappingThe promises of web scrapping

1. Why?

2. How?

3. Case studies

4. Why not?

3

Web Scrapping

1. Why?

2. How?

3. Case studies

4. Why not?

The promises of web scrapping

4

Web Scrapping: Why? Purpose:

Fetch the Impact Factor of indexed journals in Computer Science

5

Web Scrapping

1. Why?

2. How?

3. Case studies

4. Why not?

The promises of web scrapping

6

Web Scrapping: How?

Source: http://www.dartlang.org/docs/tutorials/connect-dart-html/

HTML Page Structure: the Document Object Model

7

Web Scrapping

1. Why?

2. How?

3. Case studies

4. Why not?

The promises of web scrapping

Study 1: scientists and workaholism

8

9

Sunday ! Even

on

bank

hol

iday

s in

man

y co

untr

ies!

Study 1: scientists and workaholism

10

SCRAPStudy 1: scientists and workaholism

Study 2: networks of references via Google Scholar

11

12

Study 2: networks of references via Google Scholar

13

Study 2: networks of references via Google Scholar

14

SCRAP

Study 2: networks of references via Google Scholar

15

Study 2: networks of references via Google Scholar

16

Study 2: networks of references via Google Scholar

Study 3: The world of arts – work in progress …

17

18

Study 3: The world of arts – work in progress …

19

Study 3: The world of arts – work in progress …

20

Study 3: The world of arts – work in progress …

Result

21

Web Scrapping

1. Why?

2. How?

3. Case studies

4. Why not?

The promises of web scrapping

22

Web Scrapping: Why not ?

M Scrapping is usually forbidden

23

Web Scrapping: Why not ?

M … but things are changing, at least in the UK

http://www.slideshare.net/petermurrayrust/content-mining-at-wellcome-trust

24

M Data quality issues, especially on Google Scholar?

Web Scrapping: Why not ?

25

M Data quality issues, especially on Google Scholar?

Web Scrapping: Why not ?