30
Crawling the Web Fabrizio Celli Rome, 25 th September 2014

SemaGrow demonstrator: “Web Crawler + AgroTagger”

Embed Size (px)

Citation preview

Page 1: SemaGrow demonstrator: “Web Crawler + AgroTagger”

Crawling the Web

Fabrizio Celli

Rome, 25th September 2014

Page 2: SemaGrow demonstrator: “Web Crawler + AgroTagger”

2

Outline

• Purpose of this Webinar• The Web Crawler• The AgroTagger• The AGRIS use case– What’s next?

Page 3: SemaGrow demonstrator: “Web Crawler + AgroTagger”

3

Purpose of this Webinar

• SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission

• Algorithms, infrastructures and methodologies to cope with large data volumes and real time performance

• http://www.semagrow.eu• One of SemaGrow demonstrators is the component

“Web Crawler + AgroTagger”, objective of this Webinar

Page 4: SemaGrow demonstrator: “Web Crawler + AgroTagger”

4

The demonstrator

• It is based on two command line applications (no user interface):– Web Crawler– AgroTagger

• Goal: – discover resources on the Web– tag resources with AGROVOC URIs– filter only resources about agriculture and

interlink to AGRIS

Page 5: SemaGrow demonstrator: “Web Crawler + AgroTagger”

5

What we expect from the Webinar

• Comments, suggestions, opinions• Other real case scenarios for the

demonstrator• You can send your feedback at [email protected]

Page 6: SemaGrow demonstrator: “Web Crawler + AgroTagger”

6

THE WEB-CRAWLER

Page 7: SemaGrow demonstrator: “Web Crawler + AgroTagger”

7

Apache Nutch

• http://nutch.apache.org/• Highly extensible and scalable open source

Web crawler• Configurable• Input: a list of pre-selected URLs• Output: a list of discovered URLs

Page 8: SemaGrow demonstrator: “Web Crawler + AgroTagger”

8

How it works

• The user defines a list of Web sites (URLs)• Each URL is a ROOT• The user defines the “depth”: the number of

"hops" a discovered link is away from the ROOT– Links very "far away" from the ROOT are unlikely

to hold much information• Start to crawl the Web!

Page 9: SemaGrow demonstrator: “Web Crawler + AgroTagger”

9

Example: depth = 3ROOT (URL)

URL_1_1 URL_1_2 URL_1_ndepth = 1

depth = 2

depth = 3

URL_2_2_1 URL_2_2_m

URL_3_2_1_1 URL_3_2_1_p…

Page 10: SemaGrow demonstrator: “Web Crawler + AgroTagger”

10

The application

• https://github.com/agrisfao/agrotagger/tree/master/crawler/application

• Command line application• Provided with bash scripts to run in Linux environments• Example of usage:

– depth = 5– output directory = work/output– directory with source URLS = work/urls

crawler_exec.sh 5 work/output work/urls

Page 11: SemaGrow demonstrator: “Web Crawler + AgroTagger”

11

The outputURL:: http:/URL:: http://%20www.umabroad.umn.edu/students/healthsafety/emergency.phpURL:: http://10-29-2013-tfic-luncheon.eventbrite.com/URL:: http://1z8jbr3nz90837simd2d2fwoktj.wpengine.netdna-cdn.com/wp-content/uploads/2014/05/Nina-Hale-Inc-FactSheet.pdfURL:: http://2014.northernspark.org/URL:: http://2014.northernspark.org/project/chimera outlink: toUrl: http://media2.northernspark.org/wp-includes/wlwmanifest.xml anchor: outlink: toUrl: http://2014.northernspark.org/partners/arts-culture-and-the-creative-economy-program-of-the-city-of-minneapolis anchor: outlink: toUrl: http://2014.northernspark.org/project/bell-museum-staff anchor: URL:: http://aaea.execinc.com/edibo/JobMarketCandidates outlink: toUrl: http://www.aaea.org/ anchor: AAEA outlink: toUrl: http://aaea.execinc.com/edibo/LoginHelp anchor: Create an Account / Need Help Logging In outlink: toUrl: http://www.aaea.org/about-aaea/aaea-sections anchor: AAEA Sections outlink: toUrl: http://www.aaea.org/about-aaea/aaea-committees anchor: AAEA Committees outlink: toUrl: http://www.aaea.org/about-aaea/awards-and-honors anchor: Awards and Honors...

Page 12: SemaGrow demonstrator: “Web Crawler + AgroTagger”

12

THE AGROTAGGER

Page 13: SemaGrow demonstrator: “Web Crawler + AgroTagger”

13

AGROVOC

• FAO multilingual vocabulary• Over 32 000 concepts in up to 21 languages• Part of the LOD cloud• Extensively used by cataloguers for indexing

data in agricultural information systems• http://

202.45.139.84:10035/catalogs/fao/repositories/agrovoc

Page 14: SemaGrow demonstrator: “Web Crawler + AgroTagger”

14

The AgroTagger

• At a high level of abstraction, AgroTagger is a keyword extractor that uses the AGROVOC thesaurus to extract keywords from some URLs

• Or better… to extract URIs• It is based on MAUI

Page 15: SemaGrow demonstrator: “Web Crawler + AgroTagger”

15

MAUI

• Maui is named after the Polynesian mythological hero and demi-god, which would transform himself into different kinds of birds to perform many of his exploits

• Maui automatically identifies main topics in text documents

• It uses different kinds of algorithms (Kea and Weka, named after New Zealand native birds)

• https://code.google.com/p/maui-indexer

Page 16: SemaGrow demonstrator: “Web Crawler + AgroTagger”

16

How it works

• Input: – A text file with a list of URLs– The output file of an Apache Nutch crawler

• Output:– A set of triples<URL> dcterms:subject <AGROVOC_URI>

Page 17: SemaGrow demonstrator: “Web Crawler + AgroTagger”

17

The algorithm

• For each URL in the input file– Download the resource– Run the MAUI indexer trained with AGROVOC– Create a set of triples

• Multi-threaded• Currently, MAUI is trained only for English– It can be trained in other languages that use Latin

characters– Other solutions are needed for Chinese, Arabic,

Russian, etc.

Page 18: SemaGrow demonstrator: “Web Crawler + AgroTagger”

18

The application

• https://github.com/agrisfao/agrotagger• Command line application• Entirely based on JAVA• Provided with bash scripts • Example of usage:

– directory with source files = work/source– output directory = work/output– type of source files = nutchOutput – output format = rdfnt

taggerDir.sh /work/source /work/output nutchOutput rdfnt

Page 19: SemaGrow demonstrator: “Web Crawler + AgroTagger”

19

The outputInput

AgroTagger

Output

Page 20: SemaGrow demonstrator: “Web Crawler + AgroTagger”

20

THE AGRIS USE CASE

Page 21: SemaGrow demonstrator: “Web Crawler + AgroTagger”

21

AGRIS

• http://agris.fao.org• A collection of more than 7.8 million

bibliographic references in agriculture• AGRIS records come with AGROVOC descriptors• An RDF-aware system– the AGRIS database is publicly exposed as RDF– AGROVOC is the backbone to interlink to external

sources of information (statistics, distribution maps, country profiles, germplasm data…)

Page 22: SemaGrow demonstrator: “Web Crawler + AgroTagger”

22

Page 23: SemaGrow demonstrator: “Web Crawler + AgroTagger”

23

SemaGrow demonstrator

• The core idea is to harvest the Web– Input: pre-selected sources of information about

agriculture• Crawl and assign AGROVOC URIs– Store triples in the “crawler” database

• Definition of combinations between the “crawler” database and the AGRIS database

• New widget in AGRIS mashup pages!

Page 24: SemaGrow demonstrator: “Web Crawler + AgroTagger”

24

Related resources available on the Web

• http://...• https://...

Page 25: SemaGrow demonstrator: “Web Crawler + AgroTagger”

25

Current status

• The Web Crawler gathers data from the Web• The AgroTagger computes triples to assign

Agrovoc URIs to discovered URLs• A “crawler” triplestore is ready for computations

Page 26: SemaGrow demonstrator: “Web Crawler + AgroTagger”

26

What’s next

• Processing phase • Discover meaningful combinations between the

AGRIS core database and “crawler” database• A triplestore of combinations will be set up and

used by AGRIS to generate a widget in the mashup page

• Evaluation of the quality of the widget• What does “meaningful combinations” mean?

Page 27: SemaGrow demonstrator: “Web Crawler + AgroTagger”

27

Naïve Algorithm

• Just for testing purposes• Meaningful combinations = at least N common

AGROVOC URIs

Page 28: SemaGrow demonstrator: “Web Crawler + AgroTagger”

28

Example

• http://ageconsearch.umn.edu/ • 101,000 distinct Web resources discovered by the

WebCrawler (depth = 5)• ~1 million triples generated by the AgroTagger

(“crawler” database)Number of AGRIS records N: common AGROVOC URIs

between AGRIS and the output of the Crawler

Number of associations

900 K 3 17 MLN

900 K 4 3,2 MLN

1 MLN 5 0.6 MLN

Page 29: SemaGrow demonstrator: “Web Crawler + AgroTagger”

29

Your feedback

• Comments, suggestions, other real case scenarios

• Ideas about the meaning of “meaningful combinations”

• If you will test the application, any comments to improve it

• Can the demonstrator support to overcome data problems?

• You can send your feedback at [email protected]

Page 30: SemaGrow demonstrator: “Web Crawler + AgroTagger”

3030

谢谢

σας ευχαριστώ

Gracias