26
Luis Faria [email protected] KEEP SOLUTIONS www.keepsolu:ons.com Alan Akbik, Barbara Sierman, Marcel Ras, Miguel Ferreira, José Carlos Ramalho iPRES 2013 Lisbon, September 2, 2013 Automa0c Preserva0on Watch Using Informa-on Extrac-on on the Web

Automatic Preservation Watch

Embed Size (px)

DESCRIPTION

At the iPres2013 conference in Lisbon, Portugal, in September 2013 Luís Faria, KEEP SOLUTIONS LDA, presented SCAPE work on monitoring of digital repositories and the tool, Scout, which has been developed in this connection. Scout is a web-based service that assists content holders in monitoring their digital repository and provides an ontological knowledge base for compiling the information needed to detect preservation risks and opportunities.

Citation preview

Page 1: Automatic Preservation Watch

Luis  Faria  [email protected]

KEEP  SOLUTIONS  www.keep-­‐solu:ons.com

Alan  Akbik,  Barbara  Sierman,  Marcel  Ras,  Miguel  Ferreira,  José  Carlos  Ramalho

iPRES  2013Lisbon,  September  2,  2013

Automa0c  Preserva0on  WatchUsing  Informa-on  Extrac-on  on  the  Web

Page 2: Automatic Preservation Watch

Repository

Format obsolescence

Emerging technology

Consumer trends

New standards

Organisation mission

Bit rot

Resource capability

System availability

Security breach

Economical limitations Social and political factors

Producer trends

Organisation policies

2

Why do we need monitoring?

Page 3: Automatic Preservation Watch

Repository

Format obsolescence

Emerging technology

Consumer trends

New standards

Organisation mission

Bit rot

Resource capability

System availability

Security breach

Economical limitations Social and political factors

Producer trends

Organisation policies

3

Why do we need monitoring?

RisksOpportunities

Page 4: Automatic Preservation Watch

60%

40%

Yes but manual and adhocNone

Risk Assessment

Survey on:

4

Page 5: Automatic Preservation Watch

Scout:  a  preserva-on  watch  system

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

Monitors  aspects  of  the  world  to  detect  preserva:on  risks  and  opportuni:es

5

Page 6: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 6

Information Sources

• Format registries & software catalogues

• Digital repositories & web archives

• Organizational objectives

• Experiments

• Simulation

• Human knowledge

Page 7: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 7

Currently supported information sources

• PRONOM

• Repository content and events

• Web archive content

• Web archive renderability experiments

• SCAPE Policy model

Page 8: Automatic Preservation Watch

8

Define triggers

• Notify me when there are tools that can render the format X.

Page 9: Automatic Preservation Watch

9

Define triggersSimple query with templates

Page 10: Automatic Preservation Watch

10

Receive notifications

Email

HTTP Push API

There  are  tools  that  can  render  format  X.

Page 11: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

Automa-c  Watch  Limita-ons

11

Machine readable data

• Explicit and formal specified information

• Controlled vocabulary

• Ontology

• All instances use same structure and set of values

Page 12: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

Case  study:  e-­‐Depot  coverage

12

0

100

200

300

400

500

600

40% 50% 60% 70% 80% 90% 100%

% of journal titles

Publishers Titles per publisher

97%publishers

1-10titles

Page 13: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

e-­‐journal  coverage  ques-ons

13

• Which  publisher  provides  which  journal  -tles• Publisher  changes:

• Ceases  to  provide  journal• Transfers  journal  to  other  publisher(s)• Publishers  merge

• Journal  changes:• Name  changes• ISSN  changes• Ceased  to  exist

Page 14: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

Where  is  this  informa-on?

14

“In 1991, two years before the merger with Reed, Elsevier acquired Pergamon Press in the UK.”

“The Asia-Europe Foundation (ASEF) sold the Asia Europe Journal and transferred the copyright to its long-time partner Springer.”

“Acta Chirurgica Iugoslavica is available free of charge as an Open Access journal on the Internet.”

In the publisher website!

Page 15: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

Where  is  this  informa-on?

14

“In 1991, two years before the merger with Reed, Elsevier acquired Pergamon Press in the UK.”

“The Asia-Europe Foundation (ASEF) sold the Asia Europe Journal and transferred the copyright to its long-time partner Springer.”

“Acta Chirurgica Iugoslavica is available free of charge as an Open Access journal on the Internet.”

In the publisher website!

Not machine

readable!

Page 16: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

Informa-on  Extrac-on

• Extract structural information from unstructured data• Pattern-based information extraction

• Some training and supervision may be needed

15

“[X] acquired [Y]”

Page 17: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

Experiment

1. Data acquisition and pre-processing

2. Relation discovery

3. Information extraction

4. Validation of results

16

Page 18: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

1.  Data  acquisi-on  and  pre-­‐processing

• Focused crawler with seed words (12.000 entries)• Publisher names

• Journal titles

➡500.000 Web pages

• Pre-process with NLP tools

➡18 million sentences➡8 GB

17

Page 19: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

2.  Rela-on  discovery

18

Prominent pattern Rank[X] journal of [Y] 1

[X] published by [Y] 2

[X] journal on [Y] 3

[X] journal published by [Y] 4

[X] available as [Y] journal 5

PubMed [X] [Y] 9

[X] science proceedings of [Y] 25

[X] subscription available to [Y] 30

Page 20: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

2.  Rela-on  discovery

19

Prominent pattern Rank[X] journal of [Y] 1

[X] published by [Y] 2

[X] journal on [Y] 3

[X] journal published by [Y] 4

[X] available as [Y] journal 5

PubMed [X] [Y] 9

[X] science proceedings of [Y] 25

[X] subscription available to [Y] 30

Page 21: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

3.  Informa-on  extrac-on

20

2.000 journal titles

500 journal-publisher attributions

Page 22: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

4.  Valida-on  of  results

21

4%

10%

86%

Journal titles in eDepot

15%

50%

35%

Title-publisher in the Keepers registry

Should add ExistingFalse-positives

Page 23: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

False-­‐posi-ves

• Detecting boundaries of titles and publisher names

• Using abbreviations on titles and publisher names

• Technical problems like encoding

22

“European Journal of Nuclear Medicine and Molecular Imaging”

IAAE - “International Association of Agricultural Economists”

“├ó╦å┼buda University”

Page 24: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

Conclusions

• We need data to support digital preservation

• Explicit and formal specified for automation

• Registries tend to be incomplete and outdated

• Information Extraction Technologies can help

• Still, some supervision may be needed

23

Page 25: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

Send  us  your  use  cases!

24

Alan [email protected]

Luis [email protected]

Preservation WatchWhat risks to monitor?

Information ExtractionWhat to extract from the web?

Page 26: Automatic Preservation Watch

This  work  was  par,ally  supported  by  the  SCAPE  Project.The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

Thank  you,  ques-ons?

• Scout - a preservation watch system• Site: http://openplanets.github.io/scout/

• Demo: http://scout.scape.keep.pt

• SCAPE Planning and Watch suite iPRES poster• http://bit.ly/scape-pw

• SCAPE• http://www.scape-project.eu

25