Metadata curation: hands-on session - CLARIN · 6/5/2018  · Metadata curation: hands-on session...

Preview:

Citation preview

Metadatacuration:hands-onsession

CMDIandMetadataCurationTaskForces

CLARINCentre&Developersmeeting4-5June2018

Utrecht,TheNetherlands

CLARIN 1

Prerequisites

• Java1.8- https://java.com/en/download/

• Internetconnection

CLARIN 2

Menu

1. Viewyour recordsinthe VLO2. Viewyour harvest (and its log)3. Getyour records

1. From the tarball2. Harvest them

4. Curation module1. Lookatthe website2. Runlocally

5. CMDIbestpractices1. Checkyour profiles2. Checkyour records

6. Structural queries1. Loadyour records/validation reports into BaseX2. Some useful XQueries

7. Inspect the mapping8. VLO9. Fixingproblems,butwhere?10. What’s missing?

CLARIN 3

Viewyour recordsinthe VLO

• Filterthe recordsbased onyour endpoint:- _oaiEndpointURI:

• https://vlo.clarin.eu/search?q=_oaiEndpointURI:https://clarin-pl.eu/oai/request

- Endpoints?centres.clarin.eu/oai_pmh

• Filterthe recordsbased onaprofile:- _componentProfile:

• https://vlo.clarin.eu/search?q=_componentProfile:LINDAT_CLARIN• Note:use the profilenameinstead ofits ID!

CLARIN 4

Viewyour harvest (and its log)

• Not inproduction yet,butlocal preview- will replace https://vlo.clarin.eu/data/

• Paged lists• Filteronendpoints and/orrecords• Seethe logofaharvest

CLARIN 5

Getyour records

1. From the tarball1. https://vlo.clarin.eu/data/resultsets/2. tar xjf clarin.tar.bz2

results/cmdi/DANS_CMDI_Provider3. Note:just clicking the tarball might freeze your Mac!

2. Harvest them1. https://github.com/clarin-eric/oai-harvest-manager/releases2. Editproviderssectionofresources/config-test.xml3. run-harvester.sh workdir=`pwd`

resources/config-test.xml

CLARIN 6

Curation module

1. Lookatthe website1. https://clarin.oeaw.ac.at/curate/

2. Runlocally1. https://github.com/clarin-eric/clarin-curation-module2. curation.jar (goo.gl/Cx4h3N )3. Create your own specific copyofconfig.properties4. java -jar curation.jar -config

config.properties -c -path results/cmdi/ARCHE

CLARIN 7

CMDIbestpractices

1. https://www.clarin.eu/content/cmdi-best-practice-guide2. Schematron rules (schematron.com)

1. https://github.com/TheLanguageArchive/SchemAnon/releases2. Also supported by oXygen orother XMLeditors3. Also easyto define your own rules

3. Checkyour profiles1. Identify the profiles you’re using

1. https://github.com/clarin-eric/FindProfiles/releases2. java -jar findProfiles.jar -e=xml

clarin/results/cmdi/The_Language_Archive/2. wget -O profile.xml

https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.x/profiles/clarin.eu:cr1:p_1505397653795/xml && java -jar SchemAnon.jarhttps://raw.githubusercontent.com/clarin-eric/cmdi-toolkit/develop/src/main/resources/toolkit/sch/cmd-component-best-practices.sch profile.xml

CLARIN 8

CMDIbestpractices

4. Checkthe records1. java -jar SchemAnon.jar

https://raw.githubusercontent.com/clarin-eric/cmdi-toolkit/develop/src/main/resources/toolkit/sch/cmd-record-best-practices.schclarin/results/cmdi/The_Language_Archive/xml

2. Note:use the -s optionto savethe SVRLreport

5. Validate the records1. https://github.com/clarin-eric/cmdi-instance-validator/releases2. cmdi-validator results/cmdi/IMS_Repository/3. Note:use the -s optionto use another Schematron file

CLARIN 9

Structural queries

1. Loadyour records/validation reports into BaseX1. basex.org orbrew install basex

2. Create anewdatabaseand importyour records/reports

2. XQuery (w3.org/XML/Query)declare namespace cmd="http://www.clarin.eu/cmd/1";

declare namespace svrl="http://purl.oclc.org/dsdl/svrl";

…- goo.gl/CEZtTm

Notes1. You can use the namespace wildcard(*:element)to dealwith

(many)profilespecific namespaces2. You can use base-uri() to getthe filenameofamatchingrecord3. BaseX hasuseful modules,butalso FunctX (xqueryfunctions.com)4. Aproblem that occurs often might be acandidate for aSchematron

rule

CLARIN 10

Inspect the mapping

1. Identify the profiles you’re using1. https://github.com/clarin-eric/FindProfiles/releases2. java -jar findProfiles.jar -e=xml

clarin/results/cmdi/The_Language_Archive/

2. Inspect the mapping1. https://github.com/clarin-eric/VLO-mapping2. https://cmdi.clarin.eu/mapping/

CLARIN 11

VLO

1. Curation VLO(to be updated)1. https://vlo.minerva.arz.oeaw.ac.at/vlo

2. Request an importinthe beta VLO1. vlo@clarin.eu

3. Doalocal VLOimport1. https://gitlab.com/CLARIN-ERIC/compose_vlo#run-the-

importer-to-ingest-cmdi-metadata-into-the-vlo

4. Runthe importer onone record1. https://github.com/clarin-eric/VLO/blob/master/vlo-

importer/src/main/java/eu/clarin/cmdi/vlo/importer/MetadataMapper.java• CLASSPATH="vlo-importer-4.2-SNAPSHOT-importer.jar" javaeu.clarin.cmdi.vlo.importer.MetadataMapper -c VloConfig.xml -r test.xml

CLARIN 12

Fixingproblems,butwhere?

• Your records- Typos inyour records- Inconsistencies inyour records

• Consider adopting acommon(CLARIN/CLAVAS)vocabulary- Facetmapping problems

• Can you fixthem inyour profile(s)?• Orprovide feedbackto the MetadataCuration TF(cmdi@clarin.eu)

- Valuemapping problems• Provide feedbackto the MetadataCuration TF (cmdi@clarin.eu)

• Others records- reportthem viathe VLOfeedbackbutton

CLARIN 13

What’s missing?

• OAIViewer- history- include alocal curation run- general log- mailto technical contactwhen number ofharvested recordsdrop

• VLOimporter- reportshowing the mappings applied

• VLO- _componentProfileURI

• profilenamemight not be unique!- Centerfacet

• filterto all recordsfrom one center,possible multipleendpoints- showoriginal value

• More?

CLARIN 14

Questions

• MetadataCuration Taskforces- tf-curation@lists.clarin.eu

• CMDITaskforce- cmdi@clarin.eu

• CMDIfirstaid kit- clarin.eu/sites/default/files/CMDI-first-aid-kit.pdf

• MenzoWindhouwer- menzo.windhouwer@di.huc.knaw.nl

CLARIN 15

Recommended