45
of Metadata Quality for Open Government Data Konrad Johannes Reiche*, Edzard Höfig, Ina Schieferdecker**, presented by Nikolay Tcholtchev** [email protected]*, {firstname.lastname}@fokus.fraunhofer.de**

Konrad cedem praesi

Embed Size (px)

Citation preview

Page 1: Konrad cedem praesi

Assessment and Visualizationof Metadata Qualityfor Open Government Data

Konrad Johannes Reiche*, Edzard Höfig, Ina Schieferdecker**, presented by Nikolay

Tcholtchev**[email protected]*,

{firstname.lastname}@fokus.fraunhofer.de**

Page 2: Konrad cedem praesi
Page 3: Konrad cedem praesi

“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.”

O·pen Da·ta /ˈəʊp(ə)n ˈdeɪtə/

Page 4: Konrad cedem praesi

“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-like.”

O·pen Da·ta /ˈəʊp(ə)n ˈdeɪtə/

License

Page 5: Konrad cedem praesi

Government

Data Citizens

DOMAIN

Page 6: Konrad cedem praesi

Government

Data Citizens

DOMAIN

DESIGN

Repositories

XML

JSON

RDF

Metadata

PDF XLS CSVDOC

Resources

Page 7: Konrad cedem praesi

Quality.What could possibly go wrong?

Metadata Record

Name regional-household-income

ID 98899446-0a1a-43bc-874c-2d54dc700670

Maintainer Margaret Jarmon

Maintainer Email [email protected]

Author Office for National Statistics

Author Email [email protected]

License ID uk-ogl

ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13

Description Spring 2013

Format CSV

URL http:/ / www.ons.gov.uk/ ons/ rhi14

Description Spring 2014

Format CSV

Page 8: Konrad cedem praesi

Quality.What could possibly go wrong?

Metadata Record

Name regional-household-income

ID 98899446-0a1a-43bc-874c-2d54dc700670

Maintainer

Maintainer Email

Author Office for National Statistics

Author Email

License ID uk-ogl

ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13

Description Spring 2013

Format CSV

URL http:/ / www.ons.gov.uk/ ons/ rhi14

Description

Format CSV

Page 9: Konrad cedem praesi

Quality.What could possibly go wrong?

Metadata Record

Name regional-household-income

ID 98899446-0a1a-43bc-874c-2d54dc700670

Maintainer

Maintainer Email

Author Office for National Statistics

Author Email

License ID uk-ogl

ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13

Description Spring 2013

Format CSV

URL http:/ / www.ons.gov.uk/ ons/ rhi14

Description

Format CSV

CSV

HTML

Page 10: Konrad cedem praesi

Metadata Record

Name regional-household-income

ID 98899446-0a1a-43bc-874c-2d54dc700670

Maintainer

Maintainer Email

Author Office for National Statistics

Author Email

License ID uk-ogl

ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13

Description Spring 2013

Format CSV

URL http:/ / www.ons.gov.uk/ ons/ rhi14

Description

Format CSV

Quality.What could possibly go wrong?

CSV

Page 11: Konrad cedem praesi

Metadata Record

Name

ID 98899446-0a1a-43bc-874c-2d54dc700670

Maintainer

Maintainer Email

Author

Author Email

License ID uk-ogl

ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13

Description Spring 2013

Format CSV

URL http:/ / www.ons.gov.uk/ ons/ rhi14

Description

Format CSV

Quality.What could possibly go wrong?

CSV

Page 12: Konrad cedem praesi

Reputation Loss

QUALITY LOSSInformation Loss

- Missing Fields- Dead Links- Inaccurate

Information- False Information

- Outdated Values- Missing

Information- Bad Spelling- Non-Schema

CompliantBad Searchability Unreliable

Untrustworthy

Page 13: Konrad cedem praesi

Meta·da·ta Qual·i·ty/ˈmɛtədeɪtə kwɒlɪti/

The fitness to describe the data (resources), supporting the task dimensions of finding, identifying, selecting and eventually obtaining the resources. The quality is inversely proportional to the uncertainty of the user about the actual data.

Page 14: Konrad cedem praesi

Assessing Metadata Quality is HARDHighly

Subjective

Metadata

Resource

?

1. Manual 2. Automated

Wrong

Qualified ProcessPrinciples + Guidelines

Postulated as being not feasible anymore due to the large number of metadata records.

- Algorithms?- Procedures?- Oracle?- Machine

Learning?

Page 15: Konrad cedem praesi

Automated Quality AssessmentEmpirical Analysis + Visual Aid- Field Usage- Field Values

Framework- Based on Information

Quality- Three Dimensions:

- Intrinsic- Relational /

Contextual- Reputational

- Evaluation Criteria- Completeness- Accuracy- Provenance- Logical Consistency- Timeliness …

Page 16: Konrad cedem praesi

QUALITY METRICS

Page 17: Konrad cedem praesi

𝑞𝑚 :𝑟𝑒𝑐𝑜𝑟𝑑𝑡⟶𝑉∈ [0 ,1]

Measurement. Assigning a symbolic value to an object to enable the characterization of a certain attribute of that object.

Process P

Quality. Complex Attribute. No single measure. Highly Subjective. Use of Proxies.

Page 18: Konrad cedem praesi

Completeness. How many fields have been completed?

Record contains all the information required to have an ideal representation of the described resource.

Metadata Record

Name uk-civil-service-high-earners

ID 68addaac-59ae-4230-bb67-c5a8f6a76285

Maintainer

Maintainer Email

Author Civil Service Capability Group

Author Email [email protected]

License ID uk-ogl

ResourcesSize 40959

Description Civil Servants Salaries 2010

Format CSV

Size

Description Civil Servants Salaries 2011

Format CSV

Page 19: Konrad cedem praesi

Weighted Completeness. Not all fields are equally relevant.

Weight value expresses the relative importance of field .

Metadata Record

Name uk-civil-service-high-earners

ID 68addaac-59ae-4230-bb67-c5a8f6a76285

Maintainer

Maintainer Email

Author Civil Service Capability Group

Author Email [email protected]

License ID uk-ogl

ResourcesSize 40959

Description Civil Servants Salaries 2010

Format CSV

Size

Description Civil Servants Salaries 2011

Format CSV

Page 20: Konrad cedem praesi

Accuracy. How accurate is the resource represented?

Semantic distance . Difference between the information a user can extract from the record and the resource.

Metadata Record

Name regional-household-income

ID 98899446-0a1a-43bc-874c-2d54dc700670

Maintainer

Maintainer Email

Author Office for National Statistics

Author Email

License ID uk-ogl

ResourcesURL http:/ / www.ons.gov.uk/ons/ rhi13

Description Spring 2013

Format CSV

URL http:/ / www.ons.gov.uk/ons/ rhi14

Description

Format CSV

CSV

HTML

Page 21: Konrad cedem praesi

Richness of Information. How much value is added?

𝑞𝑖 (𝑟𝑒𝑐𝑜𝑟𝑑 )=∑𝑖=1

𝑛

𝐼 ( 𝑓𝑖𝑒𝑙𝑑𝑖 )

𝑛

Vocabulary terms and descriptions should be meaningful. Information should be unique and not redundant.

𝑚Number of DocumentsNumber of Words

𝑛

Page 22: Konrad cedem praesi

Readability. How readable are the descriptions? Readable in terms of cognitive accessibility.

Flesch-Kincaid Reading Ease

Page 23: Konrad cedem praesi

Availability. Are the links working?

Metadata only links to the resources. Without working links the actual data is not available.

is true if the th resource is reachable through the URL.

Page 24: Konrad cedem praesi

Implementation.

Metadata Census

Page 25: Konrad cedem praesi

REQUIREMENTS

Metadata HarvesterSchemaless Data StoreQuality MetricsVisualizationLeaderboard

ScalabilityExtensibility

Non-functional

Functional

Page 26: Konrad cedem praesi

Repository

+ url : String

+ name : String+ type : Symbol

Snapshot

+ date : Date

MetaMetadata

+ metadata_record : Hash+ score : Float

+ statistics : Hash + completeness : Hash+ weighted_completeness : Hash+ richness_of_information: Hash...

+ latitude : String+ longitude : String + best_record() : MetaMetadata

+ worst_record() : MetaMetadata+ score() : Float

0..* 1..*

DESIGN.

Page 27: Konrad cedem praesi

CompletenessMetric

WeightedCompleteness

<<Interface>>

Metric

+ compute(record)

MetricWorker

+ perform(snapshot, metric)

GenericMetricWorker

CompletenessMetricWorker

OpennessMetric

<<use>>

<<use>>

<<use>>

Page 28: Konrad cedem praesi

Metadata Harvester

JSON JSON

JSON

Archives

API

Req

uests

Reco

rds

Page 29: Konrad cedem praesi

Imports

Persist

Metadata Census

Metadata Harvester

JSON JSON

JSON

Archives

API

Req

uests

Reco

rds

Preliminary Analyzer

Dump Importer

Database

Page 30: Konrad cedem praesi

Imports

Persist

Metadata Census

Metadata Harvester

JSON JSON

JSON

Archives

API

Req

uests

Reco

rds

Metric Processor

Query

Records

Scheduler

Analyzer

Preliminary Analyzer

Dump Importer

Database

Page 31: Konrad cedem praesi

ViewUser

Generates

Investigates

Imports

Persist

Metadata Census

Metadata Harvester

JSON JSON

JSON

Archives

API

Req

uests

Reco

rds

Metric Processor

Query

Records

Scheduler

Analyzer

Preliminary Analyzer

Dump Importer

Database

Page 32: Konrad cedem praesi
Page 33: Konrad cedem praesi
Page 34: Konrad cedem praesi

Open Government Data.

Evaluation

Page 35: Konrad cedem praesi

Implementation focused exclusively on CKAN repositories.

Page 36: Konrad cedem praesi

Rank RepositoryScor

e

Misspelling

Richness of Information

Openness

Completeness

Availability

Weighted Completeness

Readability

Accuracy

1 data.gc.ca 74 97 86 80 79 79 81 71 20

2 data.sa.gov.au 71 98 63 94 77 86 82 72 0

3 GovData.de 67 99 4 38 55 81 87 79 56

4 data.qld.gov.au 66 99 67 96 73 60 78 59 0

4 PublicData.eu 66 98 84 69 64 70 67 42 32

4 data.gov.uk 66 97 85 69 62 74 67 44 28

4 africaopendata.org 66 100 20 78 70 87 68 55 53

5 datos.codeandomexico.org 65 100 55 84 65 100 75 37 0

6 catalogodatos.gub.uy 63 100 64 1 70 74 78 65 52

6 data.openpolice.ru 63 100 0 0 58 100 81 100 64

7 dados.gov.br 61 100 87 36 53 57 72 44 39

8 opendata.admin.ch 59 100 12 0 58 100 68 35 100

9 data.gv.at 57 100 21 99 51 68 65 59 0

10 data.gov.sk 49 100 51 0 48 92 58 37 7

Page 37: Konrad cedem praesi

Conclusion

Page 38: Konrad cedem praesi

What is good about this approach?

Metadata quality is quantified, but every quality aspect on its own. Metric scores are aggregated to make it comparable.

Every additional quality metric is supposed to complete the quality puzzle.

Automated — Generic — Quantifiable — Repeatable

Page 39: Konrad cedem praesi

Platform has the advantage that it acts as a beacon...

If your metadata breaks bad everyone will see it.

Page 40: Konrad cedem praesi

What is bad not so good about this approach?

- Lacks number of quality metrics- No empirical analysis beforehand- Overvalues problems with the

metadata

More quality metrics are necessary. Current metrics need to consider more special cases in the metadata records.

Page 41: Konrad cedem praesi

Final Thought. Do not aim for excellence, aim for low-quality metadata.

Page 42: Konrad cedem praesi
Page 43: Konrad cedem praesi

Quality Feed. Monitor metadata changes live and record changes in a timeline.

Repository Support. There are more repository software with public APIs. Socrata being most prominent.

More Quality Metrics- Duplicate

Detection- Discoverability- Coherence- Advancement- Reputation

Page 44: Konrad cedem praesi

Metadata Revision System. Avoid storing whole snapshots, but the change set.

Domain-Specific Language. Make it even easier to add individual quality metrics.

Page 45: Konrad cedem praesi

DEMOmetadata-census.com