37
Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories @peterbroadwell, @mart1nkle1n #OR2016 1 Peter M. Broadwell @peterbroadwell broadwell@library .ucla.edu Martin Klein @mart1nkle1n [email protected] Let the Music Live/ que viva la música Techniques for Managed Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Embed Size (px)

Citation preview

Page 1: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR20161

Peter M. Broadwell

@peterbroadwell

[email protected]

Martin Klein

@mart1nkle1n

[email protected]

Let the Music Live/

que viva la música

Techniques for Managed Integration of a

Unique Multimedia Collection into Public

Linked Open Data Repositories

Page 2: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR20162

The collection

http://frontera.library.ucla.edu

Page 3: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR20163

The collection• 116,000 songs digitized and made available as audio

files to date, out of an estimated 160,000 in total

• Originally recorded from 1905 to the 1990s on ~2,000

commercial record labels

• Storage footprint of streaming MP3s: 460 GB

Format Number of songs

33 RPM (1955-1990) 14,741

45 RPM (1955-1990) 51,220

78 RPM (1905-1955) 33,191

Cassette tape (1955-1990) 7,879

Reel-to-reel tape *1955-1990) 368

• ~300,000 album images (covers and media)

Page 4: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR20164

The collection• 13,752 unique artists or groups on album covers

• 7,035 unique names from album sleeves

• 24,221 unique composers

• 2,000-2,500 labeled song types/genres

Record label # of songs

Victor 8,591

Columbia 8,196

Ideal 4,819

Falcon 4,532

Peerless 3,336

Bego 2,411

Vocalion 2,164

Del Valle 2,145

Song type # of songs

ranchera 21,947

bolero 10,522

corrido 7,393

canción 5,410

polka 4,742

canción ranchera 2,736

cumbia 2,055

vals 1,399

Page 5: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR20165

The collection• ~700 unique song tags/keywords (prior to translation)

• All songs tagged with 1-20 keywords (avg ~4.5)

Page 6: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR20166

Page 7: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR20167

Chris Strachwitz

Page 8: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR20168

Arhoolie Records

Page 9: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR20169

Page 10: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201610

Page 11: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201611

Page 12: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201612

Supporters of the collection

Page 13: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201613

Research using the Frontera

collection as a primary source

Page 14: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201614

A “multimedia encyclopedia”

Page 15: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201615

More metadata, more problems

• No authority values employed for person and group

names; “name hacking” used to approximate

uniqueness

• Relationship between song, album, and “release” is not

consistent

• Authority data for song entities is better: matrix numbers

and catalog numbers are available

• Collection is entirely “siloed” on its current site, largely due

to its homegrown metadata scheme

Page 16: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201616

• Adopt metadata structures of open online music

encyclopedias (MusicBrainz)

• Use unique IDs from linked open data knowledge

bases to identify people, groups, companies,

songs, albums, etc.

• Adopting IDs from external LOD sites lets us link

out to these related records

• When records are missing from external LOD

knowledge bases, add them to those sites

automatically

Goal: incorporate Frontera into

the broader semantic web

Page 17: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201617

LOD records and relations

Page 18: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201618

LOD records and relations

Page 19: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201619

LOD records and relations

Page 20: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201620

Inspiration: Linked Jazz, NYPL

Labs’ ECCO, LD4L

Page 21: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201621

Inspiration: Linked Jazz, NYPL

Labs’ ECCO, LD4L

Page 22: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201622

Inspiration: Linked Jazz, NYPL

Labs’ ECCO, LD4L

Page 23: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201623

LOD integration: phase 1Initial metadata cleaning and preparation• Identify likely unique entities (names, etc.) via “fuzzy

matching,” e.g., MD5 hash comparisons

• Challenge: finding methods that scale to >100,000 rows

(many approaches must be scripted)

• May necessitate creation of Yet Another Database

• Generate audio fingerprints of music files

Page 24: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201624

LOD integration: phase 2Discovery and linking of existing records

• Entity lookup in LOD knowledge bases

• Audio fingerprint lookups in AcoustID database, which

links to MusicBrainz

• Search for artist, group, and composer names in service

APIs (note: these work better with English than Spanish)

• DBpedia Spotlight

• MusicBrainz

• Discogs

• VIAF, LCNAF (worth a try)

• Combination of automated and crowd-sourced verification

of links, integration into site

Page 25: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201625

LOD integration: phase 3Contributing/creating new records

• Unsolicited bulk record generation may be seen as linked

data spam and rejected (“notability” problem)

• Direct communication and participation in knowledge

base’s community is the most promising approach

• Case study: discussion with MusicBrainz community

• Voting/editorial review system can be incompatible with

bulk updates, but the community may be willing to

accommodate

• Data records should be well formed and clean; upload

methods must be tested and the upload coordinated

with LOD admins

Page 26: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201626

Page 27: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201627

LOD integration: the “bot” option

Page 28: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201628

LOD integration: crosswalks

between repositories and records

Page 29: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201629

Progress to date

• Used metadata cleaning approaches to identify most likely

unique names in the DB

Page 30: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201630

Progress to date

• Used metadata cleaning approaches to identify most likely

unique names in the DB

• Applied acoustic fingerprinting to all 116,000 audio files

• matched 1,313 songs

• following the AcoustID links to MusicBrainz positively

identifies ~287 artists with their records in MusicBrainz

(as well as Discogs and DBpedia)

Page 31: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201631

Progress to date

• Used metadata cleaning approaches to identify most likely

unique names in the DB

• Applied acoustic fingerprinting to all 116,000 audio files

• matched 1,313 songs

• following the AcoustID links to MusicBrainz positively

identifies ~287 artists with their records in MusicBrainz

(as well as Discogs and DBpedia)

• Ran DBpedia Spotlight on all artists and composer names,

correlated matched entities with MusicBrainz, Wikidata IDs

Page 32: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201632

Progress to date

• Used metadata cleaning approaches to identify most likely

unique names in the DB

• Applied acoustic fingerprinting to all 116,000 audio files

• matched 1,313 songs

• following the AcoustID links to MusicBrainz positively

identifies ~287 artists with their records in MusicBrainz

(as well as Discogs and DBpedia)

• Ran DBpedia Spotlight on all artists and composer names,

correlated matched entities with MusicBrainz, Wikidata IDs

• Searched for artist and composer names via MusicBrainz,

Discogs, and VIAF APIs

Page 33: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201633

Entity matching to LOD sites

Artists on label

(out of 13,752)

Artists on sleeve

(out of 7,035)

Composers

(out of 24,211)

Acoustic

fingerprinting

287 (for all names)

DBpedia

Spotlight

272 27 72

MusicBrainz

lookup

620 434 1,151

Discogs

search API

4,929 3,502 9,423

VIAF search

API

3,707 3,057 8,889

*These are likely in order of decreasing accuracy!

Page 34: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201634

Concerns/next steps

• Scalable approaches for Q/A of data (new and old)

• Discoverability and usability for humans and machines

(APIs)

• Repository integration: adopting a linked data model will

help

• Trusted channels for upload to existing knowledge bases:

design a formal model?

• Work with specialized sub-collections of knowledge bases

(topics, regions)?

• Test DBpedia Spotlight w/Spanish data pack

• Does using links to existing LOD entries just reinforce

inequality of artist exposure (“rich get richer”/LOD “echo

chamber”)?

Page 35: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201635

Thanks!

UCLA Digital Library

• Lisa McAulay

• Kristian Allen

• T-Kay Sangwand

• …everyone else (past and present)

Arhoolie Foundation

• Tom Diamant

• Chris Strachwitz (obviously)

Page 36: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201636

Thanks!

Page 37: Integration of a Unique Multimedia Collection into Public Linked Open Data Repositories

Integration of a Unique Multimedia Collection

into Public Linked Open Data Repositories

@peterbroadwell, @mart1nkle1n – #OR201637

Peter M. Broadwell

@peterbroadwell

[email protected]

Martin Klein

@mart1nkle1n

[email protected]

Let the Music Live/

que viva la música