Upload
peter-broadwell
View
223
Download
1
Embed Size (px)
Citation preview
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR20161
Peter M. Broadwell
@peterbroadwell
Martin Klein
@mart1nkle1n
Let the Music Live/
que viva la música
Techniques for Managed Integration of a
Unique Multimedia Collection into Public
Linked Open Data Repositories
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR20162
The collection
http://frontera.library.ucla.edu
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR20163
The collection• 116,000 songs digitized and made available as audio
files to date, out of an estimated 160,000 in total
• Originally recorded from 1905 to the 1990s on ~2,000
commercial record labels
• Storage footprint of streaming MP3s: 460 GB
Format Number of songs
33 RPM (1955-1990) 14,741
45 RPM (1955-1990) 51,220
78 RPM (1905-1955) 33,191
Cassette tape (1955-1990) 7,879
Reel-to-reel tape *1955-1990) 368
• ~300,000 album images (covers and media)
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR20164
The collection• 13,752 unique artists or groups on album covers
• 7,035 unique names from album sleeves
• 24,221 unique composers
• 2,000-2,500 labeled song types/genres
Record label # of songs
Victor 8,591
Columbia 8,196
Ideal 4,819
Falcon 4,532
Peerless 3,336
Bego 2,411
Vocalion 2,164
Del Valle 2,145
Song type # of songs
ranchera 21,947
bolero 10,522
corrido 7,393
canción 5,410
polka 4,742
canción ranchera 2,736
cumbia 2,055
vals 1,399
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR20165
The collection• ~700 unique song tags/keywords (prior to translation)
• All songs tagged with 1-20 keywords (avg ~4.5)
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR20166
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR20167
Chris Strachwitz
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR20168
Arhoolie Records
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR20169
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201610
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201611
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201612
Supporters of the collection
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201613
Research using the Frontera
collection as a primary source
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201614
A “multimedia encyclopedia”
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201615
More metadata, more problems
• No authority values employed for person and group
names; “name hacking” used to approximate
uniqueness
• Relationship between song, album, and “release” is not
consistent
• Authority data for song entities is better: matrix numbers
and catalog numbers are available
• Collection is entirely “siloed” on its current site, largely due
to its homegrown metadata scheme
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201616
• Adopt metadata structures of open online music
encyclopedias (MusicBrainz)
• Use unique IDs from linked open data knowledge
bases to identify people, groups, companies,
songs, albums, etc.
• Adopting IDs from external LOD sites lets us link
out to these related records
• When records are missing from external LOD
knowledge bases, add them to those sites
automatically
Goal: incorporate Frontera into
the broader semantic web
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201617
LOD records and relations
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201618
LOD records and relations
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201619
LOD records and relations
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201620
Inspiration: Linked Jazz, NYPL
Labs’ ECCO, LD4L
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201621
Inspiration: Linked Jazz, NYPL
Labs’ ECCO, LD4L
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201622
Inspiration: Linked Jazz, NYPL
Labs’ ECCO, LD4L
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201623
LOD integration: phase 1Initial metadata cleaning and preparation• Identify likely unique entities (names, etc.) via “fuzzy
matching,” e.g., MD5 hash comparisons
• Challenge: finding methods that scale to >100,000 rows
(many approaches must be scripted)
• May necessitate creation of Yet Another Database
• Generate audio fingerprints of music files
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201624
LOD integration: phase 2Discovery and linking of existing records
• Entity lookup in LOD knowledge bases
• Audio fingerprint lookups in AcoustID database, which
links to MusicBrainz
• Search for artist, group, and composer names in service
APIs (note: these work better with English than Spanish)
• DBpedia Spotlight
• MusicBrainz
• Discogs
• VIAF, LCNAF (worth a try)
• Combination of automated and crowd-sourced verification
of links, integration into site
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201625
LOD integration: phase 3Contributing/creating new records
• Unsolicited bulk record generation may be seen as linked
data spam and rejected (“notability” problem)
• Direct communication and participation in knowledge
base’s community is the most promising approach
• Case study: discussion with MusicBrainz community
• Voting/editorial review system can be incompatible with
bulk updates, but the community may be willing to
accommodate
• Data records should be well formed and clean; upload
methods must be tested and the upload coordinated
with LOD admins
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201626
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201627
LOD integration: the “bot” option
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201628
LOD integration: crosswalks
between repositories and records
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201629
Progress to date
• Used metadata cleaning approaches to identify most likely
unique names in the DB
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201630
Progress to date
• Used metadata cleaning approaches to identify most likely
unique names in the DB
• Applied acoustic fingerprinting to all 116,000 audio files
• matched 1,313 songs
• following the AcoustID links to MusicBrainz positively
identifies ~287 artists with their records in MusicBrainz
(as well as Discogs and DBpedia)
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201631
Progress to date
• Used metadata cleaning approaches to identify most likely
unique names in the DB
• Applied acoustic fingerprinting to all 116,000 audio files
• matched 1,313 songs
• following the AcoustID links to MusicBrainz positively
identifies ~287 artists with their records in MusicBrainz
(as well as Discogs and DBpedia)
• Ran DBpedia Spotlight on all artists and composer names,
correlated matched entities with MusicBrainz, Wikidata IDs
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201632
Progress to date
• Used metadata cleaning approaches to identify most likely
unique names in the DB
• Applied acoustic fingerprinting to all 116,000 audio files
• matched 1,313 songs
• following the AcoustID links to MusicBrainz positively
identifies ~287 artists with their records in MusicBrainz
(as well as Discogs and DBpedia)
• Ran DBpedia Spotlight on all artists and composer names,
correlated matched entities with MusicBrainz, Wikidata IDs
• Searched for artist and composer names via MusicBrainz,
Discogs, and VIAF APIs
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201633
Entity matching to LOD sites
Artists on label
(out of 13,752)
Artists on sleeve
(out of 7,035)
Composers
(out of 24,211)
Acoustic
fingerprinting
287 (for all names)
DBpedia
Spotlight
272 27 72
MusicBrainz
lookup
620 434 1,151
Discogs
search API
4,929 3,502 9,423
VIAF search
API
3,707 3,057 8,889
*These are likely in order of decreasing accuracy!
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201634
Concerns/next steps
• Scalable approaches for Q/A of data (new and old)
• Discoverability and usability for humans and machines
(APIs)
• Repository integration: adopting a linked data model will
help
• Trusted channels for upload to existing knowledge bases:
design a formal model?
• Work with specialized sub-collections of knowledge bases
(topics, regions)?
• Test DBpedia Spotlight w/Spanish data pack
• Does using links to existing LOD entries just reinforce
inequality of artist exposure (“rich get richer”/LOD “echo
chamber”)?
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201635
Thanks!
UCLA Digital Library
• Lisa McAulay
• Kristian Allen
• T-Kay Sangwand
• …everyone else (past and present)
Arhoolie Foundation
• Tom Diamant
• Chris Strachwitz (obviously)
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201636
Thanks!
Integration of a Unique Multimedia Collection
into Public Linked Open Data Repositories
@peterbroadwell, @mart1nkle1n – #OR201637
Peter M. Broadwell
@peterbroadwell
Martin Klein
@mart1nkle1n
Let the Music Live/
que viva la música