Upload
buddy-mcdowell
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Supporting the Digital Humanities
Vienna, 19–20 October 2010
Findings and Outcomes of the musicSpace Project
Speaker: David Bretherton ([email protected])
Co-Authors: Daniel A. Smith, mc schraefel, Joe Lambert
Presentation overview
I am going to focus on one particular outcome of musicSpace: a successor project called ‘MusicNet’.
I will concentrate on how musicSpace provided the motivation for MusicNet
2
musicSpace
3
3-year project that concluded September 2010http://musicspace.mspace.fm
musicSpace’s goals
To integrate access to leading online music resources using the mSpace faceted browser.
Demonstrate that integration could support rapid exploration & knowledge building.
Enable complex, multipart queries.
4
MusicNet’s goals
Mint URIs for composers so that content providers can unambiguously identify them.
– Hope to expand to include all music-related entities.
Publish alignment data to back-link into our data partners’ catalogues, and to other resources.
Build a suite of tools to support the alignment and integration of new linked data resources.
Build a demonstration service to illustrate the uses and benefits of the URIs and alignment data.
6
Contents
1. Brief overview of musicSpace
2. How musicSpace provided the motivation for ‘MusicNet’
3. MusicNet’s alignment tool
7
1. Brief overview of musicSpace
8
Problem
9
10
Centuries of material ...
11
... is now increasingly digitised
Yet data is often ‘siloed’.
Geographical dispersal has been replaced by virtual dispersal on the web. Data is now segregated into countless online repositories by: – Media type (text, image, audio,
video)– Date of creation/publication– Subject
12
Yet data is often ‘siloed’.
Geographical dispersal has been replaced by virtual dispersal on the web. Data is now segregated into countless online repositories by: – Language– Copyright holder– Ad hoc/insecure nature of project
funding
13
Yet data is often ‘siloed’.
Interoperability has generally not been given a high enough priority.
14
Using current online music data resources presents barriers at
all stages of the research process:
15
It is hard to speculatively browse around a subject area.
‘Real-world’ multipart queries are effectively intractable.
16
The barriers to tractability and their solutions
Need to consult several sources … and metadata from one source cannot guide searches of another source.
Insufficient granularity of data and/or search option.
Multi-part queries have to be broken down and results collated manually.
Solutions:
Integration
Increase granularity
Optimally interactive UI (‘mSpace’)
Solution
17
18
‘musicSpace’ is a faceted browser
19
Demonstration
‘What recording of works by Cage exist, which performers have recorded a particular work by Cage, and what else by Cage have they recorded?
Screencast 1:
http://www.youtube.com/watch?v=keTN12OWies&hd=1
2. How musicSpace provided the motivation for MusicNet
20
Data is not ‘clean’...
21
Schubert Schubert, Franz Schubert, Franz Peter Shu-po-tʻe, ‡d 1797-1828 Schubert ‡d 1797-1828 F. P. Schubert Schubert, ... ‡d 1797-1828 Schubert, F. Schubert, F. ‡d 1797-1828 Schubert, Fr. Schubert, Fr. ‡d 1797-1828 Schubert, Franciszek. Schubert, Franc. ‡d 1797-1828 Schubert, Francois ‡d 1797-1828 Schubert, Franz P. ‡d 1797-1828
Schubert, Franz Peter Schubert, Franz Peter, ‡d 1797-1828 Schubert, Franz Peter ‡d 1797-1828 Schubert, Francois, ‡d 1797-1828 Schubert. Schubert ‡d 1797-1828 Shu-po-tʿe ‡d 1797-1828 Shubert, F. (Frant $s% ) ‡d 1797-1828 Shubert, F. ‡q (Frant $s% ), ‡d 1797-1828 Shubert, Frant $s% , ‡d 1797-1828 Shubert, Frant $s% ‡d 1797-1828 Shūberuto, F. Shūberuto, Furantsu ‡d 1797-1828 Subert, Franc ‡d 1797-1828 Subertas, F. (Francas), ‡d 1797-1828
Subertas, Francas Peteris, 1797-1828‡d Subert, F.
, .Subertas F ‡d 1797-1828 פרנץ, שוברט
シューベルト, F., 1797-1828 シューベルト , フランツ ‡d 1797-1828 舒柏特 , 弗朗茨 Schubert, Francois 1797-1828‡d
, Schubert Franz Peter 1797-1828‡d
Causes of dirty data
Different naming conventions;– e.g. ‘Bach, Johann Sebastian’ or ‘J. S. Bach’
Inclusion of non-name data in name field; – e.g. ‘Schubert, Franz, 1797-1828. Songs’,
or ‘Allen, Betty (Teresa)’
Different languages (and alphabets);
User input errors. – e.g. ‘Bach, Johan Sebastien’
22
Dirty data degrades the user experience
23
Searching for compositions by the composer Franz Schubert (1797–1828)...
Screencast 2:
http://www.youtube.com/watch?v=pFsYfz1vlAg&hd=1
3. MusicNet’s alignment tool
24
Prototype 1 (musicSpace era)
25
Used Alignment API & Google Docs
We used Alignment API to compare the names as strings, using WordNet to enable word stemming, synonym support, etc.
Alignment API produces a similarity measure for each possible match.
We planned to set a threshold for automatic approval.
Matches below that threshold would be sent to a Google Docs spreadsheet for expert review.
26
Shortcoming 1: no threshold
It was not possible to identify a threshold for automatic approval.
Terms are judged to be similar if they have just, say, one different character, but a difference of one character is usually significant in a name.
Names are proper nouns, and so are unsuited to WordNet’s assumptions about misspelling.
27
Shortcoming 1: no threshold
False matches with high similarity measures:
True matches with low similarity measures:
28
Shortcoming 2: no context
Alignment API compares names as strings, and the system strips the names of their context (i.e. additional metadata). – Lack of context meant the musicologist had
no way to verify the match.
Significant flaw; automation had failed so we where relying on manual review.
29
Prototype 2 (building a custom tool
for MusicNet)
30
Lessons learned
From Prototype 1:– A completely automated solution is out of the
question (for the moment...). – We needed a custom tool with a human-friendly UI
(we also wanted keyboard shortcuts for speed).– Access to additional metadata (i.e. context), so
matches can be researched by the reviewer.
From experience with faceted browsers: – Alphabetically sorted columns enable one to spot
synonymous names at a glance.· Normally sources give names surname first; duplication
arises from the different representation of given names.
31
Alignment process Data*
32
Suggested groups
Algorithm compares hash of alpha-only l.c. version of name
No groups suggested
User verified* or rejected*
Synonym groups
Manual grouping (research*)
URIs Alternative names Back links*
UI of Prototype 2
33
Prototype 2 demo
34
Screencast 3:
http://www.youtube.com/watch?v=5f8iaryZMk0&hd=1
Indicative use cases
Composer URIs: – Music(ological) content providers– Basis of a (re)search portal
Alignment tool: – Aligning databases with no authorities;– Or where authorities are inconsistent.
35
36
Thank you for listening!