Upload
noah-williamson
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
IBE312: Information Architecture2013
Ch. 9 – Metadata
Many of the slides in this slideset are reproduced and/or modified content from publically available slidesets by Paul Jacobs (2012),
The iSchool, University of Maryland http://terpconnect.umd.edu/~psjacobs/s12/INFM700s12.htm.
These materials were made available and licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States
See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details.
2
Metadata
• “Data about data” - Definitional and descriptive documentation/information about data…
• From Free On-line Dictionary of Computing:Data about data. In data processing, meta-data is definitional data that provides information about or documentation of other data managed within an application or environment.
For example, meta-data would document data about data elements or attributes, (name, size, data type, etc) and data about records or data structures (length, fields, columns, etc) and data about data (where it is located, how it is associated, ownership, etc.). Meta-data may include descriptive information about the context, quality and condition, or characteristics of the data.
• (Some other definitions.)
Metadata• Why do we need this?• Types of metadata
– Descriptive/subjective/content (e.g. author, subject, keywords, …)– Administrative (e.g. owner, rights, cost, creation date, version, …)– Technical (e.g. format, size, dependencies, programs)
– . . . .• In practical terms:– Metadata helps users locate, navigate, interpret content– Metadata helps organizations manage content– Metadata helps systems manipulate content
Data without Metadata…7/1/1988 OL 950 20.3 13 0.8 -0.1 33.1 27.8 5.3 5.927/2/1988 OL 950 24.2 12.6 1 -0.1 27.8 23.9 3.8 4.567/3/1988 OL . . . . . . . . .7/4/1988 OL 950 0.4 16.3 0.4 0.2 41 34.5 6.5 15.57/5/1988 OL 1005 32.9 18.9 1.4 0.3 29.8 23.7 6.1 14.237/6/1988 OL 1020 32.3 20.5 1.4 0.3 23.4 18.9 4.5 12.977/7/1988 OL 1015 36.8 24.9 1.7 0.5 18.6 15.3 3.2 13.927/8/1988 OL 925 42.8 25.6 2.5 0.6 23.7 19.9 3.9 15.187/9/1988 OL 945 23.3 27.8 0.7 0.8 27.7 23.5 4.3 12.337/10/1988 OL 1030 49.8 26.2 2.6 0.6 40.3 34 6.3 22.147/11/1988 OL 940 44.8 25.2 2.5 0.8 34 29.2 4.8 16.767/12/1988 OL 1010 47.6 26.9 2.6 0.7 47.3 39.6 7.7 16.137/13/1988 OL 945 36.5 22.6 1.9 0.6 36.7 32.6 4 15.57/14/1988 OL 950 19.5 18.6 0.4 0.5 302 39.1 262.9 11.077/15/1988 OL 955 31.7 15.7 1.5 0.4 29.7 25 4.7 9.497/16/1988 OL 955 23.3 14.5 1.8 0.8 23.4 20.7 2.7 8.147/17/1988 OL 1015 23.8 16.6 1.6 0.6 27.7 24.1 3.7 9.177/18/1988 OL 934 32.9 16.7 2.1 0.7 34 28.9 5.1 9.497/19/1988 OL 1010 29.2 20.4 1.9 0.7 26 22.3 3.7 10.447/20/1988 OL 952 44.8 24.8 2.1 0.8 31.7 27.5 4.2 10.757/21/1988 OL 1029 33.7 37.1 1.9 0.6 34.5 30.1 4.3 12.027/22/1988 OL 1017 34.3 32.9 2 0.7 31.4 26.2 5.1 12.657/23/1988 OL 1040 35.7 24.6 2 0.8 23.7 20.4 3.3 15.57/24/1988 OL 923 47.6 28.9 2.9 0.8 67.3 58.9 8.4 20.877/25/1988 OL 1030 58.3 32.6 2.9 0.7 68 59.3 8.7 22.147/26/1988 OL 950 49.3 29.2 3.4 0.6 86 75.1 10.9 21.197/27/1988 OL 1006 54.1 20.9 3.9 0.6 94 82.8 11.2 25.067/28/1988 OL 1010 40.5 16.5 1.7 0.3 41 34.4 6.6 6.547/29/1988 OL 1000 25.5 23.6 1.4 0.1 41 35.4 5.6 3.827/30/1988 OL 1005 47.9 17.6 0.8 0.1 18.3 15.9 2.3 4.197/31/1988 OL 1015 38 22.5 1.5 0.1 30 25.3 4.7 4.448/1/1988 OL 1018 21.2 8.8 1.1 -0.1 24.7 21.1 3.6 4.818/2/1988 OL 1004 38.5 22.8 2.1 0.3 54 46.8 7.2 9.88/3/1988 OL 1011 94 32.6 2.1 0.3 45.5 38.9 6.6 9.498/4/1988 OL 955 58.3 43.1 2.5 1.1 41 33.1 7.9 9.88/5/1988 OL 951 55.8 42.2 2.1 0.8 38 31 7 8.86
Who: authored it? to contact about data?
What: are contents of database?
When: was it collected? processed? finalized? Where: was the study done?
Why: was the data collected?
How: were data collected? processed? Verified?
… can be pretty useless!
Menagerie of Terms
• Classification• Hierarchies• Epistemology• Directories• Controlled vocabularies• Knowledge representation
Let’s focus on significant differences.Let’s focus on advantages/disadvantages.Let’s focus on how each is useful.
7
Controlled Vocabulary
• Any defined subset of natural language• List of equivalent terms (synonym rings)– Use search logs.
• List of preferred terms (authority files)– Commonly also include variant terms– Educating users, enabling browsing– Term rotation (pointers in index) p.201
• Classification scheme / taxonomy– Hierarchical relationships (narrower/broader)
Controlled Vocabularyauthority file – inclusive, preferred term can serve as the unique identifier for a
collection of terms, educate users
Related Terms & Techniques
• Taxonomies– Anything organized in some sort of hierarchical structure
• Tagging– Adding almost any kind of metadata to content, but now often
descriptive and user-provided• Thesauri
– Focus on relations between terms– Focus on “concepts”
• Ontologies– Usually model a specific domain or part of the world– Generally machine-readable
Increasing complexity and richness
Metadata
Taxonomies & Thesauri
Practical Uses
How are taxonomies, tagging, controlled vocabularies and thesauri used?
• The semantic gap: What’s the problem?– Synonymy – roughly, different words or phrases can be used
to express similar ideas (e.g. “notebook”, “laptop”)– Polysemy – roughly, the same word can have different
meanings (e.g., “line” (fishing, code, queue, . . .) )
• Taxonomies try to group similar concepts• “Tags” often assign words to concepts, making it easier
to find related concepts• Controlled vocabularies avoid ambiguity (like a specific
tag set)• Thesauri represent attempts to better organize mappings
between words and concepts
Do these present precision or recall problems?
Taxonomies
– Organization of objects according to some principle
– Familiar examples:• Linnaean taxonomy (for living organisms)• Web directories (e.g., Yahoo or ODP)• Corporate directories• Organization charts• Organizational structures previously discussed
Metadata
Taxonomies & Thesauri
Practical Uses
Thesauri: Motivation• “Semantic gap” between concepts and words
• Online thesauri help mapping many synonyms or word variants onto one preferred term – improve precision in retrieval (p.203)
• Words are used to evoke concepts– Concrete objects: MacBook Pro, iPhone– Abstract ideas: freedom, peace
ConceptsWordsIdeas
Meaning
17
Thesauri
• Book of synonyms, often including related and contrasting words and antonyms.
• In this class:– A controlled vocabulary in which equivalence,
hierarchical, and associative relationships are identified for purposes of improved retrieval.
• Technical lingo …• Thesauri standards: ISO 2788, …
Applying IA Principles
• Focus on users and user needs – users are different, and have different models
• Focus on content – concepts are different, too – different levels, words, complexity, vagueness
• Examples:– What’s the difference between laptop, PDA, phone, and
convergence device?– When is “cancer research” “oncology”?– When a user browses a furniture catalog for chairs, do
you show them ottomans and footstools?
Standard Thesaurus StructureComputer
Notebook Laptop
DesktopReplacement Ultraportable Tablet PC
IS-A
IS-A
AKASynonyms (variants)
NarrowerTerms
BroaderTerms
Preferred
Semantic relationships in a thesaurus
• (pp. 204-205): Abbreviations: PT, VT, BT, NT, RT, Use (U) – VT use PT, Use For (UF) – full list of VT on the PT record, Scope Note (SN) – meaning of the term to rule out ambiguity.
Some Real Examples
• Content tagging and social media (e.g. flickr, del.i.cious)
• Special-purpose classification schemes and thesauri (e.g. art & architecture thesaurus – AAT, UMLS)
• General semantic tools and classification schemes (e.g., Princeton WordNet, Roget’s Thesaurus)
Art & Architecture Thesaurus
Metadata
Taxonomies & Thesauri
Practical Uses
http://www.getty.edu/research/conducting_research/vocabularies/aat/
UMLS (Unified Medical Labeling System)Source: National Library of Medicine (NIH)
Metathesaurus Semantic Network
SPECIALIST Lexicon +Tools
135 broad categories and54 relationships between them
1 million+biomedical concepts from over 100 sources
lexical information and programs for language processing
3 Knowledge Sourcesused separately or together
Metadata
Taxonomies & Thesauri
Practical Uses
E.g. UMLS (Unified Medical Labeling System)
Source: National Library of Medicine (NIH)
Metadata
Taxonomies & Thesauri
Practical Uses
Began in 1986 as long-term R&D project
Designed for systems developers Develop multi-purpose tools to
enhance understanding of medical meaning across systems
Overcome barriers to effective retrieval of machine-readable information
Overcome variety of ways the same concepts are expressed in machine readable and human language
UMLS UsesSource: National Library of Medicine (NIH)
Metadata
Taxonomies & Thesauri
Practical Uses
Information retrieval Thesaurus construction Natural language processing Automated indexing Electronic health records (EHR)
Distribution mechanism for HIPAA, CHI, PHIN regulatory standards SNOMED CT
32
Semantic Relationships• Equivalence (PT = VT) • Hierarchical: Generic (Bird NT Magpie), whole-part (Foot NT big toe) or
instance (Seas NT Mediterranean Sea) – Faceted / multiple hierarchies
• Associative– Related terms (hammer RT nail)
• Preferred terms:– Form, selection, definition and specificity
• Polyhierarchy (Medline corss-lists viral pneumonia under both ...Fig 9-25, p. 220)
• Faceted classification – multiple taxonomies that focus on different dimensions of the content. (e.g. wine.com pp. 223-224.)
Poly-Hierarchies• Concepts can have multiple parents• Example:
• What are the advantages and disadvantages?• What’s the relationship to polysemy?
Cracow (Poland : Voivodship)
Auschwitz II-Birkenau (Poland : Death Camp)
Block 25 (Auschwitz II-Birkenau)
German death camps
Kanada(Auschwitz II-Birkenau)
From Shoah Foundation’s thesaurus of holocaust terms
Faceted Hierarchies
• Alternative to single and poly-hierarchies• Basic idea:– Describe objects along multiple facets– Each facet has its associated hierarchy
• Issues:– What’s a facet?– How do you navigate faceted hierarchies?
Advantages of Facets
• Integrates searching and browsing• Easy to build complex queries• Easy to narrow, broaden, shift focus• Helps users avoid getting lost• Helps to prevent “categorization wars”
Relationship to IA?
DatabaseWeb
ServerApplication
ServerNetwork
Ontologies are implicitly “hidden” here!!!
Flight
Trip
From:
Part-of
Airplane
Equipment
To:
Departure Time:
Arrival Time:
Origin:
Destination:
Type:
Capacity:
Rule: Arrival Time is always after Departure Time
Rule: Distance from Origin to Destination typical > 100 miles
Putting it all together…
DatabaseWeb
ServerApplication
ServerNetwork
DatabaseWeb
ServerNetwork
Two-Layer Architecture
Three-Layer Architecture
Apache mySQL
PHP
Content PresentationA
B C
D E F
G H
You are here: A > C > D
Contents at D
Related - D - E
Hierarchy(child, parent) Content(id, attribute1, attribute2, attribute3, …)
Faceted Browsing
Matching Results
Filter by - Facet1
(possible values)
- Facet2
(possible values)
Hierarchy(child, parent) Content(id, attribute1, attribute2, attribute3, …)