i r Lecture 150513

Embed Size (px)

Citation preview

  • 7/28/2019 i r Lecture 150513

    1/57

    12/11/98 SIMS Affiliates Meetin

    Organizing Information:

    Metadata and Controlled

    VocabulariesRay R. Larson

    University of California, BerkeleySchool of Information Management and

    Systems

  • 7/28/2019 i r Lecture 150513

    2/57

    12/11/98 SIMS Affiliates Meetin

    Overview: Metadata and

    Controlled Vocabularies Definitions

    Origins and Uses of Controlled

    Vocabularies for Information Retrieval

    Metadata

    Types of Indexing Languages, Thesauri and

    Classification Systems

    Process of Design and Development of

    Thesauri

  • 7/28/2019 i r Lecture 150513

    3/57

    12/11/98 SIMS Affiliates Meetin

    Information Organization and

    Retrieval To organize is to (1) furnish with organs, make organic, make into

    living tissue, become organic; (2) form into an organic whole; give

    orderly structure to; frame and put into working order; make

    arrangements for. Knowledge is knowing, familiarity gained by experience;persons

    range of information; a theoretical or practical understanding of; the

    sum of what is known.

    To retrieve is to (1) recover by investigation or effort of memory,

    restore to knowledge or recall to mind; regain possession of; (2) rescuefrom a bad state, revive, repair, set right.

    Informationis (1) informing, telling; thing told, knowledge, items of

    knowledge, news.

    The Oxford English Dictionary, cf. Rowley

  • 7/28/2019 i r Lecture 150513

    4/57

    12/11/98 SIMS Affiliates Meetin

    Information Properties

    Information can be communicated

    electronically

    BroadcastingNetworking

    Information can be easily duplicated and

    sharedProblems of Ownership

    Problems of Control

    Adapted from Silicon Dreams by Robert W. Lucky

  • 7/28/2019 i r Lecture 150513

    5/57

    12/11/98 SIMS Affiliates Meetin

    Information Hierarchy Data

    The raw material of information

    InformationData organized and presented by someone

    Knowledge

    Information read, heard or seen and understood Wisdom

    Distilled and integrated knowledge and

    understanding

  • 7/28/2019 i r Lecture 150513

    6/57

    12/11/98 SIMS Affiliates Meetin

    Information Hierarchy

    Wisdom

    Knowledge

    Information

    Data

  • 7/28/2019 i r Lecture 150513

    7/5712/11/98 SIMS Affiliates Meetin

    Information Life CycleCreation

    Utilization Searching

    Active

    Inactive

    Semi-Active

    Retention/

    Mining

    Disposition

    Discard

    Using

    Creating

    Authoring

    Modifying

    Organizing

    Indexing

    Storing

    Retrieval

    Distribution

    Networking

    Accessing

    Filtering

  • 7/28/2019 i r Lecture 150513

    8/5712/11/98 SIMS Affiliates Meetin

    Information Life Cycle

    Authoring/Modifying

    Organizing/Indexing

    Storing/Retrieving

    Distribution/Networking

    Accessing/Filtering Using/Creating

  • 7/28/2019 i r Lecture 150513

    9/5712/11/98 SIMS Affiliates Meetin

    Origins

    Very early history of content representation

    Sumerian tokens and envelopes

    Alexandria - pinakes

    Indices

  • 7/28/2019 i r Lecture 150513

    10/5712/11/98 SIMS Affiliates Meetin

    Origins

    Biblical Indexes and Concordances (Hugo

    de St. Caro & 500 monks, 1247 -- KWIC)

    Journal Indexes

    Information Explosion following WWII

    Cranfield Studies of indexing languages and

    information retrieval

    Development of bibliographic databases

    Index Medicus -- production and Medlars searching

  • 7/28/2019 i r Lecture 150513

    11/5712/11/98 SIMS Affiliates Meetin

    Origins Communication theory revisited

    Problems with transmission of meaning

    Noise

    Source DecodingEncoding Destination

    Message Message

    Channel

    StorageSourceDecoding

    (Retrieval/Reading)

    Encoding

    (writing/indexing)Destination

    Message Message

  • 7/28/2019 i r Lecture 150513

    12/5712/11/98 SIMS Affiliates Meetin

    Structure of an IR SystemSearch

    Line

    Interest profiles

    & Queries

    Documents

    & data

    Rules of the game =

    Rules for subject indexing +

    Thesaurus (which consists of

    Lead-In

    Vocabularyand

    Indexing

    Language

    StorageLine

    Potentially

    Relevant

    Documents

    Comparison/

    Matching

    Store1: Profiles/

    Search requests

    Store2: Document

    representations

    Indexing

    (Descriptive and

    Subject)

    Formulating query in

    terms of

    descriptors

    Storage of

    profilesStorage of

    Documents

    Information Storage and Retrieval System

    Adapted from Soergel, p. 19

  • 7/28/2019 i r Lecture 150513

    13/5712/11/98 SIMS Affiliates Meetin

    Metadata

    Data about data

    Information about Information

    Description of information structure and

    contents for individual information items, or

    entire collections of information

  • 7/28/2019 i r Lecture 150513

    14/5712/11/98 SIMS Affiliates Meetin

    Types of Metadata

    Element names.

    Element description.

    Element representation.

    Element coding.

    Element semantics. Element classification.

  • 7/28/2019 i r Lecture 150513

    15/5712/11/98 SIMS Affiliates Meetin

    Metadata Systems

    AACRII/MARC

    Dublin Core

    RDF (Resource Description Framework)

    SGML/XML

    DBMS Metadata Controlled vocabularies

  • 7/28/2019 i r Lecture 150513

    16/5712/11/98 SIMS Affiliates Meetin

    Goals of Descriptive Cataloging

    (AACRII/MARC) 1. To enable a person to find a document of which the author, or

    the title, or

    the subject is known

    2. To show what a library has

    by a given author

    on a given subject (and related subjects)

    in a given kind (or form) of literature.

    3. To assist in the choice of a document as to its edition (bibliographically)

    as to its character (literary or topical)

    Charles A. Cutter, 1876

  • 7/28/2019 i r Lecture 150513

    17/5712/11/98 SIMS Affiliates Meetin

    Dublin Core Elements

    Title

    Creator

    Subject Description

    Publisher

    Other Contributors Date

    Resource Type

    Format

    Resource Identifier

    Source Language

    Relation

    Coverage Rights Management

  • 7/28/2019 i r Lecture 150513

    18/5712/11/98 SIMS Affiliates Meetin

    RDF (W3C)

    A model for representing named properties

    and property values

    Resources (the things described)

    Properties (aspects, attributes, characteristics of

    resources)

    Statements (Resource+Property+Value ofProperty for the Resource)

    Expressed in XML

  • 7/28/2019 i r Lecture 150513

    19/5712/11/98 SIMS Affiliates Meetin

    SGML & XML

    What is SGML/XML?

    Document Type Definitions

    Document Markup

    Sources and Resources

  • 7/28/2019 i r Lecture 150513

    20/5712/11/98 SIMS Affiliates Meetin

    Databases & Metadata

    Particularly in the Relational Model

    metadata is part of the Database, providing

    information about the structure and contentsof the database

    What Relations (tables) in the the DB

    Relation(table) attributes (domains)Attribute representation and storage

    Other information (indexes, etc)

  • 7/28/2019 i r Lecture 150513

    21/5712/11/98 SIMS Affiliates Meetin

    Controlled Vocabularies

    Vocabulary control is the attempt to provide

    astandardizedand consistentset of terms

    (such as subject headings, names,classifications, etc.) with the intent of aiding

    the searcher in finding information.

  • 7/28/2019 i r Lecture 150513

    22/5712/11/98 SIMS Affiliates Meetin

    Controlled Vocabularies

    Names and name authorities

    Design of controlled vocabularies for

    subject access -- Thesaurus design

  • 7/28/2019 i r Lecture 150513

    23/5712/11/98 SIMS Affiliates Meetin

    Names

    Cutters (1876) objectives of bibliographic

    description:

    To enable a person to find a document of whichthe author is known.

    To show what the library has by a given author.

    First serves access. Second serves collocation.

  • 7/28/2019 i r Lecture 150513

    24/5712/11/98 SIMS Affiliates Meetin

    Problems with Names

    How many names should be associated with

    a document?

    Which of these should be the main entry?

    What form should each of the names take?

    What references should be made from other

    possible forms of names that havent been

    used?

  • 7/28/2019 i r Lecture 150513

    25/5712/11/98 SIMS Affiliates Meetin

    The problem

    Proliferation of the forms of names

    Different names for the same person

    Different people with the same names

    Examples

    from Books in Print (semi-controlled but not

    consistent)ERIC author index (not controlled)

  • 7/28/2019 i r Lecture 150513

    26/5712/11/98 SIMS Affiliates Meetin

    Rules for description

    AACR II and other sets of descriptive

    cataloging rules provide guidelines for:

    Determining the number of name entries

    Choosing a main entry

    Deciding on the form of name to be used

    Deciding when to make references

  • 7/28/2019 i r Lecture 150513

    27/5712/11/98 SIMS Affiliates Meetin

    Authority control

    Authority control is concerned with creation

    and maintenance of a set of terms that have

    been chosen as the standard representatives(also know as established) based on some

    set of rules.

    If you have rules, why do you need to keeptrack of all of the headings?

  • 7/28/2019 i r Lecture 150513

    28/57

    12/11/98 SIMS Affiliates Meetin

    Conditions of Authorship?

    Single person or single corporate entity

    Unknown or anonymous authors

    Shared responsibility

    Collections or editorially assembled works

    Works of mixed responsibility (e.g.translations)

    Related Works

  • 7/28/2019 i r Lecture 150513

    29/57

    12/11/98 SIMS Affiliates Meetin

    Added Entries Personal names

    Collaborators

    Editors, compilers, writers

    Translators (in some cases) Illustrators (in some cases)

    Other persons associated with the work (such as the

    honoree in a Festschrift).

    Corporate Names Any prominently named corporate body that has

    involvement in the work beyond publication,

    distribution, etc.

  • 7/28/2019 i r Lecture 150513

    30/57

    12/11/98 SIMS Affiliates Meetin

    Choice of Name

    AACR II says that the predominant form of

    the name used in a particular authors

    writings should be chosen as the form ofname.

    References should be made from the other

    forms of the name.

  • 7/28/2019 i r Lecture 150513

    31/57

    12/11/98 SIMS Affiliates Meetin

    Form of the Name When names appear in multiple forms, one

    form needs to be chosen. Criteria for choice

    are

    Fullness (e.g. Full names vs. initials only)Language of the name.

    Spelling (choose predominant form)

    Entry element:John Smith or Smith, John?

    Mao Zedong or Zedong, Mao? (Mao Tse Tung?)

  • 7/28/2019 i r Lecture 150513

    32/57

    12/11/98 SIMS Affiliates Meetin

    Name Authority FilesID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242

    KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80

    RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:

    VST:d 08-21-91 Other Versions: earlier

    040 DLC$cDLC$dDLC$dOCoLC

    053 PR6005.R517

    100 10 Creasey, John

    400 10 Cooke, M. E.400 10 Cooke, Margaret,$d1908-1973

    400 10 Cooper, Henry St. John,$d1908-1973

    400 00 Credo,$d1908-1973

    400 10 Fecamps, Elise

    400 10 Gill, Patrick,$d1908-1973

    400 10 Hope, Brian,$d1908-1973

    400 10 Hughes, Colin,$d1908-1973

    400 10 Marsden, James

    400 10 Matheson, Rodney

    400 10 Ranger, Ken

    400 20 St. John, Henry,$d1908-1973

    400 10 Wilde, Jimmy

    500 10 $wnnnc$aAshe, Gordon,$d1908-1973

    Different names for thesame person

  • 7/28/2019 i r Lecture 150513

    33/57

    12/11/98 SIMS Affiliates Meetin

    Name Authority FilesID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048

    KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91

    RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:

    VST:d 08-19-91

    040 OCoLC$cOCoLC

    100 10 Marric, J. J.,$d1908-1973500 10 $wnnnc$aCreasey, John

    663 Works by this author are entered under the name used in the item. For

    a listing of other names used by this author, search also under$bCrease

    y, John

    670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J.J. Marric)

    670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric)

    670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis

    h author; pseud.: Marric, J. J.)

  • 7/28/2019 i r Lecture 150513

    34/57

    12/11/98 SIMS Affiliates Meetin

    Name authority files

    ID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124

    KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81

    RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:

    VST:d 06-06-91 Other Versions: earlier

    040 DLC$cDLC$dDLC$dOCoLC100 10 Butler, William Vivian,$d1927-

    400 10 Butler, W. V.$q(William Vivian),$d1927-

    400 10 Marric, J. J.,$d1927-

    670 His The durable desperadoes, 1973.

    670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler)670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J

    .J. Marric)

    Different people writing with the same name

  • 7/28/2019 i r Lecture 150513

    35/57

    12/11/98 SIMS Affiliates Meetin

    Controlled Vocabularies for

    Information Access The greatest problem of today is how to teach people to

    ignore the irrelevant, how to refuse to know things, before

    they are suffocated. For too many facts are as bad as none

    at all. (W.H. Auden) Similarly, there are too many ways ofexpressing or

    explainingthe topic of a document.

    Controlled vocabularies are sets ofRules for topic

    identification and indexing, and a THESAURUS, whichconsists oflead-in vocabulary and an limited and

    selective Indexing Language sometimes with special

    coding or structures.

  • 7/28/2019 i r Lecture 150513

    36/57

    12/11/98 SIMS Affiliates Meetin

    Structure of an IR SystemSearch

    Line

    Interest profiles& Queries

    Documents& data

    Rules of the game =

    Rules for subject indexing +

    Thesaurus (which consists of

    Lead-In

    Vocabularyand

    Indexing

    Language

    Storage

    Line

    Potentially

    Relevant

    Documents

    Comparison/

    Matching

    Store1: Profiles/

    Search requests

    Store2: Document

    representations

    Indexing

    (Descriptive and

    Subject)

    Formulating query in

    terms of

    descriptors

    Storage of

    profilesStorage of

    Documents

    Information Storage and Retrieval System

    Adapted from Soergel, p. 19

  • 7/28/2019 i r Lecture 150513

    37/57

    12/11/98 SIMS Affiliates Meetin

    Uses of Controlled Vocabularies Library Subject Headings, Classification

    and Authority Files.

    Commercial Journal Indexing Services anddatabases Yahoo, and other Web classification

    schemes

    Online and Manual Systems withinorganizationsSunSolveMacArthur

  • 7/28/2019 i r Lecture 150513

    38/57

    12/11/98 SIMS Affiliates Meetin

    Types of Indexing Languages Uncontrolled Keyword Indexing

    Indexing Languages

    Controlled, but not structured

    Thesauri

    Controlled and Structured

    Classification Systems

    Controlled, Structured, and Coded

    Faceted Classification Systems

  • 7/28/2019 i r Lecture 150513

    39/57

    12/11/98 SIMS Affiliates Meetin

    Indexing Languages An index is a systematic guide designed to

    indicate topics orfeatures of documents in

    order to facilitate retrieval of documents orparts of documents.

    An Indexing languageis the set ofterms

    used in an index to represent topics orfeatures of documents, and the rules for

    combining or using those terms.

  • 7/28/2019 i r Lecture 150513

    40/57

    12/11/98 SIMS Affiliates Meetin

    Indexing Languages Library of Congress Subject Headings

    Yellow Pages Topics

    Wilson Indexes (Readers Guide)

  • 7/28/2019 i r Lecture 150513

    41/57

    12/11/98 SIMS Affiliates Meetin

    Thesauri A Thesaurus is a collection of selected

    vocabulary (preferred terms or descriptors)

    with links among Synonymous, Equivalent,Broader, Narrowerand otherRelated

    Terms

  • 7/28/2019 i r Lecture 150513

    42/57

    12/11/98 SIMS Affiliates Meetin

    Thesauri (cont.) National and International Standards for

    Thesauri

    ANSI/NISO z39.19--1994 -- American National StandardGuidelines for the Construction, Format and Management of

    Monolingual Thesauri

    ANSI/NISO Draft Standard Z39.4-199x -- American National

    Standard Guidelines for Indexes in Information Retrieval

    ISO 2788 -- Documentation -- Guidelines for the establishmentand development of monolingual thesauri

    ISO 5964-- Documentation -- Guidelines for the establishment and

    development of multilingual thesauri

  • 7/28/2019 i r Lecture 150513

    43/57

    12/11/98 SIMS Affiliates Meetin

    Thesauri (cont.) Examples:

    The ERIC Thesaurus of Descriptors

    The Art and Architecture Thesaurus

    The Medical Subject Headings (MESH) of the

    National Library of Medicine

  • 7/28/2019 i r Lecture 150513

    44/57

    12/11/98 SIMS Affiliates Meetin

    Why develop a thesaurus?

    To provide a conceptual structure or

    space for a body of information

    To make it possible to adequately describe thetopical contents of informational objects at an

    appropriate level of generality or specificity

    To provide enhanced search capabilities and to

    improve the effectiveness of searching (I.e., to

    retrieve most of the relevant material without

    too much irrelevant material).

  • 7/28/2019 i r Lecture 150513

    45/57

    12/11/98 SIMS Affiliates Meetin

    Why develop a thesaurus?

    To provide vocabulary (or terminological)

    control.

    When there are several possible termsdesignating a single concept, the thesaurus

    should lead the indexer or searcher to the

    appropriate concept, regardless of the terms

    they start with.

  • 7/28/2019 i r Lecture 150513

    46/57

    12/11/98 SIMS Affiliates Meetin

    Preliminary considerations What is used now?

    Continue using an existing thesaurus?Ad hoc modification of existing thesaurus?

    Develop a new well-structured thesaurus? What is the scope and complexity of the

    subject field? What kind of retrieval objects or data will

    be dealt with? How exhaustive and specific is the desired

    description of objects?

  • 7/28/2019 i r Lecture 150513

    47/57

    12/11/98 SIMS Affiliates Meetin

    Preliminary Considerations

    The scope and complexity of the field willprovide some indication of the scope andcomplexity of the thesaurus.It is better to plan for a larger and more

    comprehensive system than a smaller systemthat rapidly will become inadequate as thedatabase grows.

    Development of a good thesaurus requires amajor intellectual effort as well as clericaloperations like data entry and production ofsorted lists.

  • 7/28/2019 i r Lecture 150513

    48/57

    12/11/98 SIMS Affiliates Meetin

    Development of a Thesaurus Term Selection.

    Merging and Development of Concept

    Classes. Definition of Broad Subject Fields and

    Subfields.

    Development of Classificatory structure Review, Testing, Application, Revision.

    Fl f W k i Th

  • 7/28/2019 i r Lecture 150513

    49/57

    12/11/98 SIMS Affiliates Meetin

    Flow of Work in Thesaurus

    ConstructionSelect Sources

    Assign codes

    Select Terms

    Record Selected Terms

    Sort Terms

    Merge identical Terms

    Define Broad SubjectFields

    Merge Terms in Same

    Concept class

    Sort Terms into Broad

    Subject Fields

    Define Subfields within

    one Subject Field

    Work out detailed structure

    of the Subject Field

    Select Preferred Terms

    All Subfields of BroadSubject finished?

    All Broad

    Subjects finished?

    Improve Class Structure

    Yes

    Yes

    No

    No

    Print Classified Index

    and review

    Discuss with Experts and

    Users

    Select descriptors and

    checklist items

    Produce Full Thesaurus

    and Check references

    Assign Notation

    Review and Test

    Many

    Modifications?

    Based on Soergel, pp 327-333

    Yes

    No

    Revise as

    needed

  • 7/28/2019 i r Lecture 150513

    50/57

    12/11/98 SIMS Affiliates Meetin

    The Indexing Process

    Concept identification

    term selection (via thesaurus)

    term assignment

  • 7/28/2019 i r Lecture 150513

    51/57

    12/11/98 SIMS Affiliates Meetin

    Application: The Indexing

    Process (Manual)Is

    Term

    suitable

    NOSelect Alternative

    term to represent

    Concept

    Would

    Concept be

    better represented

    by one of

    these

    terms

    Is

    There

    Another

    Concept

    Consider

    Preferred

    Term

    Select

    PreferredTerm

    Establish Term

    Denoting

    Concept

    Examine Document

    and Identify

    Significant

    Concepts

    Consider

    First

    Concept

    Preferred

    Term?

    StartNO

    NO

    NO

    NO

    NO

    YES YES YES

    YES

    YES

    YES

    Does

    Thesaurus

    contain term

    for

    Concept

    Consider any

    associated terms in

    Thesaurus (NT,BT)

    Admit New Term

    Into Thesaurus

    Can Concept

    be expressed

    combining

    terms?

    Consider Each of

    These TermsAssign Terms

    to

    Document

    Prefer

    Alternative

    Term(s)

    End

    Adapted from ISO 5963, p.5

  • 7/28/2019 i r Lecture 150513

    52/57

    12/11/98 SIMS Affiliates Meetin

    Classification Systems A classification system is an indexing

    language often based on a broad ordering of

    topical areas. Thesauri and classificationsystems both use this broad ordering and

    maintain a structure of broader, narrower,

    and related topics. Classification schemescommonly use a coded notation for

    representing a topic and its place in

    relation to other terms.

  • 7/28/2019 i r Lecture 150513

    53/57

    12/11/98 SIMS Affiliates Meetin

    Classification Systems (cont.) Examples:

    The Library of Congress Classification System

    The Dewey Decimal Classification SystemThe ACM Computing Reviews Categories

    The American Mathematical Society

    Classification System

  • 7/28/2019 i r Lecture 150513

    54/57

    12/11/98 SIMS Affiliates Meetin

    Automatic Indexing and

    Classification Automatic indexing is typically the simple deriving ofkeywords from a document and providing access to all of

    those words.

    More complex Automatic Indexing Systems attempt to

    select controlled vocabulary terms based on terms in the

    document.

    Automatic classification attempts to automatically group

    similar documents using either:

    A fully automatic clustering method.

    An established classification scheme and set of documents already

    indexed by that scheme.

  • 7/28/2019 i r Lecture 150513

    55/57

  • 7/28/2019 i r Lecture 150513

    56/57

    12/11/98 SIMS Affiliates Meetin

    Automatic Class Assignment

    Doc

    DocDoc

    Doc

    Doc

    Doc

    Doc

    Search

    Engine

    1. Create pseudo-documents representingintellectually derived classes.

    2. Search using document contents

    3. Obtain ranked list

    4. Assign document to Ncategories

    ranked over threshold. OR assign

    to top-ranked category

    Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered

    clusters are order-independent, usually based on an intellectually derived scheme

  • 7/28/2019 i r Lecture 150513

    57/57

    References Soegel, D.Indexing Languages and Thesauri:Construction and Maintenance. Los Angeles : Melville

    Publishing Co., 1974

    Foskett, A.C. The Subject Approach to Information.London: Clive Bingley, 1982.

    Standards: ISO 2788 -- Documentation -- Guidelines for the establishment and development of

    monolingual thesauri

    ISO 5964-- Documentation -- Guidelines for the establishment and development ofmultilingual thesauri

    ANSI/NISO z39.19--1994 -- American National Standard Guidelines for the

    Construction, Format and Management of Monolingual Thesauri