Automatic Metadata Generation Using Associative-Networks
Marko A. RodriguezCCS-3 ‘Tech Talk’December 7, 2005
http://www.soe.ucsc.edu/~okram
Resources and Metadata
• A resource is any digital-object (e.g. manuscripts, images, video, audio, etc.).
• A resource’s metadata record is a list of attributes describing the resource
[ EXAMPLE MANUSCRIPT METADATA ] Authors, Institutions, Keywords, Subject Categories, Citations, Year, Publishing Journal, Usage Data
Metadata Record<?xml version="1.0" encoding="UTF-8" ?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <responseDate>2005-09-07T15:25:04Z</responseDate> <request verb="GetRecord" identifier="oai:arXiv.org:cs/0412047" metadataPrefix="oai_dc">http://arXiv.org/oai2</request> <GetRecord> <record> <header> <identifier>oai:arXiv.org:cs/0412047</identifier> <datestamp>2004-12-14</datestamp> <setSpec>cs</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"> <dc:title>A Social Network for Societal-Scale Decision-Making Systems</dc:title> <dc:creator>Rodriguez, Marko</dc:creator> <dc:creator>Steinbock, Daniel</dc:creator> <dc:subject>Computers and Society</dc:subject> <dc:subject>Data Structures and Algorithms</dc:subject> <dc:subject>Human-Computer Interaction</dc:subject> <dc:subject>H.4.2</dc:subject> <dc:subject>J.7</dc:subject> <dc:subject>K.4.m</dc:subject> <dc:description>In societal-scale decision-making systems the collective is faced ...</dc:description> <dc:description>Comment: Dynamically Distributed Democracy algorithm</dc:description> <dc:date>2004-12-10</dc:date> <dc:type>text</dc:type> <dc:identifier>http://arxiv.org/abs/cs/0412047</dc:identifier> <dc:identifier>North American Association for Computational Social and Organizational Science Conference Proceedings 2004</dc:identifier> </oai_dc:dc> </metadata> </record> </GetRecord></OAI-PMH>
Problem Statement
• Metadata is costly to generate by hand
• Metadata is hard to extract from raw resource (e.g. audio, video)
• How can we automatically generate metadata for atrophied resource records?
General System Overview
• Generate resource relations with existing metadata in the repository.– occurrence and/or co-occurrence networks
• Propagate metadata from metadata rich resources to metadata limited resources– encapsulate metadata in discrete particles
and disseminate them over the generated associative network
HEP-TH 2003 Semantic Network
A1
P1
Autho
r
of
O1
J1J2
K1
K2
T1
T2
A2
A3P2
O2
P3
P4
P5
cite
s
Aut
hor o
f
Published
journal
Published
journal
Has ke
ywor
d
Has keywordAuthor
of
Author of
Author of
Organization of
Organization of
Publishedtime
Publis
hed
time Published time
Author of
Organizationof
Publis
hed
time
Haskeyword
cites
Publishedjournal
c
ites
cite
s
A4Author
of
Transforming the Semantic Network
Convert the multi-node network into a collection of manuscripts with their associated attributes (metadata record).
– manuscript• Authors• Citations• Publication Date• Keywords• Organizations• Journal
resource
metadata record
Occurrence/Co-Occurrence
• Citation: two manuscripts are connected if one manuscript cites the other.
• Co-Author: two manuscripts are connected if they share the same authors
• Co-Citation: two manuscripts are connected if they share the same authors
• Co-Keyword: two manuscripts are connected if they share the same keywords
• Co-Organization: two manuscripts are connected if they share the same organizations
• Co-Date: two manuscripts are connected if they share the same publication date
• Co-Journal: two manuscripts are connected if they share the same journal
Network Generation Running Times
• Occurrence: O(N)– Each resource’s metadata record much be
checked once and only once for a direct reference to another resource.
• Co-occurrence: O([N2 – N] / 2)– Each resource’s metadata record much be
check against every other resource’s (N2), except itself (-N), once and only once (1/2).
A B
A B
C
Particle Propagation
• Every resource is given one particle, p_i. This particle contains all the metadata associated with its resource.
• A particle also has an energy value, e_i. The further the particle travels (edge steps), the more its energy value decays.
e_i(t+1) = e_i(t) * (1-\delta)
Particle Propagation
• The particle takes an outgoing edge of its current node based on the probability distribution of its outgoing edge set. If the resource it encounters doesn’t have metadata of a particular type, it recommends that resource its metadata weighted by its energy value.
Metadata Recommendations
• Manuscript A– Journal
• Journal of Complexity [0.2457]• Journal of Information Science [0.1]• Information Processing and Management [0.001]
recommendation strength
Mini-Break
Terrorist Alert
System Parameters
• Metadata Density: to validate the algorithm we kill a percentage of the metadata in the system and see if we can reconstruct it using the algorithm (d \in [0,1])
• Metadata Percentile: only those metadata tags in the pth percentile are accepted as valid metadata (p \in [0,1])
** Validation is based Precision and Recall values
Results for Co-Author Network(Citation Metadata)
Results for Co-Author Network (Organization Metadata)
Results for Co-Author Network (Keyword Metadata)
Results for Co-Keyword Network(Citation Metadata)
Results for Co-Keyword Network(Journal Metadata)
Results for Citation Network(Author Metadata)
Results for Citation Network(Keyword Metadata)
Results for Citation Network(Journal Metadata)
Take Home Points
• Different edge types are better a propagating different metadata types.
• Can work for any resource type as long as there exists some preliminary vetted metadata and a way to create resource relations. (if there is pre-existing metadata then resource relations can be automatically created).
Future Work (part 1)
• What about path types? e.g. take a co-author edge, then a citation edge, etc. Better precision and recall?
• Explore usage metadata (applicable to any resource type—and allows for cross resource relations (e.g. manuscripts connected to audio)). The weight between two resources is a function of the interval between their download from the same IP. (Bollen, et.al. 2004)
Future Work (part 2)
• Application to social-networks? Given an unknown individual, infer his attributes according to his social-relationships
how does ‘work_with’ differ from ‘married_to’? They share same income metadata and religious belief metadata, respectively.
Conclusion
• Good life…
Rodriguez, M.A., Bollen, J., Van de Sompel, H., “Automatic Metadata Generation using Associative Networks”, [unpublished], 2005.
Know of a good journal venue?