Content-based retrieval in digital libraries

.

January 1998 93

Tech

nica

l Act

iviti

esFo

rum

With the recent developmentsin multimedia and telecom-munication technologies,content-based informationis becoming increasingly

important for various areas such as dig-ital libraries, interactive video, and mul-timedia publishing (see P. Algrain, H.Zhang, and D. Petkovic, “Content-BasedRepresentation and Retrieval of VisualMedia: A State-of-the-Art Review,” Mul-timedia Tools and Applications, Vol. 3).Multimedia data refers to simple struc-tured data (such as numbers and shortstrings), large unstructured data (such astext documents, images, audio, and videodata), and complex structured data (suchas maps, graphs, charts, and tables).

Here we briefly address content-basedretrieval and the issues of representation,storage, and retrieval of multimediaobjects in digital libraries. We then verybriefly identify some open areas ofresearch. Our expanded version of this col-umn and a more extensive list of referencesis available from us (“Content-BasedMultimedia Information Retrieval inDigital Libraries,” tech. report, Center forInformation Management, Integrationand Connectivity, Rutgers University,1997). For more on content-based re-trieval, see http://ciir.cs.umass.edu/info/

ciirbiblo.html and http://www.mitre.org/resources/centers/advanced_info/g04f/bnn/mmhomeext.html.

OBJECT RETRIEVALDigital libraries must store and retrieve

multimedia data on the basis of featuresimilarity. A feature is a set of character-istics. Features include text strings for textdocuments; color, texture, and objects forimages; and objects, frame sequences, andcamera operations for videos. Content-based retrieval uses content-representa-tive metadata to both store data andretrieve it in response to user queries.Metadata is data about the media objectsstored. Manually collecting metadata isnot only inefficient but also infeasible forlarge document spaces, so we need auto-matic metadata generation.

Once collected, these content descrip-tors are linked to the physical location ofdata. Data-storage strategies are key toefficient retrieval. To facilitate retrieval,we classify metadata as

• Content-dependent. Metadata basedon some characteristics specific to

the content of the media objects.For example, text strings in textdocuments; the color, texture, andposition of objects in an image; andindividual frame characteristics,such as color histograms, for videoobjects.

• Content-descriptive. Metadata thatis not based on the content. Forexample, names of authors andyears of publication.

• Content-independent. Metadata thatdescribes the characteristics of themedia objects but cannot be gener-ated automatically. For example,image characteristics like the moodreflected by a facial expression andcamera shot distance.

Once retrieved, data should be pre-sented to the users in the decreasingorder of retrieval-status value, which is afunction of the relevancy and uniquenessof the features used for indexing andretrieval.

TEXT RETRIEVALMetadata for text documents includes

content description, storage information,and historical status information. Severalmethods have been suggested for deriv-ing metadata from SGML (StandardGeneralized Markup Language) and dig-itized text documents. These include tex-tiling algorithms used for topic iden-tification, multiresolution morphologyfor text line identification for keywords,and the Hidden Markov Model algo-rithm for keyword spotting.

Several methods are available for iden-tifying relevant text documents inresponse to user queries. These includesearching for key index words by full-textscans, using index files, and using docu-ment clusters. Full-text scanning methodshave been designed using finite state tran-sition diagrams that are used for locatingindex words in full-text documents.Inverted index files are also used to linkkeywords with documents in which theyoccur. The index file itself can be struc-tured using a B-tree linking the featurewith the physical storage locations of thedocuments. Rich Holowczak hasdescribed how information extractionwas applied to the problem of informa-tion retrieval of documents (“Extractors

Technical Activities Forum coordinator: Deborah Scherrer, Stanford University, HEPL-4085, Stanford, CA 94305-4085; fax (650)725-2333; [email protected]

Recent developmentsin multimedia and

telecommunicationstechnologies have madecontent-based retrievalincreasingly important.

Content-BasedRetrieval in

Digital LibrariesNabil R. Adam, Rutgers University

Aryya Gangopadhyay, University of Maryland, Baltimore County

.

94 Computer

Technical Activities Forum

for Digital Library Objects,” PhD thesis,Rutgers University, 1997).

IMAGE RETRIEVALImage metadata includes raster data,

data about the data types and data setsthat represent the image data, and imageprocessing history. Metadata extractionis intended to use the image features instoring and retrieving the images effi-ciently. This involves identifying theimage features and locating objects inimages and classifying the images basedon the features extracted. Image-seg-mentation algorithms are used for locat-ing an object within an image. Twoapproaches for object location areboundary detection, which isolatesobjects by detecting their boundaries,and region approaches, which identifythe region that falls within the object.

Retrieving image data is accomplishedby comparing the spatial layout and rela-tionships among objects in the image andimage features such as the color and tex-ture similarity. Retrieval methods thatrely on the structural layout of the imageare used boundaries and the spatial rela-tionships among the objects contained inthe boundary. Boundary-based methodsinclude those using minimum boundingrectangles (MBR) and the plane sweepmethod.

Plane sweep algorithms work bysweeping an image with a horizontal anda vertical line and identifying the coor-dinates of the points of intersectionbetween the lines and the objects in animage. These points, called event points,identify the spatial extents of the objectsin an image. Methods for identifying spa-

tial relationships among the objects con-tained in an image include 2D strings and2D-C strings.

Image data can also be retrieved onthe basis of features such as color,coarseness, contrast, and directionali-ties of textures. Feature-based retrievalmethods require a feature mappingfunction that would measure the dis-tance between a query image and theimage data. For efficient retrieval, imagedata is clustered and indexed on thebasis of feature similarity. Methods forcolor similarity are based on color his-tograms and correlations.

VIDEO RETRIEVALVideo object metadata can be divided

into that which describes a sequence offrames and that which describes individ-ual frames. For frame sequences, themetadata includes information corre-sponding to the whole video, such ascamera shot heights, distances, andmotions. For individual frames, themetadata information includes color his-tograms, textures, and objects covered.

Metadata information about videoobjects is gathered by a video parser thatidentifies the boundaries between twoconsecutive frames. To detect the shotboundaries, a quantitative metric gaugesthe information content of each frame.Whenever the difference between twometrics exceeds a predetermined thresh-old, a boundary is detected. Examples ofmetrics include pixels or blocks, andcolor or intensity histograms. Cameraoperations and object motions are cap-tured using motion vectors and motionanalysis of each block in a frame.

Metadata information about videoobjects typically spans several frames.Such information can be used to indexand retrieve video data in response touser queries, which are stored as inter-vals. Data structures such as the segmentindex tree have been suggested for index-ing and retrieval of frame-based videoinformation.

Feature selectionFeature selection is the process of iden-

tifying individual characteristics of storedobjects that can be used in their efficientretrieval. It facilitates both indexing andretrieval.

The indexing function is used to cre-ate description vectors for stored objects.This is accomplished by assigningweights to some predefined features(such as keywords for text documents).When a query is processed, a similarmethod is used to generate a querydescription vector.

The retrieval function creates aretrieval status value (RSV) for eachobject-query pair. The retrieved objectsare then presented to the user in thedecreasing order of the RSV. At thispoint, users may input their feedbackabout the relevancy of the objectsretrieved. This feedback can be appliedusing relevance feedback techniques, andthe query can be further refined andreprocessed.

The weights in the object and queryvectors are calculated using feature fre-quencies and inverted object frequencies.The feature frequency of an objectdenotes the number of times an indexingfeature occurs in it. Inverted object fre-quency is an inverse function of the num-ber of objects in which the feature occurs.

Data storageThe storage strategy of a DL is dictated

by the type of objects stored and the real-time retrieval requirements of the storedobjects. The different storage strategiescan be divided into single disk, multipledisks, and multiple disks with striping.Single-disk storage involves storing allmedia objects on a single disk. Retrievalspeed in such storage strategies is deter-mined by the disk bandwidth. In the sin-gle-disk storage method, data blocks can

About the Task Force on Digital LibrariesThe IEEE Computer Society Task Force on Digital Libraries promotes activities

and furthers the growth of the theory and practice of all aspects of digital libraries.The task force sponsors a semiannual newsletter and the International Journal onDigital Libraries. Issues of interest include acquiring and storing information, find-ing and filtering information, securing information and auditing access, providinguniversal access, cost management and financial instruments, and socioeconomicimpact (see N. Adam and Y. Yesha, “Digital Libraries: Introduction,” Int’l J. DigitalLibraries, Apr. 1996). To contribute to the newsletter, contact Sue Feldman [email protected] or see http://cimic.rutgers.edu/ ieeedln/. To join, follow the linksoff the CS Web page, http://computer.org.

.

January 1998 95

be arranged contiguously, scattered ran-domly across disk blocks, or distributedover the disk blocks in a constrained orlog-structured manner.

Multiple-disk storage systems distrib-ute the objects across multiple disksthrough disk striping, which facilitatesconcurrent data access. In the staggeredstriping techniques, a disk is treated asan individual storage unit. The numberof disks over which a sub-object isstriped depends on the bandwidthrequirement. This method facilitatesretrieval of media objects with differentdata retrieval rates. In the network strip-ing method, multiple servers are used tomanage a group of clusters that are con-nected to a network. Both simple andstaggered techniques are used to stripethe data. Although this method improvesdata transfer rates for multimediaobjects, its performance is dependent onnetwork bandwidth.

A lthough there have been manydevelopments in multimedia tech-nology, there remain many prob-

lems in the field of content-basedretrieval. The problems include methodsfor the automatic extraction of multi-media object features; and indexing,querying, and searching on the basis ofmultiple similarity features. The auto-matic extraction of features requires content analysis. To develop a contentanalyzer, we must determine the level ofunderstanding of the content that isrequired to perform such analysis.

New methods of similarity-basedindexing such as statistical and multidi-mensional analysis should be investi-gated. As a first step toward developinggeneralized solutions, theoretical re-search in this area should look into spe-cific application-oriented problems. ❖

Nabil Adam is a professor at RutgersUniversity and chair of the Task Forceon Digital Libraries. Contact him [email protected].

Aryya Gangopadhyay is an assistantprofessor at the University of Maryland,Baltimore County. Contact him at [email protected]

The IEEE Computer Society andSilicon Graphics/Cray Researchhave announced that they willhonor Seymour Cray’s legacy ofinnovation and genius with an

annual award for innovation in high-performance computing. The award,announced at Supercomputing 97, hasbeen endowed by a $280,000 gift fromSilicon Graphics/Cray Research.

“Seymour Cray,” said 1997 CSPresident Barry W. Johnson, “was anoutstanding engineer and a true pioneerin the computer industry. His manyinnovations helped create the supercom-puting field and contributed substantiallyto the improvement of society in general.This award will recognize individualswho, like Cray, work creatively and dili-gently to provide innovative solutions toproblems in computing engineering.”

The Computer Society is the ideal hostfor this award, said Irene Qualters, pres-ident of Cray Research and senior vicepresident of Silicon Graphics. “The soci-ety’s passion and dedication toward pro-found advances in engineering areexemplified in Seymour Cray’s career.This award is a celebration of thatlegacy.”

The IEEE Computer Society SeymourCray Computer Engineering Award willbe given to individuals “for innovativecontributions to high-performance com-

puting systems that best exemplify thecreative spirit demonstrated by SeymourCray.” The award will be given annually,with the first winner selected in 1998.Recipients, who will receive a cash prizeand a memento, will be selected throughan extensive and thorough reviewprocess. The award nomination proce-dures will be publicized in Computer, inother publications, and on the society’sWeb site, http://computer.org.

Widely considered to be the founderof supercomputing, Cray was known forhis passion for technological creativityand his constant search for new ideas. Hefounded Cray Research in 1972, with aproclaimed mission of designing andbuilding the world’s most powerful andusable computers. When it was intro-duced in 1976, the Cray-1 supercom-puter set a new standard in super-

Send CS news to Computer,10662 Los Vaqueros Cir., PO Box 3014,Los Alamitos, CA 90720-1314;[email protected]

New Awardto Honor

Seymour CrayCS

Upd

ate

Seymour Cray

Documents

Content-based retrieval in digital libraries