Web Engineering Semantic Annotation and Transcoding ... · Semantic Annotation and Transcoding: Making Web Content More Accessible Katashi Nagao IBM Research Yoshinari Shirai NTT

1070-986X/01/$10.00 © 2001 IEEE 69

Most of us have created Web con-tent using only human-friendlyinformation. Of course, thismakes sense—after all, humans

are the intended users of the information.However, with the continuing explosion of Webcontent—especially multimedia content—we arguethat contents should be created in a more machine-friendly format—one that lets machines betterunderstand and process documents. In this article,we propose a system to annotate documents exter-nally with more information in order to makethem easier for computers to process. Hori et al.1

suggested the original concept of external annota-tion. Here we use this term in a more general sense.

Why would we want to make content moremachine friendly? Well, annotated documents areeasier to personalize. For example, a computer can

process textual information annotated with partsof speech and word senses much better than plaintext and can therefore produce a nice, grammati-cally correct personalized summary formatted fora cellular phone—or translate a document fromEnglish to Japanese. Normally, when dealing withnonannotated text, transcoding to many differentformats requires a lot of task-specific effort foreach format. However, by using document anno-tation, content providers put in some extra workearly on, but receive the benefits of being able totranscode to an endless variety of formats and per-sonal tastes easily, thus reaching a much wideraudience with less overall effort.

OverviewOur idea of a Web superstructure consists of

layers of content and metacontent. The first layercorresponds to the set of metacontents of basedocuments. The second layer corresponds to theset of metacontents of the first layer, and so on.We generally consider such metacontent as exter-nal annotation. A good example of external anno-tation is external links that can be defined outsideof the set of link-connected documents. We’vediscussed these external links in the ExtensibleMarkup Language (XML) community but theyhaven’t been implemented yet in the current Webarchitecture (see http://www.w3.org/XML/).

Another popular example of external annota-tion is comments or notes on Web documents cre-ated by people other than the author. This kind ofannotation is useful for readers evaluating the doc-uments. For example, images without alternativedescriptions aren’t understandable for visuallychallenged people. If comments accompany theseimages, these people can understand the imagecontents by listening via speech transcoding.

We can easily imagine that an open platformfor creating and sharing annotations would great-ly extend the Web’s expressive power and value.Many of the benefits our system offers could beobtained by converting the underlying contentinto suitably annotated documents. While thisconversion may eventually take place on the Web,our system allows easy adaptation of existing Webdocuments for transcoding, even when the anno-tator isn’t the owner of the original document.

Recent activities on the Semantic Web(http://www.semanticweb.org) try to annotateWeb documents with ontology-based semanticdescriptions based on the Resource DescriptionFramework, or RDF (see http://www.w3.org/TR/REC-rdf-syntax/). Their approach seems top-

Web Engineering

SemanticAnnotation andTranscoding:Making WebContent MoreAccessible

Katashi NagaoIBM Research

Yoshinari ShiraiNTT Communication Science Labs

Kevin SquireUniversity of Illinois

We propose amethod forconstructing asuperstructure on theWeb using XML andexternal annotationsto Web documents.We have threeapproaches forannotatingdocuments—linguistic,commentary, andmultimedia. Theresult is annotateddocuments thatcomputers canunderstand andprocess more easily,allowing content toreach a wideraudience withminimal overhead.

down and there’s a big gap between the currentWeb and the Semantic Web. Our approach isbottom-up and our annotation system lets ordi-nary people annotate their Web documents withsome additional and useful information by usingour user-friendly tools.

Content adaptationAnnotations don’t just increase the expressive

power of the Web—they also play an importantrole in content reuse. An example of content reuseis the transformation of content depending onuser preferences.

Content adaptation is a type of transcodingthat considers a user’s environment—devices, net-work bandwidth, profiles, and so on. In addition,such adaptation sometimes requires a good under-standing of the original document contents. If thetranscoder fails to analyze a document’s semanticstructure, then the transcoding results may not beaccurate and may cause user misunderstanding.

Our technology assumes that external annota-tions help machines understand document con-tents so that transcoding can have higher quality.We call such transcoding based on annotationsemantic transcoding. Figure 1 shows the overallconfiguration of the semantic transcoding system.

Our system features three new parts: an anno-tation editor, an annotation server, and atranscoding proxy server. The remaining parts ofthe system consist of a conventional Web serverand a browser.

IBM published previous work on device-dependent adaptation of Web documents at http://www-4.ibm.com/software/webservers/transcoding/.The system it developed can dynamically filter, con-vert, or reformat data for content sharing across dis-parate systems, users, and pervasive computingdevices.

IBM claims that the transcoding benefitsinclude

1. eliminating the expense of reauthoring orporting data and content-generating applica-tions for multiple systems and devices,

2. improving mobile employee communicationsand effectiveness, and

3. creating easier access for customers who use avariety of devices to purchase products andservices.

The technology enables the modification ofWeb documents, such as converting images tolinks to retrieve images, converting simple tables tobulleted lists, removing features not supported bya device such as JavaScript or Java applets, remov-ing references to image types not supported by adevice, and removing comments. It can also trans-form XML documents by selecting and applyingthe right stylesheet for the current request based oninformation in the relevant profiles. These profilesfor preferred transcoding services are defined for an

initial set of devices.Our transcoding sys-

tem involves deeperlevels of documentu n d e r s t a n d i n g .Therefore, it needs hu-man intervention intomachine understandingof documents. Externalannotation is a guidefor machine under-standing. Of course,some profiles for usercontexts will work as aguide for transcoding,but it’s clear that suchprofiles are insufficientfor transcoders to recog-nize and manipulatefundamental documentfeatures.

Various researchers

70

Linguisticannotation

Annotatorprofiles Annotation repository

Annotation server

Commentaryannotation

Multimediaannotation

Annotationdata

Annotationdata

Annotationrequest

URLrequest

TranscodedWeb contents

URLrequest

Webcontents

Text

Transcoders

Transcoding proxy

Userpreferences

Image Voice Video so on

Webserver

Webbrowser

Annotation editor

Naturallanguageprocessor

Speechprocessor

Videoprocessor

URLrequest

Webcontents

URLrequest

Webcontents

Figure 1. Semantic

transcoding system.

have studied content personalization for some spe-cific purposes. Brusilovsky et al.2 developed anonline textbook authoring tool that adapts to stu-dents’ level of understanding. Their techniqueinvolves embedding tags into the original docu-ment. Milosavljevic et al.3 developed systems fordynamically producing documents from abstractdata and information about the user. In both cases,the amount and type of personalization or adapta-tion is limited compared to our system. It’s alsomuch more difficult to adapt current contents tofit these systems, whereas our system will workimmediately with content anywhere on the Web.

Fink et al.4 discussed a technique on adaptingWeb information for handicapped individuals.We’re also planning to extend our transcodingwork for accessibility, which means all people—including seniors and any kind of handicappedpeople—can easily use information systems.

Knowledge discoveryAnother use of annotations is in knowledge

discovery, where intelligent software agents minehuge amounts of Web contents automatically forsome essential points. Unlike conventional searchengines that retrieve Web pages using user-specified keywords, knowledge miners create asingle document that satisfies a user’s request. Forexample, the knowledge miner may generate asummary document on a certain company’s prod-uct strategy for the year from many kinds of infor-mation resources of its products on the Web.

Currently, we’re developing an informationcollector that gathers documents related to a topicand generates a document containing a summaryof each document.

External annotationWe developed a simple method for associating

external annotations with any element of anyHTML document. We use URLs, XPaths, and doc-ument hash codes (digest values) to identifyHTML elements in documents. We also developedan annotation server that maintains the relation-ship between contents and annotations and trans-fers requested annotations to a transcoder.

We represent our annotations as XML formatteddata and divide them into three categories—lin-guistic, commentary, and multimedia annotation.Multimedia annotation (especially of video) uses acombination of the other two types of annotation.

Annotation environmentOur annotation environment consists of a

client-side editor for the creation of annotationsand a server for the management of annotations.Figure 2 shows the annotation environment.

The process flows as follows (in this example,the server processes HTML files):

1. The user runs the annotation editor andrequests an URL as a target of annotation.

2. The annotation server accepts the request andsends it to the Web server.

3. The annotation server receives the Webdocument.

4. The server calculates the document hash code(digest value) and registers the URL with thecode to its database.

5. The server returns the Web document to theeditor.

6. The user annotates the requested documentand sends the result to the server, along withsome personal data (name, professional areas,and so on).

7. The server receives the annotation data andrelates it to its URL in the database.

8. The server also updates the annotator profiles.

71

Ap

ril–June 2001


AnnotatorprofilesAnnotation repository

Annotation server

Annotation editor



4. Register URL 7. Register annotation

8. Update profile

1. URL request

2. URLrequest

3. Webcontents

6. Annotationdata withannotatorprofile

5. Webcontents

Webserver

Naturallanguageprocessor

Speechprocessor

Videoprocessor

Figure 2. Annotation

environment.

Annotation editorOur annotation editor, implemented as a Java

application, can communicate with the annota-tion server (explained in the next section).

The annotation editor

1. registers targets of annotation with the anno-tation server by sending URLs,

2. specifies elements in the document using aWeb browser,

3. generates and sends annotation data to theannotation server, and

4. reuses previously created annotations whenthe target contents are updated.

Figure 3 shows an example screen of our anno-tation editor. The left top window of the editorshows the HTML’s document object structure. Theright window shows some text selected on theWeb browser. The editor automatically assigns theselected area an XPath—for example, a locationidentifier in the document (see http://www.w3.org/TR/xpath.html). The left bottom windowshows the linguistic structure of the sentence inthe selected area.

Using the editor, the user annotates text withthe linguistic structure (grammatical and seman-tic structure, described later) and adds a commentto an element in the document. The editor iscapable of basic natural language processing andinteractive disambiguation. The user should veri-fy and modify the results of the automaticallyanalyzed sentence structure (Figure 4).

Annotation serverOur annotation server receives annotation data

from an annotator and classifies it according tothe annotator’s name. The server retrieves docu-ments from URLs in annotation data and registersthe document hash codes with their URLs in itsannotation database. Since a document’s authormay modify the document after the initialretrieval, the hash code of a document’s internalstructure or DOM (Document Object Model)enables the server to discover modified elementsin the annotated document.5

The annotation server makes a table of anno-tator names, URLs, XPaths, and document hashcodes. When the server accepts a URL as a requestfrom a transcoding proxy, the server returns a listof XPaths with associated annotation files, theirtypes (linguistic or commentary), and a hashcode. If the server receives an annotator’s name asa request, it responds with the set of annotationsthat the specified annotator creates.

We’re currently developing a mechanism foraccess control between annotation servers andnormal Web servers. If authors of original docu-ments don’t want anyone to annotate their doc-uments, they can add a statement about it in thedocuments, and annotation servers won’t retrievesuch contents for the annotation editors.

Linguistic annotationThe purpose of linguistic annotation is to make

text on the Web machine understandable (on the

72

Figure 3. Screen shot of

the annotation editor.


the annotation editor

with linguistic structure

editor.

basis of a new tag set) and to develop better qualitycontent-based presentation, retrieval, question–answering, summarization, and translation systems.The Global Document Annotation (GDA) projectproposed a new tag set (see http://www.i-content.org/GDA/). It’s based on XML and isdesigned to be as compatible as possible withHTML, the Text Encoding Initiative (TEI, seehttp://www.uic.edu:80/orgs/tei), Corpus EncodingStandard (CES, http://www.cs.vassar.edu/CES/),Expert Advisory Group on Language Engineering,(Eagles, http://www.ilc.pi.cnr.it/EAGLES/home.html), and Linguistic Annotation Language (LAL).6

It specifies modifier–modifiee relations, anaphor-referent relations, word senses, and so on.

A GDA-tagged sentence looks like

<su><np rel=“agt” sense=“time0”>

Time</np>

<v sense=“fly1”>flies</v>

<adp rel=“eg”>

<ad sense=“like0”>like</ad>

<np>an <n sense=“arrow0”>arrow</n>

</np>

</adp>.</su>

The tag <su> refers to a sentential unit. Theother tags above—<n>, <np>, <v>, <ad>, and<adp>—mean noun, noun phrase, verb, adnounor adverb (including preposition and postposi-tion), and adnominal or adverbial phrase, respec-tively. A more detailed description of the GDA tagset can be found at http://www.i-content.org/GDA/tagset.html.

The rel attribute encodes a relationship inwhich the current element stands with respect tothe element on which it semantically depends. Itsvalue is called a relational term. A relational termdenotes a binary relation—which may be a the-matic role such as agent, patient, or recipient—ora rhetorical relation such as cause, concession, andso on. For instance, in the GDA-tagged sentence,<np rel=“agt” sense=“time0”>Time</np>

depends on the second element <v sense=“fly1”>flies</v> rel=”agt”, whichmeans that Time has the agent role with respect tothe event denoted by flies. Finally, the senseattribute encodes a word sense.

Automatic morphological analysis, interactivesentence parsing, and word sense disambiguationgenerate linguistic annotation by selecting the mostappropriate paraphrase. Some research issues on lin-guistic annotation relate to reducing annotationcost within some feasible levels. We’ve been devel-

oping some machine-guided annotation interfacesthat conceal the complexity of annotation.Machine-learning mechanisms also contribute toreducing the cost because they can graduallyincrease the automatic annotation accuracy.

In principle, the tag set doesn’t depend on lan-guage, but as a first step we implemented a semi-automatic tagging system for English and Japanese.

Commentary annotationCommentary annotation mainly annotates

nontextual elements like images and sounds withsome additional information. Each comment caninclude tagged texts and other images and links.Currently, this type of annotation appears in asubwindow overlayed on the original documentwindow when a user locates a mouse pointer atthe area of a comment-added element (Figure 5).

Users can also annotate text elements withinformation such as paraphrases, correctly spelledwords, and underlines. We use this type of anno-tation for text transcoding that combines suchcomments on texts and original texts.

Our system also features commentary annota-tion on hyperlinks. This contributes to quickintroduction of target documents before clickingon the links. If there are linguistic annotations onthe target documents, the transcoders can gener-ate summaries of these documents and relatethem to hyperlinks in the source document.

Previously, some research has been publishedconcerning sharing comments over the Web.ComMentor is a general metainformation archi-

73

Ap

ril–June 2001

Figure 5. Comment

overlay on the

document.

tecture for annotating documents on the Web.7

This architecture includes a basic client-server pro-tocol, metainformation description language,server system, and remodeled NCSA Mosaicbrowser with interface augmentations to provideaccess to its extended functionality. ComMentorprovides a general mechanism for shared annota-tions, which lets people annotate arbitrary docu-ments at any position in-place, share commentsor pointers with other people (either publicly orprivately), and create shared landmark referencepoints in the information space. Several annota-tion systems have a similar direction, such as thegroup annotation transducer.8

These systems are often limited to particulardocuments or documents shared only among afew people. Our annotation and transcoding sys-tem can handle multiple comments on any ele-ment of any document on the Web. Also, we canadd a community-wide access control mechanismto our transcoding proxy. If a user isn’t a memberof a particular group, then the user can’t accessthe transcoding proxy that’s for group use only.In the future, transcoding proxies and annotationservers will communicate with some secure pro-tocol that prevents some other server or proxyfrom accessing the annotation data.

Our main focus is the adaptation of Web con-tents to users, and sharing comments in a com-munity is one of our additional features. We applyboth commentary and linguistic annotations tosemantic transcoding.

Multimedia annotationWe can also apply our annotation technique to

multimedia data, such as digital video. Digitalvideo is becoming a prevalent information source.Since these collections are growing to a hugenumber of hours, it’s necessary to summarize andbrowse in a short time without losing significantcontent. We’ve developed techniques for semiau-tomatic video annotation use text to describe avideo’s content. Our techniques use some videoanalysis methods, such as automatic cut detec-tion, characterization of frames in a cut, and scenerecognition using similarity between several cuts.

Other approaches to video annotation exist. Forexample, MPEG-7 is an effort within the MovingPicture Experts Group (MPEG) of the InternationalOrganization for Standardization/InternationalElectronics Commission (ISO/IEC) that’s dealingwith multimedia content description (seehttp://www.drogo.cselt.stet.it/mpeg/standards/mpeg-7/mpeg-7.htm).

Using content descriptions, video coded inMPEG-7 is concerned with transcoding and deliv-ery of multimedia content to different devices.MPEG-7 will potentially allow greater input fromcontent publishers in guiding how to transcodemultimedia content in different situations and fordifferent client devices. Also, MPEG-7 providesobject-level descriptions of multimedia content.This allows a higher granularity of transcodingwhere individual regions, segments, objects, andevents in image, audio, and video data cantranscode differentially—depending on publisherand user preferences, network bandwidth, andclient capabilities.

Our method will be integrated into tools forauthoring MPEG-7 data. However, we don’t cur-rently know when the MPEG-7 technology will bewidely available.

Our video annotation includes automatic seg-mentation of video, semiautomatic linking ofvideo segments with corresponding text seg-ments, and interactive naming of people andobjects in video frames.

Video annotation occurs through the follow-ing steps:

1. For each video clip, the annotation system cre-ates the text corresponding to its content. Weemploy speech recognition for the automaticgeneration of a video transcript. The speechrecognition module also records correspon-dences between the video frames and thewords. The transcript isn’t required to describethe whole video content. The resolution of thedescription affects the final quality of thetranscoding (for example, summarization).

2. The software applies some video analysis tech-niques to characterize scenes, segments (cutsand shots), and individual frames in video. Forexample, by detecting significant changes inthe color histogram of successive frames, theapplication can separate frame sequences intocuts and shots.

Also, by searching and matching preparedtemplates to individual regions in the frame,the annotation system identifies objects. Theuser can specify significant objects in a sceneto reduce the time for identifying target objectsand to obtain a higher recognition successratio. The user can name objects in a frame byselecting words in the corresponding text.

3. The user relates video segments to text seg-

74

IEEE

Mul

tiM

edia

ments such as paragraphs, sentences, andphrases, based on scene structures and object-name correspondences. The system helps theuser select appropriate segments by prioritiz-ing based on the number of objects detected,camera movement, and by showing a repre-sentative frame of each segment.

We developed a video annotation editor capa-ble of scene change detection, object tracking,speech recognition, and correlation of scenes andwords. Figure 6 shows a screen shot of our videoannotation editor.

On the editor screen, the user can specify a par-ticular object in a frame by dragging a rectangle.Using automatic object-tracking techniques, theannotation editor can generate descriptions of anobject in a video frame. XML data represents thedescription and contains object coordinates instart and end frames, time codes of the start andend frames, and motion trails (series of coordi-nates for the interpolation of object movement).

The object descriptions connect with linguis-tic annotation by adding appropriate XPaths tothe tags of corresponding names and expressionsin the video transcript.

Semantic transcodingSemantic transcoding is a transcoding tech-

nique based on external annotations, used for

content adaptation according to user preferences.The transcoders here are implemented as anextension to an HTTP proxy server. Such an HTTPproxy is called a transcoding proxy. Figure 7shows the environment of semantic transcoding.

The information flow in transcoding occurs asfollows:

1. The transcoding proxy receives a request URLwith a client ID.

75

Ap

ril–June 2001


Annotatorprofiles Annotation repository

Annotation server



5. Annotation data 4. Annotation request

1. URL requestwith client ID

9. TranscodedWeb contents

7. Findpreference

8. Activatetranscoders

2. URLrequest

3. Webcontents

Text

Transcoders

Transcoding proxy

Userpreferences

Image Voice Video so on

6. Check consistency

Webserver

Webbrowser

Figure 7. Transcoding

environment.


the video annotation

editor.

2. The proxy sends the URL’s request to the Webserver.

3. The proxy receives the document and calcu-lates its hash code.

4. The proxy also asks the annotation server forannotation data related to the URL.

5. If the server finds the URL’s annotation datain its database, it returns the data to the proxy.

6. The proxy accepts the data and compares thedocument hash code with that of the alreadyretrieved document.

7. The proxy also searches for the user preferencewith the client ID. If no preference data exist,the proxy uses a default setting until the usergives the preference.

8. If the hash codes match, the proxy attempts totranscode the document based on the annota-tion data by activating the appropriatetranscoders.

9. The proxy returns the transcoded documentto the client Web browser.

Transcoding proxyWe employed IBM’s Web intermediaries (WBI)

as a development platform to implement thesemantic transcoding system (see http://www.almaden.ibm.com/cs/wbi). WBI is a customizableand extendable HTTP proxy server. WBI providesapplication programming interfaces (APIs) foruser-level access control and easy manipulation ofinput/output data of the proxy.

The transcoding proxy based on WBI has thefollowing functionality:

1. maintenance of personal preferences,

2. gathering and management of annotationdata,

3. activation and integration of transcoders.

User preference management. For the main-tenance of personal preferences, we use the Webbrowser’s cookie to identify the user. The cookieholds a user ID assigned by the transcoding proxyon the first access and uses the ID to identify theuser and select user preferences previously

defined. Storing the ID as a cookie value lets theuser change an access point using the dynamichost configuration protocol (DHCP) with thesame preference setting. However, there’s onetechnical problem. Generally, only the HTTPservers that have their values set can access cook-ies. Ordinary proxies don’t use cookies for useridentification. Instead, conventional proxies iden-tify the client by the hostname and IP address.Thus, when the user accesses our proxy and setsor updates the preferences, the proxy server actsas an HTTP server to access the browser’s cookiedata and associates the user ID (cookie value) tothe hostname or IP address. When the transcod-ing proxy works as a coventional proxy, it receivesthe client’s hostname and IP address, retrieves theuser ID, and then obtains the preference data. Ifthe user changes the access point and hostnameor IP address, our proxy performs as a server againand reassociates the user ID to such client IDs.

Collecting and indexing annotation data.The transcoding proxy communicates with anno-tation servers that hold the annotation database.The second step of semantic transcoding is to col-lect annotations distributed among several servers.

The transcoding proxy creates a multiserverannotation catalog by traversing distributed anno-tation servers and gathering their annotationindexes. The annotation catalog consists of serv-er name (for example, hostname and IP address)and its annotation index (set of annotator namesand identifiers of the original document and itsannotation data). The proxy uses the catalog todecide which annotation server to access to getannotation data when it receives a user’s request.

Integrating the results of multiple trans-coders. The final stage of semantic transcoding isto transcode requested contents depending on userpreferences and then to return them to the user’sbrowser. This stage involves activating appropriatetranscoders and integrating their results.

As mentioned previously, several types oftranscoding exist. In this article, we describe fourtypes—text, image, voice, and video.

Text transcodingText transcoding is the transformation of text

contents based on linguistic annotations. As a firststep, we implemented text summarization.

Our text summarization method employs aspreading activation technique to calculate theimportance values of elements in the text.9 Since

76

IEEE

Mul

tiM

edia

the method doesn’t employ any heuristics depen-dent on the domain and style of documents, it’sapplicable to any linguistically annotated docu-ment. The method can also trim sentences in thesummary because it assigns importance scores toelements smaller than sentences.

A linguistically annotated document naturallydefines an intradocument network where nodescorrespond to elements and links represent thesemantic relations. This network consists of sen-tence trees (syntactic head–daughter hierarchies ofsubsentential elements such as words or phrases);coreference or anaphora links; document, subdi-vision, or paragraph nodes; and rhetorical relationlinks. Figure 8 shows the intradocument network.

The summarization algorithm works as follows:

1. Spreading activation performs in such a waythat two elements have the same activationvalue if they’re coreferent or one is the syn-tactic head of the other.

2. The algorithm marks the unmarked elementwith the highest activation value for inclusionin the summary.

3. When the algorithm marks an element, itrecursively marks the following elements aswell, until it can’t find any more elements:

❚ the marker’s head

❚ the marker’s antecedent

❚ the marker’s compulsory or a priori impor-tant daughters, the values whose relation-al attributes are agt (agent), pat (patient),rec (recipient), sbj (syntactic subject),obj (syntactic object), pos (possessor), cnt(content), cau (cause), cnd (condition),sbm (subject matter), and so on

❚ the antecedent of a zero anaphor in themarker with some of the above values forthe relational attribute

4. The algorithm generates all marked elementsin the intradocument network preserving theorder of their positions in the originaldocument.

5. If a size of the summary reaches the user-spec-ified value, then terminate; otherwise, go backto step 2.

Simple user interaction can change the sum-mary’s size. Thus, the user can see the summary ina preferred size by using an ordinary Web browserwithout any additional software. The user can alsoinput any words of interest. The algorithm assignscorresponding words in the document numericvalues that reflect degrees of interest. The algo-rithm uses these values while spreading activationfor calculating importance scores.

Figure 9 shows the summarization result on thenormal Web browser. The top document is the orig-

77

Ap

ril–June 2001

Document

Sentence

Subdivision(optional)

Paragraph(optional)

Subsententialsegment

Normal linkReference link

Figure 8.

Intradocument

network.

Figure 9. Original (top)

and summarized

(bottom) documents.

inal and the bottom one is the summarized version.Another kind of text transcoding is language

translation. We can predict that translation basedon linguistic annotations will produce a muchbetter result than many existing systems. This isbecause the major difficulties of present machinetranslation come from syntactic and word senseambiguities in natural languages, which we caneasily clarify in annotation. We show an exampleof the result of English-to-Japanese translation inFigure 10.

Image transcodingImage transcoding converts images into differ-

ent sizes, colors (full color or gray scale), and res-olutions (for example, the compression ratio)depending on a user’s device and communicationcapability. Links to these converted images aremade from the original images. Therefore, userswill notice that the images they’re viewing aren’toriginal if there are links to similar images.

Figure 11 shows a document that’s summarizedat half the original size with images reduced to one-third the original size. We show the preference set-ting subwindow on the right-hand side of thefigure. The window appears when the user double-clicks the icon on the lower right corner (thetranscoding proxy automatically inserts the icon).Using this window, the user can easily modify theparameters for transcoding. For example, by com-bining image and text transcodings, the system canconvert contents to fit the client’s screen size.

Voice transcodingVoice synthesis also works better if the content

has linguistic annotation. For example, some Webengineers are discussing a speech synthesis markuplanguage (see http://www.cstr.ed.ac.uk/projects/ssml.html). A typical example is processing propernouns and technical terms. Word-level annotationson proper nouns let transcoders recognize not onlytheir meanings but also their readings.

Voice transcoding generates the spoken-lan-guage version of documents. Two types of voicetranscoding exist. One is when the transcodersynthesizes sound data in audio formats, such asMPEG-1 audio layer 3 (MP3). This case is useful fordevices without voice synthesis capability, such ascellular phones and personal digital assistants(PDAs). The other type is when the transcoderconverts documents into a more appropriate stylefor voice synthesis. This case requires that weinstall a voice-synthesis program on the clientside. Of course, the synthesizer uses the output ofthe voice transcoder. Therefore, the mechanismof document conversion is a common part toboth types of voice transcoding.

Documents annotated for voice include sometext in commentary annotation for nontextualelements and some word information in linguis-tic annotation for the reading of proper nounsand unknown words in the dictionary. A docu-ment may also contain phrase and sentenceboundary information so that pauses appear inthe appropriate positions.

Figure 12 shows an example of the voice-

78


a translated document.


a document that has

gone through image

transcoding. The

preference-setting

subwindow appears on

the right.

transcoded document where we inserted icons thatrepresent the speaker. When the user clicks on thespeaker icon, it invokes the MP3 player softwareand starts playing the synthesized voice data.

Video transcodingVideo transcoding employs video annotation

that consists of linguistically marked up tran-scripts—such as closed captions, time stamps ofscene changes, representative images (key frames)of each scene—and additional information suchas program names. Our video transcoding has sev-eral variations, including video summarization,video-to-document transformation, video trans-lation, and so on.

We perform video summarization as a by-product of text summarization. Since a summarizedvideo transcript contains important information,corresponding video sequences will produce a col-lection of significant scenes in the video. We devel-oped a video player that plays summarized videos.Figure 13 shows a screen shot of our video player.

To play the summarized video, the playeraccepts a Synchronized Multimedia IntegrationLanguage (SMIL) file (see http://www.w3.org/TR/REC-smil/), which specifies the beginning andending times of each clip in <anchor> tags. Weextended the <anchor> tag specification slightlywith an additional attribute, insummary, whichtells whether a particular clip is in the previouslycreated summary. Here’s a sample of the SMIL for-mat with our extension.

<smil>

<head> ... </head>

<body>

<video region=“...”

src=“somefile.mpg”>

<anchor title=“...”

begin=“00:00:02”

end=“00:05:24”/>

<anchor title=“Web... “

begin=“00:20:08”

end=“00:38:01” insummary=“true”/>

...

</body>

</smil>

There’s been previous work performed on videosummarization—for example, Infomedia10 andCueVideo.11 These programs create a video sum-mary based on automatically extracted features invideo such as scene changes; speech, text, andhuman faces in frames; and closed captions. Video

summarization applications also transcode thevideo data without annotations. Currently, how-ever, a quality of their summarization isn’t practi-cal because of the failure of automatic videoanalysis. Automatic scene detection sometimesextracts unnecessary scenes. Also, automaticspeech recognition often misrecognizes words.These results then affect noisy summaries. Ourapproach to video summarization is sufficient ifthe data has enough semantic annotation. Ourtool helps annotators create semantic annotationdata for multimedia data. Since our annotationdata is task independent and versatile, annotationson video are worth creating if the video will beused in different applications (such as automaticediting and information extraction from video).

79


a voice transcoded

document.


video player with the

summarization

function.

Video-to-document transformation is anothertype of video transcoding. If the client devicedoesn’t have video playing capability, the usercan’t access video contents. In this case, the videotranscoder creates a document including impor-tant images of scenes and texts related to eachscene. The text transcoder can also summarize theresulting document.

Our system implements two types of videotranslation. One is a translation of automaticallygenerated subtitle text. The transcript with timecodes generates the subtitle text. The format ofthe text looks like

<subtitle duration=“00:01:19”>

<time begin=“00:00:00”/>No speech

<clear/>

<time begin=“00:00:05”/>....

<time begin=“00:00:07”/>....

<time begin=“00:00:12”/>....

<clear/>

....

</subtitle>

The text transcoder can translate the subtitletext into different languages as the user wants,and the video player shows the results synchro-nized with the video.

The other type of video translation uses a com-bination of text and voice transcodings. First, thetext transcoder translates a video transcript withlinguistic annotation. Then, the voice transcoderconverts the translation’s result into voice-suitabletext. Synchronization of video playing and voicesynthesis makes another language version of theoriginal video clip. The video player adjusts theduration of each voice data according to the lengthof its corresponding video segment by changingthe speed of the synthesized voice.

Using object-level annotations, the videotranscoder can create an interactive hyperlinked-video in which video objects link to informationsuch as names, times, locations,12 related Websites, and so on. The transcoder uses the annota-tions for retrieving objects from multiple videoclips and generating object-featured video sum-maries. For example, a user might watch a newsclip showing Alan Greenspan’s (Chairman of theUS Federal Reserve Board) comments on the econ-omy. To find out who Greenspan is, the usermight pause the video and click on his face to findhis biography or a list of links to other text orvideo clips concerning his announcements andtheir effect on the economy.

Our system automatically combines the text,image, voice, and video transcodings are auto-matically combined according to user demand.This gives the transcoding proxy a planningmechanism to determine the order of activationof each transcoder necessary for the requestedcontent and user preferences, including clientdevice constraints.

Future plansWe plan to apply our technology to knowledge

discovery from huge online resources.Annotations will help extract some essentialpoints in documents. For example, if an annota-tor adds comments to several documents, andseems to be a specialist of a particular field, thenthe machine automatically collects documentsannotated by this annotator and generates a sin-gle document—including summaries of the anno-tated documents.

We’re also pursuing content-based retrieval ofWeb documents, including multimedia data. Suchretrieval lets users ask questions in natural lan-guage (either spoken or written). We imagine thatin the near future we’ll not use search engines;instead, we’ll use knowledge discovery enginesthat give us a personalized summary of multipledocuments—not hyperlinks.

While our current prototype system is runninglocally, we’re also planning to evaluate our systemwith open experiments with some universities inJapan. Once we’ve evaluated our system, we’ll dis-tribute our annotation editor—with natural lan-guage processing capabilities—for free. This is onestep toward dealing with the oncoming informa-tion deluge. MM

AcknowledgmentsWe’d like to acknowledge the contributors to

this project. Shingo Hosoya and RyuichiroHigashinaka developed the prototype of someplug-ins of the semantic transcoder. Koiti Hasidagave us some helpful advice to apply the GDA tagset to our linguistic and multimedia annotations.Paul Maglio and his group at IBM AlmadenResearch Center provided us with the WBI devel-oper’s kit. Hideo Watanabe developed the proto-type of the English-to-Japanese languagetranslator. Masashi Nishimura provided us withthe speech recognition software developer’s kit.Shinichi Torihara developed the voice synthesiz-er prototype.

80

IEEE

Mul

tiM

edia

References1. M. Hori et al., “Annotation-based Web Content

Transcoding,” Proc. Ninth Int’l WWW Conf., Elsevier,

Amsterdam, 2000, pp. 197-211.

2. P. Brusilovsky, E. Schwarz, and G. Weber, “A Tool for

Developing Adaptive Electronic Textbooks on

WWW,” Proc. World Conf. of the Web Soc. (WebNet

96), 1996, pp. 64-69.

3. M. Milosavljevic et al. “Virtual Museums on the

Information Superhighway: Prospects and

Potholes,” Proc. Ann. Conf. Int’l Committee for

Documentation of the International Council of

Museums (CIDOC 98), 1998, pp. 10-14.

4. J. Fink, A. Kobsa, and A. Nill, “Adaptable and

Adaptive Information Provision for All Users,

Including Disabled and Elderly People,” New Review

of Hypermedia and Multimedia 4, 1998, pp. 163-

188, http://www.zeus.gmd.de/ kobsa/papers/1998-

NRMH-kobsa.ps.

5. H. Maruyama, K. Tamura, and N. Uramoto, XML and

Java: Developing Web Applications, Addison-Wesley,

Reading, Mass., 1999.

6. H. Watanabe, Linguistic Annotation Language: The

Markup Langauge for Assisting NLP Programs, TRL

research report RT0334, IBM Tokyo Research Lab,

1999.

7. M. Roscheisen, C. Mogensen, and T. Winograd,

Shared Web Annotations as a Platform for Third-Party

Value-Added Information Providers: Architecture,

Protocols, and Usage Examples, tech. report

CSDTR/DLTR, Computer Science Dept., Stanford

Univ., Stanford, Calif., 1995.

8. M.A. Schickler, M.S. Mazer, and C. Brooks, “Pan-

Browser Support for Annotations and Other Meta-

Information on the World Wide Web,” Computer

Networks and ISDN Systems, vol. 28, 1996.

9. K. Nagao and K. Hasida, “Automatic Text

Summarization Based on the Global Document

Annotation,” Proc.Int’l Conf. Computational

Linguistics (COLING-ACL 98), Morgan Kaufmann,

San Francisco, 1998, pp. 917-921.

10. M.A. Smith and T. Kanade, Video Skimming for Quick

Browsing Based on Audio and Image Characterization,

tech. report CMU-CS-95-186, School of Computer

Science, Carnegie Mellon Univ., Pittsburgh, Pa.,

1995.

11. A. Amir et al., “CueVideo: Automated Indexing of

Video for Searching and Browsing,” Proc. 22nd Ann.

Int’l ACM SIGIR Conf. Research and Development in

Information Retrieval (SIGIR 99), ACM Press, New

York, 1999.

12. M.G. Christel, A.M. Olligschlaeger, and C. Huang,

“Interactive Maps for a Digital Video Library,” IEEE

MultiMedia, vol. 7, no. 1, Jan.-Mar. 2000, pp. 60-67.

Katashi Nagao received his BE,

ME, and PhD degrees in computer

science from the Tokyo Institute of

Technology in 1985, 1987, and

1994, respectively. Since 1987, he

has been researching natural lan-

guage processing and machine translation systems at

IBM Research, Tokyo Research Laboratory. He has con-

ducted research projects on natural language dialogue,

multiagent systems, and human–computer interaction

at Sony Computer Science Labs. He rejoined the IBM

Tokyo Research Laboratory in 1999, where he is cur-

rently working on semantic transcoding and semantic

discovery projects.

Yoshinari Shirai received his BE

and ME degrees from the Graduate

School of Media and Governance,

Keio University in 1998 and 2000,

respectively. He’s currently a

researcher at the NTT Communi-

cation Science Laboratories. His research interests include

human-computer interaction and mobile computing.

Kevin Squire received his BS in

computer engineering from Case

Western Reserve University in

Cleveland, Ohio, and his MS in

electrical engineering from the

University of Illinois at Urbana-

Champaign (UIUC). He’s currently a PhD candidate in

electrical engineering at UIUC, studying multimodal lan-

guage acquisition. His research interests include visual

pattern recognition and speech and language processing

Readers may contact Nagao at IBM Research, Tokyo

Research Laboratory, Kanagawa, Japan, email kna-

[email protected].

81

Ap

ril–June 2001