Semantic Annotation and Transcoding: Making Text and ... · Semantic Annotation and Transcoding: Making Text and Multimedia Contents More Usable on the Web Katashi Nagao IBM Research,

Semantic Annotation and Transcoding:

Making Text and Multimedia Contents More Usable on the Web

Katashi NagaoIBM Research, Tokyo Research Laboratory

1623–14 Shimotsuruma, Yamato, Kanagawa 242–8502, [email protected]

Abstract

This paper proposes an easy and simple methodfor constructing a super-structure on the Webwhich provides current Web contents with newvalue and new means of use. The super-structure is based on external annotations toWeb documents. We have developed a systemfor any user to annotate any element of anyWeb document with additional information. Wehave also developed a proxy that transcodes re-quested contents by considering annotations as-signed to them. In this paper, we classify an-notations into three categories. One is linguis-tic annotation which helps the transcoder un-derstand the semantic structure of textual el-ements. The second is commentary annota-tion which helps the transcoder manipulate non-textual elements such as images and sounds.The third is multimedia annotation, which is acombination of the above two types. All typesof annotation are described using XML, and cor-respondence between annotations and documentelements is defined using URLs and XPaths. Wecall the entire process “semantic transcoding”because we deal with the deep semantic contentof documents with annotations. The current se-mantic transcoding process mainly handles textand video summarization, language translation,and speech synthesis of documents including im-ages. Another use of annotation is for knowl-edge discovery from contents. Using this idea,we have also developed a system which discov-ers knowledge from Web documents, and gen-erates a document which includes the discov-ered knowledge and summaries of multiple doc-uments related to the same topic.

1 Introduction

The conventional Web structure can be consid-ered as a graph on a plane. In this paper, we pro-pose a method for extending such planar graphto a three-dimensional structure that consistingof multiple planar layers. Such metalevel struc-ture is based on external annotations on doc-uments on the Web. (The original concept ofexternal annotation was given in [5]. We usethis term in a more general sense.)Figure 1 represents the concept of our ap-

proach.

Figure 1: Super-structure on the Web

A super-structure on the Web consists of lay-ers of content and metacontent. The first layercorrensponds to the set of metacontents of basedocuments. The second layer corresponds tothe set of metacontents of the first layer, andso on. We generally consider such metacontentan external annotation. A famous example ofexternal annotations is external links that canbe defined outside of the set of link-connected

documents. These external links have been dis-cussed in the XML (Extensible Markup Lan-guage) community but they have not yet beenimplemented in the current Web architecture[17].Another popular example of external annota-

tion is comments or notes on Web documentscreated by people other than the author. Thiskind of annotations is helpful for readers evalu-ating the documents. For example, images with-out alternative descriptions are not understand-able for visually-challenged people. If there arecomments on these images, these people willunderstand the image contents by listening tothem via speech transcoding. This example isexplained later in more detail.We can easily imagine that an open plat-

form for creating and sharing annotaions wouldgreatly extend the expressive power and valueof the Web.

1.1 Content Adaptation

Annotations do not just increase the expressivepower of the Web but also play an importantrole in content reuse. An example of contentreuse is, for example, the transformation of con-tent depending on user preferences.Content adaptation is a type of transcoding

which considers a users’ environment such asdevices, network bandwidth, profiles, and soon. Such adaptation sometimes also involvesa deep understanding of the original documentcontents. If the transcoder fails to analyse thesemantic structure of a document, then the re-sults may cause user misunderstanding.Our technology assumes that external anno-

tations help machines to understand documentcontents so that transcoding can have higherquality. We call such transcoding based on an-notation “semantic transcoding.”The overall configuration of semantic

transcoding can be viewed in Figure 2.There are three main new parts in this system:

an annotation editor, an annotation server, anda transcoding proxy server. The remaining partsof the system are a conventional Web server anda browser.There are previous work on device dependent

adaptation of Web documents [7]. The devel-

oped system can dynamically filter, convert orreformat data for content sharing across dis-parate systems, users, and emerging pervasivecomputing devices.They claim that the transcoding benefits in-

clude:

1. eliminating the expense of re-authoring orporting data and content-generating appli-cations for multiple systems and devices

2. improving mobile employee communica-tions and effectiveness, and

3. creating easier access for customers who areusing a variety of devices to purchase prod-ucts and services.

The technology enables the modification ofHTML (HyperText Markup Language) docu-ments, such as converting images to links to re-trieve images, converting simple tables to bul-leted lists, removing features not supported bya device such as JavaScript or Java applets, re-moving references to image types not supportedby a device, and removing comments. It canalso transform XML documents by selecting andapplying the right stylesheet for the current re-quest based on information in the relevant pro-files. These profiles for preferred transcodingservices are defined for an initial set of devices.Our transcoding involves deeper levels of doc-

ument understanding. Therefore, human in-tervention into machine understanding of doc-uments is required. External annotations of ad-ditional information is a guide for machines tounderstand. Of cource, some profiles for usercontexts will work as a guide to transcode, butit is clear that such profiles are insufficient fortranscoders to recognize deep document charac-teristics.

1.2 Knowledge Discovery

Another use of annotations is in knowledge dis-covery, where huge amounts of Web contents areautomatically mined for some essential points.Unlike conventional search engines that retrieveWeb pages using user specified keywords, knowl-edge miners create a single document that satis-fies a user’s request. For example, the knowledgeminer may generate a summary document on a

Figure 2: Configuration of semantic transcoding

certain company’s product strategy for the yearfrom many kinds of information resources of itsproducts on the Web.Currently, we are developing an information

collector that gathers documents related to atopic and generates a document containing asummary of each document.There are many unresolved issues before we

can realize true knowledge discovery, but we cansay that annotations facilitate this activity.

2 External Annotation

We have developed a simple method to associateexternal annotations with any element of anyHTML document. We use URLs (Uniform Re-source Locators), XPaths (location identifiers inthe document) [18], and document hash codes(digest values) to identify HTML elements indocuments. We have also developed an anno-tation server that maintains the relationship be-tween contents and annotations and transfers re-quested annotations to a transcoder.Our annotations are represented as XML for-

matted data and divided into three categories:linguistic, commentary, and multimedia annota-tion. Multimedia (especially video) annotationis a combination of the other two types of anno-tation.

2.1 Annotation Environment

Our annotation environment consists of a clientside editor for the creation of annotations and aserver for the management of annotations.The annotation environment is shown in Fig-

ure 3.

Figure 3: Annotation environment

The process flows as follows (in this examplecase, an HTML file is processed):

1. The user runs the annotation editor and re-quests an URL as a target of annotation.

2. The annotation server accepts the requestand sends it to the Web server.

3. The annotation server receives the Webdocument.

4. The server calculates the document hashcode (digest value) and registers the URLwith the code to its database.

5. The server returns the Web document tothe editor.

6. The user annotates the requested documentand sends the result to the server withsome personal data (name, professional ar-eas, etc.).

7. The server receives the annotation data andrelates it with its URL in the database.

8. The server also updates the annotator pro-files.

Below we explain the editor and the server inmore detail.

2.2 Annotation Editor

Our annotation editor, implemented as a Javaapplication, can communicate with the annota-tion server explained below.The annotation editor has the following func-

tions:

1. To register targets of annotation to the an-notation server by sending URLs

2. To specify any element in the document us-ing the Web browser

3. To generate and send annotation data tothe annotation server

4. To reuse previously-created annotationswhen the target contents are updated

An example screen of our annotation editor isshown in Figure 4.The left window of the editor shows the docu-

ment object structure of the HTML document.The right window shows some text that was se-lected on the Web browser (shown on the righthand). The selected area is automatically as-signed an XPath.Using the editor, the user annotates text with

linguistic structure (grammatical and semanticstructure, described later) and adds a commentto an element in the document. The editor is

Figure 4: Annotation editor with Web browser

capable of natural language processing and in-teractive disambiguation. The user will mod-ify the result of the automatically-analyzed sen-tence structure as shown in Figure 5.

Figure 5: Annotation editor with linguisticstructure editor

2.3 Annotation Server

Our annotation server receives annotation datafrom any annotator and classifies it according tothe annotator. The server retrieves documentsfrom URLs in annotation data and registers thedocument hash codes with their URLs in its an-notation database. The hash codes are used tofind differences between annotated documentsand updated documents identified by the sameURL. A hash code of document internal struc-ture or DOM (Document Object Model) enablesthe server to discover modified elements in theannotated document [8].

The annotation server makes a table of an-notator names, URLs, XPaths, and documenthash codes. When the server accepts a URL asa request from a transcoding proxy (describedbelow), the server returns a list of XPaths withassociated annotation files, their types (linguis-tic or commentary), and a hash code. If theserver receives an annotator’s name as a request,it responds with the set of annotations createdby the specified annotator.We are currently developing a mechanism for

access control between annotation servers andnormal Web servers. If authors of original doc-uments do not want to allow anyone to anno-tate their documents, they can add a state-ment about it in the documents, and annota-tion servers will not retrieve such contents forthe annotation editors.

2.4 Linguistic Annotation

The purpose of linguistic annotation is tomake WWW texts machine-understandable (onthe basis of a new tag set), and to developcontent-based presentation, retrieval, question-answering, summarization, and translation sys-tems with much higher quality than is currentlyavailable. The new tag set was proposed by theGDA (Global Document Annotation) project[4]. It is based on XML, and designed to beas compatible as possible with HTML, TEI [15],CES [2], EAGLES [3], and LAL [16]. It specifiesmodifier-modifiee relations, anaphor-referent re-lations, word senses, etc.An example of a GDA-tagged sentence is as

follows:

<su><np rel="agt" sense="time0">Time</np><v sense="fly1">flies</v><adp rel="eg"><ad sense="like0">like</ad><np>an <n sense="arrow0">arrow</n></np></adp>.</su>

<su> means sentential unit. <n>, <np>, <v>,<ad> and <adp> mean noun, noun phrase,verb, adnoun or adverb (including prepositionand postposition), and adnominal or adverbialphrase, respectively1 .

1 A more detailed description of the GDA tag setcan be found athttp://www.etl.go.jp/etl/nl/GDA/tagset.html.

The rel attribute encodes a relationship inwhich the current element stands with respectto the element that it semantically depends on.Its value is called a relational term. A relationalterm denotes a binary relation, which may be athematic role such as agent, patient, recipient,etc., or a rhetorical relation such as cause, con-cession, etc. For instance, in the above sentence,<np rel="agt" sense="time0">Time</np>depends on the second element<v sense="fly1">flies</v>.rel="agt" means that Time has the agent rolewith respect to the event denoted by flies.The sense attribute encodes a word sense.Linguistic annotation is generated by auto-

matic morphological analysis, interactive sen-tence parsing, and word sense disambiguationby selecting the most appropriate paraphrase.Some research issues on linguistic annotation

are related to how the annotation cost can bereduced within some feasible levels. We havebeen developing some machine-guided annota-tion interfaces that conceal the complexity ofannotation. Machine learning mechanisms alsocontribute to reducing the cost because they cangradually increase the accuracy of automatic an-notation.In principle, the tag set does not depend on

language, but as a first step we implemented asemi-automatic tagging system for English andJapanese.

2.5 Commentary Annotation

Commentary annotation is mainly used to an-notate non-textual elements like images andsounds with some additional information. Eachcomment can include not only tagged texts butalso other images and links. Currently, thistype of annotation appears in a subwindow thatis overlayed on the original document windowwhen a user locates a mouse pointer at the areaof a comment-added element as shown in Fig-ure 6.Users can also annotate text elements with in-

formation such as paraphrases, correctly-spelledwords, and underlines. This type of annotationis used for text transcoding that combines suchcomments on texts and original texts.Commentary annotaion on hyperlinks is also

Figure 6: Comment overlay on the document

available. This contributes to quick introductionof target documents before clicking the links. Ifthere are linguistic annotations on the targetdocuments, the transcoders can generate sum-maries of these documents and relate them withhyperlinks in the source document.There are some previous work on sharing

comments on the Web. ComMentor is a gen-eral meta-information architecture for annotat-ing documents on the Web [11]. This archi-tecture includes a basic client-server protocol,meta-information description language, a serversystem, and a remodeled NCSA Mosaic browserwith interface augmentations to provide accessto its extended functionality. ComMentor pro-vides a general mechanism for shared annota-tions, which enables people to annotate arbi-trary documents at any position in-place, sharecomments/pointers with other people (eitherpublicly or privately), and create shared “land-mark” reference points in the information space.There are several annotation systems with a sim-ilar direction, such as CoNote and the GroupAnnotation Transducer [13].These systems are often limited to particu-

lar documents or documents shared only amonga few people. Our annotation and transcodingsystem can also handle multiple comments onany element of any document on the Web. Also,a community wide access control mechanism canbe added to our transcoding proxy. If a user isnot a member of a particular group, then theuser cannot access the transcoding proxy thatis for group use only. In the future, transcod-

ing proxies and annotation servers will commu-nicate with some secured protocol that preventssome other server or proxy from accessing theannotation data.Our main focus is adaptation of WWW con-

tents to users, and sharing comments in a com-munity is one of our additional features. We ap-ply both commentary and linguistic annotationsto semantic transcoding.

2.6 Multimedia Annotation

Our annotation technique can also be appliedto multimedia data such as digital video. Dig-ital video is becoming a necessary informationsource. Since the size of these collections isgrowing to huge numbers of hours, summariza-tion is required to effectively browse video seg-ments in a short time without losing the sig-nificant content. We have developed techniquesfor semi-automatic video annotation using a textdescribing the content of the video. Our tech-niques also use some video analysis methodssuch as automatic cut detection, characteriza-tion of frames in a cut, and scene recognitionusing similarity between several cuts.There is another approach to video annota-

tion. MPEG-7 is an effort within the MovingPicture Experts Group (MPEG) of ISO/IECthat is dealing with multimedia content descrip-tion [9].Using content descriptions, video coded in

MPEG-7 is concerned with transcoding and de-livery of multimedia content to different de-vices. MPEG-7 will potentially allow greaterinput from the content publishers in guidinghow multimedia content is transcoded in dif-ferent situations and for different client devices.Also, MPEG-7 provides object-level descriptionof multimedia content which allows a highergranularity of transcoding in which individualregions, segments, objects and events in im-age, audio and video data can be differentiallytranscoded depending on publisher and userpreferences, network bandwidth and client ca-pabilities.Our method will be integrated into tools for

authoring MPEG-7 data. However, we do notcurrently know when the MPEG-7 technologywill be widely available.

Our video annotation includes automatic seg-mentation of video, semi-automatic linking ofvideo segments with corresponding text seg-ments, and interactive naming of people and ob-jects in video frames.Video annotation is performed through the

following three steps.First, for each video clip, the annotation sys-

tem creates the text corresponding to its con-tent. We employed speech recognition for theautomatic generation of a video transcript. Thespeech recognition module also records corre-spondences between the video frames and thewords. The transcript is not required to de-scribe the whole video content. The resolutionof the description effects the final quality of thetranscoding (e.g., summarization).Second, some video analysis techniques are

applied to characterize scenes, segments (cutsand shots), and individual frames in video. Forexample, by detecting significant changes in thecolor histogram of successive frames, frame se-quences can be separated into cuts and shots.Also, by searching and matching prepared

templates to individual regions in the frame, theannotation system identifies objects. The usercan specify significant objects in some scene inorder to reduce the time to identify target ob-jects and to obtain a higher recognition successratio. The user can name objects in a framesimply by selecting words in the correspondingtext.Third, the user relates video segments to text

segments such as paragraphs, sentences, andphrases, based on scene structures and object-name correspondences. The system helps theuser to select appropriate segments by prioritiz-ing based on the number of objects detected,camera movement, and by showing a represen-tative frame of each segment.We developed a video annotation editor ca-

pable of scene change detection, speech recogni-tion, and correlation of scenes and words. Anexample screen of our video annotation editor isshown in Figure 7.On the editor screen, the user can specify a

particular object in a frame by dragging a rect-angle. Using automatic object tracking tech-niques, the annotation editor can generate de-scriptions of an object in a video frame. The de-

Figure 7: Video annotation editor

scription is represented as XML data, and con-tains object coordinates in start and end frames,time codes of the start and end frames, and mo-tion trails (series of coordinates for interpolationof object movement).The object descriptions are connected with

liguistic annotation by adding appropriateXPaths to the tags of corresponding names andexpressions in the video transcript.As mentioned later, the annotation of video

objects can be used for creation of hyper-video,in which annotated objects are hyperlinked withexternal information, and objects are retrievedwith keywords.

3 Semantic Transcoding

Semantic transcoding is a transcoding techniquebased on external annotations, used for contentadaptation according to user preferences. Thetranscoders here are implemented as an exten-sion to an HTTP (HyperText Transfer Protocol)proxy server. Such an HTTP proxy is called atranscoding proxy.Figure 8 shows the environment of semantic

transcoding.The information flow in transcoding is as fol-

lows:

1. The transcoding proxy receives a requestURL with a client ID.

2. The proxy sends the request of the URL tothe Web server.

Figure 8: Transcoding environment

3. The proxy receives the document and cal-culates its hash code.

4. The proxy also asks the annotation serverfor annotation data related to the URL.

5. If the server finds the annotation data of theURL in its database, it returns the data tothe proxy.

6. The proxy accepts the data and comparesthe document hash code with that of thealready retrieved document.

7. The proxy also searches for the user prefer-ence with the client ID. If there is no prefer-ence data, the proxy uses a default settinguntil the user gives the preference.

8. If the hash codes match, the proxy attemptsto transcode the document based on the an-notation data by activating the appropriatetranscoders.

9. The proxy returns the transcoded docu-ment to the client Web browser.

We explain in more detail the transcodingproxy and various kinds of transcoding.

3.1 Transcoding Proxy

We employed IBM’s WBI (Web Intermediaries)as a development platform to implement the se-

mantic transcoding system [6]. WBI is a cus-tomizable and extendable HTTP proxy server.WBI provides APIs (Application ProgrammingInterfaces) for user level access control and easymanipulation of input/output data of the proxy.The transcoding proxy based on WBI has the

following functionality:

1. Maintenance of personal preferences

2. Gathering and management of annotationdata

3. Activation and integration of transcoders

3.1.1 User preference management

For the maintenance of personal preferences, weuse the web browser’s cookie to identify theuser. The cookie holds a user ID assigned bythe transcoding proxy on the first access andthe ID is used to identify the user and to se-lect user preferences defined at the last time.The ID stored as a cookie value allows the user,for example, to change an access point usingDHCP (Dynamic Host Configuration Protocol)with the same preference setting. There is onetechnical problem. Generally, cookies can beaccessed only by the HTTP servers that haveset their values and ordinary proxies do not usecookies for user identification. Instead, conven-tional proxies identify the client by the host-name and IP address. Thus, when the user ac-

cesses our proxy and sets/updates the prefer-ences, the proxy server acts as an HTTP serverto access the browser’s cookie data and asso-ciates the user ID (cookie value) and the host-name/IP address. When the transcoding proxyworks as a coventional proxy, it receives theclient’s hostname and IP address, retrieves theuser ID, and then obtains the preference data. Ifthe user changes access point and hostname/IPaddress, our proxy performs as a server againand reassociates the user ID and such client IDs.

3.1.2 Collecting and indexing annota-tion data

The transcoding proxy communicates withannotation servers that hold the annotationdatabase. The second step of semantic transcod-ing is to collect annotations distributed amongseveral servers.The transcoding proxy creates a multi-server

annotation catalog by crawling distributed an-notation servers and gathering their annotationindeces. The annotation catalog consists ofserver name (e.g., URL) and its annotation in-dex (set of annotator names and identifiers ofthe original document and its annotation data).The proxy uses the catalog to decide which an-notation server should be accessed to get anno-tation data when it receives a user’s request.

3.1.3 Integrating the results of multipletranscoders

The final stage of semantic transcoding is totranscode requested contents depending on userpreferences and then to return them to the user’sbrowser. This stage involves activation of appro-priate transcoders and integration of their re-sults.As mentioned previously, there are several

types of transcoding. In this paper we de-scribe four types: text, image, voice, and videotranscodings.

3.2 Text Transcoding

Text transcoding is the transformation of textcontents based on linguistic annotations. As afirst step, we implemented text summarization.

Our text summarization method employs aspreading activation technique to calculate theimportance values of elements in the text [10].Since the method does not employ any heuris-tics dependent on the domain and style of doc-uments, it is applicable to any linguistically-annotated document. The method can also trimsentences in the summary because importancescores are assigned to elements smaller than sen-tences.A linguistically-annotated document natu-

rally defines an intra-document network inwhich nodes correspond to elements and linksrepresent the semantic relations. This net-work consists of sentence trees (syntactic head-daughter hierarchies of subsentential elementssuch as words or phrases), coreference/anaphoralinks, document/subdivision/paragraph nodes,and rhetorical relation links.Figure 9 shows a graphical representation of

the intra-document network.

Figure 9: Intra-document network

The summarization algorithm works as fol-lows:

1. Spreading activation is performed in such away that two elements have the same acti-vation value if they are coreferent or one ofthem is the syntactic head of the other.

2. The unmarked element with the highest ac-tivation value is marked for inclusion in thesummary.

3. When an element is marked, the followingelements are recursively marked as well, un-til no more elements are found:

• the marker’s head• the marker’s antecedent• the marker’s compulsory or a pri-

ori important daughters, the valuesof whose relational attributes are agt(agent), pat (patient), rec (recipient),sbj (syntactic subject), obj (syntac-tic object), pos (possessor), cnt (con-tent), cau (cause), cnd (condition),sbm (subject matter), etc.

• the antecedent of a zero anaphor in themarker with some of the above valuesfor the relational attribute

4. All marked elements in the intra-documentnetwork are generated preserving the orderof their positions in the original document.

5. If a size of the summary reaches the user-specified value, then terminate; otherwisego back to Step 2.

The size of the summary can be changed bysimple user interaction. Thus the user can seethe summary in a preferred size by using anordinary Web browser without any additionalsoftware. The user can also input any words ofinterest. The corresponding words in the doc-ument are assigned numeric values that reflectdegrees of interest. These values are used duringspreading activation for calculating importancescores.Figure 10 shows the summarization result on

the normal Web browser. The top document isthe original and the bottom one is the summa-rized version.Another kind of text transcoding is language

translation. We can predict that translationbased on linguistic annotations will produce amuch better result than many existing systems.This is because the major difficulties of presentmachine translation come from syntactic andword sense ambiguities in natural languages,which can be easily clarified in annotation. Anexample of the result of English-to-Japanesetranslation is shown in Figure 11.Furthermore, we are developing a dictionary-

based text paraphrasing as another repertoire oftext transcoding. Using word sense attributes inlinguistic annotation and dictionary definitions,

Figure 10: Original and summarized documents

Figure 11: Translated document

difficult words are replaced with more readableexpressions. These expressions are generatedby modifying the dictionary definitions for wordsenses according to the local contexts of the tar-get words.

3.3 Image Transcoding

Image transcoding is to convert images intothese of different size, color (full color orgrayscale), and resolution (e.g., compression ra-tio) depending on user’s device and communica-tion capability. Links to these converted imagesare made from the original images. Therefore,users will notice that the images they are look-ing at are not original if there are links to similarimages.

Figure 12 shows the document that is summa-rized in one-third size of the original and whoseimages are reduced to half. In this figure, thepreference setting subwindow is shown on theright hand. The window appears when the userdouble-clicks the icon on the lower right corner(the transcoding proxy automatically inserts theicon). Using this window, the user can easilymodify the parameters for transcoding.

Figure 12: Image transcoding (and preferencesetting window)

By combining image and text transcodings,the system can, for example, convert contentsto just fit the client screen size.

3.4 Voice Transcoding

Voice synthesis also works better if the contenthas linguistic annotation. For example, a speechsynthesis markup language is being discussed in[12]. A typical example is processing propernouns and technical terms. Word level annota-tions on proper nouns allow the transcoders torecognize not only their meanings but also theirreadings.Voice transcoding generates spoken language

version of documents. There are two types ofvoice transcoding. One is when the transcodersynthesizes sound data in audio formats such asMP3 (MPEG-1 Audio Layer 3). This case isuseful for devices without voice synthesis capa-bility such as cellular phones and PDAs (Per-sonal Digital Assistants). The other is when thetranscoder converts documents into more appro-priate style for voice synthesis. This case re-quires that a voice synthesis program is installedon the client side. Of cource, the synthesizeruses the output of the voice synthesizer. There-fore, the mechanism of document conversion is acommon part of both types of voice transcoding.Documents annotated for voice include some

text in commentary annotation for non-textualelements and some word information in linguis-tic annotation for the reading of proper nounsand unknown words in the dictionary. The doc-ument also contains phrase and sentence bound-ary information so that pauses appear in appro-priate positions.Figure 13 shows an example of the voice-

transcoded document in which icons that rep-resent the speaker are inserted. When the userclicks the speaker icon, the MP3 player soft-ware is invoked and starts playing the synthe-sized voice data.

3.5 Video Transcoding

Video transcoding employs video annotationthat consists of linguistically-marked-up tran-scripts such as closed captions, time stampsof scene changes, representative images (keyframes) of each scene, and additional informa-tion such as program names, etc. Our videotranscoding has several variations, includingvideo summarization, video to document trans-formation, video translation, etc.

Figure 13: Voice transcoding

Video summarization is performed as a by-product of text summarization. Since a sum-marized video transcript contains important in-formation, corresponding video sequences willproduce a collection of significant scenes in thevideo. Summarized video is played by a playerwe developed. An example screen of our videoplayer is shown in Figure 14.

Figure 14: Video player with summarizationfunction

There are some previous work on video sum-marization such as Infomedia [14] and CueVideo[1]. They create a video summary based on au-tomatically extracted features in video such asscene changes, speech, text and human faces inframes, and closed captions. They can transcodevideo data without annotations. However, cur-rently, an accuracy of their summarization isnot practical because of the failure of automaticvideo analysis. Our approach to video summa-rization has sufficient quality for use if the datahas enough semantic annotation. As mentioned

earlier, we have developed a tool to help annota-tors to create semantic annotation data for mul-timedia data. Since our annotation data is task-independent and versatile, annotations on videoare worth creating if the video will be used indifferent applications such as automatic editingand information extraction from video.Video to document transformation is another

type of video transcoding. If the client devicedoes not have video playing capability, the usercannot access video contents. In this case, thevideo transcoder creates a document includingimportant images of scenes and texts related toeach scene. Also, the resulting document can besummarized by the text transcoder.Our system implements two types of video

translation. One is a translation of automati-cally generated subtitle text. The subtitle textis generated from the transcript with time codes.The format of the text is as follows:

<subtitle duration="00:01:19"><time begin="00:00:00"/><clear/>No speech<time begin="00:00:05"/>....<time begin="00:00:07"/>....<time begin="00:00:12"/><clear/>....</subtitle>

The text transcoder can translate the subtitletext into different languages as the user wants,and the video player shows the results synchro-nized with the video.An example screen of the video player with

subtitle window is shown in Figure 15.

Figure 15: Video player with subtitle window

The other type of video translation is per-formed in terms of a combination of text and

voice transcodings. First, a video transcriptwith linguistic annotation is translated by thetext transcoder. Then, the result of transla-tion is converted into voice-suitable text by thevoice transcoder. Synchronization of video play-ing and voice synthesis makes another languageversion of the original video clip. The dura-tion of each voice data is adjusted according tothe length of its corresponding video segment bychanging speed of synthesized voice.Using object-level annotations, the video

transcoder can create an interactive hyper-videoin which video objects are hyperlinked with in-formation such as names, times, locations, re-lated Websites, etc. The annotations are usedfor object retrieval from multiple video clips andgeneration of object-featured video summaries.The above described text, image, voice, and

video transcodings are automatically combinedaccording to user demand, so the transcodingproxy has a planning machanism to determinethe order of activation of each transcoder neces-sary for the requested content and user prefer-ences (including client device constraints).

4 Future Plans

We are planning to apply our technology toknowledge discovery from huge online resources.Annotations will be very useful to extract someessential points in documents. For example, anannotator adds comments to several documents,and he or she seems to be a specialist of someparticular field. Then, the machine automati-cally collects documents annotated by this anno-tator and generates a single document includingsummaries of the annotated documents.Also, content-based retrieval of Web docu-

ments including multimedia data is being pur-sued. Such retrieval enables users to ask ques-tions in natural language (either spoken or writ-ten).While our current prototype system is run-

ning locally, we are also planning to evaluateour system with open experiments jointly withKeio University in Japan. In addition, we willdistribute our annotation editor, with naturallanguage processing capabilities, for free.

5 Concluding Remarks

We have discussed a full architecture for creat-ing and utilizing external annotations. Usingthe annotations, we realized semantic transcod-ing that automatically customizes Web contentsdepending on user preferences.This technology also contributes to commen-

tary information sharing like ComMentor anddevice dependent transformation for any device.One of our future goals is to make contents of theWWW intelligent enough to answer our ques-tions asked using natural language. We imaginethat in the near future we will not use searchengines but will instead use knowledge discoveryengines that give us a personalized summary ofmultiple documents instead of hyperlinks. Thework in this paper is one step toward a bettersolution of dealing with the coming informationdeluge.

Acknowledgments

The author would like to acknowledge the con-tributors to the project described in this pa-per. Koiti Hasida gave the author some help-ful advices to apply the GDA tag set to ourlinguistic and multimedia annotations. HideoWatanabe developed the prototype of English-to-Japanese language translator. Shinichi Tori-hara developed the prototype of voice syn-thesizer. Shingo Hosoya, Yoshinari Shirai,Ryuichiro Higashinaka, Mitsuhiro Yoneoka, andKevin Squire contributed to implementation ofthe prototype semantic transcoding system.

References

[1] A. Amir, S. Srinivasan, D. Ponceleon, andD. Petkovic. CueVideo: Automated index-ing of video for searching and browsing. InProceedings of SIGIR’99. 1999.

[2] Corpus Encoding Standard (CES).Corpus Encoding Standard.

http://www.cs.vassar.edu/CES/.

[3] Expert Advisory Group on Language Engi-neering Standards (EAGLES).

EAGLES online.http://www.ilc.pi.cnr.it/EAGLES/home.html.

[4] Koiti Hasida. Global Document Annotation.http://www.etl.go.jp/etl/nl/gda/.

[5] Masahiro Hori et al. Annotation-based WebContent Transcoding. In Proceedings ofthe Ninth International WWW Conference.2000.

[6] IBM Almaden Research Center.Web Intermediaries (WBI).

http://www.almaden.ibm.com/cs/wbi/.

[7] IBM Corporation. IBM Web-Sphere Transcoding Publisher. http://www-4.ibm.com/software/webservers/transcoding/.

[8] Hiroshi Maruyama, Kent Tamura, and Nao-hiko Uramoto. XML and Java: DevelopingWeb applications. Addison-Wesley, 1999.

[9] Moving Picture Experts Group (MPEG).MPEG-7 Context and Objectives.

http://drogo.cselt.stet.it/mpeg/standards/mpeg-7/mpeg-7.htm.

[10] Katashi Nagao and Koiti Hasida. Au-tomatic text summarization based on theGlobal Document Annotation. In Proceed-ings of COLING-ACL’98. 1998.

[11] Martin Roscheisen, Christian Mogensen,and Terry Winograd. Shared Web anno-tations as a platform for third-party value-added information providers: Architecture,protocols, and usage examples. TechnicalReport CSDTR/DLTR. Computer ScienceDepartment, Stanford University, 1995.

[12] The SABLE Consortium.A Speech Synthesis Markup Language.

http://www.cstr.ed.ac.uk/projects/ssml.html.

[13] Matthew A. Schickler, Murray S. Mazer,and Charles Brooks. Pan-browser supportfor annotations and other meta-informationon the World Wide Web. Computer Net-works and ISDN Systems. Vol. 28, 1996.

[14] Michael A. Smith and Takeo Kanade.Video skimming for quick browsing based onaudio and image characterization. Techni-cal Report CMU-CS-95-186. School of Com-puter Science, Carnegie Mellon University,1995.

[15] The Text Encoding Initiative (TEI).Text Encoding Initiative.

http://www.uic.edu:80/orgs/tei/.

[16] Hideo Watanabe. Linguistic AnnotationLanguage: The markup langauge for assist-ing NLP programs. TRL Research ReportRT0334. IBM Tokyo Research Laboratory,1999.

[17] World Wide Web Consortium.Extensible Markup Language (XML).

http://www.w3.org/XML/.

[18] World Wide Web Consortium.XML Path Language (XPath) Version 1.0.http://www.w3.org/TR/xpath.html.