32
SOFTWARE—PRACTICE AND EXPERIENCE Softw. Pract. Exper. 2008; 38:1499–1530 Published online 5 March 2008 inWiley InterScience (www.interscience.wiley.com). DOI: 10.1002/spe.877 Migrating legacy video lectures to multimedia learning objects Andrea De Lucia, Rita Francese, Ignazio Passero , and Genoveffa Tortora Dipartimento di Matematica e Informatica, Universit` a degli Studi di Salerno, via Ponte don Melillo 1, Fisciano (SA), Italy SUMMARY Video lectures are an old distance learning approach that offers only basic interaction and retrieval features to the user. Thus, to follow the new learning paradigms, we need to re-engineer the e-learning processes while preserving the investments made in the past. In this paper we present an approach for migrating video lectures to multimedia learning objects. Two essential problems are tackled: the detection of slide transitions and the generation of the learning objects. To this aim, the video of the lecture is scanned to detect the slide changes, while the learning object metadata and the slide pictures are extracted from the presentation document. A tool named VLMigrator (video lecture migrator) has been developed to support the migration of video lectures and the restructuring of their contents in terms of learning objects. Both the migration strategy and the tool have been experimented in a case study. Copyright © 2008 John Wiley & Sons, Ltd. Received 1 June 2007; Revised 7 January 2008; Accepted 8 January 2008 KEY WORDS: e-learning; learning object; SCORM; slide change detection 1. INTRODUCTION Today, the main challenge in the e-learning research field concerns the accessibility, usability and exploitability of digital contents, as stated by the European research program eContentplus [1]. Indeed, creating lectures in digital format, organizing and restructuring contents are very boring and expensive tasks. To accommodate audience time and/or space conflicts, many of these lectures can be provided online. In particular, oral expositions supported by slides are typical teaching and learning activities providing Information/Knowledge Dissemination that might be usefully transferred from the classroom to online mode [2], as also proposed by the Classroom 2000 project [3]. In addition, video lectures are often taught by famous ‘gurus’ and are broadcasted by satellite Correspondence to: Ignazio Passero, Dipartimento di Matematica e Informatica, Universit` a degli Studi di Salerno, via Ponte don Melillo 1, Fisciano (SA), Italy. E-mail: [email protected] Copyright 2008 John Wiley & Sons, Ltd.

Migrating legacy video lectures to multimedia learning objects

Embed Size (px)

Citation preview

Page 1: Migrating legacy video lectures to multimedia learning objects

SOFTWARE—PRACTICE AND EXPERIENCESoftw. Pract. Exper. 2008; 38:1499–1530Published online 5March 2008 inWiley InterScience (www.interscience.wiley.com). DOI: 10.1002/spe.877

Migrating legacy video lecturesto multimedia learning objects

Andrea De Lucia, Rita Francese, Ignazio Passero∗,†and Genoveffa Tortora

Dipartimento di Matematica e Informatica, Universita degli Studi di Salerno, viaPonte don Melillo 1, Fisciano (SA), Italy

SUMMARY

Video lectures are an old distance learning approach that offers only basic interaction and retrievalfeatures to the user. Thus, to follow the new learning paradigms, we need to re-engineer the e-learningprocesses while preserving the investments made in the past. In this paper we present an approach formigrating video lectures to multimedia learning objects. Two essential problems are tackled: the detectionof slide transitions and the generation of the learning objects. To this aim, the video of the lecture isscanned to detect the slide changes, while the learning object metadata and the slide pictures are extractedfrom the presentation document. A tool named VLMigrator (video lecture migrator) has been developedto support the migration of video lectures and the restructuring of their contents in terms of learningobjects. Both the migration strategy and the tool have been experimented in a case study. Copyright ©2008 John Wiley & Sons, Ltd.

Received 1 June 2007; Revised 7 January 2008; Accepted 8 January 2008

KEY WORDS: e-learning; learning object; SCORM; slide change detection

1. INTRODUCTION

Today, the main challenge in the e-learning research field concerns the accessibility, usability andexploitability of digital contents, as stated by the European research program eContentplus [1].Indeed, creating lectures in digital format, organizing and restructuring contents are very boringand expensive tasks. To accommodate audience time and/or space conflicts, many of these lecturescan be provided online. In particular, oral expositions supported by slides are typical teachingand learning activities providing Information/Knowledge Dissemination that might be usefullytransferred from the classroom to online mode [2], as also proposed by the Classroom 2000 project[3]. In addition, video lectures are often taught by famous ‘gurus’ and are broadcasted by satellite

∗Correspondence to: Ignazio Passero, Dipartimento di Matematica e Informatica, Universita degli Studi di Salerno, via Pontedon Melillo 1, Fisciano (SA), Italy.

†E-mail: [email protected]

Copyright q 2008 John Wiley & Sons, Ltd.

Page 2: Migrating legacy video lectures to multimedia learning objects

1500 A. DE LUCIA ET AL.

television or are available and mainly sold through Web sites, in a videotape or CD format. Theseold distance teaching approaches create a passive situation in which the user follows the classicallecture, but at distance. The learner passively receives the knowledge transmitted by the teacherwhich is at the center of the learning process. The user cannot interact in any way with the materialand (s)he has almost to search the entire video to find a specific content. Moreover, filming a teacherin the classroom while (s)he is giving a traditional lecture, without special constraints to his/hermovements or speaking, has still several advantages. First of all, this approach does not requirethe teacher to change his/her didactical practice, as the lecture is located in the classroom whereteaching is more natural than in studio; moreover it enables the universities to obtain, in a short time,a rich repository of good quality learning contents they can offer on the e-Learning market [4].Video lectures follow the format of traditional face-to-face lectures, with a teacher who gives

his/her 1/2 h lecture with the support of slides. A lot of this material has already been producedand there is the need of preserving the investments made in the past. As for legacy systems [5], itcan be advantageous to migrate the video lectures into a more modern format that, on the one hand,enables the learner to become an active subject and, on the other hand, embraces largely adoptedstandards to facilitate the reuse of contents. In this way, the existing live experiences can be reusedfor defining e-learning processes following the new learning paradigms, such as blended e-learning[6]. The actual trend is to create short, at most 20min [7–9], online learning content including:

• text, graphics and movies;• a navigation scheme (generally in terms of a table of contents and/or buttons);• assessments.

Learning management systems need to interact with contents to identify the learner and recordinformation about the learning experience. To suitably reuse existing video materials exploiting newfeatures offered by advanced learning systems such as learning management systems and learningcontent management systems, we need to structure them as learning objects [8], the building blockson which the new learning technologies are based. Learning objects are characterized by differentgranularity levels, are appropriately combined, often adapting navigation on the learner profile, anddeployed into an online course.Starting from a document presentation used as the main teaching resource and provided as

input together with the video of the lecture, we aim at generating learning objects having thefollowing characteristics: the flow of the slides is synchronized with the video of the lecture;a navigational schema enables to surf between the contents; and a portion of the screen showsadditional information extracted from the notes of the associated presentation document. In theabsence of any automatic support, the migration of old video lectures to this format requires one tovisually examine the video to synchronize it with the slides and an index structure, a very tedious,error-prone and time-consuming activity.In this paper, we present an approach for migrating video lectures to multimedia learning objects

consisting of two main activities: the detection of slide transitions and the generation of the learningobjects. The video of the lecture is scanned to detect the changing of slides, while the learning objectmetadata and the slide pictures are extracted from the presentation. To support the migration of videolectures and the restructuring of their contents in terms of learning objects, a tool named VLMigrator(video lecture migrator) has been developed. VLMigrator adopts a slide change detection approachbased on the creation of a mask for highlighting the portion of the video depicting the slideand several specific metrics for the detection of particular conditions which occur during slide

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 3: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1501

transitions and camera motions. Users do not need to perform any programming task, but actingon a graphical user interface (GUI) they can generate one or more learning objects. The proposeddetection approach has been validated in a case study which discusses the results concerning themigration of an entire course and of sample lectures of five courses showing different situations ofrooms, lighting, teaching gestures and slide templates.The remainder of this paper is organized as follows: Section 2 discusses related work; Section 3

describes the migration process, and Section 4 presents the features VLMigrator offers to supportthis process. Section 5 describes a case study, and finally Section 6 concludes.

2. RELATED WORK

Several commercial environments, such as Lotus Freelance [10] or PowerPointTM, provide supportfor recording the change times of slides. Exploiting this feature, simple applications can producesynchronized multimedia lectures by using timing information. This approach cannot be adoptedwhen the video lectures have been taken without recording the slide change times. When changetimes are not available, slide change detection is required to synchronize video and contents.Several approaches have been proposed in literature to detect slide transitions. Some of them are

general purpose and are based on video indexing techniques, such as video shots or scene changedetection, and others are ad hoc methodologies explicitly designed for slide change detection.The main methodology for detecting video shots or scene boundary consists of extracting and

then comparing one or more features from each frame of the video. In particular, different metricsare often used to evaluate the changes between subsequent frames, while thresholds are adopted todetermine whether changes take place. As an example, five metrics for scene change detection invideo sequences have been presented in [11].The critical issue with this kind of solutions is in tuning the thresholds: high thresholds increase

the number of misses and low thresholds increase the number of false alarms. To avoid this problem,Dugad and Ratakonda [12] proposed an interesting robust statistical approach for fixing the thresh-olds. The authors computed the frame-to-frame difference and compared this value with a rangelimited by two thresholds as follows: if the value is under the lower threshold, no transition isdetected; if it is over the higher threshold, a transition is signaled. In case the value is between thetwo thresholds, a deeper analysis is directly executed on the frames.Nagasaka and Tanaka [13] also described various frame similarity techniques, such as difference

between gray-level sums, sum of gray-level differences, difference between gray-level histograms,colored template matching, difference between color histograms, and �2 comparison of colorhistograms. They concluded that the most robust frame-to-frame difference method for detect shotchanges is �2.Mostefaoui [14]mainly addressed two issues: the shot change detection, presented as the segmen-

tation approach, and the semantic enrichment of video contents, presented as the stratificationapproach. It is important to note that the segmentation approach proposed is not enough sensibleto be used for slide change detection. At the semantic level, similar to that in our approach, slidesare used to annotate video segments.In [15] a system for analysis indexing and retrieval of video has been applied to the TREC-2002

video retrieval benchmark [16]. Shot boundary detection is also investigated. Because transitionsoften occur in a gradual way, the examination of subsequent frames can cause missing some

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 4: Migrating legacy video lectures to multimedia learning objects

1502 A. DE LUCIA ET AL.

detections. Thus, in this approach frames are compared at greater temporal distances, especiallywhen dealing with low-quality video materials.All these traditional shot boundary detection techniques have been applied for detecting ‘slide

transitions’ [17,18], yielding poor results in terms of recall and precision, two well-known metricsof the information retrieval field [19]. Indeed, this problem is far more complex: a slide transitionis often characterized by small changes; it does not present significant color changes [18], becausethe slides of a presentation generally have the same background. As a consequence, the use oftraditional approaches designed for the detection of scene or shot transitions causes the missing ofslide change detections. On the other hand, the teacher or camera movement can cause the detectionof false positives. This issue highlights the need to better exploit the typical characteristics of videolectures.Specialized techniques for detecting slide changes are adopted by systems such as the Classroom

2000 project [3], the Cornell Lecture Browser [17] and the Digital Lectures project [20]. Thesesystems aim at automatically capturing and indexing lecture contents and at producing a structuredmultimedia document. They require to capture the lecture in a constrained way, as the detectionapproach proposed in [21], where the camera position is fixed and a GUI is used to manually selectthe projected document area. These solutions are useful to create new learning material, but do notenable to reuse existing video lectures.The research proposed in [22] also aims at indexing videos of presentations that use electronic

slides. Similar to our approach, the system receives the video lecture and the slides as input anddetects slide transitions by matching the content differences in video frames to the content differ-ences in the slides. This solution is not appropriate for the legacy materials we need to manage,because it requires the lecture video to be taken by fixed cameras and the frame to contain the entireslide. The output is an indexed video lecture enhanced by replacing the content area of frames withthe projective transformed original slide images.To automate structuring and indexing, major research issues prefer to investigate layout and

content of video frames using various techniques, such as the detection of text regions in viewgraph,characters and words recognition, tracking of pointers and animations, gesture analysis and speechrecognition.Video optical character recognition (OCR) is a recent area of intensive exploration, not only for

detecting slide transitions but also for facilitating the matching of videos and electronic slides [13].The process of video OCR mainly includes the detection, segmentation and recognition of videotexts, not always balancing the greater computational efforts with better results.Syeda-Mahmood and Srinivasan [23] proposed a detection approach based on a combination of

word and phonetic recognition of speech, which exploits the order of occurrence of words in aphrase to return points in video where one or more subphrases used in the foil were heard. It isrequired that the teachers read the most significant words in the slide. The approach can correctlylink and index 65% of the electronic slides.Two cameras are adopted in [17]; an overview camera, capturing the entire lecture dais, and a

tracking camera, following the speaker. Since the first camera and the projector are fixed, problemsconcerning camera motion are not addressed and the screen area is known in advance.In our approach we also address the composition of contents in terms of learning objects. Many

of the previously cited works also aim at building structured hypermedia documents from videolectures and PowerPointTM presentations [3,13,24,25]. Similarly, numerous commercial, freewareand open-source tools have been developed in order to edit learning objects and their metadata

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 5: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1503

[26,27]. As an example, the learning object metadata generator [28] automatically extracts themetadata with minimal user intervention from HTML pages. It also creates a keyword/key phrasedatabase. SCORM [7] learning objects can be created using RELOAD, an open-source learningobject editor [27].The University of British Columbia proposed the learning tools project, aiming at supporting

teachers in the generation of learning objects for educational or research purposes [29]. It offersseveral tools, among them is the multimedia learning object authoring tool, which enables thecreation of learning objects by manually combining videos, audios, images and texts withoutprogramming. The approach we propose also gives to the content author the capability of composinglearning object and performs the synchronization between the learning object elements identifiedexploiting the slide change detection features offered by the proposed tool.

3. THE MIGRATION PROCESS

In this section, we present an overview of our approach for the migration of video lectures towardsmultimedia learning objects.Figure 1 illustrates the legacy lecture reengineering process in terms of activity diagram with

object flow, where the rounded rectangles represent process activities, while the rectangles represent

Figure 1. The video lecture re-engineering process.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 6: Migrating legacy video lectures to multimedia learning objects

1504 A. DE LUCIA ET AL.

the intermediate artifacts generated during the process phases. An actor symbol denotes that anactivity is interactive.The materials provided as input to the process are a video lecture and the associated presentation

document. The Information Extraction phase groups several simple operations enabling to takeout the table of contents and the slide images from the presentation document and to produce aresized version of the video aiming at reducing the bandwidth needed to play the learning object.The Mask Setting phase receives the video lecture as input and interactively creates a mask forhighlighting the part of the video depicting the slides. During this phase a threshold is set to cut theslide projection area and is provided as input, together with the video lecture, to the Slide ChangeDetection phase which identifies the frames where a slide change occurs. A list of these frames isprovided as output in the form of playable video segments. The user is then involved in a validationactivity consisting in verifying the association between images and video segments proposed bythe tool. In particular, the user can discard false positives or split video segments containing missedtransitions. The Learning Object Generation phase composes a learning object by selecting thevideo segments. The associated information (metadata, slides, notes, etc.) are also used to create it.The symbol ∗ on learning object provided as output denotes that multiple occurrences of this objectcan be generated according to user directives. The phases composing the re-engineering processare better detailed in the following subsections.

3.1. Information extraction

The input materials, a video lecture in AVI format and the corresponding presentation document,have to be manipulated to obtain the required learning objects. To this aim, we translate the educa-tional contents into an XML-based document. From this document we generate a representationcompliant with a highly accepted standard, the learning object metadata [28,30].A presentation document is the source of descriptive information about the learning contents.In particular, to create the navigational schema, we extract the table of contents from it. Other

information such as author name, title, date of creation, etc. can also be derived. Moreover, we obtaina snapshot of each slide in jpeg format; thus we are able to rewrite the slides in a cross-platformfashion. The presentation formats that have been considered are PowerPointTM and Open Document[31]. In case the presentation document is in PowerPointTM format, to obtain this information weexploited the component object model framework [32], defining how objects interact within a singleMicrosoft application or between applications. The tools also support the PDF format, even if itis more difficult to extract the slide title. Thus, when using this format the table of contents isgenerated in a semi-automatic way and has to be validated by the user in the validation phase.To obtain an efficient transmission in streaming modality, the video lecture has to be exported in

a low-resolution format. It is important to point out that even if the video has not a direct contributeto the understanding of the lecture, it strongly reduces the loneliness sensation of remote students.As a matter of fact, communication involves several aspects such as the body and facial languages.Thus, the video is resized to be shown in a little window during the learning object deployment.

3.2. Mask setting

During a lecture, a teacher often makes some movements that could incorrectly influence theslide change detection. To solve this problem, the slide detection has to be performed by taking

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 7: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1505

into consideration only the slide area. To properly exploit the brightness characteristics of slideprojections, the mask setting phase requires the user to interactively set the brightness threshold, �.Let us define a video frame as a matrix of n×m×3 pixels, ranging from 0 to 255, in the RGB

color space. The mask is created by applying � to the video frame and excluding pixels whoseaverage intensities, computed on the three-color band, are lower than this threshold. When settingthe value �, the user can examine how a short portion of the video is masked. Figure 2 shows someexamples of the application of a mask to frames containing the same slide but characterized bydifferent teacher positions.To retail the slide from the rest of the scene, we exploit a common peculiarity of the images

representing the slides that are characterized by higher pixel intensity values. As an example,Figure 3(a) shows the result of the intensity analysis on the frame in Figure 3(b). It is interesting totake the value of the pixels corresponding to the lines in Figure 3(b) and analyze their brightness.The circled regions in Figure 3(a) correspond to the screen area and have the highest intensityvalues. This is a common feature of this kind of video materials because of the low classroomambient lighting needed to have a good readability of the screen. As a consequence, to detect theportion of the image depicting the slide, we adopt a cut threshold � on the minimum brightnessvalue of the pixels to analyze.It is important to point out that the direct application of the threshold � to the circled regions

in Figure 3(a) has as drawback the exclusion of many pixels representing words, pointed out by

Figure 2. Examples of frame masking.

Figure 3. Frame brightness analysis.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 8: Migrating legacy video lectures to multimedia learning objects

1506 A. DE LUCIA ET AL.

Figure 4. The masking of a frame.

local minima. This could cause to miss the detection of transitions. Thus, to avoid losing interestinginformation, the image is blurred before applying the threshold �. In particular, we adopt a low-pass averaging filter to hide details [33]. Figure 4 shows the process performed to mask a frame.In particular, Figure 4(b) shows the blurred version of the image depicted in Figure 4(a). Let usnote that high-frequency details, such as words or diagrams, are hidden. This result is obtained byapplying to each pixel imgt (x, y) of the frame numbered t , the blurring function blur(imgt (x, y))representing the average of the pixel values in the square centered on it. The blurring filter is adaptedto the dimension of the frames by sizing it on the square root of the shortest frame dimension. Ifthe pixel imgt (x, y) is on the border line, the square is padded with 127, the average of the values apixel may assume. At this point, the threshold � is applied to the blurred frame and a binary maskis created, hiding the pixels whose brightness value is lower than �. As a result, the teacher andhis/her movement are masked, while the thin dark details, often representing textual informationon slides, are not masked. In particular, given an image imgt (x, y), representing the frame numbert , the mask is defined as follows:

maskt (x, y)={1 if blur(imgt (x, y))≥�

0 if blur(imgt (x, y))<�

As an example, Figure 4(c) shows the binary mask obtained by applying � to the blurred frame inFigure 4(b). Figure 4(d) shows the masked frame that highlights only the slide area.We experimented several masking construction algorithms with the aim of allowing some degrees

of freedom about the input video resolution. The mask shape and size are robust with respectto the selected resolution, as shown in Figure 5, where the same sample is masked at differentresolutions. The sizing function chosen for our approach is a good compromise between robustnessand execution time. Indeed, we experimented other sizing functions, such as the cube root, whichdoes not properly retail the frame, even if better performances in terms of execution time can beobtained.It is important to point out that the masking technique is also applicable when lectures are

characterized by content expressed with bright text on a dark background, as in the case of Figure 6.In particular, Figure 6(b) depicts the intensities of pixels underlying the line r in Figure 6(a).Examining the intensity of the pixels, the portion of them depicting the screen area, in the circledregion, can be easily identified.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 9: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1507

Figure 5. The masking of a frame with different video resolutions.

Figure 6. Frame brightness analysis in the case of dark background.

Figure 7. Examples of frame masking with dark background.

In this case, the masking is even more effective than in the case of white background, becausebright characters are not cut away by high thresholds as the dark ones do causing the hiding oftextual details. Indeed, when the text is brighter than the background, high thresholds do not excludedetails, as depicted in Figure 7(c).

3.3. Slide change detection

The slide change detection phase, together with the video lecture, receives the mask parameter asinput in such a way to apply the detection process only to the slide area.As Figure 8 shows, this phase is organized in three sub-phases: start detection, end detection and

validation.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 10: Migrating legacy video lectures to multimedia learning objects

1508 A. DE LUCIA ET AL.

Figure 8. The slide change detection sub-process.

A slide change is an event that is spanned among several frames. Thus, the start detectionand the end detection sub-phases aim at identifying both the initial and the last frames where atransition occurs, respectively. The final validation phase requires the user intervention to examinethe segments’ list and eventually correct the results, as better detailed in Section 4.1, where theVLMigrator tool is presented. This approach enables one to obtain a transition list composed ofpairs of times which will be used to properly associate a video portion with the corresponding slide.To correctly detect slide transitions, our approach takes into account two kinds of indicators, the

slide change and camera motion indicators. The former is useful to determine the slide changes,while the latter detects the presence of a camera motion which could be responsible for erroneousslide change detection.

3.3.1. Slide change indicators

Slide or scene change detection approaches generally use a testing metric computed from twoimages and a decision is made by applying a threshold to the measured value [34]. Each metricis specific for the detection of particular conditions that occur during a slide transition and has adifferent sensitivity to camera motion. Thus, we decided to explore several approaches combiningthe more effective of them considering the video lecture characteristics of our materials. It is worthmentioning that each frame, following the RGB model, is originally composed of one image foreach color band, red, green and blue. The slide change detection indicators are evaluated consideringthe picture avg imgt (i, j), where t is the frame number, and obtained calculating the average of thethree color bands as follows:

avg imgt (i, j)=imgt (i, j)[Red]+ imgt (i, j)[Green]+ imgt (i, j)[Blue]

3

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 11: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1509

Figure 9. Examples of frame differences.

The first indicator of dissimilarity between two frames is the sum of pixel-to-pixel difference, whichis very sensitive to camera motion and also to fluctuations in pixel sampling error. As shown inFigure 9(a), when the difference metric is applied to two frames depicting the same slide, no nullresults are obtained only in the teacher’s moving area. Vice versa, if the difference is computedduring a slide transition, as depicted in Figure 9(b), the result of the comparison shows both teachermovements and slide differences. Even if Figure 9 does not show the mask, let us note that, in bothcases, the teacher movement area does not influence the detection, because it is hidden as shownin Figure 2.We started considering two difference metrics �1(t) and �2(t) defined as follows:

�1(t) =n∑

i=1

m∑j=1

(maskt (i, j))(maskt−1(i, j))|avg imgt (i, j)−avg imgt−1(i, j)|, t>1

�2(t) =n∑

i=1

m∑j=1

(maskt (i, j))(maskt−1(i, j))I (|avg imgt (i, j)−avg imgt−1(i, j)|>�), t>1

I (p) ={1, p= true

0, p= false

In practice, �1(t) quantifies the amount of the variation between two consecutive frames, while �2(t)counts the number of pixels whose intensity difference is greater than a threshold � (computed asdetailed in Section 3.3.2). The two metrics �1(t) and �2(t) are computed by analyzing the video ofthe lecture and a transition is detected if �1(t) or �2(t) exceeds their detection threshold, computed asmedian(�1(t))+4�(�1(t)) and median(�2(t))+4�(�2(t)),1≤ t≤number of frames, respectively[12]. Let us note that both metrics are independent from the kind of background (dark or white)because they are computed considering the absolute value of the difference in the pixel intensities.Indeed, for every image img(i, j), let us define the negative image n img(i, j) as 255− img(i, j).Thus, img(i, j) and n image(i, j) have opposite pixel intensities and if the former depicts whitetext on a darker background, the latter depicts dark text on a white background. Consequently, theuse of img(x, y) or n image(x, y) does not affect the metrics �1(t) and �2(t).It is also worth mentioning that the considered metrics obtain satisfactory results also with low-

resolution frames, enabling to balance the detection sensitivity with the computational effort. As anexample, Figure 10 applies the metric �2 to six samples, representing the same video at differentresolutions. All the transitions, represented by a cross on the x-axes, are detected starting from thehighest resolution until a resolution of 45×36 is considered.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 12: Migrating legacy video lectures to multimedia learning objects

1510 A. DE LUCIA ET AL.

Figure 10. The metric �2(t) computed on different resolutions.

3.3.2. Setting the difference threshold �

The value of � has been experimentally determined considering that classroom projectors displaythe slides in terms of a discrete set of intensities belonging to the RGB color space. A requirementfor an effective projection of slides is to take into account the eye limits in discriminating amongthe different intensities which depicts text, images and background. To this aim, the layout of thepresentation should be designed in such a way to improve readability and to obtain a good contrastbetween contents and background. In any case, a lecturer could associate a wrong combinationof colors with presentation. In our experience, limited to our classroom environments, the bestprojection results are reached adopting black text on a bright background. The results presented inthe following are also valid in case the projected slides have a template characterized by bright texton a darker background.Videos, as all digital sampled quantities, are affected by quantization noise that cause a little

oscillation of the pixel intensities. � is introduced to avoid to interpret this fluctuation as a variation.To determine a first upper bound to the value of �, it is necessary to examine the difference

between the pixel intensities of two frames. In Figure 11(a) the difference between two frames

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 13: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1511

depicting different slides is shown. The intensities of pixels corresponding to the white line inFigure 11(a) are measured and their values are shown in Figure 11(b). It is evident, as shown inthe lower part of Figure 11(b), that the quantization added a little noise on the background (lessthan 3), but text rows are easy to distinguish if � is less than 20.All previous consideration on the values of � are confirmed in an experimental way. For the sake

of simplicity, let us consider the analysis of four video samples containing only one transition andthe results obtained varying the value of �.Table I reports for each sample the number of frames composing it, the frame number in which

the transition occurs and the frame size.Figure 12 reports the �2 values obtained by analyzing the four video samples described in Table I.

In particular, the rows of Figure 12 are numbered from 1 to 4 and depict the experiments performedon each video sample. Each column represents the experimental results on the video samples, when �varies in {1,7,15,30,70}. In each picture the horizontal line is drawn at median(�2(t))+4�(�2(t)),where 1≤ t≤number of frames [12], and represents the detection threshold.The column labeled (�=1) clearly shows how the quantization noise greatly affects the detection.

Indeed, the results obtained on all the samples show that the adaptive detection threshold is toohigh because of the influence of the noise on standard deviation. Consequently, the slide changesare missed in all the four cases. The results obtained setting � equal to 30 and much more in caseof �=70 confirm that � cannot exceed the value of 20 without impacting on the sensitivity of thedetection. Results reported in columns labeled with (�=7) and with (�=15) of Figure 8 show how

Figure 11. An example of frame difference and its pixel intensity graph.

Table I. The four video samples used for the computation of �2 in Figure 12.

Sample Number of frames Transition frame Frame size

Sample 1.avi 441 234 304×224Sample 2.avi 378 241 304×224Sample 3.avi 234 201 608×248Sample 4.avi 467 268 608×248

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 14: Migrating legacy video lectures to multimedia learning objects

1512 A. DE LUCIA ET AL.

Figure 12. The metric �2(t) computed on the four samples in Table I for �=1,7,15,30,70.

the value of �=10, chosen in the range [7,15], can effectively cut off the quantization noise withoutdegrading the detection performances.

3.3.3. The statistical metric �2

Erroneous slide detections can occur in two cases: a false-positive detection is generated when weidentify a transition that does not occur and a false-negative detection occurs when we fail to observea slide transition. To overcome this problem, the slide detection sub-process includes an interactivefinal phase in which the user, starting from a preview of the slide and the associated frame, correctsthe erroneous transitions. In particular, if a false positive is detected, the transition is discarded,whereas if the user identifies a missing detection, the process is run again on the portion of videocontaining the transition. This video segment is reanalyzed reducing the detection threshold anddisabling the camera motion indicators, as better detailed in Section 3.3.4. The detection sensitivityis improved by decreasing the multiplier of �. This action may cause the detection of false positives,but the reduced detection threshold is used only on a very small portion of the video lecture so thatthe number of new false positives is quite small.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 15: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1513

It is important to point out that, in general, video materials are taken with a mobile camera. As aconsequence, difference metrics, as �1(t) and �2(t), could induce false-positive detections [23] incase of camera motion.To overcome this problem, we combine them with the statistical metric �2 [34,35], which is

computed as follows:

�2(t) =255∑i=0

(Hi (t)−Hi (t−1))2

Hi (t)+Hi (t−1), t>1

Hi (t) =m∑i=1

n∑j=1

(maskt (i, j))I (imgt (i, j)= i)

This metric associates an histogram with each frame, whose columns depict the number of pixelshaving a brightness value x , x ∈[0,255]. These histograms are used to better establish the variationsbetween two frames, obtaining robustness against camera motion.Figure 13 depicts an experiment conducted on a sample of about 6000 frames, where one slide

transition occurs together with four strong shot variations.In particular, the upper row of Figure 13 shows, from left to right, the results obtained from

the application of �1(t), �2(t) and �2, respectively, to the same video sample containing oneslide transition and four camera perturbations. To detect a slide change, our approach uses theincremental ratio computed on these three metrics, because providing a measure of the changespeed is more suitable to detect slide transitions. Indeed, we experimentally verified that cameramovements and shoot zoom changes happen more slowly than in slide changes. The lower rowof Figure 13 shows the evaluation of the incremental ratios ��1(t)=|�1(t+1)−�1(t)|, ��2(t)=|�2(t+1)−�2(t)| and ��2(t)=|�2(t+1)−�2(t)|. The horizontal line in each diagram of Figure 13represents the detection threshold at median(��1(t))+4�(��1(t)), median(��2(t))+4�(��2(t))and median(��2(t))+4�(��2(t)). Comparing the graphics representing the metric value and thecorresponding incremental ratio, it is evident that the shot variations erroneously considered as slidechange are less relevant when the incremental ratio is considered.It is important to point out that the slides of a lecture presentation generally adopt the same

background, and the text occupies the same percentage of the screen. As a consequence, thehistogram technique could be inadequate to detect slide changes or to decide whether a strongvariation of the metrics is due to a change or due to camera motion. Conversely, camera motions orshot variations can easily alter the distribution of pixel values generating histogram which evidencefalse positives, in particular when the screen is not fully contained in the frame or when the operatorzooms in/out the scene.

3.3.4. Camera motion indicators

As shown in the right column of Figure 13, the �2 metric is useful to discard only one falsepositive on four. In fact, �2 is not able to resolve all false positives, due to camera motions or shotvariations. To avoid this kind of false positives, we decided to improve the detection by adoptingfurther indicators to inhibit the detection sensitivity when a perturbation occurs. These indicatorsare evaluated on the mask, extending its usage to understand the relative position of the screen

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 16: Migrating legacy video lectures to multimedia learning objects

1514 A. DE LUCIA ET AL.

Figure 13. Examples of metrics sensitivity to camera motion.

Figure 14. Mask regularization.

inside the frame. To this aim we operate a closing operation on the mask to regularize its shape[33]. In this way, possible intruding objects are excluded from the masking area, as depicted inFigure 14(b), where the shape of the teacher in Figure 14(a) is not visible anymore.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 17: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1515

Figure 15. Application of camera motion indicators.

The regularized mask of Figure 14(b) is segmented in its 4-connected components‡. Figure 14(c)shows these components, which are labeled from 1 to 4. Successively, the largest component, labeled1 in Figure 14(c), is selected and its position is evaluated.In particular, the mask position is described by the minimum and maximum coordinates of

unmasked pixels, representing the opposite corners of the smallest rectangle containing the mask.An approximation of the mask position and size is given by the coordinates of the rectangle centerand the diagonal length. Thus, it is possible to deduce whether the video operator is changingthe zoom factor or panning/tilting the camera by comparing the mask centers and diagonals ofthe considered two frames. In the case camera motion indicators signal camera perturbation, thesensitivity of the slide change indicators is proportionally inhibited. The enrichment of the slidechange detection with the mask position indicators can avoid several false positives, as shown inFigure 15.Indeed, the application of metric �1 in Figure 15(a) detects four false positives; Figure 15(b)

shows the results obtained weighting ��1 incremental ratio, with the distance between the twomask centers and the diagonal difference.

3.3.5. The slide change detection algorithm

All considerations made in the previous subsections contribute to formulate an algorithm for theslide change detection. The activities are performed in two phases: the first one exploits the cameramotion indicators and the second one starts after the user had found a missed detection and requireda correction.In this section, we describe only the first phase of the adopted algorithm because the second one

consists of only a subset of the operations executed during the first phase. In particular, the secondphase inhibits camera motion indicators and increases the detection sensitivity by acting on thedetection thresholds.The organization of the pseudo-code reflects the structure of the java implementation. Some

actions, which are not significant for the comprehension of the algorithm, are resumed via the callcompute, assuming to have a global scope for the results of these operations.

‡A pixel p(x, y) has four horizontal and vertical neighbors: p(x+1, y), p(x−1, y), p(x, y+1) and p(x, y−1). For a binaryimage, two pixels are connected if they are neighbors and have the same value 0,1.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 18: Migrating legacy video lectures to multimedia learning objects

1516 A. DE LUCIA ET AL.

In Figures 16 and 17, the detect change and the detect time list functions areshown.The function detect time list receives the brightness threshold � and the video lecture

as input and scans it by invoking detect change and the functions create mask,mask center, diagonal length and camera motion used to detect camera perturbations.The function is transition, sketched in Figure 18, compares detection parameters with

thresholds and determines whether a transition occurs. The relevance among parameters deducedfrom experimental results is implemented in the proposed comparing strategy. In particular, the

Figure 16. The function detect change.

Figure 17. The function detect time list.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 19: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1517

Figure 18. The decision tree underlying the function is transition() of theslide change detection algorithm.

highest relevance is given to the parameter ��2. As better detailed in Section 3.3.3, the �2 metricis partially useful to discern slide transitions because the intensity distribution of text among slidesdoes not always differ.For this reason, the metric �2 is used only to resolve conflicts among �1 and �2, as shown in

Figure 18. This is also possible because the first sub-phase of the algorithm adopts the cameramotion indicators to avoid false detection due to camera perturbations.

3.4. Learning object generation

The learning object generation phase associates a slide of the presentation with the correspondingpart of video, according to the transitions previously detected. The transition list is composed ofpairs of times representing the begin and the end of a transition. We associate the slides 1,2, . . . ,nwith the portions of video [0, t1end], [t1start, t2end], . . . , [tn−1start, tnend]. In this way, we obtainan overlap between the portions of video we associate adjacent slides to avoid the lost of speechinformation.The data extracted during the information extraction phase are used to fill the learning object

metadata. To obtain the desired granularity, this phase also enables to fragment the table of contents.In this way a video, which lasts about 1 or 2 h, can be partitioned into smaller units.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 20: Migrating legacy video lectures to multimedia learning objects

1518 A. DE LUCIA ET AL.

4. THE MIGRATION TOOL

In this section we present an overview of VLMigrator, a tool developed in Java for the migrationof video lectures towards multimedia learning objects. Figure 19 shows the layered architecture ofthe system that includes the presentation layer, the application logic layer and a repository.The presentation layer is implemented by a GUI, which supports the definition of the mask,

the validation of the slide change detected and the generation of learning objects. In particular,VLMigrator adopts a fast way of showing slides, associating them with the detected video segments,discarding possible erroneous or missing detections.The GUI enables to access to the application layer of VLMigrator. This layer is composed of

three subsystems, namely the Slide Change Detection, the Learning Object Generation moduleand RELOAD [27], an open-source learning object editor. In particular, the slide change detectionmodule semi-automatically detects slide transitions. To this aim it scans the video lecture andproduces a list of times where a change of slide is detected. This sub-system uses the libraries offeredby JavaMedia Framework (JMF) [36], an open-source framework that provides a unified architectureand messaging protocol for managing the acquisition, processing and delivery of time-based media.VLMigrator exploits these libraries to manipulate video lectures in AVI format. In particular,VLMigrator instantiates a JMF pass-through codec to access the frames of the video lectures and tocontrol the video playing [36]. The learning object generation module allows fragmenting the tableof contents in order to create several learning objects. For each learning object, the correspondingvideo lecture portion and the associated slides are then synchronized according to the detected slidetransitions. To reach this result, we adopted the markup language SMIL (Synchronized MultimediaIntegration Language) [37]. RELOAD receives the SMIL description of the multimedia presentationas input provided by the learning object generation module and creates a SCORM [7] compliantlearning object.

Figure 19. VLMigrator architecture.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 21: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1519

The materials provided as input to VLMigrator are video lectures and the associated presentationdocument, both taken from the repository, which also stores all the outputs produced byVLMigrator.In the following two subsections we describe the main functionalities provided by VLMigrator.

4.1. The slide change detection module

During the mask setting phase, the user involvement is required to restrict the detection to the pixelsof the video lecture belonging to the slide area. The user sets the value of the threshold � interactingwith a slider in a standard java GUI and receives an immediate feedback about his/her choices ona running preview of the masked video frame, as shown in Figure 20.It is worth noting that also a non-expert user is able to select an effective threshold. Indeed, the

operator has to set the mask in such a way to highlight only the slide area. As shown in Figure 21,the preview of the mask drives the user to choose value around 126.If we apply the different masks in Figure 21 to a sample lecture and observe how the detection

algorithm behaves, we obtain different detection performances in terms of precision and recall [19].In our case, the recall is the ratio of the number of slide transitions correctly identified by the toolto the total number of slide changes depicted in the video lecture. The precision is the ratio of thenumber of slide transitions correctly identified to the total number of retrieved slide transitions.

Figure 20. Parameter settings for frame masking.

Figure 21. Different masks obtained at different � values.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 22: Migrating legacy video lectures to multimedia learning objects

1520 A. DE LUCIA ET AL.

The considered sample contains five transitions in 8120 frames. Figure 22 depicts the detectionresults obtained in correspondence to each value of � selected in Figure 21. In Figure 22 the x-axisrepresents the frame number and the y-axis represents ��2(x). Crosses on the x-axis represent validtransitions, while the horizontal line represents the detection threshold. The choice of a � value near126, as shown in the right-hand lower part of Figure 22, produces a recall 1 and precision 0.63.As reported in Figure 22, in the subplot correlating precision and recall while � varies, recall is

more tolerant to wrong mask settings than precision.As a consequence, false negatives are obtained when the mask is chosen very badly, such as

for � values around 168. Conversely, precision is more sensitive to the selected � values, but theconsequences of a wrong choice are less invasive, thanks to the validation interface described below.Once � has been selected, the user can start the slide change detection phase, which is performed

in an automatic way. A list of the video segments delimited by transitions is provided as output.VLMigrator automatically associates each video segment with a slide, following the sequentialordering. Erroneous bindings can occur in case of false positive, false negative or whether theteacher breaks the slide sequential order during the lecture.As shown in Figure 23, the validation interface is composed of three parts.The player in Figure 23(a) shows the first frame of the video segment, also highlighted in the

list view, depicted in Figure 23(b). The player enables all user actions on the selected video. InFigure 23(c) a preview of all the slides is shown. When a video segment (or a slide) is selected,the associated slide (or video segment) is highlighted. To enable the validation of the associationsautomatically performed by VLMigrator, an enlarged view of the picture is shown to the user whenthe mouse pointer passes over a slide in Figure 23(c).

Figure 22. Detection performances obtained by varying �.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 23: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1521

Figure 23. The validation interface.

The player is useful when the user needs to examine a video segment. This interface providessupport to visually compare the slides with the associated video segment and make correctiveactions, such as merging video segments for discarding a false positive by merging two adjacentsegments, manipulating the associations freeing the two elements of the association (Unbind) orcreating a new association (Bind).In case the user supposes that a false negative occurs in the considered video segment, the tool

allows for running again the slide change detection only on the segment (Reanalyze). Reanalyzinga small portion of the video enables one to increase the sensitivity of the detection algorithmby reducing the detection threshold. In particular, this deeper analysis of the video is carried outdecreasing the multiplier of � by a unit.To avoid to lose transitions happening during camera panning or zooming, VLMigrator inhibits

camera motion indicators when reanalyzing a video segment to search for a missed transition.If the user needs to manually search for a missed transition, (s)he can activate the player on thesegment containing it and split the selected segment into two parts, which can be associated withnew selected slides.

4.2. The learning object generation module

The learning object generation module enables the organization of the available learning resourcesin terms of learning objects. During the previous phases, the input materials, a video lecture inAVI format and the corresponding presentation document have been manipulated to obtain theassociation between basic video segments and slides.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 24: Migrating legacy video lectures to multimedia learning objects

1522 A. DE LUCIA ET AL.

In addition, the video lecture (audio and video tracks) has to be rearranged to obtain an efficienttransmission in streaming modality. To this aim we need to obtain a low-resolution video and high-quality audio. When available, we extract the timing of the lecture from the presentation documentto synchronize the table of contents, video and slides. When this information is not recorded, it isnecessary to gather it by performing the slide change detection.VLMigrator allows the user to freely organize the entire lecture in several self-consistent learning

units. In this way, it is easier to reuse and compose small contents. On the other hand, this choiceenables one to better trace the student activities and evaluate his/her progress by providing a testsession at the end of each unit. This fine-grained organization also enables one to create personalizedlearning paths. To this purpose, VLMigrator provides a GUI aiming at composing learning objects.Contents are easily organized, as shown in Figure 24, by accessing the extracted table of contents

and indicating the entries in this list (s)he needs to include in each produced learning object, asshown in Figure 24(a).Figure 24(b) graphically represents the selected video segment and the associated slide. Author,

title, paragraphs and other textual data representing the information extracted to fill in the learningobject metadata are shown in Figure 24(c). The notes contained in the presentation document areavailable for editing in Figure 24(e) and further comments can be added in the text area as shownin Figure 24(f).To compose and synchronize video and extracted contents, VLMigrator automatically builds a

time-based, streaming multimedia presentations that combine audio, video, images and text usingSMIL [37]. The input lecture is converted into RealMediaTM format, because it is easy to embed

Figure 24. The learning object composition GUI.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 25: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1523

Figure 25. Learning object deployed by Moodle.

RealMediaTM player in HTML pages. The advantage mainly consists in adopting the same tech-nology able to play both SMIL synchronized contents and RealMediaTM video files.VLMigrator has been integrated with RELOAD [27], an open-source software tool for the

authoring of standard-compliant learning objects. RELOAD is particularly attractive because it isuser friendly and, especially, because it adheres to SCORM [7], a reference standard for instruc-tional content. In particular, SCORM prescribes the use of metadata descriptors to store additionalinformation enabling to search, reuse and evaluate learning resources. The process of manuallyentering metadata is time consuming and further requires the administrator/author to be familiarwith the learning object content. Thus, VLMigrator extracts information from the data sources.Contents are aggregated and are used to partially fill in metadata in such a way that the contentmanager can easily hide information or integrate contents when needed.In this way, the video of the lecture is embedded as a resource in the resulting learning object

and the main html page hosts a standard player to render the SMIL files. An index provides directaccess to the slides, which are synchronized with the video of the lecture.The produced learning objects can be played by using any SCORM compliant system. During

our project we adopt the Moodle course management system [38]. A screen shot of Moodle isshown in Figure 25.

5. CASE STUDY

The University of Salerno is involved in the ‘Campus Campania’ project, founded by the RegioneCampania, aiming at promoting first-level degree university programs concerning priority fieldssuch as e-business and e-commerce. The novelty and strength of the Campus Campania project are

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 26: Migrating legacy video lectures to multimedia learning objects

1524 A. DE LUCIA ET AL.

warranted by the adoption of pure e-learning didactical modality. Video lectures and the relatedpresentation documents were available for different courses. However, to provide an effectivelearning experience via Web, the need of restructuring these contents was evident. To respect theproject deadline, it was necessary to reuse this material, which was in MPEG2 format and had tobe converted into AVI, because we experienced some trouble handling MPEG2 with JMF.Actually, the migration of the video lectures of five courses of the computer science program at

the University of Salerno is ongoing. In particular, the software engineering and operating systemscourses, each composed of 24 video lectures lasting about 2 h, have been entirely migrated. Theperformance results have been collected in terms of precision and recall for the entire SoftwareEngineering course. To better validate the proposed method and the tool considering differentsituations of rooms, lighting, teaching gestures, slide template and shooting style, for each of thefive courses, we have detailed the performance results obtained on a sample lecture and tracked thevalidation time of 10 sample lecture hours.Concerning the migration of the software engineering course, Figure 26(a) and (b) depicts the

results obtained without and with the adoption of camera motion indicators, respectively.As shown in Figure 26(b), camera motion indicators cause a significant improvement in precision

with a light worsening of recall due to the presence of some transitions occurring during the cameramotion.Tables II and III show some statistics concerning a sample composed of a single lecture for each

of the considered five courses, while the validation times concerning the entire sample composedof 10 lecture hours for each of the five courses are compared in Figure 27. As shown in Table II,each lecture is divided into two parts, part A and part B, because the original videotape lasted 1 h.As depicted in the first row of Table II, the selected samples of video lectures are characterized bydifferent environmental and shooting characteristics.The results of the detection performed by VLMigrator on the lectures considered in Table II are

summarized in Table III, where statistics on the detection based only on ��1, ��2 and ��2 metricsare compared with the results obtained adopting also camera motion indicators.As reported in Table III, the detection made using ��1, ��2 and ��2 metrics on the lecture of

software engineering I course (part A) produced 1 as recall, but this result is affected by a lowprecision (0.27). The adoption of camera motion indicators improves the precision to 0.64. As a

0 5 10 15 20 25 30 35 40 450

0.2

0.4

0.6

0.8

1

Lecture

precision

recall

0 5 10 15 20 25 30 35 40 450

0.2

0.4

0.6

0.8

1

Lecture

precision

recall

(a) (b)

Figure 26. Slide change detection results concerning the software engineering course.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 27: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1525

Table II. Details of a sample video lecture for each course.

Software Software Programming Programming Operatingengineering engineering languages languages systems,I, course A II, course B I, course C II, course D course E

Part A Part B Part A Part B Part A Part A Part A Part B Part A Part B

Number offrames 63976 46833 75866 66133 54363 58445 53775 53650 54363 40871Number ofslides 16 9 38 40 28 34 28 29 17 18Number oftransitions 18 33 40 42 32 34 30 31 29 29

consequence, the number of user interventions needed to correct the results is reduced. On the otherhand, the recall is reduced to 0.88, because of the combination of camera perturbation and slidetransitions, which results in the detection of false negatives. Concerning the usage of light writingon a dark background, the lectures of the course D reached satisfying results in terms of precisionand recall when compared with the results of the other lectures.The performance improvement due to the adoption of camera motion indicators has also been

reached in all the other samples. Indeed, as Table III shows, precision always increases when theyare applied. This enhancement has as drawback: a recall decreasing when a transition occurs duringa camera motion. Let us note that the worst results concerning the recall have been obtained by theoperating system course, due to the presence of very similar slides, angular view of the screen andfrequent camera motions, as a manual examination revealed.We also noted that course D gradually explains algorithms adopting animations. The detection

algorithm is not sensitive enough in case of line or word changes. However, it detects a transitionwhen the entire slide has been shown and changed.To compare the performance differences obtained in the various courses, we compute for each

lecture the validation effort metric as the number of user interventions on the lecture normalizedby the number of transitions, as shown in Table III.The number of user interventions is due to the correction of both the false positives, performed

by merging adjacent segments, and the false negatives, which require the segment reanalyzing.The duration of a reanalyze operation is proportional to the segment length and is, in general,more expensive. Moreover, reanalyzing disables the camera motion indicators and additional falsepositive could be generated and have to be corrected. As an example, the gray column in Table IIIreports two false negatives and nine false positives. The 16 user corrections reported were due tonine ‘merge’ to correct the original false positives, two ‘reanalyze’ of the video segments in which afalse negatives occur and five additional correction of false positives generated by the reanalyzing.The validation effort metric enables one to compare the detection performances when the algo-

rithm is applied to different environments and shooting styles. In particular, by examining Table III,

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 28: Migrating legacy video lectures to multimedia learning objects

1526 A. DE LUCIA ET AL.

TableIII.Slidedetectionresults

concerning

thelectures

inTableII.

Softwareengineering

Softwareengineering

Programminglanguages

Programminglanguages

Operatin

gsystem

s,I,course

AII,course

BI,course

CII,course

Dcourse

E

PartA

PartB

PartA

PartB

PartA

PartB

PartA

PartB

PartA

PartB

#+

#+

#+

#+

#+

#+

#+

#+

#+

#+

Transitions

1833

4042

3234

3031

2929

Right

transitio

nsdetected

1816

3333

4038

4040

3231

3434

3030

3029

2928

2525

Falsepositiv

e48

913

531

1328

710

413

410

514

619

824

7Precision

0.27

0.64

0.71

0.87

0.56

0.74

0.54

0.85

0.76

0.88

0.72

0.89

0.75

0.86

0.68

0.83

0.6

0.77

0.51

0.78

Recall

10.88

11

10.95

0.95

0.95

10.97

11

11

0.97

0.93

10.96

0.86

0.86

Userinterventio

ns48

1613

531

1833

1210

713

410

517

1019

1136

19Userinterventio

ns/

transitio

nsnumber

2.7

0.9

0.4

0.2

0.7

0.4

0.8

0.3

0.3

0.2

0.4

0.1

0.3

0.2

0.5

0.3

0.6

0.4

1.2

0.6

Validationtim

e(inmin)

—21

—6

—22

—17

—10

—5

—6

—15

—15

—27

Validationtim

e/lectureduratio

n—

0,5

—0,2

—0,43

—0,37

—0,28

—0,13

—0,17

—0,40

—0,42

—0,98

#,detectionusing

�� 1

,�

� 2and

��2;

+,detectionusing

�� 1

,�

� 2,

��2,andcameramotionindicators.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 29: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1527

A B C D E0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Val

idat

ion

Tim

e/Le

ctur

e D

urat

ion

Courses

Figure 27. Validation times on the selected samples.

it is possible to argue that the less error-prone video style is obtained for courses B and C. Amanual examination of these video lectures revealed that, in both cases, the operator took the entireslide together with the teacher for the entire lecture, and he/she did not change the shooting duringthe lecture. Moreover, both these lectures are characterized by a frontal shooting, and as a conse-quence, each element composing the slide is represented with the same pixel resolution. Conversely,the sample videos of the other courses are characterized by an angle view of the slides, whichcauses a different detection sensitivity for the slide side closest to the camera with respect to thefar one.The average number, of user interventions for each transition, for the video samples considered

in Table III, is 0.36. This value signals that the user has to correct about one error for every threetransitions. As it is possible to deduct by the last row of Table III, the average time required tovalidate 1 h lecture is about 22min.In Figure 27 a box plot summarizes the validation time normalized with respect to the lecture

duration in the sample composed of the first 10 video hours for every course considered in Table III.It is worth noting that the differences among the detection performance is associated with eachsample course are narrow enough, except for course E. Its worst performance corresponding to asingle lecture in Table III is still confirmed on the larger sample of Figure 27. In particular, theseresults on course E in Table III correspond to the worst performances, as it is possible to argue byexamining its box plot. Indeed, a manual examination of videos revealed that the other lectures ofthis sample are characterized by fewer camera motions and a frontal slide view, framing the entireslide, enabling the detection to obtain better results.

6. CONCLUSION

In this paper, we have presented an approach to migrate legacy video lectures into multimedialearning objects. The method concurrently detects slide transitions and extracts information from apresentation document to obtain the slides images, to fill the learning object metadata, and extractthe table of contents of the presentation.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 30: Migrating legacy video lectures to multimedia learning objects

1528 A. DE LUCIA ET AL.

Our approach to slide change detection is derived from shot boundary detection and addressesthe two main problems that occur when applying this techniques to video lectures: low sensitivitywhen detecting a slide change and low robustness in the case of camera motion. The frames ofthe video lecture are masked to select the slide area. Two frames represent a slide transition iftheir similarity, deduced by applying some comparison metrics to unmasked pixels, is lower thana given threshold. The detection is inhibited in the case of camera motion, zooming or panning bythe adoption of camera motion indicators derived by spatial consideration on the mask. The timerequired to process a lecture is linearly dependent on the length of the video.A final phase involves the user in the validation of the detected slide transitions and in the

association among video segments and slides. This is necessary to reach the maximum value forboth precision and recall. In fact, a wrong detection of a slide transition produces the loss of thecorrespondence between the slide, audio and video. This phase naturally ends with the compositionof one or more multimedia learning objects.The proposed approach is supported by VLMigrator, a tool that integrates both the slide change

detection and learning object generation features. Concerning the former feature, it is important topoint out that the VLMigrator does not impose any constraint on the kind of video lecture to process.Indeed, the video we had at disposal was taken without particular expertise of the cameraman,which often zoomed in and out. Camera motion also occurred very frequently. The learning objectgeneration feature enables one to compose synchronized learning objects by using a friendly GUI,without performing any programming tasks.The tool and the detection approach have been validated in a case study describing the results

obtained with the migration of an entire course (48 h) and the results concerning the migration of10 h of lectures for each one of the five courses (50 h). The examined video lectures show differentsituations of rooms, lighting, teaching gestures and slide template, both with light writing on a darkbackground and vice versa.At present, we are extending the tool with optical character recognition to propose associations

between slide and video segments during the validation phase. We are also refining the tool withseveral features contributing to obtain a better quality in terms of the synchronization of the audiotrack with the associated slide. In addition to the table of contents and the general informationwe are already able to extract, we will enrich the contents by using semantic Web techniques.Automatic speech recognition technologies can enrich our approach to obtain a textual draft of thespeaker discussion to each video segment. In this way the learning object will be enriched with theverbal contents provided by the teacher during the lecture.

ACKNOWLEDGEMENTS

The authors would like to thank anonymous referees and the SPE editor for their careful reading and for theirgreat help in improving this paper.

REFERENCES

1. Econtentplus Programme. http://ec.europa.eu/information society/activities/econtentplus/index en.htm [31 January 2008].2. Pincas A. Gradual and Simple Changes to Incorporate ICT into the Classroom. http://www.elearningeuropa.info/index.php?

page=doc&doc id=4519&doclng=5 [8 August 2005].

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 31: Migrating legacy video lectures to multimedia learning objects

MIGRATING LEGACY VIDEO LECTURES 1529

3. Abowd G, Atkeson CG, Feinstein A, Hmelo C, Kooper R, Long S, Savhney N, Tani M. Teaching and learning asmultimedia authoring: The Classroom 2000 project. ACM Multimedia 1996; 187–198.

4. Gerhard J, Mayr P. Competing in the E-learning environment: Strategies for universities. Proceedings of the 35th HawaiiInternational Conference on System Sciences, Hilton Waikoloa Village, Island of Hawaii, 7–10 January 2002; 262–265.

5. Brodie ML, Stonebraker M. Migrating Legacy Systems. Morgan Kaufmann Publishers Inc.: Los Altos, CA, 1995.6. Bersin J. What works in blended learning. Learning Circuits, 2003. Available at: http://www.learningcircuits.org/2003/

jul2003/ [July retrieved 5 September 2005].7. ADL, The Advanced Distributed Learning Initiative. http://www.adlnet.org [31 January 2008].8. IEEE LTSC. The IEEE Learning Technology Standards Committee. http://ltsc.ieee.org [31 January 2008].9. IMS Global Learning Consortium. http://www.imsproject.org [31 January 2008].

10. Lotus Freelance IBM. http://www-142.ibm.com/software/sw-lotus/products/product2.nsf/wdocs/freelance [31 January2008].

11. Robson C, Ford RM, Temple D, Gerlach M. Metrics for scene change detection in digital video sequence. ICMCS ’97:Proceedings of the 1997 International Conference on Multimedia Computing and Systems. IEEE Computer Society:Washington, DC, U.S.A., 1997; 610–611.

12. Dugad R, Ratakonda K. Robust video shot change detection. IEEE Workshop on Multimedia Signal Processing, RedondoBeach, CA, 1998.

13. Nagasaka A, Tanaka Y. Automatic video indexing and fullmotion search for object appearances. Proceedings of IFIPTC2NG2.6, Second Working Conference on Visual Database Systems, Budapest, Hungary, 30 September–3 October 1991;113–127.

14. Mostefaoui A. A modular and adaptive framework for large scale video indexing and content-based retrieval: TheSIRSALE system. Software: Practice and Experience 2006; 36(8):871–890.

15. Adams B, Iyengar G, Neti C, Nock H, Amir A, Permuter H, Srinivasan S, Dorai C, Jaimes A, Lang C, Lin CY, NatsevA, Naphade M, Smith J, Tseng B, Ghosal S, Singh R, Ashwin T, Zhang D. IBM research TREC 2002 video retrievalsystem. Information Technology: The Eleventh Text Retrieval Conference, Gaithersburg, MD, 19–22 November 2002,Voorhees E, Buckland L (eds.), TREC 2002, NIST Special Publication 500-251, 2003; 289–298.

16. The Open Video Project. 2001 TREC Video Retrieval Test Collection. http://www.open-video.org/collection detail.php?cid=7 [31 January 2008].

17. Mukhopadhyay S, Smith B. Passive capture and structuring of lectures. ACM Multimedia, Orlando, FL, U.S.A.,30 October–5 November 1999; 477–487.

18. Ngo CW, Pong TC, Huang TS. Detection of slide transition for topic indexing. International Conference on MultimediaExpo, 2002; 533–536.

19. Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval. Addison-Wesley: Reading, MA, 1999.20. Bell T, Cockburn A, McKenzie B, Vargo J. Flexible delivery damaging to learning? Lessons from the Canterbury Digital

Lectures Project. Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications,2001, Kommers P, Richards G (eds.). AACE: Chesapeake, VA, 2001; 117–122.

21. Behera A, Lalanne D, Ingold R. Looking at projected documents: Event detection & document identification. IEEEInternational Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 27–30 June 2004; 2127–2130.

22. Tiecheng L, Hjelsvold R, Kender JR. Analysis and enhancement of videos of electronic slide presentations. Proceedingsof 2002 IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, 26–29 August 2002.

23. Syeda-Mahmood T, Srinivasan S. Detecting topical events in digital video. Proceedings of the Eighth ACM InternationalConference on Multimedia (Marina del Rey, CA, United States). MULTIMEDIA ’00. ACM Press: New York, NY, 2000;85–94.

24. Deshpande SG, Hwang JN. A real-time interactive virtual classroom multimedia distance learning system. IEEETransactions on Multimedia 2001; 3(4):432–444.

25. He L, Sanocki E, Gupta A, Grudin J. Auto-summarization of audio–video presentations. ACM Multimedia, Orlando, FL,U.S.A., 30 October–5 November 1999; 489–498.

26. KOM. LOM-Editor Version 1.0. Technische Universitat Darmstadt. Available at: http://www.multibook.de/lom/en/index.html [19 January 2007].

27. RELOAD. Reusable Learning Object Authoring and Delivering. http://www.reload.ac.uk/ [7 September 2005].28. LOMgen. Learning Object Metadata Generator. Available at: http://www.cs.unb.ca/agentmatcher/LOMGen.html

[6 September 2005].29. Learning Tools Project. http://www.learningtools.arts.ubc.ca/index.htm [31 January 2008].30. Singh A, Boley H, Bhavsar VC. A learning object metadata generator applied to computer science terminology.

Presentation in Learning Objects Summit, Fredericton, CA, 2004.31. ISO/IEC 26300:2006. Open Document Format for Office Applications. http://www.iso.org/iso/iso catalogue/catalogue tc/

catalogue detail.htm?csnumber=43485 [31 January 2008].32. COM, Component Object Model. http://www.microsoft.com/com/default.mspx [31 January 2008].33. Gonzalez RC, Woods RE. Digital Image Processing (2nd edn). Prentice-Hall: Englewood Cliffs, NJ, 2002; 75–141.

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe

Page 32: Migrating legacy video lectures to multimedia learning objects

1530 A. DE LUCIA ET AL.

34. Ford RM, Robson C, Temple D, Gerlach M. Metrics for scene change detection in digital video sequences. Proceedingsof the IEEE International Conference on Multimedia Computing and Systems, Ottawa, Ont., Canada, 3–6 June 1997;610–611.

35. Sethi IK, Patel NV. A statistical approach to scene change detection. IS&T SPIE Proceedings on Storage and Retrievalfor Image and Video Databases III, San Jose, CA, vol. 2420, February 1995.

36. JMF Java Media Framework. http://java.sun.com/products/java-media/jmf/ [31 January 2008].37. SMIL. http://www.w3.org/AudioVideo/ [31 January 2008].38. Moodle. http://moodle.org/ [31 January 2008].

Copyright q 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 38:1499–1530DOI: 10.1002/spe