[Lecture Notes in Computer Science] Advances in Multimedia Information Processing - PCM 2009 Volume 5879 || Spatio-temporal Just Noticeable Distortion Model Guided Video Watermarking

P. Muneesawang et al. (Eds.): PCM 2009, LNCS 5879, pp. 887–897, 2009. © Springer-Verlag Berlin Heidelberg 2009

Spatio-temporal Just Noticeable Distortion Model Guided Video Watermarking

Yaqing Niu1,2, Jianbo Liu1, Sridhar Krishnan2, and Qin Zhang1

1 Information Engineering School, Communication University of China, Beijing, China 2 Department of Electrical and Computer Engineering, Ryerson University, Toronto, Canada

Abstract. Perceptual video watermarking needs to take full advantage of the re-sults from human visual system (HVS) studies. Since motion is a specific fea-ture of video, temporal HVS properties need to be taken into account. In this paper, we exploit a combined Spatio-Temporal Just Noticeable Distortion (JND) model which incorporates spatial CSF, temporal modulation factor, reti-nal velocity, luminance adaptation and contrast masking to guide watermarking for digital video. The proposed watermarking scheme, where visual models rep-resenting additional accurate perceptual visibility threshold are fully used to determine scene-driven upper bounds on watermark insertion, allows us to pro-vide the maximum strength transparent watermark. Experimental results con-firm the improved performance of our combined Spatio-Temporal JND model guided watermarking scheme. Our Spatio-Temporal JND model guided water-marking scheme which allows higher injected-watermark energy without jeopardizing picture quality performs much better on robustness than other al-gorithms based on the relevant existing perceptual models.

Keywords: HVS, Spatio-Tempo JND model, Perceptual watermarking for video.

1 Introduction

The rapid growth of the Internet has created the need for digital watermarking that can be used for copyright protection of digital images and video. There exists a com-plex trade-off between three parameters in a well-designed watermark: imperceptibil-ity, robustness, and capacity. However, in order to maintain the image quality and at the same time increase the probability of the watermark detection, it is necessary to take the human visual system (HVS) into consideration when engaging in watermark-ing research [1].

HVS makes final evaluations on the quality of video that are processed and dis-played. Just noticeable distortion (JND), which refers to the maximum distortion that the HVS does not perceive gives us a way to model the HVS accurately and can serve as a perceptual visibility threshold to guide video watermarking. JND estimation for still images has been relatively well developed. An early perceptual threshold estima-tion in DCT domain was proposed by Ahumada [2], which gives the threshold for each DCT component by incorporating the spatial Contrast Sensitivity Function

888 Y. Niu et al.

(CSF). This scheme was improved by Watson [3] after the luminance adaptation ef-fect had been added to the base threshold, and contrast masking [4] had been calcu-lated as the elevation factor. In [5] an additional block classification based contrast masking and luminance adaptation was considered by Zhang for digital image. Since motion is a specific feature of videos, temporal dimension needs to be taken into ac-count. JND estimation for video sequences need to incorporate not only the spatial CSF, but the temporal CSF as well. A spatio-temporal CSF model was proposed by Kelly [6] from experiments on visibility thresholds under stabilized viewing condi-tions. Based on Kelly’s model, Jia [7] estimated the JND thresholds for videos by combining other visual effects such as the luminance adaptation and contrast mask-ing. An improved temporal modulation factor proposed by Wei [8] incorporates not only temporal CSF, but the directionality of motion is also considered as well.

Previous Watermarking schemes for videos have only partially used the results of the HVS studies [9] [10]; the perceptual adjustment of the watermark is mainly based on Watson’s spatial JND model [11]. As we discussed, motion is a specific feature of video. Temporal dimension properties of HVS need to be taken into account. In this paper, we exploit a combined Spatio-Temporal JND model which incorporates spatial CSF, temporal modulation factor, retinal velocity, luminance adaptation and contrast masking to guide watermarking for digital videos. The watermarking scheme, where visual models representing additional accurate perceptual visibility threshold are fully used to determine scene-driven upper bounds on watermark insertion, allows us to provide the maximum strength transparent watermark. This, in turn, is extremely robust to common video processing and attacks.

This paper is organized as follows. In Section 2, the new combined Spatio-Temporal JND model is presented. And in Section 3, the combined Spatio-Temporal JND model guided watermarking scheme is described in detail. In Section 4, a series of experiments are done to test the new combined Spatio-Temporal JND model guided video watermarking scheme's performances. Lastly, the conclusions are drawn in Section 5.

2 Combined Spatio-temporal JND Model

Spatio-Temporal Just-noticeable distortion (JND) is an efficient model to represent the additional accurate perceptual redundancies for digital videos. Here we compute a combined Spatio-Temporal JND model which incorporates spatial CSF, temporal modulation factor, retinal velocity, luminance adaptation and contrast masking all together.

2.1 Retinal Velocity

Motion is a specific feature of video imagery. Human eyes tend to track a moving object to keep retinal image of the object in the fovea [12][13][14][15]. It is necessary to take into account the observers' eye movements to see how well the traced objects can be seen during the presentation of motion imagery for human perceptual visibility analysis.

Spatio-temporal Just Noticeable Distortion Model Guided Video Watermarking 889

There are three types of eye movements [12]: natural drift eye movements, smooth pursuit eye movements (SPEM) and saccadic eye movements. The natural drift eye movements are present even when the observer is intentionally fixating on a single position and these movements are responsible for the perception of static imagery during fixation. The saccadic eye movements are responsible for rapidly moving the fixation point from one location to another; thus, the HVS sensitivity is very low. The smooth pursuit eye movements tend to track the moving object and reduce the retinal velocity, and thus, compensate for the loss of sensitivity due to motion. Future ex-periments should test the limit of smooth pursuit [14].

Based on eye movements in [12], the retinal velocity vR is different from the image velocity vI, which can be obtained through motion estimation. The retinal velocity vR can be expressed as (1)

( )yxvvv EIR ,=−= hhhh , (1)

where eye movement velocity vE is determined as (2)

[ ]MAXMINIspE vvvgv ,min +⋅= hh , (2)

where gsp is the gain of the SPEM, vMIN is the minimum eye velocity due to drift, and vMAX is the maximum eye velocity before saccadic movements. The average gsp over all observers was 0.956 +/- 0.017 [14]. The values vMIN and vMAX are set to 0.15 and 80.0 deg/s.

The image velocity vI can be obtained with a motion estimation technique as follows (3)

hh θ××= MVfv frameI , (3)

where fframe is the frame rate of video and MV is the motion vector of each block; θ is the visual angle of a pixel obtained by (4)

⎟⎠⎞

⎜⎝⎛

⋅Λ

⋅=l2

arctan2 hhθ

, (4)

where l is the viewing distance and Λ stands for the display width/length of a pixel on the monitor [6].

2.2 Joint Spatio-Temporal CSF

In comparison with various Spatial CSF models for still images [3][5], the CSF model for videos need to take into account the temporal dimension in addition to the spatial properties. The Joint Spatio-Temporal CSF describes the effect of spatial frequency and temporal frequency on the HVS sensitivity. The Spatio-Temporal CSF model is the reciprocal of the base distortion threshold which can be tolerated for each DCT coefficient. Literature [16][8] shows that the base threshold for the DCT domain TBASE corresponding to the Spatio-Temporal CSF model can be expressed by (5)

),,,(),,,(),,,( jinkFjinkTjinkT TBASEsBASE ×=, (5)

890 Y. Niu et al.

where TBASEs is the base threshold corresponding to the Spatial CSF model and FT is the temporal modulation factor; k is the index of the frame in the video sequences, and n is the index of a block in the kth frame; i and j are the DCT coefficient indices.

In [16] the base threshold TBASEs is computed by (6)

( ) ( ) ( )ij

ijij

jiBASEs

rr

bacsjinT

ϕωω

φφ 2cos)1(

/exp1,,

⋅−++

⋅⋅= (6)

Where a=1.33, b=0.11, c=0.18, s=0.25. φi and φj are DCT normalization factors by (7); ωij is the spatial frequency which can be calculated by (8); N is the dimension of the DCT block; θx and θy are the horizontal and vertical visual angles of a pixel by (4). r is set to 0.6, and φij stands for the directional angle of the corresponding DCT com-ponent by (9).

⎪⎩

⎪⎨⎧

>

==

0,2

0,1

mN

mNmφ

(7)

( ) ( )22

2

1yxij ji

Nθθω += (8)

⎟⎟⎠

⎞⎜⎜⎝

⎛=

ij

jiij 2

,00,2arcsin

ωωω

ϕ (9)

In [8] the temporal modulation factor FT is computed by (10)

( )cpdf

Hzfcpdf

Hzfcpdf

jinkF

s

ts

ts

f

fT

t

t

5

10&5

10&5

07.1

07.1

1

,,, )10(

≥≥<<<

⎪⎩

⎪⎨

⎧

= − (10)

where the cpd is cycle per degree, the temporal frequency ft which depends not only on the motion, but also on the spatial frequency of the object is given by (11)

RysyRxsxt vfvff +=, (11)

where fsx and fsy are the horizontal and vertical components of the spatial frequency, which can be calculated by (12).

h

h θN

if s 2

= (12)

As discussed in 2.1, human eyes can automatically move to track an observed object. The retinal velocity vRx and vRy can be calculated by (1).

2.3 Luminance Adaptation

Human eyes are more sensitive to the noise in medium gray regions, so the visibility threshold is higher in very dark or very light regions. Because our base threshold is


detected at the 128 intensity value, for other intensity values, a modification factor needs to be included. This effect is called the luminance adaptation effect. The curve of the luminance adaptation factor is a U-shape which means the factor at the lower and higher intensity regions is larger than the middle intensity region. An empirical formula for the luminance adaptation factor aLum in [16] is shown as (13) where I(k,n) is the average intensity value of the nth block in the kth frame.

( )170),(

170),(60

60),(

1425)170),((

1

1150)),(60(

,

≥<<

≤

⎪⎩

⎪⎨

⎧

+−

+−=

nkI

nkI

nkI

nkI

nkI

nkaLum (13)

2.4 Contrast Masking

Contrast masking refers to the reduction in the visibility of one visual component in the presence of another one. The masking is strongest when both components are of the same spatial frequency, orientation, and location. To incorporate contrast masking

effect, we employ contrast masking contrasta [4] measured as (14) where C(k,n, i, j) is

the (i, j)-th DCT coefficient in the nth block, and ε=0.7.

( ) ( ) ⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛

⋅=

ε

nkajink

jinkCjinka

Lumcontrast

T BASE,,,,

),,,(,1max),,,( (14)

2.5 Complete JND Estimator

The overall JND (15) can be determined by the base threshold TBASE, the luminance adaptation factor aLum and the contrast masking factor aContrast.

( ) ( ) ),,,(,,,,),,,( jinkankajinkTjinkT contrastLumBASEJND ⋅⋅= (15)

TJND(k,n,i,j) is the complete scene-driven Spatio-Temporal JND estimator which represents the additional accurate perceptual visibility threshold profile for videos to guide watermarking.

3 The Spatio-temporal JND Model Guided Video Watermarking Scheme

We exploit the combined Spatio-Temporal JND model guided video watermarking scheme to embed and extract watermarking. Diagram of combined Spatio-Temporal JND model guided watermark embedding is shown in Fig. 1.

The scheme first constructs a set of approximate energy sub-regions using the Im-proved Longest Processing Time (ILPT) algorithm [17], and then enforces an energy difference between every two sub-regions to embed watermarking bits under the con-trol of our combined Spatio-Temporal JND model [18][19].

892 Y. Niu et al.

Fig. 1. Diagram of combined Spatio-Temporal JND model guided watermark embedding

The embedding procedure of the scheme is described as the following steps:

a) Decompose the original video frames into non-overlapping 8x8 blocks and com-pute the energy of the low-frequency DCT coefficients in the zigzag sequence.

b) Obtain approximate energy sub-regions by ILPT algorithm. The watermark capacity is determined by the number of blocks in a sub-region which is used to embed one watermark bit.

c) Map the index of the DCT blocks in a sub-region according to ILPT. d) Use our combined Spatio-Temporal JND model described in Section 2 to calcu-

late the perceptual visibility threshold profile for video frames which makes the watermark imperceptible and very strong.

e) If the watermark to be embedded is 1, the energy of sub-region A should be increased and the energy of sub-region B should be decreased. If the watermark to be embedded is 0, the energy of sub-region A should be decreased and the en-ergy of sub-region B should be increased. The energy of each sub-region is modified by adjusting their low-frequency DCT coefficients accordingly under the control of our combined Spatio-Temporal JND model as (16)

( )

( ) ( )( ) ( )( )

( ) ( )( ) ( )( ) NM

PM

jinkTjinkCfjinkCSignjinkC

jinkTjinkCfjinkCSignjinkC

jinkC

JND

JND

m

⎪⎪⎩

⎪⎪⎨

⎧

⋅−

⋅+

=

,),,,(,,,,,,,,,,

,),,,(,,,,,,,,,,

,,,

(16)

Where C(k,n,i,j)m is the modified DCT coefficient, Sign(.) is the sign function, PM is positive modulation which means increased the energy and NM is negative modulation which means decreased the energy, TJND(k,n,i,j) is the perceptual visi-bility threshold by our combined Spatio-Temporal JND model and f (.) can be ex-pressed by (17)

( )( )

( )( ) ),,,(,,,

),,,(,,,

),,,,(

,0

),,,(,,,,

jinkTjinkCif

jinkTjinkCif

jinkT

jinkTjinkCf

JND

JND

JND

JND

≥<

⎩⎨⎧

= (17)


f) Conduct IDCT to the energy modified result to obtain the watermark embedded video frames. Diagram of combined Spatio-Temporal JND model guided watermark extraction is shown in Fig. 2. The extraction procedure is described as the following steps:

a) Decompose the watermark embedded video frames into non-overlapping 8x8 blocks and compute the energy of the low-frequency DCT coefficients in the zigzag sequence.

b) Energy of each sub-region is calculated according to the index map. c) Compare the energy of sub-region A with sub-region B. If the energy of sub-

region A is greater than the energy of sub-region B, the watermark embedded is 1. If the energy of sub-region A is smaller than the energy of sub-region B, the watermark embedded is 0. So the watermark is extracted.

Fig. 2. Diagram of combined Spatio-Temporal JND model guided watermark extraction

4 Experimental Results and Performance Analysis

In this experiment, the generated Spatio-Temporal JND profile can be used to guide watermarking in video sequences to evaluate the performance of the JND model. Watson’s JND model [3] (referred to as Model 1 hereinafter), Zhang’s JND model [5] (referred to as Model 2 hereinafter) and Wei’s JND model [8] (referred to as Model 3 hereinafter) were also implemented and compared with the proposed JND estimator. We construct a series of tests to observe the performance of our combined Spatio-Temporal JND model guided watermarking in terms of watermark's visual quality, capacity and robustness. The 720x576 walk_pal video sequences are used for this series of experiments. The original watermark is a binary image of the logo of Com-munication University of China with size 20x20.

4.1 Visual Quality

A better JND model allows higher injected-noise energy (corresponding to lower PSNR) without jeopardizing picture quality. Our combined Spatio-Temporal JND model correlates with the HVS very well, we can use our model to guide watermark embedding in each DCT coefficients of digital video, yet the difference is hardly noticeable. Fig. 3 (a) shows the first frame of the walk_pal video sequence. Fig. 3 (b)-(e) are the first frame of the watermarked video sequence using four JND models. The section below compares the five mentioned figures, Fig. 3 (a)-(e). We can see no

894 Y. Niu et al.

(a) (b) (c)

(d) (e) (f)

Fig. 3. (a) original frame (b) watermarked by Model 1 (c) watermarked by Model 2 (d) water-marked by Model 3 (e) watermarked by our Model (f) watermark

obvious degradation in Fig. 3 (b)-(e) where the PSNR are 35.5dB, 47.9dB, 43.9dB and 34.4dB respectively.

4.2 Capacity

The watermark capacity is determined by the number of blocks in a group-region which is used to embed one watermark bit. We set the number of blocks at 8 (i.e. n=8) in the following experiments.

4.3 Robustness

In practice, watermarked content has to face a variety of distortions before reaching the detector. We present robustness results with different attacks such as MPEG2 compression, MPEG4 compression, Gaussian noise and valumetric scaling. Robust-ness results of algorithm based on models 1 to 3 were compared with results of algo-rithm based on our JND model shown in Fig. 4, Fig. 5, Fig. 6, and Fig. 7. For each category of distortion, the watermarked images were modified with a varying magni-tude of distortion and the Bit Error Rate (BER) was then computed.

From the robustness results shown in Fig. 4, Fig. 5, Fig. 6, and Fig. 7, the water-marking scheme based on our Spatio-Temporal combined JND Model performs slightly better than algorithm based on model 1 and evidently better than algorithms based on model 2 and model 3. Our model correlates with the HVS better than the other relevant perceptual models. Due to our model’s improved correlation with the HVS, it allows higher injected-watermark energy without jeopardizing picture quality and obtains better robustness in digital video watermarking.


Fig. 4. Robustness versus MPEG2 Compression

Fig. 5. Robustness versus MPEG4 Compression

Fig. 6. Robustness versus Gaussian noise

896 Y. Niu et al.

Fig. 7. Robustness versus Valumetric Scaling

5 Conclusion

Perceptual video watermarking needs to take full advantage of the results from HVS studies. Since motion is a specific feature of video, temporal HVS properties need to be taken into account. In this paper, we exploit a combined Spatio-Temporal JND model which incorporates Spatial CSF, temporal modulation factor, retinal velocity, luminance adaptation and Contrast masking to guide watermarking for digital video. The proposed watermarking scheme, where visual models representing additional accurate perceptual visibility threshold are fully used to determine scene-driven upper bounds on watermark insertion, allows us to provide the maximum strength transpar-ent watermark. Experimental results with subjective viewing confirm the improved performance of our combined Spatio-Temporal JND model guided watermarking scheme. Our Spatio-Temporal JND model guided watermarking scheme which allows higher injected-watermark energy without jeopardizing picture quality performs much better on robustness than other algorithms based on the relevant existing perceptual models.

Acknowledgments

We acknowledge the funding provided by the National Natural Science Foundation of China (Grant No. 60832004) and Key Construction Program of the Communication University of China “211” Project.

References

1. Wolfgang, R.B., Podilchuk, C.I., Delp, E.J.: Perceptual watermarks for digital images and video. In: Proc. IEEE (Special Issue on Identification and Protection of Multimedia Infor-mation), July 1999, vol. 87, pp. 1108–1126 (1999)


2. Ahumada Jr., A.J., Peterson, H.A.: Luminance-model-based DCT quantization for color image compression. In: SPIE Proc., vol. 1666, pp. 365–374 (1992)

3. Watson, B.: DCTune: A technique for visual optimization of DCT quantization matrices for individual images in Soc. Inf. Display Dig. Tech. Papers XXIV, pp. 946–949 (1993)

4. Legge, G.E.: A power law for contrast discrimination. Vision Res. 21, 457–467 (1981) 5. Zhang, X., Lin, W.S., Xue, P.: Improved estimation for just-noticeable visual distortion.

Signal Processing 85(4), 795–808 (2005) 6. Kelly, D.H.: Motion and vision II: stabilized spatiotemporal threshold surface. J. Opt. Soc.

Amer. 69(10), 1340–1349 (1979) 7. Jia, Y., Lin, W., Kassim, A.A.: Estimating justnoticeable distortion for video. IEEE Trans-

actions on Circuits and Systems for Video Technology 16(7), 820–829 (2006) 8. Wei, Z., Nagan, K.: A temporal just-noticeble distortion profile for video in DCT domain.

In: ICIP 2008, October 2008, pp. 1336–1339 (2008) 9. Delaigle, J.F., Devleeschouwer, C., Macq, B., Langendijk, I.: Human visual system fea-

tures enabling watermarking. In: Proceedings of the IEEE ICME 2002, vol. 2, pp. 489–492 (2002)

10. Wolfgang, R.B., Podilchuk, C.I., Delp, E.J.: Perceptual watermarks for digital images and video. Proceedings of the IEEE 87(7), 1108–1126 (1999)

11. He-Fei, Zheng-Ding, Fu-Hao, Rui-Xuan: An Energy Modulated Watermarking Algorithm Based on Watson Perceptual Model. Journal of Software 17(5), 1124–1132 (2006)

12. Daly, S.: Engineering observations from spatiovelocity and spatiotemporal visual models. In: Vision Models and Applications to Image and Video Processing, ch. 9. Kluwer, Nor-well (2001)

13. Tourancheau, S., Le Callet, P., Barba, D.: Influence of motion on contrast perception: su-pra-threshold spatio-velocity measurements. In: Proc. SPIE, vol. 6492, p. 64921M (2007)

14. Laird, J., Rosen, M., Pelz, J., Montag, E., Daly, S.: Spatio-velocity CSF as a function of retinal velocity using unstabilized stimuli. In: Proc. SPIE Conf. Human Vision and Elec-tronic Imaging XI. Electronic Imaging 2006, Janvier, vol. 6057 (2006)

15. Schütz, C., Delipetkos, I.B., Kerzel, D., Gegenfurtner, K.R.: Temporal contrast sensitivity during smooth pursuit eye movements. Journal of Vision 7(13), Article 3, 1–15 (2007)

16. Wei, Z., Ngan, K.N.: Spatial Just Noticeable Distortion Profile for Image in DCT Domain. In: IEEE Int. Conf., Multimedia and Expo. (2008)

17. Fuhao, Z.: Research of Robust Video Watermarking Algorithms and Related Techniques. A Dissertation Submitted for the Degree of Doctor of Philosophy in Engineering, Huazhong University of Science&Technology (2006)

18. Langelaar, G.C., Lagendijk, R.L.: Optimal differential energy watermarking of DCT en-coded images and video. IEEE Transactions on Image Processing 10(1), 148–158 (2001)

19. Hefei, L., Zhengding, L., Fuhao, Z.: Improved Differential Energy Watermarking (IDEW)Algorithm for DCT-Encoded Imaged and Video. In: Proc. of the Seventh Interna-tional Conference on Signal Processing (ICSP 2004), pp. 2326–2329 (2004)

Documents

[Lecture Notes in Computer Science] Advances in Multimedia Information Processing - PCM 2009 Volume 5879 || Spatio-temporal Just Noticeable Distortion Model Guided Video Watermarking