If you can't read please download the document
Upload
vukiet
View
218
Download
0
Embed Size (px)
Citation preview
PARTIAL ENCRYPTION OF VIDEO FOR COMMUNICATION AND STORAGE
A THESIS SUBMITTED TOTHE GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCES
OFTHE MIDDLE EAST TECHNICAL UNIVERSITY
BY
TURAN YUKSEL
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
THE DEPARTMENT OF COMPUTER ENGINEERING
SEPTEMBER 2003
Approval of the Graduate School of Natural and Applied Sciences.
Prof. Dr. Canan OzgenDirector
I certify that this thesis satisfies all the requirements as a thesis for the degreeof Master of Science.
Prof. Dr. Ayse KiperHead of Department
This is to certify that we have read this thesis and that in our opinion it is fullyadequate, in scope and quality, as a thesis for the degree of Master of Science.
Assoc. Prof. Dr. GozdeBozdag AkarCo-Supervisor
Prof. Dr. Fatos T. Yarman VuralSupervisor
Examining Committee Members
Prof. Dr. A. Enis Cetin
Prof. Dr. Fatos T. Yarman Vural
Assoc. Prof. Dr. Gozde Bozdag Akar
Assoc. Prof. Dr. M. Volkan Atalay
Dr. Cevat Sener
ABSTRACT
PARTIAL ENCRYPTION OF VIDEO FOR COMMUNICATION AND STORAGE
Yuksel, Turan
M.S., Department of Computer Engineering
Supervisor: Prof. Dr. Fatos T. Yarman Vural
Co-Supervisor: Assoc. Prof. Dr. Gozde Bozdag Akar
SEPTEMBER 2003, 66 pages
In this study, a new method is proposed to protect video data through partial en-
cryption. Unlike previous methods, the bit rate of the encrypted portion can be
controlled. In order to accomplish this task, a simple model for the time to break
the partial encryption by a chipertext-only attack is defined. Then, the encrypted
bit budget distribution strategy maximizing the time subject to the bitrate constraint
is found. An algorithm to estimate the model parameters is constructed and it is
then implemented over an MPEG-4 natural video codec together with the bit budget
distribution strategy. The encoder is tested with various image sequences and the
output is analyzed.
In addition to the developed video encryption method, a file format is defined to
store encryption related side information.
Keywords: Video Encryption, MPEG-4, IPMP.
iii
OZ
ILETISIM VE SAKLAMA ICIN KISMI VIDEO SIFRELEME
Yuksel, Turan
Yuksek Lisans, Bilgisayar Muhendisligi Bolumu
Tez Yoneticisi: Prof. Dr. Fatos T. Yarman Vural
Ortak Tez Yoneticisi: Doc. Dr. Gozde Bozdag Akar
EYLUL 2003, 66 sayfa
Bu calsmada, video verisinin ksmi sifreleme yoluyla korunmas icin yeni bir yontem
onerilmistir. Daha onceki yontemlerden farkl olarak, sifrelenmis ksmn boyutu-
nun kontrolu saglanmstr. Bunu saglayabilmek icin ksmi sifrelemeyi krmak icin
gereken zamann basit bir modeli tanmlanmstr. Sifrelenen ksmn buyuklugu kst
altnda modeli enbuyukleyen bit butcesi dagtm stratejisi bulunmustur. Calsma,
model parametrelerinin kestirimi icin de bir algoritma onermektedir. Algoritma ve
sifrelenmis bit butcesi dagtm stratejisi bir MPEG-4 dogal video kodlayc/cozucu
uzerinde gerceklenmis ve cesitli imge dizilerindeki bit daglm gozlenmistir.
Video sifreleme yonteminin yan sra, calsmada sifreleme yan bilgilerinin saklan-
mas icin bir dosya bicimi de tanmlanmstr.
Anahtar Kelimeler: Video Sifreleme, MPEG-4, IPMP.
iv
ACKNOWLEDGMENTS
I am grateful to my advisors Dr. Fatos T. Yarman Vural and Dr. Gozde Bozdag
Akar for their unique support. My family-at-large and friends (in alphabetical order)
Nafiz, Murat, Pnar, Faruk, Caglar, Emre, Oguz, Bars, Ersan and Ulas get equiva-
lent credits for their academic and motivational support. My thesis implementation
is based on MPEG-4 reference software by MoMuSys and Microsoft teams, which
eliminated the need to write a from-scratch MPEG-4 natural video codec, although
making me feel regret at times.
v
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
OZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributed Work . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 BACKGROUND ON VIDEO COMPRESSION AND ENCRYPTION . . 4
2.1 Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 MPEG-4 Natural Video Coding Standard . . . . . . . . . . . . . 6
2.2.1 Natural Video Coding Tools Provided by MPEG-4 . . 6
2.2.1.1 Shape Coding . . . . . . . . . . . . . . . . 7
2.2.1.2 Motion Estimation and Compensation . . 8
2.2.1.3 Texture Coding . . . . . . . . . . . . . . . . 9
2.2.1.4 Sprites . . . . . . . . . . . . . . . . . . . . . 10
2.2.1.5 Scalable Video . . . . . . . . . . . . . . . . 11
2.2.1.6 Static Textures . . . . . . . . . . . . . . . . 12
2.2.2 Error Resillience and Concealment Tools . . . . . . . 13
2.2.3 MPEG-4 Visual Profiles and Levels . . . . . . . . . . . 14
2.3 MPEG-4 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
vi
2.4 Cryptography and Cryptanalysis . . . . . . . . . . . . . . . . . 16
2.4.1 Cryptosystems . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Cryptanalysis . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Image and Video Encryption . . . . . . . . . . . . . . . . . . . . 18
2.5.1 Application of encryption in the encoding process . . 18
2.5.2 Syntactical entities for encryption . . . . . . . . . . . . 19
2.5.3 Combined image encryption and compression frame-works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.4 Data analysis and attacks to core chiper . . . . . . . . 21
2.5.5 Error concealment attacks . . . . . . . . . . . . . . . . 21
2.5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 PROPOSED ENCRYPTION TECHNIQUE . . . . . . . . . . . . . . . . . 23
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Dependency Through Error Propogation . . . . . . . . . . . . . 23
3.3 The Bit Allocation Strategy . . . . . . . . . . . . . . . . . . . . . 24
3.4 Levels and Estimation of ci . . . . . . . . . . . . . . . . . . . . . 27
3.5 Encryption Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 Encryption Side-Information . . . . . . . . . . . . . . . . . . . . 30
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 EXPERIMENTS AND RESULTS . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 Implementation and Test Platform . . . . . . . . . . . . . . . . . 32
4.2 Implementation of SET-WEIGHTS and Budget Distribution . . 32
4.2.1 Core Chiper . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 Restrictions of the Implementation . . . . . . . . . . . 33
4.3 Test Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Encoding Parameters . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5.1 Bit Distribution Plots . . . . . . . . . . . . . . . . . . . 35
4.5.2 Encryption Ratios . . . . . . . . . . . . . . . . . . . . . 56
4.5.3 Bit Allocation with Changing GOV size and Bitrate . 57
4.5.4 Side Information Characteristics . . . . . . . . . . . . 58
4.5.5 Perceptual Quality . . . . . . . . . . . . . . . . . . . . 59
vii
5 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . 61
5.1 Features of the Proposed Method . . . . . . . . . . . . . . . . . 61
5.2 Main Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Suggested Future Work . . . . . . . . . . . . . . . . . . . . . . . 62
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
viii
LIST OF TABLES
TABLE
3.1 Algorithm SET-WEIGHTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2 IPMP SelectiveDecryptionMessage stucture, specific to the proposed
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 Bit distribution for Carphone at 1700 bits/frame encryption, 12-VOPGOVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Bit distribution for Foreman at 1700 bits/frame encryption, 12-VOPGOVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Bit distribution for Foreman at 2500 bits/frame encryption, 12-VOPGOVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Bit distribution for Foreman at 3400 bits/frame encryption, 12-VOPGOVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Bit distribution for Foreman at 4200 bits/frame encryption, 12-VOPGOVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6 Bit distribution for Foreman at 5000 bits/frame encryption, 12-VOPGOVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.7 Bit distribution for Miss America at 1700 bits/frame encryption, 12-VOP GOVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.8 Length of side information for various sequences . . . . . . . . . . . . . 58
ix
LIST OF FIGURES
FIGURES
2.1 Block diagram for encoding process. . . . . . . . . . . . . . . . . . . . . 52.2 Some of the possible prediction configurations for temporally scalable
video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Decoder elements and IPMP control points . . . . . . . . . . . . . . . . 16
3.1 Macroblock interdependence . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Error propogation from frame 268 to frame 271 of foreman . . . . . . . . 243.3 VOP dependence stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4 Referenced block areas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Inter (above) and intra (below) bit distributions in Carphone with 1700bits/frame encryption and 12-VOP GOVs . . . . . . . . . . . . . . . . . 36
4.2 Inter (above) and intra (below) bit distributions in Carphone with 2500bits/frame encryption and 12-VOP GOVs . . . . . . . . . . . . . . . . . 37
4.3 Inter (above) and intra (below) bit distributions in Carphone with 3400bits/frame encryption and 12-VOP GOVs . . . . . . . . . . . . . . . . . 38
4.4 Inter (above) and intra (below) bit distributions in Carphone with 4200bits/frame encryption and 12-VOP GOVs . . . . . . . . . . . . . . . . . 39
4.5 Inter (above) and intra (below) bit distributions in Carphone with 5000bits/frame encryption and 12-VOP GOVs . . . . . . . . . . . . . . . . . 40
4.6 Inter (above) and intra (below) bit distributions in Carphone encoded at384 kbps with 4200 bits/frame encryption . . . . . . . . . . . . . . . . . 41
4.7 Inter (above) and intra (below) bit distributions in Carphone encoded at576 kbps with 4200 bits/frame encryption . . . . . . . . . . . . . . . . . 42
4.8 Inter (above) and intra (below) bit distributions in Carphone encoded at768 kbps with 4200 bits/frame encryption . . . . . . . . . . . . . . . . . 43
4.9 Inter (above) and intra (below) bit distributions in Carphone with 3400bits/frame encryption and 24-VOP GOVs . . . . . . . . . . . . . . . . . 44
4.10 Inter (above) and intra (below) bit distributions in Carphone with 3400bits/frame encryption and 36-VOP GOVs . . . . . . . . . . . . . . . . . 45
4.11 Inter (above) and intra (below) bit distributions in Carphone with 3400bits/frame encryption and 48-VOP GOVs . . . . . . . . . . . . . . . . . 46
4.12 Inter (above) and intra (below) bit distributions in Foreman with 3400bits/frame encryption and 24-VOP GOVs . . . . . . . . . . . . . . . . . 47
4.13 Inter (above) and intra (below) bit distributions in Miss America with3400 bits/frame encryption and 24-VOP GOVs . . . . . . . . . . . . . . 48
x
4.14 Distribution of the segment lengths for Carphone, Foreman and MissAmerica, encrypted at 1700 bits/frame, 24-VOP GOVs. y-axis is loga-rithmically scaled and samples with segment lengths greater than 2500are discarded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.15 Foreman original (left) and encrypted at 2500 bits/frame (right), frame184 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.16 Miss America original (left) and encrypted at 1700 bits/frame (right),frame 89 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
xi
LIST OF ABBREVIATIONS
2-D Two-dimensional
3-D Three-dimensional
4:2:0 Color subsampling technique inwhich luminance component aresampled at full rate whereas chromi-nance components are sampledat half (horizontal and vertical)resolution.
AC Non-DC (coefficient of the trans-formation)
AES Advanced Encryption Standard
BIFS Binary Format for Scenes
CAE Context-based Arithmetic En-coding
CIF Common Image Format (352 by288)
DC The lowest frequency (coefficientof the transformation)
DCT Discrete Cosine Transform
DES Data Encryption Standard
DVB Digital Video Broadcast
fps Frames Per Second
GOP Group of Pictures
GOV Group of Video Object Planes
iDCT Inverse DCT
IEC International Electrotechnical Com-mission
IPMP Intellectual Property Manage-ment and Protection
ISO International Standards Orga-nization
ITU International TelecommunicationsUnion
JPEG Joint Photographic Experts Group
MB Macroblock
MPEG Moving Picture Experts Group
MV Motion Vector
MVD Motion Vector Difference
PSNR Peak Signal to Noise Ratio
QCIF Quarter CIF (176 by 144)
RGB Color space in which colors arerepresented as combinations ofred, green and blue light.
RSA Rivest, Shamir and Adleman (thenames of the inventors of thealgorithm)
RVLC Reversible Variable Length Cod-ing/Codeword
SAD Sum of Absolute Differences
SPIHT Set Partitioning in HierarchialTrees
VLC Variable Length Coding/Codeword
VO Video Object
VOL Video Object Layer
VOP Video Object Plane
VS Video Object Sequence
YUV Color space in which colors arerepresented in one luminance(Y) and two chrominance (U andV) components
xii
CHAPTER 1
INTRODUCTION
1.1 Motivation
Advances in compression, delivery and presentation technologies of digital video in
recent years have broadened the share of digital video in (audio)visual communica-
tion and entertainment, changing the ways that the end users create, access, store
and copy video. In contrast to analog technologies, the digital technology offers
Computer-aided content creation and manipulation,
Transmission over computer networks,
Storage and in computer environment,
Production of identical copies without any specialized hardware.
However, the listed benefits bring a problem on; access control. Video is transmitted
over insecure networks, where a malicious party can acquire any packet, including
those carrying private communication or commercially valued entertainment data.
The network, in particular the Internet, also allows peers to share their files, result-
ing exponentially increasing number of copies, a phenomenon, called superdistribu-
tion[1].
The path between the content creator and the viewer must be secured, so that
the only viewers that are authorized by the content creator (or presenter) can access
the video, which corresponds to preservation of privacy and prevention of piracy in
1
one-to-one communication and broadcasting cases, respectively. It is also desirable
that the viewer must be able to produce copies as long as a policy established by the
content creator permits.
Encryption of video, combined with access control logic implemented in the player
is essential to prevent unwanted content acquisition. There are a number of issues
to be considered while designing an access control mechanism, as pointed out by
previous works [2, 3]:
1. Encryption (and decryption) of a video stream entirely takes considerable amount
of time, which can be comparable to the decoding time. Therefore, only a care-
fully selected portion of video can be encrypted, to limit the cost of the opera-
tion.
2. The protection level for the content must be identified. Considering the bussi-
ness of copyrighted items trade, in particular entertainment, the increase in
piracy boosts the demand for legitimate items1. Therefore, paranoid protection
may offend the end user and reduce the demand, on the other hand a loose
protection mechanism may harm the bussiness setup, reducing the revenues.
3. The protected video may have a limited lifetime, in the sense that it is of no
value after some time on. For example, piracy makes sense if a protected live
soccer broadcast can be broken until excrepts from the match are broadcast
publicly in the succeeding sports programs. Therefore a protection scheme that
needs just more time than the lifetime of the content is robust.
4. Difficulty of breaking an encryption mechanism is usually estimated consid-
ering current computational resources. Upgrades or reconfigurations are re-
quired to keep them robust.
(1) is a well-studied problem, in the sense that different syntactical entities of
the video stream are tried for partial encryption, without considering any of (2,3).
Examples of such studies are discussed in Section 2.5. (4) is a problem of design at a
coarser level and solutions do exist; MPEG-4 IPMP is an example, which is discussed
in Section 2.3.
1 The curious reader may have a look at [4] and works referenced there.
2
1.2 Contributed Work
This work proposes a method where the video stream is partially encrypted and the
distribution of encrypted bits over different syntactical entities of the video stream is
optimized constrained to the number of encrypted bits, based on a simple model to
assess the time to break the protection so that the average time to break the encryp-
tion over a temporal sample is maximized. Therefore, the developed method partial
encryption method can be configured in a straightforward way, regarding the value
of the data, providing solutions for (2) and (3).
A method to estimate the parameters of the model is also proposed. The estima-
tion method produces parameters depending on the video stream to be encrypted
and it can be used simultaneously with encoding.
Additionally, the layout of the encryption side information conforming to MPEG-
4 IPMP Final Proposed Draft Ammendment is described.
1.3 Organization
The succeeding chapters of this thesis are organized as follows: Chapter 2 gives back-
ground on how digital video is encoded compactly, focusing on MPEG-4 compres-
sion tools. Chapter 3 proposes a model for the cryptanalytic complexity of a MPEG-4
natural video stream and the bit budget distribution maximizing the time required
to break the encryption is found, constrained by the number of bits to be encrypted.
The side information format is also presented in that chapter. Chapter 4 contains ex-
perimental estimation the model parameters and quantitative information regarding
the encrypted video streams. The thesis is concluded in Chapter 5, with directions
for future studies in this area.
The CD includes the extended codec source files, scripts that make up the exper-
imental setup and raw video sequences used as test data.
3
CHAPTER 2
BACKGROUND ON VIDEO COMPRESSION AND
ENCRYPTION
This chapter describes state of the art video compression and encryption algorithms
to complement the succeeding chapters, concluding with a summary of previously
developed encryption methods for video.
2.1 Video Compression
Video data requires large amount of space for storage in its raw form. For exam-
ple, a one minute sequence of 352x288 RGB frames at 25fps is approximately 430
megabytes. Fortunately, a large amount of spatial and temporal redundancy resides
in such raw sequences, which can be reduced by compression. The succeeding para-
graphs of this section describes the source of redundancies and the basic approaches
used in current video compression techniques.
The human visual system is less sensitive to the chrominance information than
the luminance information, since there are more luminance sensing cells than chromi-
nance sensing cells in the retina. Therefore, one can downsample the chrominance
information in every individual frame to reduce the amount of data to represent a
perceptually equivalent frame [5].
A well-known approach for compression is to eliminate the spatial redundancy
by transform coding, which involves transforming the image.The image in the trans-
form domain can be approximated with all zero, but a few nonzero pixels. Dis-
4
Raw Video
Motion Estimation Transform Coding Quantization Entropy Coding
Encoded Video
Figure 2.1: Block diagram for encoding process.
crete cosine transform (DCT) and Wavelet transform are the most commonly used
transformations[6]. Although inferior in compression, DCT is more commonly used
than Wavelet transform since blockwise DCT of the image is more suitable for block-
based motion estimation and it is also more popular (and economic) than alternative
motion estimation methods.
Consecutive frames of a video sequence are usually similar (except for the loca-
tions where the scene changes), with slight differences due to motion. The redun-
dancies due to this similarity can be eliminated by modeling the motion.
Any source of symbols can be compressed by entropy coding. The symbols are
coded in a way that a symbol is mapped to a codeword with the length depending
on the frequency of the symbol. Most of the video coding schemes prefer using prefix
codes with predefined symbol to codeword mappings to eliminate the overhead due
to transmission of the tables. An alternate method is arithmetic coding [7], which
maps the string to be encoded to a number in the subinterval [0, 1] using the frequen-
cies of symbols to be encoded. The optimal codeword assignment is achieved with
arithmetic coding, but it requires more computational power, compared to prefix
coding with predefined tables.
The entire process of video encoding can be summarized as a block diagram, as
shown in Figure 2.1.
One can encode the video a scalable encoding so that a range of decoders with dif-
ferent capabilities can decode the video in different qualities and/or spatiotemporal
resolutions. Scalable encoding involves encoding a basic bitstream and enhancement
bitstreams, depending on the basic bitstream[5].
5
2.2 MPEG-4 Natural Video Coding Standard
MPEG-4 is a standard for coding audiovisual objects, enables re-use of audiovisual
content, mixtures of natural and synthetic content and spatiotemporal arrangements
of objects to form scenes. Thus, natural video coding tools were designed to be used
with such compositions as well as ordinary rectangular image sequences. Most of
these tools are specialized and practically applicable for a number of configurations.
For example, robust and fast segmentation algorithms are required to encode non-
rectangular video objects from a nature scene, on the other hand its much easier
with chroma keying in a studio environment. The remainder of this section is an
overview of natural video coding in MPEG-4 and a description of the bitstream syn-
tax, as a summary of [8] and [9].
2.2.1 Natural Video Coding Tools Provided by MPEG-4
The audiovisual object is the basic entity in an MPEG-4 scene, which is described in
the way specified in ISO/IEC 14496-1, as well as the transmission of the video object
to the decoder. Each video object is characterized by spatial and temporal informa-
tion in the form of texture, motion and shape. Texture is the spatial and motion is
the temporal relation between the video samples and the spatiotemporal boundary
of the samples is put by the shape information. An MPEG-4 scene may consist of one
or more video objects. The visual bitstream provides a hiearchial description of a vi-
sual scene from video objects down to temporal samples of the video objects and the
decoder can access any entity in the hierarchy by seeking certain codewords called
start codes, which are not generated elsewhere in the bitstream. The hierarchy levels
with their commonly used abbreviations are:
Visual Object Sequence (VS): The sequence of 2D or 3D natural or synthetic ob-
jects.
Video Object (VO): A video object corresponds to the atomic entity that has the
means of access (by seeking and browsing) and manipulation (by cuts, pastes
and relocations in the scene).
Video Object Layer (VOL): Each VO can be encoded in non-scalable (single layer)
or scalable (multi layer) way, depending on the application. The VOL provides
6
support for scalability. There are two types of VOLs, the VOL with full MPEG-
4 functionality and the reduced functionality VOL, also called the VOL with
short headers. The latter provides bitstream compatibility with baseline H.263,
an ITU standard for video coding.
Video Object Plane (VOP): A VOP is a temporal sample of a video object. VOPs
can be encoded independently from each other or dependent on other VOPs by
motion compensation. A conventional video frame can be represented with a
rectangle-shaped VOP.
Group of Video Object Planes (GOV): GOVs group video object planes to provide
points in the bitstream where video object planes are encoded independently
from each other. Therefore GOVs provide random access points. GOVs are
optional.
A video object plane is divided into macroblocks which contain a section of the lu-
minance(Y) component and spatially subsampled chrominance components(Cr and
Cb). In the MPEG-4 visual standard, a macroblock is a 16x16 section of a VOP con-
taining four luminance and two chrominance blocks of size 8x8 pixels, which is also
referred as 4:2:0 subsampling, with associated motion and shape information. The
texture in each 8x8 block is encoded using DCT.
2.2.1.1 Shape Coding
MPEG-4 provides support shape representation in bitmaps for both binary and grayscale
shapes. In order to code the binary shape for a nonrectangular VOP, the VOP is
bounded by a rectangle which can be chosen so that it contains the minimum num-
ber of 16x16 nontransparent blocks. The shape compression algorithm provides sev-
eral modes to encode a shape block; the basic tool is Context-based Arithmetic En-
coding (CAE) algorithm, which involves estimation of a context number computed
from spatiotemporally neighboring pixels to initialize the arithmetic coder. Motion
compensation can be used to encode shape blocks depending on previously encoded
blocks. Coding with motion compensation and without motion compensation use
different variants of CAE; namely InterCAE and IntraCAE, respectively. The motion
vectors themselves are differentially coded. Every shape block can be coded in one
7
of these ways:
Entire block is transparent or opaque. No shape coding is required. Texture is
coded for opaque blocks.
The block is coded using IntraCAE without use of past information.
Motion vector difference (MVD) for the shape is zero, but the block is not up-
taded.
The block update is coded with InterCAE. MVD may be zero or nonzero.
MVD is nonzero and the block is not coded.
Grayscale shapes correspond to the notion of alpha plane in computer graphics.
MPEG-4 provides syntax to code 8-bit grayscale shapes where a value of 0 corre-
sponds to a completely transparent pixel, a value of 255 corresponds to a completely
opaque pixel and intermediate values correspond to different values of transparency.
Grayscale shapes are encoded in a similar way to that of textures, with use of mo-
tion compensation and DCT; only lossy coding of grayscale shapes is allowed. The
grayscale shape coding also makes use of binary shape coding to code the regions
where grayscale shape is nonzero; the DCT coded grayscale shape belongs to this
coded region.
2.2.1.2 Motion Estimation and Compensation
The motion estimation and compensation tools in the MPEG-4 standard are similar to
those used in other video coding standards such as MPEG-2 and H.263 [5], adapting
the block-based techniques to the VOP structure. MPEG-4 provides three modes to
encode an input VOP:
A VOP can be encoded independent from any other VOP and called to be an
intra VOP (I-VOP). The first coded VOP should be an I-VOP.
A VOP may be predicted from another previously decoded VOP. Such VOPs
are called predicted VOPs (P-VOP).
A VOP may be bidirectionally predicted from a past VOP and a future VOP.
(B-VOP) B-VOPs may only be predicted from I-VOPs or P-VOPs
8
When a VOL contains B-VOPs, VOPs are rearranged and then transmitted so that
the decoder needs to keep at most three VOPs at a time. If a B-VOP is received, its
decoded directly. If a P-VOP or I-VOP is received, the decoder outputs the frame
constructed from the previous I-VOP or P-VOP.
Encoding P-VOPs and B-VOPs require motion estimation. Motion estimation is
performed only for macroblocks in the bounding box of the VOP. If a macroblock is
entirely within a VOP, the motion vector is estimated minimizing the sum of absolute
difference (SAD) of the 16x16 macroblock as well as its 8x8 luminance blocks in ad-
vanced prediction mode, which results in a motion vector for the entire macroblock
and a vector per luminance block. The motion vectors represent the translations of
the blocks, i.e. the motion estimation model is f (x, y, t) = f (x + c, y + c, t) + (x, y, t),
where f (x, y, t) is the pixel (x, y) at time t, is the estimation error and c is the trans-
lation parameter. c is constant within a macroblock, or within the 8x8 luminance
blocks of a macroblock in advanced prediction mode. Motion vectors are computed
to half-pixel precision. Motion vectors are estimated using a modified block match-
ing technique for the macroblocks that are partially in the VOP.
A motion vector is predictively coded based on three previously coded blocks.
Then VLC word, corresponding to this differential, is placed into the bitstream.
2.2.1.3 Texture Coding
Texture information of a video object plane is implicitly represented by the luminance
(Y) and two chrominance (Cb and Cr) channels of the video signal. In the case of an
I-VOP, texture is the luminance and chrominance components of the signal and it is
the residual error after motion compensation in B-VOPs and P-VOPs. In order to
encode the texture information, a 8x8 grid is superimposed on the VOP and blocks
of the grid are transformed using DCT. Blocks that entirely reside in the VOP are
directly transformed, on the other hand, boundary blocks are padded before DCT.
Blocks containing residual error after motion compensation are padded with zeros
and intra blocks are padded by the use of a low pass extrapolation filter.
Transformation of blocks are succeeded by quantization as a lossy compression
step, involving division of DCT coefficients by a quantization step size. The quanti-
zation step size can be held fixed within a block or changed in a way specified as a
9
quantization matrix.
Quantization step and quantized coefficients can be encoded by using prediction
from neighboring blocks. Prediction can be performed from either the block above
or the block left. The prediction direction is adaptive and selected in a direction,
depending the derivative of DC (the lowest frequency) coefficient on the horizontal
and vertical direction. Only the DC coefficient or first row/column of the AC (non-
DC) coefficients can be predicted.
Coefficients are ordered and coded based on the prediction direction, if theres no
prediction a zigzag ordering is used. The zigzag ordering is then run-length encoded
using VLC. DC coefficient can be coded in the same way as AC coefficients, by using
a different VLC table or by using a fixed-length code. The last alternative is used in
encoding a bitstream with short headers.
2.2.1.4 Sprites
A sprite consists of the regions of a VO that are present in the scene throughout the
video segment, e.g. a panoramic background scene parts of which are visible in any
temporal sample of the VO. MPEG-4 allows sprite coding because it provides high
coding efficiency in cases like the example given. For any given time instant, the
background VOP can be extracted by warping and cropping the sprite appropriately.
The shape and texture of the background is encoded in the same way as that of an
I-VOP. MPEG-4 supports three modes of sprite encoding; basic, low-latency and scal-
able.
Basic encoding is encoding of the background sprite as an I-VOP and any other
VOPs as S-VOPs, which are VOPs coded dependent to the sprite and may be depen-
dent to another VOP. This I-VOP is not displayed, but it is stored in a sprite memory
and will be used by all the succeeding S-VOPs in the same VOL.
Since receiving a large I-VOP before starting the decoding process causes a delay,
a low latency sprite mode is also provided. In this case, an initial sprite sufficient to
reconstruct the first few VOPs is transmitted. Sprite pieces and updates can be
transmitted in succeeding S-VOPs. pieces are highly quantized replacements for
specified portions of the sprite and updates are residuals for specified portions of
the sprite. Sprite pieces in a VOP are terminated by either a stop signal, indicating
10
that all the sprite information for the VOL has been transmitted, or a pause sig-
nal, indicating that all the sprite information packed with the current VOP has been
transmitted.
Enhancements to sprites can also be encoded, as described in section 2.2.1.5.
2.2.1.5 Scalable Video
MPEG-4 offers both temporal and spatial scalability, which are meant to increase
temporal and spatial resolutions, respectively. Both methods are implemented using
more than one VOLs. A mid-processor connects the base layer decoder to the en-
hancement layer, which uses the base layer as a reference, performing any required
spatiotemporal conversions to be used to decode the enhancement layer. Finally,
the postprocessor combines the decoded layers prior to rendering. An enhancement
layer cannot provide both spatial and temporal enhancements at the same time; spa-
tial enhancements must be in the same temporal resolution as the base layer and
temporal enhancements must be in the same spatial resolution as the base layer.
Spatial scalability tools only support rectangular video objects. Base layer is en-
coded in the way described in preceeding subsections. The enhancement layer VOPs
in the enhancement layer can be encoded predictively depending on most recently
decoded enhancement VOP, most recent VOP of the reference layer, next VOP of the
reference layer and temporally coinciding VOP of the reference layer. In the last case,
no motion vectors are transmitted. Bidirectional prediction is also possible, allow-
ing prediction from four combinations of possible reference entities. Independently
coded VOPs are not allowed in enhancement layers, i.e. all VOPs in the enhancement
layer must be P-VOPs or B-VOPs.
Unlike spatial scalability tools, tools for temporal scalabilty suport nonrectangu-
lar layers and partial enhancements, e.g. a fast-moving car in an almost still scene
can be selected for enhancement. For P-VOPs, prediction from most recently de-
coded VOP of the same layer, most recent VOP of the reference layer or next VOP
of the reference layer is possible. B-VOPs can be predicted in three diffrerent refer-
ence configurations which are combinations of the possible references for P-VOPs. A
number of prediction configurations are illustrated in Figure 2.2. The arrows point
from the reference frames.
11
TemporalEnhancement
Base t=0 I
t=1 P
t=2 B
t=3 B
t=4 P
t=5 B
t=6 P
Figure 2.2: Some of the possible prediction configurations for temporally scalablevideo
2.2.1.6 Static Textures
MPEG-4 allows encoding of 2-D or 3-D meshes and static textures may be mapped on
the meshes. The way the textures are encoded provides a high degree of scalability
more than the DCT-based texture coding techniques mentioned previously. Static
texture coding technique is based on the wavelet transform. DC and AC bands of
the wavelet transform are coded separately and encoded using a zero-tree algorithm
and arithmetic coding.
Texture is separated into subbands by applying discrete wavelet transform to the
data. The number of decomposition levels can be adjusted on the encoder side. The
bitstream includes the information regarding whether the transform is an nteger or
floating-point transform and whether default filter banks or filter banks specified in
the bitstream is used. Wavelet transform allows a natural way of scalability; the more
bands the decoder processes, the more approximate the image is decoded. The low-
est resolution subband is called the DC subband and its coded using a predictive
scheme, depending on the horizontal and vertical derivatives of the coefficient. The
differential is then quantized and entropy coded using arithmetic coding. AC coeffi-
cients are encoded using the fact that most of the coefficients are zero and the zeroes
are correlated; a zero at a coarse scale means that zeroes are likely in the same spatial
position at finer scales, forming a tree. Special symbols are used to encode isolated
zeroes and zerotree roots; the latter indicates that the descendants in the tree are not
12
encoded. The formed symbol sequence is encoded using arithmetic coding. Pack-
etization of encoded data, which is the only error resilience tool provided for static
textures, is supported by MPEG-4 Version 2 only.
Static textures support only binary shapes in the same way as that of video en-
coding.
2.2.2 Error Resillience and Concealment Tools
Every undecrypted piece of bitstream is treated as a bitstream error by a standard
player. Therefore it is desired that the encryption scheme must be robust to any
concealment tool which is available due to the nature of the video stream.
Bit errors in VLC encoded data results loss of synchronization and the bitstream
till the next synchronization marker or start code cannot be decoded. In this way,
error is localized and precise localization results more correct decoding. MPEG-4
markers are placed into the bitstream so that the macroblocks between two resyn-
chronization markers are just above a predetermined threshold. In this way, data
is packetized so that each packet is equally important since they contain nearly the
same amount of compressed bitstream. A packet contains a variable number of mac-
roblocks, unlike the packetization schemes of H.263 or MPEG-2 where a number of
rows of macroblocks are packetized together. The resynchronization marker is fol-
lowed by the number of the first macroblock in the packet, its absolute quantization
scale, optionally redundant header information and the macroblocks in the packet.
The predictive coding used to code the macroblocks in a packet does not use predic-
tion information from other macroblocks.
In addition to the packet approach, MPEG-4 also adopts a second method called
fixed interval resynchronization. This method requires that VOP start codes and
resynchronization markers appear at only fixed locations in the bitstream, which
avoids most of the problems due to start code emulation. However, it has an over-
head of stuffing bits used to align the bitstream.
An error at the motion estimation residual encoded as texture can be concealed
assuming zero estimation error. In a similar way, errorneous motion vectors can
be concealed motion compensating with zero motion vectors. MPEG-4 provides an
encoding mode called data partitioning where motion and texture information in a
13
packet are separated by a marker, providing further error localization and a method
to conceal errors. MPEG-4 provides further error localization by use of reversible
VLC so that codewords can be decoded both in forward and reverse directions.
Another error resilience tool in MPEG-4 is inclusion of intra coded macroblocks
in non I-VOPs. The encoder can choose to encode a macroblock in intra mode if
motion prediction error exceeds a predetermined threshold. The technique is called
adaptive intra refresh.
2.2.3 MPEG-4 Visual Profiles and Levels
In order to classify the conformance of encoders, decoders and encoded bitstreams,
subsets of the standard which define conformance points, are defined by means of
profiles and levels. A profile is a subset of MPEG-4 coding tools and a level is the
restrictions on the parameters of the encoding tools, e.g. number of macroblocks per
second, bitrate etc. Profile and level information is signaled in the bitstream so that a
decoder can deduce whether it has the capability of processing the stream.
The Simple object is an error resilient rectangular natural video object of arbitrary
height/width ratio, developed for low bitrate applications. It uses I-VOPs and P-
VOPs with simple and inexpensive coding tools. Simple scalable object type is built
on top of simple, adding spatial and temporal scalability tools. Advanced simple object
type is also built on top of simple, by addition of B-VOP coding tools and interlaced
video support. Advanced simple profile is popular among video codecs for desktop
computers, such as DivX.
Core object type is also built on simple, by addition of tools to support binary
shapes and B-VOPs. N-Bit object type is built on core, by addition of support for pixel
depths in 4-12 bits range. Main object type supports sprites, interlacing and greylevel
shape, in addition to those supported by core.
Still textures are supported by scalable still texture profile and mapping of these
textures on 2D dynamic meshes is supported by animated 2D mesh profile. The inter-
ested readers in profiles and levels are directed to [9, 10].
14
2.3 MPEG-4 Systems
This section describes the systems layer of the standard, which defines the way that
audiovisual objects are delivered to the decoder in synchronization and the way that
a MPEG-4 scene is described. The systems layer also defines means of intellectual
property management and protection (IPMP) in MPEG-4. The standard only defines
control points for the IPMP tool and the structure of the container for IPMP data
including tool identification and the container for tool-specific data, permitting inte-
gration of proprieaty conditional acccess methods into the standard.
The final committee draft[11] does not specify a file format for MPEG-4, but a file
format based on that of QuickTime has been adopted later in an ammendment [12].
An interface for IPMP tools has also been added as an ammendment [13], in the same
way as in MPEG-2[14].
The components of the systems level is shown in Figure 2.3, which is adapted
from [11]. Demultiplexing framework acquires the elementary streams (ES), which
contain data of only one kind. Elementary streams are not required to reside in the
same medium, i.e. a number of them can be downloaded while others are read from
the file. Decoders are fed with elementary streams from demultiplex buffers (DB)
and their outputs are put into composition buffers (CB), which hold decoded content
prior to scene composition using the description from the scene description ES, which
is encoded in a format called BIFS (Binary Format for Scenes). The scene composer
gets descriptions of objects in the scene from the object descriptor (OD) stream. Then
required objects are acquired from audio and video composition buffers, using the
object description information. The composed scene is then rendered.
The IPMP control system can manipulate the decoding process at a number of
control points using the information from the IPMP-ES which, for example, can be
used to carry decryption keys. In Figure 2.3, control points are shown with gray
circles. The standard is flexible in the sense that it does not define any IPMP tools,
allowing proprieaty IPMP systems to be implemented. In this way, MPEG-4 is pro-
tected from becoming obsolete due to changes in technology (of cryptanalysis) and
bussiness models (affecting the way that users purchase/view content). IPMP tool
acquisition, authentication and operation (as a blackbox) are defined in ammend-
ments to the standard [13].
15
Audio DB
Video DB
OD DB
BIFS DB
IPMP DB
Audio Decode
Video Decode
OD Decode
BIFS Decode Decoded BIFS BIFS Tree
IPMP Systems, controling at s
Audio CB
Video CBC
omposition
Rendering
Dem
ultiplexer
IPMP ES
Figure 2.3: Decoder elements and IPMP control points
2.4 Cryptography and Cryptanalysis
Cryptography is the subset of science concerned in encoding data, also called encryp-
tion, so that it can only be decoded, also called as decryption, by specific individuals.
A system for encrypting and decrypting data is a cryptosystem. Encryption usu-
ally involves an algorithm for combining the original data (plaintext) with one or
more keys numbers or strings of characters known only to the sender and/or
recipient. The resulting output of encryption is known as ciphertext.
There are two main classes of cryptosystems, with different practical application
areas in todays technology. Public key methods use two different keys for encryption
and decryption. On the other hand, secret key encryption methods use the same key
for encryption and decryption.
2.4.1 Cryptosystems
Secret key methods can be classified in two groups, namely block and stream chipers.
Block chipers encrypt and decrypt in multiples of blocks and stream chipers encrypt
and decrypt at arbitrary data sizes. Block chipers are mostly based on the idea by
Shannon that sequential application of confusion and diffusion will obscure redun-
dancies in the plaintext, where confusion involves substitutions to conceal redun-
dancies and statistical patterns in the plaintext and diffusion involves transforma-
tions (or permutations) to dissipate the redundancy of the plaintext by spreading
it out over the chipertext. DES and Rijndael are examples of algorithms based on
16
this idea, which allows simple hardware implementations or fast computer imple-
mentations by use of simple arithmetic, however they are not fast enough to encrypt
large volumes of data in real time; an ANSI C implementation of Rijndael, which is
adopted as AES by US Government, requires 950 processor cycles per block on the
x86 architecture[15]1.
Most of the stream chipers rely on the fact that XORing the plaintext with a string
only known to the sender and receiver provides strong encryption. In order to gen-
erate the string one can use a block chiper to encrypt a sequence known to both, as
suggested in Rijndael specification. A stream can also be encrypted by block chipers
after being aligned to block boundaries, in chiper block chaining mode, where the
encryption process of a block depends on the previous block due to XORing of pre-
vious chipertext with the plaintext of the block.
The most popular public key method is RSA, which uses large prime numbers
and modular arithmetic to encrypt a given text. RSA is slower and more complicated
to be implemented in hardware since the primes are usually greater than 512-bits in
size and the algorithm requires computation of powers and remainders with those
large primes, the benchmark in Slagells thesis [16] concludes that RSA is at least
three times slower than secret-key methods and processing time increases cubically
with key size on x86 architecture whereas secret-key methods cause slight increases.
However, private key is not predictable given the public key and vice versa, therefore
a sender-receiver pair can establish a one-way secure channel with the transfer of the
encryption key from receiver to the sender. A common application of public key
methods is to transfer a secret key to encrypt a larger amount of data.
2.4.2 Cryptanalysis
Cryptanalysis is the science concerned in breaking cryptosystems. Cryptanalysis
generally involves the following main methods:
A cryptanalyst can inspect a number of particular chipertexts for certain pat-
terns and correlations. This method of attempting to break a cryptosystem is
called a chipertext-only attack.
1 An MMX implementation for inverse DCT requires not less than a thousand processor cycles per88 block and iDCT counts one third of decoding effort
17
The cryptanalyst may have the plaintexts besides the chipertexts. In this case, it
may be possible to investigate the relation between the plaintexts and the cor-
responding chipertexts. This type of attack is called a known-plaintext attack.
In a chosen-plaintext attack, the cryptanalyst has access to the cryptosystem
and is able to get the chipertexts for the plaintexts he/she provides.
As a last method, one can exhaustively try a set of keys until a decryption de-
cided to be valid is achieved, which is impractical for large amounts of data or
large key spaces.
In addition to these attacks, section 2.5.4 presents two more example attacks, specific
to video data.
2.5 Image and Video Encryption
Video encryption has two major fields of application. The first application is access
control to commercial multimedia content where the requirement is the minimiza-
tion of illegal accesses to the content while keeping the cost, in terms of increasing
player complexity and decreasing player usability, of encryption low. The second
application is the protection of video which is distributed from a source to one or a
few destinations, e.g. in videoconferencing, where privacy is essential.
This survey includes both image and video encryption schemes proposed prior
to this work. Image encryption schemes are also included since the presented ideas
may be helpful in texture encryption for video.
2.5.1 Application of encryption in the encoding process
As pointed by [17], data can be encrypted in any stage of the encoding process.
However, every point is not equally advantageous in terms of format compliance,
encryption overhead, compression efficiency, processability, syntax awareness and
transmission friendliness, which form a set of important criteria for many of the ex-
isting applications. Encryption prior to encoding is not suitable, because encrypting
a bitstream increases its entropy, therefore renders further compression impossible.
Encryption before variable length coding also causes an increase, less than the for-
mer, in the encoded bitstream size and it also results in a format compliant bitstream;
18
the bitstream does not contain any syntactical (e.g. invalid VLC codes) nor seman-
tical (e.g. more than 64 DCT coefficients in a 8x8 DCT) errors. The work by Wen et
al.[18, 2], encrypting the indexes of VLC and FLC entries is a good example. Their
work also proposes other methods such as shuffling of higher level structures like
macroblocks and runlevel codewords, the main drawback of which is that it causes
a delay in decoding since the entire area of shuffling must be retrieved before higher
level operations can be conducted, e.g. inverse DCT.
Compression prior to multiplexing and packetization can be conducted in a syntax-
aware manner, so that any fault-tolerant but undecrypting player can handle the bit-
stream; the video stream can still be browsable, the layers and the video objects on the
stream can still be separable, these abilities may be necessary to support transcoding
or traffic shaping. To do this, the video cryptosystem must not output bits emulating
the special codes that signal the structure of the video stream. Such methods reduce
compression efficiency less than that of the formerly stated methods. One advantage
of encrypting a high-entropy bitstream is that it permits using less encrypted bits still
providing high security, as presented in Qiao and Nahrstedts work [19].
Any encryption succeeding the packetization step at the systems layer is harder to
implement in a syntax-aware way efficiently. A syntax-unaware encryption, which
simply encrypts randomly or uniformly spaced fragments of the bitstream, on the
other hand, does not provide the facilities mentioned for pre-packetization encryp-
tion. Besides this, it may be insecure since it does not take error resilience tools nor
data interdependencies of the video coding scheme. An example is Griwodzs work
[20]. Another example is the work of Wee and Apostolopoulos [21], which is a com-
bined scalable encoding and packetization framework optimized for transcoding.
2.5.2 Syntactical entities for encryption
There are a few basic ideas for selective encryption. Selecting a segment of a video
sequence on which some other part has been coded dependingly reduces the size of
the data to be encrypteed. For example, one can encrypt I-VOPs, on which encoding
P-VOPs and B-VOPs depend. However, encoders can be designed to encode a single
I-VOP at the begining and put intra MBs adaptively in P-VOPs [9].
In the same way, one can apply to encryption to the base layer of a scalalably
19
encoded video stream to protect the entire stream. In order to provide different qual-
ities of service with access control, one can also encrypt the enhancement layer with
a different key where only those possessing the two keys can decode the full-quality
video [22, 23, 24].
DCT is known to output coefficients with small correlation, so one can alter the
coefficients depending on the output of a chiper to encrypt the data. The work by Shi
and Bhargava is such an example [25]. DCT coefficients can also be permuted, as in
[26], however this is shown to be insecure [19] and it reduces compression efficiency.
The works by Tosun and Feng [27, 22] proposes a scheme where a portion of DCT
coefficients are encrypted. Qiao and Nahrstedts work, presenting a way to halve the
number of bits to encrypt, is also based on DCT encryption [19].
Encryption of motion vectors is infeasible in most cases since encryption of a
single motion vector will require markers as encryption side information. Its only
feasible in a case like VLC index encryption by Wen et al.[18] or MPEG-4 data parti-
tioning mode. Moreover, Wen et al.have demonstrated that the errors due to a motion
vector only encryption is concealed to an acceptable degree.
On the other hand, encrypting headers does not provide security since header
information can be guessed most of the time.
2.5.3 Combined image encryption and compression frameworks
Bourbakis and Maniccam have proposed a image encryption/compression frame-
work based on traversal of the image plane in a way suitable for run length coding[28].
The traversal is encoded in a context-free grammar previously developed in Bour-
bakis works[29]. One can achieve both lossless compression and encryption by run-
length encoding the traversal of pixels and encrypted traversal description together.
The main disadvantage of their scheme is that it takes much greater effort to encode
than that of JPEG.
In [30], Chang et al.have proposed a method which involves building a quantiza-
tion table and encoding the table, which is encrypted afterwards. They argued that
their scheme is hard to break using known attacks. Their work does not include any
experimentation or application on some transform coding scheme.
Quadtree encoding with encryption have been firstly proposed by Chang et al.[30],
20
where a square image is divided into subimages until every subimage is homo-
geneous. Homogeneous subimages are leaves in the quadtree hierarchy, which is
formed according to image inclusion, where parent nodes include their children.
Then, the image is encoded as the tree traversal and leaf values. The image can
be encrypted by applying encryption on this tree structure. In later works by Cheng
[31, 32], encrypting certain traversals are presented as a method for image protec-
tion. Cheng, in his Masters thesis [32], have also proposed a method to encrypt
SPIHT encoded images.
2.5.4 Data analysis and attacks to core chiper
The curious reader can find attacks to chipers such as DES in the literature, however
attacks to core chiper are considered to be impossible and infeasible in most of the
possible cases, legitimate access to the bitstream costs much less than the computa-
tional power required to break the core chipers in the case of encryption of video for
entertainment purposes. On the other hand, parts of data may still be guessed even
after encoding, as previously discussed in 2.4.2 and concluded in the next paragraph.
Video data is known to be spatiotemporally smooth, so one can speedup breaking
the the chipertext if a part of the plaintext is known; this technique is called nearest
neighbor attack in [30]. The same work also defines a jigsaw puzzle attack to be speeding
up the breaking process by division of the chipertext into small portions constrained
by smoothness and similarity to the neighbors in boundaries.
2.5.5 Error concealment attacks
Default values for undecodable fields can be set; motion vectors and difference of
quantizer step can be set to zero and Intra DC to a fixed value, when the decoder
is unable to retrieve them. Alternatively values from previous frames can be used
since these values tend to change in small steps, these methods are suggested as
simple means of error concealment in the literature [9, 18]. Besides these, the reader
can find various studies on other techniques that predict undecodable values from
the syntax or previously decoded values.
21
2.5.6 Discussion
There are two classes of cryptosystem breaks that can be considered for video encryp-
tion schemes described in this chapter. In the first class, the entire cryptosystem fails
so that the entire video sequence can be broken by a one-time effort, called simultane-
ous cryptanalysis. The second class is the one that the attacker can break an individual
video element at a time, called progressive cryptanalysis. Simultaneous cryptanalysis is
the case that one attacks the core chiper or the way that the decryption key is kept or
transmitted, which requires a systems-level attack or use of cryptanalytic techniques
to break the core chiper. Simultaneous cryptanalysis techniques require a study of
data security and general cryptanalysis, therefore they are left out of the scope of this
thesis. On the other hand, partially encrypted video is prone to the attacks of types
discussed in Chapter 2, e.g. run length encryption can be broken by bit togglings
until a valid VLC sequence is found, optionally constrained to give an output resem-
bling a given fragment of the video. In a similar way, index encryption can also be
broken, by trying a subset of possible codewords. Both techniques are not feasible to
apply on low-cost video, e.g. anyone to break the encryption in real time to watch a
live soccer broadcast needs much more expensive hardware than the cost of watch-
ing the broadcast in the proper way. Recent works like [18] mention this situation
and propose partial encryption of syntactic entities, e.g. partial encryption of MBs,
encryption of MVs with magnitude in a predefined interval. The current literature
does not propose any reasonable means to adjust the level of encryption depending
on the value of the video stream, although the encryption schemes in [27] or [18] can
be applied in multiple levels.
22
CHAPTER 3
PROPOSED ENCRYPTION TECHNIQUE
3.1 Introduction
Having attempted to encrypt every syntactical entity in the encoded video, the recent
concerns of the study of video encryption were syntax compliance and processability
of the unencrypted bitstream by third parties to manipulate transmission rates and to
allow searches. However, the limitation of the bit rate of the encrypted portion of the
video stream while keeping security maximized remains as an open problem, which
requires distribution of the budget of encrypted bits over the syntactical entities of
video. Another unattacked problem is encoding of the encryption side information
compactly and error resilient. An imprecise, yet efficiently computable solution to
the first problem is presented in this chapter. The storage format complying with
ammendments to the MPEG-4 standard is also given at the end of the chapter.
The reader can see that a solution to limit the bitrate of the encrypted stream while
keeping security maximized will have a great impact, if low-resource hardware that
can decrypt slower than some certain rate is considered, e.g. a wireless player with
constraints due to limited battery life, or a DVB box with constraints on production
costs.
3.2 Dependency Through Error Propogation
As briefly described in Chapter 2, VOPs in the video are encoded dependently on
one another by estimation of translational motion. A P macroblock is encoded as
23
texture and motion information depending on the reconstruction of at least one and
at most four macroblocks in the previously encoded I-VOP or P-VOP, as described
in Figure 3.1 and Figure 3.2. Because of the fact that natural video sequences contain
motion of more complex nature, more than one macroblock may depend on a certain
macroblock in the reference VOP, in particular the macroblocks that reside in a loca-
tion of the VOP, where the motion flux is large. Moreover, texture and motion in the
same video packet are encoded predictively, hence in-VOP dependency also exists,
which is also beneficial to consider while designing a video encryption scheme.
T = t T = t+1
Figure 3.1: Macroblock interdependence
Figure 3.2: Error propogation from frame 268 to frame 271 of foreman
3.3 The Bit Allocation Strategy
Achieving maximal security can be defined as the maximization of the computational
power required to break the encryption scheme. In the context of this work, the con-
straint for this maximization is the number of bits that can be encrypted. Break at-
24
tempts that result with the exact cleartext is considered to be a successful attempt
in order to make things simpler; it requires an exhaustive search of a subset of the
codeword space constrained by some criteria, e.g. the codeword space for the DC
component of a DCT transformed block can be constrained by the energy of a por-
tion of a known plaintext and the set of valid codewords. The process of searching
the reduced space has a complexity of f (x) = 2kx, or f (x) = ekx with k = k ln 2
and k [0, 1], in terms of the number of encrypted bits x and a constant k. k is a
factor representing a reduction in the search space by syntax, heuristics or data anal-
ysis, hence it represents the smartness of the attacker and the weakness of the
underlying encryption method.
Because the encrypted portions are treated as errors by a no-decryption decoder,
the problem of breaking the encryption is equivalent to recovery of bitstream errors
and maximization of security in the sense described is equivalent to maximizing the
effort for recovery, constrained with the number of errorneous bits. Therefore, the
time required for cryptanalysis can be modeled, once the error propogation is mod-
eled.
A model for error propogation is established in the studies by Zhang et al.[33]. In
their study, MPEG-2 frames are classified into levels stacked on one another so that
error propogates from bottom to the top. The levels are numbered and propogation
of errors from level i to level j is found by exprerimentation and organized into a ma-
trix E, using the number of impaired macroblocks in level j due to the propogation
of an intrinsic (i.e. not propogated) error at level i as the error metric. Considering
rectangular VOPs, one can use this data to assign importances to macroblocks since
an average of m propogated errors in level j due to an intrinsic error in level i, i.e.
Eij = m can be interpereted as the requirement of m macroblocks in level i to decode
a macroblock in level i. Zhang et al.have worked with sequences encoded into peri-
odic I-frames and following P-B combinations, corresponding to Figure 3.3(a), which
is adapted from their work. However, other stack structures can be established for
different encoder configurations; Figure 3.3(b) is the stack for the encoder configura-
tion with single initial I-frame and 3.3(c) is the stack for bilayer video with periodic
intra refresh.
Once the encrypted and therefore undecodable portions are detected and local-
25
I, independently coded
P3, dependent on I
P6, dependent on P3
P9, dependent on P6
B-frames, dependenton I, P3, P6 and P9
B-VOPs, depending on P+ and P-
P+, coded depending on P-
P-, coded depending on P+
P2-, not dependingany VOP in the stack
I-VOP,independently coded
P-VOPs of base layer
B-VOPsof the baselayer
B-VOPs ofthe enh.layerP-VOPs ofthe enh.layer
(a) (b) (c)
Figure 3.3: VOP dependence stacks
ized, the average amount of time required to cryptanalyze a given VOP becomes
C(x1 . . . xN) =1N
N
i=1
ci fi(xi) (3.1)
where ci = Nj=1 Eij, a weight representing the importance of layer i,N is the number
of layers and fi(xi) are assumed to be of the form ekxi . Equation (3.1) is constrained
by the number of encrypted bits:
N
i=1
xi = B (3.2)
where B > 0 is the number of encrypted bits. Although not taken into account, xi
is also bounded; 0 x Bi, where Bi is the number of bits in which the syntactical
entity i is encoded. Equation (3.1) constrained with (3.2) has only one extreme point
xi =BN 1
Nk(
N
j=0,j 6=i
ln cj) (3.3)
which is a minimum. Hence, the maximizing solution is (which is in fact intuitive)
on the boundary:
xi = B i = arg maxi
ci (3.4)
xi = 0 i 6= arg maxi
ci (3.5)
26
However, the budget B is not entirely spent if Bi < B. Therefore, the maximizing
solution firstly requires sorting ci into ci, ci . . . c
i, where
ci = cj ci < c1 . . . ci < cj1
Then, the minimum of number of bits left in hand and Bi must be reserved for syn-
tactic entity i:
Bi = min(Bi,i
j=1
cj) (3.6)
3.4 Levels and Estimation of ci
An enhancement for frame based leveling can be constructed by defining subse-
quences of DCT runs as the syntactical entities. Starting with [22], blocks of DCT
coefficients are separately considered as entities to be encrypted, hence frame-based
leveling can be refined to subdivide the DCT coefficients into sublevels to adapt the
models of Section 3.3 to encryption. In this study, DCT coefficients are divided into
three sublevels; in intra-coded blocks, the DC coefficient is the first sublevel1 and the
sequence of AC coefficients are divided into two, in scan order. Inter-coded blocks
are divided into three almost-equal sublevels. Consequently, all ci are replaced with
tuples, (ci1, ci2, ci3), cij are scalars.
Experimental estimation of ci for a sequence is found to be impractical as it re-
quires statistically sufficient number of error simulations in the decoder. Instead, ci
are estimated per GOV of the video stream. In order to estimate ci, intrinsic weights
i,x,y = (i1,x,y, i2,x,y, i3,x,y) are assigned to every block at level i. The intrinsic weight
for a block is proportional to the mean squared error between the block and the block
with coefficients affecting ij,x,y set to zero. The weights are normalized in the sense
that sum of ij,x,y for a block is one, if the block is intra and equal to the ratio of the
energy of the estimation error block to the energy of the reconstruction block for a
nonintra block. With every i,x,y, a reference count ri,x,y is associated, which is set
to zero initially. The motion vectors are used to alter the reference counts to reflect
propogation.
A predictively coded macroblock refers one or more macroblocks in the reference
VOP. Assuming the macroblock is uniform (which becomes more realistic as VOPs
1 Intra DC coefficients are not encoded differently from AC coefficients in all experiments.
27
get larger in spatial size), the referred macroblocks have effect on the error propor-
tional to both the size of the area overlapping with the reference area and the intrinsic
weight of the prediction for block (x, y).
After scaling i,x,y with ri,x,y, ci is estimated to be the average of all i,x,y Two ci
values are estimated, one from intra blocks ci and one from inter blocks ci. Inter and
intra ci values are sorted altogether and the available encrypted bits are distributed
in descending ci order.
Table 3.1 contains the algorithm updating the weights of the layers depending on
the reference area in a more complete format.
Table 3.1: Algorithm SET-WEIGHTS
SET-WEIGHTS()1 for each block b(x, y, t) in the estimation set2 do for each level i containing data in b(x, y, t)3 do rj b(x, y, t) with coefficients of level i are zeroed , j = 1, 2, 34 ij,x,y MSE(bj, b(x, y, t))/ k MSE(bk, b(x, y, t))5 rij 06 if the block is nonintra, i.e. predicted7 then8 b prediction, for which b(x, y, t) is the prediction error9 (x, y, t) = MSE(b(x, y, t), 0)/(MSE(b(x, y, t), 0) + MSE(b, 0))
10 ij,x,y = (x, y, t)ij,x,y1112 for each level i in descending order13 do for block b(x, y, t) in level i14 do Find a, b, c, d and the overlapping blocks ba, bb, bc, bd as in Figure 3.415 Normalize a, b, c, d so that a + b + c + d = 116 rk rk + k(1 (x, y, t)) , k = a, b, c, d1718 for each level i in descending order19 do for block b(x, y, t) in level i20 do ij ijrij21 Findciandci22
3.5 Encryption Strategy
As Wen et al.pointed out in their studies [18], encryption of indexes is advantageous
over direct encryption of the bitstream, because it is more error resilient, preserves
syntax compliance and is compatible with other players that doesnt have the decryp-
28
T = t T = t+1
a b
c d
MV(x,y)
Figure 3.4: Referenced block areas.
tion facility. However, a direct-encryption tool is implemented in this work, because
of the following reasons:
Compatibility with other players is of little value if the content is provided as a
commercial service (e.g. pay tv broadcast). In this case, the service agreement
may include the use of a supported player not to void the service warranty.
Direct encryption requires less number of bits to protect a syntactical entity of
the video and a level of error resilience can be achieved by the use of the side
information, if the side information is designed appropriately.
Implementation of direct encryption over a codec that is implemented previ-
ously is found to be less complicated. In this study, the base codec was the
MPEG-4 reference implementation, which was not well documented.
The problems with direct encryption are resolved in the following ways:
The start code emulations in the encrypted stream is cleaned by introduction of
stuffing bits (a value of 1) after 20 zeroes2
Encryption side-information is synchronized with the bitstream, as described
in Section 3.6
2 MPEG-4 start codes begin with byte aligned 00000000 00000000 00000001
29
Table 3.2: IPMP SelectiveDecryptionMessage stucture, specific to the proposed sys-tem
class IPMP_SelectiveDecryptionMessage extends IPMP_ToolMessageBase:bit(8) tag = IPMP_SelectiveDecryptionMessage_tag;
{bit(8) mediaTypeExtension;bit(8) mediaTypeIndication;bit(8) profileLevelIndication;const bit(8) compliance = 0x01;const bit(8) numBufs = 1;Struct bufInfoStruct {
bit(128) cipher_Id;bit(8) syncBoundary;bit(1) isBlock;const bit(7) reserved = 0b0000.000;bit(8) mode;bit(16) blockSize;bit(16) keySize;
}const bit(1) isContentSpecific = 0;const bit(7) reserved = 0b0000.000;bit(16) nSegments;bit(16) RLE_Data[nSegments];
}
3.6 Encryption Side-Information
Although a more compact side-information format is possible, the suggested side
information storage format is an IPMP SelectiveDecryptionMessage data structure,
as described in [13]. The structure specific to this work is given in Table 3.2.
The sequence of encrypted and unencrypted segments are encoded into the array
RLE Data, as lengths of the segments, starting with the length of an unencrypted
segment. The array RLE Data contains nSegments RLE encoded segment lengths.
The video cryptosystem requires a single buffer to decrypt the data and the chiper
is synchronized at the start of the syntactic entity (e.g. VOP) specified at
syncBoundary field for error resilience.
The file can be incorporated into the MPEG-4 file as an ipsm track, multiplexed
with other streams after the file creation and multiplexing utilities are modified; the
mp4creator utility that comes in the mpeg4ip package [34] is a suitable platform
to apply this idea.
30
3.7 Summary
A model for the cryptanalytic complexity of video streams is presented. The equa-
tions to find the encrypted bit distribition maximizing cryptanalytic complexity are
derived and an algorithm is defined using the outcomes of the equation, depend-
ing on a set of parameters. The parameters ci can be estimated experimentally, from
video sequences of similar nature, however its considered to be costly. A method to
estimate these parameters is proposed in Section 3.4.
31
CHAPTER 4
EXPERIMENTS AND RESULTS
4.1 Implementation and Test Platform
The proposed method is implemented over MoMuSys video codec which is devel-
oped as MPEG-4 Verification Model. The implementation also uses previously im-
plemented AES functionality, in the separated encryption module. Red Hat Linux 9
with GNU C compiler and GNU make is used as the development platform.
4.2 Implementation of SET-WEIGHTS and Budget Distribution
Set-Weights is implemented for the VOP hierarchies, Figure 3.3(a) and Figure 3.3(c),
however only results regarding hierarchy (a) are discussed in this chapter. The algo-
rithm is able to find the weights for a set of VOPs after the VOPs are encoded, this
does not matter for the configurations where pre-encoded content is served. On the
other hand, encryption of data by a delay of a few VOPs can be a problem in live
broadcasts or videoconferencing.
The functions that make up SET-WEIGHTS is found to consume around 0.8% of
the CPU time, but the share of CPU time is expected to increase in more optimized
codecs.
4.2.1 Core Chiper
Although many stream chipers are available, a new one is constructed at the expense
of efficiency. The main reason is that the author was unable to find a stream chiper
32
implementation that encrypts in bits. The AES implementation in ANSI C by Brian
Gladman, which is publicly available, is used for the stream chiper.
The stream chiper is implemented by XORing input bits with a random sequence.
The random sequence is obtained by encryption of an increasing sequence with AES.
The sequence is initiated by using the encryption key for AES and new blocks filled
with the increasing sequence are encrypted whenever needed. An application of
Berlekamp-Massey algorithm over the stream shows that the sequence is not linear,
so one cannot break the sequence by finding the linear recurrence that generates it.
Once the encoder generates the sequence of segments for the visual stream, the
encryption program is run to encrypt the given segments of the bitstream with the
specified key.
4.2.2 Restrictions of the Implementation
The video encryption implementation does not support suitable encryption schemes
for all the natural video coding tools, nor does it support any particular profile. A
few of the tools are considered not to be suitable for encryption. A few other cannot
be rate controlled by the developed model, due to restrictions on the implemented
codec. The remaining tools can be supported by slight and straightforward modifi-
cations and they are considered as future work.
B-VOP encryption is not implemented because B-VOPs are assumed to have lower
reference counts than P-VOPs.
RVLC coded video is not supported since its believed by the author that it is difficult
to output a compact chipertext that can be divided into reversible codewords.
Interlaced video encryption is not implemented as the underlying codec does not
provide full support.
Still textures are not supported since they dont have temporal extent.
Sprite encryption is not implemented. A specification or draft of GMC was not
available to the author, either.
Grayscale shapes are not supported. Because graysacale shapes are encoded in the
same way as texture, reference maps can also be kept for grayscale shapes,
33
however combined rate control for shape/texture encryption is left as an open
problem.
Binary shape maps are not supported. A binary shape has no effect on macroblock
addressing, hence a coarser shape is always available for an attacker.
4.3 Test Sequences
Video sequences that were commonly used in the literature are selected for the ex-
periments. The sequences are obtained in (or later converted to) uncompressed 4:2:0
YUV. Only QCIF (176144) sequences are used for tests, due to implementation prob-
lems. The first 300 frames of each sequence are used.
Carphone Single object QCIF sequence with high motion foreground and background.
Foreman Talking head QCIF sequence with high motion foreground and a camera
pan at the end.
Miss America QCIF sequence containing almost still foreground and background
with motion.
The sequences are included in the compact disc as AVI files with uncompressed YV12
video tracks.
4.4 Encoding Parameters
The files are encoded with periodic I-VOP refreshes followed by sequences of B and
P-VOPs, so that every third VOP is a P-VOP. Every sequence is coded at 30 fps. In
order to simplify the implementation, every nonintra macroblock is coded with one
motion vector and regular motion compensation. Motion vectors are computed to
half sample precision. Qp is initially set to 4 for all texture coding schemes. Video
is packetized so that every packet includes macroblock-aligned data just exceeding
20 bits, hence avoiding spatially predictive coding. The rate controlled sequences
are coded using Q2 rate control algorithm with default parameters of the MoMuSys
implementation. The first 300 frames of Foreman and news are used in the experimen-
tation. The first 150 frames of Miss America are used in the experimentation, as its
length is less than others.
34
4.5 Experimental Results
The test sequences are encoded and encrypted by the implemented encoder and the
effects of encryption are measured in the relative size of encryption side informa-
tion and the distribution of encrypted bits over various syntactical structures of the
video. The time consumed by index extraction and encryption/decryption functions
in terms of CPU time is not measured since the implementation is not optimal; the
reader must be informed that three additional iDCT per block are performed to find
the bit distribution.
Tests are conducted to investigate the nature of bit selection strategy when
1. A constant Qp is used with fixed GOV size.
2. GOV sizes are changed, holding Qp constant.
3. Bit rate of the encoded video is restricted by a rate control algorithm, while Qp
is changed by the algorithm and GOV sizes change due to skipped VOPs.
4.5.1 Bit Distribution Plots
The plots are taken here in order to relax the alignment of the actual text, as the
graphs are large in size. A grid is put onto the plot to identify GOVs.
Each plot have three different entities. In intra plots, DC, AC1 and AC2
are the bitrates of the DC coefficient, 30 coefficients succeeding (in zig zag order) the
DC coefficient and the remaining coefficients of intra coded blocks, respectively. In
inter plots, AC1, AC2 and AC3 are the bitrates of the first 20 (in zig zag order),
succeeding 20 and remaining coefficients of inter coded blocks.
Plots for which GOV size is specified are obtained without rate control and plots
for which bitrate is specified are obtained with 12-VOP GOV setting, however a num-
ber of frames are skipped to meet the bitrate constraint.
Encoding parameters for the plots are specified in Section 4.4.
35
500
1000
1500
2000
2500
3000
0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192 204 216 228 240 252 264 276 288
Enc
rypt
ed In
ter
Dat
a (b
its)
Frame Number
AC1AC2AC3
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192 204 216 228 240 252 264 276 288
Enc
rypt
ed In
tra
Dat
a (b
its)
Frame Number
DCAC1AC2
Figure 4.1: Inter (above) and intra (below) bit distributions in Carphone with 1700bits/frame encryption and 12-VOP GOVs
36
0
1000
2000
3000
4000
5000
6000
0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192 204 216 228 240 252 264 276 288
Enc
rypt
ed In
ter
Dat
a (b
its)
Frame Number
AC1AC2AC3
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192 204 216 228 240 252 264 276 288
Enc
rypt
ed In
tra
Dat
a (b
its)
Frame Number
DCAC1AC2
Figure 4.2: Inter (above) and intra (below) bit distributions in Carphone with 2500bits/frame encryption and 12-VOP GOVs
37
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192 204 216 228 240 252 264 276 288
Enc
rypt
ed In
ter
Dat
a (b
its)
Frame Number
AC1AC2AC3
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192 204 216 228 240 252 264 276 288
Enc
rypt
ed In
tra
Dat
a (b
its)
Frame Number
DCAC1AC2
Figure 4.3: Inter (above) and intra (below) bit distributions in Carphone with 3400bits/frame encryption and 12-VOP GOVs
38
0
2000
4000
6000
8000
10000
12000
14000
0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192 204 216 228 240 252 264 276 288
Enc
rypt
ed In
ter
Dat
a (b
its)
Frame Number
AC1AC2AC3
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 12 24 36 48 60 72 84 96 108 120 132 144