Upload
lamdung
View
217
Download
0
Embed Size (px)
Citation preview
where information lives
Page. 1
Seamless Audio Splicing for
ISO/IEC 13818 Transport Streams
Seamless Audio Splicing for
ISO/IEC 13818 Transport Streams
A New Framework for Audio Elementary
Stream
Tailoring and Modeling
Seyfullah Halit Oguz, Ph.D. and Sorin Faibish
EMC Corporation
Media Solutions Group Engineering
where information lives
Page. 2
EMC Media Solutions Group Profile
The EMC Media Solutions Group is part of EMC Engineering
—EMC Engineering will spend well over $1 BillionDollars on Research and Development in FY 2001.
—Over 300 engineers concentrating on Celerra Server development and Rich Media products.
—The Media Solutions Group Lab has in excess of $300 Million dollars of hardware for development of Rich Media solutions.
where information lives
Page. 3
The EMC Media Solutions Group is tasked with:
—Deploying EMC Products into the Rich Media market.
—Develop products and solutions which meet the requirements of Rich Media customers.
—Developing partnerships with key companies to develop and deploy customer solutions.
—Provide Professional Services in the Rich Media environment.
Mission Statement
where information lives
Page. 4
OutlineOutline
n Splicing in briefn Objectiven Problem descriptionn Basic algorithmn A model for the audio elementary streamsn Enhanced algorithmn Additional implementation detailsn Conclusions
where information lives
Page. 5
SplicingSplicing
n Splicing is the act of switching from one MPEG-2 program (embedded in a transport stream) to another MPEG-2 program (again embedded in a transport stream).
n Commercial insertion, camera or content switching and content editing all require splicing to be performed on compressed bit-streams.
n The structure of the compressed data makes a seamless splicing algorithm be far from trivial.
where information lives
Page. 6
ObjectiveObjectiven A generic method to process the audio elementary
streams during the splicing of ITU-T Rec. H.222.0 | ISO/IEC 13818-1 transport streams (TS) to achieve a seamless audio splice.
* “generic”: No constraining assumptions are made aboutsignal formats (e.g. the video frame rate (PAL, NTSC), the audio sampling
frequency), orvarious encoding parameters (e.g. the audio bit rate, the layer of audio
encoding algorithm employed).* “audio elementary streams”. Current focus will be on the audio elementary
streams. Ultimately audio and video splicing should be considered jointly.* “transport streams”. Achieve audio elementary stream splicing directly on
transport streams with lowest possible complexity.
where information lives
Page. 7
Definitions and NotationDefinitions and Notation
Audio Access UnitAAU (Audio Frame)(576 bytes)*
Audio Presentation Unit,APU, (a block of contig-uous audio samples)(24 ms)*
AudioDecoder
Video Access UnitVAU (variable size)
Video Presentation Unit,VPU, (a video frame)(1/29.97 s)**
VideoDecoder
*(ISO/IEC 11172-3 Layer-II Audio coding with sampling frequency 48kHz and audio bitrate192kbits/s assumed only for illustrative purposes)
Encoded Data Domain Decoded Data Domain
**(NTSC frame rate assumed only for illustrative purposes)
where information lives
Page. 8
VPU and APU alignmentVPU and APU alignment
E
APU 1 APU 2
0.024 seconds
End S E
Start End S E
(1 / 29.97) seconds
VPU 0 VPU 1 VPU 2
APU 0
Start E
S
S S
S E
S E S E
VPU (k-2) VPU (k-1) VPU k VPU (k+1) VPU (k+2)
APU (j-2) APU (j-1) APU j APU (j+1) APU (j+2) APU (j+3)
S S SE E EE S
SS ES E S E
S ES E S E
The start of a VPU will be alignedwith the start of an APU possiblyat the beginning of a stream andthen only at multiples of 5 minutesincrements in time. This implies that later they will not be aligned again for all practical purposes.
AT
TH
E B
EG
INN
ING
LATE
R
~= 0.03337 seconds
where information lives
Page. 9
The setting for splicingThe setting for splicing
S E S E S E
VPU (k-2) VPU (k-1) VPU k VPU (k+1) VPU (k+2)
SE E ES S
S SE E
APU (m-2) APU (m-1) APU m APU (m+1) APU (m+2) APU (m+3)
S S S S SE E E E E
APU (j-2) APU (j-1) APU j APU (j+1) APU (j+2) APU (j+3)
S S S S S S SE E E E E E E
S E S E S E
VPU (n-2) VPU (n-1) VPU n VPU (n+1) VPU (n+2)
SE E ES S
Ending stream
Beginning stream
Splicing point is naturallydefined with respect to VPUs.
time base #1
time base #2
where information lives
Page. 10
Audio processing at splicingAudio processing at splicing
APUs are available only through the decoding of their corresponding AAUs.Fractional (i.e. truncated) AAUs in the encoded data domain are useless.
S E S E S E
VPU (k-2) VPU (k-1) VPU k VPU (k+1) VPU (k+2)
SE E ES S
APU (j-2) APU (j-1) APU j APU (j+1) APU (j+2) APU (j+3)
S S S S S S SE E E E E E E
S
APU (m-2) APU (m-1) APU m APU (m+1) APU (m+2)
S S S SE E E E EE
APU (m-3)
SE S
Time base of the beginning stream is shifted to achieve video presentation continuity.
where information lives
Page. 11
So far…So far…
n Decoding, time domain editing and re-encoding.
High computational complexity.
n Gaps in the audio stream.
Audio mutes, uncontrolled audio-visual skew.
n Overlaps in the scopes of APUs.
Uncontrolled audio-visual skew, inconsistent ES structure.
where information lives
Page. 12
ObservationsObservations
n Audio truncation should always be done at AAU boundaries i.e. no fractional AAUs!
n Audio truncation for the ending stream should be done with respect to the end of its last VPU’s presentation interval.
n Audio truncation for the beginning streamshould be done relative to the beginning of its first VPU’s presentation interval.
“BEST ALIGNED APUs”
where information lives
Page. 13
Best aligned APUsBest aligned APUs
12 msec. 12 msec.
12 msec. 12 msec.
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
APU m
APU (j+1)
Best aligned final APU
Best aligned initial APU
The APU of the ending streamwhose presentation interval endswithin the identified 24 msinterval is called the “best alignedfinal APU”.
The APU of the beginning streamwhose presentation interval startswithin the identified 24 msinterval is called the “best alignedinitial APU”.
“short” “long”
“short”“long”
There is a comprehensive list of 8 possible cases that can be identified regarding the alignment of ending and beginning audio streams based on the above definitions.
where information lives
Page. 14
How to make use of best aligned APUsHow to make use of best aligned APUs
n Truncate the ending audio stream at the end of the best aligned final APU.
n Start the beginning audio stream at the beginning of the best aligned initial APU.
n Re-stamp the audio PTSsof the beginning stream to generate an immediate continuation of the ending audio stream.
n In the audio PES packet carrying the best aligned final APU
- truncate after the AAU associated with the best aligned final APU,
- modify the PES packet size information accordingly.
n In the audio PES packet carrying the best aligned initial APU
- delete the AAU data preceding the AAU associated with the best aligned initial APU,
- modify the PES packet size information accordingly.
n Modify the PTS values associated with the first and all consequent audio PES packets accordingly.
ACTIONREQUIRED PROCESSING AT
ELEMENTARY STREAM LEVEL
where information lives
Page. 15
12 msec. 12 msec.
12 msec. 12 msec.
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
Best aligned initial APU
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
Best aligned final APU
Case 6) b. a. final APU long, b. a. initial APU short and 0 msec. < audio overlap < 12 msec.
A/V skew of at most 12 msec.
SOLUTION:
APU (j+2)
APU m
APU (j+1) APU (j+2) APU (j+3)
(m) (m+1)
APU (j+1)
APU (m-1)
where information lives
Page. 16
Minimal Achievable Skew AlgorithmMinimal Achievable Skew Algorithm
n Immediately applicable to 6 out of the 8 possible best aligned APU relative position classes.
n In the remaining 2 classes of relative position, a slight modification to the proposed algorithm is needed to achieve an A/V skew bounded by half APU duration.
where information lives
Page. 17
12 msec. 12 msec.
12 msec. 12 msec.
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
Best aligned initial APU
Best aligned final APU
Case 1) Both best aligned APUs are short and 12 msec. < audio gap < 24 msec.
APU j+1 APU (j+2)
APU (m-1) APU m
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
APU (j+1) APU (j+2) APU (j+3)
APU (j+1) APU (j+2) APU (j+3)
(m)
(m)(m-1)
A/V skew of at most 12 msec.
A/V skew of at most 12 msec.
SOLUTION (a):
SOLUTION (b):
where information lives
Page. 18
Facts - IFacts - I
n An audio elementary stream construction with no holes and no audio PTS discontinuity is possible.
n As a consequence, an A/V skew of magnitude at most half APU duration will be induced in the beginning stream. This is below the sensitivity limits of human perception.
n The proposed algorithm can be repeatedly applied an arbitrary number of times with neither a failure to meet its structural assumptions nor a degradation in its promised A/V skew performance.
where information lives
Page. 19
Facts - IIFacts - II
n The A/V skews induced as a result of the proposed processing do not accumulate, i.e. irrespective of the number of consecutive splices the worst A/V skew at any point in time will be half of the APU duration.
n At each splice point, at the termination of thePUs of the ending stream, we always have total audio and video presentation durations up to that point almost matching each other, i.e.
|video_duration - audio_duration| (1/2) APU_duration
i.e. always correct amount of audio data provided.
=<
where information lives
Page. 20
Facts - IIIFacts - III
n The resulting audio stream is error-free and totally ISO/IEC 11172-3 and ISO/IEC 13818-3 compliant.
where information lives
Page. 21
Implementation at TS levelImplementation at TS level
n TS level implementation necessitates further considerations:
1) Editing of transport packets
2) Transport packet buffering and re-multiplexing
3) Audio buffer management
4) Meta-data generation
where information lives
Page. 22
Audio buffer managementAudio buffer management
n We have to control the dynamic behavior of the audio buffer during the transient induced by the splicing as well as consequent to splicing.
n We need a simple model to characterize the audio elementary stream within a TS, such that
1) model parameters should be easy to estimate,
2) model should accurately provide the following desired information
a) the mean audio buffer fullness,
b) the extent of buffer fullness variation around its mean value.
where information lives
Page. 23
Audio elementary stream modelAudio elementary stream model
n Audio elementary stream will be modeled on the basis of AAUs which determines its granularity with respect to audio decoder actions.
n The proposed framework for the model is an arrival process to a FIFO queue with a deterministic service time.
a a a(j+1) (j+2) (j+3)
service time 0.024 sec.
aj
pj p(j+1) p(j+2) p(j+3)
aj : arrival time of j th AAUpj : presentation time of j th AAU
where information lives
Page. 24
Audio elementary stream modelAudio elementary stream model
n The presentation times p can be easily computed for each AAU.
n The arrival times a however are not uniquely defined since each AAU arrives in a distributed fashion owing to transport packet encapsulation.
Solution:
Use “weighted-average arrival times”.
j
j
where information lives
Page. 25
850 900 950 1000 1050
0.45
0.5
0.55
0.6
0.65
TS PACKET TYPE VISUALIZATION PLOT
TS packet indices
TS
pac
ket
type
s (c
olor
cod
ed)
Audio elementary stream modelAudio elementary stream modeln Weighted-average arrival time is defined for each AAU as follows:
∑all TS packets whose
payloads have some of thej AAU data
(TS packet arrival time) * (fraction of the j AAU in the payload data)
• Note: TS packet arrival time is defined with respect to packet’s 94th byte through a PCR extrapolation process.th
tha = j
where information lives
Page. 26
Audio elementary stream modelAudio elementary stream model
n Based on these definitions we let
w = p - a
where w is the waiting time for j AAU in
the audio buffer.
n The mean and variance of w as a random variable w, provide very important information about the audio elementary stream within the TS multiplex and hence the audio buffer.
j j j
jth
j
where information lives
Page. 27
Audio elementary stream modelAudio elementary stream model
Specifically,
E[w].(audio bitrate) = mean audio buffer fullness
.(audio bitrate) = a measure of the variation in the
audio buffer fullness around its
mean value
σw
E[.]: expectation operation.: standard deviation.σ
where information lives
Page. 28
Audio elementary stream modelAudio elementary stream model
n Example 1.
Encoder B characteristics:
- Long mean waiting time
- Highly irregular, bursty arrival regime
where information lives
Page. 29
6500 6600 6700 6800 6900 7000 7100 7200 7300 7400 75000
0.2
0.4
0.6
0.8
1TS PACKET TYPE VISUALIZATION PLOT
TS packet indices
TS
pac
ket t
ypes
(co
lor
code
d)
0 5000 10000 150007.944
7.9441
7.9441
7.9442
7.9443x 10
4 INTERLEAVED AUDIO-VIDEO ALIGNMENT IN THE BITSTREAM WRT PTS VALUES
TS packet indices
Inte
rpol
ated
pla
y-ou
t tim
es fo
r in
divi
dual
TS
pac
kets
, [se
cond
s]
Example 1
where information lives
Page. 30
0 5 10 15 20 25 30-10
0
10
20
30
40
50
60
70
90
100
110
AUDIO BUFFER ANALYSIS
Time [seconds]
Aud
io b
uffe
r le
vel [
% o
f max
]Predicted mean audio buffer fullness: 15585.8 bits = 54.36 %. 2 STD interval around the mean: 4504.2 bits = 15.71 %.
Example 1
where information lives
Page. 31
An important observationAn important observation
n PTS re-stamping in the beginning audio streamèThe waiting times of AAUs will be modified in the beginning stream.
n èThe mean waiting time of AAUs in the beginning stream will change (decrease or increase) by at most half APU duration.
n èA corresponding change in the mean audio buffer fullness level for the beginning stream will be induced.
where information lives
Page. 32
Improvement to the proposed processingImprovement to the proposed processing
n For audio elementary streams structured with mean audio buffer fullness level bounded away from both underflows and overflows, the already proposed methodology with the minimal achievable A/V skewis the best to choose.
n For audio elementary streams with mean audio buffer fullness level close to either underflow or overflow states, the requirement of minimal resultant A/V skew should be relaxed and instead of “best aligned APUs” those APUs introducing the minimal safe A/V skew should be employed.
where information lives
Page. 33
Outline of the improvement - IOutline of the improvement - I
n Let the type of A/V skew in which the audio signal component is delayed with respect to its associated video signal, be referred to as the forward (in time) skew.
n Forward skew is a result of a uniform increment in the beginning stream’s AAU presentation time stamps and is therefore associated with an increase in the mean audio buffer fullness level.
n Hence, for beginning audio elementary streams with mean audio buffer fullness levels close to underflow, the minimal achievable forward skew is the minimal safe skew and should be the one to be employed.
where information lives
Page. 34
Outline of the improvement - IIOutline of the improvement - II
n Let the type of A/V skew in which the audio signal component moves to earlier in time with respect to its associated video signal, be referred to as the backward (in time) skew.
n Backward skew is a result of a uniform decrement in the beginning stream’s AAU presentation time stamps and is therefore associated with a decrease in the mean audio buffer fullness level.
n Hence, for beginning audio elementary streams with mean audio buffer fullness levels close to overflow, the minimal achievable backward skew is the minimal safe skew and should be the one to be employed.
where information lives
Page. 35
Minimal Safe Skew Algorithm SummaryMinimal Safe Skew Algorithm Summary
n For all 8 possible relative position classes of best aligned APUs, three solutions are defined
1. the minimal achievable skew solution2. the minimal safe forward skew solution3. the minimal safe backward skew solutionn Minimal achievable skew is upper-bounded by
half APU duration.n Minimal safe skew (forward or backward) is
upper bounded by one APU duration. (Still below the sensitivity limits of human perception.)
where information lives
Page. 36
12 msec. 12 msec.
12 msec. 12 msec.
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
Best aligned initial APU
Best aligned final APU
Case 1) Both best aligned APUs are short and 12 msec. < audio gap < 24 msec.
APU (j+1) APU (j+2)
APU (m-1) APU m
where information lives
Page. 37
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
APU (j+1) APU (j+2) APU (j+3)
(m)
A/V skew of at most 12 msec.
SOLUTION I:
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
APU (j+1) APU (j+2) APU (j+3)
(m)
A/V skew of at most 12 msec.
SOLUTION II:
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
APU (j+1) APU (j+2) APU (j+3)
(m+1)
SOLUTION III:
A/V skew of at most 24 msec.
Minimal achievable skew implementation for clips with originally moderate audio buffer levels.Note: An alternate implementation by omitting APU (j+2) and including APU (m-1) is also possible and this approachachieves exactly the same resultant A/V skew. This latter solution is preferable in terms of ease of implementation.
Minimal achievable safe (forward) skew implementation for clips with originally low audio buffer levels.Note: An alternate implementation by omitting APU (j+2) and including APU (m-1) is also possible and this approachachieves exactly the same resultant A/V skew. This latter solution is preferable in terms of ease of implementation.
(m)
Minimal achievable safe (backward) skew implementation for clips with originally high audio buffer levels.
where information lives
Page. 38
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
Best aligned initial APU
Case 6) b. a. final APU long, b. a. initial APU short and 0 msec. < audio overlap < 12 msec.
APU (j+2)
APU m
APU (j+1)
APU (m-1)
12 msec. 12 msec.
12 msec. 12 msec.
Best aligned final APU
VPU k VPU (k+1)
where information lives
Page. 39
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
A/V skew of at most 12 msec.
SOLUTION I: APU (j+1) APU (j+2) APU (j+3)
(m) (m+1)
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
A/V skew of at most 12 msec.
SOLUTION II: APU (j+1) APU (j+2) APU (j+3)
(m) (m+1)
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
A/V skew of at most 24 msec.
SOLUTION III: APU (j+1) APU (j+2) APU (j+3)
(m+1) (m+2)
Minimal achievable skew implementation for clips with originally moderate audio buffer levels.
Minimal achievable safe (forward) skew implementation for clips with originally low audio buffer levels.
Minimal achievable safe (backward) skew implementation for clips with originally high audio buffer levels.Note: An alternate implementation by omitting APU (j+1) and including APU m is also possible and this approach
achieves exactly the same resultant A/V skew. This latter solution is preferable in terms of ease of implementation.
where information lives
Page. 40
Implementation at TS levelImplementation at TS level
n TS level implementation necessitates further considerations:
1) Editing of transport packets
2) Transport packet buffering and re-multiplexing
3) Audio buffer management
4) Metadata generation
where information lives
Page. 41
Editing of transport packetsEditing of transport packets
n The truncation of the final PES packet of the ending audio stream will typically necessitate the insertion of some adaptation field padding into its last transport packet.
n The deletion of some AAU data from the front end of the beginning audio stream’s first PES packet will typically necessitate the editing of at most two audio transport packets.
n Possible use of a causal bit-reservoir in Layer III encoding, typically dictates a structural constraint on the beginning audio stream.
where information lives
Page. 42
Transport packet buffering and re-multiplexingTransport packet buffering and re-multiplexing
In the transport stream
n the audio bit rate is a constant
n the VAUs are of varying sizes
therefore
the relative positions of VAUs and AAUs associated with VPUs and APUs almost aligned in time cannot be maintained constant.
Almost always it is the case that the AAUs are significantly delayed with respect to the VAUs for which the decoded representations are almost synchronous.
where information lives
Page. 43
LegendLegend
n Black: TS video packets initiating I type pictures.
n Blue: TS video packets initiating P or B type pictures.
n Cyan: TS video packets.
n Red: TS audio packets initiating PES packets.
n Magenta: TS audio packets,
n Yellow: Null TS packets.
n Green: TS packets carrying various system information.
where information lives
Page. 44
9800 9805 9810 9815 9820 98250
0.005
0.01
0.015
0.02
0.025
0.03
TS PACKET TYPE VISUALIZATION PLOT
TS packet indices
TS
pac
ket
type
s (c
olor
cod
ed)
0.95 1 1.05 1.1 1.15
x 104
7.9442
7.9442
7.9442
7.9442
7.9442
7.9442
7.9442
7.9442
x 104 INTERLEAVED AUDIO-VIDEO ALIGNMENT IN THE BITSTREAM WRT PTS VALUES
TS packet indices
Inte
rpol
ated
pla
y-ou
t tim
es f
or in
divi
dual
TS
pac
kets
, [s
econ
ds]
TS packet index 9812
TS packet index 11005
Example 1. Encoder B.
where information lives
Page. 45
Transport packet buffering and re-multiplexingTransport packet buffering and re-multiplexing
n This TS multiplex structure necessitates
1) locating and temporarily storing (buffering) the delayed audio packets when the ending stream is truncated based on the last VAU,
2) TS re-multiplexing in the form of
a) deletion of some obsolete green audio packets in the beginning stream,
b) insertion of the above blue audio packets into the beginning stream.
where information lives
Page. 46
Metadata GenerationMetadata Generation
n During the ingest of MPEG-2 transport streams to storage systems,
1. estimate the parameters of the proposed audio elementary stream model,
2. record this descriptive information within the metadata associated with the asset.
n In splicing scenarios with live streams, known characteristics and/or settings of the encoder employed can be used.
where information lives
Page. 47
ConclusionsConclusions
n Audio elementary stream tailoring based on minimal achievable or minimal safe A/V skew concepts.
n Audio splicing without any artifacts made possible.
n A simple and efficient model to characterize the audio elementary stream structure embedded in the transport stream. (Prediction and possible control of audio buffer behavior within reach.)
n Audio splicing cannot be considered independently from the video splicing.
where information lives
Page. 48