Jigsaw: Robust Live 4K Video Streamingjianhe/jigsaw-mobicom19.pdf · Uncompressed 4K video streaming requires around 3Gbps of data rate. WiGig is the commodity wireless card that

Jigsaw Robust Live 4K Video Streaming

Ghufran Baig1lowast Jian He1lowast Mubashir Adnan Qureshi1 Lili Qiu1 Guohai Chen2Peng Chen2 Yinliang Hu2

1University of Texas at Austin 2Huawei

ABSTRACTThe popularity of 4K videos has grown significantly in thepast few years Yet coding and streaming live 4K videos incursprohibitive cost to the network and end system Motivatedby this observation we explore the feasibility of supportinglive 4K video streaming over wireless networks using com-modity devices Given the high data rate requirement of 4Kvideos 60 GHz is appealing but its large and unpredictablethroughput fluctuation makes it hard to provide desirableuser experience In particular to support live 4K video stream-ing we should (i) adapt to highly variable and unpredictablewireless throughput (ii) support efficient 4K video codingon commodity devices To this end we propose a novel sys-tem Jigsaw It consists of (i) easy-to-compute layered videocoding to seamlessly adapt to unpredictable wireless linkfluctuations (ii) efficient GPU implementation of video cod-ing on commodity devices and (iii) effectively leveragingboth WiFi and WiGig through delayed video adaptation andsmart scheduling Using real experiments and emulation wedemonstrate the feasibility and effectiveness of our systemOur results show that it improves PSNR by 6-15dB and im-proves SSIM by 0011-0217 over state-of-the-art approachesMoreover even when throughput fluctuates widely between02Gbps-2Gbps it can achieve an average PSNR of 33dB

CCS CONCEPTSbull Human-centered computing rarr Ubiquitous and mo-bile computing systems and tools bull Information sys-temsrarr Multimedia streaming

lowastCo-primary authors

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page Copyrights for componentsof this work owned by others than ACMmust be honored Abstracting withcredit is permitted To copy otherwise or republish to post on servers or toredistribute to lists requires prior specific permission andor a fee Requestpermissions from permissionsacmorgMobiCom rsquo19 October 21ndash25 2019 Los Cabos Mexicocopy 2019 Association for Computing MachineryACM ISBN 978-1-4503-6169-91910 $1500httpsdoiorg10114533000613300127

KEYWORDS4k Video Live streaming 60 GHz Layered coding

ACM Reference FormatGhufran Baig1lowast Jian He1lowast Mubashir Adnan Qureshi1 Lili Qiu1Guohai Chen2 Peng Chen2 YinliangHu2 2019 Jigsaw Robust Live4K Video Streaming In The 25th Annual International Conferenceon Mobile Computing and Networking (MobiCom rsquo19) October 21ndash25 2019 Los Cabos Mexico ACM New York NY USA 16 pageshttpsdoiorg10114533000613300127

1 INTRODUCTIONMotivation The popularity of 4K videos has grown rapidlythanks to a wide range of display options more affordabledisplays and widely available 4K content The first 4K TVscost $20K in 2012 but now we can buy a good 4K TV under$200 Sony PS 4 Xbox One Apple TV 4K Amazon Fire TVRoku and NVIDIA Shield all stream 4K videos DirecTV DishNetwork Netflix and YouTube all provide 4K programmingGaming Virtual Reality (VR) and Augmented Reality (AR) allcall for 4K videos since the resolution has profound impact onthe immersive user experience Moreover these interactiveapplications in particular VR and gaming demand not onlyhigh resolution but also low latency (eg within 60 ms [1114 21])

Many websites offer 4K video streaming These videosare compressed in advance and require 35-68 Mbps [8] ofthroughput which is affordable in many network conditionsIn comparison streaming live raw 4K videos (eg in gamingVR AR) is muchmore challenging to support Let us considera raw 4K video is streamed at 30 frames per second (FPS)and uses 12-bit YUV color space Without any compressionit requires sim3 Gbps Our commodity devices with the latest60 GHz technology can only achieve 24 Gbps in the bestcase With mobility the throughput can quickly diminishand sometimes the link can be completely broken Thereforeit is prohibitive to send raw 4K videos over current wirelessnetworks

Video coding has been widely used to significantly reducethe amount of data to transmit Traditionally a video is en-coded using a few bitrates at the server side The server thensends the video at an appropriate bitrate chosen based onthe network bandwidth The client receives the video and

decodes it to recover the raw video and then feeds it to theplayerHowever coding 4K video is expensive The delay re-

quirement for streaming a live 4K video at 30 FPS meansthat we should finish encoding transmitting and decod-ing each video frame (having 4096 x 2160 pixels) within 60ms [11 14 21] Even excluding the transmission delay sim-ply encoding and decoding a 4K video can take long timeExisting commodity devices are not powerful enough toperform encoding and decoding in real-time Liu et al [25]develop techniques to speed up 4K video coding using high-end GPUs However these GPUs cost over $1200 are bulkyconsume high energy and are not available for mobile de-vices How to support live 4K videos on commodity devicesstill remains open

Our approach To effectively support live 4K video stream-ing we identify the following requirements

bull Adapting tohighly variable andunpredictablewire-less links A natural approach is to adapt the videoencoding bitrate according to the available bandwidthHowever as many measurement studies have shownthe data rate of a 60 GHz link can fluctuate widely andis hard to predict Even small obstruction or movementcan significantly degrade the throughput This makes itchallenging to adapt video quality in advance

bull Fast encoding and decoding on commodity devicesIt is too expensive to stream the raw pixels in 4K videosEven the latest 60 GHz cannot meet its bandwidth re-quirement On the other hand the time to stream livevideos includes not only the transmission delay but alsoencoding and decoding delay While existing video cod-ing (eg H264 and HEVC) achieves high compressionrates they are too slow for real-time 4K video encod-ing and decoding Fouladi et al [15] show that YouTubeH264 encoder takes 365 minutes to encode a 15-minute4K video at 24 FPS using H264 which is too slow There-fore a fast video coding algorithm is needed to streamlive 4K videos

bull Exploiting different links WiGig alone is often in-sufficient to support 4K video streaming since its datarate may reduce by orders of magnitude even with smallmovement or obstruction Sur et al [38] have developedapproaches to detect when WiGig is broken based on thethroughput in a recent window and then switch to WiFireactively However it is challenging to select the rightwindow size a too small window results in unnecessaryswitch toWiFi and being constrained by the limitedWiFithroughput and a too large window results in long linkoutage In addition even when WiGig link is not brokenWiFi can also complement WiGig by increasing the total

throughput TheWiFi throughput can be significant com-pared with the WiGig throughput since the latter can bearbitrarily small depending on the distance orientationand movement In addition existing works mainly focuson bulk data transfer and do not consider delay sensitiveapplications like live video streaming

We develop a novel system to address the above challengesOur system has the following unique characteristicsbull Easy-to-compute layered video coding To supportlinks with unpredictable data rates we develop a simpleyet effective layered coding It can effectively adapt tovarying data rates on demand by first sending the baselayer and then opportunistically sending more layerswhenever the link allows Different from traditional lay-ered coding our coding is simple and easy to parallelizeon GPU

bull Fast video coding using GPUs and pipelining Tospeed up video encoding and decoding our system ef-fectively exploits GPUs by parallelizing the encodingand decoding of multiple layers within a video frameOur encoding and decoding algorithms can be split intomultiple independent tasks that can be computed usingcommodity GPUs in real time We further use pipelin-ing to overlap transmission with encoding and overlapreception with decoding which effectively reduces theend-to-end delay

bull Effectively using bothWiGig andWiFiWhileWiGigoffers up to 24 Gbps of throughput its data rate fluctu-ates a lot and is hard to predict In comparison WiFi ismore stable even though its data rate is more limited(eg 11ac offers up to 120 Mbps in our cards) Using bothWiFi and WiGig not only improves the data rate but alsominimizes the chances of video freezing under link out-age To accommodate the highly unpredictable variationin 60 GHz throughput we defer video rate adaptationand scheduling which determine how much data to sendand which interface to use as close to the transmissiontime as possible so that we do not require throughputpredictionWe implement our system on ACER laptops equipped

with WiFi and WiGig cards without any extra hardware Ourevaluation shows that it can stream 4K live videos at 30 FPSusing WiFi and WiGig under various mobility patterns

2 MOTIVATIONUncompressed 4K video streaming requires around 3Gbps ofdata rate WiGig is the commodity wireless card that comesclosest to such a high data rate In this section we examinethe feasibility and challenges of streaming live 4K videosover WiGig from both system and network perspectives

This study identifies major issues that should be addressedin supporting live 4K video streaming

21 4K Videos Need CompressionWiGig throughput in our wireless card varies from 0 to 24Gbps Even in the best case it is lower than the data rate of4K raw videos at 30 FPS which is 3 Gbps Therefore sending4K raw videos is too expensive and video coding is necessaryto reduce the amount of data to transmit

22 Rate Adaptation RequirementWiGig links are sensitive to mobility Even minor movementat the transmitter (TX) or receiver (RX) side can induce adrastic change in throughput In the extreme cases wherean obstacle is blocking the line-of-sight path throughputcan reduce to 0 Evidently such a throughput drop results insevere degradation in the video qualityLarge fluctuations Fig 1(a) shows theWiGig link through-put in our system when both the transmitter and receiver arestatic (Static) and when the TX rotates around a human bodyslowly while the RX remains static (Rotating) The WiGiglink can be blocked when the human stands between the RXand TX As you can see the amount of throughput fluctu-ation can be up to 1 Gbps in the Static case This happensdue to multipath beam searching and adaptation [31] SinceWiGig links are highly directional and have large attenuationfactor due to high frequency the impacts of multipath andbeam direction on throughput are severe In the Rotating casethe throughput reduces from 2 Gbps to 0 Gbps We observe0 Gbps when the user completely blocks the transmissionbetween the TX and RX Therefore drastic throughput fluctu-ation is common in 60 GHz and a desirable video streamingscheme should adapt to such fluctuation

0 30 60 90 120 150

ughput (G

Sampling Interval (10ms)

StaticRotating

(a) Throughput Variation

0 5 10 15 20 25 30 35 40 45 50

tion E

(Gbps)

Prediction Window (30ms)

StaticRotating

(b) Prediction Error

Figure 1 Example 60GHz Throughput Traces

Large prediction error In order to predict the through-put fluctuation existing streaming approaches use historicalthroughput measurements to predict the future through-put The predicted throughput is then used to determine theamount of data to be sent Due to large throughput fluctu-ation in WiGig links the throughput prediction error can

be large In Fig 1(b) we evaluate the accuracy of using theaverage throughput of the previous 40ms window to predictthe average throughput of the next 30 ms window We ob-serve large prediction error when there is a drastic changein the throughput Even if throughput remains relativelystable in the Static case the prediction error can reach upto 05 Gbps When link blockage happens the predictionerror can be even higher than 1 Gbps in Rotating case Suchlarge fluctuation will cause selecting too low video rate ortoo high rate The former degrades the video quality whilethe latter results in partially blank video frame because onlypart of the video frame can arrive in timeInsight The large and unpredictable fluctuation of 60 GHzlink suggests that we should adapt promptly to the unantici-pated throughput variation Layered coding is promising sincethe sender can opportunistically send less or more data depend-ing on the network condition instead of selecting a fixed videorate in advance

23 Limitations of Existing Video CodecsWe explore the feasibility of using traditional codecs likeH264 and HEVC for 4K video encoding and decoding Thesecodecs are attractive due to their high compression ratiosWe also study the feasibility of the state-of-the-art layeredencoding scheme since it is designed to adapt to varyingnetwork conditionsCost of conventional softwarehardware video codecsRe-cent work [15] shows that YouTubersquos H264 software encodertakes 365 minutes to encode a 4K video of 15 minutes andVP8 [10] software encoder takes 149 minutes to encode 15minute video It is clear these coding schemes are too slowto support live 4K video streamingTo test the performance of hardware encoding we use

NVIDIA NVENCNVDEC [4] hardware codec for H264 en-coding We use a desktop equipped with 36GHz quad-coreIntel Xeon Processor 8GB RAM 256GB SSD and NVIDIAGeforce GTX 1050 GPU to collect encoding and decodingtime We measure the 4K video encoding time where oneframe is fed to the encoder every 33 ms Table 1 shows theaverage encoding and decoding time of a single 4K videoframe played at 30 FPS We use ffmpeg [1] for encoding anddecoding using NVENC and NVDEC codecs The maximumtolerable encoding and decoding time is 60ms as mentionedearlier We observe the encoding and decoding time is wellbeyond this threshold even with GPU support This meansthat even using the latest hardware codecs and commodityhardware video content cannot be encoded in time and re-sult in unacceptable user experience Such large coding timeis not surprising In order to achieve high compression ratethese codecs use sophisticated motion compensation [37 40]which are computationally expensive

A recent work [25] uses a very powerful desktop equippedwith an expensive NVIDIA GPU ($1200) to stream 4K videosin real time This work is interesting but such GPUs arenot available on common devices due to their high costbulky size (eg 4376primeprime times 105primeprime [3]) and large system powerconsumption (600W [3]) This makes it hard to deploy onlaptops and smartphones Therefore we aim to bring thecapability of live 4K video streaming to devices with limitedGPU capabilities like laptops and other mobile devices

Codec Video Enc(ms) Dec(ms) Total(ms)H264 Ritual 160 50 210H264 Barscene 170 60 230HEVC Ritual 140 40 180HEVC Barscene 150 50 200

Table 1 Codecs encoding and decoding time per frame

Cost of layered video codecs Scalable video coding (SVC)is an extension of the H264 standard It divides the videodata into base layer and enhancement layers The video qual-ity improves as more enhancement layers are received TheSVC uses layered code to achieve robustness under varyingnetwork conditions But this comes at a cost of high compu-tational cost since SVC needs more prediction mechanismslike inter-layer prediction [32] Due to its higher complexitySVC has rarely been used commercially even though it hasbeen standardized as an H264 extension [26] None of themobile devices have hardware SVC encodersdecoders

To compare the performance of SVC with standard H264we use OpenH264 [5] which has software implementationof both SVC and H264 Our results show that SVC with 3layers takes 13x encoding time as that of H264 Since H264alone is not feasible for live 4K encoding we conclude thatthe SVC is not suitable for live 4K video streamingInsight Traditional video codecs are too expensive for 4Kvideo encoding and decoding We need a cheap 4K video codingthat is fast to run on commodity devices

24 WiGig and WiFi InteractionSince WiGig links may not be stable and can be broken weshould seek an alternative way of communication As WiFiis widely available and low-cost it is an attractive candidateThere have been several interesting works that use WiFiand WiGig together however they are not sufficient for ourpurpose as they do not consider delay sensitive applicationslike video streamingReactive use of WiFi Existing works [38] use WiFi in re-active manner (only when WiGig fails) Reactive use of WiFifalls short for two reasons First WiFi is not used at all aslong as WiGig is not completely disconnected Howevereven when WiGig is not disconnected its throughput can be

quite low and WiFi throughput becomes significant Secondit is hard to accurately detect the WiGig outage A too earlydetection may result in unnecessary switch to WiFi whichtends to have lower throughput than WiGig while a too latedetection may lead to a long outageWiFi throughput Second it is non-trivial to use WiFi andWiGig links simultaneously since both throughput may fluc-tuate widely This is shown in Fig 2 As we can see theWiFi throughput also fluctuates in both static and mobilescenarios In the static case the fluctuation is mainly dueto contention In the mobile cases the fluctuation mainlycomes from both wireless interference and signal variationdue to multipath effect caused by mobility

0 30 60 90 120 150

StaticRotating

Figure 2 Example WiFi Throughput Traces

Insight It is desirable to use WiFi in a proactive manner alongwith the WiGig link Moreover it is important to carefullyschedule the data across the WiFi andWiGig links to maximizethe benefit of WiFi

3 JIGSAWIn this section we describe the important components ofour system for live 4K video streaming (i) light-weight lay-ered coding (ii) efficient implementation using GPU andpipelining and (iii) effective use of WiGig and WiFi

31 Light-weight Layered CodingMotivation Existing video codecs such as H264 and HEVCare too expensive for live 4K video streaming A naturalapproach is to develop a faster codec albeit with a lowercompression rate If a sender sends more data than the net-work can support the data arriving at the receiver beforethe deadline may be incomplete and insufficient to constructa valid frame In this case the user will see a blank screenThis can be quite common for WiGig links On the otherhand if a sender sends at a rate lower than the supported

data rate (eg also due to prediction error) the video qualitydegrades unnecessarily

In comparison layered coding is robust to throughput fluc-tuation The base layer is small and usually delivered evenunder unfavorable network conditions so that the user cansee some version of a video frame albeit at a lower resolutionEnhancement layers can be opportunistically sent based onnetwork conditions While layered coding is promising theexisting layered coding schemes are too computationallyexpensive as shown in Section 2 We seek a layered codingthat can (i) be easily computed (ii) support parallelism toleverage GPUs (iii) compress the data and (iv) take advan-tage of partial layers which are common since a layer canbe large

311 Our Design A video frame can be seen as a 2D array ofpixels There are two raw video formats RGB and YUVwhereYUV is becoming more popular due to better compressionthan RGB Each pixel in RGB format can be represented usingthree 8-bit unsigned integers in RGB (one integer for red onefor green and one for blue) In the YUV420 planar formatfour pixels are represented using four luminance (Y) valuesand two chrominance (UV) values Our implementation usesYUV but the general idea applies to RGB

We divide a video frame into non-overlapping 8x8 blocksof pixels We further divide each 8x8 block into 4x4 2x2 and1x1 blocks We then compute the average of all pixels in each8x8 block which makes up the base layer or Layer 0 Weround each average into an 8-bit unsigned integer denotedas A0 (i j ) where (i j ) denotes the block index Layer 0 hasonly 164 of the original data size which translates to aroundsim50 Mbps Since we use both WiGig and WiFi it is very rareto get the total throughput below sim50 Mbps Therefore layer0 is almost always delivered 1 While only receiving layer0 gives a 512x270 video which is a very low resolution itis still much better than partial blank screen which mayhappen if we try to send a higher resolution video than thelink can supportNext we go down to the 4x4 block level and compute

averages of these smaller blocks Let A1 (i jk ) denote theaverage of a 4x4 block where (i jk ) is the index of the k-th 4x4 block within the (i j )-th 8x8 block and k = 1 2 3 4D1 (i jk ) = A1 (i jk ) minusA0 (i j ) forms layer 1 Using three ofthese differences we can reconstruct the fourth one sincethe 8x8 block contains the average of the four 4x4 blocksThis reconstruction is not perfect due to rounding errorThe rounding error is small MSE is 11 in our videos wherethe maximum pixel value is 255 This has minimum impacton the final video quality Therefore the layer 1 consists ofD1 (i jk ) for k = 1 to 3 from all 8x8 blocks We callD1 (i j 1)1If the typical throughput is even lower one can construct 16x16 blocksand use average of these blocks to form layer 0

A0(00)

D1(000)

D1(001)

D1(002) D1

D2(0020)

D2(0021)

D2(0022) D2

(0023)

D3(00210)

D3(00211)

D3(00212) D3

(00213)

D3(ijkl) = A3

(ijkl) - A2

Figure 3 Our layered coding

the first sublayer in layer 1 D1 (i j 2) the second sublayerand D1 (i j 3) the third

Following the same principle we form layer 2 by dividingthe frame into 2x2 blocks and computing A2 minusA1 and formlayer 3 by dividing into 1x1 blocks and computingA3minusA2 Asbefore we omit the fourth value in layer 2 and layer 3 blocksThe complete layering structure of an 8x8 block is shownin Fig 3 The values marked in grey are not sent and can bereconstructed at the receiver This layering strategy gives us1 average value for layer 0 3 difference values for layer 112 difference values for layer 2 and 48 difference values forlayer 3 which gives us a total of 64 values per 8x8 block to betransmitted Note that due to spatial locality the differencevalues are likely to be small and can be represented usingless than 8 bits In YUV format an 8x8 blockrsquos representationrequires 96 bytes Our strategy allows us to use fewer than96 bytesEncoding Difference values calculated in the layers 1 - 3vary in magnitude If we use the minimum number of bits torepresent each difference value the receiver needs to knowthat number However sending that number for every dif-ference value defeats the purpose of compression and wewill end up sending more data To minimize the overheadwe use the same number of bits to represent the differencevalues that belong to the same layer of an 8x8 block Fur-thermore we group eight spatially co-located 8x8 blocksreferred to as block-group and use the same number of bits torepresent their difference values for every layer This servestwo purposes(i) It further reduces the overhead and (ii) thecompressed data per block-group is always byte aligned Tounderstand (ii) consider if b bits are used to represent differ-ence values for a layer in a block-group b lowast 8 will always bebyte-aligned This enables more parallelism in GPU threads

as explained in Sec 32 Our meta data is less than 03 ofthe original dataDecoding Each pixel at the receiver side is constructed bycombining the data from all the received layers correspond-ing to each block So the pixel value at location (i jk l m)where (i j )k l m correspond to 8x8 4x4 2x2 and 1x1 blockindices respectively can be reconstructed as

A0 (i j ) + D1 (i jk ) + D2 (i jk l ) + D3 (i jk l m)

If some differences are not received they are assumed tobe zero When partial layers are received we first constructthe pixels for which we received more layers and then usethem to interpolate the pixels with fewer layers based on theaverage of the larger block and the current blocks receivedso farEncoding EfficiencyWe evaluate our layered coding effi-ciency using 7 video traces We classify the videos into twocategories based on the spatial correlation of the pixels in thevideos A video is considered rich if it has low spatial localityFig 4(a) shows the distribution across block-groups for theminimum number of bits requires to encode difference val-ues for all layers For less rich videos 4 bits are sufficient toencode more than 80 of difference values For more detailedor rich videos 6 bits are needed to encode 80 of differencevalues Fig 4(b) shows the average compression ratio foreach video where compression ratio is defined by the ratiobetween the encoded frame size and the original frame sizeError bars represent maximum and minimum compressionratio across all frames of a video Our layering strategy canreduce the size of data to be transmitted by 40-62 Less richvideos achieve higher compression ratio due to relativelysmaller difference values

2 3 4 5 6 7 8

Number of Bits

High-Richness VideoLow-Richness Video

(a) Difference bits

1 2 3 4 5 6 7

atio (

Video ID

(b) Compression ratio

Figure 4 Compression Efficiency

32 Layered Coding ImplementationOur layered coding encodes a frame as a combination ofmultiple independent blocks This allows encoding of a fullframe to be divided into multiple independent computationsand makes it possible to leverage GPU GPU consists of many

cores with very simple control logic that can efficiently sup-port thousands of threads to perform the same task on dif-ferent inputs to improve throughput In comparison CPUconsists of a few cores with complex control logic that canhandle only a small number of threads running differenttasks to minimize the latency of a single task Thus to min-imize our encoding and decoding latency we implementour schemes using GPU However achieving maximum ef-ficiency over GPU is not straightforward To maximize theefficiency of GPU implementation we address several signif-icant challengesGPU threads synchronization An efficient GPU imple-mentation should have minimal synchronization betweendifferent threads for independent tasks Synchronization re-quires data sharing which greatly impacts the performanceWe divide computation into multiple independently com-putable blocks However blocks still have to synchronizesince blocks vary in size and have variable writing offsetsin memory We design our scheme such that we minimizethe synchronization need among the threads by putting thecompression of a block-group in a single threadMemory copying overheadWe transmit encoded data assublayers The receiver has to copy the received data pack-ets of sublayers from CPU to GPU memory A single copyoperation has non-negligible overhead Copying packets in-dividually incurs too much overhead while delaying copy-ing packets until all packets are received incurs significantstartup delay for the decoding process We design a mecha-nism to determine when to copy packets for each sublayerWe manage a buffer to cache sublayers residing in GPUmem-ory Because copying a sublayer takes time GPU is likelyto be idle if the buffer is small Therefore we try to copypackets of a new sublayer while we are decoding a sublayerin the buffer Our mechanism can reduce GPU idle time byparallelizing copying new sublayers and decoding bufferedsublayersOur implementation works for a wide range of GPUs

including low to mid-range GPUs [2] which are commonon commodity devices

321 GPU Implementation GPU background To maxi-mize the efficiency of GPU implementation we follow thetwo guidelines below (i) A good GPU implementation shoulddivide the job into many small independently computabletasks (ii) It should minimize memory synchronization acrossthreads as memory operations can be expensive

Memory optimization is critical to the performance of GPUimplementation GPU memory can be classified into threetypes Global Shared and Register Global memory is stan-dard off-chip memory accessible to GPU threads through businterface Shared memory and register are located on-chip so

their access time is sim100times faster than global memory How-ever they are very small But they give us an opportunity tooptimize performance based on the type of computationIn GPU terms a function that is parallelized among mul-

tiple GPU threads is called kernel If different threads in akernel access contiguous memory global memory access fordifferent threads can be coalesced such that memory for allthreads can be accessed in a single read instead of individualreads This can greatly speed up memory access and enhancethe performance Multiple threads can work together usingthe shared memory which is much faster than the globalmemory So the threads with dependency can first read andwrite to the shared memory thereby speeding up the com-putation Ultimately the results are pushed to the globalmemoryImplementation Overview To maximize the efficiency ofGPU implementation we divide the encoder and decoderinto many small independently computable tasks In orderto achieve independence across threads we should satisfythe following two propertiesNo write byte overlap If no two threads write to the samebyte there is no need for memory synchronization amongthreads This is a desired property as thread synchronizationin GPU is expensive Since layers 1-3 may use fewer than 8bits to represent the differences a single thread processing a8x8 block may generate output that is not a multiple of 8-bitsIn this case different threads may write to the same byteand require memory synchronization Instead we use onethread to encode difference values per block-group where ablock-group consists of 8 spatially co-located blocks Sinceall blocks in the block-group use the same number of bits torepresent difference values the output size from block-groupis an multiple of 8 and always byte alignedReadwrite independence amongblocks EachGPU threadshould know in advance where to read the data from orwhere to write the data to As the number of bits used torepresent the difference values vary across block-groups itswriting offset depends on how many bits were used by theprevious block-groups Similarly the decoder should knowthe read offsets Before spawning GPU threads for encodingor decoding we derive the readwrite memory offsets forthat thread using cumulative sum of the number of bits usedper block-group

322 Jigsaw GPU Encoder Overview Encoder consists ofthree major tasks (i) calculating averages and differences(ii) constructing meta-data and (iii) compressing differencesSpecifically consider a YUV video frame of dimensions a timesb The encoder first spawns ab

64 threads for Y values and05ab64 for UV values Each thread operates on an 8x8 block

to (i) compute the average of an 8x8 block (ii) compute the

averages of all 4x4 blocks within the 8x8 block and take thedifference between the averages of 8x8 block and 4x4 blockand similarly compute the hierarchical differences for 2x2blocks and 1x1 blocks and (iii) coordinate with the otherthreads in its block-group using the shared memory to getthe minimum number of bits required to represent differencevalues for its block-group Next the encoder derives the meta-data and memory offsets Then it spawns xlowastylowast15

64lowast8 threadseach of which compresses the difference values for every 8blocksCalculating averages anddifferencesThe encoder spawnsmultiple threads to compute averages and differences Eachthread processes an 8x8 block It calculates all the averagesand difference values required to generate all layers for thatblock It reads eight 8-byte chunks for a block one by one tocompute the average In GPU architecture multiple threadsof the same kernel execute the same instruction for a givencycle In our implementation successive threads work oncontiguous 8x8 blocks This makes successive threads accessand operate on contiguous memory so the global memoryreads are coalesced and memory access is optimizedAfter computing the average at each block level each

thread computes the differences for each layer andwrites it toan output buffer The thread also keeps track of the minimumnumber of bits required to represent the differences for eachlayer Let bi denote the number of bits required for layer i All 8 threads for a block-group use atomic operations in ashared memory location to get the number of bits requiredto represent differences for that block-group We use Bij todenote the number of bits used by block-group j for layer i Meta data processing We compute the memory offsetwhere the compressed value for the ith layer of block-groupj should be written to using a cumulative sumCi

j =sumk=j

k=0 Bik

Based on the cumulative sum the encoding kernel generatesmultiple threads to process the difference values concur-rently without write byte overlap Bij values are transmittedbased on which the receiver can compute the memory offsetsof compressed difference values 3 values per block-group perlayer are transmitted except the base layer and we need 4bits to represent one meta-data value Therefore a total of3lowastxlowastylowast1564lowast8lowast2 bytes are sent as meta-data which is within 03

of the original dataEncoding A thread is responsible for encoding all the dif-ference values for a block-group in a layer A thread for block-group j uses Bij to compress and combine the difference val-ues for layer i from all its blocks It then writes the com-pressed value in an output buffer at the given memory offsetOur design is to ensure consecutive threads read and write tocontiguous memory locations in the global GPU memory totake advantage ofmemory coalescing Fig 5(a) shows averagerunning time of each step in our compression algorithm On

average all steps finish around 10ms Error bars representthe maximum and minimum running times across differentruns

323 Jigsaw GPU Decoder Overview The receiver firstreceives the layer 0 and meta-data It then processes themeta-data to compute the read offsets Each layer except thebase layer has 3 sublayers as described in Section 31 Onceall data corresponding to a sublayer are received a threadis spawned to decode the sublayer and add the differencevalue to the previously reconstructed lower level averageThis process is repeated for every sublayer that is receivedcompletely

The decoder consists of the following steps (i) processingmeta-data (ii) decoding and (iii) reconstructing a frameMeta-data processingMeta-data contains the number ofbits used for difference values per layer per block group Thekernel calculates the cumulative sum Ci

j from the meta-datavalues similarly to the encoder described previously Thiscumulative sum indicates the read offset at which block-groupj can read the compressed difference values for any sublayerin layer i Since all sublayers in a layer are generated usingthe same number of bits their relative offset is the same

r-3Tot

(a) Encoder Modules

onstru

(b) Decoder Modules

Figure 5 Codec GPU Modules Running Time

Decoding A decoding thread decodes all the difference val-ues for a block-group corresponding to a sublayer This threadis responsible for decompressing and adding the differencevalue to the previously reconstructed lower level averageThis process is repeated for every sublayer that is receivedcompletely Each block is composed of 4 smaller sub-blocksDifference values for three of these sub-blocks are transmit-ted Once sublayers corresponding to the three sub-blocksof the same block are received the fourth sublayer is con-structed based on the average principle

Received sublayers reside in the main memory and shouldbe copied to GPU memory before they can be decoded Eachmemory copy incurs overhead so copying every packet in-dividually to GPU has significant overhead On the otherhand memory copy for a sublayer cannot be deferred to thepoint at which its decoding kernel is scheduled Otherwise

it would stall the kernel till the sublayer is completely copiedto GPU memory

We implement a smart delayed copy mechanism It par-allelizes memory copy for one sub-layer with the decoding ofanother sublayer We always keep a maximum of 4 sublayersin GPU memory They are next in line to decode As soon asone of these 4 sublayers is scheduled to be decoded in GPUwe choose a new complete sublayer to copy from CPU toGPU memory If no new complete sublayer is available wecopy a partial sublayer to GPU In the latter case all futureincoming packets for that sublayer are directly copied toGPU without any delay Our smart delayed copy mechanismallows us to reduce the memory copy time between GPUand main memory by 4x at the cost of only 1 of the kernelsexperiencing 015ms stall on average due to delayed memorycopying The total running time of the decoding process ofa sublayer consists of the memory copy time and the kernelrunning time Because of the large memory copy time ourmechanism significantly reduces the total running timeFrame reconstruction Once the deadline for a frame isimminent we stop decoding its sublayers and prepare forreconstruction We interpolate for the partial sublayers asexplained in Section 31 The receiver reconstructs a pixelbased on all the received sublayers Finally reconstructedpixels are organized as a frame and rendered on the screenFig 5(b) shows the average running time of each step in

our decompression algorithm where the error bars representthe maximum and minimum values On average the totalrunning time is around 19ms

Frame 1 Decoding

Layer-1

Metadata

Average amp Difference Calculation

Frame Reconstruction

Layer-3

Layer-2

TRANSMISSION

Frame 1 Encoding Frame 2 Encoding

Figure 6 Jigsaw Pipeline

324 Pipelining A 4K video frame takes significant amountof time to transmit Starting transmission after the wholeframe finishes encoding increases delay similarly decodingafter receiving a whole frame is also inefficient Pipeliningcan be potentially used to reduce the delay However not allencoders and decoders can be pipelined Many encoders haveno intermediate point where transmission can be pipelinedDue to data dependency it is not possible to start decodingbefore all dependent data has been received

Our layering strategy has nice properties that (i) the lowerlayers are independent of the higher layers and can be en-coded or decoded independently and (ii) sublayers within alayer can be independently computed This means our en-coding can be pipelined with data transmission as shownin Fig 6 Transmission can start as soon as the average anddifference calculation is done which can happen as soon as2ms after the frame is generated This is possible since thebase layer consists of only average values and does not re-quire any additional computation Subsequent sublayers areencoded in order and scheduled for transmission as soon asany sublayer is encoded Average encoding time for a singlesublayer is 200-300usA sublayer can be decoded once all its lower layers have

been received As the data are sent in order by the time asublayer arrives its dependent data are already available Soa sublayer is ready to be decoded as soon as it is receivedAs shown in Fig 6 we pipeline sublayer decoding with datareception Final frame reconstruction cannot happen until allthe data of the frame have been received and decoded As theframe reconstruction takes around 3-4ms we stop receivingdata for that frame before the frame playout deadline minusthe reconstruction time Pipelining reduces the delay from10ms to 2ms at the sender and reduces the delay from 20msto 3ms at the receiver

33 Video TransmissionWe implement a driver module on top of the standard net-work interface bonding driver in Linux that bonds WiFi andWiGig interfaces It is responsible for the following impor-tant tasks (i) adapting video quality based on the wirelessthroughput (ii) selecting the appropriate interface to trans-mit the packets (ie intra video frame scheduling) and (iii)determining how much data to send for each video framebefore switching to the next one (ie inter video frame sched-uling) Our high-level approach is to defer the decision tilltransmission to minimize the impact of prediction errorsBelow we describe these components in detail

Delayed video rate adaptation A natural approach totransmit data is to let the application decide how much datato generate and pass it to the lower network layers basedon its throughput estimate This is widely used in the ex-isting video rate adaptation However due to unpredictablethroughput fluctuation it is easy to generate too much ortoo little data This issue is exacerbated by the fact that the60 GHz link throughput can fluctuate widely and rapidlyTo fully utilize network resources with minimized effect

on user quality we delay the transmission decision as lateas possible to remove the need for throughput prediction atthe application layer Specifically for each video frame welet the application generate all layers and pass them to the

driver in the order of layers 0 1 2 3 The driver will transmitas much data as the network interface card allows before thesender switches to the next video frame

While the concept is intuitive realizing it involves signifi-cant effort as detailed below First our module determineswhether a packet belongs to a video stream based on the des-tination port Instead of forwarding all video packets directlyto the interfaces they are added to a large circular bufferThis allows us to make decisions based on any optimizationobjective Packets are dequeued from this buffer and addedinto the transmission queue of an interface whenever a trans-mission decision is made Moreover to minimize throughputwastage this module drops the packets that are past theirdeadline The module uses the header information attachedto each video packet to determine which video frame thepacket belongs to and its deadline Before forwarding thepackets to the interface the module estimates whether thatpacket can be delivered within the deadline based on currentthroughput estimate and queue length of the interfaces Theinterface queue depletion rate is used to estimate the achiev-able throughput If a packet cannot make the deadline allthe remaining packets belonging to the same video frameare marked for deletion since they have the same deadlineThe buffer head is moved to the start of next frame Packetsmarked for deletion are removed from the socket buffer in aseparate low priority thread since the highest priority is toforward packets to interfaces to maintain throughput

Intra-frame scheduling Interface firmware beneath thewireless driver is responsible for forwarding packets Trans-mission is out of control of the driver once the packet is putto interface queue of an interface This means that the packetcannot be dropped even if it misses its deadline To minimizesuch waste we only enqueue a minimum number of packetsto both interfaces to sustain the maximum throughput Theinterface card notifies the driver whenever a packet has beensuccessfully received and the driver removes it from thatinterface queue automatically

Our module keeps track of the queue lengths for bothWiFiand WiGig interfaces As soon as the queue size of any inter-face goes below the threshold it forwards packets from thepacket buffer to that interface As shown in Section 2 bothinterfaces have frequent inactivity intervals during whichthey are unable to transmit any packets This can result insignificant delay in the packets sent to the inactive interfaceSuch delay can sometimes cause the lower layer packets toget stuck at one interface queue and the higher layer packetsto be transmitted at the other interfaces Since our layeredencoding requires the lower layer packets to be received inorder to decode the higher layer packets it is important toensure the lower layer packets to be transmitted before thehigher layer

To achieve this goal the sender monitors the queue onboth interfaces If no packet is removed from that queue forT duration it declares that interface is inactive When thequeue of the active interface reduces which means that itis now sending packets and is ready to accept more packetswe compare the layer of the new packet with that of thepacket queued at the inactive interface If the latter is a lowerlayer (which means it is required in order to decode thehigher layer packets) we move the packet from the inactiveinterface to the active interface No rescheduling is donefor the packets in the same layer since these packets havethe same priority In this way we ensure that all the lowerlayer packets are received prior to the reception of higherlayer packetsT is adapted dynamically based on the currentframersquos deadline If the deadline is more than 10 ms awaywe setT = 4ms otherwise it is set to 2ms This is because weneed to react more quickly if the deadline is closer so thatall the enqueued packets can be transmitted on the activeinterface and received before the deadline

Inter-frame scheduling If the playout deadline for a frameis larger than the frame generation interval the module canhave packets from different video frames at a given instantFor example in a 30 FPS video a new frame is generatedevery 33 msec If the deadline to play a frame is set to 60 msfrom its generation as suggested in recent studies [11] twoconsecutive frames will overlap for 27 msIn this case we have an opportunity to decide when to

start transmitting the next video frame Certainly no pack-ets should be transmitted after their deadline However wesometimes need to move on to transmit the next video framebefore the deadline of the previous frame arrives to ensuresimilar numbers of packets are transmitted for two consecu-tive frames and there is no significant variation in the qualityof consecutive framesTo achieve this whenever the interface queue has room

to accept a new packet our module predicts the number ofpackets that can be transmitted for the next frame If theestimate is smaller than the number of packets already sentfor the current frame minus a threshold (80 packets in oursystem) we switch to transmitting the next video frameThis reduces variation in the quality of the two consecutivevideo frames Note that a frame is never preempted beforeits base layer finishes transmission (unless it passes the dead-line) to ensure at least the lowest resolution frame is fullytransmitted

4 EVALUATIONIn this section we first describe our evaluation methodologyThen we perform micro-benchmarks to understand the im-pact of our design choices Finally we compare Jigsaw withthe existing approaches

41 Evaluation MethodologyEvaluationmethodsWe classify our experiments into twocategories Real-Experiment and Emulation Experiments aredone on two laptops equipped with QCA9008 AC+AD WiFimodule that supports 80211ad on 60 GHz and 80211ac on245 GHz band Each laptop has 24GHz dual-core processor8 GB RAM and NVIDIA Geforce 940M GPU One laptopserves as the sender and the other serves as the receiver Thesender encodes the data while the receiver decodes the dataand displays the videoWhile real experiments are valuable the network may

vary significantly when we run different schemes even forthe same movement pattern This makes it hard to com-pare different schemes Emulation allows us to run differentschemes using exactly the same throughput trace For emula-tion we connect two desktops each with 36GHz quad-coreprocessor 8 GB RAM 256GB SSD and NVIDIA Geforce GTX1050 GPU a 10 Gbps fiber optic link We collect packet tracesover WiFi and WiGig using our laptops and use these tracesto run trace-driven emulation The sender and receiver coderun on two desktops in real time We emulate two interfacesand delay the packets according to the packet trace fromWiFi and WiGig before sending them over the 10 Gbps fiberoptic linkWe verify the correctness of our emulation by com-paring the instantaneous throughput between the real tracesand the emulated experiment and find that the emulatedthroughput is within 05 of the real tracersquos throughputVideo TracesWe use 7 uncompressed videos with YUV420format from Derfrsquos collection under Xiph [7] with resolu-tion 4096x2160 We choose videos with different motion andspatial locality to evaluate their impact Videos are streamedat 30 FPS and each video is concatenated to itself to generatedesired duration We classify the videos into two categoriesbased on spatial correlation of pixels We use variance inthe Y values in 8x8 blocks for all video frames of a video toquantify spatial correlation a video Videos that have highvariance are classified as high richness (HR) videos and videowith low variance are classified as low richness (LR) videosWe use 2 HR videos and 5 LR videosMobility Patterns We collect traces by fixing the senderand moving the receiver We use the following movementpatterns in our experiment (i) Static No movement in senderor receiver (ii) Watching a 360 video A user watches a 360ovideo and rotating the receiver laptop up and down rightand left according to the video displayed on the laptop andthe static sender is around 15m away (iii) Walking The userwalks around with the receiver laptop in hand within a 3mradius from the sender (iv) Blockage Sender and receiverare static but another user moves between them and mayblock the link from time to time thus inducing environmentmobility

For our real experiments we run each experiment 5 timesfor each mobility pattern and then average the results Foremulation we collect 5 different packet level traces for eachmobility patternPerformance metricsWe use Peak Signal to Noise Ratio(PSNR) and Structural Similarity Index (SSIM) to quantifythe video quality PSNR is a widely used video quality metricVideos with PSNR greater than 45 dB are considered to haveexcellent quality between 33-45 are considered good andbetween 27-33 are considered fair 1 dB difference in PSNRis already visible and 3 dB difference indicates that the videoquality is doubled Videos with SSIM greater than 099 areconsidered to have excellent quality 095-099 are consideredgood and 088-095 are considered fair [29]In all our results the error bars represent maximum and

minimum values unless otherwise specified

42 Micro-benchmarksWe use emulation to quantify the impact of different sys-tem components since we can keep the network conditionidenticalImpact of layers on video quality Our layered codingallows us to exploit spatial correlation It can not only adaptthe video quality on the fly but also reduce the amount of datato transmit We quantify the compression rate for differenttypes of videos Fig 7 shows the frame PSNR as we vary thenumber of layers For the LR videos receiving only 2 layerscan achieve an average PSNR close to 40dB and receivingonly 3 layers gives PSNR of 42 dB For the HR videos theaverage PSNR is around 7dB lower than the LR video whenreceiving 3 layers Our scheme can achieve similar PSNRvalues for both kind of videos when all the layers are receivedMoreover as we would expect less rich videos have highercompression ratios due to smaller difference values whichcan be represented using fewer bits

1 2 3 4

No Received Layers

LR VideoHR Video

Figure 7 Video quality vs no of received layers

Impact of using WiGig and WiFi Fig 8 shows the PSNRper frame when using WiGig only We use the throughput

trace collected when a user is watching 360o videos WithoutWiFi PSNR drops below 10dB when disconnection happensWhen WiGig has drastic changes PSNR can decrease bymore than 10dB even if WiGig is not completely discon-nected WiFi improves PSNR by over 25dB when WiGig isdisconnected and by 2dB even when WiGig is not discon-nected The disconnection of WiGig can result in partialblank frames because the available throughput may not beable to transmit the whole layer 0 in time The complemen-tary throughput from WiFi can remove partial blank frameseffectively so we observe large improvement in PSNR whenWiGig disconnection happens We can transmit more layerswhen using both WiGig and WiFi because of higher totalavailable throughput Thus WiFi still improves PSNR evenif WiGig is not disconnected

0 50 100 150 200 250 300 350 400 450

WiFiWiGig

0 50 100 150 200 250 300 350 400 450

Frame Number

Wo WiFiWith WiFi

Figure 8 Impacts of using WiGig and WiFi

Impact of interface schedulerWhen the WiGig through-put reduces data at the WiGig interface gets stuck Suchsituation is especially bad when the data packets queued atWiGig are from the lower layer than those queued at WiFiIn this case even if WiFi successfully delivers the data theyare not decodable without the complete lower layer dataTo avoid this situation our scheduler can effectively movethe data from the inactive interface to the active interfacewhenever the inactive interface has lower layer data WhenLayer-0 data get stuck the user will see part of the receivedframe in blank Fig 9(a) shows the percentage of partialframes that do not have complete Layer-0 under variousmobility patterns Our scheduler reduces the percentage ofpartial frames by 90 82 and 49 for watching walkingand blockage traces respectively In the static case we donot observe any partial frame Our scheme gives a partiallyblank frame only when the throughput is below the mini-mum required ndash 50 Mbps As shown in Fig 9(a) the numberof partially blank frames for our scheme are within 01 ofwhat we receive in ideal case for these tracesImpact of Inter-frame SchedulerWhen throughput fluc-tuates the video quality can vary significantly across frames

Blockage Watching Walking

Mobility Patterns

IdealWo ReschedulingWith Rescheduling

(a) Impacts of interface scheduler

0 10 20 30 40 50

PSNR (dB)

With Inter-frameWo Inter-frame

(b) Impacts of inter-frame scheduler

16 26 36 46 56 66

Frame Deadline (ms)

(c) Impacts of frame deadline

Figure 9 Microbenchmarks

Our inter-frame scheduler tries to balance throughput allo-cated to consecutive frames Without this scheduling wejust send frame data until its deadline As shown in Fig 9(b)this improves the quality of frames around 10dB when thethroughput is low The impact is minimal when the through-put is already high and stableImpact of GPU Next we study the impact of GPU on theencoding and decoding time Table 2 shows the performanceof Jigsaw for three types of GPUs with different number ofcores More powerful GPU has more cores and reduces theencoding and decoding time significantly leaving more timeto transmit data However this comes at the cost of highpower consumption Even the low to mid-end GPUs thatare generally available on mobile devices can successfullyencode and decode the data in real time

GPU Cores Power Encoding DecodingRating (W) Time (ms) Time (ms)

384 30 101 193768 75 17 572816 250 11 53

Table 2 Performance over different GPUs

Impact of Frame Deadline Frame deadline is determinedby the desired user experience of the application Fig 9(c)shows the video PSNR when varying frame deadline from 16ms to 66 ms The performance of Jigsaw using 36 ms dead-line is similar to that using 66 ms deadline 36 ms is muchlower than the 60 ms delay threshold for interactive mobileapplications [11 14 21] We use 60ms as our frame deadlinethreshold because this is the threshold that is deemed tolera-ble by the earlier works [11 14 21] However as we showthat we can tolerate frame delay as low as 36ms so even iffuture applications require lower delay our system can stillsupport

43 System ResultsWe compare our system with the following three baselineschemes using real experiments as well as emulation

bull Pixel Partitioning(PP) Video is divided into 8x8blocks The first pixels from all blocks are transmittedfollowed by the second pixels in all blocks and so onIf not all pixels are received for a block the remainingones are replaced by average of the remaining pixelswhich can be computed based on the average of thelarger block and the current blocks received so farbull Raw Raw video frame is transmitted This is uncom-pressed video streaming Packets after their deadlineare dropped to avoid wastagebull Rate adaptation (Rate) Throughput is estimated us-ing historical information Based on this estimate theuncompressed video is down-sampled accordingly be-fore transmission The receiver up-samples it to 4K anddisplays it This is similar in concept to DASH stream-ing which is the current standard of video-on-demandstreaming

Benefits of Jigsaw Fig 10 shows the video quality for allschemes under various mobility patterns and videos Weperform real experiments by running each video 5 timesover each mobility traces and report the average We makethe following observationsFirst Jigsaw achieves much higher PSNR than the other

schemes across all mobility scenarios and videos Raw alwayshas partial-blank frames and has a very low PSNR of around10dB in all settings as the throughput can not support full 4Kframe transmission even in the best case PP also transmitscomplete 4K frame so it is never able to transmit a completeframe in any setting however the impact of not receivingall data is not as severe as Raw This is because some datafrom different parts of the frame is received and frames areno longer partially blank Moreover using interpolation toestimate the unreceived pixels also improves quality Rateachieves high PSNR in static scenarios since the throughputprediction is more accurate and it can adapt the resolutionHowever it cannot transmit at full 4K resolutionSecond the benefit of Jigsaw is higher under mobility

scenarios Jigsaw improves the median PSNR by up to 6 dBover Rate 12 dB over PP and 35 dB over Raw in static settings

0 10 20 30 40 50

PSNR (dB)

Jigsaw Rate PP Raw

(a) HR Video Static

0 10 20 30 40 50

PSNR (dB)

Jigsaw Rate PP Raw

(b) HR Video Watching

0 10 20 30 40 50

PSNR (dB)

Jigsaw Rate PP Raw

(c) HR Video Walking

0 10 20 30 40 50

PSNR (dB)

Jigsaw Rate PP Raw

(d) HR Video Blockage

0 10 20 30 40 50

PSNR (dB)

Jigsaw Rate PP Raw

(e) LR Video Static

0 10 20 30 40 50

PSNR (dB)

Jigsaw Rate PP Raw

(f) LR Video Watching

0 10 20 30 40 50

PSNR (dB)

Jigsaw Rate PP Raw

(g) LR Video Walking

0 10 20 30 40 50

PSNR (dB)

Jigsaw Rate PP Raw

(h) LR Video Blockage

Figure 10 Performance of Jigsaw under various mobility patterns (HR High-Richness LR Low-Richness)

The corresponding improvement is 15 dB over Rate 16 dBover PP and 38 dB over Raw because Jigsaw can quicklyadapt to the throughput changes in the mobile case due toits layered coding design delayed video rate adaptation andsmart schedulingThird for all schemes HR videos suffer more when less

data can be transmitted All schemes achieve higher PSNRfor the LR videos The median PSNR for the LR (HR) videosis 46dB (41dB) 38dB (31dB) 33dB (26dB) and 10dB (9dB)for Jigsaw Rate PP and Raw respectively These resultsshow that Jigsaw significantly out-performs the existingapproaches for a variety of network conditions and videosTable 3 shows the median video SSIM for all schemes

We can observe similar trend as PSNR Jigsaw achieves thehighest SSIM Jigsaw can achieve at least good video qualityfor all videos and mobility traces

0 50 100 150 200 250 300 350 400 450

0 50 100 150 200 250 300 350 400 450Thpt (G

0 20 40 60

0 50 100 150 200 250 300 350 400 450

layers

Frame Number

Figure 11 Frame quality correlation with throughput

Effectiveness of layer adaptation Fig 11 shows the cor-relation between throughput and video frame quality for

Jigsaw We can see that the changes of video quality hasvery similar pattern as the throughput changes Jigsaw onlyreceives Layer-0 and partial Layer-1 when throughput isclose to 0 In those cases the frame quality drops to around30dB When the throughput stays close to 18Gbps the framequality reaches around 49dB Because we keep our interfacequeues small our packet transmission rate closely followsthe packet depletion rate from the interface queues Henceour layer adaptation can quickly respond to any throughputfluctuation

Jigsaw Rate PP RawHRStatic 0993 0978 0838 0575

HRWatching 0965 0749 0818 0489HRWalking 0957 0719 0805 0421HRBlockage 0971 0853 0811 0511LRStatic 0996 0985 0949 0584

LRWatching 0971 0785 0897 0559LRWalking 0965 0748 0903 0481LRBlockage 0984 0862 0907 0560

Table 3 Video SSIM under various mobility patterns

44 Emulation ResultsIn this section we compare time series of Jigsaw with theother schemes when using the same throughput trace Weuse emulation to compare different schemes under the samecondition Fig 12 shows the quality of each frame usingan example throughput trace collected from the Watchingmobility pattern As we can see Jigsaw achieves the highestquality for frames and least variation

0 50 100 150 200 250 300 350 400 450

Frame Number

Jigsaw Rate PP Raw

Figure 12 Frame quality comparison

5 RELATEDWORK

Video Streaming over Internet DASH [36] is the currentstandard protocol for Internet video streaming services Avideo is divided into chunks each of which contains a fewseconds video content The client adapts the bitrate of videochunks according to the network condition Significant workhas been done on understanding and improving the perfor-mance of video rate adaptation algorithms [18 20 22 28 43]However most of them only investigate video-on-demand(VoD) services in which entire videos are encoded and storedat the server side before users start downloading Comparedwith VoD live video streaming [13 30 42] is more delaysensitive Existing live video streaming approaches streamvideo content in chunks and users have to wait for a fewseconds before a video chunk is ready to play In this workwe focus on supporting live video streaming in which thedelay of a video frame is as small as only tens of millisecondsSuch live streaming is crucial for interactive applicationslike gaming virtual reality and augmented reality [11 21]Video Streaming over non-60GHzWireless NetworksWireless links are not reliable and have large fluctuation inavailable throughput Flexcast [9] incorporates rateless codeinto the existing video codec such that video quality does notdrop drastically when network condition varies Softcast [19]exploits 3D DCT to achieve similar target Parcast [27] inves-tigates how to transmit more important video data to morereliable wireless channels PiStream [41] uses physical layerinformation from LTE links to enhance video rate adaptationHowever these works do not study 4K video streaming Dueto large size existing video coding cannot code 4K videocontent in real time Furion [23] tries to use multiple codecinstances to encode and decode VR video content in par-allel but it supports a much lower resolution than the 4Kresolution Rubiks [16] designs a layered coding scheme tominimize bandwidth for 360-degree video streaming How-ever it does not consider 4K video encoding cost and hencecannot support 4K live video Some dedicated hardware can

support 4K videos in real time [25] but it is generally notavailable on laptops and mobile devices Our work focuseson supporting live 4K video encoding transmission anddecoding in real time on commodity devices

Video Streaming over WiGig Owing to the large band-width in WiGig streaming uncompressed video has becomea key application for WiGig [35 39] Choi et al [12] pro-poses a link adaptation policy to allocate different amountof resource to different data bits in a pixel such that thetotal amount of allocated resource is minimized while main-taining good video quality He et al [17] try to encode anuncompressed video into multiple descriptions using RS cod-ing The video quality improves as more descriptions arereceived [12 17] use unequal error protection to protectdifferent bits in a pixel based on importance Shao et al [33]compresses pixel difference values using run length codingwhich is difficult to parallize since run length codes since dif-ferent pixels can not be known beforehand Singh et al [34]propose to partition adjacent pixels into different packetsand adapt the number of pixels to send based on throughputestimation But its video quality degrades significantly whenthroughput drops since it has no compression Li et al [24]develops an interface that notifies the application layer ofthe parallelize between WiGig and WiFi so that the videoserver can determine an appropriate resolution of the videoto send In general these works do not consider encodingtime which can be significant for 4K videos They also do notexplore how to leverage multiple links for video streamingwhich our system addresses

Otherwireless approachesDeveloping other high-bandwidthreliable wireless networks is also an interesting solution tosupport 4K video streaming A recent router [6] claims tosupport Gbps-level throughput by exploiting one 24GHz andtwo 5GHz bands However it still cannot support raw 4Kvideo transmission Our system can efficiently fit 4K videointo this throughput range

6 CONCLUSIONIn this paper we study the feasibility of streaming live 4Kvideos and identify the major challenge in realizing this goalis to tolerate large and unpredictable throughput fluctua-tion To address this challenge we develop a novel system tostream live 4K videos using commodity devices Our designconsists of a new layered video coding an efficient imple-mentation using GPU and pipelining and implicit video rateadaptation and smart scheduling data acrossWiFi andWiGigwithout requiring throughput prediction Moving forwardwe are interested in exploring applications of live 4K videostreaming such as VR AR and gaming

REFERENCES[1] ffmpeg httpswwwffmpegorg[2] Nvidia geforce 940m httpswwwnotebookchecknet

NVIDIA-GeForce-940M1380270html[3] Nvidia titan x specs httpswwwnvidiacomen-usgeforceproducts

10seriestitan-x-pascal[4] Nvidia video codec httpsdevelopernvidiacom

nvidia-video-codec-sdk[5] Openh264 httpswwwopenh264org[6] Tp-link archer c5400x mu-mimo tri-band gam-

ing router httpsventurebeatcom20180906tp-link-launches-gaming-router-for-4k-video-stream-era

[7] Video dataset url=httpsmediaxiphorgvideoderf[8] Youtube 4k bitrates httpssupportgooglecomyoutubeanswer

1722171hl=en[9] S Aditya and S Katti Flexcast Graceful wireless video streaming

In Proceedings of the 17th annual international conference on Mobilecomputing and networking pages 277ndash288 ACM 2011

[10] J Bankoski J Koleszar L Quillio J Salonen P Wilkins and Y XuVp8 data format and decoding guide RFC 6386 Google Inc November2011

[11] C-M Chang C-H Hsu C-F Hsu and K-T Chen Performancemeasurements of virtual reality systems Quantifying the timing andpositioning accuracy In Proceedings of the 2016 ACM on MultimediaConference pages 655ndash659 ACM 2016

[12] M Choi G Lee S Jin J Koo B Kim and S Choi Link adaptationfor high-quality uncompressed video streaming in 60-ghz wirelessnetworks IEEE Transactions on Multimedia 18(4)627ndash642 2016

[13] L De Cicco S Mascolo and V Palmisano Feedback control for adap-tive live video streaming In Proceedings of the second annual ACMconference on Multimedia systems pages 145ndash156 ACM 2011

[14] J Deber R Jota C Forlines and D Wigdor How much faster is fastenough User perception of latency amp latency improvements in directand indirect touch In Proceedings of the 33rd Annual ACM Conferenceon Human Factors in Computing Systems pages 1827ndash1836 ACM 2015

[15] S Fouladi R S Wahby B Shacklett K V Balasubramaniam W ZengR Bhalerao A Sivaraman G Porter and K Winstein Encodingfast and slow Low-latency video processing using thousands of tinythreads In 14th USENIX Symposium on Networked Systems Design andImplementation (NSDI 17) pages 363ndash376 Boston MA 2017 USENIXAssociation

[16] J He M A Qureshi L Qiu J Li F Li and L Han Rubiks Practical360-degree streaming for smartphones In Proceedings of the 16thAnnual International Conference on Mobile Systems Applications andServices ACM 2018

[17] Z He and S Mao Multiple description coding for uncompressedvideo streaming over 60ghz networks In Proceedings of the 1st ACMworkshop on Cognitive radio architectures for broadband pages 61ndash68ACM 2013

[18] T-Y Huang R Johari N McKeown M Trunnell and M Watson Abuffer-based approach to rate adaptation Evidence from a large videostreaming service ACM SIGCOMM Computer Communication Review44(4)187ndash198 2015

[19] S Jakubczak and D Katabi A cross-layer design for scalable mobilevideo In Proceedings of the 17th annual international conference onMobile computing and networking pages 289ndash300 ACM 2011

[20] J Jiang V Sekar H Milner D Shepherd I Stoica and H Zhang CfaA practical prediction system for video qoe optimization In NSDIpages 137ndash150 2016

[21] T Kaumlmaumlraumlinen M Siekkinen A Ylauml-Jaumlaumlski W Zhang and P Hui Dis-secting the end-to-end latency of interactive mobile video applicationsIn Proceedings of the 18th International Workshop on Mobile Computing

Systems and Applications pages 61ndash66 ACM 2017[22] R Kuschnig I Kofler and H Hellwagner An evaluation of tcp-based

rate-control algorithms for adaptive internet streaming of h 264svcIn Proceedings of the first annual ACM SIGMM conference on Multimediasystems pages 157ndash168 ACM 2010

[23] Z Lai Y C Hu Y Cui L Sun and N Dai Furion Engineering high-quality immersive virtual reality on todayrsquos mobile devices In Proceed-ings of the 23rd Annual International Conference on Mobile Computingand Networking pages 409ndash421 ACM 2017

[24] Y-Y Li C-Y Li W-H Chen C-J Yeh and KWang Enabling seamlesswigigwifi handovers in tri-band wireless systems InNetwork Protocols(ICNP) 2017 IEEE 25th International Conference on pages 1ndash2 IEEE2017

[25] L Liu R Zhong W Zhang Y Liu J Zhang L Zhang and M GruteserCutting the cord Designing a high-quality untethered vr system withlow latency remote rendering In Proceedings of the 16th Annual In-ternational Conference on Mobile Systems Applications and Servicespages 68ndash80 ACM 2018

[26] X Liu Q Xiao V Gopalakrishnan B Han F Qian and M Varvello360ampdeg innovations for panoramic video streaming In Proceedingsof the 16th ACM Workshop on Hot Topics in Networks HotNets-XVIpages 50ndash56 New York NY USA 2017 ACM

[27] X L LiuW Hu Q Pu FWu and Y Zhang Parcast Soft video deliveryin mimo-ofdm wlans In Proceedings of the 18th annual internationalconference on Mobile computing and networking pages 233ndash244 ACM2012

[28] HMao R Netravali andM Alizadeh Neural adaptive video streamingwith pensieve In Proceedings of the Conference of the ACM SpecialInterest Group on Data Communication pages 197ndash210 ACM 2017

[29] A Moldovan and C H Muntean Qoe-aware video resolution thresh-olds computation for adaptive multimedia In 2017 IEEE InternationalSymposium on BroadbandMultimedia Systems and Broadcasting (BMSB)pages 1ndash6 June 2017

[30] K Pires and G Simon Youtube live and twitch a tour of user-generatedlive streaming systems In Proceedings of the 6th ACM MultimediaSystems Conference pages 225ndash230 ACM 2015

[31] T Rappaport Wireless Communications Principles and Practice Pren-tice Hall PTR Upper Saddle River NJ USA 2nd edition 2001

[32] H Schwarz D Marpe and T Wiegand Overview of the scalablevideo coding extension of the h264avc standard IEEE Transactions onCircuits and Systems for Video Technology 17(9)1103ndash1120 Sept 2007

[33] H-R Shao J Hsu C Ngo and C Kweon Progressive transmission ofuncompressed video overmmwwireless InConsumer Communicationsand Networking Conference (CCNC) 2010 7th IEEE pages 1ndash5 IEEE2010

[34] H Singh J Oh C Kweon X Qin H-R Shao and C Ngo A 60 ghzwireless network for enabling uncompressed video communicationIEEE Communications Magazine 46(12) 2008

[35] H Singh X Qin H-r Shao C Ngo C Y Kwon and S S Kim Supportof uncompressed video streaming over 60ghz wireless networks InConsumer Communications and Networking Conference 2008 CCNC2008 5th IEEE pages 243ndash248 IEEE 2008

[36] T Stockhammer Dynamic adaptive streaming over httpndash standardsand design principles In Proceedings of the second annual ACM confer-ence on Multimedia systems pages 133ndash144 ACM 2011

[37] G J Sullivan J-R Ohm W-J Han T Wiegand et al Overview ofthe high efficiency video coding(hevc) standard IEEE Transactions oncircuits and systems for video technology 22(12)1649ndash1668 2012

[38] S Sur I Pefkianakis X Zhang and K-H Kim Wifi-assisted 60 ghzwireless networks In Proceedings of the 23rd Annual InternationalConference on Mobile Computing and Networking MobiCom rsquo17 pages28ndash41 New York NY USA 2017 ACM

[39] T Wei and X Zhang Pose information assisted 60 ghz networksTowards seamless coverage and mobility support In Proceedings ofthe 23rd Annual International Conference on Mobile Computing andNetworking pages 42ndash55 ACM 2017

[40] T Wiegand G J Sullivan G Bjontegaard and A Luthra Overviewof the h 264avc video coding standard IEEE Transactions on circuitsand systems for video technology 13(7)560ndash576 2003

[41] X Xie X Zhang S Kumar and L E Li pistream Physical layerinformed adaptive video streaming over lte In Proceedings of the 21stAnnual International Conference on Mobile Computing and Networking

pages 413ndash425 ACM 2015[42] H Yin X Liu T Zhan V Sekar F Qiu C Lin H Zhang and B Li

Design and deployment of a hybrid cdn-p2p system for live videostreaming experiences with livesky In Proceedings of the 17th ACMinternational conference on Multimedia pages 25ndash34 ACM 2009

[43] X Yin A Jindal V Sekar and B Sinopoli A control-theoretic approachfor dynamic adaptive video streaming over http In ACM SIGCOMMComputer Communication Review volume 45(4) pages 325ndash338 ACM2015

Abstract

1 Introduction

2 Motivation

21 4K Videos Need Compression

22 Rate Adaptation Requirement

23 Limitations of Existing Video Codecs

24 WiGig and WiFi Interaction

3 Jigsaw

31 Light-weight Layered Coding

32 Layered Coding Implementation

33 Video Transmission

4 Evaluation

41 Evaluation Methodology

42 Micro-benchmarks

43 System Results

44 Emulation Results

5 Related Work

6 Conclusion

References