1 LEARN Codes: Inventing Low-latency Codes via Recurrent ... · the reliability of codes at small to medium block length regime [9]. We note that latency translates directly to block-length

1

LEARN Codes: Inventing Low-Latency Codes viaRecurrent Neural Networks

Yihan Jiang, Hyeji Kim, Himanshu Asnani, Sreeram Kannan, Sewoong Oh, Pramod Viswanath

Abstract—Designing channel codes under low-latency con-straints is one of the most demanding requirements in 5Gstandards. However, a sharp characterization of the performanceof traditional codes is available only in the large block-lengthlimit. Guided by such asymptotic analysis, code designs requirelarge block lengths as well as latency to achieve the desirederror rate. Tail-biting convolutional codes and other recent state-of-the-art short block codes, while promising reduced latency,are neither robust to channel-mismatch nor adaptive to varyingchannel conditions. When the codes designed for one channel(e.g., Additive White Gaussian Noise (AWGN) channel) are usedfor another (e.g., non-AWGN channels), heuristics are necessaryto achieve non-trivial performance.

In this paper, we first propose an end-to-end learned neuralcode, obtained by jointly designing a Recurrent Neural Network(RNN) based encoder and decoder. This code outperforms canon-ical convolutional code under block settings. We then leveragethis experience to propose a new class of codes under low-latencyconstraints, which we call Low-latency Efficient Adaptive RobustNeural (LEARN) codes. These codes outperform state-of-the-artlow-latency codes and exhibit robustness and adaptivity properties.LEARN codes show the potential to design new versatile anduniversal codes for future communications via tools of moderndeep learning coupled with communication engineering insights.

Index Terms—Channel Coding, Low Latency, Communica-tions, Deep Learning, Robustness, Adaptivity.

I. INTRODUCTION

Reliable channel codes have had a huge impact for commu-nications in the modern information age. Since its inception in[1] that was powered by the mathematical insights of informationtheory and principles of modern engineering, several capacity-achieving codes such as Polar, Turbo and LDPC codes [2][3][4]have come close to the Shannon limit when operated at largeblock lengths under Additive White Gaussian Noise (AWGN)channels. These codes were successfully adopted and appliedin LTE and 5G data planes [5]. Since 5G is under intensive

Y. Jiang and S. Kannan are with Department of Electrical and ComputerEngineering (ECE) at University of Washington (UW), Seattle, USA. Email:[email protected] (Y. Jiang), [email protected] (S. Kan-nan). H. Kim is with Samsung AI Center Cambridge, Cambridge, UnitedKingdom, Email: [email protected]. H. Asnani is with the School ofTechnology and Computer Science (STCS) at Tata Institute of Fundamental Re-search (TIFR), Mumbai, India, Email: [email protected]. Oh is with the Allen School of Computer Science & Engineering at Universityof Washington (UW), Seattle, USA, Email: [email protected]. Viswanath is with the Coordinated Science Lab (CSL) and the Department ofElectrical Engineering at University of Illinois at Urbana Champaign (UIUC),Email: [email protected] paper is an extended version of work appeared in the 53rd IEEE InternationalConference on Communications (ICC 2019).

development, designing codes that have features such as lowlatency, robustness, and adaptivity has become increasinglyimportant.

A. Motivation

An Ultra-Reliable Low Latency Communication (URLLC)code [6] requires minimal delay constraints, thereby enablingscenarios such as vehicular communication, virtual reality, andremote surgery. When considering low latency requirements,it is instructive to observe the interplay of three differenttypes of delays: processing delays, propagation delays, andstructural delays. Processing and propagation delays are affectedmostly by computing resources and varying environments [7].Low-latency channel coding, the focus of this paper, aims toimprove the structural delay caused by the encoder and/ordecoder. Encoder structural delay refers to the delay betweenthe encoder’s receiving the information bit and sending itout. Decoder structural delay refers to the delay betweenthe decoder’s receiving and decoding bits from the channel.Traditional AWGN capacity-achieving codes, such as LDPC andTurbo codes with small block lengths, show poor performancefor URLLC requirements [7], [11]. There has also beenrecent interest in establishing theoretical limits and boundson the reliability of codes at small to medium block lengths [12].

We note that latency is proportional to block length whenusing a block code; the decoder waits until it receives the entire(noisy) codeword to start the decoding. However, when using aconvolutional code, latency is given by the decoding windowlength. Thus, there is an inherent difference between block codesand convolutional codes when considering latency. Since thelatter incorporates locality in encoding, they can also be locallydecoded. While convolutional codes with small constraintlengths are not capacity achieving, they can possibly be optimalunder the low latency constraint. Indeed, this possibility wasraised by [7] and observed in [8], [9], [10]. In this work, weadvance this hypothesis by showing that we can invent codessimilar to convolutional codes that outperform handcraftedcodes in the low latency regime. While convolutional codesare the state-of-the-art in this regime, in the moderate latencyregime, Extended Bose-Chaudhuri-Hocquenghem codes (eBCH)also perform well [13].

In addition, low latency constraint channel coding musttake channel effects into account in fading channels becausepilot bits used for accurate channel estimation increase latency

arX

iv:1

811.

1270

7v2

[ee

ss.S

P] 2

4 Ju

l 202

0

2

[7]. This calls for incorporating both robustness and adaptivityas desired features for URLLC codes. Robustness refers tothe ability to perform with acceptable degradation withoutretraining when model mismatches occur; adaptivity refers tothe ability to learn to adapt to different channel models withretraining. Current and traditional channel coding systems requireheuristics to compensate for channel effects, which leads tosub-optimality for model mismatches [14]. In general, channelswithout clean mathematical analysis lack the theory of an optimalcommunication algorithm and thus rely on sub-optimal heuristics[15]. In short, current channel coding schemes fail to deliverunder the challenges of low latency, robustness, and adaptivity.

B. Channel Coding Inspired by Deep Learning: Prior Art

On the past decade, advances in deep learning (DL) havegreatly benefited engineering fields such as computer vision,natural language processing, and gaming technology [16]. Thishas generated recent excitement about applying DL methodsto communication system design [19][20]. These methods havetypically been successful in settings where there is a significantmodel-deficiency, i.e., the observed data cannot be well describedby a clean mathematical model. Thus, many initial proposalsapplying DL to communication systems have also focused onproblems where there is model uncertainty due to the lack of,say, channel knowledge [21], [22]. In developing codes for theAWGN channel under low latency constraints, there is no modeldeficiency since the channel is well-defined mathematically andsimple to describe. However, the main challenge is that optimalcodes and decoding algorithms are not known; we refer to thisas algorithm deficiency.

For algorithm-deficit problems, the following two categoriesof work apply deep learning to communication systems: (1)designing a neural network decoder (also known as neuraldecoder) for a given canonical encoder such as LDPC or Turbocodes, and (2) jointly designing both the neural network encoderand decoder, referred to as a Channel Autoencoder (also knownas Channel AE) [19] (as illustrated in Figure 1).

Fig. 1. Channel AE block diagram.

Neural decoders show promising performance by mimickingand modifying existing optimal decoding algorithms. LearnableBelief Propagation (BP) decoders for BCH and High-DensityParity-Check (HDPC) codes have been proposed in [24] and[25]. Polar decoding via neural BP is proposed in [26] and[27]. Since mimicking learnable Tanner graphs requires a fullyconnected neural network, generalizing to longer block lengthsis prohibitive. Capacity-achieving performance for Turbo codeunder an AWGN channel is achieved via Recurrent NeuralNetworks (RNNs) for arbitrary block lengths [28]. The jointdesign of neural code (encoders) and decoders via a ChannelAE, relevant to the problem under consideration in this paper,

has witnessed scant progress. Deep autoencoders have beensuccessfully applied for various problems, such as dimensionalityreduction, representation learning, and graph generation [18].However, Channel AE significantly differs from typical deepautoencoder models in the following two aspects, making itsdesign and training highly challenging:

1) The number of possible message bits b grows exponentiallywith respect to the block length. Hence, Channel AE mustgeneralize to unseen messages with capacity-restrictedencoders and decoders [26].

2) Channel models add noise between the encoder anddecoder, and the encoder needs to satisfy power constraints,thus requiring high robustness in the code.

For Channel AE training, [19] and [20] introduce learningtricks emphasizing both channel coding and modulations. Learn-ing Channel AE without a channel gradient is shown in [32].Modulation gain is reported in [33]. Beyond AWGN and fadingchannels, [34] extended RNN to design a code for the feedbackchannel, which outperforms existing state-of-the-art codes [29],[30]. Extending Channel AE to MIMO settings is reported in[23]. Despite the successes, existing research on Channel AEunder canonical channels is currently restricted to very shortblock lengths (for example, achieving the same performance asa rate 4/7 Hamming code with 4 information bits). Furthermore,existing works do not focus on the low latency, robustness, andadaptivity requirements.

With this backdrop, this paper poses the following funda-mental question: Can we improve the Channel AE design toconstruct new codes that comply with low latency requirements?

We answer this in affirmative, as described next.

C. Our Contribution

Our primary goal is to design a low latency code underextremely low structural-latency requirements. As noted, con-volutional codes outperform block codes given low structurallatency [7] [8] [9] [10]. An RNN is a constrained neural structurewith a natural connection to convolutional codes since theencoded symbol has a locality of memory and is most stronglyinfluenced by the recent past of the input bits. Furthermore,RNN-based codes have shown a natural generalization acrossdifferent block lengths in prior work [28][26]. With a carefullydesigned learnable structure that uses Bidirectional RNN (Bi-RNN) for both encoder and decoder, as well as a novel trainingmethodology developed specifically for the Channel AE model,we demonstrate that our Bi-RNN-based neural code outperformsconvolutional code.

We then propose Low-Latency Efficient Adaptive RobustNeural (LEARN) code, which applies learnable RNN structuresto both the encoder and decoder with an additional low latencyconstraint. LEARN improves performance under extremely lowlatency constraints. Ours is the first work we know of thatcreates an end-to-end design for a neural code that improvesperformance on the AWGN channel (in any regime). In summary,the contributions of this paper include:

1. Outperforming convolutional codes. We propose a bi-directional RNN network structure and a tailored learning

3

methodology for Channel AE that outperform canonicalconvolutional codes. The proposed training methodologyresults in smoother training dynamics and better general-ization. (Section II)

2. Improving performance in low latency settings. We designLEARN code for low latency requirements with specificnetwork designs. LEARN code improves performance inextremely low latency settings. (Section III).

3. Showing robustness and adaptivity. When channel condi-tions are varying, LEARN codes demonstrate robustness(ability to work well under unseen channels) and adaptivity(ability to adapt to a new channel with few trainingsymbols), showing an order of magnitude improvementin reliability compared to existing state-of-the-art codes.(Section III)

4. Interpreting the neural code. We provide interpretationsto aid in the fundamental understanding of why the jointlytrained code works better than canonical codes. (SectionIV)

II. DESIGNING NEURAL CODES THAT OUTPERFORMCONVOLUTIONAL CODES

The reliability of neural codes relies heavily on two factors:(1) a network architecture, and (2) a training methodology. Inthis section, we provide guidelines for these two factors anddemonstrate that neural codes designed and trained using ourguidelines have superior reliability compared to convolutionalcodes in a block coding setting.

A. Network Architecture

The performance (coding gain) of traditional channel codesimproves with block length. Recent research on the ChannelAE model does not show coding gain for even moderate blocklengths [19][26] with fully connected neural networks, even withnearly unlimited training examples. We argue that a RecurrentNeural Network (RNN) architecture is a more suitable DLstructure for Channel AE. For a brief introduction to RNNand its variants – such as Bidirectional RNN (Bi-RNN), GatedRecurrent Unit (GRU), or Long Short Term Memory (LSTM)– please refer to Appendix A. In this paper, we use the termsGRU and RNN interchangeably.

RNN-based Encoder and Decoder DesignOur empirical results comparing different Channel AE structures(Figure 2) show that for long enough block lengths, RNNoutperforms a Fully Connected Neural Network (FCNN) forChannel AE. The FCNN curve in Figure 2 refers to using FCNNfor both encoder and decoder. RNN in Figure 2 refers to usingBi-RNN for both encoder and decoder. The training steps arekept the same for a fair comparison. The repetition code andextended Hamming code performances are shown as a referencefor both short and long block lengths.

Figure 2 (left) shows that for short block lengths (4), theperformance of FCNN and RNN are close to each other sinceenumerating all possible code is not prohibitive. On the otherhand, for a longer block length (100), Figure 2 (right) showsthat in using FCNN, the Bit Error Rate (BER) is even worse

than repetition codes, which shows a failure in generalization;RNN outperforms FCNN due to its generalization via parametersharing and adaptive learnable dependency length. Hence, in thispaper, we model the encoder and decoder as RNNs in order togain generalization across block lengths. Figure 2 also shows thatRNNs with tailored training methodologies outperform simplyapplying RNN or FCNN for Channel AE; we illustrate thistraining methodology in Section II.B.

Power Constraint ModuleThe output of the RNN encoder can take arbitrary values anddoes not necessarily satisfy the power constraint. To imposethe power constraint, we use a power constraint layer after theRNN encoding layer, as shown in Figure 1. Power normalizationenforces that the output code has unit power by normalizing theoutput as E[x2] = 1. More detail is in the Appendix.

B. Training Methodology

We find in this paper that the following training methodsresult in a faster learning trajectory and better generalizationwith the learnable structure discussed above.• Training with a large batch size• Using Binary Crossentropy (BCE) loss• Training encoder and decoder separately• Adding a minimum distance regularizer on encoder.• Using the Adam optimizer• Adding more capacity (parameters) to the decoder than the

encoderSome of the training methods are not common in deep

learning due to the unique structure of Channel AE. AppendixC shows the empirical evidence, where we reason about thebetter performance of the above training methods.

C. Performance of RNN-Based Channel AE: AWGN Setting

Applying the network architecture guidelines and thetraining methodology improvements thus far proposed, we designa neural code with Bi-GRU for both encoder and decoder, asshown in Figure 3. The hyperparameters are shown in Figure 4.

Tail-biting Convolutional Code (TBCC) has proven to bethe state of the art under a short block length regime [7], [8], [9],[10]. We compare the performance of TBCC with RNN-basedchannel code on block code settings. The BER performance inthe AWGN channel under various code rates is shown in Figure 5.The TBCC BER curve is generated by convolutional code withconstraint length up to m = 7 by the best generator function fromFigure 7, with traceback length equals 5(m+ 1); we measurethe BERs for all TBCCs in Figure 7 empirically using CommPysimulator [?] and plot the best performing one. Figure 5 showsthat RNN-based Channel AE outperforms all convolutional codesup to constraint length 7. RNN-based Channel AE empiricallydemonstrates the advantage of jointly optimizing encoder anddecoder over the AWGN channel.

Note that the RNN-based Channel AE code is continuous;the performance gain is from both coding gain and high-ordermodulation gain, as shown in Figure 5 right; the performancegap at a higher SNR is larger. Binarized code leads to a fairer

4

Fig. 2. Channel AE performance on code rate 1/2, where block length (number of information bits) is 4 (left) and 100 (right). TBCC is m=2, with g11 = 5,g12 = 7 with feedback 7.

Decoder layer Output dimensioninput (K, 1/R)

Bi-GRU (2 layers) (K, 100)FCNN (sigmoid) (K,1)

Encoder layer Output dimensioninput (K, 1)

Bi-GRU (2 layers) (K, 25)FCNN (linear) (K,1/R)

Fig. 3. RNN-based Channel AE encoder (top left), decoder (top right), andnetwork structures (bottom).

Encoder 2-layer Bi-GRU w/ 25 unitsDecoder 2-layer Bi-GRU w/ 100 units

Power constraint bit-wise normalizationBatch size 1000

Learning rate 0.001, × 1/10 when saturateNum epoch 250

Block length 100Batch per epoch 100

Optimizer AdamLoss Binary Cross Entropy (BCE)

Min Dist Regularizer 0.0Train SNR at rate 1/2 mixture of 0 to 8dBTrain SNR at rate 1/3 mixture of -1 to 2dBTrain SNR at rate 1/4 mixture of -2 to 2dB

Train method train ENC once DEC 5 timesMin Distance Regularizer 0.001 (s = 10)

Fig. 4. RNN-based Channel AE hyperparameters.

comparison; however, binarizing with the sign function is notdifferentiable. Bypassing non-differentiability using a straight-through estimator (STE) [50] degrades performance in channelcoding [51]. Binarizing code and comparing to high ordermodulation are deferred to future research.

D. Performance of RNN-Based Channel AE: Non-AWGN Setting

We test the robustness and adaptivity of RNN-basedChannel AE on three families of channels:

1) AWGN channel: y = x+ z, where z ∼ N(0, σ2).2) Additive T-distribution Noise (ATN) channel: y = x+ z,

where x ∼ T (v, σ2), for v = 3, 5. This heavy-tailed noisedistribution models bursty interference.

3) Radar channel: y = x+ w + z, where z ∼ N(0, σ21) and

w ∼ N(0, σ22), w.p. p = 0.05. (Assume σ1 � σ2). This

noise model appears when there is bursty interference, forexample, when Radar interferes with LTE [39], [15].

RobustnessRobustness refers to the property that when RNN-based ChannelAE is trained for the AWGN channel, the test performance withno re-training on a different channel (ATN and Radar) shouldnot degrade significantly. Most existing codes are designed underAWGN since it has a clean mathematical abstraction, and AWGNis the worst case noise under a given power constraint [1]. Whenboth the encoder and decoder are unaware of the non-AWGNchannel statistics, the BER performance degrades. Robustnessensures both the encoder and decoder perform well under channelmismatch, which is a typical use case for the low-latency schemewhen channel estimation and corresponding receiver calibrationare not accurate [5].AdaptivityAdaptivity allows an RNN-based Channel AE to learn a decodingalgorithm from sufficient data (includes retraining) even whenthe channel statistics do not admit a simple mathematicalcharacterization [28]. We train RNN-based Channel AE underATN and Radar channels with the same hyperparameters shown

5

Fig. 5. Comparing RNN-based Channel AE with convolutional code on rate 1/4 (left), 1/3 (middle), 1/2 (right). Block length = 100.

in Figure 4 and the same amount of training data to ensurethat the RNN-based Channel AE converges. With both learnableencoder and decoder, two cases of adaptivity are tested. First isdecoder adaptivity, where the encoder is fixed and the decodercan be further trained. Second is the full adaptivity of bothencoder and decoder. In our findings, encoder adaptivity showsno further advantage and is thus omitted.

We now evaluate RNN-based Channel AE for robustnessand adaptivity on ATN and Radar channels. The BER per-formance is shown in Figure 6. Note that under non-AWGNchannels, RNN-based Channel AE trained on the AWGN channeloutperforms the convolutional code with the Viterbi decoder. Italso shows more robust decoding ability for channel mismatchingcompared to the best convolutional code. As shown in Figure 6,RNN-based Channel AE with decoder-only adaptivity improvesover the RNN-based Channel AE robust decoder, while RNN-based Channel AE with full adaptivity with both trainableencoder and decoder shows the best performance.

The fully adapted RNN-based Channel AE outperformsthe convolutional code even with CSIR, which utilizes the log-likelihood of T-distribution noise. Thus, designing jointly byutilizing the encoder and decoder further optimizes the codeunder given channels. Even when the underlying mathematicalmodel is far from a cleaner abstraction, RNN-based ChannelAE can learn the underlying functional code via self-supervisedback-propagation.

RNN-based Channel AE is the first neural code to ourknowledge that outperforms existing canonical codes underthe AWGN channel coding setting, which opens a new fieldof constructing efficient neural codes under canonical settings.Furthermore, RNN-based Channel AE can be applied to channelswhen, especially, closed form mathematical analysis cannot beperformed.

III. DESIGNING LOW LATENCY NEURAL CODES: LEARN

Designing codes for low latency constraints is challengingsince many existing block codes require inevitably long blocklengths. To address this challenge, we now propose a new RNN-based encoder and decoder architecture that satisfies a low

latency constraint, which we call LEARN. While our encoder anddecoder architectures are based on RNNs, we introduce a newblock in the decoder architecture so that the decoder can satisfythe extreme low latency constraint. We show that LEARN codeis: (1) significantly more reliable than convolutional codes, whichare state-of-the-art under extreme low latency constraints [7],and (2) more robust and adaptive for various channels beyondAWGN channels. In the following, we first define latency andreview the literature for the low latency setting.

A. low latency Convolutional Code

Formally, decoder structural delay D is understood in thefollowing setting: to send message bt at time t, the causal encodersends code xt, and the decoder has to decode bt as soon as itreceives yt+D. The decoder structural delay D is the number ofbits that the decoder can look ahead to decode. The convolutionalcode has 0 encoder delay due to its causal encoder, and thedecoder delay is controlled by the optimal Viterbi decoder [35]with a decoding window of length w, which uses only the next wfuture branches in the trellis to compute the current output. Forcode rate R = k

n convolutional code, the structural decoder delayis D = k − 1 + kw [11]. We use code rates R = 1/2, 1/3, 1/4(k = 1 with n = 2, 3, 4).; the structural decoder delay is D = w.Convolutional code is state-of-the-art code under extreme lowlatency, where D ≤ 50 [7].

In this paper, we confine our scope to investigating extremelow latency with no encoder delay under low structural decoderdelay D = 1 to D = 10 with code rates 1/2, 1/3, and 1/4. Thebenchmark we are using is convolutional codes with variablememory length. Under an unbounded block length setting, longermemory improves performance; however, under a low latencyconstraint, longer memory may not necessarily mean betterperformance since the decoding window is short [7]. Hence,we test for all memory lengths under 7 to get the state-of-the-artperformance of the Recursive Systematic Convolutional (RSC)code, whose generating functions are shown in Figure 7 (top),with the convolutional code encoder shown in Figure 7 (bottom).The decoder is Viterbi with a decoding window w = D.

6

Fig. 6. RNN-based Channel AE vs TBCC, ATN(ν = 3) at rate 1/2 (left), Radar (σ22 = 5.0) at rate 1/3 (middle), ATN(ν = 3) at rate 1/4 (right). TBCC is m=2,

with g11 = 5, g12 = 7 with feedback 7

R=1/2 R=1/3 R=1/4m g11 g12 g11 g12 g13 g11 g12 g13 g14 Feedback

1 2 3 1 3 3 1 1 3 3 32 5 7 5 7 7 5 7 7 7 73 15 17 13 15 17 13 15 15 17 174 23 35 25 33 37 25 27 33 37 375 53 75 47 53 75 53 67 71 75 756 133 171 133 145 175 135 135 147 163 1637 247 371 225 331 367 237 275 313 357 357

Fig. 7. Convolutional code generator matrix in octal representation (top) [7]and encoder (bottom).

B. LEARN Network Architecture

Using the network design proposed in the previous section,we propose a novel RNN-based neural network architecturefor LEARN (both encoder and decoder) that satisfies the lowlatency constraint. Our proposed LEARN encoder is illustratedin Figure 8 (top left). The causal neural encoder is a causal RNNwith two layers of GRU added to a Fully Connected NeuralNetwork (FCNN). The neural structure ensures that the optimaltemporal storage can be learned and extended to a non-linearregime. The power constraint module is bit-wise normalization,as described in the previous section.

Applying the Bi-RNN decoder for low latency code requiresthe computation of lookahead instances for each receivedinformation bit, which is computationally expensive in bothtime and memory. To improve efficiency, the LEARN decoderuses two GRU structures instead of Bi-RNN structures. It hastwo GRUs: one GRU runs till the current time slot, and anotherGRU runs further for D steps; the outputs of the two GRUs arethen summarized by a FCNN. The LEARN decoder ensures thatall the information bits satisfying the delay constraint can beutilized with the forward pass only. When decoding a receivedsignal, each GRU needs to process only one step ahead, which

Decoder layer Output dimensionInput (K, 1/R)

GRU1 (2 layers) (K, 100)GRU2 (2 layers) (K, 100)FCNN (sigmoid) (K,1)

Encoder layer Output dimensionInput (K, 1)

GRU (2 layers) (K, 25)FCNN (linear) (K,1/R)

Fig. 8. LEARN encoder (top left), LEARN decoder (top right), and networkstructures (bottom).

results in decoding computation complexity O(1). Viterbi andBCJR low latency decoders need to go through the trellis andbacktrack to the desired position, which requires going forwardone step and backward with delay constraints steps, resulting inO(D) computation for decoding each bit. Although GRU has alarge computational constant due to the complexity of the neuralnetwork, the computation time is expected to diminish [38]with emerging AI chips. The hyperparameters of LEARN differfrom block code settings. To summarize the differences: (1)the encoder and decoder use GRU instead of Bi-GRU, (2) thenumber of training epochs is reduced to 120, and (3) we do notuse a partial minimum distance regularizer.

C. Performance of LEARN: AWGN Setting

Figure 9 shows the BER of a LEARN code and state-of-the-art RSC codes with varying memory lengths as illustrated inFigure 7 (top) for rates 1/2, 1/3, and 1/4 as a function of SNR

7

under latency constraints D = 1 and D = 10. As the figureshows, for rates 1/3 and 1/4 under the AWGN channel, LEARNcode under extreme delay (D = 1 to D = 10) shows betterperformance in Bit Error Rate (BER) compared to the state-of-the-art RSC codes from varying memory lengths in Figure7 (top). LEARN outperforms all RSC codes listed in Figure 7(top) with D ≤ 10 with code rates 1/3 and 1/4, demonstratinga promising application of neural code under the low latencyconstraint.

For higher code rates (such as R = 12 and D ≥ 5),

LEARN shows comparable performance to convolutional codesbut degrades at a high SNR. We expect further improvementscan be made via improved structure design and hyperparameteroptimization, especially at higher rates.

D. Robustness and Adaptivity

The performance of LEARN with reference to robustnessand adaptivity is shown in Figure 10 for three different settings:(1) delay D = 10, code rate R = 1/2, with ATN (ν = 3)channel; (2) delay D = 2, code rate R = 1/3, with ATN(ν = 3) channel; and (3) delay D = 10, code rate R = 1/2,with the Radar (p = 0.05, σ2 = 5.0) channel. As shown inFigure 10 (left), with ATN (ν = 3), which has a heavy-tailnoise, LEARN with robustness outperforms convolutional codes.An improved adaptivity is achieved when both the encoderand decoder are trained, compared to when only decoder istrained. By exploring a larger space of codes, neural designedcoding schemes can match canonical convolutional codes withChannel State Information at Receiver (CSIR) at a low coderate (R = 1/2) and outperform convolutional codes with CSIRat a high code rate (R = 1/3).

As for Figure 10 (middle) ATN (ν = 3) channel with coderate R = 1/3 and Figure 10 (right) Radar (σ2 = 5.0) channelwith code rate R = 1/4, the same trend holds. Note that underthe Radar channel, we apply the heuristic proposed in [15].We observe that LEARN with full adaptation gives an order-of-magnitude improvement in reliability over the convolutional codeheuristic [15]. This experiment shows that by jointly designingboth encoder and decoder, LEARN can adapt to a broad familyof channels. LEARN offers an end-to-end low latency codingdesign method that can be applied to any statistical channelsand ensure good performance.

IV. INTERPRETABILITY OF DEEP MODELS

The promising performance of LEARN and the RNN-basedChannel AE leads to a question: how do we interpret whatthe encoder and decoder have learned? Answering this caninspire future research as well as help us find caveats andlimitations. In this section, we present interpretation analysisvia local perturbation for LEARN and RNN-based Channel AEencoders and decoders.

A. Encoder Interpretability

The significant recurrent length of RNNs is a recurrentcapacity indicator that helps to interpret neural encoder anddecoders. The RNN encoder’s significant recurrent length is

defined, at time t, as how long a sequence the input ut can impactas RNN operates recurrently. Assume two input sequences,u1 = u1,1, ..., u1,t, ...u1,T and u2 = u2,1, ..., u2,t, ...u2,T , whereonly u1,t and u2,t differ. Taking a batch of u1 and u2 as inputfor the RNN encoder, we compare the output absolute differencealong the whole block between x1 = f(u1) and x2 = f(u2)to investigate how long the input flip at time t can affect. Toinvestigate LEARN’s RNN encoder, we flip only the first bit(position 0) of u1 and u2. The code position refers to the blockbit positions, and the y-axis shows the averaged differencebetween two different sequences. Figure 11 (top left) shows thatfor an extremely short delay D = 1, the encoder’s significantrecurrent length is short. The effect of the current bit diminishesafter 2 bits. As the delay constraint increases, the encoder’ssignificant recurrent length increases, accordingly. The LEARNencoder learns to encode locally to optimize under the lowlatency constraint.

For RNN-based Channel AE with a Bi-RNN encoder, theblock length is 100, and the flip is applied at the middle 50thbit position. Figure 11 (top right) shows the encoder trainedunder the AWGN and ATN channels. The encoder trained onATN shows a longer significant dependency length. ATN is aheavy-tail noise ditribution. To alleviate the effect of extremevalues, increasing the dependency of an encoder helps. Note thateven the longest significant recurrent length is only backward 10steps and forward 16 steps; thus, the GRU encoder actually didnot learn a very long recurrent dependency. AWGN capacity-achieving codes have some inbuilt mechanisms to improvelong-term dependency. For example, the Turbo encoder uses aninterleaver to improve the long-term dependency [3]. Improvingthe significant recurrent length length via a more learnablestructure design is an interesting future research direction.

B. Decoder Interpretability

The decoder’s significant recurrent length can illustratehow it copes with different constraints and channels. Assumetwo noiseless coded sequences, y1 = y1,1, ..., y1,t, ...y1,T andy2 = y2,1, ..., y2,t, ...y2,T , y1, and y2 equals y1 other than attime t, where y1,t = y2,t + p, where p is the large deterministicpulse noise; p = 5.0 for our experiment. We compare the outputabsolute difference along the whole block between u1 = g(y1)and u2 = g(y2) to investigate how long the pulse noise canaffect. For the LEARN decoder, we inject pulse noise at thestarting position. Figure 11 (bottom left) shows that for all delaycases, the noise most significantly affected the position equal tothe delay constraint. This shows that the LEARN decoder learnsto coordinate with the causal LEARN encoder. Since D = 1,the maximized decoder difference along the block is at position1; when D = 10, the maximized decoder difference along theblock is at position 10. Other code bits have a less significantbut non-zero decoder difference.

The LEARN decoder’s significant recurrent length impliesthat it not only learns to utilize the causal encoder’s immediateoutput, but it also utilizes outputs in other time slots to helpdecoding. Note that the maximized significant recurrent lengthis approximately twice the delay constraint; after less than

8

Fig. 9. BER curves comparing low latency convolutional code vs LEARN under the AWGN channel with rate 1/4, 1/3, and 1/2.

Fig. 10. LEARN robustness and adaptivity in ATN (ν = 3, R = 1/2, D = 10)(left), ATN (ν = 3, R = 1/3, D = 2)(middle), and Radar (p = 0.05, σ2 = 5.0,R = 1/4, D = 10) (right) channels.

approximately 2D, the impact diminishes. The LEARN decoderlearns to decode locally to optimize under different low latencyconstraints. For RNN-based Channel AE with the Bi-RNNencoder, the perturbation is applied at the middle 50th position,still with block length 100. Figure 11 (bottom right) showsthe decoder trained under the AWGN and ATN channels. TheAWGN-trained decoder is more sensitive to pulse noise withextreme values than the ATN-trained decoder. By reducing thesensitivity for extreme noise, the ATN-trained decoder learnsto alleviate the effect of non-Gaussian noise. The RNN-basedChannel AE decoder learns to optimize under different channelsettings.

V. CONCLUSION

In this paper, we demonstrated the power of neural network-based architectures to achieve state-of-the-art performance forjoint encoder and decoder design. We showed that our learnedcodes significantly outperform convolutional codes in short tomedium block lengths. However, in order to outperform state-of-the-art codes such as Turbo or LDPC codes, we requireadditional mechanisms such as interleaving to introduce long-term dependencies. This promises to be a fruitful direction forfuture exploration.

In the low-latency regime, we achieved state-of-the-artperformance with LEARN codes. Furthermore, LEARN codesbeat existing codes by an order of magnitude in reliability when

9

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0Code position

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

code

diff

eren

ce

LEARN encoder interpretationdelay = 1delay = 5delay = 10

40 45 50 55 60 65 70Code position

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

code

diff

eren

ce

RNN-based Channel AE Encoder interpretationAWGN trainedATN trained

0 5 10 15 20 25Code position

0.0

0.2

0.4

0.6

0.8

Dec

diffe

renc

e

LEARN decoder interpretationdelay = 1delay = 5delay = 10

40.0 42.5 45.0 47.5 50.0 52.5 55.0 57.5 60.0Code position

10 5

10 4

10 3

10 2

10 1

100

Dec

diffe

renc

e

RNN-based Channel AE Decoder interpretationAWGN trainedATN trained

Fig. 11. Encoder interpretation for LEARN (top left) and RNN-based ChannelAE (top right). Decoder interpretation for LEARN (bottom left) and RNN-basedChannel AE (bottom right).

there is a channel mismatch. Our present design is restricted tovery low structural latency; however, with additional mechanismsfor introducing longer term dependencies [40], [41], we believethat it is possible to extend these designs to cover a larger rangeof delays. This is another interesting direction for future work.Finally, we have focused only on very short structural delay.Latency depends on other factors as well (e.g., computationalcomplexity). Optimization of neural decoders to reduce otherattributes of latency poses an interesting open problem.

ACKNOWLEDGMENT

This work was supported in part by NSF awards CCF-1651236, CIF-1908003, CIF-1703403, CNS-2002932, CNS-2002664, and Intel. We also want to thank reviewers for thehelpful and supportive comments.

REFERENCES

[1] Shannon, C.E. “A mathematical theory of communication." Bell systemtechnical journal 27.3 (1948): 379-423.

[2] Arikan, Erdal. “A performance comparison of polar codes and Reed-Mullercodes." IEEE Communications Letters 12.6 (2008).

[3] Berrou, Claude, Alain Glavieux, and Punya Thitimajshima. “Near Shannonlimit error-correcting coding and decoding: Turbo-codes. 1." Communica-tions, 1993. ICC’93.

[4] MacKay, D. JC, and Radford M. N. “Near Shannon limit performance oflow density parity check codes." Electronics letters 32.18 (1996).

[5] Richardson, Tom, and Ruediger Urbanke. Modern coding theory. Cambridgeuniversity press, 2008.

[6] Sybis, Michal, et al. “Channel coding for ultra-reliable low-latency com-munication in 5G systems." Vehicular Technology Conference (VTC-Fall),IEEE 84th. IEEE, 2016.

[7] Rachinger, Christoph, Johannes B. Huber, and Ralf R. Muller. “Comparisonof convolutional and block codes for low structural delay." IEEE Transactionson Communications 63.12 (2015): 4629-4638.

[8] Liva G, Gaudio L, Ninacs T, Jerkovits T. Code design for short blocks: Asurvey. arXiv preprint arXiv:1610.00873. 2016 Oct 4.

[9] Liva G, Durisi G, Chiani M, Ullah SS, Liew SC. Short codes withmismatched channel state information: A case study. In2017 IEEE 18thInternational Workshop on Signal Processing Advances in Wireless Com-munications (SPAWC) 2017 Jul 3 (pp. 1-5). IEEE.

[10] Durisi G, Koch T, Popovski P. Toward massive, ultrareliable, and low-latency wireless communication with short packets. Proceedings of theIEEE. 2016 Aug 2;104(9):1711-26.

[11] Maiya, Shashank V., Daniel J. Costello, and Thomas E. Fuja. “Lowlatency coding: Convolutional codes vs. LDPC codes." IEEE Transactionson Communications 60.5 (2012): 1215-1225.

[12] Polyanskiy, Yury, H. Vincent Poor, and Sergio Verdu. "Channel codingrate in the finite blocklength regime." IEEE Trans. on Info. Theory 56.5(2010): 2307-2359.

[13] Shirvanimoghaddam M, Mohammadi MS, Abbas R, Minja A, Yue C,Matuz B, Han G, Lin Z, Liu W, Li Y, Johnson S. Short block-lengthcodes for ultra-reliable low latency communications. IEEE CommunicationsMagazine. 2018 Dec 28;57(2):130-7.

[14] Li, Junyi, Xinzhou Wu, and Rajiv Laroia. OFDMA mobile broadbandcommunications: A systems approach. Cambridge University Press, 2013.

[15] Safavi-Naeini HA, Ghosh C, Visotsky E, Ratasuk R, Roy S. Impact andmitigation of narrow-band radar interference in down-link LTE. In2015IEEE international conference on communications (ICC) 2015 Jun 8 (pp.2644-2649). IEEE.

[16] Goodfellow I, Bengio Y, Courville A. Deep learning. MIT press; 2016Nov 10.

[17] Han, Song, Huizi Mao, and William J. Dally. “Deep compression:Compressing deep neural networks with pruning, trained quantization andHuffman coding." arXiv:1510.00149 (2015).

[18] Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. “Reducing the dimen-sionality of data with neural networks." science 313.5786 (2006): 504-507.

[19] O’Shea, Timothy J., Kiran Karra, and T. Charles Clancy. “Learning tocommunicate: Channel auto-encoders, domain specific regularizers, andattention." Signal Processing and Information Technology (ISSPIT), 2016IEEE International Symposium on. IEEE, 2016.

[20] O’Shea, Timothy, and Jakob Hoydis. “An introduction to deep learningfor the physical layer." IEEE Transactions on Cognitive Communicationsand Networking 3.4 (2017): 563-575.

[21] Farsad, Nariman and Goldsmith, Andrea, “Neural Network Detection ofData Sequences in Communication Systems." IEEE Transactions on SignalProcessing, Jan 2018.

[22] Dörner S, Cammerer S, Hoydis J, ten Brink S. Deep learning basedcommunication over the air. IEEE Journal of Selected Topics in SignalProcessing. 2017 Dec 15;12(1):132-43.

[23] O’Shea, Timothy J., Tugba Erpek, and T. Charles Clancy. “Deep learningbased MIMO communications." arXiv preprint arXiv:1707.07980 (2017).

[24] Nachmani, Eliya, Yair Be’ery, and David Burshtein. “Learning to decodelinear codes using deep learning." Communication, Control, and Computing(Allerton), 2016 54th Annual Allerton Conference on. IEEE, 2016.

[25] Nachmani, Eliya, et al. “Deep learning methods for improved decoding oflinear codes." IEEE Journal of Selected Topics in Signal Processing 12.1(2018): 119-131.

[26] Gruber T, Cammerer S, Hoydis J, ten Brink S. On deep learning-basedchannel decoding. In2017 51st Annual Conference on Information Sciencesand Systems (CISS) 2017 Mar 22 (pp. 1-6). IEEE.

[27] Cammerer S, Gruber T, Hoydis J, ten Brink S. Scaling deep learning-baseddecoding of polar codes via partitioning. InGLOBECOM 2017-2017 IEEEGlobal Communications Conference 2017 Dec 4 (pp. 1-6). IEEE.

[28] Kim, Hyeji and Jiang, Yihan and Rana, Ranvir and Kannan, Sreeram andOh, Sewoong and Viswanath, Pramod. ”Communication Algorithms viaDeep Learning” Sixth International Conference on Learning Representations(ICLR), 2018.

[29] Schalkwijk J, Kailath T. A coding scheme for additive noise channels withfeedback–I: No bandwidth constraint. IEEE Transactions on InformationTheory. 1966 Apr;12(2):172-82.

[30] Chance Z, Love DJ. Concatenated coding for the AWGN channel withnoisy feedback. IEEE Transactions on Information Theory. 2011 Oct6;57(10):6633-49.

[31] Hammer, Barbara, “On the approximation capability of recurrent neuralnetworks." Neurocomputing Vol. 31, pp. 107–123, 2010.

[32] Aoudia FA, Hoydis J. End-to-end learning of communications systemswithout a channel model. In2018 52nd Asilomar Conference on Signals,Systems, and Computers 2018 Oct 28 (pp. 298-303). IEEE.

[33] Felix A, Cammerer S, Dörner S, Hoydis J, Ten Brink S. OFDM-Autoencoder for end-to-end learning of communications systems. In2018IEEE 19th International Workshop on Signal Processing Advances inWireless Communications (SPAWC) 2018 Jun 25 (pp. 1-5). IEEE.

[34] Kim, Hyeji and Jiang, Yihan and Rana, Ranvir and Kannan, Sreeram andOh, Sewoong and Viswanath, Pramod. “Deepcode: Feedback codes via deeplearning." NeurIPS 2018.

http://arxiv.org/abs/1610.00873



10

[35] Viterbi, Andrew. “Error bounds for convolutional codes and an asymp-totically optimum decoding algorithm." IEEE transactions on InformationTheory 13.2 (1967): 260-269.

[36] Chung, J., Gulcehre, C., Cho, K., Bengio, Y. “Empirical evaluation ofgated recurrent neural networks on sequence modeling." In NIPS 2014Workshop on Deep Learning, December 2014

[37] Ioffe, S., Szegedy, C. “Batch Normalization: Accelerating Deep NetworkTraining by Reducing Internal Covariate Shift." Proceedings of the 32ndInternational Conference on Machine Learning, in PMLR 37:448-456

[38] Ovtcharov, Kalin, et al. “Accelerating deep convolutional neural networksusing specialized hardware." MSR Whitepaper 2.11 (2015).

[39] Sanders, Geoffrey A, “Effects of radar interference on LTE (FDD) eNodeBand UE receiver performance in the 3.5 GHz band." US Department ofCommerce, National Telecommunications and Information Administration,2014.

[40] Sutskever, Ilya and Vinyals, Oriol and Le, Quoc V, “Sequence to sequencelearning with neural networks." NeurIPS 2014.

[41] Jaderberg, Max and Simonyan, Karen and Zisserman, Andrew and others,“Spatial transformer networks." NeurIPS 2015.

[42] Rifai, Salah, et al. "Contractive auto-encoders: Explicit invariance duringfeature extraction." Proceedings of the 28th International Conference onInternational Conference on Machine Learning. Omnipress, 2011.

[43] Hoffer, Elad, Itay Hubara, and Daniel Soudry. "Train longer, generalizebetter: closing the generalization gap in large batch training of neuralnetworks." Advances in Neural Information Processing Systems. 2017.

[44] Poole, Ben, Jascha Sohl-Dickstein, and Surya Ganguli. "Analyzing noise inautoencoders and deep networks." arXiv preprint arXiv:1406.1831 (2014).

[45] Samuel L. Smith and Pieter-Jan Kindermans and Quoc V. Le. “Don’tdecay the learning rate, increase the batch size." International Conferenceon Learning Representations (2018)

[46] Buja, Andreas, Werner Stuetzle, and Yi Shen. "Loss functions for binaryclass probability estimation and classification: Structure and applications."Working draft, November 3 (2005).

[47] Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. "Practical bayesianoptimization of machine learning algorithms." Advances in neural informa-tion processing systems. 2012.

[48] Sabri, Motaz, and Takio Kurita. "Effect of Additive Noise for Multi-LayeredPerceptron with AutoEncoders." IEICE TRANSACTIONS on Informationand Systems 100.7 (2017): 1494-1504.

[49] Kinga, D., and J. Ba Adam. "A method for stochastic optimization."International Conference on Learning Representations (ICLR). Vol. 5. 2015.

[50] Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y. Binarizedneural networks. InAdvances in neural information processing systems 2016(pp. 4107-4115).

[51] Jiang Y, Kim H, Asnani H, Kannan S, Oh S, Viswanath P. Turbo Autoen-coder: Deep learning based channel codes for point-to-point communicationchannels. InAdvances in Neural Information Processing Systems 2019 (pp.2754-2764).

Yihan Jiang Yihan Jiang is a Ph.D. candidate at the Electrical and ComputerEngineering Department, University of Washington, Seattle at Washington. Hereceived his M.S degree in 2014 in Electrical and Computer Engineering fromUC San Diego and a B.S degree from Beijing Institute of Technology in 2012.His research interests are in the areas of channel coding, information theory,deep learning, and federated learning.

Hyeji Kim Hyeji Kim is a researcher at Samsung AI Research Cambridge.Prior to joining Samsung AI Research in 2018, she was a postdoctoral researchassociate at the University of Illinois, Urbana-Champaign. She received herPh.D. and M.S. degrees in Electrical Engineering from Stanford Universityin 2016 and 2013, respectively, and her B.S. degree with honors in ElectricalEngineering from KAIST in 2011. She is a recipient of the Stanford GraduateFellowship and participant in the Rising Stars in EECS Workshop, 2015.

Himanshu Asnani Himanshu Asnani is currently Reader (eq. to tenure-trackAssistant Professor) in the School of Technology and Computer Science at theTata Institute of Fundamental Research, Mumbai. He is also an Affiliate AssistantProfessor in the Electrical and Computer Engineering Department at Universityof Washington, Seattle, where he held a Research Associate position previously.His current research interests include machine learning, causal inference andinformation and coding theory. He received his M.S. and Ph.D. in ElectricalEngineering Department in 2011 and 2014 respectively from Stanford Universitywhere he was a Stanford Graduate Fellow. He received his B.Tech. from IITBombay in 2009. Following his graduation, Himanshu worked briefly as aSystem Architect in Ericsson Silicon Valley following which he was involvedin a couple education startups. In the past, he has also held visiting facultyappointments at Stanford University and IIT Bombay. He was the recipient of2014 Marconi Society Paul Baran Young Scholar Award, 2018 Amazon CatalystFellow Award, Best Paper Award at MobiHoc 2009, and was the Finalist forthe Student Paper Award in ISIT 2011.

Sreeram Kannan Sreeram Kannan is currently an Assistant Professor at theUniversity of Washington, Seattle. He was a postdoctoral scholar at the Universityof California, Berkeley between 2012-2014 before which he received his Ph.D. inElectrical Engineering and M.S. in Mathematics from the University of Illinois,Urbana-Champaign. He is a recipient of the 2019 UW ECE Outstanding Teachingaward, 2017 NSF Faculty Early CAREER award, the 2013 Van Valkenburgoutstanding dissertation award from UIUC, a co-recipient of the 2010 QualcommCognitive Radio Contest first prize, a recipient of 2010 Qualcomm (CTO) RobertoPadovani Outstanding Intern award, a recipient of the SVC Aiya medal fromthe Indian Institute of Science, 2008, and a co-recipient of Intel India StudentResearch Contest first prize, 2006. His research interests include the applicationsof information theory and learning to blockchains, computational biology andwireless networks.

Sewoong Oh Sewoong Oh is an Associate Professor in the Paul G. AllenSchool of Computer Science & Engineering at the University of Washington.Prior to joining the University of Washington in 2019, he was an AssistantProfessor in the department of Industrial and Enterprise Systems Engineeringat University of Illinois, Urbana-Champaign since 2012. He received his Ph.D.from the department of Electrical Engineering at Stanford University in 2011under the supervision of Andrea Montanari. Following his Ph.D., he workedas a postdoctoral researcher at the Laboratory for Information and DecisionSystems (LIDS) at MIT under the supervision of Devavrat Shah. Sewoong’sresearch interest is in theoretical machine learning. He was co-awarded theACM SIGMETRICS Best Paper award in 2015, the NSF CAREER award in2016, the ACM SIGMETRICS Rising Star award in 2017, and the GOOGLEFaculty Research Award in 2017 and 2020.

Pramod Viswanath Pramod Viswanath is a Professor of Electrical and ComputerEngineering at the University of Illinois, Urbana-Champaign. He receivedhis Ph.D. degree in Electrical Engineering and Computer Science from theUniversity of California, Berkeley in 2000. His current research interestsinclude blockchain technologies from a variety of angles: networking protocols,consensus algorithms, payment channels, distributed coded storage and incentivedesigns. He is a co-founder and CEO of Applied Protocol Research, a startupfocused on developing core blockchain technologies. He has received the EliahuJury Award, the Bernard Friedman Prize, a NSF CAREER award, and the BestPaper Award at the Sigmetrics conference in 2015. He is a co-author, withDavid Tse, of the text Fundamentals of Wireless Communication, which hasbeen used in over 60 institutions worldwide.


1

APPENDIX

APPENDIX ARECURRENT NEURAL NETWORKS : AN INTRODUCTION

As illustrated in Figure 12 (top left), RNN is defined as ageneral function f(.) such that (yt, ht) = f(xt, ht−1) at time t,where xt is the input, yt is the output, ht is the state sent tothe next time slot and ht−1 is the state from the last time slot.RNN can only emulate causal sequential functions. Indeed, it isknown that an RNN can capture a general family of measurablefunctions from the input time sequence to the output timesequence [31]. Illustrated in Figure 12 (top right), BidirectionalRNN (Bi-RNN) combines one forward and backward RNN andcan infer current state by evaluating through both past and future.Bi-RNN is defined as (yt, h

ft , h

bt) = f(xt, h

ft−1, h

bt+1), where

hft and hbt stands for the state at time t for forward and backwardRNNs [16].

RNN is a restricted structure which shares parametersbetween different time slots across the whole block. This makesit naturally generalizable to a longer block length. Moreover,RNNs can be considered as an over parameterized non-linearconvolutional codes for both encoder and decoder, since aconvolutional code encoder can be represented by a causalRNN and BCJR forward-backward algorithm can be emulatedby a Bi-RNN [28]. There are several choices of parametricfunctions f(.) for RNN, such as vanilla RNN, Gated RecurrentUnit (GRU), or Long Short Term Memory (LSTM). VanillaRNN is known to be hard to train due to diminishing gradients.LSTM and GRU are the most widely used RNN variants whichutilize gating schemes to alleviate the problem of diminishinggradients [16]. We empirically compare Vanilla RNN with GRUand LSTM under test loss trajectory, which shows the test lossalong with the training times. The test loss trajectory is themean of 5 independent experiments. Figure 12 (bottom) depictsthe test loss along training time, which shows that vanilla RNNhas slower convergence and GRU converges fast, while GRUand LSTM have similar final test performances. Since GRU hasless computational complexity, in this paper we use GRU asour primary network architecture [36]. In this paper, we use theterms GRU and RNN interchangeably.

APPENDIX BPOWER CONSTRAINT MODULE

The power normalization enforces the output code to havea unit power by normalizing the output. In the following, weinvestigate three implementations on the power constraint layerthat are differentiable. We let b, x, x denote the message bitsequence, output of the encoder, and the normalized codeword,respectively.

1. hard power constraint: We use a hyperbolic tangentfunction in training (x = tanh(x)) and threshold to -1 and +1 for testing, which only allows discrete codingschemes.

2. bit-wise normalization: E||Xi||22 ≤ 1,∀i. We have xi =xi−mean(xi)

std(xi)∀i. For a given coding bit position, the bit

power is normalized.

Fig. 12. Basic RNN structure (top left), Bi-RNN (top right), and test trajectoryfor RNN variants selection (bottom)

3. block-wise normalization: E||X||22 ≤ 1,We have x =x−mean(x)std(x) ∀i which allows us to reallocate power across

the code block.For bit-wise and block-wise normalizations, the behaviors

in training and testing are different. During the training phase,bit-wise and block-wise normalizations normalize input mini-batch according to the training mini-batch statistics. Duringtesting phase, the mean and standard deviation are precomputedby passing through many samples to ensure that the estimationsof mean and standard deviation are accurate.

Figure 13 left shows that block-wise normalization doesoffer better learning test trajectory, while bit-wise normalizationshows slightly worse performance. Hard power constraintusing tanh, due to saturating gradients, results in high testerrors. To satisfy maximized power constraint, we use bit-wisenormalization in this paper.

APPENDIX CTRAINING METHODOLOGY

In this section we give empirical evidence for the best effectof training techniques used in this paper.

A. Using Large Batch Sizes

Deep Learning models typically use mini-batching toimprove generalization and reduce the memory usage. A smallrandom mini-batch results in a better generalization [43][45],while a large batch size requires more training to generalize.

2

However, Figure 13 (middle) shows that a larger batch sizeresults in much better generalization for Channel AE, whilesmall batch size tends to saturate in high test loss. A large batchsize is required due to the following reasons:

(1) A large batch offers better power constraint statistics[37]. With a large batch size, the normalization of powerconstraint module offers a better estimation of mean and standarddeviation, which makes the output of the encoder less noisy,thus the decoder can be better trained accordingly.

(2) A large batch size gives a better gradient. Since ChannelAE is trained with a self-supervision, the improvement ofChannel AE originates from error back propagation. As extremeerror might result in wrong gradient direction, large batch sizecan alleviate this issue.

The randomness in the training mini-batch results in a bettergeneralization [16]. Figure 13 (bottom) shows that even for asmall block length (L = 10) when enumeration of all possiblecodes becomes possible, random mini-batching outperforms fixedfull batch which enumerates all possible codes in each batch.Note that training with all possible codewords leads to a worsetest performance, while training with large random batch (5000)outperforms the full batch settings. Thus we conclude that usinga large random batch gives a better result. In this paper, due toGPU memory limitations, we use a batch size of 1000.

B. Use Binary Crossentropy Loss

The input u ∈ {0, 1}k and output u ∈ {0, 1}k are binarydigits, which makes training Channel AE a binary self-supervisedclassification problem [19]. Binary Crossentropy (BCE) is betterdue to surrogate loss argument [46]. MSE and its variants canbe used as the loss function for Channel AE [23][26] as MSEoffers implicit regularization on over-confidence of decoding.The comparison of MSE and BCE loss is shown in Figure 14(top).

Although both the final test loss for BCE and MSE tendto converge, MSE leads to slower convergence. This is due tothe fact that MSE actually punishes overconfident bits, whichmakes the learning gradient more sparse. As faster convergenceand better generalization are desired, BCE is used as a primaryloss function.

C. Separately Training Encoder and Decoder

Training encoder and decoder jointly with end-to-end back-propagation leads to saddle points. We argue that trainingChannel AE entails separately training encoder and decoder[32]. The accurate gradient of the encoder can be computedwhen the decoder is optimal for a given encoder. Thus aftertraining encoder, training decoder until convergence will makethe gradient of encoder more trustable. However, at every steptraining decoder till convergence is computationally expensive.Empirically we compare different training methods in Figure 14(bottom). Training encoder and decoder jointly saturates easily.Training encoder once and training decoder 5 times shows thebest performance and is used in this paper.

Fig. 13. Test loss trajectory on different power constraint (top), different batchsize with block length 100 (middle) and random batch with block length 10(bottom)

D. Adding Minimum Distance Regularizer

Naively optimizing Channel AE results in a paired localoptimum: a pair of sub-optimal encoder and decoder can belocked in a saddle-point. Adding regularization to loss is acommon method to escape local optima [16]. Coding theorysuggests that maximizing minimum distance between all possible

3

Fig. 14. Test loss trajectory different loss functions (top), Training jointly vsseparately (bottom)

input messages[5] improves coding performance. However, sincethe number of all possible messages increases exponentially withrespect to the code block length, computing loss for the largestminimum distance for long block code becomes prohibitive.

Exploiting the locality inherent to RNN codes, we introducea different loss term solely for the encoder which we refer toas the partial minimum code distance regularizer. Partial min-imum code distance d(uL) = minu1,u2∈RL,u1 6=u2

{||fθ(u1) −fθ(u2)||22} is the minimum distance among all message withlength L. Computing pairwise distance requires O(

(2L

2

)) compu-

tations. For a long block code with a block length LB >> L, theRNN encoded portion L has the minimum distance M , whichguarantees that the block code with a block length LB has theminimum distance at least M , while the minimum distance canbe as large as LB

L M . Partial minimum code distance is a com-promise over computation, which still guarantees large minimumdistance under small block length L, while hoping the minimumdistance on longer block length would still be large. The lossobjective of Channel AE with partial maximized minimum codedistance is L(u) = Ez||gφ(h(fθ(u)) + z)− u||22 + λd(uL). Tobeat convolutional code via RNN autoencoder, we add minimumdistance regularizer for block length 100, with L = 10 andλ = 0.001. The performance is shown in Figure 15, the topleft panel shows the test loss trajectory, and the top right panelshows the BER.

Fig. 15. Effect of Partial Minimum Distance Regularization, Test loss trajectory(top left) and BER (top right). Test loss trajectory different optimizers withlearning rate 0.001 (bottom)

E. Using Adam optimizer

The learning rate for the encoder and the decoder hasto be adaptive to compensate the realizations of noise withdifferent magnitude. Also as training loss decreases, it is lesslikely to experience decoding error, which making the gradientsfor both encoder and decoder sparse. Adam[49] has an adaptivelearning rate and an exponential moving average for a sparse andnon-stationary gradient, which in theory is a suitable optimizer.The comparison of different optimizers in Figure 15 (bottom)shows that Adam empirically outperforms all other optimizers,with faster convergence and better generalization. SGD fails toconverge with learning rate 0.001 with high instability. Thus weuse Adam for training Channel AE.

F. Adding more capacity (parameters) to the decoder than theencoder

Channel AE can be considered as an overcomplete autoen-coder model with the noise injected in the middle layer. In whatfollows, we perform the analysis of the noise as done in [39]and for simplification consider only Minimum Square Error(MSE) (Binary Cross-entropy (BCE) loss would follow the sameprocedure). Assume z ∼ N(0, σ2) is the added Gaussian noise,u is the input binary bits, fθ is the encoder, gφ is the decoder, andh(.) is the power constraint module. Applying Taylor expansion(see appendix for more details), the loss of Channel AE can bewritten as: L = ||gφ(h(fθ(u))) − u||22 + σ2||Jgφ(u)||2F , where||Jgφ(u)||2F =

∑ki

∑nj (∂gφ(c)i∂cj

)2 is the Jacobian of function gφ.The reconstruction error ||gφ(h(fθ(u)))−u||22 can be interpretedas the coding error when no noise is added. With smaller∂gφ(c)i∂cj

, the decoder results in an invariance and robustnessof the representation for small variations of the input [42],thus the Jacobian term σ2

∑ki

∑nj (∂gφ(c)i∂cj

)2 is the regularizerencouraging the decoder to be locally invariant to noise [44].

4

The Jacobian term reduces the sensitivity of the decoder, whichimproves the generalization of the decoders.

Empirically there exist an optimal training SNR for neuraldecoders [28] and Channel AE [19]. When training with a largenoise σ2, the Jacobian term dominates, hence the reconstructionerror becomes non-zero, which degrades the performance. Whentraining with too small noise, the decoder is not local invariant,which reduces the generalization capability.

Also as the Jacobian term only applies regularization tothe decoder, the decoder needs more capacity. Empirically theneural decoder has to be more complicated than encoder. Withtraining for 120 epochs, the encoder/decoder size 25/100 and100/400 units shows an improved test loss, comparing to thecases where the encoder is less complicated than the decoder.As the encoder/decoder with 25/100 units works as well as en-coder/decoder with size 100/400 units, we take 25 units encoderand 100 units decoder for most of our applications. Furtherhyperparameter optimization such as Bayesian Optimization[47]could lead to an even better performance.

Encoder and Decoder Hyperparameter Design after 120 epochsEnc Unit Dec Unit Test Loss25 100 0.18025 400 0.640100 100 0.690100 400 0.181

APPENDIX DFURTHER EMPIRICAL RESULTS

A. Alternative Minimum Distance RegularizerAs RNN encoders have small dependency lengths (see

section 4), small Hamming distance of messages may causesmall code distance. Another method of regularizing is to directlyregularize the minimum distance between the codewords formessages with small Hamming distances. The method startsby enumerating all messages within Hamming distance s to arandom message u of length LB , which contains

∑si=1

(LBi

)+1

messages, and then it computes the minimum distance amongthe enumerated messages as a regularization term. However, thismethod doesn’t guarantee any minimum distance property evenamong short block length, and the computational complexity ishigh with even small s. Empirically this method doesn’t workwell.

B. Loss Analysis for Encoder and Decoder SizeThis section shows the derivation of loss analysis used in

the main text. Using Minimum Square Error (MSE), the loss ofChannel AE is L = Ez||gφ(h(fθ(u))+ z)−u||22. The output ofthe encoder is denoted as c = h(fθ(u)). Using 1st order Taylorexpansion following [48], with the assumption that noise z issmall and ignoring all higher order components, the decoder isapproximated as: gφ(c+ z) = gφ(c) +

∂gφ(c)∂c z +O(z)

Note that the 1st order Taylor expansion is a localapproximation of a function. Hence, the assumption of ig-noring higher order components is only valid with small zlocally. Then by expanding the MSE loss, we have: L =

||gφ(h(fθ(u)))− u||22 +Eztr(zT ∂gφ(c)

∂c

T ∂gφ(c)∂c z), which yields:

L = ||gφ(h(fθ(u)))− u||22 + σ2||Jgφ(u)||2F .

C. Low Latency Benchmark: Convolutional Code with DifferentMemory length

The benchmarks of applying convolutional codes under lowlatency constraint are shown in Figure 16. Note that there doesn’texist a single convolutional code that is universally best forvarious delay constraints and code rates. The convolutional codesreported in the main section are using the best convolutionalcodes shown in Figure 16.

5

Fig. 16. Convolutional Code with different delay constraint on rate 1/2 (top), rate 1/3 (bottom)

Documents

1 LEARN Codes: Inventing Low-latency Codes via Recurrent ... · the reliability of codes at small to medium block length regime [9]. We note that latency translates directly to block-length