15
1 Scalable, Practical VoIP Teleconferencing with End-to-End Homomorphic Encryption Kurt Rohloff, Member, IEEE, David Bruce Cousins, Senior Member, IEEE, Daniel Sumorok, Member, IEEE Abstract—We present an approach to scalable, secure Voice over IP (VoIP) teleconferencing on commodity mobile devices and data networks with end-to-end Homomorphic Encryption (HE). We assume an honest-but-curious threat model where an adversary, despite observing all communications between all teleconference participants and having full access to teleconfer- encing servers, is unable to obtain unencrypted data and subse- quently listen to the conversation. Prior secure VoIP teleconfer- encing services have relied on 1) all teleconferencing clients to maintain point-to-point encrypted links with other clients or 2) a teleconferencing server which can access and manipulate VoIP streams unencrypted. Our approach mixes VoIP data streams at a single VoIP teleconferencing server only while encrypted and data streams are never decrypted at the teleconferencing server. Our approach is based on a previously known encryption scheme we implement. Innovation in our approach comes from a novel efficient VoIP encoding scheme to greatly reduce the circuit depth needed to support homomorphic mixing of the en- crypted VoIP data, parameterization for relatively low bandwidth usage and implementation and integration of our designs into an existing open-source VoIP infrastructure. We experimentally evaluate our end-to-end encrypted VoIP teleconferencing ap- plication running on commodity iPhones, mixing at the VoIP servers on the lowest-cost Amazon AWS cloud server instances and communicating on commercial data networks with com- mercial cellular data and commodity 802.11n wireless access points. K. Rohloff is with the Cybersecurity Research Center in the Col- lege of Computing Sciences, New Jersey Institute of Technology, Newark, NJ, 07102. E-mail: [email protected] D. Cousins and D. Sumorok are with Raytheon BBN Technolo- gies, Cambridge, MA, 02138. E-mail: {dcousins,dsumorok}@bbn.com Sponsored by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL) under Contract No. FA8750-11-C-0098. The views expressed are those of the authors and do not necessarily reflect the official policy or position of the Department of Defense or the U.S. Government. Distribution Statement “A” (Approved for Public Release, Distribution Unlim- ited.) 1 I NTRODUCTION Teleconferencing is an important aspect of mod- ern professional life that supports long-distance commerce and collaboration. With the increased prevalence and reliance on teleconferencing tech- nologies, there is an increased need to provide se- cure, scalable teleconferencing technologies. Voice- over-IP (VoIP) teleconferencing is becoming increas- ingly important due to the prevalence of low-cost portable computing device and easy, economical data network access. There is a perceived risk that compromised data networks or VoIP servers could be used by adversaries to steal and leak sensitive information communicated through VoIP in tele- conferences. Secure VoIP teleconferencing has been partially served through point-to-point communication links which are made secure with point-to-point encryp- tion technologies [1], [2]. This approach does not scale. When there are more than a handful of partic- ipants in a teleconference call, O(n 2 ) point-to-point encryption links are needed, leading to bandwidth issues which degrades the quality of user experi- ences. To address the scaling limitation, voice sig- nals could be sent to a centralized teleconferencing server that mixes VoIP signals to support interactive teleconferencing and the mixed signals are returned to the participants. Up to now, VoIP signals on a teleconference server needed to be mixed in the clear, mean- ing that VoIP signals need to be unencrypted at the server. VoIP teleconferencing servers are often hosted in a semi-secure environment, such as by commodity cloud providers such as Amazon AWS 1 or Microsoft Azure 2 . This creates opportunity 1. https://aws.amazon.com/ 2. https://azure.microsoft.com/

Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

1

Scalable, Practical VoIP Teleconferencing withEnd-to-End Homomorphic EncryptionKurt Rohloff, Member, IEEE, David Bruce Cousins, Senior Member, IEEE,

Daniel Sumorok, Member, IEEE

F

Abstract—We present an approach to scalable, secure Voiceover IP (VoIP) teleconferencing on commodity mobile devicesand data networks with end-to-end Homomorphic Encryption(HE). We assume an honest-but-curious threat model wherean adversary, despite observing all communications between allteleconference participants and having full access to teleconfer-encing servers, is unable to obtain unencrypted data and subse-quently listen to the conversation. Prior secure VoIP teleconfer-encing services have relied on 1) all teleconferencing clients tomaintain point-to-point encrypted links with other clients or 2) ateleconferencing server which can access and manipulate VoIPstreams unencrypted. Our approach mixes VoIP data streamsat a single VoIP teleconferencing server only while encryptedand data streams are never decrypted at the teleconferencingserver. Our approach is based on a previously known encryptionscheme we implement. Innovation in our approach comes froma novel efficient VoIP encoding scheme to greatly reduce thecircuit depth needed to support homomorphic mixing of the en-crypted VoIP data, parameterization for relatively low bandwidthusage and implementation and integration of our designs intoan existing open-source VoIP infrastructure. We experimentallyevaluate our end-to-end encrypted VoIP teleconferencing ap-plication running on commodity iPhones, mixing at the VoIPservers on the lowest-cost Amazon AWS cloud server instancesand communicating on commercial data networks with com-mercial cellular data and commodity 802.11n wireless accesspoints.

• K. Rohloff is with the Cybersecurity Research Center in the Col-lege of Computing Sciences, New Jersey Institute of Technology,Newark, NJ, 07102.E-mail: [email protected]

• D. Cousins and D. Sumorok are with Raytheon BBN Technolo-gies, Cambridge, MA, 02138.E-mail: {dcousins,dsumorok}@bbn.com

Sponsored by the Defense Advanced Research Projects Agency(DARPA) and the Air Force Research Laboratory (AFRL) underContract No. FA8750-11-C-0098. The views expressed are those ofthe authors and do not necessarily reflect the official policy or positionof the Department of Defense or the U.S. Government. DistributionStatement “A” (Approved for Public Release, Distribution Unlim-ited.)

1 INTRODUCTION

Teleconferencing is an important aspect of mod-ern professional life that supports long-distancecommerce and collaboration. With the increasedprevalence and reliance on teleconferencing tech-nologies, there is an increased need to provide se-cure, scalable teleconferencing technologies. Voice-over-IP (VoIP) teleconferencing is becoming increas-ingly important due to the prevalence of low-costportable computing device and easy, economicaldata network access. There is a perceived risk thatcompromised data networks or VoIP servers couldbe used by adversaries to steal and leak sensitiveinformation communicated through VoIP in tele-conferences.

Secure VoIP teleconferencing has been partiallyserved through point-to-point communication linkswhich are made secure with point-to-point encryp-tion technologies [1], [2]. This approach does notscale. When there are more than a handful of partic-ipants in a teleconference call, O(n2) point-to-pointencryption links are needed, leading to bandwidthissues which degrades the quality of user experi-ences. To address the scaling limitation, voice sig-nals could be sent to a centralized teleconferencingserver that mixes VoIP signals to support interactiveteleconferencing and the mixed signals are returnedto the participants.

Up to now, VoIP signals on a teleconferenceserver needed to be mixed in the clear, mean-ing that VoIP signals need to be unencrypted atthe server. VoIP teleconferencing servers are oftenhosted in a semi-secure environment, such as bycommodity cloud providers such as Amazon AWS1 or Microsoft Azure 2. This creates opportunity

1. https://aws.amazon.com/2. https://azure.microsoft.com/

Page 2: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

2

for adversaries to snoop on VoIP data such that ifthe server is compromised, voice data could leakto adversaries. As such, teleconferencing solutionshave been vulnerable to man-in-the-middle attacksof various types [3].

Taken together, there is a need for a VoIP tele-conferencing capability with end-to-end encryptionwhere VoIP signals are never accessible in the clearexcept by clients with decryption keys. We addressthese limitation to provide a teleconferencing ap-proach where data is mixed while encrypted. Thisapproach prevents an adversary from obtaining thecommunications of teleconference participants evenif the adversary observes all communication anddata manipulations performed by the teleconfer-ence server.

To use the cryptosystem, teleconferencing clientsencode their voice samples with an additive encod-ing scheme, encrypt their encoded voice data underan additive homomorphic encryption scheme, sendtheir encrypted voice samples to a VoIP telecon-ferencing server. The VoIP teleconferencing serverperforms an encrypted signal mixing operation onthe encrypted VoIP signals. In our scheme, this en-crypted mixing operation consists of homomorphicaddition on the encrypted voice signals. The tele-conference server sends the result of the encryptedmixing back to the clients. The clients then decrypt,decode and play back the result. Our scheme relieson the pre-sharing of a common private key for ouradditive homomorphic encryption scheme.

The encoding operation converts raw VoIP sig-nal into a data representation which is nativelyencrypted by the additive homomorphic encryp-tion scheme. Our homomorphic scheme nativelyencrypts the integer coefficients of a polynomial,and the encode operation converts raw floating-point VoIP signals into the integer coefficients of apolynomial. The decode operation is the inverse ofthe encode operation.

In modern VoIP teleconferencing schemes, themixing operation merges VoIP signals. In partic-ular, a mixing operation takes the encoded VoIPsignals, each representing a single client speaking,and outputs a single signal that encodes all of theclients’ voices. Different VoIP encoding schemes usedifferent encoding approaches.

Innovation in our approach comes from:

• A novel efficient VoIP encoding scheme thatgreatly reduces the circuit depth needed tosupport homomorphic mixing of the en-crypted VoIP data.

• A homomorphic mixing operation thatsolely uses homomorphic additions on en-crypted data.

• An additive homomorphic encryption thatis a simplification of a prior well-knownscheme.

• Parameterizations to provide security, highvoice quality and relatively low bandwidthusage.

• Implementation and integration of our de-signs into an existing open-source VoIP in-frastructure.

• Experimentation of our prototype on low-cost mobile and cloud computing infrastruc-ture and commercial data networks.

The cryptographic basis of our approach is anefficient implementation of an additive homomor-phic encryption scheme. The cryptosystem buildsfrom recent advances in practical homomorphicencryption [4]. In particular, our approach is basedon recent advances in lattice-based homomorphicencryption and is a simplification of the LTV scheme[5] which is itself an NTRU-like cryptosystem [6].

Many modern VoIP communication systems relyon logarithmic encoding schemes which dedicatemore bits of the encoding to the most significantbits of the encoded data. Our approach relies onan additive encoding scheme which greatly reducesthe circuit depth of the VoIP mixing operation.

We integrate our voice data encoding and en-cryption implementations with the existing Mum-ble/Murmur open-source VoIP teleconferencingframework 3. We experimentally evaluate our end-to-end encrypted VoIP teleconferencing applicationrunning on commodity iOS-based clients such asiPod Touches and iPhones, mixing on the lowest-cost Amazon AWS cloud server instances andcommercial cellular data networks or commodity802.11n wireless access points.

A high-level overview of our approach is men-tioned in [7]. The underlying cryptosystem withoutimplementation is presented in [5] and an earlyimplementation of this cryptosystem without appli-cation to VoIP is in [4].

The paper is organized as follows. Section 2discusses related teleconferencing approaches andapplied homomorphic encryption efforts. Section3 discusses our security model, assumptions wemake of the adversary and design goals for ourscalable VoIP teleconferencing capability with end-to-end encryption. Section 4 discusses the overall

3. http://www.mumble.info/

Page 3: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

3

architecture of our end-to-end encrypted VoIP tele-conferencing capability. In Section 5 we describeour simplification of the LTV-based homomorphiccryptosystem [5] that we build from. We introduceour custom VoIP coding scheme in Section 6 andthe homomorphic mixing operation in Section 7. InSection 8 we describe integrating the cryptosystemwith the Mumble open-source VoIP framework.Section 9 discusses our engineering trade-offs inparameterizing the coding scheme, the cryptosys-tem and the VoIP mixer. In Section 10 we discussexperimentation. We present a security analysis inSection 11. We conclude the paper with a discussionof our insights and future work in Section 12.

2 RELATED WORK

Advances in secure VoIP technologies have focusedon providing security for data in transit [2], address-ing identity and key management [8], among manyother security issues [9].

There has been some recent work to addressthese prior challenges, including using SecureMulti-Party Computation (SMC) [10] to mix en-crypted VoIP signals over multiple servers, withthe restriction that each client needs to trust aserver. This prior SMC-based approach is exper-imentally demonstrated by integrated with theMumble/Murmur software, and our team sharedintegration and implementation advice with theauthors of [10].

The basis of our approach builds on lattice-basedHomomorphic Encryption (HE) which has previ-ously been considered inefficient for broad practicaluse [11]. Solutions to HE runtime challenges havebeen explored through several means, including byimproving the theoretical efficiency of the under-lying scheme [12], [13], and by developing moreefficient implementations of these schemes [14]–[18]. Despite these advances in HE schemes andimplementations, there are few applications of thesetechnologies.

Besides the runtime challenges of HE designs,there is limited experience with data structures andrepresentations of homomorphic encrypted data[19]. Modern lattice-based HE provides a very dif-ferent computation model as compared to the morewell understood RAM compute models. The port-ing of familiar data structures and algorithms (suchas for VoIP mixing) for efficient use with HE isan ongoing challenge, especially for highly efficientencrypted execution of these algorithms over the

encrypted input data. We expand beyond single-bit-per-ciphertext encodings by placing entire VoIPdata frames into each ciphertext. These designs arein some sense much simpler than existing currentindustry practice, such as the mu-law encoders [20]which are common in modern VoIP systems.

3 THREAT MODEL AND FUNCTIONAL GOALS

To frame the functional and security goals of ourend-to-end encrypted VoIP teleconferencing con-cept, we present our high-level design considera-tions:

1) Encryption Work Factor: We should pro-vide an encryption work factor adequateto protect teleconferences. For our lattice-based scheme, current security estimatesindicate that a root Hermite factor δ < 1.007generally provides 80 bits of security and anadequate work factor [21].

2) Server Compromise: The confidentiality ofthe data should be preserved even if theserver is fully compromised by an honest-but-curious adversary who observes all in-ternal server operations.

3) Latency: There should usable end-to-endlatency of less than 150 ms, which is con-sidered acceptable in practice [22].

4) Sound Quality: There should be acceptablesound quality, preferably with full-duplexso users could listen while simultaneouslyspeaking.

5) Scalability: The teleconferencing capabilityshould scale to support more than two par-ticipants without noticeable degradation insound quality or latency.

6) Bandwidth Usage: The end-to-end en-crypted VoIP teleconferencing capabilityshould ideally require less than an orderof magnitude ciphertext expansion over un-encrypted or even current AES-encryptedapproaches.

7) Wide Geographic Area: The capabilityshould operate over a wide geographicarea, ideally trans-continental if not inter-continental, without an unacceptable degra-dation in sound quality or latency.

4 END-TO-END ENCRYPTED VOIP TELECON-FERENCING ARCHITECTURE

We now describe the high-level architecture of ourend-to-end encrypted VoIP teleconferencing capa-

Page 4: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

4

Fig. 1: VoIP Encoding

Fig. 2: VoIP Decoding

bility. Multiple clients, possibly from multiple ac-cess points, connect over a commodity data net-work, such as the public Internet, to a VoIP server.Each client samples a stream of voice data, encodesthe voice data stream and encrypts it. The clientsends its stream of encrypted VoIP data to theVoIP server. The VoIP server receives the encryptedVoIP streams from multiple clients. The VoIP serverperforms homomorphic mixing operations on the en-crypted voice streams and sends the encrypted re-sults back to the clients. The mixed encrypted VoIPdata received by each client is then decrypted bythe clients, decoded and played back to the clients’users.

Figure 2 shows how the returning stream ofencrypted VoIP data is decrypted into plaintext, andthe plaintext is decoded into a stream of digitizedaudio samples. The audio samples are then playedback to the user through the devices’ speakers.

A common approach to merge digital audiosignals in a teleconference session at the server isto add the encoded samples of unencrypted (plain-text) digital audio signals from the client. The func-tion which merges the encoded, but unencrypted,sampled digital audio signals at the teleconferenceserver is called the Mix function. Although thisdesign relies on nontrivial encoding and decod-ing processes, the Mix function can be a simple,low-depth operation at the teleconference server.This suggests an approach to encrypted VoIP tele-conferencing to design a homomorphic encryptedversion of the plaintext Mix function. We call thishomomorphic encryption version of the Mix func-tion as the EvalMix function, and the design ofMix suggests that EvalMix could be built by solelyhomomorphically adding encrypted audio signals.Hence, both an additive homomorphic encryption

scheme and an additive encoding scheme wouldbe needed for this to approach to work. (We saythat an encoding scheme Encode() is additive ifEncode(a+ b) = Encode(a) + Encode(b).)

A primary challenge is then to design a ho-momorphic evaluation function EvalMix such thatfor two voice samples v1 and v2 and their respec-tive encodings Enc(pk, v1) and Enc(pk, v2), we knowthat the result of mixing the signals, Mix(v1, v2)is equal to Dec(sk,EvalMix(Enc(pk, v1),Enc(pk, v2)))for an encoding of the digital audio stream intoplaintext. We should be able to parameterize theEnc and EvalMix operations so that they are suffi-ciently secure and resource efficient to make end-to-end encrypted VoIP teleconferencing practical. If{Enc(pk, v1),Enc(pk, v2),Enc(pk, v3)} are encryptedaudio blocks received by the server at a specific timefrom three clients, then the server should return toclient 1 a ciphertext c′1 that when decrypted equalsMix(v2, v3).

5 ADDITIVE HOMOMORPHIC LATTICE-BASEDCRYPTOSYSTEM

Our design builds on a recent efficient FHE design[5] and an implementation [4]. Our approach is alattice-based cryptographic scheme. Mathematicalpreliminaries for lattice-based cryptography can befound in [23], and a general survey on the design oflattice-based schemes can be found in [24].

Our HE cryptosystem provides the core pub-lic key encryption primitives of Key Generation(KeyGen → (pk, sk)), Encryption (Enc(pk, µ) → c)and Decryption (Dec(sk, c) → µ) where pk is apublic key, sk is a secret key, µ is a plaintextand c is a ciphertext. Our scheme is defined tobe an additive homomorphic encryption schemebecause it provides an Evaluation Addition oper-ation (EvalAdd(c1, c2)→ c3) where Dec(Enc(a+ b)) =Dec(EvalAdd(Enc(a),Enc(b))) if the scheme is prop-erly parameterized to satisfy its correctness con-straints.

Our primary simplification of the scheme in[4], [5] is that we remove the ability to supportEvaluation Multiplication (EvalMult). Because wesimplify the scheme in this way, the correctnessand security constraints for the general scheme in[4], [5] hold for our simplified scheme, and we areable to choose smaller concrete parameters whilemaintaining security and correctness.

5.1 Plaintext and Ciphertext RepresentationConcretely, we represent our plaintext and cipher-text as vectors of unsigned integers. We use power-

Page 5: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

5

of-2 cyclotomics, meaning that plaintext and cipher-text vectors represent the coefficients of integer modp polynomials of degree 2x [23]. Almost all of theoperations we need to support on the plaintext andciphertext are linear transforms, and by restrictingourselves to power-of-2 dimensions, we greatly re-duce the implementation complexity required inour cryptosystem. Reduced implementation com-plexity translates directly into more efficient imple-mentation. Current HE implementations designedfor non-power-of-2 cyclotomics can support moregeneral capabilities, but we intentionally choose tofocus on a simpler scheme for improved runtimeand reduced ciphertext expansion for limited-depthapplications.

For n a power of 2, we define the ring R =Z[x]/(xn + 1) (i.e., integer polynomials modulo xn +1) where, for any positive integer q, define the ci-phertext space Rq = R/qR (i.e., integer polynomialsmodulo xn + 1, with mod-q coefficients). Additionaldetails on mathematical preliminaries can be foundin [23]. The plaintext space is Rp for some integerp ≥ 2, meaning that plaintext are length-n vectorsof integers modulus p. Typically p is much smallerthan 264 and we typically choose p to be between2 and 212. Except for special applications, such asfor our secure VoIP application, we often need onthe order of several hundred bits to represent q tosupport deep computations at a reasonable level ofsecurity.

5.2 LTV-based Scheme with LSB EncodingBased on the mathematical preliminaries, we nowprovide a description of our simplified LTV scheme[5]. Concrete implementation and parameter discus-sions are given after the mathematical sketches.

In lattice-based encryption schemes, plaintext,ciphertext and other ring elements can be repre-sented in either evaluation or Chinese RemainderTransform (CRT) representation [23]. These rep-resentations are equivalent. A ring element canbe converted from an evaluation representation toa CRT representation by application of a specialNumber Theoretic Transform called the Chinese Re-mainder Transform (CRT) [23]. More explicitly, wecan convert a ciphertext from its evaluation repre-sentation ce to its CRT representation ccrt computingccrt = CRTq(ce) for appropriately chosen parame-terizations of the CRT as in [14]. The inverse CRT(CRT-1) converts ring Elements from the CRT repre-sentation back to the evaluation representation. Inthis case, ce can be recovered from ccrt by applyingthe inverse CRT transform, ce = CRT-1

q(ccrt).

Our simplified LTV encryption scheme is ag-nostic with respect to the representation of the ci-phertext. We describe the operation of this schemeassuming a CRT representation of ciphertext un-less otherwise stated. It is advantageous to keepciphertext in CRT representation because additionsand multiplications on ciphertext are performedelement-wise in this representation. Otherwise, inevaluation representation, ciphertext multiplicationrequires an operation similar to a convolution oper-ation to perform.

Our cryptosystem operations are described asfollows:

• KeyGen: choose a “short” f ∈ R such thatf = 1 mod p and f is invertible modulo q.Concretely, f is a vector of n integers and fis invertible modulo q if and only if each ofits mod-q CRT evaluations is nonzero. The“short” elements f can be chosen from dis-crete Gaussians. E.g., we can let f = p · f ′+ 1(where 1 is a length-n with all entries setto 1) for some Gaussian-distributed f ′. theGaussian distribution is zero-centered with adistribution parameter denoted as r. We out-put sk = f . The secret key, as defined above,is most efficiently represented in its eval-uation representation. Similarly, we choosea “short” g ∈ R, possibly from discreteGaussians. The g is sampled in it evaluationrepresentation, but then converted to its CRTrepresentation. The CRT representations off−11 (modulo q) is the mod-q inverse of f1. Weoutput pk = h = g · f−1 mod q in its double-CRT representation.

• Enc(pk = h, µ ∈ Rp): choose a “short”r ∈ R and a “short” m ∈ R such thatm = µ mod p. The vectors r and m are sam-pled in their evaluation representations, butthen converted to their CRT representations.We output c = p · r · h+m mod q. Concretely,m can be chosen as m = p · m′ + µ for aGaussian-distributed m′ sampled in its eval-uation representation, then converted to itsCRT representation.

• Dec(sk = f, c ∈ Rq): compute b = f · c mod q,and lift it to the integer polynomial b ∈ Rwith coefficients in [−q/2, q/2). Output µ =b mod p. The key multiplication operationcan be performed in either the evaluation orCRT representation.

The scheme supports the additive homomorphic

Page 6: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

6

operation:

EvalAdd(c0, c1) = c0 + c1 mod q

The EvalAdd operation can be performed using bothevaluation or CRT representations, but our imple-mentation always uses a CRT representation so thescheme supports expected forward compatibilitywith additional encrypted operations supported onthe ciphertext.

5.3 Security of Scheme

The proofs of security in [5] are directly applicableto the simpler additive homomorphic scheme wediscuss here. In particular, Lemma 3.6 in [5] demon-strates the security of this scheme, and we do notprovide a detailed security proofs. In Section 9 weleverage further parameter selection and noise man-agement optimizations from [21], [25]–[27] whichdo not affect the security proofs given in [5].

6 VOIP SIGNAL MIXER AND ADDITIVE EN-CODING

We consider the sampled and digitized representa-tion of VoIP audio signals as being a sequence of y-bit integers [v1, v2, . . .]. Define vi = [vi1, v

i2, . . . , v

im] to

represent a block of samples, Encode([vi1, . . . , vim]) =

[zi1, . . . , zin] = zi where zi is a mod p plaintext and

Enc([zi1, . . . , zin]) = [ci1, . . . , c

in] = ci where ci is a mod

q ciphertext.We define the Mix function as the function which

takes unencrypted encoded sampled audio signalsas input and outputs a single audio signal in whichall input signals are overlayed on one another andcan be heard. The most common (and simplest)mixing functions in deployed VoIP systems is asample-adder, and we use this as our model plain-text mixing operation which we define here.

Definition 1. Mix: We define the Mix operation asfollows where j = bt/2c:

Mix(z1, . . . , zt) =z1, if t = 1

Mix(c1, . . . , zj)+

Mix(zj+1, . . . , zt), otherwise

Our goal is to show that for our Encode andDecode functions, when used in conjunction withthe Mix, the following property holds for proper pa-rameterizations and arbitrary samples v1, v2, . . . , vt

where t is a variable that represents the number ofactive speakers:

Decode(Mix

(Encode(v1), . . . ,Encode(vt)

))(1)

= v1 + v2 + · · ·+ vt

When designing the Encode and Decode opera-tions, wrap-around in the plaintext signal shouldbe avoided when the Mix function is applied toencoded (but not encrypted) plaintext data. Specifi-cally, for the plaintext samples z1j , z

2j , . . ., if

∑i z

ij > p

for some j, then the encoded plaintext data outputby the Mix operation wraps around p. If this wrap-around occurs on the plaintext data, there will bedistortion in the resulting mixed audio signal, evenif there is no decryption. If this distortion occurs inthe plaintext data, then the distortion will also bepresent in encrypted output signal resulting fromapplying the EvalMix operation to the encryptedinput signal due to the property of Equation 2.

To guarantee there is no distortion due to plain-text wrap-around, we design Encode and Decodeto guarantee that Equation 1 always holds whenused with our defined EvalMix operation under theassumption that there are no more than t telecon-ference participants speaking simultaneously. Notethat t is a variable which can be changed as the VoIPcapability is used, as long as clients agree on a tvalue to guarantee a common encoding and decod-ing operation. We design the Encode and Decode tooperate with any arbitrarily chosen t. Our vision isthat the parameter t will be set for every session. Weexperimentally evaluate scenarios below for variousvalues of t and where that are more than t activespeakers on a session to investigate the effect thisvariable has on the usability of the scheme.

Definition 2. Encode: We assume without loss ofgenerality that the input is sampled mod 2y. Wedefine the Encode operation as [z1, z2, . . . , zn] :=Encode([v1, v2, . . . , vm]) where

zi =∑

j∈{kn+i|0≤c,kn+i≤m}

vj ∗ 2k(y+s) mod p

Note that we use (y + s)-bit representations ofthe data samples in the encoding, even though weassume the samples from the audio source have y-bit representations. This is to add extra padding bitsso that when we add encoded samples of up to t ≤2s speakers, we have extra most-significant bits toavoid

∑i z

ij > p for some j. Generally we set t to

be a power of 2, so that for a scenario where wesupport up to 4 active speakers in a session, we sett = 4, so s = 2.

Page 7: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

7

Fig. 3: Encrypted VoIP Encoding

A graphical representation of this scheme canbe seen in Figure 3, with the following concreteexecution steps:- We start by splitting the length m VoIP digitizedaudio sample input [v1, v2, . . . , vm] into blocks[v1, v2, . . . , vn], [vn+1, vn+2, . . . , v2n], . . . , [v1+kn, . . . , vm].Each block has n samples except for the last blockwhich is of length m mod n if m mod n > 0, andthe last block is of length n otherwise. This meansthat unless m is a multiple of n, the last block of< n samples is the leftover samples which are notfit into the first length-n samples.- For the ith block, we multiply all samples in theblock by 2k(y+s) so that in a binary representation,the k(y+ s) least significant bits are all zero and they most significant bits is a left-shift of the originalsample, with two padding bits.- We add all of the samples for each index in theblocks.

Note that we can encode (y + s) integer inputsto the Encode function the way it is formulated. In abinary representation of the encoded data elementzi, the bits in locations [k(y + s) + 1 : k(y + s) + y]represents the (kn + i)th sample. For the sake ofnotational simplicity, we say that zij [a, b] representsthe ath through bth bits in the j entry of plaintextencoding vector zi where the least significant bit isthe 1st bit. We insert additional binary 00 paddingbetween the binary representation of encoding out-put so that zij [k(y+s)+y+1 : k(y+s)+y+s] = [00]for all i, j and k. We do this so for four encodedsamples z1i , z

2i , z

3i , . . . , z

ti , we know that for the sum-

mation z′i = z1i + z2i + z3i + · · · + zti , which is out-put by the mixing operation, has bits in locations[k(y+s)+1 : k(y+s)+y+s] to hold the potential over-flow resulting from summations of the (kn + i)thsamples. We call these two padding bits which

Fig. 4: Encrypted VoIP Decoding

catches the overflow as overflow bits. As such, ourencoding scheme induces the constraint that theplaintext modulus p must satisfy the condition thatp ≥ 2b(y+s) where b = dm/ne.

The Decode operation behaves as the inverse ofthe Encode operation. The Decode operation returnsthe bits in locations [k(y + s) + 1 : (k + 1)(y + s)]of zi as the (kn + i)th returned sample. Note thatthe output of the Decode operation is mod 2y+s,and this requires 2 additional bits. We do this so theoutput of the decoding can hold the summation ofthe encoding input without wrapping around.

Definition 3. Decode: We define the Decode operationas follows: [v1, v2, . . . , vm] := Decode([z1, z2, . . . , zn])where

vkn+i = zi[k(y + s) + 1 : (k + 1)(y + s)].

Figure 4 shows how our decoding process couldbe implemented. On the right hand side of thisfigure we take the input vector. We make copies ofthis block and perform bit selection. Similar to theencoding operation, these operations are all highlyefficient as they only involve bit selection, which isextremely efficient to implement.

We are now ready to show that Equation 1 holdswith the following Theorems 1 and 2. Theorems 1shows that the encoding scheme is additive andTheorem 2 shows that the composition of the de-coding and encoding schemes is additive. We usethese properties to prove that the encrypted VoIPsystem, when used with a homomorphic encryptionequivalent of the plaintext Mix operation, does notdistort the mixed encrypted signals.

Page 8: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

8

Theorem 1. Given v1, . . . , vt,

Encode(v1) + · · ·+ Encode(vt) mod p

= Encode(v1 + · · ·+ vt mod 2y+s).

Proof. By definition for vh = [vh1 , vh2 , . . . , v

hm], zh =

[zh1 , zh2 , . . . , z

hn],

zhi =∑

j∈{kn+i|0≤c,kn+i≤m}

vkj ∗ 2k(y+s) mod p

Define A = {kn+ i|0 ≤ c, kn+ i ≤ m} By addingthese two equations,∑

h∈{1,...,t}

zhi mod p

=∑

h∈{1,...,t}

∑j∈A

vhj ∗ 2k(y+s)

mod p

=∑j∈A

∑h∈{1,...,t}

vhj

∗ 2k(y+s)

mod p

Hence,

Encode(v1) + · · ·+ Encode(vt) mod p

= Encode(v1 + · · ·+ vt mod 2y+s).

Theorem 2. If t ≤ 2s, Decode(Encode(v1) + · · · +Encode(vt) mod p) = (v1 + · · ·+ vt mod 2y+s).

Proof. From Theorem 1 that Encode(v1) + · · · +Encode(vt) mod p = Encode(v1 + · · ·+vt mod 2y+s)so,

Decode(Encode(v1) + · · ·+ Encode(vt) mod p)

= Decode(Encode(v1 + · · ·+ vt mod 2y+s))

If we assign z′ = Encode(v1 + · · ·+ vt mod 2y+s)and v′ = Decode(Encode(v1 + · · · + vt mod 2y+s)),then from the definition of the Decode operation, weknow that v′kn+i = z′i[k(y + s) + 1 : (k + 1)(y + s)] =(∑

l∈{1,...,t} vlkn+i

)mod 2y+s.

7 LOW-DEPTH HOMOMORPHIC ENCRYPTIONVOIP SIGNAL MIXER

As above, define vi = [vi1, vi2, . . . , v

im] to repre-

sent a block of samples, Encode([vi1, . . . , vim]) =

[zi1, . . . , zin] = zi where zi is a mod p plaintext and

Enc(pk, [zi1, . . . , zin]) = [ci1, . . . , c

in] = ci where ci is a

mod q ciphertext.We abuse notation and assume without loss

of generality that all encryption operations aredone with a common public key pk, so we denote

Enc(pk, zi) as Enc(zi). We take a similar notationalshortcut with the decryption operation and an as-sumed common secret key sk such that Dec(sk, ci)is denoted for notational simplicity as Dec(ci).

To support end-to-end encrypted VoIP telecon-ferencing, we use our encoding function Encode,decoding function Decode and mixing function Mixwith our encryption function Enc and decryptionfunction Dec to define an operation EvalMix thatis the homomorphic encrypted equivalent of theplaintext Mix operation. Our goal is to design theEvalMix operation such that the following propertyholds for proper parameterizations and arbitrarysamples v1, v2, . . . , vt where t is a variable that rep-resents the number of active speakers:

Decode(Dec

(EvalMix

(ci, . . . , ct

)))(2)

= Decode(Mix(v1, v2, . . . , vt)

)In this equation, ci = Enc

(Encode

(vi))

repre-sents the encryption of the encoding of the samplesvi. The EvalMix function performs an encrypted ho-momorphic mixing of these ciphertext, which whendecoded and then decrypted, the result is the sameas if the sampled data had been mixed withoutencoding and encryption.

Definition 4. EvalMix: We define our homomorphicmixing function EvalMix as a generalization of ourpreviously defined EvalAdd operation where j = bt/2c.

EvalMix(c1, . . . , ct) =c1, if t = 1

EvalAdd(EvalMix(c1, . . . , cj),

EvalMix(cj+1, . . . , ct)), otherwise

Note that our EvalMix operation can be opti-mized for parallel execution by evaluating the re-cursive branches on different threads.

We show in Theorem 4 that Equation 2 holds forour EvalMix operation. To show this proof, we firstneed to show that if encrypted signals are input tothe EvalMix function, then the decrypted output isthe summation of the plaintext signals. We then usethis property in the proof of Theorem 4 to show ifencoded VoIP signals are fed as input to the EvalMixoperation, then the decrypted and decoded outputis a mix of the VoIP signals.

Theorem 3. For the input plaintext z1, . . . , zt andplaintext modulus p assuming proper parameterization

Page 9: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

9

to enable correct decryption after repeated application ofthe EvalAdd operation,

Dec(EvalMix(Enc(z1), . . . ,Enc(zt)))

= (z1 + · · ·+ zt mod p).

Proof. We demonstrate this with a generalized proofby induction on t.

For the base case, t = 1, EvalMix(c1) = c1

mod q by definition. By the correctness of decryp-tion of our encryption scheme, Dec(Enc(z1)) = z1

mod p, so Dec(EvalMix(Enc(z1))) = (Dec(Enc(z1))mod p) = z1.

Assume that for t′ < t,

Dec(EvalMix(Enc(z1), . . . ,Enc(zt′))

= (z1 + · · ·+ zt′

mod p).

From the definition of EvalMix, we know that

EvalMix(c1, . . . , ct)

= EvalAdd(EvalMix(c1, . . . , cj),EvalMix(cj+1, . . . , ct)).

By substitution,

EvalMix(Enc(z1), . . . ,Enc(zt))

= EvalAdd(EvalMix(Enc(z1), . . . ,Enc(zj)),

EvalMix(Enc(zj+1), . . . ,Enc(zt))).

Therefore,

Dec(EvalMix(Enc(z1), . . . ,Enc(zt))

= Dec(EvalAdd(EvalMix(Enc(z1), . . . ,Enc(zj)),

EvalMix(Enc(zj+1), . . . ,Enc(zt)))).

From the correctness of the EvalAdd operation,

Dec(EvalMix(Enc(z1), . . . ,Enc(zt))

= Dec(EvalMix(Enc(z1), . . . ,Enc(zj)))

+ Dec(EvalMix(Enc(zj+1), . . . ,Enc(zt)))

mod p.

From the induction hypothesis,

Dec(EvalMix(Enc(z1), . . . ,Enc(zt))

= (z1 + · · ·+ zj + zj+1 + · · ·+ zt mod p).

Theorem 4. For t ≤ 2s,

Decode(Dec(EvalMix(Enc(Encode(v1)), (3). . . ,Enc(Encode(vt)))))

= Mix(v1, v2, . . . , vt)

Proof. We know from Theorem 3 that

Dec(EvalMix(Enc(z1), . . . ,Enc(zt)))

= z1 + · · ·+ zt mod p.

By substitution,

Decode(Dec(EvalMix(Enc(Encode(v1)),

. . . ,Enc(Encode(vt)))))

= Encode(v1) + · · ·+ Encode(vt) mod p).

We know from Theorem 1 that (Encode(v1)+· · ·+Encode(vt) mod p) = Encode(v1+· · ·+vt mod 2y+s)so,

Decode(Dec(EvalMix(Enc(Encode(v1)),

. . . ,Enc(Encode(vt)))))

= Encode(v1 + · · ·+ vt mod 2y+s).

We know from Theorem 2 that for t ≤ 2s,Decode(Encode(v1)+· · ·+Encode(vt) mod p) = (v1+· · ·+ vt mod 2y+s), so,

Decode(Dec(EvalMix(Enc(Encode(v1)),

. . . ,Enc(Encode(vt)))))

= (v1 + · · ·+ vt mod 2y+s).

By definition, (v1 + · · · + vt mod 2y+s) =Mix(v1, v2, . . . , vt) so if t ≤ 2s,

Decode(Dec(EvalMix(Enc(Encode(v1)),

. . . ,Enc(Encode(vt)))))

= Mix(v1, v2, . . . , vt) .

Using Theorem 4 we can now provide an en-crypted VoIP mixing capability so that a teleconfer-ence participant can listen to multiple mixed audiosignals from other participants on a single chan-nel. An important feature of our scheme is that itprovides full-duplex communication, meaning thatparticipants can transmit while they receive mixedsignals.

From a practical usability perspective, a tele-conference participant would not want their voicefed back to them through the mixing operation.Even with a small return lag as low as 10 ms, thefeedback signal can become disorienting to speak-ers, even if such a lag from other participants isgenerally unnoticeable during normal conversation.We therefore want to support multiple mixing oper-ations simultaneously, so that client voice streams(v1, . . . , vt), the ith client receives the mixed sig-nal

∑j∈{1,...,t}\{i}(v

i). The EvalAdd operations in the

Page 10: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

10

EvalMix function are all highly efficient as they onlyinvolve splitting vectors, multiplication by two andbitwise concatenation, which are all extremely effi-cient to implement. This summation for multiple re-cipients is computationally efficient, relying only onsummation operations, and can also be performedin a parallel manner. We discuss the engineeringtradeoffs associated with concrete parameter selec-tion in greater depth in Section 9.

8 END-TO-END ENCRYPTED VOIP TELECON-FERENCING PROTOTYPE

We implemented the additive homomorphic cryp-tosystem portion of the secure VoIP teleconferenc-ing prototype in the Mathworks Matlab environ-ment because it provides an interpreted compu-tation environment with native support for vectorand matrix manipulation for rapid prototyping. Wefound the Matlab syntax to be a natural fit for theprimitive lattice operations needed for our LTV-based cryptosystem design. We used the Matlabcoder toolkit [28] to generate an ANSI C version ofour implementation. For early experimentation wecompiled the generated ANSI C using gcc to run asan executable in a Linux environment for parametertuning.

Rather than construct a VoIP capability fromthe ground up, we constructed an end-to-end en-crypted VoIP teleconferencing capability by inte-grating our HE library and the Encode/Decode func-tionality with an existing open-source VoIP tele-conferencing library. We selected the Mumble VoIPlibrary for this integration because Mumble is ma-ture, offers high sound quality and runs on a varietyof platforms. We focused on iOS clients becausethese clients were the most mature open sourceversions of the Mumble clients.

By integrating with the Mumble library, ourend-to-end encrypted VoIP library has the sameuser interface, usage model and deployment modelas the standard Mumble capability. An image ofthe Graphical User Interface (GUI) of the modifiedclient running on an iPod Touch can be seen inFigure 5 where the client is running in push-to-talkmode. Although we did not make this app publiclyavailable, the modified Mumble software can alsobe deployed through an app store model, or asbinaries which can be loaded onto iOS devices withdevelopment tools.

Client integration was relatively straight for-ward with several notable exceptions to reducepacket drops and improve sound quality:

Fig. 5: The Push-To-Talk Client GUI

1) The client application generated voice pack-ets that contained 480 samples at 48 kHz,or 10 ms worth of sound. The sound driveron the client, however, generated slightlylarger packets. As a result, the period of thesound packets was slightly larger than 10ms, and every so often two sound packetswere generated back to back. The originalserver set a 10 ms timer and accepted onepacket every 10 ms. We added a smallqueue at the server so we did not droppackets when we received two packets inless than 10 ms.

2) We generated new frame numbers at theserver as opposed to re-using the clientframe numbers. The clients correlated theframe numbers with time. This modificationreduces time jitter.

3) The encryption and decryption operationsfor our applications were processor inten-sive, and were run in batches of severalaudio packets at once. We moved the en-cryption and decryption operations to a lowpriority thread and had a higher prioritythread accept and queue new audio packets(both from the network, and from the mi-crophone). This helped prevent a situationwhere audio packets are dropped becausethe client was busy decrypting or encrypt-

Page 11: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

11

ing.4) We improved the efficiency noise sample

generation by using an approximate dis-crete Gaussian operation that mapped 32-bit uniform distributed samples from theclient pseudo-random generator to multiplediscrete Gaussian samples using a look-uptable.

Our modifications, by improving sound qualityand efficiency and reducing packet drops, enablesincreases in audio sampling rates and and loweredencryption lag. As a result, the client is capableof generating 16 bit audio samples at a rate of48 kHz, which is a standard configuration of theMumble VoIP client. This throughput provides asound quality substantially better than PSTN.

9 PARAMETERIZATION AND ADAPTATION FOREND-TO-END ENCRYPTED VOIP TELECONFER-ENCING

We make our initial high-level end-to-end en-crypted VoIP teleconferencing capability goals moreconcrete by choosing parameters for the encod-ing/decoding and cryptosystem operations so that:- The VoIP signal data is sampled at a high enoughrate that adequately high voice quality is encodedinto plaintext.- The end-to-end VoIP transmission cannot haveand end-to-end lag of more than the classical 150mslimit on end-to-end lag [22].- We target a throughput of not too much larger thanwould be needed for AES-encrypted operations,meaning a ciphertext expansion of less than 10, or atotal bandwidth usage of less than 6.4Mbps whichis supportable by most commodity wireless datatransmission protocols [29].- The plaintext is securely encrypted into VoIP ci-phertext with sufficiently large work factor prevent-ing unauthorized decryption. For our encryptionscheme, this means δ ≤ 1.007 for 80 bits of security[21].- The homomorphic mixing of multiple encryptedVoIP feeds can be successfully decrypted and thendecoded. We discuss the case for at most t = 4 activespeakers at any instance in time can be supportedduring a session, but this parameter generalizes.- All these operations need to be run efficientlyon commodity hardware, such as iOS and Linuxdevices with 64 bit processors. This mean that thelargest ciphertext moduli has to be less than 264.

To achieve these more concrete goals in thecontext of our introduced cryptosystem and encod-

ing and decoding functionality, the parameters weadjust are:

• q, the ciphertext modulus.• p, the plaintext modulus.• n, the ring dimension.• y, the VoIP sample bit depth.• φ, the VoIP sample rate.• m, the number of samples per VoIP block.• s, the number of bits of padding in the en-

coding scheme.

We consequently set our optimization con-straints as:

• q < 264, the maximum ciphertext modulusfor optimal execution on 64-bit processors.

• p > 2dm/ne(y+2), to guarantee lossless encod-ing from the definition of the Encode op-eration in Section 9 for 4 actively speakingclients.

• q > 4pr√nw, to guarantee correct decryption

where r = 3 is a discrete Gaussian noiseparameter and w = 4.5 is an assurance pa-rameter for the encryption system we usefrom [4], [5].

• n > (log q)/(4 log δ), to guarantee security[26], [30].

• s ≥ log2(t), to guarantee that the bits of 0padding between samples in the encodingand our correctness proofs hold when themaximum number of speakers are talking.

With these optimization constraints we want to:- Minimize Encode, Enc, EvalMix, Dec and Decoderuntime. We find in [4] that the runtimes of theencryption operations are linear in n. Naive algo-rithms for the runtime of Encode and Decode are alsolinear in n. We thus attempted to minimize n whilesatisfying all other constraints.- Minimize (qn)/(2y+2m), the ciphertext expansion.- Guarantee throughput φ2y does not cause suffi-cient packet drops to negatively affect voice quality.We essentially fix φ and evaluate voice quality ex-perimentally.

Given the large number of variables, the highlynonlinear optimization constraints for those that wecan analyze (such as ciphertext expansion) and theprevalence of experimentally evaluated conditions,such as packet drop rates, voice quality and soft-ware runtime, we performed initial experimenta-tion using plaintext operations to find initial datasampling rates and encoding parameters. From apragmatic standpoint for the VoIP teleconferencingapplication, we observed experimentally that there

Page 12: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

12

is little practical sense to accommodate more than4 active speakers. It is very difficult for any one lis-tener to follow a conversation with so many simul-taneously speaking participants. We identified ex-perimentally that plaintext mixing with 40ms blocksof 48KHz VoIP data of 5-bit depth provided goodsound quality adequate for normal conversation.The native app provides sampled data in groupingsof 10ms of data, but we merge 4 blocks of 10ms ofdata together for an m of 1920 samples per audioblock.

We use a ring dimension n = 1024 with 1920samples per block, and we have a resulting plaintextmodulus of p = 214 = 16384. We then set q to be1236950597633, which requires 41 bits to represent.These parameters results in a root Hermite factorupper-limit of δ = 1.00696 which is currently be-lieved to be adequately secure and provide at least80 bits of security [21]. From the audio input (5bits of 1920 samples) and ciphertext output (41 bitsof 1024 samples), we have a resulting ciphertextexpansion of roughly 4.37.

With these parameter settings we observed thatwhen running on an iPhone 5s, the encoding andencryption operation took a mean time of 9.2msand decryption and decoding took 4.6ms. The sum-mation on the VoIP server took 0.5ms. Transport ofencrypted VoIP traffic from Cambridge MA to theNorthern Virginia Amazon AWS servers took anaverage of 15ms. This results in a total mean latencyof less than 150ms for VoIP traffic. Taken together,these parameters meet our performance goals, al-though experimental use indicates that latency ison the high side and users who speak quicklywould need to speak more slowly than normal tohave a conversation, although not so slow as to beunnatural.

10 END-TO-END ENCRYPTED VOIP TELE-CONFERENCING EXPERIMENTATION

We experimentally evaluated the performance ofthe VoIP service deployed in each of the AmazonAWS data centers across the world. We then in-stalled the client software in iPod Touch and iPhone5s clients. We connected each client to each of theservers through various connection types in themetro area of Boston, a major city in the NewEngland region of the United States. These con-nections included 802.11n wireless enterprise gate-way connected to a high-speed enterprise Internetconnection, 4G LTE, 3G and 2G connections overthe T-mobile commercial wireless service and an

AT&T DSL connection in a rural area in the stateof Connecticut.

We measured the upload and downloadthroughput of the connections, the drop rate of VoIPpackets routed through the various server locationsand the subjective quality of the VoIP teleconfer-ence session as defined by the experimenters. Theupload and download throughput was measuredby Ookla throughput measurement app [31] onthe client devices. VoIP drop rates were measuredexperimentally by instrumenting the VoIP serversto measure drop rates. Voice quality was measuredin comparison to PSTN voice quality where “Ex-cellent” means the VoIP conversation was betterthan PSTN, “Good” means the VoIP conversationwas comparable PSTN, “Poor” means the VoIPconversation was worse than PSTN but still usablefor communication, and “Unusable” means the con-nection was useless for communication.

All of the experiments were run over a 2 hourperiod on a weekday evening using two iPod Touchclients with servers deployed on the Amazon AWSt1.micro instances [32]. The clients were each onindependent connections to the Internet at all times,so there was low likelihood of one client contribut-ing substantially to congestion or packet drops forthe other client.

Table 1 shows the upload and downloadthroughput observed by each of the clients for eachof the connections. Note that the rural DSL serviceprovided better throughput than the 2G connectionand better download throughput than the 3G con-nection.

TABLE 1: Experimentally Measured Data Through-put in Mb/s for Connection Types

Connection Type Upload RateMb/s

Download RateMb/s

Enterprise 802.11n 38.22 36.534G LTE 35.82 17

3G 6.31 0.432G 0.2 0.16

Rural DSL 2.55 0.47

We observed that all of the last-mile wirelessconnections supported acceptable VoIP teleconfer-ence capabilities except for the 2G connections.Over all of the acceptable connections, the lowestupload or download throughput observation wason the 3G download: 0.43Mb/s Because the VoIPdownload and upload data flows are symmetric,this implies at least a 0.43Mb/s upload and down-load throughput connection is required to supportVoIP teleconferencing using our prototype.

Page 13: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

13

Table 2 shows the packet drop rates observedat each of the servers at the various Amazon AWSlocations for the various client connection types.Note that distance between the client and serverhad only a minor impact on drop rates, while theconnection type had a very large impact on droprates. This implies that the connection could be abottleneck for the VoIP service.

The subjective VoIP teleconference quality ob-served through each of the servers at the variousAmazon AWS locations for the various client con-nection types had no observed impact on voicequality. Conversely, we found that connection typehad a very large impact on voice quality. Enterprise802.11n, 4G LTE, 3G all provided adequate voicequality, while 2G and Rural DSL connections pro-vided poor to unusable voice quality.

In addition to our tests of connection-serverpairings, we also tested the impact of having morethan t = 2s clients actively speaking at any onetime.. For this experiment we established a ses-sion with 3 iPod Touch and 4 iPhone 5s clientsat various locations on the eastern United Statesseaboard to a single VoIP server in the AmazonAWS Northern Virginia data center. With these 7client connections running simultaneously with 4users speaking simultaneously we were able to holdmeaningful conversation among the 4 simultaneousspeakers and no voice distortion was observed bythe 3 non-speaking client users. With 6 users speak-ing simultaneously and 1 silent listening client, weexperimentally observed that no clipping due tomixing unless the 6 clients are talking so loudlyas to nearly be screaming. However, when morethan 4 participants attempt to speak simultaneously,the conversation was ineffective not because of lim-itations in the homomorphic mixing, but due tothe inability of participants to mentally track thesimultaneous speaking of more than 4 people. Assuch, limiting the number of speakers to 4 is areasonable choice for practical teleconferencing usecases.

11 SECURITY ANALYSIS

Although innovation in our prototype is intendedto address honest-but-curious adversaries who canobserve all server behavior, our design can be fur-ther augmented to provide broader security fea-tures. For instance, if a client is compromised, theadversary at the compromised client, even withfull observation of network traffic from all otherclients, will be unable to determine from which

other clients specific VoIP signals are coming fromif there is at least two other clients. This featurearises from the inability to map mixed signals totheir sources based on ciphertext observations. Thisnon-identification feature comes from the featureof the lattice encryption system in that encryptionsof two different signals are indistinguishable, andthese two encrypted signals are indistinguishablefrom the encryption of white noise because of thenoise added to ciphertext during encryption.

Recent publications have introduced the conceptof subfield lattice attacks [33], [34]. Although theseattacks are relevant for some variants of the NTRUcryptosystems that rely on the DSPR [5], our pa-rameter settings do not meet the conditions of theseattacks.

A possible effective attack on the end-to-endencrypted VoIP teleconferencing activity would befor a compromised client to inject noise signalsinto the teleconferencing mixer, thus reducing theability of the client participants to converse withone another. This attack is a vulnerability of generalteleconferencing systems. Any participant can injectnoise into the conversation. A feasible longer-termapproach to reduce the effectiveness of this threatwould be to perform homomorphic filtering againstnoise injection, and possibly provide some function-ality for the clients to tell the mixer which otherparticipants they want to listen to. Over the shortterm, the application of authentication techniquesto identify likely adversaries could be effective inaddressing these issues.

Another issue is one of key distribution. Thekeys in our cryptosystem are relatively small:41*1024 bits which is a little more than 5MB. Thesekeys could easily be distributed through encryptedflat files using existing public key infrastructures,or even using Diffie-Hellman type techniques togenerate shared secret keys. Furthermore, our as-sumption of prior key coordination can be removedby generalizing the scheme to rely on multi-keyfeatures of the LTV scheme [5].

12 DISCUSSION AND ONGOING ACTIVITIES

In this paper we present and discuss a scalableapproach to VoIP teleconferencing that providesend-to-end encryption. Our assessment is that wemet our design goals for practical and secure tele-conferencing as discussed in Section 3.

A further aspect of our layered architecturevision is an ability to mix-and-match a comput-ing substrate at the server for increased scalability

Page 14: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

14

TABLE 2: Packet Drop Rates For Various Server Locations and Client Internet Connection Types.

Server Location Client Location Enterprise 802.11n 4G LTE 3G 2G Rural DSLN. Virginia Connecticut 0% 10% 10% 66% 33%

Oregon Connecticut 0% 2% 3% 71% 35%N. California Connecticut 0% 7% 8% 67% 34%

Ireland Connecticut 0% 7% 7% 73% 38%Singapore Connecticut 5% 2% 2% 68% 39%

Tokyo Connecticut 1% 3% 4% 69% 37%Sydney Connecticut 5% 3% 3% 67% 34%

Sao Paulo Connecticut 0.30% 4% 6% 76% 34%

and throughput. Although we only utilize limited-depth Additive Homomorphic Encryption capabil-ities, our encryption system implementation is ascaled-down version of a Fully Homomorphic En-cryption (FHE) system [4]. As such, our teleconfer-ence implementation can be generalized to supportFHE operations with software modifications andchanges in parameter selection to maintain security.The engineering trade-off is that as deeper circuitsare computed at the server, end-to-end latency in-creases.

Prior general implementation efforts [4] showsEvalMult runtimes are currently too large for com-fortable interactive conversations if these operationsare to be performed at the VoIP server. However, ifruntime performance of EvalMult were to improveby only an order of magnitude or two, the latency toexecute deeper circuits than the homomorphic mix-ing would become feasible. This provides a designpath for deeper operations on the encrypted data.An initial feasible application would be encryptednoise rejection at the VoIP server by implementinga low pass filter. Low-pass filters, such as weightedmoving averages, can be supported in relativelyshallow depth circuits with a few multiply op-erations. This more general design would enableprotection against some potentially practical attacksthat could be made by an adversary such as noiseinjection attacks where an adversary inserts noiseinto a VoIP teleconferencing session to reduce theability of participants to hear one another.

Even deeper operations, such as voice recog-nition operations run on encrypted data, wouldrequire circuits so deep that current HE imple-mentations would need to use an effective boot-strapping operation to maintain security. Bootstrap-ping in current FHE implementations takes mul-tiple orders of magnitude longer to perform thanEvalMult operations, and multiple breakthroughs inthe implementation of bootstrapping, or even com-pletely new and more efficient FHE schemes, will beneeded for even deeper circuits to be practical to sat-

isfy VoIP latency constraints. Bootstrapping is likelyto be first implemented practically in custom high-performance hardware, such as in a custom FPGA-or ASIC-based co-processor [35]–[37]. When usedin conjunction of a variation of a not previouslyimplemented bootstrapping scheme [38], our designoffers the possibility for a much more general VoIPteleconferencing capability that incorporates signaldetection and noise filtering operations on the en-crypted VoIP channels.

Our end-to-end encrypted VoIP capability pro-vides a basis for integration with other hardware-accelerated co-processors due to our use of a layeredsoftware services stack to provide high-level homo-morphic encryption operations supported by lower-level lattice-based primitive implementations. Partof our longer-term vision is to provide software in-terfaces in our design for our highly optimized im-plementations of the basic FHE operations (KeyGen,Enc, EvalAdd, EvalMult, Dec) for both general andspecific applications.

ACKNOWLEDGEMENT

The authors wish to acknowledge the support ofChris Peikert in designing and implementing thecryptosystem, advice on the VoIP integration fromDavid Archer, suggestions related to VoIP mixingfrom abhi shelat and suggestions on parameter se-curity from Yuriy Polyakov.

REFERENCES

[1] M. Baugher, D. McGrew, M. Naslund, E. Carrara, andK. Norrman, “The secure real-time transport protocol(SRTP),” 2004.

[2] P. Zimmermann, A. Johnston, and J. Callas, “ZRTP: Mediapath key agreement for secure RTP,” draft-zimmermann-avt-zrtp-04 (work in progress), 2007.

[3] R. Zhang, X. Wang, R. Farley, X. Yang, and X. Jiang, “Onthe feasibility of launching the man-in-the-middle attackson VoIP from remote attackers,” in Proceedings of the4th International Symposium on Information, Computer, andCommunications Security, ser. ASIACCS ’09. New York,NY, USA: ACM, 2009, pp. 61–69. [Online]. Available:http://doi.acm.org/10.1145/1533057.1533069

Page 15: Scalable, Practical VoIP Teleconferencing with End-to-End …rohloff/papers/2017/Rohloff... · 2016-10-04 · crypted VoIP signals over multiple servers, with the restriction that

15

[4] K. Rohloff and D. B. Cousins, “A scalable implementationof fully homomorphic encryption built on NTRU,” inProceedings of the 2nd Workshop on Applied HomomorphicCryptography (WAHC), 2014.

[5] A. Lopez-Alt, E. Tromer, and V. Vaikuntanathan, “On-the-fly multiparty computation on the cloud via multikey fullyhomomorphic encryption,” in STOC, 2012, pp. 1219–1234.

[6] J. Hoffstein, J. Pipher, and J. H. Silverman, “NTRU:A ring-based public key cryptosystem,” in AlgorithmicNumber Theory, ser. Lecture Notes in Computer Science,J. P. Buhler, Ed. Springer Berlin Heidelberg, 1998,vol. 1423, pp. 267–288. [Online]. Available: http://dx.doi.org/10.1007/BFb0054868

[7] D. W. Archer and K. Rohloff, “Computing with data pri-vacy: Steps toward realization,” IEEE Security & Privacy,no. 1, pp. 22–29, 2015.

[8] M. Bellare and P. Rogaway, “Entity authentication andkey distribution,” in Advances in CryptologyCRYPTO93.Springer, 1994, pp. 232–249.

[9] A. D. Keromytis, “Voice-over-IP security: Research andpractice,” Security & Privacy, IEEE, vol. 8, no. 2, pp. 76–78, 2010.

[10] J. Launchbury, D. Archer, T. DuBuisson, and E. Mertens,“Application-scale secure multiparty computation,” inProgramming Languages and Systems, ser. Lecture Notesin Computer Science, Z. Shao, Ed. Springer BerlinHeidelberg, 2014, vol. 8410, pp. 8–26. [Online]. Available:http://dx.doi.org/10.1007/978-3-642-54833-8 2

[11] C. Gentry and S. Halevi, “Implementing Gentry’s fullyhomomorphic encryption scheme,” in Advances in Cryp-tology, EUROCRYPT 2011, ser. Lecture Notes in ComputerScience, K. Paterson, Ed. Springer Berlin / Heidelberg,2011, vol. 6632, pp. 129–148.

[12] Z. Brakerski, C. Gentry, and V. Vaikuntanathan, “(Leveled)fully homomorphic encryption without bootstrapping,”ACM Transactions on Computation Theory (TOCT), vol. 6,no. 3, p. 13, 2014.

[13] C. Gentry, A. Sahai, and B. Waters, “Homomorphic en-cryption from learning with errors: Conceptually-simpler,asymptotically-faster, attribute-based,” in Advances inCryptology–CRYPTO 2013. Springer, 2013, pp. 75–92.

[14] C. Gentry, S. Halevi, and N. P. Smart, “Homomorphicevaluation of the AES circuit,” in Advances in Cryptology–CRYPTO 2012. Springer, 2012, pp. 850–867.

[15] M. Naehrig, K. Lauter, and V. Vaikuntanathan, “Canhomomorphic encryption be practical?” in Proceedingsof the 3rd ACM workshop on Cloud computing securityworkshop, ser. CCSW ’11. New York, NY, USA:ACM, 2011, pp. 113–124. [Online]. Available: http://doi.acm.org/10.1145/2046660.2046682

[16] Y. Doroz, Y. Hu, and B. Sunar, “Homomorphic AES eval-uation using NTRU.” IACR Cryptology ePrint Archive, vol.2014, p. 39, 2014.

[17] C. Gentry and S. Halevi, “HElib,” https://github.com/shaih/HElib, 2014.

[18] L. Ducas and D. Micciancio, “FHEW: Bootstrapping ho-momorphic encryption in less than a second,” in Advancesin Cryptology–EUROCRYPT 2015. Springer, 2015, pp. 617–640.

[19] D. Wu and J. Haven, “Using homomorphic encryption forlarge scale statistical analysis,” 2012.

[20] L. Liu, M. Fukumoto, and S. Saiki, “An improved mu-law proportionate NLMS algorithm,” in Acoustics, Speechand Signal Processing, 2008. ICASSP 2008. IEEE InternationalConference on. IEEE, 2008, pp. 3797–3800.

[21] T. Lepoint and M. Naehrig, A Comparisonof the Homomorphic Encryption Schemes FV andYASHE. Cham: Springer International Publish-ing, 2014, pp. 318–335. [Online]. Available:http://dx.doi.org/10.1007/978-3-319-06734-6 20

[22] S. Na and S. Yoo, “Allowable propagation delay for VoIPcalls of acceptable quality,” in Advanced Internet Servicesand Applications. Springer, 2002, pp. 47–55.

[23] V. Lyubashevsky, C. Peikert, and O. Regev, “A toolkitfor ring-lwe cryptography.” in EUROCRYPT, vol. 7881.Springer, 2013, pp. 35–54.

[24] C. Peikert, “A decade of lattice cryptography,” Foundationsand Trends R© in Theoretical Computer Science, vol. 10, no. 4,pp. 283–424, 2016.

[25] N. Gama and P. Q. Nguyen, “Predicting lattice reduction,”in EUROCRYPT, 2008, pp. 31–51.

[26] R. Lindner and C. Peikert, “Better key sizes (and attacks)for LWE-based encryption,” in CT-RSA, 2011, pp. 319–339.

[27] Y. Chen and P. Q. Nguyen, “BKZ 2.0: Better lattice se-curity estimates,” in ASIACRYPT, ser. Lecture Notes inComputer Science, vol. 7073. Springer, 2011, pp. 1–20.

[28] MATLAB, R2012b. Natick, Massachusetts: The Math-Works Inc., 2012.

[29] E. Perahia and R. Stacey, Next Generation Wireless LANS:802.11 n and 802.11 ac. Cambridge University Press, 2013.

[30] D. Micciancio and O. Regev, “Lattice-based cryptogra-phy,” in Post Quantum Cryptography. Springer, February2009, pp. 147–191.

[31] Ookla, 2014. [Online]. Available: https://www.ookla.com/[32] Amazon AWS Instance Types, 2014. [Online]. Available:

http://aws.amazon.com/ec2/instance-types/[33] M. Albrecht, S. Bai, and L. Ducas, “A subfield lattice attack

on overstretched NTRU assumptions,” IACR CryptologyePrint Archive, vol. 2016, p. 127, 2016.

[34] J. H. Cheon, J. Jeong, and C. Lee, “An algorithm for NTRUproblems and cryptanalysis of the GGH multilinear mapwithout a low level encoding of zero,” IACR CryptologyePrint Archive, vol. 2016, p. 139, 2016.

[35] X. Cao, C. Moore, M. O’Neill, E. O’Sullivan, and N. Han-ley, “Accelerating fully homomorphic encryption over theintegers with super-size hardware multiplier and modularreduction.” IACR Cryptology ePrint Archive, vol. 2013, p.616, 2013.

[36] D. B. Cousins, J. Golusky, K. Rohloff, and D. Sumorok,“An fpga co-processor implementation of homomorphicencryption,” in High Performance Extreme Computing Con-ference (HPEC), 2014 IEEE. IEEE, 2014, pp. 1–6.

[37] E. Ozturk, Y. Doroz, B. Sunar, and E. Savas, “Acceleratingsomewhat homomorphic evaluation using FPGAs,” Cryp-tology ePrint Archive, Report 2015/294, Tech. Rep., 2015.

[38] J. Alperin-Sheriff and C. Peikert, “Practical bootstrappingin quasilinear time,” in Advances in Cryptology CRYPTO2013, ser. Lecture Notes in Computer Science, R. Canettiand J. Garay, Eds. Springer Berlin Heidelberg, 2013, vol.8042, pp. 1–20. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-40041-4 1