Click here to load reader

Community aware group testing · and workers who share common spaces. Group testing reduces the number of tests needed to identify the infected individuals by pooling diagnostic samples

  • View
    20

  • Download
    0

Embed Size (px)

Text of Community aware group testing · and workers who share common spaces. Group testing reduces the...

  • Group testing for connected communities

    Pavlos Nikolopoulos† Sundara Rajan Srinivasavaradhan‡ Tao Guo‡

    Christina Fragouli‡ Suhas Diggavi‡

    †EPFL, Switzerland ‡University of California Los Angeles, USA

    Abstract

    In this paper, we propose algorithms thatleverage a known community structure tomake group testing more efficient. We con-sider a population organized in disjoint com-munities: each individual participates in acommunity, and its infection probability de-pends on the community (s)he participatesin. Use cases include families, students whoparticipate in several classes, and workerswho share common spaces. Group testingreduces the number of tests needed to iden-tify the infected individuals by pooling di-agnostic samples and testing them together.We show that if we design the testing strat-egy taking into account the community struc-ture, we can significantly reduce the numberof tests needed for adaptive and non-adaptivegroup testing, and can improve the reliabilityin cases where tests are noisy.

    1 Introduction

    Group testing pools together diagnostic samples toreduce the number of tests needed to identify in-fected members in a population. In particular, if ina population of n members we have a small frac-tion infected (say k � n members), we can iden-tify the infected members using as low as O(k log(nk ))group tests, as opposed to n individual tests [Du andHwang, 1993, Aldridge et al., 2019, Kucirka et al.,2020]. Triggered by the need of widespread testing,such techniques are already being explored in the con-text of Covid-19 [Gollier and Gossner, 2020,Broadfoot,2020,Ellenberg, 2020,Verdun et al., 2020,Ghosh et al.,2020, Kucirka et al., 2020]. Group testing has a rich

    This work was accepted at the Proceedings of the24th International Conference on Artificial Intelligence andStatistics (AISTATS) 2021, San Diego, California, USA.

    history of several decades dating back to R. Dorfmanin 1943 and a number of variations and setups havebeen examined in the literature [Dorfman, 1943, Duand Hwang, 1993, Aldridge et al., 2019, Yaakov Mali-novsky, 2016].

    The observation we make in this paper is that we canleverage a known community structure to make grouptesting more efficient. The work in group testing weknow of, assumes “independent” infections, and ig-nores that an infection may be governed by communityspread; we argue that taking into account the commu-nity structure can lead to significant savings. As a usecase, consider an apartment building consisting of Ffamilies that have practiced social distancing; clearlythere is a strong correlation on whether members ofthe same family are infected or not. Assume that thebuilding management would like to test all membersto enable access to common facilities. We ask, what isthe most test-efficient way to do so.

    Our approach enlarges the regime where group test-ing can offer benefits over individual testing. Indeed,a limitation of group testing is that it offers fewer orno benefits when k grows linearly with n [Riccio andColbourn, 2000,Hu et al., 1981,Ungar, 1960,Aldridge,2019, Aldridge et al., 2019]. Taking into account thecommunity structure allows to identify and removefrom the population large groups of infected members,thus reducing their proportion and converting a lin-ear to a sparse regime identification. Essentially, thecommunity structure can guide us on when to use in-dividual, and when group testing.

    Our main results are as follows. Assume that n popu-lation members are partitioned into F groups that wecall families, out of which kf families have at least oneinfected member.• We derive a lower bound on the number of tests,which for some regimes increases (almost) linearly withkf (the number of infected families) as opposed to k(the number of infected members).• We propose an adaptive algorithm that achieves thelower bound in some parameter regimes.• We propose a nonadaptive algorithm that accounts

    arX

    iv:2

    007.

    0811

    1v5

    [cs

    .IT

    ] 1

    7 M

    ar 2

    021

  • Group testing for connected communities

    for the community structure to reduce the number oftests when some false positive errors can be tolerated.• We propose a new decoder based on loopy beliefpropagation that is generic enough to accommodateany community structure and can be combined withany test matrix (encoder) to achieve low error rates.•We numerically show that leveraging the communitystructure can offer benefits both when the tests usedhave perfect accuracy and when they are noisy.

    We present our models in Section 2, the lower bound inSection 3, our algorithms for the noiseless case in Sec-tion 4, and loopy belief propagation (LBP) decodingin Section 5. Numerical results are in Section 6.

    Note: The proofs of our theoretical results (in Sec-tions 3–4) are in the Appendix, along with an extendedexplanation of the rationale behind our algorithms.

    2 Background and notation

    2.1 Traditional group testing

    Our work extends traditional group testing to infectionmodels that are based on community spread. For thisreason, we review here known results from prior work.

    Traditional group testing typically assumes a popu-lation of n members out of which some are infected.Two infection models are considered: (i) in the combi-natorial model, a fixed number of infected members k ,are selected uniformly at random among all sets of sizek ; (ii) in the probabilistic model, each item is infectedindependently of all others with probability p, so thatthe expected number of infected members is k̄ = np.A group test τ takes as input samples from nτ individ-uals, pools them together and outputs a single value:positive if any one of the samples is infected, and neg-ative if none is infected. More precisely, let Ui = 1when individual i is infected and 0 otherwise. Thenthe traditional group testing output Yτ takes a binaryvalue calculated as Yτ =

    ∨i∈δτ Ui, where

    ∨stands for

    the OR operator (disjunction) and δτ is the group ofpeople participating in the test.

    The performance of a group testing algorithm is mea-sured by the number of group tests T = T (n) neededto identify the infected members (for the probabilisticmodel, the expected number of tests needed). Setupsthat have been explored in the literature include:• Adaptive vs. non-adaptive testing: In adaptive test-ing, we use the outcome of previous tests to decidewhat tests to perform next. An example of adaptivetesting is binary splitting, which implements a form ofbinary search. Non-adaptive testing constructs, in ad-vance, a test matrix G ∈ {0, 1}T×n where each rowcorresponds to one test, each column to one mem-

    ber, and the non-zero elements determine the set δτ .Although adaptive testing uses less tests than non-adaptive, non-adaptive testing is more practical as alltests can be executed in parallel.• Scaling regimes of operation: assume k = Θ(nα), wesay we operate in the linear regime if α = 1; in thesparse regime if 0 ≤ α < 1; in the very sparse regimeif k is constant.

    Known results. The following are well estab-lished results (see [Johnson, 2017, Du and Hwang,1993,Aldridge et al., 2019] and references therein):• In the combinatorial model, since T tests allow todistinguish among 2T combinations of test outputs,then to identify all k infected members without error,we need: 2T ≥

    (nk

    )⇔ T ≥ log2

    (nk

    ). This is known as

    the counting bound [Johnson, 2017,Du and Hwang,1993,Aldridge et al., 2019] and implies that we cannotuse less than T = O(k log nk ) tests. In the probabilisticmodel, a similar bound has been derived for the num-ber of tests needed on average: T ≥ nh2 (p), whereh2 is the binary entropy function.• Noiseless adaptive testing can achieve the countingbound for k = Θ(nα) and α ∈ [0, 1); for non-adaptivetesting, this is also true of α ∈ [0, 0.409], if we allowa vanishing (with n) error [Aldridge et al., 2019,Coja-Oghlan et al., 2020,Coja-Oghlan et al., 2020].• In the linear regime (α = 1), group testing offers lit-tle benefits over individual testing. In particular, if theinfection rate k/n is more than 0.38, group testing doesnot use fewer tests than 1-by-1 (individual) testing un-less high identification-error rates are acceptable [Ric-cio and Colbourn, 2000,Hu et al., 1981,Ungar, 1960].

    2.2 Community and infection models

    In this paper, we additionally assume a known commu-nity structure: the population can be decomposed inF disjoint groups of individuals that we call families.Each family j has Mj members, so that n =

    ∑Fj=1 Mj .

    In the symmetric case, Mj = M for all j and n = FM .Note, that the term “families” is not limited to realfamilies—we use the same term for any group of peo-ple that happen to interact, so that they get infectedaccording to some common infection principle.

    We consider the following infection models, that par-allel the ones in the traditional setup:• Combinatorial Model (I). kf of the families areinfected—namely they have at least one infected mem-ber. The rest of the families have no infected members.In each infected family j , there exist k jm infected mem-bers, with 0 ≤ k jm ≤ Mj . The infected families (resp.infected family members) are chosen uniformly at ran-dom out of all families (resp. members of the samefamily). For our analysis, we sometimes consider onlythe symmetric case, where k jm = km for each family j .

  • Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

    • Probabilistic Model (II). A family is infectedwith probability q i.i.d. across the families. A memberof an infected family j is infected, independently fromthe other members (and other families), with proba-bility pj > 0. If a family j is not infected, then pj = 0.When k jm = pjMj the two models behave similarly.

    Our goal is two-fold: (a) provide new lower bounds forthe number of tests T needed to identify all infectedmembers without error; and (b) design community-aware testing algorithms that are more efficient thantraditional group-testing ones, in the sense that theycan achieve the same identification accuracy using sig-nificantly fewer tests and they can also perform closeto the lower bounds in some cases.

    2.3 Noisy testing and error probability

    In this work we assume that there is no dilution noise,that is, the performance of a test does not dependon the number of samples pooled together. This isa reasonable assumption with genetic RT-PCR testswhere even small amounts of viral nucleotides can beamplified to be detectable [Saiki et al., 1985, Kucirkaet al., 2020]. However, we do consider noisy tests inour numerical evaluation (Section 6) using a Z-channelnoise model1. We remark that this is simply a modelone may use; our algorithms are agnostic to this andcan be used with any other model.

    Additionally, some of our identification algorithmsmay return with errors. For this, we use the followingterminology: Let Ûi denote the estimate of the stateof Ui after group testing. Zero error captures the re-quirement that Ûi = Ui for all i ∈ N . Vanishing errorrequires that all error probabilities go to zero with n.Sometimes we also distinguish between False Negative(FN) and False Positive (FP) errors: FN errors occurwhen infected members are identified as non-infected(and vice-versa for FP).

    2.4 Other related work

    The idea of community-aware group testing is exploredto some extent in our preprint [Nikolopoulos et al.,2020]. Also, a similar idea of using side-informationfrom contact tracing in decoding is proposed by [Zhuet al., 2020, Goenka et al., 2020], independently fromour work. That work is complementary to ours; wefocus more on test designs rather than decoding, forwhich we use well-known algorithms such as COMPand LBP. Finally, test designs, lower bounds and de-

    1In a Z-channel noise model, a test output that shouldbe positive, flips and appears as negative with probabilityz , while a test output that is negative cannot flip. Thus:

    P(Yτ = 1|Uδτ ) =(∨

    i∈δτ Ui)

    (1− z ).

    coding algorithms for independent but not identicalpriors are investigated by [Li et al., 2014].

    The line of work on graph-constrained group testing(see for example [Cheraghchi et al., 2012,Karbasi andZadimoghaddam, 2012, Luo et al., 2019]) solves theproblem of how to design group tests when there areconstraints on which samples can be pooled together,provided in the form of a graph; in our case, individu-als can be pooled together into tests freely.

    3 Lower bound on the number of tests

    We compute the minimum number of tests needed toidentify all infected members under the zero-error cri-terion in both community models (I) and (II).

    Theorem 1 (Combinatorial community bound).Consider the combinatorial model (I) (of Section 2.2).Any algorithm that identifies all k infected memberswithout error requires a number of tests T satisfying:

    T ≥ log2(

    F

    kf

    )+

    kf∑j=1

    log2

    (Mj

    k jm

    ). (1)

    For the symmetric case: T ≥ log2(Fkf

    )+ kf log2

    (Mkm

    ).

    Observations: We make two observations regard-ing the combinatorial community bound, in the casewhere the number of infected family members followsa “strongly” linear regime (km ≈ Mj ) and the num-ber of infected families kf follows a sparse regime (i.e.,kf = Θ(F

    αf ) for αf ∈ [0, 1)):

    (a) The bound increases almost linearly with kf (thenumber of infected families), as opposed to k (the over-all number of infected members). This is because,if the infection regime about families is sparse, thefollowing asymptotic equivalence holds: log2

    (Fkf

    )∼

    kf log2Fkf∼ (1− αf )kf log2 F .

    (b) If additionally to the sparse regime about families,an overall sparse regime (k = Θ(nα) for α ∈ [0, 1))holds, then the community bound may be significantlylower than the counting bound that does not take intoaccount the community structure. Consider, for ex-ample, the symmetric case. The asymptotic behav-ior of the counting bound in the sparse regime is:log2

    (nk

    )∼ k log2 nk ∼ kf km log2

    Fkf

    , where the latter is

    because km ≈ M . So, the ratio of the counting boundto the combinatorial bound scales (as F gets large) as:

    log2(nk

    )log2

    (Fkf

    )+ kf log2

    (Mkm

    ) ∼ kf km log2 Fkfkf log2

    Fkf

    = km . (2)

    Although simplistic, observation (b) is important forpractical reasons. Many times, the population is com-posed of a large number of families with members that

  • Group testing for connected communities

    have close contacts (e.g. relatives, work colleagues,students who attend the same classes, etc.). In suchcases, we do expect that almost all members of infectedfamilies are infected (i.e. km ≈ Mj ), even though theoverall infection regime may still be sparse. Eq. (2)shows the benefits of taking the community structureinto account in the test design, in such a case.

    Theorem 2 (Probabilistic Community bound). Con-sider the probabilistic model (II) (of Section 2.2). Anyalgorithm that identifies all k infected members with-out error requires a number of tests T satisfying:

    T ≥ F h2 (q)+F∑

    j=1

    qMjh2 (pj )− wjh2(

    1− qwj

    )(3)

    where wj = 1− q + q(1− pj )Mj .

    Two observations: (a) If for each family j , pj and Mjare such that q(1− pj )Mj → 0 (i.e. the probability ofthe peculiar event, where a family is labeled “infected”and yet has no infected members, is negligible), thecombinatorial and probabilistic bounds are asymptot-ically equivalent. In particular, using the standard es-timates of the binomial coefficient [Ash, 1990, Sec. 4.7],the combinatorial bound in (1) is asymptotically

    equivalent to F h2 (kf/F) +∑kf

    j=1 Mjh2 (kjm/Mj ), which

    matches the probabilistic bound in (3): F h2 (q) +

    q∑F

    j=1 Mjh2 (pj ) = F h2 (k̄f/F) +∑k̄f

    j=1 Mjh2 (k̄jm/Mj ),

    with kf = k̄f + o(1) and kjm = k̄

    jm + o(1) in place of

    their expected values k̄f = Fq and k̄jm.

    (b) Theorem 2 extends from zero-error recovery toconstant-probability recovery by applying Fano’s in-equality (similarly to Thm 1 of [Li et al., 2014]), andin doing so, the right-hand side of (3) gets multipliedby the desired probability of success P(suc).

    4 Algorithms

    4.1 Adaptive algorithm

    Alg. 1 describes our algorithm for the fully adaptivecase, which consists of two parts (the interested readermay find the detailed rationale for our algorithm inAppendix B). In both parts, we make use of a clas-sic adaptive-group-testing algorithm AdaptiveTest(),which is an abstraction for any existing (or future)adaptive group-testing algorithm. We distinguish be-tween 2 different kinds of input for AdaptiveTest():(a) a set of selected members, which is the typical in-put of group-testing algorithms; (b) a set of selectedmixed samples. A mixed sample is created by poolingtogether samples from multiple members that usuallyhave some common characteristic. For example, mixedsample x (rj ) denotes an aggregate sample of a set of

    Algorithm 1 Adaptive Community Testing

    Ûi is the estimated infection status of member i .Ûx is the estimated infection status of a mixed samplex .SelectRepresentatives() is a function that selects a rep-resentative subset from a set of members.AdaptiveTest() is an adaptive algorithm that tests aset of items (mixed samples or members).

    1: for j = 1, . . . ,F do2: rj = SelectRepresentatives ({i : i ∈ j})3: end for4:[Ûx(r1), . . . , Ûx(rF )

    ]=

    AdaptiveTest (x (r1), . . . , x (rF ))5: Set A := ∅6: for j = 1, . . . ,F do7: if Ûx(rj ) = “positive” then8: Use a noiseless, individual test for each fam-

    ily member: Ûi = Ui , ∀i ∈ j .9: else

    10: A := A ∪ {i : i ∈ j}11: end if12: end for13:

    {Ûi : i ∈ A

    }= AdaptiveTest (A)

    14: return[Û1, . . . , Ûn

    ]

    representative members rj from family j . A mixedsample is “positive,” if at least one of the membersthat compose it is infected, and “negative” otherwise.Because in some cases we only care about mixed sam-ples, we can treat them in the same way as individualsamples—hence use group testing to identify the infec-tion state of mixed samples as we do for individuals.

    Part 1 (lines 1-4): The goal of this part is to de-tect the infection regime inside each family j , so thatthe family is tested accordingly at the next part: us-ing group testing, if j is “lightly” infected, or individ-ual testing, otherwise. Our idea is motivated by theresult presented in Section 2.1 that group testing ispreferable to individual, only if infection rate is low(i.e. pj ≤ 0.38). Therefore, the challenge is to ac-curately detect the infection regime spending only alimited number of tests. In this paper, we limited ourexploration to using only one mixed sample in this re-gard, but more sophisticated techniques are also pos-sible, some of which are discussed in Appendix B.2.

    First, a representative subset rj of family-jmembers is selected using a sampling functionSelectRepresentatives() (lines 1-3). Then, a mixedsample x (rj ) is produced for each subset rj , and anadaptive group-testing algorithm is performed on topof all representative mixed samples (line 4). If ourchoice of AdaptiveTest() offers exact reconstruction

  • Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

    (which is usually the case), then: Ûx(rj ) = Ux(rj ).

    Part 2 (lines 5-13): We treat Ûx(rj ) as an estimate

    of the infection regime inside family j : if Ûx(rj ) is posi-tive, then we consider the family to be heavily infected(i.e k jm/Mj or pj ≥ 0.38), otherwise lightly infected (i.e.k jm/Mj or pj < 0.38). Since group testing performs bet-ter than individual testing only in the latter case (sec-tion 2.1), we use individual testing for each heavily-infected family (lines 7-8), and adaptive group testingfor all lightly-infected ones (line 13).

    Analysis for the number of tests. We now com-pute the maximum expected number of tests neededby our algorithm to detect the infection status of allmembers without error. For simplicity of notation, wepresent our results through the symmetric case, whereMj = M , k

    jm = km (combinatorial case) or pj = p

    (probabilistic case), and |rj | = R for all families:Let SelectRepresentatives() be a simple function thatperforms uniform (random) sampling without replace-ment, and consider 2 choices for the AdaptiveTest()algorithm: (i) Hwang’s generalized binary splitting al-gorithm (HGBSA) [Hwang, 1972], which is optimal ifthe number of infected members of the tested group isknown in advance; and (ii), traditional binary-splittingalgorithm (BSA) [Sobel and Groll, 1959], which per-forms well, even if little is known about the number ofinfected members.

    Lemma 1 (Expected number of tests - Symmetriccombinatorial model). Consider the choices (i) and(ii) for the AdaptiveTest() defined above. Alg. 1 suc-ceeds using a maximum expected number of tests:

    T̄(i) ≤kf φc(

    log2F

    kf φc+ 1 + M

    )+ k (1− φc)

    (log2

    n − kf Mφck (1− φc)

    + 1

    )(4)

    T̄(ii) ≤kf φc (log2 F + 1 + M ) ++ k (1− φc) (log2 (n − kf Mφc) + 1) , (5)

    where the inequalities are because of the worst-case per-formance of HGBSA and BSA, and φc is the expectedfraction of infected families whose mixed sample is pos-itive:

    φc =

    0 , if R = 0

    1− (M−kmR )/(MR) , if 1 ≤ R ≤ M − km1 , if M − km < R ≤ M .

    Lemma 2 (Expected number of tests - Symmetricprobabilistic model). If Alg. 1 uses BSA in place ofAdaptiveTest(), then it succeeds using a maximum ex-pected number of tests:

    T̄ ≤Fqφp (log2 F + 1 + M ) (6)+ nqp (1− φp) (log2 (n (1− qφp)) + 1) , (7)

    where the inequality is due to the worst performanceof BSA, and φp = 1− (1− p)R is the expected fractionof infected families whose mixed sample is positive.

    Lemmas 1 and 2 are derived (in Appendix B) as a re-peated application of the performance bounds of HG-BSA and BSA: if out of n members, k are infected uni-formly at random, then HGBSA (resp. BSA) achievesexact identification using at most: log2

    (nk

    )+ k (resp.

    k log2 n + k) tests [Aldridge et al., 2019, Baldassiniet al., 2013].

    Observations: (a) If heavily/lightly infected familiesare detected without errors in Part 1, our algorithmcan asymptotically achieve (up to a constant) the lowercombinatorial bound of Theorem 1 in particular com-munity structures. We show this via 2 examples:

    First, consider a sparse regime for families (i.e. kf =Θ(Fαf ) for αf ∈ [0, 1)) and a moderately linear regimewithin each family (i.e. km/M ≈ 0.5). In this case:log2

    (Fkf

    )∼ kf log2(F/kf ), log2

    (Mkm

    )∼ M h2 (km/M) ∼ M

    and the bound in (1) becomes: kf (log2 F/kf + M ). If Ris chosen such that all infected families (which are alsoheavily infected as km/M > 0.38) are detected withouterrors (e.g. if R > M − km), then φc = 1; thus, theRHS of (4) becomes almost equal (up to constant kf )to the lower bound (1).

    Second, consider the opposite example, where the in-fection regime for families is very high, while eachseparate family is lightly infected. In this case, k =kf km ≈ kf ; therefore, the lower bound becomes: T ∼kf log2(F/kf ) + kf km log2(M/km) ≈ k log2(n/k). If R ischosen such that none of the (lightly infected) fam-ilies is marked as heavily infected in Part 1 (e.g. ifR = 0, which reduces to using traditional community-agnostic group testing), then φc = 0, and the RHSof (4) is almost equal (up to k) to the bound in (1).

    (b) The upper bound in (5) shows that our algorithmachieves significant benefits compared to classic BSAwhen the infected families are heavily-infected and Ris chosen such that φc = 1 (e.g. R > M − km); this isbecause T̄(ii) ≤ kf (log2 F + 1 + M ) � k log2 n + k).Also, it achieves the same performance as BSA, whenfamilies are lightly-infected and R is chosen such thatφc = 0 (e.g. R = 0); this is because T̄(ii) ≤ k log2 n +k). Since the former case (heavy infection) is morerealistic, our algorithm is expected to per. form a lotbetter than classic group testing in practice.

    The examples in observation (a) and the above anal-ysis indicate two things: First, the knowledge of thecommunity structure is more beneficial when familiesare heavily infected; traditional group testing performsequally well in low infection rates. Our experimentsshowed that the community structure helps whenever

  • Group testing for connected communities

    p > 0.15 and the benefits increase with p. Second,a rough estimate of the families’ infection rate has tobe known a priori in order to optimally choose R. InAppendix B, we demonstrate that this is unavoidablein the symmetric scenario we examine and when onlyone mixed sample per family is used to identify whichfamilies are heavily/lightly infected.

    (c) In the most favorable regime for our community-aware group testing, where very few families have al-most all their members infected (i.e. kf = Θ(F

    αf ) forαf ∈ [0, 1) and km ≈ M ), even if R is chosen optimallysuch that φc = 1, the ratio of the expected number oftests needed by Algorithm 1 (see (4)) and HGBSA can-not be less than 1/ log(n/k), which upper bounds thebenefits one may get. In Appendix B.2, we detail thisobservation and provide an optimized version of ouralgorithm that improves upon the gain of 1/ log(n/k).

    4.2 Two stage algorithm

    The adaptive algorithm can be easily implemented as atwo-stage algorithm, where we first perform one roundof tests, see the outcomes, and then design and per-form a second round of tests. The first round of testsimplements part 1, checking whether a family is highlyinfected or not; the second round of tests implementspart 2, performing individual tests for the membersof the highly infected families, and in parallel, grouptesting for the members of the remaining families.

    As we did before for the adaptive case, we here makeuse of a classic non-adaptive group-testing algorithm,which we call NonAdaptiveTest(), and abstracts anyexisting (or future) non-adaptive algorithm in thegroup-testing literature. Thus to translate Alg. 1 toa two-stage algorithm, lines 4 and 13 simply become:

    4 :[Ûx(r1), ...,Ûx(rF )

    ]=NonAdaptiveTest (x (r1), ..., x (rF ))

    13 :{

    Ûi : i ∈ A}

    = NonAdaptiveTest (A) . (8)

    Number of tests: In some regimes, the two-stagealgorithm can operate with the same (order) numberof tests as the adaptive algorithm, at a cost of a van-ishing error probability: for example, for the tests inline 4, if kf = Θ(F

    αf ) with αf < 0.409, we can useapproximately (1 − αf )Fαf log2 F tests and achievevanishing error probability leveraging literature non-adaptive algorithms [Aldridge et al., 2019,Scarlett andCevher, 2016,Johnson et al., 2019,Coja-Oghlan et al.,2020,Coja-Oghlan et al., 2020].

    4.3 Non-adaptive algorithm

    For simplicity of notation, we describe our non-adaptive algorithm using again the symmetric case.

    Test Matrix Structure. Our test matrix G is di-

    vided into two sub-matrices: G =

    [G1G2

    ].

    . The sub-matrix G1 of size T1 × n identifies the in-fected families using one mixed sample from each fam-ily, similar to line 4 of Alg. 1. We want G1 to identifyall (non-)infected families with small error probability.If the number of tests available is high, we set T1 = F ,i.e., we use one row for each family test. Otherwise, insparse kf regimes, we set T1 closer to O(kf log

    Fkf

    ).

    . The sub-matrix G2 of size T2 × n has a block ma-trix structure and contains F identity matrices IM ,one for each family. G2 is designed as follows: (i) eachblock column contains only one identity matrix IM ,i.e., each member is tested only once; (ii) each blockrow i (i ∈ {1, 2, · · · , b}) contains ci identity matricesIM , i.e., there are ci members included in the corre-sponding tests. As a result: T2 = bM . An examplewith F = 6, b = 3, c1 = 2, c2 = 1, c3 = 3 is:

    G2 =

    IM 0M×M 0M×M IM 0M×M 0M×M0M×M IM 0M×M 0M×M 0M×M 0M×M0M×M 0M×M IM 0M×M IM IM

    .Decoding. From the outcome of the tests in G1 weidentify the F − kf non-infected families, and proceedto remove the corresponding columns (non-infectedmembers) from G2. We use the remaining columnsof G2 to identify infected members according to therules (which follow the logic of combinatorial orthogo-nal matching pursuit (COMP) decoding [Chan et al.,2014,Cai et al., 2017]):(i) A member is identified as non-infected if it is in-cluded in at least one negative test in G2.(ii) All other members, that are only included in pos-itive tests in G2, are identified as infected.

    Error Probability. It is perhaps not hard to see that:after the removal of the columns, the block structureof G2 helps us obtain a test matrix that is close to anidentity matrix – hence perform “almost” individualtesting2. Also, note that our decoding strategy for G2leads to zero FN errors. Building on these ideas, thefollowing lemmas guide us though a design of G2 thatminimizes the (FP) error probability.

    • Requiring zero-error decoding is too rigid: the op-timal solution is the trivial solution that tests eachmember individually, but this would require T2 ≥ n.

    • The symmetric choice ci = c minimizes the errorprobability. As said, we design G2 such that FP errorsare minimized. A FP may happen if identity matri-ces IM corresponding to two or more infected families

    2An extended analysis about G2 is in Appendix C.2.

  • Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

    appear in the same block row of G2. In this case,some non-infected members may be included in thesame test with infected members from other familiesand identified as infected by mistake.

    Lemma 3. Under models (I) and (II), the probabilitythat there is some block row containing two or moreinfected families is:

    PIjoint = 1−

    ∑|B|=kf : B⊆{1,2,··· ,b}

    ∏i∈B

    ci(Fkf

    ) , (9)PIIjoint = 1−

    b∏i=1

    [(1− q)ci + ciq(1− q)ci−1

    ]. (10)

    The following lemma offers a test-matrix design thatminimizes the system FP probability, defined as:

    P(any-FP) , P(∃i : ûi = 1 and ui = 0). (11)

    Lemma 4. The P(any-FP) is minimized for both mod-els (I) and (II), if ci = c for all i ∈ {1, · · · , b}.Lemma 5. For G2 as in Lemma 4, the system FPprobability for models (I) and (II) equals:

    PI(any-FP) =

    [1− 1(

    Mkm

    )] [1− (T2/Mkf )(FM /T2)kf(Fkf

    ) ] .PII(any-FP) =

    [1−

    M∑i=1

    [pi(1− p)M−i

    ]2 1(Mi

    )]

    ·

    [1−

    ((1− q)

    FMT2−1(

    1− q + FMqT2

    ))T2/M].

    P(any-FP) can be pessimistic; a more practical metricis the average fraction of members that are misidenti-fied (error rate): R(error) , |{i : ûi 6= ui}|/n.

    Lemma 6. For G2 as in Lemma 4, the error rate iscalculated for models (I) and (II) as:

    RI(error) <kf (M − km)

    FM· PIjoint, (12)

    RII(error) < (1− p)q[1− (1− q)c−1

    ]. (13)

    5 Loopy belief propagation decoder

    We now describe our new algorithm for decoding infec-tion status of the individuals (and families). This is ac-complished by estimating the posterior probability ofthe corresponding individual (or family) being infectedvia loopy belief propagation (LBP). LBP computes theposterior marginals exactly when the underlying factorgraph describing the joint distribution is a tree (whichis rarely the case) [Kschischang et al., 2001]. Never-theless, it is an algorithm of practical importance and

    has achieved success on a variety of applications. Also,LBP offers soft information (posterior distributions),which can be proved more useful than hard decisionsin the context of disease-spread management.

    We use LBP for our probabilistic model, because itis fast and can be easily configured to take into ac-count the community structure leading to more reli-able identification. Many inference algorithms existthat estimate the posterior marginals, some of whichhave also been employed for group testing. For ex-ample, GAMP [Zhu et al., 2020] and Monte-Carlosampling [Cuturi et al., 2020] yield more accurate de-coders. However, taking into account the statisticalinformation provided by the community structure wasproved not trivial with such decoders. Moreover, thefocus of this work is to examine whether benefits fromaccounting for the community structure (both at thetest design and the decoder) exist; hence we think thatconsidering a simple (possibly sub-optimal) decoderbased on LBP is a good first step; we defer more com-plex designs to future work.

    We next describe the factor graph and the belief prop-agation update rules for our probabilistic model (II).Let the infection status of each family j be Vj ∼Ber(q). Moreover, let V (Ui) denote the family thatUi belongs to.

    P(V1,...,VF ,U1, ...,Un ,Y1, ...,YT ) =F∏

    j=1

    P(Vj )n∏

    i=1

    P(Ui |V (Ui))T∏τ=1

    P(Yτ |Uδτ ), (14)

    where δτ is the group of people participating in thetest. Equation (14) can be represented by a factorgraph, where variable nodes correspond to each ran-dom variable Vj ,Ui ,Yτ and factor nodes correspondto P(Vj ),P(Ui |V (Ui)),P(Yτ |Uδτ ).

    Given the result of each test is yτ , i.e., Yτ = yτ , LBPcomputes the marginals P(Vj = v|Y1 = y1, ...,YT =yT ) and P(Ui = u|Y1 = y1, ...,YT = yT ), by iter-atively exchanging messages across the variable andfactor nodes. The messages are viewed as beliefsabout that variable or distributions (a local estimate ofP(variable|observations)). Since all random variablesare binary, each message is a 2-dimensional vector.

    We use the factor graph framework from [Kschischanget al., 2001] to compute the messages: Variable nodesYτ continually transmit the message [0, 1] if Yτ = 1and [1, 0] if Yτ = 0 on its incident edge, at every iter-ation. Each other variable node (Vj and Ui) uses thefollowing rule: for incident each edge e, the node com-putes the elementwise product of the messages fromevery other incident edge e′ and transmits this alonge. For the factor node messages, we derive closed-form

  • Group testing for connected communities

    0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8

    p

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    aver

    age #

    of tes

    ts

    Alg.1: R = 1

    Alg.1: R = M

    binary splitting

    lower bound

    non-adaptive

    counting bound

    Figure 1: Noiseless case: Average number of tests.

    expressions for the sum-product update rules (akin toequation (6) in [Kschischang et al., 2001]). The exactmessages are described in Appendix D.

    6 Numerical evaluation

    In this section, we evaluate the benefits (in terms ofnumber of tests and error rate) from taking the com-munity structure into account in practical scenarios,where noiseless or noisy tests are used.

    Experimental setup I: Symmetric. In our sim-ulations, we consider 2 different use cases about thecommunity structure: (Community 1) a neighborhoodwith F = 200 families of M = 5 members each, and(Community 2) a university department with F = 20classes of M = 50 students each. In each use case, wealso examine 2 different infection regimes: (a) a lin-ear regime, where k̄/n = 0.1; and (b) a sparse regime,where k̄ =

    √n = 32. Finally, we consider both noise-

    less tests that have perfect accuracy and noisy teststhat follow the Z-channel model from Section 2.3. Foreach scenario, we average over 500 randomly generatedcommunity structures, in which the members/studentsare infected according to the symmetric probabilis-tic model (II): first a family/class is chosen at ran-dom w.p. q to be infected and then each of its mem-bers/students gets randomly infected w.p. p.

    Results. Our results were similar in all scenarios; forbrevity, we show here only the sparse regime. Furtherresults can be found in the Appendix of the supple-mentary submitted document.

    (i) Noiseless testing – Average number of tests: In thisexperiment, we measure the average number of testsneeded by 3 algorithms that achieve zero-error recon-struction (Alg. 1 with R = 1, Alg. 1 with R = M ,and classic BSA), and a nonadaptive algorithm (Sec-tion 4.3) that uses T1 = F tests for G1 and has FPrate around 0.5%. Alg. 1 assumes no prior knowl-edge of the number of infected families/classes or mem-bers/students, hence uses BSA for the AdaptiveTest().

    Fig. 1 depicts our results about Community 2 and forp ∈ [0.4, 0.8]. Both versions of Alg. 1 need significantlyfewer tests compared to classic BSA, while staying be-low the counting bound. This indicates the potential

    benefits from the community structure, even when thenumber of infected members is unknown. More inter-estingly, when R = M , Alg. 1 performs close to thelower bound in most realistic scenarios p ∈ [0.5, 0.8](as also shown in Section 4.1). The relevant result inthe linear regime, was slightly worse: 50-70 tests abovethe lower bound. Last, the grey line shows numberof tests needed by our nonadaptive algorithm; we ob-serve that even that algorithm can perform better thanBSA, when p > 0.55 and small FP rates are tolerated.

    (ii) Noiseless testing – Average error rate: We herequantify the additional cost in terms of error rate,when one goes from a two-stage adaptive algorithmthat achieves zero-error identification to much fastersingle-stage nonadaptive algorithms. In each run, wefirst run our two-stage algorithm (Section 4.2) thatuses a classic constant-column-weight test design ateach stage and measure the number of tests it requiresto achieve zero errors. Then, we use the same numberof tests to infer the members’ infection status through2 nonadaptive algorithms that account for the com-munity structure either at the test matrix (encoding)part or the decoding and a traditional one that doesnot consider it at all: “COMP with C-encoder” is ournonadaptive algorithm that uses a COMP decoder asdescribed in Section 4.3; “C-LBP with NC-encoder” isan algorithm that uses classic constant-column-weighttest design combined with our LBP decoder form Sec-tion 5; and “COMP with NC-encoder” is a traditionalnonadaptive algorithm, that we use as a benchmarkand uses a constant-column-weight test matrix with aCOMP decoder. “C” denotes that the community istaken into account, while “NC” denotes that it is ig-nored. It is important to note that the number of testsneeded by the two-stage algorithm (and therefore allother algorithms) gets lower as p gets large, somethingthat affects the results (as discussed further below).

    Fig. 2 depicts the FP and FN error rates3 (averagedover 500 runs) as a function of p ∈ [0.3, 0.9] for Com-munity 1. We observe that any community-aware non-adaptive algorithm performs better than traditionalnonadaptive group testing (red line) when p > 0.4—the absolute performance gap ranges from 0.4% (whenp = 0.3) to 5.5% (when p = 0.9). “COMP with C-encoder” has a stable FP rate across for all p valuesthat was close to 1%, and a zero FN rate by con-struction. Our LBP decoder, may yield both FN andFP errors. Also, being an approximate inference algo-rithm, it may produce worse results than COMP whenp ∈ [0.42, 0.67], but performs better when the infectionrate is higher.

    Fig. 3 examines the effect of the number of tests. Start-

    3FN rate is the percentage of infected individuals iden-tified as negative and vice versa for FP.

  • Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

    Figure 2: Noiseless case: Averageerror rate with few tests.

    Figure 3: Noiseless case: Averageerror rate (p = 0.6).

    Figure 4: Noisy case: Average errorrate (p = 0.8).

    Alg.1: R=1 Alg.1: R=M BSA

    0

    2

    4

    6

    8

    10

    12

    Figure 5: Asymmetric case: Ratio of the number oftests needed to the lower bound (2).

    ing from the average number of tests used by the twostage algorithm when p = 0.6, we compute the FPand FN rates for larger numbers of tests. Our ex-periment shows a transition around T = 240, afterwhich point “C-LBP with NC-encoder” performs bet-ter than “COMP with C-encoder”. In fact, “COMPwith C-encoder” seems to converges to zero FP er-rors much slower. This result was common for otherp values, the transition just occurred at different T .We therefore conclude that one may use our “COMPwith C-encoder” when the number of tests availableis limited or they just want to use a simple decoder;otherwise if the testing budget is larger, one shouldbetter go with “C-LBP with NC-encoder”.

    (iii) Noisy testing: Assuming the Z-channel noise ofSection 2.3 with parameter z = 0.15, we evaluate theperformance of our community-based LBP decoder ofSection 5 against a LBP that does not account forcommunity—namely its factor graph has no Vj nodes.

    Fig. 4 depicts our results for Community 1 and fora selected p = 0.8 and a number of tests as givenfrom the two-stage algorithm of the previous experi-ments. We observe that the knowledge of the com-munity structure (in C-LBP) reduces both FP andFN rates achieved community-unaware NC-LBP. Es-pecially, FN error rates drop significantly (up to 80%when tests are few), which is important in our contextsince FN errors lead to further infections. Our resultswere similar for other p values as well.

    Experimental setup II: Asymmetric. In ourasymmetric setup, infections follow again the proba-bilistic model (II), but this time for each family j , Mj

    and pj are selected uniformly at random from the in-tervals [5, 50] and [0.4, 0.8], respectively.

    Fig. 5 is a box plot depicting our results for the sparseregime (q = 3%) over 500 randomly generated in-stances, as described above. The middle line in thebox represents the mean and the ends of the box rep-resent the lower and upper quartiles respectively. Thecrosses represent outlier points. BSA needs on aver-age 5.23× (that can reach up to 13×) more tests com-pared to the probabilistic bound, while the two ver-sions of Algorithm 1 with R = 1 and R = M needonly 2.4× and 1.11× (that can reach up to 9.85× and1.8×) more tests, respectively. Also, the significantlysmaller range between the 25-th and 75-th percentilesof the boxplots related to Algorithm 1 indicate a morepredictable performance w.r.t. BSA.

    7 Conclusions

    The new observation we make in this paper is that tak-ing into account infection correlations, as dictated bya known community structure, enables to reduce thenumber of group tests required to identify the infectedmembers of a population and can improve the identi-fication accuracy when the number of tests is fixed.

    In this paper we make this point assuming a nonover-laping community structure, a specific noise model andbinary group testing. We considered a combinatorialand probabilistic model, derived lower bounds on thenumber of tests needed, explored adaptive, two-stageand non-adaptive algorithms for the noiseless case, aswell as algorithms for the noisy case. Our algorithmsare not always optimal w.r.t. the lower bounds, butperform significantly better than community-agnosticgroup testing; per our experiments, they need upto55− 75% fewer tests (on average) to achieve the sameidentification accuracy.

    We posit that such benefits are possible in a num-ber of other community or noise or group test mod-els; as an example, the followup work in [Nikolopou-los et al., 2021] illustrates benefits when the familiesoverlap. Understanding what are benefits in more so-phisticated models remains as an open question.

  • Group testing for connected communities

    Acknowledgments

    This work was supported in part by NSF grants#2007714, #1705077 and UC-NL grant LFR-18-548554. We would also like to thank Katerina Ar-gyraki for her ongoing support and the valuable dis-cussions we have had about this project, as well as theanonymous reviewers (especially Reviewer 1) for theirconstructive comments that helped improve our paper.

    References

    [Aldridge, 2019] Aldridge, M. (2019). Individual test-ing is optimal for nonadaptive group testing in thelinear regime. IEEE Trans. Inf. Theory, 65(4).

    [Aldridge et al., 2019] Aldridge, M., Johnson, O., andScarlett, J. (2019). Group testing: an informationtheory perspective. CoRR, abs/1902.06002.

    [Ash, 1990] Ash, R. (1990). Information theory. DoverPublications Inc., New York, NY.

    [Baldassini et al., 2013] Baldassini, L., Johnson, O.,and Aldridge, M. (2013). The capacity of adaptivegroup testing. In 2013 IEEE International Sympo-sium on Information Theory, pages 2676–2680.

    [Broadfoot, 2020] Broadfoot, M. (2020). Coronavirustest shortages trigger a new strategy: Group screen-ing. See https://www.scientificamerican.com/article/coronavirus-test-shortages-

    trigger-a-new-strategy-group-screening2/.

    [Cai et al., 2017] Cai, S., Jahangoshahi, M., Bakshi,M., and Jaggi, S. (2017). Efficient algorithms fornoisy group testing. IEEE Trans. Inf. Theory,63(4):2113–2136.

    [Chan et al., 2014] Chan, C. L., Jaggi, S., Saligrama,V., and Agnihotri, S. (2014). Non-adaptive grouptesting: Explicit bounds and novel algorithms.IEEE Trans. Inf. Theory, 60(5):3019–3035.

    [Cheraghchi et al., 2012] Cheraghchi, M., Karbasi,A., Mohajer, S., and Saligrama, V. (2012). Graph-constrained group testing. IEEE Transactions onInformation Theory, 58(1):248–262.

    [Coja-Oghlan et al., 2020] Coja-Oghlan, A., Geb-hard, O., Hahn-Klimroth, M., and Loick, P. (2020).Information-theoretic and algorithmic thresholdsfor group testing. IEEE Trans. Inf. Theory.

    [Coja-Oghlan et al., 2020] Coja-Oghlan, A., Geb-hard, O., Hahn-Klimroth, M., and Loick, P. (2020).Optimal group testing. volume 125 of Proceedingsof Machine Learning Research, pages 1374–1388.

    [Cuturi et al., 2020] Cuturi, M., Teboul, O., and Vert,J.-P. (2020). Noisy adaptive group testing us-ing bayesian sequential experimental design. arXivpreprint arXiv:2004.12508.

    [Dorfman, 1943] Dorfman, R. (1943). The detection ofdefective members of large population. The Annalsof Mathematical Statistics, 14:436–440.

    [Du and Hwang, 1993] Du, D.-Z. and Hwang, F.(1993). Combinatorial Group Testing and Its Ap-plications. Series on Applied Mathematics.

    [Ellenberg, 2020] Ellenberg, J. (2020). Five people.one test. this is how you get there. NYtimes.

    [Ghosh et al., 2020] Ghosh, S. et al. (2020). Tapestry:A single-round smart pooling technique for covid-19testing. medRxiv.

    [Goenka et al., 2020] Goenka, R., Cao, S.-J., Wong,C.-W., Rajwade, A., and Baron, D. (2020). Contacttracing enhances the efficiency of covid-19 grouptesting. arXiv preprint arXiv:2011.14186.

    [Gollier and Gossner, 2020] Gollier, C. and Goss-ner, O. (2020). Group testing against covid-19. See https://www.tse-fr.eu/articles/group-testing-against-covid-19.

    [Hu et al., 1981] Hu, M. C., Hwang, F. K., and Wang,J. K. (1981). A boundary problem for group testing.SIAM Jour. on Algebraic Discrete Methods.

    [Hwang, 1972] Hwang, F. K. (1972). A method fordetecting all defective members in a population bygroup testing. Journal of the American StatisticalAssociation, 67(339):605–608.

    [Johnson et al., 2019] Johnson, O., Aldridge, M., andScarlett, J. (2019). Performance of group testingalgorithms with near-constant tests per item. IEEETrans. Inf. Theory, 65(2):707–723.

    [Johnson, 2017] Johnson, O. T. (2017). Strong con-verses for group testing from finite block- length re-sults. IEEE Trans. Inf. Theory, 63(9).

    [Karbasi and Zadimoghaddam, 2012] Karbasi, A. andZadimoghaddam, M. (2012). Sequential group test-ing with graph constraints. In 2012 IEEE informa-tion theory workshop, pages 292–296. Ieee.

    [Kschischang et al., 2001] Kschischang, F. R., Frey,B. J., and Loeliger, H.-A. (2001). Factor graphs andthe sum-product algorithm. IEEE Transactions oninformation theory, 47(2):498–519.

    https://www.scientificamerican.com/article/coronavirus-test-shortages-trigger-a-new-strategy-group-screening2/https://www.scientificamerican.com/article/coronavirus-test-shortages-trigger-a-new-strategy-group-screening2/https://www.scientificamerican.com/article/coronavirus-test-shortages-trigger-a-new-strategy-group-screening2/https://www.tse-fr.eu/articles/group-testing-against-covid-19https://www.tse-fr.eu/articles/group-testing-against-covid-19

  • Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

    [Kucirka et al., 2020] Kucirka, L. M., Lauer, S. A.,Laeyendecker, O., Boon, D., and Lessler, J. (2020).Variation in false-negative rate of reverse tran-scriptase polymerase chain reaction–based sars-cov-2 tests by time since exposure. Annals of InternalMedicine, 173:262–267.

    [Li et al., 2014] Li, T., Chan, C. L., Huang, W.,Kaced, T., and Jaggi, S. (2014). Group testing withprior statistics. In 2014 IEEE International Sympo-sium on Information Theory, pages 2346–2350.

    [Luo et al., 2019] Luo, S., Matsuura, Y., Miao, Y.,and Shigeno, M. (2019). Non-adaptive group testingon graphs with connectivity. Journal of Combina-torial Optimization, 38(1):278–291.

    [Nikolopoulos et al., 2020] Nikolopoulos, P., Srini-vasavaradhan, S. R., Guo, T., Fragouli, C., and Dig-gavi, S. (2020). Community aware group testing.arXiv preprint arXiv:2007.08111.

    [Nikolopoulos et al., 2021] Nikolopoulos, P., Srini-vasavaradhan, S. R., Guo, T., Fragouli, C., andDiggavi, S. (2021). Group testing for overlappingcommunities. In Proc. of the IEEE InternationalConference on Communications, ICC 2021.

    [Riccio and Colbourn, 2000] Riccio, L. and Colbourn,C. J. (2000). Sharper bounds in adaptive grouptesting. Taiwanese Journal of Mathematics, page669–673.

    [Saiki et al., 1985] Saiki, R. et al. (1985). Enzymaticamplification of beta-globin genomic sequences andrestriction site analysis for diagnosis of sickle cellanemia. Science, 230(4732):1350–1354.

    [Scarlett and Cevher, 2016] Scarlett, J. and Cevher,V. (2016). Phase transitions in group testing. InProceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA2016, pages 40–53. SIAM.

    [Sobel and Elashoff, 1975] Sobel, M. and Elashoff, R.(1975). Group testing with a new goal, estimation.Biometrika, 62(1):181–193.

    [Sobel and Groll, 1959] Sobel, M. and Groll, P. A.(1959). Group testing to eliminate efficiently alldefectives in a binomial sample. The Bell SystemTechnical Journal, 38(5):1179–1252.

    [Ungar, 1960] Ungar, P. (1960). Cutoff points in grouptesting. Comm. Pure Appl. Math, 13:49–54.

    [Verdun et al., 2020] Verdun, C. et al. (2020). Grouptesting for sars-cov-2 allows up to 10-fold efficiencyincrease across realistic scenarios and testing strate-gies. medRxiv.

    [Walter et al., 1980] Walter, S. D., Hildreth, S. W.,and Beaty, B. J. (1980). Estimation of infectionrates in populations of organisms using pools ofvariable size. American Journal of Epidemiology,112(1):124–128.

    [Yaakov Malinovsky, 2016] Yaakov Malinovsky, P.S. A. (2016). Revisiting nested group testing pro-cedures: new results, comparisons, and robustness.American Statistician. See also https://arxiv.org/abs/1608.06330.

    [Zhu et al., 2020] Zhu, J., Rivera, K., and Baron, D.(2020). Noisy pooled pcr for virus testing. arXivpreprint arXiv:2004.02689.

    https://arxiv.org/abs/1608.06330https://arxiv.org/abs/1608.06330

  • Group testing for connected communities

    Appendix

    A Appendix for Section 3: The lowerbounds

    A.1 Proof of Theorem 1

    Proof. Ineq. (1) is because of the following countingargument: There are only 2T combinations of test re-sults. But, because of the community model I, there

    are(Fkf

    )·∏kf

    j=1

    (Mjk jm

    )possible sets of infected members

    that each must give a different set of results. Thus,

    2T ≥(

    F

    kf

    kf∏j=1

    (Mj

    k jm

    ),

    which reveals the result. The RHS of the latter in-equality is because there are

    (Fkf

    )combinations of in-

    fected families, and for each infected family j , there

    are(Mjk jm

    )possible combinations of infected family

    members—hence for each combination of kf infected

    families, there are∏kf

    j=1

    (Mjk jm

    )possible combinations of

    infected family members. The symmetric bound is ob-tained as a corollary by taking Mj = M and k

    jm = km

    for each infected family j .

    A.2 Proof of Theorem 2

    Proof. Let V be the indicator random vector for theinfection status of all families. By rephrasing [Li et al.,2014, Theorem 1], any probabilistic group testing algo-rithm using T noiseless tests can achieve a zero-errorreconstruction of U if:

    T ≥ H (U) = H (V) + H (U|V)−H (V|U). (A.1)

    The first term is: H (V) =∑F

    j=1 H (Vj ) = F h2 (q).

    The second term is calculated as:

    H (U|V) =n∑

    v=1

    H (Uv|VEv )

    =

    n∑v=1

    ∑x∈{0,1}

    P(VEv = x)H (Uv|VEv = x)

    =

    n∑v=1

    (qH (Uv|VEv = 1) + (1− q)H (Uv|VEv = 0))

    =

    n∑v=1

    qh2 (pEv )= q

    F∑j=1

    Mjh2 (pj ),

    where Ev is the family containing vertex v.

    Finally, we compute the third term as:

    H (V|U) =F∑

    j=1

    H (Vj |U) =F∑

    j=1

    H (Vj |USj )

    =

    F∑j=1

    P(USj = 0)h2 (P(Vj = 0|USj = 0))

    =

    F∑j=1

    (1− q + q(1− pj )|Sj |)h2(

    1− q1− q + q(1− pj )|Sj |

    )where Sj is the set of members who belong to familyj and |Sj | = Mj . Combining all the 3 terms concludesthe proof.

    B Appendix for Section 4.1: Thenoiseless adaptive case

    B.1 Rationale for Alg. 1

    Group testing already has a rich body of literaturewith near-optimal test designs in the case of indepen-dent infections, we do not try to improve upon them.Instead, we adapt these ideas to incorporate the cor-relations arisen from the community structure. Alltest designs described in this section are conceptuallydivided into two parts. This split is guided by thecommunity structure and attempts to identify the dif-ferent infection regimes inside the community, so thatthe best testing method (individual or classic grouptesting) is used. We show that such a two-part de-sign is enough to significantly reduce the cost of grouptesting and also achieve the lower bound in some cases.

    Two-part design: Two parts of Algorithm 1 servecomplimentary goals:

    The goal of Part 1 is to detect the infection regimeinside each family j : i.e., to accurately estimate whichof the F families have a high infection rate (“heavily”infected) and which are have a low or zero infectionrate (“lightly” infected). Our interest in detecting theinfection regime is motivated by prior work [Riccio andColbourn, 2000,Hu et al., 1981], which has shown thatgroup testing offers benefits over individual testing,only if the infection rate is low (pj ≤ 0.38). This

  • Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

    Figure 6: Expected number of tests from (7) as a func-tion of size of representative set and probability of in-fection inside a family.

    allows us to define the two regimes as follows: In thecombinatorial model I (resp. probabilistic model II), ais considered heavily infected when k jm/Mj ≥ 0.38 (resp.pj ≥ 0.38); conversely, it is considered lightly infectedfamily when k jm/Mj < 0.38 (resp. pj < 0.38).

    For each family j , we regard Ûx(rj ) as an estimate of

    the family’s infection regime. If Ûx(rj ) is positive, weconsider the family to be highly infected and there-fore perform individual testing for all of its members.Otherwise, if Ûx(rj ) is negative, we consider the fam-ily to be lightly infected and group test its memberswith all other lightly infected families. The challengeis therefore to produce accurate enough regime esti-mates, such that the overall number of tests that areneeded from Alg. 1 to achieve exact infection-status re-construction for all members i = 1, . . . ,n is minimal.We discuss this challenge further below.

    Given all estimates Ûx(rj ) from Part 1, the goal of thePart 2 is then to identify all infected members, by us-ing the appropriate testing method (group or individ-ual testing) according to the infection regime of eachfamily (light or heavy). In this way, at the end of Part2, the algorithm returns an estimate Ûi of the trueinfection status Ui of each individual member i .

    Selection of family representatives: FunctionSelectRepresentatives() at line 2 refers to any samplingfunction on a set of family members, as long as it re-turns a fixed number of members from family j . Thatis, one may use their own sampling function, as longas the accuracy of Part 1 is well defined. In this paper,we consider only random-sampling functions withoutreplacement (i.e. |rj | members are randomly chosenfrom the family members and each subset of that sizehas the same probability of being selected as the rep-resentative subset). But perhaps, more elaborate sam-pling functions may be considered in other contexts.

    For example, if the internal structure of family j canbe represented through a contact graph, in which onlyspecific family nodes have external contacts with otherfamilies, it may make sense to include (some of) thesenodes into the representative group with certainty.

    When only one mixed sample per family is used toidentify the heavily/lightly infected families, the cardi-nality of the representative subset |rj | is essential, butthe optimal choice of it is not trivial. |rj | affects theaccuracy of regime estimate—hence the performanceof our algorithm in terms of the expected number oftests that it uses. Unfortunately, choosing the num-ber of representatives optimally is not easy even in thesymmetric case that is examined in Section 4.1. Ide-ally, in the symmetric case, we would like to choose|rj | = R such that the bounds in Lemmas 1 and 2 areminimized. However, this requires solving equations ofthe form yey = x, which is generally possible throughLambert functions for x ≥ − 1e , but the latter doesnot hold in our case. Fig. 6 demonstrates that thereexists no unique R that is optimal for any infectionprobability p in (0, 1) through an example of F = 50families with M = 60 members each. The figure plotsthe bound of Lemma 2 as a function of p and R. As wecan see, there is no single minimizer R?: if p < 0.15,then R must be picked equal to 0 (which yields tradi-tional group testing); otherwise, if p > 0.15, then Rmust be selected equal to M .

    Therefore, in order to optimally choose R, a roughestimate about p has to be known a priori. If thelatter is not possible, then one may use a few moretests at the first stage of our algorithm to better detectwhether a family is heavily infected. We provide suchan optimization in the next section.

    Function AdaptiveTest(): In both parts of our algo-rithm, we make use of a classic adaptive-group-testingalgorithm, which we call AdaptiveTest(). This may beregarded as an abstraction for any existing (or future)adaptive algorithm in the group-testing literature. Inour analysis, however, we mostly focus on the classicbinary splitting algorithm because of its good perfor-mance in realistic cases, where the numbers of infectedfamilies and/or members (kf , k

    jm) are unknown [Sobel

    and Groll, 1959].

    In this section, we consider only adaptive algorithmsthat offer noiseless (zero-error) reconstruction. Note,however, the fact that AdaptiveTest() offers exact re-construction is not enough to guarantee an accuratedetection of any family’s infection regime in Part 1.For example, consider the following case, where thetrue infection rate within a family j is not very low(say pj = 0.6), yet none of the family representativein set rj happened to be infected. Intuitively, the er-ror probability of detection in Part 1 should depend

  • Group testing for connected communities

    on the number of selected representatives |rj | fromeach family j and the infection rate among its mem-bers pj . In our analysis, we examine different scenariaw.r.t. these parameters and discuss which parametriza-tion (i.e. value of |rj |) optimizes the expected numberof the tests required by our algorithm.

    B.2 Modified/Optimized versions of Alg. 1

    • One modification of our algorithm is the following:In Part 1, instead of selecting only one representativegroup for each family, we select ms representative sub-groups, each of size s, and we treat each of these sub-groups as a single “(super)-member”. That is, we iden-tify whether each subgroup is positive (has at least onepositive member) or not, and based on this informa-tion, using for example majority vote, we can classifythe family as heavily or lightly infected; essentially wecan solve an estimation problem as in [Aldridge et al.,2019] (see Chapter 5.3), [Walter et al., 1980,Sobel andElashoff, 1975]. In this regard, Alg. 1 is just a specialcase of this approach, with ms = 1 and s = |rj |.

    Intuitively, we expect that such a modification wouldincrease the estimation accuracy of p̂j and reduce theerror of the related hypothesis test, at the cost of fewmore tests. As a result, it could need fewer tests onexpectation than Alg. 1, hence perform better in somecases. However, the potential improvement would de-pend on parameters such as the family size - for in-stance for small size families it is not expected to belarge. To keep things simple, we prefer not to ana-lyze this algorithm in this paper and defer it to futurework.

    • Another modification could be the following: insteadof leveraging the community structure to perform indi-vidual tests where needed, we could use it to improvetraditional binary splitting algorithm by running it onmultiple testing groups that are related to the com-munity structure. For example, consider a symmetriccase where: we split all n = FM members into Mgroups of F people (one from each family), then runbinary splitting to each of these groups.

    This modification is also related to Hwang’s binarysplitting algorithm, but achieves only logarithmic ben-efits compared to binary splitting, as opposed to ouralgorithm that may perform much better in real cases(see Section 4.1). In fact, the expected number of testsneeded by this modified algorithm would be at mostk log2 (n/M)+O(k): each group g has kg infected mem-ber and binary splitting needs kg log2 (n/M) + O(kg)tests to identify all of them. By adding together thenumber of tests for each group g, we deduce the result.

    • A last modification occurred to us after a relatedcomment of one of our reviewers, who we thank. Asdiscussed in Section 4.1, when a sparse regime holdsfor families (i.e. kf = Θ(F

    αf ) for αf ∈ [0, 1)) anda heavily linear regime holds within each family (i.e.km ≈ M ), the benefits of Lemma 1 with regard toHwang’s binary splitting (HBSA) cannot be more than1/ log(n/k). This is because, in Eq. (4), we get theadditive term kf M > FM = k , which comes from thesecond stage of Algorithm 1.

    Nevertheless, if km > M − km (i.e., the infection rateinside each family is more than 0.5), then at the secondstage of our algorithm it makes more sense to look fornot-infected members and stop testing once we findthem. In that case, we need at least kf ∗ (M − km)tests, which can be less than k, and therefore couldlead to more benefits on average.

    For example, consider the case where km = M − 1.Then the expected number of individual tests neededto find the 1 not-infected member inside each in-fected family can be computed as follows: Withoutloss of generality, suppose that we test the membersat some fixed ordering without replacement and thenot-infected member has a uniformly random posi-tion in that ordering. Then, the probability of thenot-infected item being at a given position i in theordering is equal to 1/M and we need i tests tofind it. As a result, the expected number of tests is∑Mi=1 i ∗ 1/M = (M + 1)/2. From linearity of expec-

    tation, the expected number of tests for all infectedfamilies at the second stage of our algorithm (if wefurther assume that all infected families are identifiedwithout error at the first stage—i.e., φc = 1) will be:kf ∗ (M + 1)/2 < kf ∗ (M − 1) = kf ∗ km = k . Hence,is this particular regime, the modification of our algo-rithm can achieve benefits more than 1/ log(n/k).

    In the more general case, where M − km > 1, the rele-vant probabilities for the computation of the expectednumber of tests can be obtained from the negative hy-pergeometric distribution (since sampling is withoutreplacement).

    In the extreme case, where for each infected family kmis known and equal to M , all we need to do is to iden-tify the infected families and label all their members asinfected. In that case the benefit would be kf /k . Note,that to achieve these higher benefits described above,the knowledge of the number of infected members perfamily is required, but this is also the case for HBSA.

    3The symmetric example is only used here only to betterillustrate the advantages of the modification proposed. Theidea is similar for the asymmetric case.

  • Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

    B.3 Proof of Lemma 1

    Proof. Let φc be the expected fraction of infectedfamilies whose mixed sample is positive. SinceSelectRepresentatives() is uniform random samplingwithout replacement, we can compute φc when 1 ≤R ≤ M − km using the hypergeometric distributionHyper(M , km , R), as follows: the probability of arandom mixed sample x (rj ) being negative (i.e. allmembers of rj are negative) is given by the PMF ofHyper(M , km , R) evaluated at 0, and it is thereforeequal to

    (M−kmR

    )/(MR

    ), which yields φc = 1−

    (M−kmR

    )/(MR

    ).

    We also define the following for completeness: φc = 0when R = 0 and φc = 1 when M − km < R ≤ M .

    Fixing the number of positive mixed samples in Part 1of Alg. 1 to its expected value: kf ·φc , we now computethe maximum number of tests needed by the algorithmto succeed.

    Alg. 1 performs testing at lines 4, 8, 13.

    • At line 4, it identifies the positive mixed samplesto mark the corresponding families as heavily infectedand all others as lightly infected. If HGBSA is used forAdaptiveTest(), then Alg. 1 is expected to succeed atthis step using kf φc log2

    Fkf φc

    + kf φc tests. Similarly,

    if BSA is used for AdaptiveTest(), then then Alg. 1is guaranteed to succeed at this step using at mostkf φc log2 F + kf φc [Aldridge et al., 2019, Baldassiniet al., 2013].

    • At line 8, the expected number of individual tests isequal to: M kf φc . This is the same irrespectively fromwhether AdaptiveTest() is binary splitting or Hwang’salgorithm as it only depends on φc .

    • At line 13, the expected number of items thatare tested is: n − kf φcM , and the expected numberof infected members is: k − kf φckm = k (1− φc).So, if HGBSA is used for AdaptiveTest(), thenAlg. 1 is guaranteed to succeed at this step using

    k (1− φc) log2(n−kf φcM )k(1−φc) + k (1− φc) tests. Similarly,

    if BSA is used, then Alg. 1 is expected to succeedin at most: k (1− φc) log2 (n − kf φcM ) + k (1− φc)tests [Aldridge et al., 2019,Baldassini et al., 2013].

    We add together all the above terms that are relatedto HGBSA or BSA, and the result follows.

    B.4 Proof of Lemma 2

    Proof. Let φp be the expected fraction of infected fam-ilies whose mixed sample is positive. Then, because ofthe probabilistic setting, φp = 1− (1− p)R.

    Alg. 1 performs testing at lines 4, 8, 13.

    • At line 4, the expected number of mixed samples thatare positive is Fqφp . So, if BSA is used in the place

    of AdaptiveTest(), then the maximum number of testsneeded to identify all mixed samples is on expectationFqφp log2 F + Fqφp [Aldridge et al., 2019, Baldassiniet al., 2013].

    • At line 8, the expected number of individual tests isequal to: FqφpM .

    • At line 13, the expected number of items thatare tested is: n − FqφpM , and the expected num-ber of infected members is equal to the expectednumber of all infected members minus the expectednumber of the ones that are identified though indi-vidual testing at line 8: i.e., FqMp − FqφpMp =FqMp (1− φp) = nqp (1− φp). So, if BSA is usedin the place of AdaptiveTest(), it is expected to suc-ceed using at most nqp (1− φp) log2 (n − FqφpM ) +nqp (1− φp) tests [Aldridge et al., 2019, Baldassiniet al., 2013].

    We add together all the above terms and the resultfollows.

    C Appendix for Section 4.3: TheNoiseless Non-adaptive case

    C.1 Zero error requirements

    For our design of G2, we have the following lemmaand observation.

    Lemma 7. To achieve zero-error w.r.t. G2, we needT2 ≥ n.

    Proof. A trivial implementation for G2 is to use anidentity matrix of size n; since each member is testedindividually, we can identify all the infected memberscorrectly. We next argue that T2 ≥ n for the zero-errorcase. We prove this through contradiction. Assumethat T2 < n. Then, from the pigeonhole principle,there exists one member, say m1 that does not partic-ipate in any test alone -it always participates togetherwith one or more members from a set S1. Assume thatall members in S1 are infected, while m1 belongs in aninfected family but is not infected -our decoding willresult in a FP.

    Observation: G2 leads to zero error if and only if ithas the following property:Zero Error Property: Any subset of {1, 2, · · · ,n} ofsize (F −kf )M + kf (M −km) equals the union of sometesting rows of G2. Namely, the members of the not-infected families together with the not-infected mem-bers of the infected families, need to be the only partic-ipants in some rows of G2, for all possible not-infectedfamilies and not-infected members. This requirementcan lead to an alternative proof of Lemma 7.

  • Group testing for connected communities

    C.2 Rationale for the structure of G2

    Our goal is to design a non-trivial matrix G2 thatcan identify almost all the infected members withhigh probability and a small number of tests. Wenext discuss two intuitive properties we would likeour designs to have to minimize the error probability.

    Desirable Property 1: Use identity matrices as build-ing blocks.Intuition: ideally, after removing the (F − kf )Mcolumns corresponding to the members in non-infected families, we would like the remaining columnsto form an identity matrix so that we can identify allthe infected members correctly. To reduce the numberof tests, there should be more than one membersincluded in each test. Thus we use overlappingidentity matrices, one corresponding to each family.We assume the index for the n members is family-by-family, i.e., the indexes for the members in the samefamily are consecutive. Then each family correspondsto an identity sub-matrix IM in G2. Now the prob-lem becomes how to arrange the identity sub-matrices.

    Desirable Property 2: The identity matrices corre-sponding to different families either appear in the sameset of M rows in G2 or they do not appear in anyshared rows.Intuition: otherwise, a family would share tests withmore other families. Then the probability that thisfamily shares tests with infected families becomeslarger. This would increase the probability that twoinfected families share tests after removing all the non-infected family columns, which in turn would increasethe FP probability.

    C.3 Proof of Lemma 3

    Proof. The probabilities can be explained as follows:

    (i) For PIjoint in (9), the numerator gives the num-ber of possibilities that each block row containsat most one infected family, which is obtained byrandomly choosing kf block rows (the summa-tion) and then from each chosen block row choos-ing one family to be infected (ci possible choicesfor i-th block row). The denominator is the to-tal number of infection possibilities, and then thefraction denotes the probability that each blockrow contains at most one infected family. Thus,PIjoint is obtained as the probability that thereis some block row that contains two or more in-fected families.

    (ii) For PIIjoint in (10), (1 − q)ci is the probability

    that there is no infected family in the i-th blockrow, and ciq(1 − q)ci−1 is the probability thatthere is only one infected family in the i-th blockrow. The multiplication

    ∏denotes the proba-

    bility that any one block row contains at mostone infected family. Thus, PIIjoint is obtained asthe probability that there is some block row thatcontains two or more infected families.

    C.4 Proof of Lemma 4

    Proof. Consider ci > cj + 1, let c′i = ci − 1 and c′j =

    cj + 1. For the combinatorial model, we can verify thedifference of the probability for c′i and ci by∑|B|=kf :

    B⊆{1,2,··· ,b}

    ∏`∈B

    c′` −∑|B|=kf :

    B⊆{1,2,··· ,b}

    ∏`∈B

    c` = (c′ic′j − cicj) ·X

    = (ci − cj − 1) ·X> 0,

    where X is a positive value independent of ci and cj .

    This implies that the minimum of the probability PIjointin (10) achieves its minimum roughly at the symmetriccase where all ci’s are equal, i.e., ci = c for all i ∈{1, 2, · · · , b}.

    Similarly, for the probabilistic model, consider theprobability in (10), we can calculate that

    b∏`=1

    [(1− q)c

    ′` + c′`q(1− q)c

    ′`−1]

    −b∏`=1

    [(1− q)c` + c`q(1− q)c`−1

    ](C.1)

    = [(ci − cj)− (1− q)2]q2(1− q)ci+cj−2 · Y> 0, (C.2)

    where Y =∏` 6=i,j

    [(1− q)c` + c`q(1− q)c`−1

    ]> 0 is

    independent of ci and cj . This implies that the mini-mum of the probability in (10) achieves its minimumroughly at the symmetric case where all ci’s are equal,i.e., ci = c for all i ∈ {1, 2, · · · , b}.

    C.5 Proof of Lemma 5

    The lemma is obtained under the assumption that thenumber of families F is a multiple of b and c. If F can-not be factorized, the error probabilities in Lemma 5can be viewed as an upper bound for the correspond-ing error probabilities. This can be seen by simplyadding F ′ auxiliary families so that F + F ′ = bc.

  • Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

    Proof. In the symmetric case, i.e., ci = c for all i ∈{1, 2, · · · , b}, the probabilities in (9) and (10) become

    PIjoint = 1−

    (bkf

    )ckf(

    Fkf

    ) , (C.3)PIIjoint = 1−

    ((1− q)c−1(1− q + cq)

    )b. (C.4)

    For the symmetric combinatorial model, the numberof infected members in an infected family k jm = km forall infected families j. If two families appear in thesame set of M tests, the probability that all infectedmembers in one family share the same km tests as theother family is simply

    P(no FP|joint) = 1(Mkm

    ) . (C.5)Thus the probability that FPs happen is

    Pe = P(FP|joint) · PIjoint =

    [1− 1(

    Mkm

    )] [1− ( bkf )ckf(Fkf

    ) ] .(C.6)

    For the symmetric probabilistic model, the infectionprobability in an infected family pj = p for all infectedfamilies j. If two families appear in the same set of Mtests, then there is no false positives only when the twofamilies have the same number of infected membersand the infected (non-infected) members in one familymust appear in the same set of tests as infected (non-infected) members of the other family. The probabil-ity that two families both have i infected members is[pi(1− p)M−i

    ]2, and the probability that all infected

    members in one family share tests with only infectedmembers in the other family is simply 1

    (Mi ). Thus, the

    probability that there is no false positives is given asfollows,

    P(no FP|joint) =M∑i=1

    [pi(1− p)M−i

    ]2 1(Mi

    ) . (C.7)Thus the probability that a false positive happens canbe obtained as

    Pe = P(FP|joint) · PIIjoint

    =

    [M∑i=1

    [pi(1− p)M−i

    ]2 1(Mi

    )]·[1−

    ((1− q)c−1(1− q + cq)

    )b]. (C.8)

    Replacing b by T2/M and c by FM /T2 completes theresult.

    C.6 Proof of Lemma 6 and Discussions

    Proof. For the combinatorial model (I), it is hard toexplicitly calculate the expected error rate. The upper

    bound in (12) is obtained by assuming that if thereexist errors (FPs), then all non-infected members ininfected families are misidentified as infected in thedecoding of G2. (Note that all non-infected membersin non-infected families are correctly identified by de-coding of G1.)

    For the probabilistic model (II), the upper bound forthe expected error rate in (13) is obtained by

    RII(error) =1

    n· b ·

    c∑j=2

    (c

    j

    )qj(1− q)c−j

    ·

    (j∑i=1

    (j

    i

    )pi(1− p)j−i(j − i)

    )·M

    ](C.9)

    =bM

    [c∑j=2

    (c

    j

    )qj(1− q)c−j

    ·(j(1− p)− j(1− p)j

    )](C.10)

    <(1− p)T2

    c∑j=2

    (c

    j

    )qj(1− q)c−j · j

    =

    (1− p)T2n

    ·[cq − cq(1− q)c−1

    ],

    = (1− p)q[1− (1− q)c−1

    ], (C.11)

    where the expression in the bracket in (C.9) for each jdenotes the expected number of FPs in one block rowif there are j families infected in this block row, (C.10)is obtained from the expected value of binomial distri-bution, and (C.11) follows by substituting c = nT2 .

    We here make the following observation about the sys-tem FP probability P(any-FP): As we explore fur-ther in Section 6 non-adaptive group testing requiresmore tests than adaptive. Assume that kf = Θ(F

    αf )for αf ∈ [0, 1) and choose R = M − 1 in Algo-rithm 1. Adaptive testing allows to achieve zero er-ror with kf log2 F + kf M tests; if we use the same(order) number of tests with a non-adaptive strat-egy, i.e., T1 = kf log2

    Fkf

    and T2 = kf (log2 kf + M ),

    we get P(any-FP) in Lemma 5 approximately equal

    to(1− 1M

    ) [1−

    (T2/Mkf )(T2/M

    kf

    )kf (F/kf )kf(Fkf )]

    which is bounded

    away from 0. The latter can be seen as follows: i)

    T2/M ≈ kf � F ; ii)(nk)

    (nk )k

    / (n+mk )(n+mk )

    k =(

    nn+m

    )k·∏m

    i=1n+i−kn+i is decreasing withm and can be very small

    when m� n.

    Fig. 7 depicts P(any-FP) and R(error) for parametersF = 64, kf = 6, km = 4, M = 5, q = 1/8, and p = 0.8.

  • Group testing for connected communities

    Figure 7: System FP probability and FP error rate.

    D Appendix for Section 5: LoopyBelief Propagation algorithm

    We here describe our loopy belief propagation algo-rithm (LBP) and update rules for our probabilisticmodel (II). We use the factor graph framework of[Kschischang et al., 2001] and derive closed-form ex-pressions for the sum-product update rules (see equa-tions (5) and (6) in [Kschischang et al., 2001]).

    The LBP algorithm on a factor graph iteratively ex-changes messages across the variable and factor nodes.The messages to and from a variable node Vj or Uiare beliefs about the variable or distributions (a localestimate of P(Vj |observations) or P(Ui |observations)).Since all the random variables are binary, in our caseeach message would be a 2-dimensional vector [a, b]where a, b ≥ 0. Suppose the result of each testis yτ , i.e., Yτ = yτ and we wish to compute themarginals P(Vj = v |Y1 = y1,Y2 = y2, ...,YT = yT )and P(Ui = u|Y1 = y1,Y2 = y2, ...,YT = yT ) forv , u ∈ {0, 1}. The LBP algorithm proceeds as follows:

    1. Initialization: The variable nodes Vj and Uitransmit the message [0.5, 0.5] on each of theirincident edges. Each variable node Yτ transmitsthe message [1− yτ , yτ ], where yτ is the observedtest result, on its incident edge.

    2. Factor node messages: Each factor node receivesthe messages from the neighboring variable nodesand computes a new set of messages to send oneach incident edge. The rules on how to computethese messages are described next.

    3. Iteration and completion. The algorithm alter-nates between steps 2 and 3 above a fixed numberof times (in practice 10 or 20 times works well) andcomputes an estimate of the posterior marginalsas follows – for each variable node Vj and Ui , wetake the coordinatewise product of the incoming

    factor messages and normalize to obtain an esti-mate of P(Vj = v |y1...yT ) and P(Ui = u|y1...yT )for v , u ∈ {0, 1}.

    Next we describe the simplified variable and factornode message update rules. We use equations (5) and(6) of [Kschischang et al., 2001] to compute the mes-sages.

    Leaf node messages: At every iteration, the variablenode Yτ continually transmits the message [0, 1] ifYτ = 1 and [1, 0] if Yτ = 0 on its incident edge. Thefactor node P(Vj ) continually transmits [1 − q , q ] onits incident edge; see Fig. 8 (a) and (b).

    Variable node messages: The other variable nodes Vjand Ui use the following rule to transmit messagesalong the incident edges: for incident each edge e,a variable node takes the elementwise product of themessages from every other incident edge e′ and trans-mits this along e; see Fig. 8 (c).

    Factor node messages: For the factor node messages,we calculate closed form expressions for the sum-product update rule (equation (6) in [Kschischanget al., 2001]). The simplified expressions are summa-rized in Fig. 8 (d) and (e). Next we briefly describethese calculations.

    Firstly, we note that each message represents a prob-ability distribution. One could, without loss of gen-erality, normalize each message before transmission.Therefore, we assume that each message µ = [a, b] issuch that a + b = 1. Now, the the leaf nodes labeledP(Vj) perennially transmit the prior distribution cor-responding to Vj .

    Next, consider the factor node P(Ui|Vj) as shown inFig. 8 (d). The message sent to Ui is calculated as

    µu =∑

    v∈{0,1}

    P(Ui = u|Vj = v)wv

    = w0(1− u) + w1puj (1− pj)1−u.

    Similarly, the message sent to Vi is

    νv =∑

    u∈{0,1}

    P(Ui = u|Vj = v)su

    = s0(v(1− pj) + 1− v) + s1vpj .

    Finally for the factor nodes P(Yτ |Uδτ ) as shown inFig. 8 (e), note that the messages to Yτ play no rolesince they are never used to recompute the variable

  • Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, Suhas Diggavi

    𝑋 𝑌 Pr 𝑌 |𝑈

    1 , 1 ,

    Pr 𝑌 |𝑈𝑌

    𝑈

    𝑈

    1 , 1 ,

    𝑒 Messages from Pr 𝑌 |𝑈 factor nodes

    𝑋 𝑜 𝑈

    𝑐 Messages from 𝑋 and 𝑈 variable nodes

    𝑎 Messages from Pr 𝑋 factor nodes 𝑏 Messages from 𝑌 variable nodes

    𝑋

    Pr 𝑈 |𝑋𝑈

    , 1 ,

    𝑑 Messages from Pr 𝑈 |𝑋 factor nodes

    Pr 𝑋

    𝑓

    𝑓′

    , -- incoming message from factor node 𝑓′

    , -- incoming message from node 𝑈, -- incoming message from node 𝑋

    , – incoming message from node 𝑈, – outgoing message to node 𝑈

    V!

    V!

    V!

    V!

    V!

    V!

    V!

    (Vj)

    Vj

    Figure 8: The update rules for the factor and variable node messages.

    messages. The messages to Ui nodes are expressed as

    µu =∑

    y∈{0,1},{ui′∈{0,1}:i

    ′∈δτ\{i}}

    (P(Yτ = y|Uδτ = uδτ )

    (1− yτ )1−yyyτ∏

    i′∈δτ\{i}}

    s(i′)

    ui′

    )= (1− yτ )

    ∑{ui′∈{0,1}:i′∈δτ\{i}}

    (P(Yτ = 0|Uδτ = uδτ )

    ∏i′∈δτ\{i}}

    s(i′)

    ui′

    )+ yτ

    ∑{ui′∈{0,1}:i′∈δτ\{i}}

    (P(Yτ = 1|Uδτ = uδτ )

    ∏i′∈δτ\{i}}

    s(i′)

    ui′

    ).

    From our Z-channel model, recall that P(Yτ = 0|Uδτ =uδτ ) = 1 if ui = 0 ∀ i ∈ δτ and P(Yτ = 0|Uδτ = uδτ ) =z otherwise. Thus we split the summation terms into2 cases – one where ui′ = 0 for all i

    ′ and the other itscomplement. Also combining this with the assumption

    that the messages are normalized, i.e., s(i)0 + s

    (i)1 = 1,

    we get∑{ui′∈{0,1}:i′∈δτ\{i}}

    (P(Yτ = 0|Uδτ = uδτ )

    ∏i′∈δτ\{i}}

    s(i′)

    ui′

    )

    = 1u=1z + 1u=0

    {1− (1− z)(1−

    ∏i′∈δτi′ 6=i

    s(i′)0 )

    },

    and∑{ui′∈{0,1}:i′∈δτ\{i}}

    (P(Yτ = 1|Uδτ = uδτ )

    ∏i′∈δτ\{i}}

    s(i′)

    ui′

    )

    = 1u=1(1− z) + 1u=0(

    (1− z)(1−∏i′∈δτi′ 6=i

    s(i′)0 )

    ).

    Substituting u = 0, and u = 1 we obtain the messages

    µ0 = (1− yτ ){

    1− (1− z)(1−∏i′∈δτi′ 6=i

    s(i′)0 )

    }

    + yτ (1− z)(1−∏i′∈δτi′ 6=i

    s(i′)0 ),

  • Group testing for connected communities

    0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8

    p

    0

    100

    200

    300

    400

    500

    600

    aver

    age

    #of t

    ests

    Alg.1: R = 1

    Alg.1: R = M

    binary splitting

    lower bound

    non-adaptive

    counting bound

    (a) Community 1—sparse regime.

    0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8

    p

    0

    200

    400

    600

    800

    1000

    1200

    aver

    age

    #of t

    ests

    Alg.1: R = 1

    Alg.1: R = M

    binary splitting

    lower bound

    non-adaptive

    counting bound

    (b) Community 1—linear regime.

    0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8

    p

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    aver

    age

    #of t

    ests

    Alg.1: R = 1

    Alg.1: R = M

    binary splitting

    lower bound

    non-adaptive

    counting bound

    (c) Community 2—sparse regime.

    0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8

    p

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    aver

    age

    #of t

    ests

    Alg.1: R = 1

    Alg.1: R = M

    binary splitting

    lower bound

    non-adaptive

    counting bound

    (d) Community 2—linear regime.

    Figure 9: Experiment (i):Noiseless case—Average number of tests.

    and

    µ1 = (1− yτ )z + yτ (1− z).

    For our probabilistic model, the complexity of com-puting the factor node messages increases only linearlywith the factor node degree.

    E Appendix for Section 6: OtherResults

    We next provide additional experimental results to theones provided in Section 6.

    (i) Noiseless testing – Average number of tests: In Fig-ure 9, we reproduce additional numerics akin to theones in Section 6 for number of tests in the noiseless-testing case. As earlier, we measure the average num-ber of tests needed by 3 algorithms that achieve zero-error reconstruction (Alg. 1 with R = 1, Alg. 1 withR = M , and classic BSA), and a version of our non-adaptive algorithm (Section 4.3) that uses T1 = Ftests for submatrix G1 and has an overall FP ratearound 0.5%. Alg. 1 assumes no prior knowledgeof the number of infected families/classes or mem-bers/students, hence uses BSA as group-testing algo-rithm for the AdaptiveTest() function.

    Fig. 9 depicts our results: We observe that both ver-sions of Alg. 1 (black and magenta lines) need sig-nificantly fewer tests compared to classic BSA (greenline), while staying below the counting bound. Thisindicates the potential benefits from the communitystructure, even when the number of infected membersis unknown. More interestingly, when R = M , Alg. 1performs close to the lower bound in most realisticscenarios p ∈ [0.5, 0.8] (as also shown in Section 4.1).The grey line