9
Federated Unsupervised Representation Learning Fengda Zhang 1 , Kun Kuang 1 , Zhaoyang You 1 , Tao Shen 1 , Jun Xiao 1 , Yin Zhang 1 , Chao Wu 2 * , Yueting Zhuang 1 , Xiaolin Li 3 1 College of Computer Science and Technology, Zhejiang University 2 School of Public Affairs, Zhejiang University 3 Tongdun Technology {fdzhang, kunkuang, zhaoyangyou, tao.shen}@zju.edu.cn, [email protected], {zhangyin98, yzhuang, chao.wu}@zju.edu.cn, [email protected] Abstract To leverage enormous unlabeled data on distributed edge devices, we formulate a new problem in fed- erated learning called Federated Unsupervised Repre- sentation Learning (FURL) to learn a common repre- sentation model without supervision while preserving data privacy. FURL poses two new challenges: (1) data distribution shift (Non-IID distribution) among clients would make local models focus on different categories, leading to the inconsistency of representation spaces. (2) without the unified information among clients in FURL, the representations across clients would be mis- aligned. To address these challenges, we propose Feder- ated Constrastive Averaging with dictionary and align- ment (FedCA) algorithm. FedCA is composed of two key modules: (1) dictionary module to aggregate the representations of samples from each client and share with all clients for consistency of representation space; and (2) alignment module to align the representation of each client on a base model trained on a public data. We adopt the contrastive loss for local model training. Through extensive experiments with three evaluation protocols in IID and Non-IID settings, we demonstrate that FedCA outperforms all baselines with significant margins. Introduction Federated Learning (FL) is proposed as a paradigm that enables distributed clients to collaboratively train a shared model while preserving data privacy (McMahan et al. 2017). Specifically, in each round of federated learning, clients ob- tain the global model and update it on their own private data to generate the local models, and then the central server ag- gregates these local models into a new global model. Most of the existing works focus on supervised federated learn- ing in which clients train their local models with supervi- sion. However, the data generated in edge devices are typ- ically unlabeled. Therefore, learning a common representa- tion model for various downstream tasks from decentralized and unlabeled data while keeping private data on devices, i.e. Federated Unsupervised Representation Learning (FURL), remains still an open problem. * Corresponding Author cat dog car plane (a) Inconsistency of representation spaces. (b) Misalignment of representations. Figure 1: Illustration of challenges in FURL: (a) inconsis- tency of representation spaces: data distribution shift among clients causes local models to focus on different categories; and (b) misalignment of representations: without unified in- formation, the representation across clients would be mis- alignment (e.g., rotated by a certain angle). The hyper- spheres are representation spaces encoded by different local models in federated learning. It’s a natural idea that we can combine federated learning with unsupervised approaches, which means that clients can train their local models via unsupervised methods. There are a lot of highly successful works on unsupervised representa- tion learning. Particularly, contrastive learning methods train models by reducing the distance between representations of positive pairs (e.g., different augmented views of the same image) and increasing the distance between negative pairs arXiv:2010.08982v1 [cs.LG] 18 Oct 2020

arXiv:2010.08982v1 [cs.LG] 18 Oct 2020Server ClientClient local model global dictionary local dictionary global model (a) Overview of FedCA. x other v other f local (·) v i v j x

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Federated Unsupervised Representation Learning

    Fengda Zhang1, Kun Kuang1, Zhaoyang You1, Tao Shen1, Jun Xiao1,Yin Zhang1, Chao Wu2*, Yueting Zhuang1, Xiaolin Li31College of Computer Science and Technology, Zhejiang University

    2School of Public Affairs, Zhejiang University3Tongdun Technology

    {fdzhang, kunkuang, zhaoyangyou, tao.shen}@zju.edu.cn, [email protected],{zhangyin98, yzhuang, chao.wu}@zju.edu.cn, [email protected]

    Abstract

    To leverage enormous unlabeled data on distributededge devices, we formulate a new problem in fed-erated learning called Federated Unsupervised Repre-sentation Learning (FURL) to learn a common repre-sentation model without supervision while preservingdata privacy. FURL poses two new challenges: (1) datadistribution shift (Non-IID distribution) among clientswould make local models focus on different categories,leading to the inconsistency of representation spaces.(2) without the unified information among clients inFURL, the representations across clients would be mis-aligned. To address these challenges, we propose Feder-ated Constrastive Averaging with dictionary and align-ment (FedCA) algorithm. FedCA is composed of twokey modules: (1) dictionary module to aggregate therepresentations of samples from each client and sharewith all clients for consistency of representation space;and (2) alignment module to align the representation ofeach client on a base model trained on a public data.We adopt the contrastive loss for local model training.Through extensive experiments with three evaluationprotocols in IID and Non-IID settings, we demonstratethat FedCA outperforms all baselines with significantmargins.

    IntroductionFederated Learning (FL) is proposed as a paradigm thatenables distributed clients to collaboratively train a sharedmodel while preserving data privacy (McMahan et al. 2017).Specifically, in each round of federated learning, clients ob-tain the global model and update it on their own private datato generate the local models, and then the central server ag-gregates these local models into a new global model. Mostof the existing works focus on supervised federated learn-ing in which clients train their local models with supervi-sion. However, the data generated in edge devices are typ-ically unlabeled. Therefore, learning a common representa-tion model for various downstream tasks from decentralizedand unlabeled data while keeping private data on devices, i.e.Federated Unsupervised Representation Learning (FURL),remains still an open problem.

    *Corresponding Author

    cat

    dog

    car

    plane

    (a) Inconsistency of representation spaces.

    (b) Misalignment of representations.

    Figure 1: Illustration of challenges in FURL: (a) inconsis-tency of representation spaces: data distribution shift amongclients causes local models to focus on different categories;and (b) misalignment of representations: without unified in-formation, the representation across clients would be mis-alignment (e.g., rotated by a certain angle). The hyper-spheres are representation spaces encoded by different localmodels in federated learning.

    It’s a natural idea that we can combine federated learningwith unsupervised approaches, which means that clients cantrain their local models via unsupervised methods. There area lot of highly successful works on unsupervised representa-tion learning. Particularly, contrastive learning methods trainmodels by reducing the distance between representations ofpositive pairs (e.g., different augmented views of the sameimage) and increasing the distance between negative pairs

    arX

    iv:2

    010.

    0898

    2v1

    [cs

    .LG

    ] 1

    8 O

    ct 2

    020

  • (e.g., augmented views from different images), have beenoutstandingly successful in practice (Chen et al. 2020a,b; Heet al. 2020; Oord, Li, and Vinyals 2018). However, their suc-cesses highly rely on their abundant data for representationtraining, for example, contrastive learning methods need alarge number of negative samples for training (Chen et al.2020a; Sohn 2016). Moreover, few of these unsupervisedmethods take the problem of data distribution shift into ac-count, which is a common practical problem in federatedlearning. Hence, it’s no easy task to combine federated learn-ing with unsupervised approaches for the problem of FURL.

    In federated learning applications, however, the collecteddata of each client is limited and the data distribution ofclient might be different from each other (Jeong et al. 2018;Kairouz et al. 2019; Sattler et al. 2019; Yang et al. 2019a;Zhao et al. 2018). Hence, we face the following challengesto combine federated learning and with unsupervised ap-proaches for FURL:

    • Inconsistency of representation spaces. In federatedlearning, limited data of each client would lead to thevariation of data distribution from client to client, result-ing in the inconsistency of representation spaces encodedby different local models. For example, as shown in Fig-ure 1(a), client 1 is with only images of cats and dogs,and client 2 is with only images of cars and planes. Then,the locally trained model on client 1 only encodes a fea-ture space of cats and dogs, failing to map cars or planesto the appropriate representations, and the same goes fortrained model on client 2. Intuitively, the performance ofthe global model aggregated by these inconsistent localmodels may fall short of expectations.

    • Misalignment of representations. Even if the trainingdata of clients are IID and the representation spaces en-coded by different local models are consistent, there maybe misalignment between representations by the reasonof randomness in the training process. For instance, for agiven input set, the representations generated by a modelare equivalent to the representations generated by anothermodel when rotated by a certain angle, as shown in Fig-ure 1(b). It should be noted that the misalignment betweenlocal models may have drastic detrimental effects on theperformance of the aggregated model.

    To address these challenges, we propose a contrastiveloss-based federated unsupervised representation learningalgorithm called FedCA, which consists of two main novelmodules: dictionary module for addressing the inconsis-tency of representation spaces and alignment module foraligning the representations across clients. Specifically, thedictionary module, which is maintained by server, aggre-gates abundant representations of samples from clients andcan be shared to each client for local model optimization.In the alignment module, we first train a base model basedon a small public data (e.g., a subset of STL-10 dataset)(Coates, Ng, and Lee 2011), then require all local modelsto mimic the base model such that the representations gen-erated by different local models can be aligned. Overall, ineach round, FedCA involves two stages: (i) clients train lo-cal representation models on their own unlabeled data via

    contrastive learning with two modules above, and then gen-erate local dictionaries, and (ii) server aggregates the trainedlocal models to obtain a shared global model and integrateslocal dictionaries into a global dictionary.

    To the best of our knowledge, FedCA is the first algorithmdesigned for the FURL problem. Our experiments show thatFedCA has better performance than those naive methodswhich solely combine federated learning with unsupervisedapproaches. We believe that FedCA will serve as a criticalfoundation in this novel and challenging problem.

    Related WorkFederated LearningFederated learning enables distributed clients to train ashared model collaboratively while keep private data on de-vices (McMahan et al. 2017). Li et al. add a proximal termto the loss function to keep local models close to the globalmodel (Li et al. 2018). Wang et al. propose a layers-wisefederated learning algorithm to deal with permutation invari-ance of neural network parameters (Wang et al. 2020). How-ever, the existing works only focus on the consistency of pa-rameters, while we emphasize the consistency of represen-tations in this paper. Some works also focus on reducing thecommunication of federated learning (Konečnỳ et al. 2016).To further protect data privacy of clients, cryptography tech-nologies are applied to federated learning (Bonawitz et al.2017).

    Unsupervised Representation LearningThere are two main types of unsupervised learning methods:generative and discriminative. Generative approaches learnrepresentations by generating pixels in the input space (Hin-ton and Salakhutdinov 2006; Kingma and Welling 2013;Radford, Metz, and Chintala 2015). Discriminative ap-proaches train representation model by performing pretexttasks where the labels are generated for free from unlabelleddata (Gidaris, Singh, and Komodakis 2018; Pathak et al.2017). Among them, contrastive learning methods achieveexcellent performance (Chen et al. 2020a,b; He et al. 2020;Oord, Li, and Vinyals 2018). The contrastive loss is pro-posed by Hadsell et al. (Hadsell, Chopra, and LeCun 2006).Wu et al. propose an unsupervised contrastive learning ap-proach based on a memory bank to learn visual represen-tations (Wu et al. 2018). Recently, Wang et al. point twokey properties, closeness and uniformity, related to the con-trastive loss. (Wang and Isola 2020). Other works also applycontrastive learning to video (Sermanet et al. 2018; Tian,Krishnan, and Isola 2019), NLP (Logeswaran and Lee 2018;Mikolov et al. 2013; Yang et al. 2019b), audio (Baevski et al.2020) and graph (Hassani and Khasahmadi 2020; Qiu et al.2020).

    Federated Unsupervised LearningSome concurrent works (Jin et al. 2020; van Berlo, Saeed,and Ozcelebi 2020) also focus on federated learning fromunlabeled data. Different from these works that all simply

  • Server

    Client Client Client

    local

    model

    global

    dictionary

    local

    dictionary

    global

    model

    (a) Overview of FedCA.

    xother

    vother

    flocal(·)

    vi vj

    x

    falign(·)

    xalign

    hother hi hj hlocal

    glocal(·) galign(·)

    halign

    zother zi zj zlocal zalignzdict_global

    Minimize Distance

    Minimize DistanceMinimize DistanceMaximize Distance

    Maximize Distance

    (b) Local Update of Model.

    ···

    t

    localdict

    xlocal

    𝒉𝒍𝒐𝒄𝒂𝒍𝒕

    𝒛𝒍𝒐𝒄𝒂𝒍𝒕

    𝒉𝒍𝒐𝒄𝒂𝒍𝟐

    𝒛𝒍𝒐𝒄𝒂𝒍𝟐

    𝒇𝟏(𝒈𝟏(·))

    𝒉𝒍𝒐𝒄𝒂𝒍𝟏

    𝒛𝒍𝒐𝒄𝒂𝒍𝟏

    𝒇𝒕(𝒈𝒕(·))𝒇𝟐(𝒈𝟐(·))

    (c) Local Update of Dictionary.

    Figure 2: Illustrations of FedCA. (a) In each round, clients generate local models and dictionaries, and then server gathersthem to obtain global model and dictionary. (b) Clients update local models by contrastive leaning with the dictionary andalignment modules. xother is a sample different from sample x, and xalignment is a sample from the additional public datasetfor alignment. f is the encoder and g is the projection head. (c) Clients generate local dictionaries via temporal ensembling.

    combine federated learning with unsupervised approaches,we explore and identify the main challenges in federated un-supervised representation learning and design an algorithmto deal with these challenges.

    PreliminaryIn this section, we discuss the primitives needed for our ap-proach.

    Federated LearningIn federated learning, each client u ∈ U has a private datasetDu of training samples withD = ∪u∈UDu and our aim is totrain a shared model while keeping private data on devices.There are a lot of algorithms designed for aggregation infederated learning (Li et al. 2018; Wang et al. 2020). Here,for simplicity, we introduce a standard and popular aggre-gation method named FedAvg (McMahan et al. 2017). Ineach round of FedAvg, the server randomly selects a subsetof clients U t ⊆ U and each client u ∈ U t locally updatesthe model f with parameters θt on dataset Du via stochasticgradient descent rule:

    θt+1u ← θt − η5Lf (Du, θt) (1)where η is the stepsize. Then the server gathers parametersof local models {θt+1u |u ∈ U t} and aggregate these lo-cal models via weighted average to generate a new globalmodel:

    θt+1 ←∑u∈Ut

    |Du|∑i∈Ut |Di|

    θt+1u (2)

    The training process above is repeated until the global modelconverges.

    Unsupervised Contrastive LearningUnsupervised contrastive representation learning methodslearn representations from unlabeled data by reducing the

    distance between representations of positive samples and in-creasing the distance between representations of negativesamples. Among them, SimCLR achieves outstanding per-formance and can be applied to federated learning easily(Chen et al. 2020a). SimCLR randomly samples a minibatchof N samples and executes twice random data augmenta-tions for each sample to obtain 2N views. Typically, theviews augmented from the same image are treated as posi-tive samples and the views augmented from different imagesare treated as negative samples (Dosovitskiy et al. 2014).The loss function for a positive pair of samples (i, j) is de-fined as:

    li,j = −logexp(sim(zi, zj)/τ)∑2N

    k=1 1[k 6=i]exp(sim(zi, zk)/τ), (3)

    where τ is temperature and 1[k 6=i] = 1 iff k 6= i. The model(consisting of a base encoder network f to extract repre-sentation vectors h from augmented views and a projectionhead g to map representations h to z) is trained by minimiz-ing the loss function above. Finally, we use representationsh to perform downstream tasks.

    MethodIn this section, we analyze two challenges mentioned aboveand detail the dictionary module and alignment moduledesigned for these challenges. Then we introduce Feder-ated Contrastive Averaging with dictionary and alignment(FedCA) algorithm for FURL.

    Dictionary Module for Inconsistency ChallengeFURL aims to learn a shared model mapping data to rep-resentation vectors such that similar samples are mapped tonearby points in representation space so that the features arewell-clustered by classes. However, the presence of Non-IID data presents a great challenge to FURL. Since the local

  • A

    B

    C

    (a) Vanilla Federated Unsupervised Approach.

    A

    B

    C

    (b) FedCA.

    Figure 3: T-SNE visualization results of representations onCIFAR10. In federated learning with Non-IID setting, weuse the local model of the client who only has samples ofclass 0 and 1 to generate representations. We compare twomethods: (a) FedSimCLR (SimCLR is combined with Fe-dAvg directly) and (b) FedCA (ours). A, B are the regionswhere representations of samples of class 0, 1 cluster re-spectively and C is the rest region.

    dataset Du of a given client u likely contains samples ofonly a few classes, the local models may encode inconsis-tent spaces, causing bad effects on the performance of theaggregated model.

    To empirically verify this, we visualize the representa-tions of images from CIFAR-10 via T-SNE method. To bespecific, we split training data of CIFAR-10 into 5 Non-IID sets and each set consists of 10000 samples from 2classes. Then FedAvg algorithm is solely combined withunsupervised approach (SimCLR) to learn representationsfrom these subsets. We use the local model in 20th roundof the client who only has samples of class 0 and 1 to ex-tract features from test set of CIFAR-10 and visualize therepresentations after dimensionality reduction by T-SNE, asshown in Figure 3(a). We find that the scattered representa-tions of samples from class 0 and 1 spread over a very largearea of representation space and it is difficult to distinguishsamples of class 0 and 1 from others. It suggests that the lo-cal model encodes a representation space of samples of class0 and 1 and it cannot map samples of other classes to thesuitable positions. The visualization results support our hy-pothesis that the representation spaces encoded by differentlocal models are inconsistent in Non-IID setting.

    We argue that the cause of the inconsistency is that theclients can only use their own data to train the local modelsbut the distribution of data varies from client to client. Toaddress this issue, we design a dictionary module, as shown

    in Figure 2(b). Specifically, in each communication round,clients use the global model (including the encoder and theprojection head) to obtain the normalized projections {z̃i}of their own samples and send normalized projections to theserver along with the trained local models. Then the severgathers the normalized projections into a shared dictionary.For each client, the global dictionary z̃dict with K projec-tions is treated as a normalized projection set of negativesamples for local contrastive learning. Specifically, in localtraining process, for a given minibatch xbatch with N sam-ples, we randomly augment them to obtain xi, xj and gen-erate normalized projections z̃i, z̃j . Then we calculate

    logitsbatch = z̃i · z̃jT , (4)

    logitsdict = z̃i · z̃Tdict, (5)logitstotal = concat([logitsbatch, logitsdict], dim = 1),

    (6)where concat() denotes concatenation and the size of logitsisN×(N+K). Now we turn the unsupervised problem intoa (N +K)-classification problem and define

    label = [0, 1, 2, ..., N − 2, N − 1] (7)

    as a class indicator. Then the loss function is given as

    losscontrastive = CE(logits/t, labels), (8)

    where CE() denotes cross entropy loss and t is temperatureterm.

    Note that, in each round, the shared dictionary is gener-ated by global model from the previous round, but the pro-jections of local samples are encoded by current local mod-els. The inconsistencies in representations may affect thefunction of the dictionary module, especially in Non-IID set-ting. We use temporal ensembling to alleviate this problem,as shown in Figure 2(c). To be specific, each client main-tains a local ensemble dictionary consisting of projectionsset {Zt−1i |xi ∈ Du}. In each round, client u uses trainedlocal model to obtain projections {zti |xi ∈ Du} and accu-mulates it into ensemble dictionary by updating

    Zti ← αZt−1i + (1− α)zti , (9)

    and then normalized ensemble projection is given as

    z̃ti =Zti/(1− αt)||Zti/(1− αt)||2

    =Zti||Zti ||2

    , (10)

    where α ∈ [0, 1) is a momentum parameter and Z0i = ~0.We visualize the representations encoded by local model

    trained via federated contrastive learning with dictionarymodule in the same setting as vanilla federated unsupervisedapproach. As shown in Figure 3(b), we find that the points ofclass 0 and 1 are clustered in a small subspace of represen-tation space, which means that the dictionary module workswell as we expected.

    Alignment Module for Misalignment ChallengeDue to the randomness in training process, there might be acertain angle difference between representations generated

  • (a) FedSimCLR. (b) FedCA.

    Figure 4: Boxplots of angles between representations en-coded by local models on CIFAR10 in federated learningwith IID setting.

    by two models trained on the same dataset respectively, al-though these two models encode consistent spaces. The mis-alignment of representations may have an adverse effect onmodel aggregation.

    To verify it, we record the angles between normalizedrepresentations generated by different local models in feder-ated learning. We split training data of CIFAR-10 into 5 IIDsets randomly and each set consists of 10000 samples fromall 10 classes. We randomly select 2 local models trainedby vanilla federated unsupervised approach (FedSimCLRis used as an example) and use them to obtain normalizedrepresentations on testset of CIFAR-10. As shown in Fig-ure 4(a), there is always a large angle (beyond 20◦) differ-ence between representations encoded by the local modelsin learning process.

    We introduce an alignment module to tackle this chal-lenge. As shown in Figure 2(b), we prepare an additionalpublic dataset Dalign with small size and train a modelgalign(falign()) (called alignment model) on it. The localmodels are then trained via contrastive loss with a regular-ization term that replicating outputs of the alignment modelon a subset of alignment dataset. For a given client u, theloss function is defined as

    losshalign =

    |Dsubalign|∑i=1

    ||hialign − hiu||22, (11)

    losszalign =

    |Dsubalign|∑i=1

    ||zialign − ziu||22, (12)

    lossalign = losszalign + loss

    zalign, (13)

    where hialign = falign(xi), zialign = galign(h

    ialign), h

    iu =

    fu(xi), ziu = gu(h

    iu), x

    i ∈ Dsubalign ⊆ Dalign.We also calculate the angles between representations of

    local models trained via federated contrastive learning withalignment module (3200 images sampled from STL-10 ran-domly are used for alignment) in the same setting as vanillafederated unsupervised approach. As shown in Figure 4(b),the angles can be controlled within 10◦ after 10 training

    rounds, which suggests that the alignment module can helpto align the local models.

    FedCA AlgorithmFrom the above, the total loss function of local model updateis given as

    loss = losscontrastive + βlossalign, (14)

    where β is a scale factor controlling the influence of align-ment module. Now we have a complete algorithm namedFederated Contrastive Averaging with Dictionary and Align-ment (FedCA) which can handle the challenges of FURLwell, as shown in Figure 2.

    Algorithm 1 Federated Contrastive Averaging with Dictio-nary and Alignment (FedCA).Require: The n clients are indexed by u; parameters ofglobal model (encoder and projection head) θt, parametersof local model θut , global dictionary dictt, local dictionarydictut , the proportion of selected clients C, the number oflocal epochs E, local dataset Du, and learning rate η.Server executes:1: Initialize θ02: Prepare a public dataset Dalign and an alignment model

    with parameters θalign3: for each round t = 0, 1, 2, ... do4: m← max(C · n, 1)5: Ut ← (random set of m clients)6: for each client u ∈ Ut in parallel do7: θut+1, dict

    ut+1 ← ClientUpdate(u, θt, dictt)

    8: θt+1 ←∑

    u∈Ut|Du|∑

    i∈Ut|Di|

    θut+1

    9: dictt+1 ← concat([{dictut+1|u ∈ Ut}], dim = 1)ClientUpdate(u, θ, dict) : // Run on client u1: for each local epoch i from 1 to E do2: for batch b ∈ Du do3: // Update θ with Eq. (14)4: θ ← θ − η5L(θ; b, dict,Dalign, θalign)5: Generate dictu by Eq. (9)(10)6: return θ, dictu

    Algorithm 1 summarizes the proposed approach.

    ExperimentsFURL aims to learn a representation model from decentral-ized and unlabeled data. In this section, we present an em-pirical study of FedCA.

    Experimental SetupBaselines. AutoEncoder is a generative method to learnrepresentations in an unsupervised manner by generatingfrom the reduced encoding a representation as close aspossible to its original input (Hinton and Salakhutdinov2006). Predicting Rotation is one of the proxy tasks of self-supervised learning by rotating samples by random multi-ples of 90 degrees and predicting the degrees of rotations

  • Setting MethodCIFAR10 CIFAR100 MiniImageNet

    5-layer CNN ResNet-50 5-layer CNN ResNet-50 5-layer CNN ResNet-50

    IID

    FedAE 61.23 65.47 34.07 36.56 28.21 31.97FedPR 55.75 63.52 29.74 30.89 24.76 26.63

    FedSimCLR 61.62 68.10 34.18 39.75 29.84 32.18FedCA (ours) 64.87 71.25 39.47 43.30 35.27 37.12

    Non-IID

    FedAE 60.14 63.74 33.94 37.27 29.00 30.44FedPR 54.94 60.31 30.70 32.39 24.74 25.91

    FedSimCLR 59.21 64.06 33.63 38.70 29.24 30.47FedCA (ours) 63.02 68.01 38.94 42.34 34.95 35.01

    Table 1: Top-1 accuracies (%) of algorithms for FURL on linear evaluation

    (Gidaris, Singh, and Komodakis 2018). We solely combineFedAvg with AutoEncoder (named FedAE), Predicting Ro-tation (name FedPR) and SimCLR (name FedSimCLR) re-spectively and use them as baselines for FURL.

    Dataset. The CIFAR-10/CIFAR100 dataset (Krizhevsky,Hinton et al. 2009) consists of 60000 32x32 colour imagesin 10/100 classes, with 6000/600 images per class, and thereare 50000 training images and 10000 test images in CIFAR-10 and CIFAR100. The MiniImageNet dataset (Deng et al.2009; Vinyals et al. 2016) is extracted from the ImageNetdataset and consists of 60000 84x84 colour images in 100classes, we split it into a training dataset with 50000 sam-ples and a test dataset with 10000 samples. We implementFedCA and the baseline methods on three datasets above inPyTorch (Paszke et al. 2019).

    Federated Setting. We deploy our experiments under asimulated federated learning environment where we set acentralized node as the server and 5 distributed nodes asclients. The number of local epochs E is 5 and in eachround all of the clients obtain global model and execute lo-cal training, i.e., the proportion of selected clients C is 1.For each dataset, we consider two federated settings: IID andNon-IID. Each client randomly samples 10000 images fromthe entire training dataset in IID setting, while in Non-IIDsetting, samples are split to clients by class, which meansthat each client has 10000 samples of 2/20/20 classes of CI-FAR10/CIFAR100/MiniImageNet.

    Training Details. We compare our approach with base-line methods on different encoders including 5-layer CNN(Krizhevsky, Sutskever, and Hinton 2012) and ResNet-50(He et al. 2016). The encoder mapping input samples to rep-resentations with 2048-dimension and then a multilayer per-ceptron translate the representations to a vector with 128-dimension used to calculate contrastive loss. Adam is usedas optimizer and the initial learning rate is 1e-3 with 1e-6weight decay. We train models for 100 epochs with a mini-batch size of 128. We set dictionary sizeK = 1024, momen-tum term of temporal ensembling α = 0.5 and scale factorβ = 0.01. 3200 images randomly sampled from STL-10 areused for the alignment module. Data augmentation for con-trastive representation learning includes random croppingand resizing, random color distortion, random flipping andGaussian blurring.

    Evaluation Protocols and ResultsLinear Evaluation We first study our method by linearclassification on fixed encoder to verify the representationslearned in FURL. We perform FedCA and baseline methodsto learn representations on CIFAR10, CIFAR100 and Mini-ImageNet without labels respectively in federated setting.Then we fix the encoder and train a linear classifier withsupervision on entire datasets. We train this classifier withAdam as optimizer for 100 epochs and report top-1 classifi-cation accuracy on test dataset of CIFAR10, CIFAR100 andMiniImageNet.

    As shown in Table 1, federated averaging with contrastivelearning works better than other unsupervised approaches.Moreover, our method outperforms all of the baseline meth-ods due to the modules designed for FURL as we expect.

    Semi-Supervised Learning In federated scenarios, theprivate data at clients may be only partly labeled, so we canlearn a representation model without supervision and fine-tune it on labeled data. We assume that each client has 1%and 10% labeled data respectively. First, we train a represen-tation model in FURL setting. Then we finetune it (followedby a MLP consisting of a hidden lay and a ReLU activationfunction) on labeled data for 100 epochs with Adam as opti-mizer and learning rate lr = 1e− 3.

    Table 2 reports the top-1 accuracy of various methods onCIFAR10, CIFAR100 and MiniImageNet. We observe thatthe accuracy of global model trained by federated super-vised learning on limited labeled data is significantly bad,and using the representation model trained in FURL as initialmodel can improve performance more or less. Our methodoutperforms other approaches, suggesting that federated un-supervised representation learning benefits from designedmodules of FedCA, especially in Non-IID setting.

    Transfer Learning A main goal of FURL is to learn arepresentation model from decentralized and unlabeled datafor personalized downstream tasks. To verify if the featureslearned in FURL is transferable, we set the models trained inFURL as initial models and then a MLP is used to be trainedalong with encoder on other datasets. The image size of CI-FAR (32*32*3) is resized to be the same as MinImageNet(84*84*3) when we fine-tune the model learned from Mini-ImageNet on CIFAR. We train it for 100 epochs with Adamas optimizer and set learning rate lr = 1e− 3.

    Table 3 shows that the model trained by FedCA achievesan excellent performance and outperforms all of the baselinemethods in Non-IID setting.

  • Label Fraction Setting MethodCIFAR10 CIFAR100 MiniImageNet

    5-layer CNN ResNet-50 5-layer CNN ResNet-50 5-layer CNN ResNet-50

    1%

    IID

    FedAvg (Supervised) 31.84 26.68 9.35 8.09 5.83 5.42FedAE 35.98 36.86 13.36 14.53 11.71 12.84FedPR 34.51 36.47 13.15 14.20 11.52 12.34

    FedSimCLR 43.95 50.00 22.16 23.01 19.14 19.67FedCA (ours) 45.05 50.67 22.37 23.32 19.20 20.22

    Non-IID

    FedAvg (Supervised) 20.99 17.72 6.22 5.37 3.92 3.03FedAE 23.08 23.43 9.96 9.63 8.45 8.43FedPR 22.83 23.17 9.83 9.38 8.30 8.58

    FedSimCLR 26.08 26.03 14.30 14.02 11.02 10.89FedCA (ours) 28.96 28.50 17.02 16.48 13.39 13.03

    10%

    IID

    FedAvg (Supervised) 50.87 40.44 16.18 14.47 13.46 12.76FedAE 51.88 53.64 21.77 22.45 21.73 21.96FedPR 51.38 53.32 21.30 21.21 21.67 21.58

    FedSimCLR 59.27 60.67 31.11 31.56 28.45 28.79FedCA (ours) 59.91 61.02 31.37 32.09 28.93 29.44

    Non-IID

    FedAvg (Supervised) 30.62 21.69 14.90 13.98 11.88 10.13FedAE 32.07 32.19 18.77 18.98 13.48 13.65FedPR 31.04 31.78 18.39 18.34 13.30 13.24

    FedSimCLR 32.52 33.83 19.91 20.01 15.90 16.03FedCA (ours) 35.78 36.28 21.98 22.46 18.67 18.89

    Table 2: Top-1 accuracies (%) of algorithms for FURL on semi-supervised learning

    Setting MethodCIFAR100 → CIFAR10 MiniImageNet→CIFAR10 MiniImageNet→CIFAR100

    5-layer CNN ResNet-50 5-layer CNN ResNet-50 5-layer CNN ResNet-50- Random init 86.70 93.79 86.60 93.05 58.05 70.52

    IID

    FedAE 87.33 94.23 86.74 94.23 58.82 71.36FedPR 87.22 93.89 87.33 93.55 58.23 70.78

    FedSimCLR 87.80 94.88 88.03 94.87 59.08 71.85FedCA (ours) 88.04 95.03 87.91 94.94 58.91 71.98

    Non-IID

    FedAE 87.37 94.35 87.00 94.06 58.56 71.17FedPR 86.97 93.91 86.92 93.55 58.39 70.25

    FedSimCLR 87.04 94.02 86.81 93.97 58.11 70.91FedCA (ours) 87.75 94.69 87.66 94.16 58.93 71.32

    Table 3: Top-1 accuracies (%) of algorithms for FURL on transfer learning

    Ablation StudyWe perform the ablation study analysis on CIFAR-10 inIID and Non-IID settings to demonstrate the effectivenessof alignment module and dictionary module (with temporalensembling). We implement (i) FedSimCLR, (ii) federatedcontrastive learning with only alignment module, (iii) feder-ated contrastive learning with only dictionary module, (iv)federated contrastive learning with only dictionary modulebased on temporal ensembling, (v) FedCA respectively andthen a linear classifier is used to evaluate the performance ofthe frozen representation model with supervision. Figure 5shows the results .

    Figure 5: Ablation Study of Modules Designed for FURL bylinear classification on CIFAR-10 (ResNet-50).

    We observe that the alignment module improves the per-

    formance by 1.4% in both IID and Non-IID settings. Withthe help of dictionary module (without temporal ensem-bling), there are 2.5% and 2.7% increase in accuracy un-der IID and Non-IID setting respectively. Moreover, we notethat the representation model learned in FURL benefits fromtemporal ensembling technique in Non-IID setting than inIID setting, probably because the features learned in IID set-ting are stable enough so that the temporal ensembling playsa far less important role in IID setting than in Non-IID set-ting. Fortunately, the model achieves excellent performancewhen we combine federated constrative learning with align-ment module and dictionary module based on temporal en-sembling, which suggests that these two modules can workcollaboratively and help to tackle the challenges in FURL.

    ConclusionsWe formulate a significant and challenging problem Fed-erated Unsupervised Representation Learning (FURL) andshow two main challenges of this problem: inconsistency ofrepresentation spaces and misalignment of representations.In this paper, we propose a contrastive learning-based fed-erated learning algorithm named FedCA composed of thedictionary module and alignment module to tackle abovechallenges. Thanks to these two modules, FedCA enablesdistributed local models to learn consistent and aligned rep-resentations while protecting data privacy. Our experimentsdemonstrate that FedCA outperforms those algorithms that

  • solely combine federated learning with unsupervised ap-proaches and provides a stronger baseline for FURL.

    In future work, we plan to extend FedCA to cross-modalscenarios where different clients may have data in differentmodes such as images, videos, texts and audios.

    ReferencesBaevski, A.; Zhou, H.; Mohamed, A.; and Auli, M. 2020.wav2vec 2.0: A framework for self-supervised learning ofspeech representations. arXiv preprint arXiv:2006.11477 .

    Bonawitz, K.; Ivanov, V.; Kreuter, B.; Marcedone, A.;McMahan, H. B.; Patel, S.; Ramage, D.; Segal, A.; andSeth, K. 2017. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017ACM SIGSAC Conference on Computer and Communica-tions Security, 1175–1191.

    Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020a.A simple framework for contrastive learning of visual repre-sentations. arXiv preprint arXiv:2002.05709 .

    Chen, X.; Fan, H.; Girshick, R.; and He, K. 2020b. Improvedbaselines with momentum contrastive learning. arXivpreprint arXiv:2003.04297 .

    Coates, A.; Ng, A.; and Lee, H. 2011. An analysis of single-layer networks in unsupervised feature learning. In Proceed-ings of the fourteenth international conference on artificialintelligence and statistics, 215–223.

    Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical imagedatabase. In 2009 IEEE conference on computer vision andpattern recognition, 248–255. Ieee.

    Dosovitskiy, A.; Springenberg, J. T.; Riedmiller, M.; andBrox, T. 2014. Discriminative unsupervised feature learningwith convolutional neural networks. In Advances in neuralinformation processing systems, 766–774.

    Gidaris, S.; Singh, P.; and Komodakis, N. 2018. Unsuper-vised representation learning by predicting image rotations.arXiv preprint arXiv:1803.07728 .

    Hadsell, R.; Chopra, S.; and LeCun, Y. 2006. Dimension-ality reduction by learning an invariant mapping. In 2006IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPR’06), volume 2, 1735–1742.IEEE.

    Hassani, K.; and Khasahmadi, A. H. 2020. ContrastiveMulti-View Representation Learning on Graphs. arXivpreprint arXiv:2006.05582 .

    He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020.Momentum contrast for unsupervised visual representationlearning. In Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition, 9729–9738.

    He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-ual learning for image recognition. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, 770–778.

    Hinton, G. E.; and Salakhutdinov, R. R. 2006. Reducingthe dimensionality of data with neural networks. science313(5786): 504–507.

    Jeong, E.; Oh, S.; Kim, H.; Park, J.; Bennis, M.; and Kim, S.-L. 2018. Communication-efficient on-device machine learn-ing: Federated distillation and augmentation under non-iidprivate data. arXiv preprint arXiv:1811.11479 .

    Jin, Y.; Wei, X.; Liu, Y.; and Yang, Q. 2020. Towards Uti-lizing Unlabeled Data in Federated Learning: A Survey andProspective. arXiv: Learning .

    Kairouz, P.; McMahan, H. B.; Avent, B.; Bellet, A.; Bennis,M.; Bhagoji, A. N.; Bonawitz, K.; Charles, Z.; Cormode, G.;Cummings, R.; et al. 2019. Advances and open problems infederated learning. arXiv preprint arXiv:1912.04977 .

    Kingma, D. P.; and Welling, M. 2013. Auto-encoding varia-tional bayes. arXiv preprint arXiv:1312.6114 .

    Konečnỳ, J.; McMahan, H. B.; Yu, F. X.; Richtárik, P.;Suresh, A. T.; and Bacon, D. 2016. Federated learning:Strategies for improving communication efficiency. arXivpreprint arXiv:1610.05492 .

    Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiplelayers of features from tiny images .

    Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Im-agenet classification with deep convolutional neural net-works. In Advances in neural information processing sys-tems, 1097–1105.

    Li, T.; Sahu, A. K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.;and Smith, V. 2018. Federated optimization in heteroge-neous networks. arXiv preprint arXiv:1812.06127 .

    Logeswaran, L.; and Lee, H. 2018. An efficient frame-work for learning sentence representations. arXiv preprintarXiv:1803.02893 .

    McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; andy Arcas, B. A. 2017. Communication-efficient learning ofdeep networks from decentralized data. In Artificial Intelli-gence and Statistics, 1273–1282. PMLR.

    Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; andDean, J. 2013. Distributed representations of words andphrases and their compositionality. In Advances in neuralinformation processing systems, 3111–3119.

    Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representationlearning with contrastive predictive coding. arXiv preprintarXiv:1807.03748 .

    Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.;Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.;et al. 2019. Pytorch: An imperative style, high-performancedeep learning library. In Advances in neural informationprocessing systems, 8026–8037.

    Pathak, D.; Agrawal, P.; Efros, A. A.; and Darrell, T. 2017.Curiosity-driven exploration by self-supervised prediction.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition Workshops, 16–17.

  • Qiu, J.; Chen, Q.; Dong, Y.; Zhang, J.; Yang, H.; Ding, M.;Wang, K.; and Tang, J. 2020. GCC: Graph Contrastive Cod-ing for Graph Neural Network Pre-Training. In Proceed-ings of the 26th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining, 1150–1160.Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervisedrepresentation learning with deep convolutional generativeadversarial networks. arXiv preprint arXiv:1511.06434 .Sattler, F.; Wiedemann, S.; Müller, K.-R.; and Samek, W.2019. Robust and communication-efficient federated learn-ing from non-iid data. IEEE transactions on neural networksand learning systems .Sermanet, P.; Lynch, C.; Chebotar, Y.; Hsu, J.; Jang, E.;Schaal, S.; Levine, S.; and Brain, G. 2018. Time-contrastivenetworks: Self-supervised learning from video. In 2018IEEE International Conference on Robotics and Automation(ICRA), 1134–1141. IEEE.Sohn, K. 2016. Improved deep metric learning with multi-class n-pair loss objective. In Advances in neural informa-tion processing systems, 1857–1865.Tian, Y.; Krishnan, D.; and Isola, P. 2019. Contrastive mul-tiview coding. arXiv preprint arXiv:1906.05849 .van Berlo, B.; Saeed, A.; and Ozcelebi, T. 2020. Towardsfederated unsupervised representation learning. In Proceed-ings of the Third ACM International Workshop on Edge Sys-tems, Analytics and Networking, 31–36.Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D.; et al.2016. Matching networks for one shot learning. In Advancesin neural information processing systems, 3630–3638.Wang, H.; Yurochkin, M.; Sun, Y.; Papailiopoulos, D.; andKhazaeni, Y. 2020. Federated learning with matched aver-aging. arXiv preprint arXiv:2002.06440 .Wang, T.; and Isola, P. 2020. Understanding ContrastiveRepresentation Learning through Alignment and Uniformityon the Hypersphere. arXiv preprint arXiv:2005.10242 .Wu, Z.; Xiong, Y.; Yu, S. X.; and Lin, D. 2018. Unsuper-vised feature learning via non-parametric instance discrimi-nation. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 3733–3742.Yang, Q.; Liu, Y.; Chen, T.; and Tong, Y. 2019a. Federatedmachine learning: Concept and applications. ACM Trans-actions on Intelligent Systems and Technology (TIST) 10(2):1–19.Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov,R. R.; and Le, Q. V. 2019b. Xlnet: Generalized autoregres-sive pretraining for language understanding. In Advances inneural information processing systems, 5753–5763.Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; and Chan-dra, V. 2018. Federated learning with non-iid data. arXivpreprint arXiv:1806.00582 .

    IntroductionRelated WorkFederated LearningUnsupervised Representation LearningFederated Unsupervised Learning

    PreliminaryFederated LearningUnsupervised Contrastive Learning

    MethodDictionary Module for Inconsistency ChallengeAlignment Module for Misalignment ChallengeFedCA Algorithm

    ExperimentsExperimental SetupEvaluation Protocols and ResultsAblation Study

    Conclusions