15
Journal of Communications and Information Networks, Vol.5, No.1, Mar. 2020 Research paper Sparse Optimization for Green Edge AI Inference Xiangyu Yang, Sheng Hua, Yuanming Shi, Hao Wang, Jun Zhang, Khaled B. Letaief Abstract—With the rapid upsurge of deep learn- ing tasks at the network edge, effective edge artificial intelligence (AI) inference becomes critical to provide low- latency intelligent services for mobile users via leveraging the edge computing capability. In such scenarios, energy efficiency becomes a primary concern. In this paper, we present a joint inference task selection and downlink beamforming strategy to achieve energy-efficient edge AI inference through minimizing the overall power consump- tion consisting of both computation and transmission power consumption, yielding a mixed combinatorial optimization problem. By exploiting the inherent connec- tions between the set of task selection and group sparsity structural transmit beamforming vector, we reformulate the optimization as a group sparse beamforming problem. To solve this challenging problem, we propose a log- sum function based three-stage approach. By adopting the log-sum function to enhance the group sparsity, a proximal iteratively reweighted algorithm is developed. Furthermore, we establish the global convergence analysis and provide the ergodic worst-case convergence rate for this algorithm. Simulation results will demonstrate the effectiveness of the proposed approach for improving energy efficiency in edge AI inference systems. Manuscript received Jan. 05, 2020; revised Feb. 08, 2020; accepted Feb. 16, 2020. Part of this work was presented at the IEEE 90th Vehicu- lar Technology Conference (VTC2019-Fall), Honolulu, Hawaii, USA, Sept. 2019 [1] . This work was supported in part by National Nature Science Foun- dation of China under Grant 61601290 (Yuanming Shi) and a start-up fund of Hong Kong Polytechnic University (Project ID P0013883) (Jun Zhang). The associate editor coordinating the review of this paper and approving it for publication was R. Wang. X. Y. Yang, S. Hua, Y. M. Shi, H. Wang. School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China (e-mail: [email protected]; [email protected]; shiym@ shanghaitech.edu.cn; [email protected]). X. Y. Yang, S. Hua. Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China. Uni- versity of Chinese Academy of Sciences, Beijing 100049, China (e-mail: [email protected]; [email protected]). J. Zhang. Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong 999077, China (e-mail: [email protected]). K. B. Letaief. Department of Electronic and Computer Engineer- ing, Hong Kong University of Science and Technology, Hong Kong 999077, China. Peng Cheng Laboratory, Shenzhen 518055, China (e-mail: [email protected]). Keywords—AI, edge inference, cooperative transmis- sion, energy efficiency, group sparse beamforming, proxi- mal iteratively reweighted algorithm I. I NTRODUCTION T he availability of big data and computing power, along with the advances in the optimization algorithms, has triggered a booming era of artificial intelligence (AI). No- tably, deep learning [2] is regarded as the most popular sec- tor in modern AI and has achieved exciting breakthroughs in applications such as speech recognition, computer vision [3] , etc. Benefiting from these achievements, AI is becoming a promising tool that streamlines people’s decision-making pro- cess and facilitates the development of diversified intelligence services (e.g., virtual personal assistant, recommendation sys- tem, etc). Meanwhile, with the proliferation of mobile com- puting as well as Internet of things (IoT) devices, massive real- time data are generated locally [4] . However, it is widely ac- knowledged that traditional cloud-based computing [5,6] faces challenges (e.g., latency, privacy and network congestion) for supporting the ubiquitous AI-empowered applications on mo- bile devices [7] . In contrast, edge AI is a promising approach, which can tackle the above concerns, via fusing mobile edge computing [8] with AI-enabled techniques (e.g., deep neural networks (DNNs)). By pushing AI models to the network edge, it brings the edge servers close to the requesting mobile devices and thus enables low-latency and privacy-preserving. Notably, edge AI is envisioned as the key ingredient of future intelligent 6G networks [9-12] , which fully unleashes the poten- tials for mobile communication and computation. Typically, edge AI consists of two phases of edge training and edge in- ference. As for the edge training phase, the training of AI models can be performed on cloud, edge or end devices [13] , however, this is beyond the scope of this work. By deploy- ing trained AI models and implementing model inference at network edges, this paper mainly focuses on edge inference. Following Refs. [7,14], edge AI inference architectures are generally classified into three major types. On-device inference. It performs the model inference di- rectly on end devices where DNN models are deployed. While some enabling techniques (e.g., model compression [15,16] , hardware speedup [17] ) have been proposed to facilitate the deployment of the DNN model, it still poses challenges for

Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

Journal of Communications and Information Networks, Vol.5, No.1, Mar. 2020 Research paper

Sparse Optimization for Green Edge AI InferenceXiangyu Yang, Sheng Hua, Yuanming Shi, Hao Wang, Jun Zhang, Khaled B. Letaief

Abstract—With the rapid upsurge of deep learn-ing tasks at the network edge, effective edge artificialintelligence (AI) inference becomes critical to provide low-latency intelligent services for mobile users via leveragingthe edge computing capability. In such scenarios, energyefficiency becomes a primary concern. In this paper,we present a joint inference task selection and downlinkbeamforming strategy to achieve energy-efficient edge AIinference through minimizing the overall power consump-tion consisting of both computation and transmissionpower consumption, yielding a mixed combinatorialoptimization problem. By exploiting the inherent connec-tions between the set of task selection and group sparsitystructural transmit beamforming vector, we reformulatethe optimization as a group sparse beamforming problem.To solve this challenging problem, we propose a log-sum function based three-stage approach. By adoptingthe log-sum function to enhance the group sparsity, aproximal iteratively reweighted algorithm is developed.Furthermore, we establish the global convergence analysisand provide the ergodic worst-case convergence rate forthis algorithm. Simulation results will demonstrate theeffectiveness of the proposed approach for improvingenergy efficiency in edge AI inference systems.

Manuscript received Jan. 05, 2020; revised Feb. 08, 2020; acceptedFeb. 16, 2020. Part of this work was presented at the IEEE 90th Vehicu-lar Technology Conference (VTC2019-Fall), Honolulu, Hawaii, USA, Sept.2019[1]. This work was supported in part by National Nature Science Foun-dation of China under Grant 61601290 (Yuanming Shi) and a start-up fund ofHong Kong Polytechnic University (Project ID P0013883) (Jun Zhang). Theassociate editor coordinating the review of this paper and approving it forpublication was R. Wang.

X. Y. Yang, S. Hua, Y. M. Shi, H. Wang. School of Information Scienceand Technology, ShanghaiTech University, Shanghai 201210, China (e-mail:[email protected]; [email protected]; [email protected]; [email protected]).

X. Y. Yang, S. Hua. Shanghai Institute of Microsystem and InformationTechnology, Chinese Academy of Sciences, Shanghai 200050, China. Uni-versity of Chinese Academy of Sciences, Beijing 100049, China (e-mail:[email protected]; [email protected]).

J. Zhang. Department of Electronic and Information Engineering, TheHong Kong Polytechnic University, Hong Kong 999077, China (e-mail:[email protected]).

K. B. Letaief. Department of Electronic and Computer Engineer-ing, Hong Kong University of Science and Technology, Hong Kong999077, China. Peng Cheng Laboratory, Shenzhen 518055, China (e-mail:[email protected]).

Keywords—AI, edge inference, cooperative transmis-sion, energy efficiency, group sparse beamforming, proxi-mal iteratively reweighted algorithm

I. INTRODUCTION

The availability of big data and computing power, alongwith the advances in the optimization algorithms, has

triggered a booming era of artificial intelligence (AI). No-tably, deep learning[2] is regarded as the most popular sec-tor in modern AI and has achieved exciting breakthroughs inapplications such as speech recognition, computer vision[3],etc. Benefiting from these achievements, AI is becoming apromising tool that streamlines people’s decision-making pro-cess and facilitates the development of diversified intelligenceservices (e.g., virtual personal assistant, recommendation sys-tem, etc). Meanwhile, with the proliferation of mobile com-puting as well as Internet of things (IoT) devices, massive real-time data are generated locally[4]. However, it is widely ac-knowledged that traditional cloud-based computing[5,6] faceschallenges (e.g., latency, privacy and network congestion) forsupporting the ubiquitous AI-empowered applications on mo-bile devices[7].

In contrast, edge AI is a promising approach, whichcan tackle the above concerns, via fusing mobile edgecomputing[8] with AI-enabled techniques (e.g., deep neuralnetworks (DNNs)). By pushing AI models to the networkedge, it brings the edge servers close to the requesting mobiledevices and thus enables low-latency and privacy-preserving.Notably, edge AI is envisioned as the key ingredient of futureintelligent 6G networks[9-12], which fully unleashes the poten-tials for mobile communication and computation. Typically,edge AI consists of two phases of edge training and edge in-ference. As for the edge training phase, the training of AImodels can be performed on cloud, edge or end devices[13],however, this is beyond the scope of this work. By deploy-ing trained AI models and implementing model inference atnetwork edges, this paper mainly focuses on edge inference.Following Refs. [7,14], edge AI inference architectures aregenerally classified into three major types.• On-device inference. It performs the model inference di-

rectly on end devices where DNN models are deployed. Whilesome enabling techniques (e.g., model compression[15,16],hardware speedup[17]) have been proposed to facilitate thedeployment of the DNN model, it still poses challenges for

Page 2: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

2 Journal of Communications and Information Networks

resource-limited (e.g., memory, power budget and computa-tion) end devices[18]. To mitigate such concerns, on-devicedistributed computing is envisioned as a promising solutionfor on-device inference, which enables AI model inferenceacross multiple distributed end devices[19].• Joint device-edge inference. This mode carries out the

AI model inference in a device-edge cooperation fashion[7]

with the model partition and model early-exit techniques[20].While device-edge cooperation is flexible and enables lowcorrespondence-latency edge inference, it may still have highresource requirements for end devices due to the resource-demanding nature of DNNs[21].• Edge server inference. Such methods transfer the raw

input data to edge servers for processing, which then re-turn inference results to end-users[22,23]. Edge server infer-ence is particularly suitable for those computation-intensivetasks. Nonetheless, the inference performance relies mainlyon the channel bandwidth between the edge server and enddevices. Cooperative transmission[24] becomes promising forcommunication-efficient inference results delivery.

To support those computation-tasks on the resource-limitedend devices, edge server inference stands out as a viable so-lution to fulfill the key performance requirements. The mainfocus of this paper is on the AI model inference for mobile de-vices with the edge server inference architecture. For the edgeAI inference system, energy efficiency is a key performanceindicator[14], which motives us to focus on the energy-efficientedge inference design. This is achieved by optimizing theoverall network power consumption, including computationpower consumption for performing inference tasks and trans-mission power consumption for returning inference results. Inparticular, cooperative transmission[24] is a widely recognizedtechnique to reduce the downlink transmit power consump-tion and provide low-latency transmission services by exploit-ing the high beamforming gains for edge AI inference. Inthis work, we thus consider that multiple edge base stations(BSs) collaboratively transmit the inference results to the enddevices[22]. To enable transmission cooperation, we apply thecomputation replication principle[25], i.e., the inference tasksfrom end devices can be performed by several neighboringedge BSs to create multiple copies of the inference results.However, computation replication greatly increases the powerconsumption in performing inference tasks. Therefore, it isnecessary to select the inference tasks to be performed by eachedge BS to achieve an optimal balance between communica-tion and computation power consumption.

In this paper, we propose a joint inference task selec-tion and downlink beamforming strategy towards achievingenergy-efficient edge AI inference by optimizing the over-all network power consumption consisting of the computa-tion power consumption and the transport network powerconsumption under the quality-of-service (QoS) constraints.

However, the resulting formulation contains combinatorialvariables and nonconvex constraints, which makes it compu-tationally intractable. To address this issue, we observe thatthe transmit beamforming vector has an intrinsic connectionwith the set of inference task selection (i.e., tasks are opted byedge servers to execute). Based on this crucial observation, wepresent a group sparse beamforming (GSBF) reformulation,followed by proposing a log-sum function based three-stageGSBF approach. In particular, in the first stage, we adopt aweighted log-sum function based relaxation to enhance thegroup sparsity of the structural solutions.

Nonetheless, the log-sum function minimization problemposes challenges in computation and analysis. To resolvethe issues, we present a proximal iteratively reweighted algo-rithm, which solves a sequence of weighted convex subprob-lems. Moreover, we establish the global convergence analy-sis and worst-case convergence rate analysis of the presentedproximal iteratively reweighted algorithm. Specifically, byleveraging the Frechet subdifferential[26], we characterize thefirst-order necessary optimality conditions of the formulatedconvex-constrained log-sum problem. We then show that thegenerated iterates of the proposed algorithm make the func-tion values steadily decrease and prove that any cluster pointof the generated entire sequence is a critical point of the initialobjective for any initial feasible point. Finally, we show thatthe defined optimality residual has O(1/t) ergodic worst-caseconvergence rate, where t is the iteration counter.

In the following, we summarize the major contributions ofthis paper as follows.• We propose a joint task selection and downlink beam-

forming strategy to optimize the trade-off between computa-tion and communication power consumption for an energy-efficient edge AI inference system. In particular, task selec-tion is achieved by controlling the group sparsity structureof the transmit beamforming vector, thereby formulating agroup sparse beamforming problem under the target QoS con-straints.• To solve the resulting optimization problem, we propose

a log-sum function based three-stage GSBF approach. In par-ticular, we adopt a weighted log-sum approximation to en-hance the group sparsity of the transmit beamforming vectorin the first stage. Moreover, we propose a proximal iterativelyreweighted algorithm to solve the log-sum minimization prob-lem.• For the presented proximal iteratively reweighted algo-

rithm, we establish the global convergence analysis. We provethat every cluster point generated by the presented algorithmsatisfies the first-order necessary optimality condition for theoriginal nonconvex log-sum problem. Furthermore, a worst-case O(1/t) convergence rate is established for this algorithmin an ergodic sense.• Numerical experiments are conducted to demonstrate

Page 3: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

Sparse Optimization for Green Edge AI Inference 3

the effectiveness and competitive performance of the log-sumfunction based three-stage GSBF approach for designing thegreen edge AI inference system.

A. Related WorksThe study of inducing sparsity generally falls into the

sparse optimization category[27,28]. In particular, sparse op-timization, emerging as a powerful tool, has recently con-tributed to the effective design of wireless networks, e.g.,group sparse beamforming for energy-efficient cloud radioaccess networks[29,30], and sparse signal processing for IoTnetworks[31,32]. In particular, to induce the group sparsitystructure of the beamforming vector, the work of Refs. [22,23]adopted the mixed `1,2-norm. As illustrated in Ref. [27],the mixed `1,q-norms (q > 1) can induce the group sparsitystructure of the interested solution. Moreover, the mixed`1,2-norm and `1,∞-norm[33] are commonly adopted. How-ever, the effectiveness of sparsity based on convex sparsity-inducing norms is not satisfactory since there always existssome small nonzero elements in the obtained solutions[34].In contrast to these works, some works applied nonconvexsparsity-inducing functions to seek sparser solutions[35]. No-tably, the work [34] reported the capability of log-sum func-tion for enhancing the sparsity of the solutions.

Motivated by their superior performance on inducing spar-sity, we adopt log-sum functions to promote the sparsity pat-tern in the solutions. However, adopting the log-sum func-tion to enhance sparsity usually makes the problem difficult tocompute and analyze. In Ref. [34], the authors first proposedan iteratively reweighted `1 algorithm (IRL1) for tackling thenonconvex and nonsmooth log-sum functions with linear con-straints. Nonetheless, they did not further conduct the con-vergence analysis for the proposed method. Under reason-able assumptions, the work of Ref. [36] established the con-vergence results for a class of unconstrained nonconvex non-smooth problems based on the limiting-subgradient tool. Inparticular, these results could apply to the log-sum model inan unconstrained setting. In Ref. [37], they proposed a prox-imal iteratively reweighted algorithm and proved that any ac-cumulation point is a critical point. The work of Ref. [38]further showed that, for any starting feasible points, the se-quence generated by their proximal iteratively reweighted al-gorithm could converge to a critical point under the Kurdyka-Łojasiewicz property[39]. However, these works focused onthe unconstrained formulation or linearly constrained caseswhen the log-sum model is involved. The theoretical analysisfor the log-sum function with general convex-set constraintshas not been investigated.

B. OrganizationThe remainder of this paper is organized as follows. Sec-

tion II presents the system model of the edge AI inference,

followed by the problem formulation and analysis. SectionII.C provides the group sparse beamforming formulation. Thelog-sum function based three-stage GSBF approach is pro-posed in section III. Section IV provides the global conver-gence and convergence rate analysis of the proposed proximaliteratively reweighted algorithm. Section V demonstrates theperformance of the proposed approach. The conclusion re-mark is made in section VI. To keep the main text coherentand free of technical details, we divert most of the mathematicproofs to the Appendices.

C. NotationThroughout this paper, we subsume the notation used as

follows. We use Cn and Rn to denote the complex vectorspace and the real Euclidean n-space Rn, respectively. Bold-face lower-case letters and upper case letters to represent vec-tors (e.g., x) and matrices (e.g., I) with an appropriate size,respectively. The inner product between x,y ∈ Cn is denotedas 〈x,y〉. ‖·‖1 and ‖·‖2 is the conventionally defined `1-normand `2-norm for any vectors in Cn, respectively. In addition,we use (·)H and (·)T to denote the Hermitian and transposeoperators, respectively. R(·) is the real part of a complexscalar. 1 is a vector with all components equal to 1 and 0

denotes the zero vector with an appropriate size. In particular,‖v‖G2 = [‖v1‖2, · · · ,‖vn‖2]

T ∈ Rn represents a vector whosenth element is the `2-norm of a structured vector vn ∈Cn. Weuse to denote composition operation between two functionsand symbol defines the element-wise product for any twovectors x,y ∈ Cn.

For any closed convex set C ⊂ Cn, we use δC (c) to de-note the characteristic function associated with C , which isdefined as

δC (c) :=

0, c ∈ C ,

+∞, c /∈ C .

Similarly, I(·) defines an indicator function associated withthe given condition ·, i.e., if condition · is met, then re-turn the value 1; otherwise, return the value 0. Moreover,z∼ C N (µ,σ2) corresponds to the complex random variablez with mean µ and variance σ2.

II. SYSTEM MODEL ANDPROBLEM FORMULATION

This section describes the overall system model and powerconsumption model for performing intelligent tasks in theconsidered edge AI inference system, followed by the prob-lem formulation and analysis.

A. System ModelWe consider an edge computing system consisting of N L-

antenna BSs collaboratively serving K single-antenna mobile

Page 4: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

4 Journal of Communications and Information Networks

Uplink

Downlink

User 1 User 2

BS 1

BS 2

BS N

φ1(d1)

φ2(d2)d1

d2

φk(dk)

dk User k

Figure 1 System model illustration of the edge AI inference for intelligenttasks (This paper considers the scenario that each neighboring edge BS hascollected the raw input data dk from mobile users.)

users (MUs), as illustrated in Fig. 1. These deployed BSs areused as dedicated edge nodes and have access to the enor-mous computation and storage resources[8]. For convenience,define K = 1, · · · ,K and N = 1, · · · ,N as the index setsof MUs and BSs, respectively. MUs have inference computingtasks, and the results can be inferred from task-related DNNs.For ease of expression, we use dk to denote the raw inputdata collected from MU k, and the corresponding inferenceresults are represented as φk(dk). As performing intelligenttasks on DNNs are typically resource-demanding, it is usuallyimpractical to perform the tasks on resource-constrained mo-bile devices locally. In the proposed edge AI inference system,by exploiting the computation replication[25], we consider thescenario that each neighboring edge BS has collected the rawinput data dk from all MUs. Then the edge BSs processthe data dk for model inference. After the edge BSs com-plete the model inference, the inference results φk(dk) arereturned to the corresponding MUs via the downlink channels.We assume that all edge BSs have been equipped with the pre-trained deep network models for all inference tasks[23].

In the downlink transmission, the edge BSs, which performthe inference tasks for the same MU cooperatively, return theinference results to the MU. We assume perfect channel stateinformation (CSI) is available to all edge BSs to enable coop-erative transmission for the inference results[24]. Let An ⊆Kdenote the indexes of MUs whose tasks are selectively per-formed by BS n, and A = (A1, · · · ,AN) represents task se-lection strategy.

1) Downlink Transmission Model: Let sk ∈ C denote theencoded scalar of the requested output φk(dk) for MU k,and vnk ∈ CL be the transmit beamforming vector at the BSn for sk. For convenience, and without loss of generality, weassume that E(|sk|2) = 1, i.e., the power of sk is normalized to

the unit. The transmitted signal xn at BS n can be expressedas

xn = ∑k∈An

vnksk. (1)

Let hnk ∈ CL be the propagation channel coefficient vectorbetween BS n and MU k. The received signal at MU k denotedas yk, is then given by

yk = ∑n∈N

hHnkxn + zk = ∑

n∈NhH

nk ∑l∈An

vnlsl + zk =

∑n∈N

hHnk

[I(k ∈An)vnksk+ ∑

l∈An,l 6=kvnlsl

]+ zk, (2)

where zk ∼ C N (0,σ2k ) is the isotropic additive white Gaus-

sian noise.We assume that all data symbols sk are mutually indepen-

dent of each other as well as noise. Based on (2), the signal-to-interference-plus-noise ratio (SINR) for MU k is thereforegiven as

SINRk(A ) =|∑n∈N I(k ∈An)h

Hnkvnk|2

∑l 6=k |∑n∈N I(l ∈An)hHnkvnl |2 +σ2

k. (3)

2) Power Consumption Model: The computation andtransmission power consumption for model inference is gen-erally large. Energy efficiency is of significant importancefor an energy-efficient edge AI inference system design, forwhich the overall network power consumed in computationand communication at the edge BSs becomes our main inter-est. Specifically, we express the total transmission power forall edge BSs in the downlink as

Ptrans(A ,vnk) :=N

∑n=1

1ηn

E

[∑

k∈An

‖vnksk‖22

]=

N

∑n=1

∑k∈An

1ηn‖vnk‖2

2, (4)

where ηn is the radio frequency power amplifier efficiency co-efficient of edge BS n.

In addition to the downlink transmission power consump-tion, the power consumed in performing AI inference tasksshould be taken into consideration as well, owing to thepower-demanding nature of running DNNs. We use Pc

nk todenote the computation power consumption of the BS n inperforming inference task φk. Then the computation powerconsumed by all BSs are given by

Pcomp(A ) := ∑n∈N

∑k∈An

Pcnk. (5)

For the estimation of the computation energy consumptionin executing task φk therein, the works [40,41] stated that theenergy consumption of a deep neural network layer for infer-ence mainly includes computation energy consumption and

Page 5: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

Sparse Optimization for Green Edge AI Inference 5

data movement energy consumption. For illustration, we takeGoogLeNet v1[42] as a concrete example to illustrate the en-ergy consumed by performing inference tasks. Specifically,we use GoogLeNet v1 to perform image classification taskson the Eyeriss chip[43]. With the help of an energy estimationonline tool introduced in Ref. [41], we are able to visualizethe energy consumption breakdown of the GoogLeNet v1, asillustrated in Fig. 2. We obtain the estimation of the compu-tation power consumption via dividing the total energy con-sumption by the computation time. In particular, the computa-tion time is determined by the total number of multiplication-and-accumulation (MAC) operations and the peak throughoutof Eyeriss chip.

Therefore, the overall power consumption for edge AI in-ference, including transmission and computation power con-sumption, is calculated as

Poverall(A ,vnk) =Ptrans(A ,vnk)+Pcomp(A ) =

∑n∈N

∑k∈An

1ηn‖vnk‖2

2+ ∑n∈N

∑k∈An

Pcnk. (6)

B. Problem Formulation and AnalysisNote that there is a fundamental trade-off between trans-

mission and computation power consumption. To be specific,more edge BSs performing the same task for MUs can sig-nificantly reduce the transmission power by exploiting highertransmit beamforming gains. However, this inevitably in-creases the computation power consumption for performinginference tasks. Therefore, the goal of an energy-efficientedge inference system can be achieved by minimizing theoverall network power consumption to reach a balance be-tween these two parts of power consumption.

Let γk be the target SINR for MUs to receive the reliableAI inference results in the downlink successfully. In our pro-posed energy-efficient edge AI inference system, the overallpower minimization problem is thus formulated as

minA ,vnk

Poverall(A ,vnk)

s.t. SINRk(A )> γk, ∀k ∈K ,

∑k∈K‖vnk‖2

2 6 Pmaxn , ∀n ∈N ,

(7)

where Pmaxn > 0 denotes the maximum transmit power of edge

BS n.Unfortunately, problem (7) turns out to be a mixed

combinatorial optimization problem due to the presence ofcombinatorial variable A , which makes it computationallyintractable[44]. On the other hand, the nonconvex SINR con-straints also pose troublesome challenges for solving (7). Toaddress these issues, we recast problem (7) into a tractable for-mulation by inducing the group sparsity of the beamformingvector in the following section.

Input feature mapOutput feature mapWeightComputation

×108

14

12

10

8

6

4

2

0

Norm

aliz

ed e

ner

gy c

onsu

mpti

on

Layer index5 10 15 20 25 30 35 40 45 50 55

Figure 2 The estimated energy consumption breakdown[41] of theGoogLeNet v1 to perform image classification tasks on the Eyeriss chip[43]

C. A Group Sparse Beamforming RepresentationFramework

One naive approach to cope with the combinatorial vari-able A is the exhaustive search. However, it is often com-putationally prohibitive owing to the exponential complexity.As a practical alternative, there is a critical observation thatsuch a combinatorial variable A can be eliminated by ex-ploiting the inherent connection between task selection andthe group sparsity structure of beamforming vectors. Specif-ically, if the edge BS n does not perform the inference tasksfrom MU k (i.e., k /∈An), then it will not deliver the inferenceresult φk(dk) in the downlink transmission (i.e., vnk = 0). Inother words, if k /∈ An, all coefficients in the beamformingvector vnk are zero simultaneously. Mathematically, we haveAn = k | vnk 6= 0, k ∈K , for all n ∈N , meaning the taskselection strategy A can be uniquely determined by the groupsparsity structure of vnk. In this respect, the overall networkpower consumption problem (7) can rewritten as

Psparse(vnk)=N

∑n=1

K

∑k=1

1ηn‖vnk‖2

2+N

∑n=1

K

∑k=1

I(vnk 6= 0)Pcnk. (8)

By considering the sparsity structure in the beamformingvectors, the SINR expression (3) is transformed into

SINRk =|∑n∈N hH

nkvnk|2

∑l 6=k |∑n∈N hHnkvnl |2 +σ2

k=

|hHk vk|2

∑l 6=k |hHk vl |2 +σ2

k, ∀k ∈K , (9)

where hk =[hH

1k, · · · ,hHNk

]H and vk =[vT

1k, · · · ,vTNk

]T are theaggregated channel vector and downlink transmit beamform-ing vector for MU k, respectively.

On the other hand, since an arbitrary phase rotation ofthe transmit beamforming vectors vnk does not affect thedownlink SINR constraints and the objective function value,we can always find proper phases to equivalently trans-form the SINR constraints in (7) into convex second-order

Page 6: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

6 Journal of Communications and Information Networks

cone constraints[45]. We thus have the following convex-constrained sparse optimization framework for network powerminimization

minvnk

Psparse(vnk)

s.t. ∑k∈K‖vnk‖2

2 6 Pmaxn , ∀n ∈N ,

√∑l 6=k|hH

k vl |2 +σ2k 6

1√

γkR(hH

k vk), ∀k ∈K .

(10)

However, problem (10) is still nonconvex due to the indicatorfunction in the objective function. As presented in Proposi-tion 1 in Ref. [29], a weighted mixed `1,2-norm can be servedas the tightest convex surrogate of the objective in (10), i.e.,

P(vnk) = 2N

∑n=1

K

∑k=1

√Pc

nk/ηn‖vnk‖2. (11)

In this paper, we instead propose to adopt a new groupsparsity inducing function for inference tasks selection via en-hancing sparsity, thereby further reducing the network powerconsumption.

III. A LOS-SUM FUNCTION BASEDTHREE-STAGE GROUP SPARSEBEAMFORMING FRAMEWORK

In this section, we shall propose to adopt the log-sum func-tion to enhance the group sparsity of the beamforming vector,followed by describing the log-sum function based three-stageGSBF approach. In particular, we propose a proximal itera-tively reweighted algorithm to address the log-sum minimiza-tion problem in the first stage.

A. Log-Sum Function for Enhancing Group SparsityLet v = [vT

11, · · · ,vT1K , · · · ,vT

N1, · · · ,vTNK ]

T ∈ CLNK denotethe aggregated beamforming vector v. To promote the groupsparsity for the beamforming vector vnk, in this paper, we pro-pose to use the following weighted nonconvex log-sum func-tion as an approximation for the objective Psparse(v)

Ω(v) :=N

∑n=1

K

∑k=1

ρnk log(1+ p‖vnk‖2), (12)

where ρnk =√

Pcnk/ηn > 0 is a weight coefficient and p > 0 is

a tunable parameter. The main motivation for adopting sucha log-sum penalty among various types of sparsity-inducingfunctions[27] is based on the following considerations.• The mixed `1,2-norm is similar as an `1-norm of vector

v and thereof offers the tightest convex relaxation to the `0-norm. In contrast to the mixed `1,2-norm, it has been reportedthat the log-sum function can significantly enhance the spar-sity of the solution than the conventional `1-norm[27,34].

• From the perspective of performance and theoreticalanalysis of the designed algorithm, a log-sum function bringsmore practicability due to its coercivity and boundedness ofits first derivative.

B. A Log-Sum Function Based Three-Stage GroupSparse Beamforming Approach

We now present the proposed log-sum based three-stageGSBF framework. Specifically, the first stage is to solvethe log-sum convex-constrained problem via the proposedproximal iteratively reweighted algorithm to obtain a solutionvsparse; the second stage prioritizes the tasks in progress basedon the obtained solution vsparse and system parameters, fol-lowed by obtaining the optimal task selection strategy A ∗;with fixed A ∗, we refine the v in the third stage. Details aredepicted as follows.

Stage 1: Log-Sum Function Minimization. In this firststage, we obtain the group sparsity structure of beamformer vby solving the following nonconvex program

minv

Ω(v) s.t. v ∈ C , (13)

where C denotes the convex-constraints in (10).However, the nonconvex and nonsmooth objective in (13)

and the presence of the convex constraints usually pose chal-lenges in computation and analysis. Inspired by the work ofRef. [34], we can iteratively minimize the objective by solvinga sequence of tractable convex subproblems. The main idea ofour presented algorithm is to solve a well-constructed convexsurrogate subproblem instead of directly solving the originalnonconvex problem.

Let fp(vnk) := log(1+ p‖vnk‖2). First observe that fp(vnk)

is a composite function with z(vnk) = ‖vnk‖2 convex andfp(z) = log(1+ pznk) nonconvex. At the ith iterate v[i]nk, forany feasible vnk, we have

fp(z(vnk))6 fp(z(v[i]nk))+ 〈w

[i],z(vnk)−z(v[i]nk)〉6

fp(z(v[i]nk))+ 〈w

[i],z(vnk)−z(v[i]nk)〉 +

β

2‖vnk−v

[i]nk‖

22, (14)

wherew[i] ∈ ∂ ( fp(z(v[i]nk))) is the subgradient of fp(z) at z =

z(v[i]nk) and β > 0 is the prescribed proximity parameter, and

the first inequality holds by the definition of the subgradient ofthe convex function. Hence, a convex subproblem is derivedas an approximation of fp(vnk) at current iterate v[i]nk, whichreads

minvnk

N

∑n=1

K

∑k=1

w[i]nk‖vnk‖2 +

β

2

N

∑n=1

K

∑k=1‖vnk−v

[i]nk‖

22

s.t. v ∈ C

(15)

Page 7: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

Sparse Optimization for Green Edge AI Inference 7

Algorithm 1 A proximal iteratively reweighted algorithm for solving (13)

1: Input: p > 0, β > 0, IterMax and v[0].2: Initialization: w[0] = 1 and set i = 0.3: while not converge or not attain IterMax do4: (Solve the reweighted subproblem for v[i+1]) Calculate v[i+1] accord-

ing to (15).5: (Reweighting) Update weight w[i]

nk according to (16).6: Set i← i+1.7: end while8: Output: vsparse = v[i].

with weights

w[i]nk = ρnk ·∂ ( fp(z(v

[i]nk))) =

pρnk

p‖v[i]nk‖2 +1. (16)

As presented in Ref. [34], a smaller ‖v[i]nk‖2 causes larger

w[i]nk, then drive the nonzero components of vnk towards zero

aggressively. Overall, to enhance the group sparsity structureof the beamforming vector, the proposed proximal iterativelyreweighted algorithm is illustrated in Algorithm 1.

Stage 2: Tasks Selection. In this second stage, an order-ing guideline is applied to determine the priority of inferencetasks, which is guided by the solution obtained in Stage 1.For ease of notation, let S = (n,k) | n ∈N ,k ∈K denotethe set of all tasks. By considering the key system parameters(e.g., hnk, Pc

nk and ηn), the priority of task (n,k) is heuristicallygiven as

θnk =

√‖hnk‖2

2ηn

Pcnk

‖vnk‖2. (17)

Intuitively, if edge BS n is with a lower aggregative beam-former gain, lower power amplifier efficiency, lower channelpower gain, but a higher computation power consumption forMU k, task (n,k) has a lower priority. A lower θnk indicatesthat the tasks from MU k have lower priority and may not beperformed by BS n. Thus, tasks are arranged in light of therule (17) with a descending order. That is, the task’s priorityis θπ1 > θπ2 · · · > θπNK , where π denotes the permutation oftask indexes.

We then solve a sequence of convex feasibility detectionproblems to obtain the task selection strategy A ,

find v s.t. vπ(t) = 0, v ∈ C , (18)

where π(t) = πt+1, · · · ,πNK and t increases from K to NKuntil (18) is feasible. Here v

π(t) = 0 are convex constraints,meaning that all vnk’s coefficients are zeros for task (n,k) ∈π(t). The support set of beamformer v is defined as T (v) =

(n,k) | ‖vnk‖2 6= 0,n∈N ,k ∈K , then the optimal task se-lection strategy A ∗ can be derived from T (v)= π1, · · · ,πt.

Stage 3: Solution Refinement. At this point, we have de-termined tasks selection for each BS. Then, after fixing the

Algorithm 2 A log-sum function based three-stage GSBF framework forsolving (7)

1: Stage 1: Log-sum function minimization. Call Algorithm 1 to obtainvsparse.

2: Stage 2: Tasks selection.3: Sort θnk’s in descending order by (17).4: Set t = K.5: while not feasible do6: (Determine the optimal task selection) Calculate (18) with π(t) =

πt+1, · · · ,πNK.7: Set t← t +1.8: end while9: Stage 3: Solution refinement.

10: Refine vnk by solving (19).11: Obtain vnk and A ∗.

obtained task index set, we solve the following convex pro-gram to refine the beamforming vectors

minv

N

∑n=1

K

∑k=1

1ηn‖vnk‖2

2 +N

∑n=1

∑k∈A ∗

Pcnk

s.t. vπ(t) = 0,

v ∈ C .

(19)

Overall, our proposed log-sum function based three-stageGSBF framework for solving (7) can be presented in Algo-rithm 2.

IV. GLOBAL CONVERGENCE ANALYSIS

In this section, we provide the global convergence for Al-gorithm 1. Specifically, we derive the first-order necessaryoptimality condition to characterize the optimal solutions. Wethen establish convergence results for a subsequence of the se-quence generated by Algorithm 1. Furthermore, we show thatfor any initial feasible point, the entire sequence must havecluster points, and any cluster point satisfies the establishedfirst-order optimality condition. Finally, the ergodic worst-case convergence rate of the optimality residual is derived.

A. First-Order Necessary Optimality ConditionIn this subsection, we derive the first-order necessary con-

ditions to characterize the optimal solution of (13). Prob-lem (13) is equivalently rewritten as

minv

J(v) := Ω(v)+δC (v). (20)

Similarly, for the derived subproblem (15), we have

minv

G(v;v[i]) :=N

∑n=1

K

∑k=1

w[i]nk‖vnk‖2 +

β

2‖v−v[i]‖2

2+δC (v).

(21)Due to the nonconvex and nonsmooth nature of the log-

sum function, we make use of the Frechet subdifferential asthe major tool in our analysis. Its definition is introduced asfollows.

Page 8: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

8 Journal of Communications and Information Networks

Definition 1 (Frechet subdifferential[26]) Let X be a realBanach space and X ∗ denotes the corresponding topologicaldual and f be a function from X into an extended real lineR= R∪+∞, finite at r. A set

∂F f (r)=r∗∈X ∗| liminf

u→r

f (u)− f (r)−〈r∗,u−r〉‖u−r‖2

> 0

is called a Frechet subdifferential of f at r. Its elements arereferred to as Frechet subgradients.

Several important properties of the Frechet sub-differential[26] are listed below, which are used to characterizethe optimal solution of (13).Proposition 1 Let X be a closed and convex set. Then thefollowing properties on Frechet subdifferentials hold true.(i) If f is Frechet subdifferentiable at x and attains local

minimum at x, then 0 ∈ ∂F f (x).(ii) Let h(·) be Frechet subdifferentiable at z = z(x) with

z(x) being convex, then h z(x) is Frechet subdifferentiableat x such that

y∂ z(x)⊂ ∂F h z(x)

for any y ∈ ∂F h(z).(iii) NX (x) = ∂F δX (x) with closed and convex sets X .

The following Fermat’s rule[46] describes the necessary op-timality condition of problem (13).Theorem 1 (Fermat’s rule) If (20) attains a local minimumat v, then it holds true that

0 ∈ ∂F J(v) := ∂F Ω(v)+NC (v). (22)

We next investigate the properties of ∂F Ω(v) in the follow-ing Proposition 2, indicating that the Frechet subdifferentialsof fp(·) at 0 are bounded.Proposition 2 If vnk 6= 0, then γ∂z(vnk)⊂ ∂F fp z(vnk) forany γ ∈ ∂F fp(z). In particular, ∂F fp z(0) is any element ofy ∈ CL | ‖y‖2 6 p.

To explore the behavior of the proposed proximal itera-tively reweighted algorithm, based on Theorem 1 and Propo-sition 2, we define the optimality residual associated with (20)at a point v[i] as

r[i] :=w[i]x[i]+u[i], (23)

where x[i] ∈ ∂‖v[i]‖G2 = [(∂‖v[i]11‖2)T, · · · ,(∂‖v[i]NK‖2)

T]T andu[i] ∈NC (v

[i]). Since r[i] ∈ ∂F J(v[i]), it implies that if r[i] = 0

then v[i] satisfies the first-order necessary optimality condi-tion (22). We adopt r[i] to measure the convergence rate ofour algorithm.

Moreover, we provide the first-order optimality conditionof the subproblem (21) as follows

0= ∂G(v;v[i]) = β (v[i+1]−v[i])+w[i]x[i+1]+u[i+1],(24)

where v[i+1] = argminvG(v;v[i]), x[i+1] ∈ ∂‖v[i+1]‖G2 andu[i+1] ∈ NC (v

[i+1]). Note that the existence of optimal so-lution to (21) simply follows from the convexity and the coer-civity of the objective G(·;v[i]).

Now we show that an optimal solution of (21) also satisfiesthe first-order necessary optimality condition of (20) in thefollowing lemma.Lemma 1 v[i] satisfies the first-order necessary optimalitycondition of (20) if and only if

v[i] = argminv

G(v;v[i]).

Proof Please refer to Appendix A).Define the model reduction caused by v[i+1] at a point v[i]

as

∆G(v[i+1];v[i]) := G(v[i];v[i])−G(v[i+1];v[i]). (25)

The new iterate v[i+1] causes a decrease in the objectiveJ(v), and this model reduction (25) converges to zero in thelimit, both results are revealed in the following Lemma 2.Lemma 2 Suppose v[i]∞

i=0 is generated by of Algorithm 1with v[0] ∈ C . The following statements hold true(i) J(v[i+1])− J(v[i])6 G(v[i+1];v[i])−G(v[i];v[i])6 0.(ii) lim

i→∞G(v[i+1];v[i]) = 0.

(iii) G(v;v[i]) is monotonically decreasing. Indeed,

∆G(v[i+1];v[i])>β

2‖v[i]−v[i+1]‖2

2.

Proof Please refer to Appendix B).We now provide our main result in the following Theo-

rem 2.Theorem 2 Suppose v[i]∞

i=0 is generated by Algorithm 1with v[0] ∈ C . It holds true that v[i] must be bounded andany cluster point of v[i] satisfies the first-order necessaryoptimality condition of (20).Proof Please refer to Appendix C).

B. Ergodic Worst-Case Convergence RateIn this subsection, we show that the presented proximal it-

eratively reweighted algorithm has O(1/t) ergodic worst-caseconvergence rate in terms of the optimality residual. In thefollowing Lemma, it states that the optimality residual has anupper bound with the displacement of the iterates.Lemma 3 The optimality residual associated with prob-lem (20) satisfies

‖r[i+1]‖22 6 (β 2 +2βκ p2 +κ

2 p4)‖v[i]−v[i+1]‖22

with κ = ρmax1.

1ρmax denotes the maximum elements among [ρnk] for all n ∈N , k ∈K .

Page 9: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

Sparse Optimization for Green Edge AI Inference 9

0 5 10 15 20 25#Iterations

2.22

2.23

2.24

2.25

2.26

2.27

2.28

2.29

2.30

2.31

2.32

Ω(v

) in

log-s

cale

Figure 3 Convergence of the proximal iteratively reweighted algorithm forlog-sum minimization problem

Proof Please refer to Appendix D).The subproblem (21) is referred to as the primal problem,

and by exploiting the conjugate function[46], the associatedFenchel-Rockafellar dual is constructed as

maxλ,µ

Q(λ,µ;v[i])

s.t. ‖λnk‖2 6 1, ∀n ∈N , k ∈K ,(26)

where the dual objective is given as Q(λ,µ;v[i]) =− 12β‖λ

w[i]+µ−βv[i]‖22 +

β

2 ‖v[i]‖2

2− δ ∗C (µ), and the technical de-tails to construct (26) is provided in Appendix E).

The Fenchel-Rockafellar duality theorem[46] states that thesolution to (26) provides a lower bound on the minimum valueto the solution of (21). Moreover, the gap between the primalobjective function value of (21) and the corresponding dualobjective function value of (26) at the ith iterate is defined as

g(v,λ,µ;v[i]) := G(v;v[i])−Q(λ,µ;v[i]). (27)

If this gap g(v,λ,µ;v[i]) is zero, then the strong duality holds.That is, at the optimal solution (v[i+1];λ[i+1],µ[i+1]), we have

∆G(v[i+1];v[i]) = g(v[i+1],λ[i+1],µ[i+1];v[i]). (28)

We now show that the duality gap vanish asymptotically inthe following Theorem.Theorem 3 Let v[i]∞

i=0 be the sequence generated by Al-gorithm 1 with v[0] ∈ C . Then ‖r[i+1]‖2

2 has O(1/t) ergodicworst-case convergence rate.Proof Please refer to Appendix F).

V. NUMERICAL EXPERIMENTS

In this section, we use numerical experiments to validatethe effectiveness of our proposed algorithms and illustrate thepresented theoretical results. We compare the log-sum func-tion based three-stage GSBF approach with the coordinated

beamforming (CB) approach[47] and mixed `1,2 GSBF[29]

beamforming approach (Mixed `1,2 GSBF). These two ap-proaches are listed below.• The CB approach considers minimizing the total trans-

mit power consumption. In other words, all BSs are requiredto perform the inference tasks from all MUs.• Mixed `1,2 GSBF approach considers adopting the mixed

`1,2-norm (i.e., the objective function in (13) is replaced with∑

Nn=1 ∑

Kk=1 ρnk‖vnk‖2) to induce group sparsity of the beam-

forming vector in Stage 1 of Algorithm 2.On the experimental set-up, we consider the edge AI in-

ference system with 8 2-antennas, and 15 single-antennaMUs that all are uniformly and independently distributed ina [−0.5,0.5] km × [−0.5,0.5] km square region. The channelbetween BS n and MU k is set as hnk = 10−L(dnk)/20ξnk, wherethe path-loss model is given by L(dnk) = 128.1+ 37.6lgdnk

and dnk is the Euclidean distance between BS n and MU k, ξnk

is the small-scale fading coefficient, i.e., ξnk ∼ C N (0,I).We set Pc

nk = 0.45 W and specify Pmaxn = 1 W, ηn = 25%

and σ2k = 1. Furthermore, for the proposed log-sum function

based three-stage GSBF approach, we set p = 100, β = 0.1and initialize w[0] = 1. In particular, we terminate the prox-imal iteratively reweighted algorithm either it hits the prede-fined maximum iterations IterMax = 25 or satisfies

‖w[i+1]−w[i]‖1 6 ε, (29)

where ε = 10−5 is a predescribed tolerance.

A. Convergence of the Proximal Iteratively ReweightedAlgorithm

The goal in this subsection is to illustrate the convergencebehavior of the proposed proximal iteratively reweighted al-gorithm. The presented result is obtained in a typical channelrealization. Fig. 3 illustrates the convergence of the proxi-mal iteratively reweighted algorithm. We can see that Ω(v)

steadily decreases along with the iterations, which is consis-tent with our analysis in Lemma 2. Interestingly, we observethat the objective value of Ω(v) drops quickly in the first fewiterations (less 5 iterations), which indicates that the proposedproximal iteratively reweighted algorithm converges very fast.In view of this, we may suggest early terminating the Algo-rithm 1 in practice to obtain an approximate solution to speedup the entire algorithm while guaranteeing the overall perfor-mance.

B. Effectiveness of the Proposed ApproachWe evaluate the performance of the three algorithms in

terms of the overall network power consumption, the trans-mit power consumption and the number of computation tasks.The presented results are averaged over 100 randomly and in-dependently generated channel realizations.

Page 10: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

10 Journal of Communications and Information Networks

0 1 2 3 4 5 6 7 8

Target SINR (dB)

10

20

30

40

50

60

70

Tota

l pow

er c

onsu

mpti

on (

W)

Proposed

Mixed l1,2 GSBF

Coordinated beamforming

Figure 4 Average total network power consumption comparison for threedifferent approaches in edge AI inference system

Fig. 4 depicts the overall network power consumption ofthree approaches with different target SINRs. First, we ob-serve that all three approaches have higher total power con-sumption as the required SINR becomes more stringent. Thisis because more edge BSs are required to transmit the infer-ence results for higher QoS. In addition, we can see that theCB approach has the highest power consumption among threeapproaches and the relative power difference between the CBand the other two approaches can achieve approximately 67%when SINR is 2 dB and approximately 18% when SINR is 8dB, indicating the effectiveness of joint task selection strat-egy and group sparse beamforming approach to minimize theoverall network power consumption. On the other hand, wecan see that the proposed log-sum function based three-stageGSBF approach outperforms the mixed `1,2 GSBF approach,which demonstrates that enhancing the group sparsity furtherreduces the overall network power consumption. In particular,we also observe that the performance gap between the blueand the red curve approximately remains at 9% when SINRranges from 4 dB to 8 dB, which indicates that the proposedlog-sum function based three-stage GSBF approach is still at-tractive in the high SINR regime.

Tabs. 1 and 2 further demonstrate the number of inferencetasks performed by edge BSs and the transmission power con-sumption, respectively. To be specific, in Tab. 1, we observethat the number of performed inference tasks among three ap-proaches is different under various SINRs, which shows theexistence of the task selection strategy. Besides, it is observedthat the log-sum function based three-stage GSBF approachalways achieves a less number of performed inference taskscompared to the mixed `1,2 GSBF approach for target SINRs,which indicates that the log-sum function based three-stageGSBF approach can enhance the group sparsity pattern in thebeamforming vector. Meanwhile, as observed in Tab. 2, theCB approach has the lowest transmission power compared

Table 1 The average number of performed inference tasks with differentapproaches

Target SINR (dB) Proposed Mixed `1,2 GSBF CB

0 16.49 17.79 120.00

2 21.00 22.50 120.00

4 30.08 34.47 120.00

6 43.57 52.17 120.00

8 67.76 77.36 120.00

Table 2 The total transmit power consumption with different approaches

Target SINR (dB) Proposed Mixed `1,2 GSBF CB

0 4.16 4.68 0.41

2 9.12 8.31 0.88

4 12.24 11.76 1.99

6 15.81 14.70 4.57

8 18.81 18.58 10.79

to the other two approaches because the CB approach onlyoptimizes the power consumption in transmission with per-forming all inference tasks. On the other hand, the trans-mission power consumption of the log-sum function basedthree-stage GSBF approach is slightly higher compared to themixed `1,2 GSBF approach under most SINRs. This is be-cause more edge BSs participate in performing inference tasksin the mixed `1,2 GSBF approach, resulting in a higher trans-mit beamforming gain for reducing transmission power. Inother words, less number of performed inference tasks furtherreduces the computation power consumption of edge BSs butincreases the transmission power consumption. Observe theFig. 4 and Tabs. 1 and 2 together, it indicates that the pro-posed joint task selection strategy and GSBF approach finda good balance between computation power consumption andtransmission power consumption, yielding the lowest networkpower consumption.

VI. CONCLUSION

In this paper, we developed an energy-efficient edge AIinference system through the joint selection of the infer-ence tasks and optimization of the transmit beamforming vec-tors for minimizing the computation power consumption andthe downlink transmission power consumption, respectively.Based on the critical insight that the inference tasks selectioncan be achieved by controlling the group sparsity structure oftransmit beamforming vectors, we developed a group sparseoptimization framework for network power minimization, forwhich a log-sum function based three-stage group sparsebeamforming algorithm was developed to enhance group spar-sity in the solutions. To resolve the resulting nonconvex and

Page 11: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

Sparse Optimization for Green Edge AI Inference 11

nonsmooth log-sum function minimization problem, we fur-ther proposed a proximal iteratively reweighted algorithm.Furthermore, the global convergence analysis was provided,and a worst-case O(1/t) convergence rate in an ergodic sensehas been derived for this algorithm.

APPENDIX

A) Proof of Lemma 1 Let x[i] ∈ ∂‖v‖G2 and u[i] ∈NC (v

[i]). If v[i] = argminvG(v;v[i]), by (24), we have

0=w[i]x[i]+u[i]. (30)

We conclude that v[i] satisfies (22), indicating that v[i] is first-order optimal for (20).

Conversely, if v[i] satisfies (22), implying v[i] satisfies (30)by Proposition 2. Thus v[i] must be the optimal solution to thesubproblem (15). This completes the proof.B) Proof of Lemma 2 First, v[i+1] = argminvG(v;v[i]) andG(v;v[i]) is convex, so that G(v[i+1];v[i])−G(v[i];v[i]) 6 0.Since fp(·) is concave, we have,

fp(z)6 fp(z0)+ 〈∇ fp(z0),(z−z0)〉, ∀z,z0 ∈ RL+. (31)

Therefore

J(v[i+1]) = Ω(v[i+1])6

Ω(v[i])+N

∑n=1

K

∑k=1

w[i]nk(‖v

[i+1]nk ‖2−‖v[i]nk‖2) +

β

2‖v[i+1]−v[i]‖2 =

J(v[i])+G(v[i+1];v[i])−G(v[i];v[i]), (32)

where the first inequality follows from (31). This completesthe first statement (i).

On the other hand, by (32), we have

∆G(v[i+1];v[i])6 J(v[i])− J(v[i+1]). (33)

Summing both sides of (33) over i = 0, . . . , t, yielding

0 6t

∑i=0

∆G(v[i+1];v[i])6 J(v[0])− J(v[t+1])6 J(v[0])− J,

(34)

where J is the lower bound of J(·). Allowing t→ ∞, we have

limi→∞

∆G(v[i+1];v[i]) = 0. (35)

This completes the second statement (ii).For the last statement (iii), inspired by the proof line pre-

sented in Ref. [37]. By (24), we have

0= β (v[i+1]−v[i])+ x[i+1]+u[i+1], (36)

where x[i+1] ∈ ∂ 〈w[i],‖v[i+1]‖G2〉, u[i] ∈NC (v[i]). Taking a

dot-product with (v[i+1]−v[i]) on both sides of (36) yields

0= β‖v[i+1]−v[i]‖22 + 〈x[i+1],v[i+1]−v[i]〉 +

〈u[i+1],v[i+1]−v[i]〉. (37)

By the definition of subgradient of the convex function, wehave,

〈w[i],‖v[i]‖G2 −‖v[i+1]‖G2〉> 〈x

[i+1],v[i]−v[i+1]〉=

〈u[i+1],v[i+1]−v[i]〉 +

β‖v[i+1]−v[i]‖22. (38)

Therefore

G(v[i];v[i])−G(v[i+1];v[i]) =N

∑n=1

K

∑k=1

w[i]nk(‖v

[i]nk‖2−‖v[i+1]

nk ‖2)−β

2‖v[i+1]−v[i]‖2

2 =

〈w[i],‖v[i]‖2−‖v[i+1]‖2〉−β

2‖v[i+1]−v[i]‖2

2 >

〈u[i+1],v[i+1]−v[i]〉+ β

2‖v[i+1]−v[i]‖2

2 =

β

2‖v[i]−v[i+1]‖2

2−〈u[i+1],v[i]−v[i+1]〉>

β

2‖v[i]−v[i+1]‖2

2, (39)

where the first inequality is obtained by (38) and the lastinequality holds since u[i+1] ∈ NC (v

[i+1]), completing theproof.C) Proof of Theorem 2 By Lemma 2, the sequenceJ(v[i]) is monotonically decreasing. Since J(·) is coer-cive, we conclude that the sequence v[i] is bounded. Weconclude that v[i] must have cluster points. Let v∗ be acluster point of v[i]. From Lemma 1, it suffices to showthat v∗ = argminvG(v;v∗). We prove this by contradic-tion. Assume that there exists a point v such that ε :=G(v∗;v∗)−G(v;v∗) > 0. Suppose v[i]S → v∗, S ⊂ N.Based on Lemma 2, there must exist k1 > 0 such that for allk > k1,k ∈S

G(v[i];v[i])−G(v[i+1];v[i])6 ε/4. (40)

Note that v[i]nkS→ v∗nk and w[i]

nkS→ w∗nk, there exists k2 such

that for all k > k2,k ∈S ,∑Nn=1∑

Kk=1w[i]

nk‖v[i]nk‖2−w∗nk‖v∗nk‖2 >

−ε/10, and β

2 ‖v∗− v‖2

2 +∑n,k(w∗nk−w[i]nk)‖vnk‖2− β

2 ‖v−v[i]‖2

2 >−ε/10.Therefore, for all k > k2,k ∈S ,

G(v∗;v∗)−G(v;v[i]) =N

∑n=1

K

∑k=1

w∗nk‖v∗nk‖2−β

2‖v−v[i]‖2

2 −

Page 12: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

12 Journal of Communications and Information Networks

N

∑n=1

K

∑k=1

(w∗nk− (w∗nk−w[i]nk))‖vnk‖2 =

[G(v∗;v∗)−G(v;v∗)]+β

2‖v−v∗‖2

2 +

N

∑n=1

K

∑k=1

(w∗nk−w[i]nk)‖vnk‖2−

β

2‖v−v[i]‖2

2 >

ε− ε/10 = 9ε/10, (41)

and that

G(v[i];v[i])−G(v∗;v∗) =N

∑n=1

K

∑k=1

w[i]nk‖v

[i]nk‖2−

N

∑n=1

K

∑k=1

w∗nk‖v∗nk‖>−ε/10. (42)

Hence, for all k > max(k1,k2),k ∈S , it holds that

G(v[i];v[i])−G(v;v[i]) =

G(v[i];v[i])−G(v∗;v∗)+G(v∗;v∗)−G(v;v[i])>

− ε/10+9ε/10 = 4ε/5, (43)

contradicting (40). Hence, v∗ = argminvG(v;v∗). ByLemma 1, v∗ satisfies the first-order necessary optimalityfor (20). This completes the proof.D) Proof of Lemma 3 Recall the first-order necessary opti-mality condition (24) of subproblem (15). By rearranging theterm, and we have

v[i]−v[i+1] =1β(w[i]x[i+1]+u[i+1]), (44)

where x[i+1] ∈ ∂‖v[i+1]‖G2 and u[i+1] ∈NC (v[i+1]). Square

on both sides of (44), we have

‖v[i]−v[i+1]‖22 =

1β 2 ‖w

[i]x[i+1]+u[i+1]‖22 =

1β 2 ‖(w

[i]−w[i+1])x[i+1]+r[i+1]‖22 =

1β 2 ‖(w

[i]−w[i+1])x[i+1]‖22 +

1β 2 ‖r

[i+1]‖22 +

2β 2 〈(w

[i]−w[i+1])x[i+1],r[i+1]〉=

1β 2 ‖(w

[i]−w[i+1])x[i+1]‖22 +

1β 2 ‖r

[i+1]‖22 −

2β 2 〈(w

[i]−w[i+1])x[i+1],

(w[i]−w[i+1])x[i+1]+β (v[i+1]−v[i])〉=

− 1β 2 ‖(w

[i]−w[i+1])x[i+1]‖22 +

1β 2 ‖r

[i+1]‖22−

2β〈(w[i]−w[i+1])x[i+1],v[i+1]−v[i]〉, (45)

where we exploit u[i+1] = β (v[i] − v[i+1])−w[i] x[i+1] toobtain the fourth equality. Then we have

‖r[i+1]‖22 =

β2‖v[i]−v[i+1]‖2

2 +‖(w[i]−w[i+1])x[i+1]‖22 +

2β 〈(w[i]−w[i+1])x[i+1],v[i+1]−v[i]〉6

β2‖v[i]−v[i+1]‖2

2 +‖w[i]−w[i+1]‖22 +

2β‖(w[i]−w[i+1])x[i+1]‖2‖v[i]−v[i+1]‖2 6

β2‖v[i]−v[i+1]‖2

2 +‖w[i]−w[i+1]‖22 +

2β‖w[i]−w[i+1]‖2‖v[i]−v[i+1]‖2 6

β2‖v[i]−v[i+1]‖2

2 + p4κ

2‖v[i]−v[i+1]‖22 +

2β p2κ‖v[i]−v[i+1]‖2

2 =

(β 2 +2βκ p2 +κ2 p4)‖v[i]−v[i+1]‖2

2, (46)

where the last inequality holds due to f ′p(·) is p2-Lipschitzcontinuous and κ = ρmax. This completes the proof.E) Construction of Fenchel-Rockafellar Dual (26) Thisconstruction of the dual program owes to the conjugatefunction[46]. We first rewrite the objective G(v;v[i]) by ex-ploiting the conjugate function as

G(v;v[i]) =N

∑n=1

K

∑k=1

w[i]nk[supλnk

〈λnk,vnk〉−δB(λnk)] +

β

2‖v−v[i]‖2

2 + supµ[〈µ,v〉−δ

∗C (µ)], (47)

where B := y ∈ CL | ‖y‖2 6 1 denotes the dual norm unitball of ‖ · ‖2. Then the primal subproblem (21) is simply

infv

G(v;v[i]) = infv

supλ,µ

N

∑n=1

K

∑k=1

w[i]nk[〈λnk,µnk〉−δB(λnk)] +

β

2‖v−v[i]‖2

2 + 〈µ,v〉−δ∗C (µ). (48)

Swapping the order of inf and sup gives the correspondingLagrangian dual, i.e.,

supλ,µ

infv

N

∑n=1

K

∑n=k

w[i]nk[〈λnk,vnk〉−δB(λnk)] +

β

2‖v−v[i]‖2

2 + 〈µ,v〉−δ∗C (µ). (49)

The dual objective at v[i] is then given by

Q(λ,µ;v[i]) = infv

N

∑n=1

K

∑k=1

w[i]nk[〈λnk,vnk〉−δB(λnk)] +

β

2‖v−v[i]‖2

2 + 〈µ,v〉−δ∗C (µ). (50)

By exploiting the first-order necessary optimality conditionof (50), we have

supλ,µ− 1

2β‖λw[i]+µ−βv[i]‖2

2 +β

2‖v[i]‖2

2−δ∗C (µ) −

Page 13: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

Sparse Optimization for Green Edge AI Inference 13

N

∑n=1

K

∑k=1

w[i]nkδB(λnk), (51)

which is consistent with the presented Fenchel-Rockafellarin (26). This completes the construction.F) Proof of Theorem 3 We evaluate g(v,x,u;v[i]) at v[i]

with λ[i+1]∈∂‖v[i+1]‖G2 and µ[i+1] ∈NC (v[i+1]). Then, we

have

g(v[i],λ[i+1],µ[i+1];v[i]) =

G(v[i];v[i])−Q(λ[i+1],µ[i+1];v[i]) =N

∑n=1

K

∑k=1

w[i]nk‖v

[i]nk‖2 +δC (v

[i])− β

2‖v[i]‖2

2 +

12β‖λ[i+1]w[i]+µ[i+1]−βv[i]‖2

2 +δ∗C (µ

[i+1])>

12β‖w[i]λ[i+1]+µ[i+1]‖2

2+〈w[i],‖v[i]‖−v[i]λ[i+1]〉 +

〈µ[i+1],v[i]−v[i]〉>1

2β‖w[i]λ[i+1]+µ[i+1]‖2

2, (52)

where the first inequality holds due to the Fenchel-Younginequality and the second inequality is obtained since‖λ[i+1]‖2 6 1 makes ‖v[i]‖G2 −v[i]λ[i+1] > 0.

Next, by making use of (24) to replace µ[i+1] with β (v[i]−v[i+1])−w[i]λ[i+1] in (52), we have

12β‖w[i]λ[i+1]+µ[i+1]‖2

2 =

12β‖w[i+1]λ[i+1]+µ[i+1]+(w[i]−w[i+1])λ[i+1]‖2

2 =

12β

(‖r[i+1]‖22 +‖(w[i]−w[i+1])λ[i+1]‖2

2 +

2〈r[i+1],(w[i]−w[i+1])λ[i+1]〉) =1

2β‖r[i+1]‖2

2−1

2β‖(w[i]−w[i+1])λ[i+1]‖2

2 −

〈v[i+1]−v[i],(w[i]−w[i+1])λ[i+1]〉>1

2β‖r[i+1]‖2

2−‖v[i]−v[i+1]‖2‖w[i]−w[i+1]‖2 −

12β‖w[i]−w[i+1]‖2

2, (53)

where we exploit Cauchy-Bunyakovskii-Schwarz (CBS) in-equality to obtain the last inequality.

On the other hand, by strong duality, we have

g(v[i],λ[i+1],µ[i+1];v[i])=G(v[i];v[i])−G(v[i+1];v[i]). (54)

Combine (52), (53) and (54), we have

‖r[i+1]‖22 6

2β∆G(v[i+1];v[i])+‖w[i]−w[i+1]‖22 +

2β‖v[i]−v[i+1]‖2‖w[i]−w[i+1]‖2 6

2β∆G(v[i+1];v[i])+κ2 p4‖v[i]−v[i+1]‖2

2 +

2βκ p2‖v[i]−v[i+1]‖22 6

2β∆G(v[i+1];v[i])+(κ2 p4 +2βκ p2)‖v[i]−v[i+1]‖22 6

2β∆G(v[i+1];v[i])+2β(κ2 p4 +2βκ p2)∆G(v[i+1];v[i])6

2β(β 2 +2βκ p2 +κ

2 p4)∆G(v[i+1];v[i])6

2β(β 2 +2βκ p2 +κ

2 p4)(J(v[i])− J(v[i+1])), (55)

where the fourth inequality holds due to Lemma 2 (iii).Summing up both sides of above inequality from i = 0 to t,

we have

C(J(v[0])− J(v[t+1]))>t

∑i=0‖r[i+1]‖2

2 >

t mini=0,··· ,t

‖r[i+1]‖22 (56)

with C = 2(β 2 +2βκ p2 +κ2 p4)/β . This indicates that

mini=0,··· ,t

‖r[i+1]‖22 6

Ct(J(v[0])− J(v[t+1]))6

Ct(J(v[0])− J(v∗)). (57)

Hence

mini=0,··· ,t

‖r[i+1]‖22 = O

(1t

). (58)

This completes the proof.

REFERENCES

[1] HUA S, YANG X, YANG K, et al. Deep learning tasks processingin fog-RAN[C]//Proceedings of the IEEE 90th Vehicular TechnologyConference. Piscataway: IEEE Press, 2019: 1-5.

[2] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015,521(7553): 436-444.

[3] VOULODIMOS A, DOULAMIS N, DOULAMIS A, et al. Deep learn-ing for computer vision: A brief review[J]. Computational Intelligenceand Neuroscience, 2018, 2018: 1-13.

[4] C. V. N. Index. Global mobile data traffic forecast update, 2017-2022[R]. White Paper, 2019.

[5] ZHANG Q, CHENG L, BOUTABA R. Cloud computing: State-of-the-art and research challenges[J]. Journal of Internet Services and Appli-cations, 2010, 1(1): 7-18.

[6] YAN Z S, CHEN C W. Cloud-based video communication and net-working: An architectural overview[J]. Journal of Communicationsand Information Networks, 2016, 1(1): 27-43.

[7] CHEN J S, RAN X K. Deep learning with edge computing: A re-view[J]. Proceedings of the IEEE, 2019, 107(8): 1655-1674.

[8] MAO Y Y, YOU C S, ZHANG J, et al. A survey on mobile edgecomputing: The communication perspective[J]. IEEE Communica-tions Surveys & Tutorials, 2017, 19(4): 2322-2358.

Page 14: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

14 Journal of Communications and Information Networks

[9] LETAIEF K B, CHEN W, SHI Y M, et al. The roadmap to 6G: AIempowered wireless networks[J]. IEEE Communications Magazine,2019, 57(8): 84-90.

[10] ZHANG Z Q, XIAO Y, MA Z, et al. 6G wireless networks: Vision,requirements, architecture, and key technologies[J]. IEEE VehicularTechnology Magazine, 2019, 14(3): 28-41.

[11] ZHANG J, LETAIEF K B. Mobile edge intelligence and computingfor the Internet of vehicles[J]. Proceedings of the IEEE, 2020, 108(2):246-261.

[12] HUANG Y M, XU C M, ZHANG C, et al. An overview of intelligentwireless communications using deep reinforcement learning[J]. Jour-nal of Communications and Information Networks, 2019, 4(2): 15-29.

[13] YANG K, JIANG T, SHI Y M, et al. Federated learning via over-the-air computation[J]. IEEE Transactions on Wireless Communications,2020,19(3): 2022-2035.

[14] ZHOU Z, CHEN X, LI E, et al. Edge intelligence: Paving the lastmile of artificial intelligence with edge computing[J]. Proceedings ofthe IEEE, 2019, 107(8): 1738-1762.

[15] JIANG T, YANG X, SHI Y, et al. Layer-wise deep neural networkpruning via iteratively reweighted optimization[C]//IEEE InternationalConference on Acoustics, Speech and Signal Processing. Piscataway:IEEE Press, 2019: 5606-5610.

[16] REN A, ZHANG T, YE S, et al. ADMM-NN: An algorithm-hardwareco-design framework of DNNs using alternating direction methods ofmultipliers[C]//Proceedings of the 24th International Conference onArchitectural Support for Programming Languages and Operating Sys-tems. New York: ACM, 2019: 925-938.

[17] SZE V, CHEN Y H, YANG T J, et al. Efficient processing of deepneural networks: A tutorial and survey[J]. Proceedings of the IEEE,2017, 105(12): 2295-2329.

[18] PLASTIRAS G, TERZI M, KYRKOU C, et al. Edge intelligence:Challenges and opportunities of near-sensor machine learning applica-tions[C]//IEEE 29th International Conference on Application-SpecificSystems, Architectures and Processors. Piscataway: IEEE Press, 2018:1-7.

[19] YANG K, SHI Y M, DING Z. Data shuffling in wireless distributedcomputing via low-rank optimization[J]. IEEE Transactions on SignalProcessing, 2019, 67(12): 3087-3099.

[20] LI E, ZENG L K, ZHOU Z, et al. Edge AI: On-demand acceleratingdeep neural network inference via edge computing[J]. IEEE Transac-tions on Wireless Communications, 2020, 19(1): 447-457.

[21] HE K, ZHANG X, REN S, et al. Deep residual learning for im-age recognition[C]//IEEE Conference on Computer Vision and PatternRecognition. Piscataway: IEEE Press, 2016: 770-778.

[22] HUA S, ZHOU Y, YANG K, et al. Reconfigurable intelligent surfacefor green edge inference[J]. arXiv:1912.00820, 2019.

[23] YANG K, SHI Y M, YU W, et al. Energy-efficient processingand robust wireless cooperative transmission for edge inference[J].arXiv:1907.12475, 2019.

[24] GESBERT D, HANLY S, HUANG H, et al. Multi-cell MIMO coopera-tive networks: A new look at interference[J]. IEEE Journal on SelectedAreas in Communications, 2010, 28(9): 1380-1408.

[25] LI K K, TAO M X, CHEN Z Y. Exploiting computation replication formobile edge computing: A fundamental computation-communicationtradeoff study[J]. arXiv: 1903.10837, 2019.

[26] KRUGER A Y. On Frechet subdifferentials[J]. Journal of MathematicalSciences, 2003, 116(3): 3325-3358.

[27] BACH F, JENATTON R, MAIRAL J, et al. Optimization with sparsity-inducing penalties[J]. Foundations and Trends R© in Machine Learning,2012, 4(1): 1-106.

[28] JENATTON R, AUDIBERT J Y, BACH F. Structured variable selec-tion with sparsity-inducing norms[J]. Journal of Machine Learning Re-

search, 2011, 12(10): 2777-2824.[29] SHI Y M, ZHANG J, LETAIEF K B. Group sparse beamforming for

green cloud-RAN[J]. IEEE Transactions on Wireless Communications,2014, 13(5): 2809-2823.

[30] DAI B B, YU W. Energy efficiency of downlink transmission strategiesfor cloud radio access networks[J]. IEEE Journal on Selected Areas inCommunications, 2016, 34(4): 1037-1050.

[31] LIU L, LARSSON E G, YU W, et al. Sparse signal processing forgrant-free massive connectivity: A future paradigm for random accessprotocols in the Internet of things[J]. IEEE Signal Processing Maga-zine, 2018, 35(5): 88-99.

[32] SHI Y M, DONG J L, ZHANG J. Low-overhead communications inIoT networks-structured signal processing approaches[M]. Singapore:Springer, 2020.

[33] MEHANNA O, SIDIROPOULOS N D, GIANNAKIS G B. Joint mul-ticast beamforming and antenna selection[J]. IEEE Transactions onSignal Processing, 2013, 61(10): 2660-2674.

[34] CANDES E J, WAKIN M B, BOYD S P. Enhancing sparsity byreweighted `1 minimization[J]. Journal of Fourier Analysis and Ap-plications, 2008, 14(5-6): 877-905.

[35] SHI Y M, CHENG J K, ZHANG J, et al. Smoothed `p-minimizationfor green cloud-RAN with user admission control[J]. IEEE Journal onSelected Areas in Communications, 2016, 34(4): 1022-1036.

[36] OCHS P, DOSOVITSKIY A, BROX T, et al. On iteratively reweightedalgorithms for nonsmooth nonconvex optimization in computer vi-sion[J]. SIAM Journal on Imaging Sciences, 2015, 8(1): 331-372.

[37] LU C, WEI Y, LIN Z, et al. Proximal iteratively reweighted al-gorithm with multiple splitting for nonconvex sparsity optimiza-tion[C]//Proceedings of the 28th AAAI Conference on Artificial In-telligence. Palo Alto: AAAI Press, 2014: 1251-1257.

[38] SUN T, JIANG H, CHENG L Z. Global convergence of proximal iter-atively reweighted algorithm[J]. Journal of Global Optimization, 2017,68(4): 815-826.

[39] ABSIL P A, MAHONY R, ANDREWS B. Convergence of the iteratesof descent methods for analytic cost functions[J]. SIAM Journal onOptimization, 2005, 16(2): 531-547.

[40] YANG T J, CHEN Y H, EMER J, et al. A method to estimate theenergy consumption of deep neural networks[C]//Proceedings of the51st Asilomar Conference on Signals, Systems, and Computers. Pis-cataway: IEEE Press, 2017: 1916-1920.

[41] YANG T J, CHEN Y H, SZE V. Designing energy-efficient convolu-tional neural networks using energy-aware pruning[C]//Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition.Piscataway: IEEE Press, 2017: 5687-5695.

[42] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolu-tions[C]//Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. Piscataway: IEEE Press, 2015: 1-9.

[43] CHEN Y H, KRISHNA T, EMER J S, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural net-works[J]. IEEE Journal of Solid-State Circuits, 2016, 52(1): 127-138.

[44] SHEN Y F, SHI Y M, ZHANG J, et al. LORM: Learning to optimizefor resource management in wireless networks with few training sam-ples[J]. IEEE Transactions on Wireless Communications, 2020, 19(1):665-679.

[45] WIESEL A, ELDAR Y C, SHAMAI S. Linear precoding via conic op-timization for fixed MIMO receivers[J]. IEEE Transactions on SignalProcessing, 2005, 54(1): 161-176.

[46] ROCKAFELLAR R T. Convex analysis[M]. Princeton: Princeton Uni-versity Press, 2015.

[47] DAHROUJ H, YU W. Coordinated beamforming for the multicellmulti-antenna wireless system[J]. IEEE Transactions on WirelessCommunications, 2010, 9(5): 1748-1759.

Page 15: Sparse Optimization for Green Edge AI Inferencefaculty.sist.shanghaitech.edu.cn/faculty/shiym/papers/Journal/JCIN2… · power consumption, yielding a mixed combinatorial optimization

Sparse Optimization for Green Edge AI Inference 15

ABOUT THE AUTHORS

Xiangyu Yang (S’17) received his B.S. degree in com-munication engineering from Zhengzhou University,Zhengzhou, China, in 2017. He is currently work-ing towards his Ph.D. degree in computer science withthe School of Information Science and Technology atShanghaiTech University. His research interests in-clude nonlinear optimization and machine learning.

Sheng Hua (S’19) received his B.S. degree in In-ternet of things engineering from Xidian University,Xi’an, China, in 2018. He is currently pursuing hisgraduate degree at the School of Information Scienceand Technology, ShanghaiTech University, Shanghai,China. His current research interests include mobileedge computing and edge artificial intelligence.

Yuanming Shi (S’13-M’15) [corresponding author]received his B.S. degree in electronic engineering fromTsinghua University, Beijing, China, in 2011. He re-ceived his Ph.D. degree in electronic and computer en-gineering from the Hong Kong University of Scienceand Technology (HKUST), in 2015. Since Septem-ber 2015, he has been with the School of InformationScience and Technology in ShanghaiTech University,where he is currently a tenured associate professor. He

visited University of California, Berkeley, CA, USA, from October 2016 toFebruary 2017. Dr. Shi is a recipient of the 2016 IEEE Marconi Prize PaperAward in Wireless Communications, and the 2016 Young Author Best PaperAward by the IEEE Signal Processing Society. He is an editor of IEEE Trans-actions on Wireless Communications. His research areas include optimiza-tion, statistics, machine learning, signal processing, and their applications to6G, IoT, AI and FinTech.

Hao Wang (M’18) is currently an assistant profes-sor with the School of Information Science and Tech-nology at ShanghaiTech University, Shanghai, China.He received his B.S. and M.S. degrees in mathemat-ics from Beihang University, Beijing, China, and hisPh.D. degree from the Industrial and Systems Engi-neering Department, Lehigh University, Bethlehem,PA, USA in 2015. He was with ExxonMobil CorporateStrategic Research Lab, Mitsubishi Electric Research

Lab, and GroupM R&D. His research interests include nonlinear optimizationand machine learning.

Jun Zhang (S’06-M’10-SM’15) received his B.Eng.degree in electronic engineering from University ofScience and Technology of China in 2004, his M.Phil.degree in information engineering from the ChineseUniversity of Hong Kong in 2006, and his Ph.D. de-gree in Electrical and Computer Engineering from theUniversity of Texas at Austin in 2009. He is an as-sistant professor in the Department of Electronic andInformation Engineering at the Hong Kong Polytech-

nic University. His research interests include wireless communications andnetworking, mobile edge computing and edge learning, distributed learningand optimization, and big data analytics.

Dr. Zhang co-authored the books Fundamentals of LTE (Prentice-Hall,2010), and Stochastic Geometry Analysis of Multi-Antenna Wireless Net-works (Springer, 2019). He is a co-recipient of the 2019 IEEE Communi-cations Society & Information Theory Society Joint Paper Award, the 2016Marconi Prize Paper Award in Wireless Communications, and the 2014 BestPaper Award for the EURASIP Journal on Advances in Signal Processing.He also received the 2016 IEEE ComSoc Asia-Pacific Best Young ResearcherAward. He is an editor of IEEE Transactions on Wireless Communications,IEEE Transactions on Communications, and Journal of Communications andInformation Networks.

Khaled B. Letaief is the New Bright Professor of En-gineering and Chair Professor at the Hong Kong Uni-versity of Science and Technology (HKUST). He isalso a distinguished scientist at Peng Cheng Labora-tory in Shenzhen, China. Dr. Letaief is an interna-tionally recognized leader in wireless communicationsand networks with research interests in machine learn-ing, mobile edge computing, 5G systems and beyond.In these areas, he has over 620 journal and conference

papers with over 34,300 citations. He also has a large number of patents,including 11 US inventions and has made 6 technical contributions to inter-national standards.

He served as consultants for different organizations including Huawei,ASTRI, ZTE, Nortel, PricewaterhouseCoopers, and Motorola. He is thefounding editor-in-chief of the prestigious IEEE Transactions on WirelessCommunications and has been involved in organizing many flagship inter-national conferences.

He is the recipient of many distinguished awards including the 2019 Dis-tinguished Research Excellence Award by HKUST School of Engineering(Highest research award), 2019 IEEE Communications Society and Informa-tion Theory Society Joint Paper Award, 2018 IEEE Signal Processing SocietyYoung Author Best Paper Award, 2017 IEEE Cognitive Networks TechnicalCommittee Publication Award, 2016 IEEE Marconi Prize Award in WirelessCommunications, 2011 IEEE Harold Sobol Award, and 2010 Purdue Univer-sity Outstanding Electrical and Computer Engineer Award.

Dr. Letaief is well recognized for his dedicated service to professionalsocieties and in particular IEEE where he has served in many leadershippositions. These include IEEE Communications Society Vice-President forConferences and Vice-President for Technical Activities. He also served asPresident of the IEEE Communications Society (2018-19), the world’s lead-ing organization for communications professionals with headquarter in NewYork City and members in 162 countries.

Since 1993, he has been with HKUST where he has held many adminis-trative positions, including Head of Electronic and Computer Engineeringdepartment and founding Director of Huawei-HKUST Innovation Labora-tory. He also served as Dean of Engineering. Under his leadership, HKUSTSchool of Engineering dazzled in international rankings (rising from # 26in 2009 to # 14 in the world in 2015 according to QS World UniversityRankings).

Dr. Letaief received his B.S. degree with distinction, M.S. and Ph.D. de-grees in electrical engineering from Purdue University at West Lafayette, In-diana, USA.

He is a fellow of IEEE and a fellow of HKIE. He is also recognized byThomson Reuters as an ISI Highly Cited Researcher.