1
Learning Combinatorial Optimization Algorithms over Graphs Hanjun Dai*, Elias B. Khalil*, Yuyu Zhang, Bistra Dilkina, Le Song (*equal contribution) Georgia Institute of Technology Mailto: {hanjun.dai, elias.khalil, yuyu.zhang, bdilkina, lsong}@cc.gatech.edu Problem Statement Implement Q-function with structure2vec Reinforcement Learning Formulation Experiments Given a graph optimization problem and a distribution of problem instances, can we learn better greedy heuristics that generalize to unseen instances from ? Minimum Vertex Cover Tackling NPC problems Design rationale Example Exact algorithms Tight formulations, IP solvers CPLEX Approximation algorithms Worst-case guarantees Edge-picking Heuristics Empirical performance Degree-greedy Can classical algorithms exploit the common distribution of instances? Minimum Vertex Cover (MVC) Maximum Cut (MAXCUT) Traveling Salesman Problem (TSP) Synthetic data Random gaphs Erdos-Renyi (ER) or Barabasi-Albert (BA) ER or BA DIMACS generator; uniform grid or clustered Statistics (# nodes) training: 15 ~ 500 test: 15 ~ 1200 training: 15 ~ 300 test: 15 ~ 1200 training: 15 ~ 300 test: 15 ~ 1200 Real-world data Datasets MemeTracker Physics TSPLIB Visualization Statistics 1 graph, 960 nodes, 5000 edges 10 graphs, 125 nodes, 375 edges 38 graphs, 51 to 318 nodes Opt obtained ILP with CPLEX IQP with CPLEX Concorde MVC - BA MAXCUT - BA TSP - Clustered Table 3: Realistic data experiments, results summary. Values are average approximation ratios. Problem Dataset S2V-DQN Best Competitor 2 nd Best Competitor MVC MemeTracker 1.0021 1.2220 (MVCApprox-Greedy) 1.4080 (MVCApprox) MAXCUT Physics 1.0223 1.2825 (MaxcutApprox) 1.8996 (SDP) TSP TSPLIB 1.0475 1.0947 (2-opt) 1.1771 (Cheapest) 5.5 Discovery of interesting new algorithms 1 1.001 1.002 1.003 1.004 1.005 1.006 1.007 MVC - BA Train on small graphs with 50-100 nodes, generalize to larger graphs S2V-DQN CPLEX 1st CPLEX 2nd CPLEX 3rd CPLEX 4th 2-approx 2-approx + PNAC Generate 200 Barabasi-Albert networks with 300 nodes Let CPLEX produces 1 st , 2 nd , 3 rd , 4 th feasible solutions Minimum Vertex Cover: NP-Complete Problems on Graphs Background Definition: Find smallest vertex subset S s.t., S covers all the edges Application: Advertising optimization in social networks Real-world: Same problem is solved repeatedly with slightly different data Apr. 14 Apr. 17 …… timeline Scale Free Network ~ Not automatically! Typical approach: customize b.n.b./approx./heuristic Graph Opt. Prob. Greedy Procedures Illustration Minimum Vertex Cover Insert nodes into cover Maximum Cut Insert nodes into subset Traveling Salesman Prob. Insert nodes into sub-tour Proposed Framework min ( ) +,- / 1 1∈ . . 1 + 8 ≥ 1, ∀ , Repeat until all edges are covered: 1. Compute score for each vertex 2. Select vertex with largest score 3. Add best vertex to cover Reward: @ = −1 State : current selected nodes Action value function: E (, ) Greedy policy: = O Q (, ) Update state μ (t+1) v relu ( 1 x v + 2 X u2N (v ) μ (t) u + + 3 X u2N (v ) relu(4 w (v,u)) ) Neighbors’ edge weights Updating feature vector Repeat embedding times = V : model parameters Node’s own tag O Neighbors’ features 3 0 0 1 1 Input Nonlinear Mappings Prediction: b Q(h(S ),v ; )= > 5 relu([6 X u2V μ (T ) u , 7 μ (T ) v ]) Compute Q-value State representation: sum-pooling over nodes Action representation Experiment Settings Generalization on new random graphs with same distribution approximation ratio ≈1 Methods: S2V-DQN: our proposed algorithm PN-AC: Bello et.al. 2016 Greedy heuristics; Approximation algorithms; Approximation ratio: Lower the better Generalization with different distribution 1 1.005 1.01 1.015 1.02 1.025 MAXCUT - BA 1 1.02 1.04 1.06 1.08 1.1 1.12 TSP - Clustered Convergence 10 3 10 4 10 5 # minibatch training 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 approx ratio pre-trained #node-15-20 #node-40-50 #node-50-100 #node-100-200 #node-200-300 10 3 10 4 10 5 # minibatch training 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 approx ratio pre-trained #node-15-20 #node-40-50 #node-50-100 #node-100-200 #node-400-500 10 2 10 3 10 4 # minibatch training 1 1.02 1.04 1.06 1.08 1.1 1.12 1.14 approx ratio #node-15-20 #node-40-50 #node-50-100 #node-100-200 #node-200-300 MVC - BA MAXCUT - BA TSP - Clustered Experiments on real-world graphs Time-Solution Tradeoff Unique Heuristic Discovered (1) (2) (3) (4) (5) (10) (6) (7) (8) (9) Our method keeps the connectivity!

Learning Combinatorial Optimization Algorithms over Graphshdai8/pdf/nips2017_poster.pdf · Learning Combinatorial Optimization Algorithms over Graphs HanjunDai*, Elias B. Khalil*,YuyuZhang,BistraDilkina,

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning Combinatorial Optimization Algorithms over Graphshdai8/pdf/nips2017_poster.pdf · Learning Combinatorial Optimization Algorithms over Graphs HanjunDai*, Elias B. Khalil*,YuyuZhang,BistraDilkina,

LearningCombinatorialOptimizationAlgorithmsoverGraphsHanjun Dai*,EliasB.Khalil*, Yuyu Zhang, Bistra Dilkina,LeSong(*equalcontribution)

GeorgiaInstituteofTechnology Mailto:{hanjun.dai,elias.khalil,yuyu.zhang,bdilkina,lsong}@cc.gatech.edu

ProblemStatement

ImplementQ-functionwithstructure2vec

ReinforcementLearningFormulation Experiments

Given a graph optimization problem 𝑮 and a distribution 𝓓 of problem instances, can we learn better greedy

heuristics that generalize to unseen instances from 𝒟?

Minimum Vertex Cover

Tackling NPC problems Design rationale ExampleExact algorithms Tight formulations, IP solvers CPLEX

Approximation algorithms Worst-case guarantees Edge-pickingHeuristics Empirical performance Degree-greedy

Can classical algorithms exploit the common distribution of instances?

Minimum VertexCover (MVC)

Maximum Cut (MAXCUT)

Traveling Salesman Problem (TSP)

Synthetic data

Random gaphs Erdos-Renyi (ER) or Barabasi-Albert (BA) ER or BA

DIMACS generator; uniform grid or clustered

Statistics(# nodes)

training: 15 ~ 500test: 15 ~ 1200

training: 15 ~ 300test: 15 ~ 1200

training: 15 ~ 300test: 15 ~ 1200

Real-world dataDatasets MemeTracker Physics TSPLIB

Visualization

Statistics 1 graph, 960 nodes, 5000 edges

10 graphs, 125 nodes, 375 edges

38 graphs, 51 to 318 nodes

Opt obtained ILP with CPLEX IQP with CPLEX Concorde

MVC- BA MAXCUT- BA

TSP- ClusteredWe can see that S2V-DQN achieves a very low approximation ratio. Note that the “optimal" valueused in the computation of approximation ratios may not be truly optimal (due to the solver time cutoffat 1 hour), and so CPLEX’s and Concorde’s solution do typically get worse as problem size grows.This is why sometimes we can even get better approximation ratio on larger graphs.5.3 Scalability & Trade-off between running time and approximation ratioTo construct a solution on a test graph, our algorithm has polynomial complexity of O(k|E|) where kis number of greedy steps (at most the number of nodes |V |) and |E| is number of edges. For instance,on graphs with 1200 nodes, we can find the solution of MVC within 11 seconds using a single GPU,while getting an approximation ratio of 1.0062. For dense graphs, we can also sample the edges forthe graph embedding computation to save time, a measure we will investigate in the future.Figure 3 illustrates the approximation ratios of various approaches as a function of running time.All algorithms report a single solution at termination, whereas CPLEX reports multiple improvingsolutions, for which we recorded the corresponding running time and approximation ratio. Figure D.3(Appendix D.7) includes other graph sizes and types, where the results are consistent with Figure 3.

(a) MVC BA 200-300 (b) MAXCUT BA 200-300

Figure 3: Time-approximationtrade-off for MVC and MAX-CUT. In this figure, each dotrepresents a solution found fora single problem instance. ForCPLEX, we also record the timeand quality of each solution itfinds, e.g. CPLEX-1st means thefirst feasible solution found byCPLEX.

Figure 3 shows that, for MVC, we are slightly slower than the approximation algorithms but enjoya much better approximation ratio. Also note that although CPLEX found the first feasible solutionquickly, it also has much worse ratio; the second improved solution found by CPLEX takes similaror longer time than our S2V-DQN, but is still of worse quality. For MAXCUT, the observations arestill consistent. One should be aware that sometimes our algorithm can obtain better results than1-hour CPLEX,which gives ratios below 1.0. Furthermore, sometimes S2V-DQN is even faster thanthe MaxcutApprox, although this comparison is not exactly fair, since we use GPUs; however, wecan still see that our algorithm is efficient.5.4 Experiments on real-world datasetsIn addition to the experiments for synthetic data, we identified sets of publicly available benchmarkor real-world instances for each problem, and performed experiments on them. A summary of resultsis in Table 3, and details are given in Appendix C. S2V-DQN significantly outperforms all competingmethods for MVC, MAXCUT and TSP.

Table 3: Realistic data experiments, results summary. Values are average approximation ratios.

Problem Dataset S2V-DQN Best Competitor 2nd Best CompetitorMVC MemeTracker 1.0021 1.2220 (MVCApprox-Greedy) 1.4080 (MVCApprox)MAXCUT Physics 1.0223 1.2825 (MaxcutApprox) 1.8996 (SDP)TSP TSPLIB 1.0475 1.0947 (2-opt) 1.1771 (Cheapest)

5.5 Discovery of interesting new algorithmsWe further examined the algorithms learned by S2V-DQN, and tried to interpret what greedy heuristicshave been learned. We found that S2V-DQN is able to discover new and interesting algorithms whichintuitively make sense but have not been analyzed before. For instance, S2V-DQN discovers analgorithm for MVC where nodes are selected to balance between their degrees and the connectivity ofthe remaining graph (Appendix Figure D.4 and D.7). For MAXCUT, S2V-DQN discovers an algorithmwhere nodes are picked to avoid cancelling out existing edges in the cut set (Appendix Figure D.5).These results suggest that S2V-DQN may also be a good assistive tool for discovering new algorithms,especially in cases when the graph optimization problems are new and less well-studied.

6 ConclusionsWe presented an end-to-end machine learning framework for automatically designing greedy heuristicsfor hard combinatorial optimization problems on graphs. Central to our approach is the combinationof a deep graph embedding with reinforcement learning. Through extensive experimental evaluation,we demonstrate the effectiveness of the proposed framework in learning greedy heuristics as comparedto manually-designed greedy algorithms. The excellent performance of the learned heuristics isconsistent across multiple different problems, graph types, and graph sizes, suggesting that theframework is a promising new tool for designing algorithms for graph problems.

8

11.0011.0021.0031.0041.0051.0061.007

MVC- BA• Trainonsmallgraphswith50-100nodes,generalizetolargergraphs

S2V-DQN

CPLEX1st

CPLEX2nd

CPLEX3rd CPLEX

4th

2-approx

2-approx+

PNAC

• Generate 200 Barabasi-Albert networks with 300 nodes

• Let CPLEX produces 1st, 2nd, 3rd, 4th

feasible solutions

Minimum Vertex Cover: NP-Complete Problems on Graphs

Background

Definition:Find smallest vertex subset Ss.t., S covers all the edges

Application:Advertising optimizationin social networks

Real-world: Same problem is solved repeatedly with slightly different data

Apr.14 Apr.17 …… timeline

Scale Free Network

~

Not automatically! Typical approach: customize b.n.b./approx./heuristic

Graph Opt. Prob. Greedy Procedures Illustration

Minimum Vertex Cover Insert nodes into cover

Maximum Cut Insert nodes into subset

Traveling Salesman Prob. Insert nodes into sub-tour

ProposedFramework

min()∈ +,-

/𝑥1

1∈𝓥𝑠. 𝑡. 𝑥1 + 𝑥8 ≥ 1, ∀ 𝑖, 𝑗 ∈ 𝓔

Repeat until all edges are covered:

1. Compute score for each vertex

2. Select vertex with largest score

3. Add best vertex to cover

Reward: 𝑟@ = −1

State 𝑆: current selected nodes

Action value function: 𝑸E(𝑺, 𝒗)

Greedy policy: 𝑣∗ = 𝑎𝑟𝑔𝑚𝑎𝑥O𝑄Q(𝑆, 𝑣)

Update state 𝑆

degree distribution, triangle counts, distance to tagged nodes, etc. In order to represent such complexphenomena over combinatorial structures, we will leverage a deep learning architecture over graphs,in particular the structure2vec of [5], to parameterize bQ(h(S), v;⇥).

3.1 Structure2Vec

We first provide an introduction to structure2vec. This graph embedding network will computea p-dimensional feature embedding µv for each node v 2 V , given the current partial solution S. Morespecifically, structure2vec defines the network architecture recursively according to an inputgraph structure G, and the computation graph of structure2vec is inspired by graphical modelinference algorithms, where node-specific tags or features xv are aggregated recursively accordingto G’s graph topology. After a few step of recursion, the network will produce a new embedding foreach node, taking into account both graph characteristics and long-range interactions between thesenode features. One variant of the structure2vec architecture will initialize the embedding µ

(0)v

at each node as 0, and for all v 2 V update the embeddings synchronously at each iteration as

µ(t+1)v F

⇣xv, {µ(t)

u }u2N (v), {w(v, u)}u2N (v) ;⇥

⌘, (2)

where N (v) is the set of neighbors of node v in graph G, and F is a generic nonlinear mapping suchas a neural network or kernel function.Based on the update formula, one can see that the embedding update process is carried out based onthe graph topology. A new round of embedding sweeping across the nodes will start only after theembedding update for all nodes from the previous round has finished. It is easy to see that the updatealso defines a process where the node features xv are propagated to other nodes via the nonlinearpropagation function F . Furthermore, the more update iterations one carries out, the farther awaythe node features will propagate and get aggregated nonlinearly at distant nodes. In the end, if oneterminates after T iterations, each node embedding µ

(T )v will contain information about its T -hop

neighborhood as determined by graph topology, the involved node features and the propagationfunctino F . An illustration of 2 iterations of graph embedding can be found in Figure 1.

3.2 Parameterizing bQ(h(S), v;⇥)

We now discuss the parameterization of bQ(h(S), v;⇥) using the embeddings fromstructure2vec.In particular, we design F to update a p-dimensional embedding µv as

µ(t+1)v relu

�✓1xv + ✓2

Xu2N (v)

µ(t)u + ✓3

Xu2N (v)

relu(✓4 w(v, u))�, (3)

where ✓1 2 Rp, ✓2, ✓3 2 Rp⇥p and ✓4 2 Rp are the model parameters, and relu is the rectifiedlinear unit(relu(z) = z if z > 0 and 0 otherwise) applied elementwise to its input. The summationover neighbors is one way of aggregating neighborhood information invariant to the permutation ofneighbor ordering. For simplicity of exposition, xv here is a binary scalar as described earlier; it isstraightforward to extend xv to a vector representation by incorporating useful node information. Tomake the nonlinear transformations more powerful, we can add some more layers of relu before wepool over the neighboring embeddings µu.Once the embedding for each node is computed after T iterations, we will use these embeddings todefine the bQ(h(S), v;⇥) function. More specifically, we will use the embeddingµ(T )

v for node v and thepooled embedding over the entire graph,

Pu2V µ

(T )u , as the surrogates for v and h(S), respectively, i.e.

bQ(h(S), v;⇥) = ✓>5 relu([✓6X

u2Vµ(T )u , ✓7 µ

(T )v ]) (4)

where ✓5 2 R2p, ✓6, ✓7 2 Rp⇥p and [·, ·] is the concatenation operator. Since the embedding µ(T )u

is computed based on the parameters from the graph embedding network, bQ(h(S), v) will dependon a collection of 7 parameters ⇥ = {✓i}7i=1. The number of iterations T for the graph embeddingcomputation is usually small, such as T = 4.The parameters ⇥ will be learned. Previously, [5] required a ground truth label for every input graphG in order to train the structure2vec architecture. There, the output of the embedding is linkedwith a softmax-layer, so that the parameters can by trained end-to-end by minimizing the cross-entropyloss. This approach is not applicable to our case due to the lack of training labels. Instead, we trainthese parameters together end-to-end using reinforcement learning.

4

degree distribution, triangle counts, distance to tagged nodes, etc. In order to represent such complexphenomena over combinatorial structures, we will leverage a deep learning architecture over graphs,in particular the structure2vec of [5], to parameterize bQ(h(S), v;⇥).

3.1 Structure2Vec

We first provide an introduction to structure2vec. This graph embedding network will computea p-dimensional feature embedding µv for each node v 2 V , given the current partial solution S. Morespecifically, structure2vec defines the network architecture recursively according to an inputgraph structure G, and the computation graph of structure2vec is inspired by graphical modelinference algorithms, where node-specific tags or features xv are aggregated recursively accordingto G’s graph topology. After a few step of recursion, the network will produce a new embedding foreach node, taking into account both graph characteristics and long-range interactions between thesenode features. One variant of the structure2vec architecture will initialize the embedding µ

(0)v

at each node as 0, and for all v 2 V update the embeddings synchronously at each iteration as

µ(t+1)v F

⇣xv, {µ(t)

u }u2N (v), {w(v, u)}u2N (v) ;⇥

⌘, (2)

where N (v) is the set of neighbors of node v in graph G, and F is a generic nonlinear mapping suchas a neural network or kernel function.Based on the update formula, one can see that the embedding update process is carried out based onthe graph topology. A new round of embedding sweeping across the nodes will start only after theembedding update for all nodes from the previous round has finished. It is easy to see that the updatealso defines a process where the node features xv are propagated to other nodes via the nonlinearpropagation function F . Furthermore, the more update iterations one carries out, the farther awaythe node features will propagate and get aggregated nonlinearly at distant nodes. In the end, if oneterminates after T iterations, each node embedding µ

(T )v will contain information about its T -hop

neighborhood as determined by graph topology, the involved node features and the propagationfunctino F . An illustration of 2 iterations of graph embedding can be found in Figure 1.

3.2 Parameterizing bQ(h(S), v;⇥)

We now discuss the parameterization of bQ(h(S), v;⇥) using the embeddings fromstructure2vec.In particular, we design F to update a p-dimensional embedding µv as

µ(t+1)v relu

�✓1xv + ✓2

Xu2N (v)

µ(t)u + ✓3

Xu2N (v)

relu(✓4 w(v, u))�, (3)

where ✓1 2 Rp, ✓2, ✓3 2 Rp⇥p and ✓4 2 Rp are the model parameters, and relu is the rectifiedlinear unit(relu(z) = z if z > 0 and 0 otherwise) applied elementwise to its input. The summationover neighbors is one way of aggregating neighborhood information invariant to the permutation ofneighbor ordering. For simplicity of exposition, xv here is a binary scalar as described earlier; it isstraightforward to extend xv to a vector representation by incorporating useful node information. Tomake the nonlinear transformations more powerful, we can add some more layers of relu before wepool over the neighboring embeddings µu.Once the embedding for each node is computed after T iterations, we will use these embeddings todefine the bQ(h(S), v;⇥) function. More specifically, we will use the embeddingµ(T )

v for node v and thepooled embedding over the entire graph,

Pu2V µ

(T )u , as the surrogates for v and h(S), respectively, i.e.

bQ(h(S), v;⇥) = ✓>5 relu([✓6X

u2Vµ(T )u , ✓7 µ

(T )v ]) (4)

where ✓5 2 R2p, ✓6, ✓7 2 Rp⇥p and [·, ·] is the concatenation operator. Since the embedding µ(T )u

is computed based on the parameters from the graph embedding network, bQ(h(S), v) will dependon a collection of 7 parameters ⇥ = {✓i}7i=1. The number of iterations T for the graph embeddingcomputation is usually small, such as T = 4.The parameters ⇥ will be learned. Previously, [5] required a ground truth label for every input graphG in order to train the structure2vec architecture. There, the output of the embedding is linkedwith a softmax-layer, so that the parameters can by trained end-to-end by minimizing the cross-entropyloss. This approach is not applicable to our case due to the lack of training labels. Instead, we trainthese parameters together end-to-end using reinforcement learning.

4

degree distribution, triangle counts, distance to tagged nodes, etc. In order to represent such complexphenomena over combinatorial structures, we will leverage a deep learning architecture over graphs,in particular the structure2vec of [5], to parameterize bQ(h(S), v;⇥).

3.1 Structure2Vec

We first provide an introduction to structure2vec. This graph embedding network will computea p-dimensional feature embedding µv for each node v 2 V , given the current partial solution S. Morespecifically, structure2vec defines the network architecture recursively according to an inputgraph structure G, and the computation graph of structure2vec is inspired by graphical modelinference algorithms, where node-specific tags or features xv are aggregated recursively accordingto G’s graph topology. After a few step of recursion, the network will produce a new embedding foreach node, taking into account both graph characteristics and long-range interactions between thesenode features. One variant of the structure2vec architecture will initialize the embedding µ

(0)v

at each node as 0, and for all v 2 V update the embeddings synchronously at each iteration as

µ(t+1)v F

⇣xv, {µ(t)

u }u2N (v), {w(v, u)}u2N (v) ;⇥

⌘, (2)

where N (v) is the set of neighbors of node v in graph G, and F is a generic nonlinear mapping suchas a neural network or kernel function.Based on the update formula, one can see that the embedding update process is carried out based onthe graph topology. A new round of embedding sweeping across the nodes will start only after theembedding update for all nodes from the previous round has finished. It is easy to see that the updatealso defines a process where the node features xv are propagated to other nodes via the nonlinearpropagation function F . Furthermore, the more update iterations one carries out, the farther awaythe node features will propagate and get aggregated nonlinearly at distant nodes. In the end, if oneterminates after T iterations, each node embedding µ

(T )v will contain information about its T -hop

neighborhood as determined by graph topology, the involved node features and the propagationfunctino F . An illustration of 2 iterations of graph embedding can be found in Figure 1.

3.2 Parameterizing bQ(h(S), v;⇥)

We now discuss the parameterization of bQ(h(S), v;⇥) using the embeddings fromstructure2vec.In particular, we design F to update a p-dimensional embedding µv as

µ(t+1)v relu

�✓1xv + ✓2

Xu2N (v)

µ(t)u + ✓3

Xu2N (v)

relu(✓4 w(v, u))�, (3)

where ✓1 2 Rp, ✓2, ✓3 2 Rp⇥p and ✓4 2 Rp are the model parameters, and relu is the rectifiedlinear unit(relu(z) = z if z > 0 and 0 otherwise) applied elementwise to its input. The summationover neighbors is one way of aggregating neighborhood information invariant to the permutation ofneighbor ordering. For simplicity of exposition, xv here is a binary scalar as described earlier; it isstraightforward to extend xv to a vector representation by incorporating useful node information. Tomake the nonlinear transformations more powerful, we can add some more layers of relu before wepool over the neighboring embeddings µu.Once the embedding for each node is computed after T iterations, we will use these embeddings todefine the bQ(h(S), v;⇥) function. More specifically, we will use the embeddingµ(T )

v for node v and thepooled embedding over the entire graph,

Pu2V µ

(T )u , as the surrogates for v and h(S), respectively, i.e.

bQ(h(S), v;⇥) = ✓>5 relu([✓6X

u2Vµ(T )u , ✓7 µ

(T )v ]) (4)

where ✓5 2 R2p, ✓6, ✓7 2 Rp⇥p and [·, ·] is the concatenation operator. Since the embedding µ(T )u

is computed based on the parameters from the graph embedding network, bQ(h(S), v) will dependon a collection of 7 parameters ⇥ = {✓i}7i=1. The number of iterations T for the graph embeddingcomputation is usually small, such as T = 4.The parameters ⇥ will be learned. Previously, [5] required a ground truth label for every input graphG in order to train the structure2vec architecture. There, the output of the embedding is linkedwith a softmax-layer, so that the parameters can by trained end-to-end by minimizing the cross-entropyloss. This approach is not applicable to our case due to the lack of training labels. Instead, we trainthese parameters together end-to-end using reinforcement learning.

4

Neighbors’ edge weights

Updating feature vector

Repeat embedding 𝑇 times

𝚯 = 𝜽 𝐢V𝟏𝟕 :modelparameters

Node’s own tag 𝑥O

Neighbors’ features

𝑣 3

0

01

1

Input

NonlinearMappings

Prediction:

degree distribution, triangle counts, distance to tagged nodes, etc. In order to represent such complexphenomena over combinatorial structures, we will leverage a deep learning architecture over graphs,in particular the structure2vec of [5], to parameterize bQ(h(S), v;⇥).

3.1 Structure2Vec

We first provide an introduction to structure2vec. This graph embedding network will computea p-dimensional feature embedding µv for each node v 2 V , given the current partial solution S. Morespecifically, structure2vec defines the network architecture recursively according to an inputgraph structure G, and the computation graph of structure2vec is inspired by graphical modelinference algorithms, where node-specific tags or features xv are aggregated recursively accordingto G’s graph topology. After a few step of recursion, the network will produce a new embedding foreach node, taking into account both graph characteristics and long-range interactions between thesenode features. One variant of the structure2vec architecture will initialize the embedding µ

(0)v

at each node as 0, and for all v 2 V update the embeddings synchronously at each iteration as

µ(t+1)v F

⇣xv, {µ(t)

u }u2N (v), {w(v, u)}u2N (v) ;⇥

⌘, (2)

where N (v) is the set of neighbors of node v in graph G, and F is a generic nonlinear mapping suchas a neural network or kernel function.Based on the update formula, one can see that the embedding update process is carried out based onthe graph topology. A new round of embedding sweeping across the nodes will start only after theembedding update for all nodes from the previous round has finished. It is easy to see that the updatealso defines a process where the node features xv are propagated to other nodes via the nonlinearpropagation function F . Furthermore, the more update iterations one carries out, the farther awaythe node features will propagate and get aggregated nonlinearly at distant nodes. In the end, if oneterminates after T iterations, each node embedding µ

(T )v will contain information about its T -hop

neighborhood as determined by graph topology, the involved node features and the propagationfunctino F . An illustration of 2 iterations of graph embedding can be found in Figure 1.

3.2 Parameterizing bQ(h(S), v;⇥)

We now discuss the parameterization of bQ(h(S), v;⇥) using the embeddings fromstructure2vec.In particular, we design F to update a p-dimensional embedding µv as

µ(t+1)v relu

�✓1xv + ✓2

Xu2N (v)

µ(t)u + ✓3

Xu2N (v)

relu(✓4 w(v, u))�, (3)

where ✓1 2 Rp, ✓2, ✓3 2 Rp⇥p and ✓4 2 Rp are the model parameters, and relu is the rectifiedlinear unit(relu(z) = z if z > 0 and 0 otherwise) applied elementwise to its input. The summationover neighbors is one way of aggregating neighborhood information invariant to the permutation ofneighbor ordering. For simplicity of exposition, xv here is a binary scalar as described earlier; it isstraightforward to extend xv to a vector representation by incorporating useful node information. Tomake the nonlinear transformations more powerful, we can add some more layers of relu before wepool over the neighboring embeddings µu.Once the embedding for each node is computed after T iterations, we will use these embeddings todefine the bQ(h(S), v;⇥) function. More specifically, we will use the embeddingµ(T )

v for node v and thepooled embedding over the entire graph,

Pu2V µ

(T )u , as the surrogates for v and h(S), respectively, i.e.

bQ(h(S), v;⇥) = ✓>5 relu([✓6X

u2Vµ(T )u , ✓7 µ

(T )v ]) (4)

where ✓5 2 R2p, ✓6, ✓7 2 Rp⇥p and [·, ·] is the concatenation operator. Since the embedding µ(T )u

is computed based on the parameters from the graph embedding network, bQ(h(S), v) will dependon a collection of 7 parameters ⇥ = {✓i}7i=1. The number of iterations T for the graph embeddingcomputation is usually small, such as T = 4.The parameters ⇥ will be learned. Previously, [5] required a ground truth label for every input graphG in order to train the structure2vec architecture. There, the output of the embedding is linkedwith a softmax-layer, so that the parameters can by trained end-to-end by minimizing the cross-entropyloss. This approach is not applicable to our case due to the lack of training labels. Instead, we trainthese parameters together end-to-end using reinforcement learning.

4

Compute Q-value

State representation: sum-pooling over nodes Action representation

ExperimentSettings

Generalization on new random graphs with same distribution

approximationratio≈ 1

Methods:• S2V-DQN: our proposed algorithm• PN-AC: Bello et.al. 2016• Greedy heuristics; • Approximation algorithms;

Approximation ratio:

Lower the better

Generalization with different distribution

1

1.005

1.01

1.015

1.02

1.025

MAXCUT- BA

11.021.041.061.081.1

1.12

TSP- Clustered

Convergence

103 104 105

# minibatch training

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

appr

ox ra

tio

pre-trained

#node-15-20#node-40-50#node-50-100#node-100-200#node-200-300

103 104 105

# minibatch training

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

appr

ox ra

tio

pre-trained

#node-15-20#node-40-50#node-50-100#node-100-200#node-400-500

102 103 104

# minibatch training

1

1.02

1.04

1.06

1.08

1.1

1.12

1.14

appr

ox ra

tio

#node-15-20#node-40-50#node-50-100#node-100-200#node-200-300

MVC- BA

MAXCUT- BA

TSP- Clustered

Experiments on real-world graphs

Time-Solution Tradeoff Unique Heuristic Discovered(1) (2)

(3) (4)

(5)

(10)

(6)

(7) (8)

(9)

Our method keeps the connectivity!