circuits - arXivA generative modeling approach for benchmarking and training shallow quantum circuits Marcello Benedetti,1,2 Del na Garcia-Pintos,3 Oscar Perdomo,3,4,5 Vicente Leyton-Ortega,3,4

A generative modeling approach for benchmarking and training shallow quantumcircuits

Marcello Benedetti,1, 2 Delfina Garcia-Pintos,3 Oscar Perdomo,3, 4, 5

Vicente Leyton-Ortega,3, 4 Yunseong Nam,6 and Alejandro Perdomo-Ortiz1, 3, 4, 7, 8, ∗

1Department of Computer Science, University College London, WC1E 6BT London, UK2Cambridge Quantum Computing Limited, CB2 1UB Cambridge, UK

3Qubitera, LLC., Mountain View, CA 94041, USA4Rigetti Computing, 2919 Seventh Street, Berkeley, CA 94710-2704, USA

5Department of Mathematics, Central Connecticut State University, New Britain, CT 06050, USA6IonQ, Inc., College Park, MD 20740, USA

7Quantum Artificial Intelligence Lab., NASA Ames Research Center, Moffett Field, CA 94035, USA8Zapata Computing Inc., 439 University Avenue, Office 535, Toronto, ON M5G 1Y8, Canada

(Dated: June 2, 2019)

Hybrid quantum-classical algorithms provide ways to use noisy intermediate-scale quantum com-puters for practical applications. Expanding the portfolio of such techniques, we propose a quantumcircuit learning algorithm that can be used to assist the characterization of quantum devices and totrain shallow circuits for generative tasks. The procedure leverages quantum hardware capabilitiesto its fullest extent by using native gates and their qubit connectivity. We demonstrate that ourapproach can learn an optimal preparation of the Greenberger-Horne-Zeilinger states, also knownas “cat states”. We further demonstrate that our approach can efficiently prepare approximate rep-resentations of coherent thermal states, wave functions that encode Boltzmann probabilities in theiramplitudes. Finally, complementing proposals to characterize the power or usefulness of near-termquantum devices, such as IBM’s quantum volume, we provide a new hardware-independent metriccalled the qBAS score. It is based on the performance yield in a specific sampling task on one ofthe canonical machine learning data sets known as Bars and Stripes. We show how entanglement isa key ingredient in encoding the patterns of this data set; an ideal benchmark for testing hardwarestarting at four qubits and up. We provide experimental results and evaluation of this metric toprobe the trade off between several architectural circuit designs and circuit depths on an ion-trapquantum computer.

Keywords: quantum-assisted machine learning, generative models, quantum circuit learning

I. INTRODUCTION

What is a good metric for the computational powerof noisy intermediate-scale quantum [1] (NISQ) devices?Can machine learning (ML) provide ways to benchmarkthe power and usefulness of NISQ devices? How canwe capture the performance scaling of these devices asa function of circuit depth, gate fidelity, and qubit con-nectivity? In this work, we design a hybrid quantum-classical framework called data-driven quantum circuitlearning (DDQCL) and address these questions throughsimulations and experiments.

Hybrid quantum-classical algorithms, such as the vari-ational quantum eigensolver [2, 3] (VQE) and the quan-tum approximate optimization algorithm [4, 5] (QAOA),provide ways to use NISQ computers for practical ap-plications. For example, VQE is often used in quan-tum chemistry when searching the ground state of theelectronic Hamiltonian of a molecule [6–8]. QAOA isused in combinatorial optimization to find approximatesolutions of a classical Ising model [9] and it has beendemonstrated on a 19-qubit device for the task of clus-

∗ Correspondence: [email protected]

tering synthetic data [10]. Other successful hybrid ap-proaches based on genetic algorithms were proposed forapproximating quantum adders and training quantumautoencoders [11–13]. In all these examples, there is aclear-cut objective function describing the cost associatedwith each candidate solution. The task is then to opti-mize it exploiting both quantum and classical resources.

Although optimization tasks offer a great niche of ap-plications, probabilistic tasks involving sampling havemost of the potential to prove quantum advantage inthe near-term [14–16]. For example, learning probabilis-tic generative models is in many cases an intractabletask, and quantum-assisted algorithms have been pro-posed for both gate-based [17, 18] and quantum annealingdevices [19–23]. Differently from optimization, the learn-ing of a generative model is not always described by aclear-cut objective function. All we are given as input isa data set, and there are several cost functions that couldbe used as a guide, each one with their own advantages,disadvantages, and assumptions.

Here we present a hybrid quantum-classical approachfor generative modeling on gate-based NISQ deviceswhich heavily relies on sampling. We use the 2N am-plitudes of the wave function obtained from a N -qubitquantum circuit to construct and capture the correla-tions observed in a data set. As Born’s rule determines

arX

iv:1

801.

0768

6v4

[qu

ant-

ph]

2 J

un 2

019

mailto:Correspondence: [email protected]

2

the output probabilities, this model belongs to the classof models called Born machines [24]. Previous imple-mentations of Born machines [25–28] often relied on theconstruction of tensor networks and their efficient ma-nipulation through graphical-processing units. Our workdiffers in that Born’s rule is naturally implemented by aquantum circuit executed on a NISQ hardware. Follow-ing the notation of subsequent work [29], we refer to ourgenerative model as a quantum circuit Born machine.

By employing the quantum circuit itself as a model forthe data set, we also differentiate from quantum algo-rithms that target specific probability distributions. Forexample, in Ref. [30] the authors developed a hybrid al-gorithm to approximately sample from a Boltzmann dis-tribution. The samples are used to update a classicalgenerative model, which requires running the hybrid al-gorithm for each training iteration. In contrast, our workdoes not assume a Boltzmann distribution and thereforedoes not require a specific sampling beyond Born’s rule.

Our work is also in contrast with other quantum-assisted ML work (see e.g. Refs. [31–34]) requiring fault-tolerant quantum computers, which are not expected tobe readily available in the near-term [35]. Instead, ourcircuits are carefully designed to exploit the full powerof the underlying NISQ hardware without the need for acompilation step.

On the benchmarking of NISQ devices, quantum vol-ume [9, 36] has been proposed as an architecture-neutralmetric. It is based on the task of approximating a specificclass of random circuits and estimating the associated ef-fective error rate. This is very general and it is indeeduseful for estimating the computational power of a quan-tum computer. In this paper, we propose the qBAS score,a complementary metric designed for benchmarking hy-brid quantum-classical systems. The score is based onthe generative modeling performance on a canonical syn-thetic data set which is easy to generate, visualize, andvalidate for sizes up to hundreds of qubits. Yet, imple-menting a shallow circuit that can uniformly sample suchdata is hard; we will show that some candidate solutionsrequire large amount of entanglement. Hence, any mis-calibration or environmental noise will affect this singleperformance number, enabling comparison between dif-ferent devices or across different generations of the samedevice. Moreover, the score depends on the classical re-sources, hyper-parameters, and various design choices,making it a good choice for the assessment of the hybridsystem as a whole.

Our design choices are based on the setup of existingion trap architectures. We choose the ion trap becauseof its full connectivity among qubits [37] which allowsus to study several circuit layouts on the same device.Our experiments are carried out on an ion trap quantumcomputer hosted at the University of Maryland [38].

II. RESULTS

The learning pipeline

In this Section, we present a hybrid quantum-classicalalgorithm for the unsupervised machine learning taskof approximating an unknown probability distributionfrom data. This task is also known as generative model-ing. First, we describe the data and the model, then wepresent a training method.

The data set D = (x(1), · · · ,x(D)) is a collection of Dindependent and identically distributed random vectors.The underlying probabilities are unknown and the targetis to create a model for such distribution. For simplicity,we restrict our attention to N -dimensional binary vectorsx(d) ∈ {−1,+1}N , e.g. black and white images. Thisgives us an intuitive one-to-one mapping between obser-vation vectors and the computational basis of an N -qubitquantum system, that is x↔ |x〉 = |x1x2 · · ·xN 〉. Notethat standard binary encodings can be used to imple-ment integer, categorical, and approximate continuousvariables.

Provided with the data set D, our goal is now toobtain a good approximation to the target probabilitydistribution PD. A quantum circuit model with fixeddepth and gate layout, parametrized by a vector θ, pre-pares a wave function |ψ(θ)〉 from which probabilities areobtained according to Born’s rule Pθ(x) = |〈x|ψ(θ)〉|2.Following a standard approach from generative ma-chine learning [39], we can minimize the Kullback-Leibler (KL) divergence [40] DKL[PD|Pθ] from the cir-cuit probability distribution in the computational ba-sis Pθ to the target probability distribution PD. Min-imization of this quantity is directly related to the min-imization of a well known cost function: the negative

log-likelihood C(θ) = − 1D

∑Dd=1 ln(Pθ(x(d))). However,

there is a caveat; as probabilities are estimated fromfrequencies of a finite number of measurements, low-amplitude states could lead to incorrect assignements.For example, an estimate Pθ(x(d)) = 0 for some x(d) inthe data set would lead to infinite cost. To avoid singu-larities in the cost function, we use a simple variant

Cnll(θ) = − 1

D

D∑d=1

ln(max(ε, Pθ(x(d)))), (1)

where ε > 0 is a small number to be chosen. Note that thenumber of measurements needed to obtain an unbiasedestimate of the relevant probabilities may not scale favor-ably with the number of qubits N . In the SupplementaryMaterial we suggest alternative cost functions such as themoment matching error and the earth mover’s distance.

After estimating the cost, we update the parametervector θ to further minimize the cost. This can in princi-ple be done by any suitable classical optimizer. We useda gradient-free algorithm called particle swarm optimiza-tion (PSO) [41, 42] as previously done in the context ofquantum chemistry [7]. The algorithm iterates for a fixed

3

number of steps, or until a local minimum is reached andthe cost does not decrease.

We chose the layout of the model circuit to be ofthe following form. Let us consider a general circuit

parametrized by single qubit rotations {θ(l,k)i } and two-

qubit entangling rotations {θ(l)ij }. The subscripts denotequbits involved in the operation, l denotes the layer num-ber and k ∈ {1, 2, 3} denotes the rotation identifier. Thelatter is needed as we decompose an arbitrary single qubitrotation into three simpler rotations (see Section IV forfurther details about some exceptions and potential sim-plifications depending on the native gates available ineach specific hardware). Inspired by the gates readilyavailable in ion trap quantum computers, we use alter-nating layers of arbitrary single qubit gates (odd layers)and Mølmer-Sørensen XX entangling gates [43–46] (evenlayers) as our model. All parameters are initialized atrandom.

It is important to note that in our model the number ofparameters is fixed and is independent of the size D of thedata set. This means we can hope to obtain a good ap-proximation to the target distribution only if the modelis flexible enough to capture its complexity. Increasingthe number of layers or changing the topology of the en-tangling layer alter this flexibility, potentially improvingthe quality of the approximation. However, we antici-pate that such flexible models are more challenging tooptimize because of their larger number of parameters.Principled ways to choose the circuit layout and to reg-ularize its parameters could significantly help in such acase.

As summarized in Figure 1, the learning algorithm iter-atively adjusts all the parameters to minimize the value ofthe cost function. At any iteration the user-defined cost isapproximated using both samples from the data set andmeasurements from the quantum hardware, hence thename data-driven quantum circuit learning (DDQCL).

The qBAS score

Bars and stripes (BAS) [47] is a synthetic data setof images that has been widely used to study generativemodels for unsupervised machine learning. For n×mpixels, there are NBAS(n,m) = 2n + 2m − 2 images be-longing to BAS, and they can be efficiently produced andvisualized. The probability distribution is 1/NBAS(n,m)

for each pattern belonging to BAS(n,m), and zero forany other pattern. Figure 2 on the top left panel showspatterns belonging to BAS(2, 2), while the top centralpanel shows the remaining patterns.

We use DDQCL to learn a circuit that encodes all theBAS patterns in the wave function of a quantum state.This also allows us to design the qBAS(n,m) score: atask specific figure of merit to assess the performance ofshallow quantum circuits. In a single number, it cap-tures the model capacity of the circuit layout and in-trinsic hardware strengths and limitations in solving a

. Repeat 2 through 4 until convergence

…

…

|0i|0i

|0i

U(✓L)U(✓1) U(✓2) … U(✓l)

Outcome from quantum circuitReference data to be learned

Update

Mea

sure

men

ts

✓

1 Initialize circuit with random parameters

2

3 Estimate mismatch between data and quantum outcomes

4

✓ = (✓1, · · · ,✓L)

FIG. 1. General framework for data-driven quantum circuitlearning (DDQCL). Data vectors are interpreted as represen-tative samples from an unknown probability distribution andthe task is to model such distribution. The 2N amplitudesof the wave function resulting from an N -qubit quantum cir-cuit are used to capture the correlations observed in the data.Training of the quantum circuit is achieved by successive up-dates of the parameters θ, corresponding to specifications ofsingle qubit operations and entangling gates. In this work,we use arbitrary single qubit rotations for the odd layers, andMølmer-Sørensen XX gates for the even layers. At each iter-ation, measurements from the quantum circuit are collectedand contrasted with the data through evaluation of a costfunction which tracks the learning progress.

complex sampling task that requires a fair amount ofentanglement. It takes into account the model circuitdepth, gate fidelities, and any other architectural designaspects, such as the quantum hardware’s qubit-to-qubitconnectivity and native set of single and two-qubit gates.It also takes classical resources such as the choice of costfunction, optimizer, and hyper-parameters, into account.Therefore, it can be used to benchmark the performanceof the components of the hybrid quantum-classical sys-tem.

The qBAS(n,m) score is an instantiation of the F1

score widely used in the context of information retrieval.The F1 score is defined as the harmonic mean of the pre-cision p and the recall r, i.e. F1 = 2pr/(p + r). Theprecision p indicates the ability to retrieve states whichbelong to the data set of interest [48]. In our contextthis is the number of measurements that belong to theBAS(n,m) data set, divided by the total number of mea-surements Nreads performed. The recall r is the capacityof the model to retrieve the whole spectrum of patternsbelonging to the desired data set. In our case, if we de-

4

1

34

2

BAS patterns Non-BAS pa�erns

1

34

2

1 2

4 3

1

34

2 1

34

2

1

34

2

1

34

2

FIG. 2. DDQCL on the BAS data set. The top left panel shows patterns that belong to BAS(2, 2) our quantum circuit is togenerate. The top central panel shows undesired patterns. On the top right panel, we show a possible mapping of the 4 pixelsto N = 4 qubits, and we show some of the qubit-to-qubit connectivity topologies that can be set up in entangling layer andnatively implemented by the ion trap quantum computer (e.g chain, star, and all). The bottom left panel shows the resultsof DDQCL simulations of shallow circuits with different topologies. We show the bootstrapped median and 90% confidenceinterval over the distribution of medians of the KL divergence as learning progresses for 100 iterations. The mean-field-likecircuit L = 1 (green crosses) severely underperforms. A significant improvement is obtained with L = 2, where most of theparameters for XX gates have been learned to their maximum entangling value. These observations indicate that entanglementis a key resource for learning the BAS data set. Note that for L = 2 the choice of topology becomes a key factor for improvingthe performance. The chain topology (purple squares) performs slightly better than the star topology (red stars) even thoughthey have the same number of parameters. The all-to-all topology (orange circles) significantly outperform all the others asit has more expressive power. The bottom central image extends the previous analysis to deeper circuits with L = 4 andapproximatively twice the number of parameters. All the topologies achieve a lower median KL divergence and the confidenceintervals shrink. The bottom right panel shows the bootstrapped mean qBAS(2, 2) score and 95% confidence interval forsimulations (green bars) and experiments on the ion trap quantum computer hosted at University of Maryland (pink bars).

note the number of unique patterns that were measuredas d(Nreads), then r = d(Nreads)/NBAS(n,m). To scorehigh (F1 ≈ 1.0), both high precision (p ≈ 1.0) and highrecall (r ≈ 1.0) are required.

The F1 score is a useful measure for the quality of in-formation retrieval and classification algorithms, but forour purposes it has a caveat: the dependence of r onthe total number of measurements. As an example, con-sider a model that generates only BAS patterns, i.e. itsprecision is 1.0, but with highly heterogeneous distribu-tion. If some of the BAS patterns have infinitesimallysmall probability, we can still push the recall to 1.0 bytaking a large number of measurements Nreads → ∞.This is not desirable since our purpose is to evaluate cir-cuits on the task of uniformly sampling all the patternsfrom BAS(n,m). Therefore, to define a unique score

it is important to fix Nreads to a reasonable value suchthat r ≈ 1.0 under the assumption of the model distri-bution being equal to the target distribution PBAS(n,m),but not so large as to make the score insensitive to de-viations from the target distribution. Assuming a per-fect model distribution PBAS(n,m) = 1/NBAS(n,m), theexpected number of measurements to obtain a value ofr = 1.0 can be estimated using the famous coupon collec-tor’s problem. For our purposes here, we set Nreads to beequal to the expected number of samples that need to bedrawn to collect all the NBAS(n,m) patterns (“coupons”).That is, Nreads = NBAS(n,m)HNBAS(n,m)

, where Hk is thek-th harmonic number. Computed values of Nreads areprovided in Table I for different values of n and m upto 100 qubits. As shown in the table, the number ofreadouts required to determine qBAS(n,m) are within

5

experimental capabilities of current NISQ devices [49].For statistical robustness, we recommend as a good

practice to perform R repetitions of the Nreads measure-ments leading to R independent estimates of the recall(each denoted as ri). For estimating the precision p, allthe samples collected should be used to robustly estimatethis quantity. Using this value of p one can compute Rindependent values of the qBAS(n,m) score from eachof the ri. These are subsequently bootstrapped to ob-tain a more robust average for the final reported value ofqBAS(n,m) (see details in Section IV).

We note that a more general performance indicatorthan qBAS(n,m) score is indeed the KL divergence,DKL[PBAS(n,m)|Pθ]. However, this would not be robustin terms of scalability; as n ×m becomes large, it is ex-pected that the KL divergence is frequently undefined[50]. This is true when measurements yield distributionssuch that Pθ(x(d)) = 0 for any of the x(d) in BAS(n,m).In all these cases, the qBAS score can still be computedand the number of measurements Nreads necessary for ob-taining a robust estimate continues to remain relativelysmall to be practical for intermediate size n×m.

Experiments

We investigated three different examples, namely,GHZ state preparation, coherent thermal state prepa-ration, and BAS(2, 2). We implemented each exampleof DDQCL using both numerical simulations and exper-iments. We explored the parameter space of DDQCLby varying the qubit-to-qubit connectivity topology (seetop-right panel of Figure 2) and the number of layers. Weevaluated the performance of each instance by using theKL divergence from the circuit probability distributionin the computational basis to the target probability dis-tribution. While explicitly computing the KL divergenceis generally intractable, due to its demanding resource re-quirement as the size of instances to consider increases,we were able to compute this quantity explicitly for allcases considered in this paper.

GHZ state preparation

To test the capabilities of DDQCL, we started withthe preparation of GHZ states, also known as “catstates” [51]. Besides their importance in quantum in-formation, the choice is motivated by their simple de-scription and by the availability of many studies abouttheir preparation (see e.g. Refs. [52–54]). From theDDQCL perspective, we explored whether it is possibleto learn any of the known recipes for GHZ state prepa-ration starting only from classical data. More specifi-cally, the input data consists of samples from a distri-bution corresponding to the two desired computationalbasis states; P0 = 0.5 for the state |0 . . . 0〉 and P1 = 0.5for the state |1 . . . 1〉. Using a layer of single qubit rota-

|0〉

GMS2n(π2 )

|0〉...

...

|0〉

|0〉 Rx(π2 )

GMS2n+1(π2 )

|0〉 Rx(π2 )

......

|0〉 Rx(π2 )

FIG. 3. GHZ-like state preparation assisted by DDQCL. Left(right) panel shows a recipe obtained by a human expert as-sisted by DDQCL for the even (odd) cat state preparation.Rx stands for the single qubit rotation about the x axis. GMSstands for a global Mølmer-Sørensen gate [46] acting on all theN = 2n (N = 2n + 1) qubits and is equivalent to the appli-cation of local XX gates to all N(N − 1)/2 pairs of qubits.All parameters attained values very close to π

2. The human

expert rounded the parameters to some precision and foundthese patterns.

tions followed by an entangling layer with the all-to-alltopology, DDQCL yielded many degenerate preparationsof GHZ-like states differing only by a relative phase.

In particular, we first ran particle swarm optimizationon 25 random initializations for 3, 4, 5 and 6-qubit in-stances. Then, a human expert inspected the best set ofparameters learned for each size and after rounding theparameters to some precision, spotted a clear pattern. In-stances of 3 and 5 qubits yielded a recipe, while instancesof 4 and 6 qubit yielded another recipe. The recipes ob-tained are summarized in Figure 3 and were verified forlarger number of qubits, both odd and even. Indeed,DDQCL successfully reproduced the recipes previouslyused on ion trap quantum computers [55], and, to thebest of our knowledge, they correspond to the most com-pact and efficient protocols for GHZ state preparationusing XX gates (see Figure 3). Another commonly usedapproach consists of cascading entangling gates, with al-ternations of single qubit rotations [54]. DDQCL pro-duced approximate recipes of this kind in some of thetest cases for 3 and 4 qubits when using a single entan-gling layer with chain topology.

It is interesting to note that in DDQCL all the param-eters are learned independently and not constrained tobe the same. As shown in Figure 3, the learning processunveiled that these converge to the same value. This isnot necessarily the case for the other data sets consid-ered below. We also note that the simulations assumednoiseless hardware, making the analysis of parameterseasier for the human expert. It would be much moredifficult to analyse parameters found with noisy hard-ware, as DDQCL can learn to compensate certain typesof noise, e.g., systematic parameter offsets, in non-trivialways. The upside is that learning can be successful evenin the presence of such systematic errors.

Finally, it is reassuring that DDQCL obtained circuitsfor cat state preparation starting from samples of a clas-sical distribution. While this target distribution couldhave been modeled by a zero-temperature ferromagnet,there apparently was no other way for our circuit to re-

6

produce such a distribution, if not by preparing a GHZ-like state. One can obtain more general solutions by al-lowing DDQCL to prepare mixed states. For example,consider a circuit acting on both the main qubit registerand an additional ancilla register. By tracing out theancilla register, e.g. ignoring it during measurement, themain register can implement a mixed state and can betrained to simulate a zero-temperature ferromagnet. An-other example which does not resort to an ancilla registeris to use decoherence as a mechanism to prepare mixedstates that explain the data.

Coherent thermal states

Thermal states play an important role in statisticalphysics, quantum information, and machine learning.Using DDQCL, we trained quantum circuits with dif-ferent number of layers L ∈ {1, 2, 3} and using all-to-alltopology, to approximate a target Boltzmann distribu-tion. In particular, we considered data sets sampled fromthe Boltzmann distribution of 25 synthetic instances withN = 5 qubits. By decreasing the temperature T of thetarget distribution, we can increase the difficulty of thelearning task. Figure 4 shows the bootstrapped medianand 90% confidence interval over the distribution of me-dians of the KL divergence during learning. Deeper cir-cuits such as L = 3 (purple pentagons) consistently out-performed shallower circuits such as L = 2 (red circles)and L = 1 (yellow triangles). This became more evidentas we went from easy learning tasks (Figure 4 (a)) tohard learning tasks (Figure 4 (c)). Results for instancesof N = 6 qubits are shown in Figure 6 in the Supplemen-tary Material.

To assess how well DDQCL performs on the generativetask, we compared DDQCL to the inverse Bethe approx-imation [56] (see also Eqs. (3.21) and (3.22) in Ref. [57]),a classical closed-form approach widely used in statisti-cal physics to solve the inverse Ising problem. As shownin Figure 4, the inverse Bethe approximation (green bar)performed extremely well in the easy task (a), matchedthe L = 3 quantum circuit in the intermediate task (b),and underperformed on the difficult task (c). The latterobservation comes from the fact that the median perfor-mance of the inverse Bethe approximation has very largeconfidence intervals. We emphasize that this is not aform of quantum supremacy as the two methods are fun-damentally different. DDQCL prepares a quantum statewithout the assumption of an underlying Boltzmann dis-tribution, while the inverse Bethe approximation infersthe parameters with such assumption. Furthermore, theerror in the inverse Bethe approximation is expected togo to zero with system size, and only above the referencetemperature Tc (see Section IV for details). Thus, it isnot surprising that we obtained bad performance in Fig-ure 4 (c) with the inverse Bethe approximation. Otherclassical methods based on machine learning and MarkovChain Monte Carlo, such as the Boltzmann machine [58],

could achieve higher accuracy by requiring more compu-tational resources than the inverse Bethe approximationused here. A thorough comparison is beyond the scopeof this work.

BAS(2,2)

For the purposes of benchmarking and measuring thepower of NISQ devices with DDQCL, it is insufficientto have an easy-to-generate target data set; we also re-quire the data set to represent a useful quantum statein quantum computing, while simultaneously proving tobe sufficiently challenging for the quantum computer togenerate. Because of the importance of entanglementin quantum information processing, we considered theentanglement entropy averaged over all two-qubit sub-sets [59] as a proxy measure of a specific quantum state’susefulness for benchmarking purposes. We start by not-ing that the four-qubit cat state, whose rich entanglednature makes it ideal for studying decoherence and de-cay of quantum information [52, 54], has entanglemententropy SGHZ = 1. Now consider states that encodeBAS(2, 2) in the computational basis. The minimumvalue of entanglement entropy that any such state canhave is SBAS(2,2) = 1.25163. Furthermore, the maxi-mum value that a quantum representation of BAS(2, 2)can reach is SBAS(2,2) = 1.79248, which happens to bethe maximum entanglement entropy known for any four-qubit state [59] (see also Figure 9 and the correspondingSection in the Supplementary Material).

We also note that one of the quantum representa-tions of BAS(2, 2) found by DDQCL reached a remark-able value of SBAS(2,2) = 1.69989 (see Figure 7 in theSupplementary Material). This shows the power of ourframework, in that DDQCL is capable of handling usefulquantum states that are rich in entanglement. This isan important observation, since we know, based on ourempirical results, that (i) single layer circuits with noentangling gates severely underperform in producing theoutput state probability distribution that is close to thetarget data set, and (ii) when inspecting the parameterslearned for circuits with all-to-all topology with L = 2layers, we found that most of the XX gates reached theirmaximum entangling setting.

We now discuss results for the qBAS(2, 2) score, whichwe computed experimentally and theoretically in orderto compare the entangling topologies sketched in the topright panel of Figure 2 and for different number of layers.The process consists of two steps; first, DDQCL is usedto encode BAS(2, 2) in the wave function of the quantumstate. Second, the best circuits, i.e. those with lowestcost, are compared using the qBAS(2, 2) score.

The bottom left and bottom central panels in Figure 2show the bootstrapped median of the KL divergence and90% confidence interval over 25 random initializations ofDDQCL in silico. The all-to-all topology (orange cir-cles) always outperforms sparse topologies (red stars and

7

(a) T = 2 TC (b) T = TC (c) T = TC /1.5

FIG. 4. DDQCL preparation of coherent thermal states. We generated 25 random instances of size N = 5 and varied thedifficulty of the learning task by decreasing the temperature in T ∈ {2Tc, Tc, Tc/1.5} where Tc is the reference temperature (seeSection IV for details). The model is a quantum circuit with five qubits and an all-to-all qubit connectivity for the entanglinglayer. We show the bootstrapped median and 90% confidence interval over the distribution of medians of the KL divergence ofDDQCL as learning progresses for 50 iterations. (a) When T > Tc, the learning task is easy and shallow quantum circuits suchas L = 1 (yellow triangles) and L = 2 (red circles) perform very well. (b) When T ≈ Tc, a gap in performance between circuitsof different depth becomes evident. (c) When T < Tc, the learning task becomes hard and deeper circuits perform much betterthan shallow ones. We also report results for the inverse Bethe approximation, which does not actually prepare a state, butproduces a classical model in closed-form. The classical model so obtained (green band) is excellent for the easy task in (a),matches the best quantum model in (b), and underperforms for the hard task in (c).

purple squares). However, deeper circuits do not alwaysprovide significant improvements, as it is the case for all-to-all L = 4 (dark green circles) versus all-to-all L = 2(orange circles). A possible explanation is that, whengoing from two to four layers, we approximately doublethe number of parameters, and particle swarm optimiza-tion struggles to find enhanced local optima. Anotherplausible explanation is that for this small data set, theall-to-all circuits with L = 2 are already close to optimalperformance (we show supporting evidence in Figure 7in the Supplementary Material).

As the best performing circuits to compare using theqBAS score, we chose all-to-all L = 2 and star L ∈ {2, 4}circuits. While they represent very different approximatesolutions to the same problem, they may be comparedwith the help of qBAS(2, 2) score. For each setting, wecomputed 25 scores from batches of size Nreads = 15 sam-ples, as described in Section II. The bottom right panelin Figure 2 shows the bootstrapped mean qBAS(2, 2)score and 95% confidence interval for simulations (greenbars) and experiments on the ion trap quantum com-puter hosted at University of Maryland (pink bars). Thescore is sensitive to the depth of the circuit as shownby the performance improvement of L = 4 compared toL = 2 in the star topology. Note that the theoreticalimprovement for using L = 4 is larger than that ob-served experimentally in the ion trap. This is becausethe quantum computer accumulated errors while execut-ing the deeper circuit. The score is also sensitive to thechoice of topology as shown by the drop in performanceof star compared to all-to-all when the same number oflayers L = 2 is used.

Although we compared circuits implemented on thesame ion trap hardware, the score may be used to com-pare different device generations or even completely dif-ferent architectures (e.g. superconductor-based versusatomic-based). Similarly, one may use the score to com-pare classical resources of the hybrid system (e.g. differ-ent optimizers).

III. DISCUSSION

Data is an essential ingredient of any machine learningtask. In this work, we presented a data-driven quantumcircuit learning algorithm (DDQCL) as a framework thatcan assist in the characterization of NISQ devices and toimplement simple generative models. The success of thisapproach is evidenced by the results on three differentdata sets.

To summarize, first, we learned a GHZ state prepara-tion recipe for an ion trap quantum computer. Minimalintervention by a human expert allowed to generalize therecipe to any number of qubits. This is not an exampleof compilation, but rather an illustration of how simpleclassical probability distributions can guide the synthesisof interesting non-trivial quantum states. Depending onthe level or type of noise in the system, the same algo-rithm could lead to a different circuit fulfilling the sameprobability distribution as that of the data. The messagehere is that machine learning can teach us that “there ismore than one way to skin a cat (state)”.

Second, we trained circuits to prepare approximationsof thermal states. This illustrates the power of Born

8

machines [24] to approximate Boltzmann machines [58]when the data require thermal-like features.

Finally, tapping into the real power of near-term quan-tum devices and approximate algorithms implementableon them, we designed a task-specific performance esti-mator based on a canonical data set. The bars andstripes data is easy to generate, visualize and verify clas-sically, while modeling it still requires significant quan-tum resources in the form of entanglement. Errors inthe device will affect this single performance measure,the qBAS score, which can be used to compare differentdevice generations, or completely different architectures.The qBAS score can also be used to benchmark the typ-ical performance of optimizers used in hybrid quantum-classical systems. Selecting the method and optimiz-ing the hyper-parameters can be a daunting task andis a key challenge towards a successful implementationas the number of qubits increases. Therefore, having thisunique metric for benchmarking could help reduce thecomplexity of this fine-tuning stage. The score can becomputed in any of the NISQ architectures available todate.

DDQCL is a modular framework and its performancewill ultimately depend on the choices made for such mod-ules. In this article we explored the impact of circuit lay-out and cost function, while subsequent work has anal-ysed other modules and suggested extensions to the al-gorithm. In Ref. [29] the authors trained Born machinesusing a differentiable cost function and exploiting gradi-ent calculations proposed in Ref. [60]. In Refs. [61, 62]the authors compared several optimizers, and in Ref. [63]the authors focused on the impact of hardware noise.The expressive power of shallow circuits was investigatedin Ref. [64] and it was shown to outweigh that of someclasses of artificial neural networks. Finally, recent workhas shown successfull implementation of DDQCL on theIBM Q 20 Tokyo processor [63], on a five-qubit ion-traphosted at University of Maryland [61], and on the Rigetti16Q Aspen-1 processor [62].

It is left to future work to demonstrate more realisticmachine learning by allowing more flexible models andemploying regularization. At a finite and fixed low cir-cuit depth, the power of the generative model can be en-hanced by including ancilla qubits, in analogy to the roleof hidden units in probabilistic graphical models. Regu-larization can be included in the cost function. In thispaper, we used the negative log-likelihood which quicklybecomes expensive to estimate as the size of the systemincreases. We reported preliminary results on alternativecost-functions that overcome the caveat and still producesatisfactory results. Layer-wise pre-training of the quan-tum circuit inspired by deep learning [39] could initializeparameters to near-optimal locations in the cost land-scape. Finally, DDQCL could be generalized to learnquantum distributions or states, assuming experimentaldata coming from quantum experiments, e.g. quantummeasurements beyond the computational basis. We thinkthese are the most promising directions to be explored in

future work.Our approach has the bidirectional capability of using

NISQ devices for machine learning, and machine learningfor the characterization of NISQ devices. We hope theideas presented here contribute to the development offurther concrete metrics to help guide the architecturalhardware design, while tapping into the computationalpower of NISQ devices.

IV. METHODS

Simulation of quantum circuits in silico

We simulated quantum circuits using the QuTiP2 [65]Python library and implemented the constraints dictatedby the ion trap experimental setting. In the current ex-perimental setup, we can perform arbitrary single qubitrotations and Mølmer-Sørensen XX entangling gates in-volving any two qubits. We used only these gates henceavoiding the need of further compilation. For the simula-tions in silico, we also assume perfect gate fidelities anderror-free measurements.

In the ion trap setting, the implementation of singlequbit rotations Rz is very convenient. Therefore, weperform arbitrary single qubit operations relying on the

decomposition U(l)i = Rz(θ

(l,3)i )Rx(θ

(l,2)i )Rz(θ

(l,1)i ),

where l is the layer number, i is the qubit in-

dex, and θ(l,k)i ∈ [−π,+π] are Euler angles. The

rotations are then expressed as exponentials of

Pauli operators Rz(θ(l,·)i ) = exp(− i

2θ(l,·)i σzi ) and

Rx(θ(l,·)i ) = exp(− i

2θ(l,·)i σxi ).

Because we execute circuits always starting from the|0 · · · 0〉 state, the first set of Rz rotations would have noeffect and, therefore, is not needed. When an odd num-ber of layers is used, a similar exception occurs in the lastlayer. There, the last set of Rz rotations would only add aphase that becomes irrelevant when taking the amplitudesquared required for the Born machine. In other words,we can slightly reduce the number of parameter withoutchanging the expressive power of the circuit. Every otherlayer of arbitrary single qubit operations would in gen-eral require 3N parameters, where N is the number ofqubits. By using an alternative decomposition, namelyU = RxRzRx, we could apply commutation rules withXX gates and obtain a reduction to 2N parameters inall odd layers. We decided not to do the former stepfor two reasons. First, there is no effective reduction ofthe number of parameters for experiments up to L = 5layers considered here. Second, in the ion trap quantumcomputer used here [38] it is experimentally convenientto use a larger number of Rz rather than Rx rotations.

For the case of the entangling gates, we use the nota-

tion U(l)ij = XX(θ

(l)ij ), which in exponential form reads as

9

XX(θ(l)ij ) = exp(− i

2θ(l)ij σ

xi σ

xj ). Recalling that states that

differ by a global phase are indistinguishable, a directcomputation shows that the tunable parameters can be

taken as θ(l)ij ∈ [−π,+π]. Also, there is no need to set

up an order for these gates within an entangling layer asthey commute with one another.

The total number of parameters per entangling layerdepends on the chosen topology: all is a fully-connectedgraph and has N(N − 1)/2 parameters, chain is a one-dimensional nearest neighbor graph with N − 1 parame-ters, and star is a star-shaped graph with N − 1 param-eters. The top right panel of Figure 2 shows a graphicalrepresentation of these topologies for the case of N = 4qubits.

When executing DDQCL, we always estimated the re-quired quantities from 1000 measurements in the compu-tational basis.

Gradient-free optimization

Once the number of layers and topology of entanglinggates is fixed, the quantum circuits described above pro-vide a template; by adjusting the parameters we canimplement a small subset of the unitaries that are inprinciple allowed in the Hilbert space. The variationalapproach aims at finding the best parameters by mini-mizing a cost function. For all our tests, we choose tominimize a clipped version of the negative log-likelihood.

We use a global-best particle swarm optimization al-gorithm [42] implemented in the PySwarms [66] Pythonlibrary. A ‘particle’ corresponds to a candidate solutioncircuit; the position of a particle is a point θ in parameterspace, the velocity is a vector determining how to updatethe position in parameter space. Position and velocity ofall the particles are initialized at random and updated ateach iteration following the schema shown in Figure 1.There are three hyper-parameters controlling the swarmdynamics: a cognition coefficient c1, a social coefficientc2 and an inertia coefficient w. After testing a grid ofvalues, we chose to use a constant value of 0.5 for allthree hyper-parameters, which we found to work well forour purpose. To avoid large jumps in parameter space,we further restrict position updates in each dimension toa maximum magnitude of π.

Finally, we set the number of particles to twice thenumber of parameters of the circuit. This is a conserva-tive value compared to previous work [8], also because ofthe large number of parameters in the circuits exploredhere.

Data sets details

We worked with three synthetic data sets: zero-temperature ferromagnet, random thermal, and bars andstripes (BAS). In all our numerical experiments we use

1000 data points sampled exactly from these distribu-tions.

GHZ state preparation

The zero-temperature ferromagnet distribution isequivalent to assigning 1/2 probability to both |0 . . . 0〉and |1 . . . 1〉 states of the computational basis. This dis-tribution can be easily prepared as a mixed state, butour study uses pure states prepared by the circuit. Theonly way to reproduce the zero-temperature ferromag-net distribution in our setting is to implement a unitarytransformation that prepares a GHZ-like state.

Thermal states

A thermal data set in N dimensions is generated by ex-act sampling realizations of x ∈ {−1,+1}N from the dis-tribution P (x) = Z−1 exp((

∑ij Jijxixj +

∑i hixi)T

−1)where Z is the normalization constant, Jij and hi arerandom coefficients sampled from a normal distributionwith zero mean and

√N standard deviation, and T is

the temperature. In the large system-size limit, a phasetransition is expected at Tc ≈ 1. Although this is nottrue for the small-sized systems considered here, we takethis value as a reference temperature. In our study, wevary T ∈ {2Tc, Tc, Tc/1.5} in order to generate increas-ingly complex instances.

Bars and Stripes

BAS [47] is a canonical machine learning data set fortesting generative models. It consists of n×m pixel pic-tures generated by setting each row (or column) to eitherblack (−1) or white (+1), at random. In generative mod-eling, a handful of patterns are input to the algorithmand the target is to train a model to capture correlationsin the data. Assuming a successful training, the modelcan reconstruct and generate previously unseen patternsfrom partial or corrupted data. On the other hand, if weprovide the algorithm with all the patterns we are inter-ested in and aim to a model that generates only those,this would amount to an associative memory. Althoughboth tasks can be done with our DDQCL pipeline, forthe qBAS(n,m) score we focus on the latter task.

We now determine the number of patterns and pro-vide an easy identification of the bitstring belonging tothe BAS(n,m) class. For the total count of the number ofpatterns, we first count the number of single stripes, dou-ble stripes, etc. that can fit into the n rows. This numberis the sum of binomial coefficients

∑nk=0

(nk

)= 2n. The

same expression holds for the number of patterns withbars that can be placed in the m columns, that is 2m.Note that empty (all-white) and full (all-black) patternsare counted in both the bars and the stripes. Therefore,

10

we obtain the total count for the BAS patterns by sub-tracting the two extra patterns from this double count:

NBAS(n,m) = 2n + 2m − 2. (2)

In the main text, we use the BAS data set to design atask-specific performance indicator for hybrid quantum-classical systems. Table I shows the requirements forsome values of n and m.

(n,m) Nqubits NBAS(n,m) Nreads

(2,2) 4 6 15(2,3) 6 10 30(3,3) 9 14 46(4,4) 16 30 120(7,7) 49 254 1554(8,8) 64 510 3475

(10,10) 100 2046 16780

TABLE I. Example of experimental requirements for near-term quantum computers with up to 100 qubits. As describedin the main text, Nreads is the number of readouts requiredfor every estimation of the qBAS score.

Bootstrapping analysis

To obtain error bars for the KL divergence, we usedthe following procedure. DDQCL was always executed25 times with random initialization of the parameters.From the 25 repetitions, we sampled 10,000 data sets ofsize 25 with replacement and computed the median KLdivergence for each. From the distribution of 10,000 me-dians, we computed the median and obtained error barsfrom the 5-th and 95-th percentiles as the lower and up-per limits, respectively, accounting for a 90% confidenceinterval.

For the case of qBAS score, we did the following boot-strap analysis. qBAS score was always computed 25times from batches of samples Nreads. From the 25 rep-etitions, we sampled 10,000 data sets of size 25 with re-placement and computed the mean for each. From thedistribution of 10,000 means, we computed mean and ob-

tained error bars from two standard deviations, account-ing for a 95% confidence interval.

DATA AVAILABILITY

All data needed to evaluate the conclusions are avail-able from the corresponding author upon request.

ACKNOWLEDGEMENTS

M.B. was supported by the UK Engineering and Phys-ical Sciences Research Council (EPSRC) and by Cam-bridge Quantum Computing Limited (CQCL). The au-thors are very grateful to Prof. Christopher Monroe andhis team at the University of Maryland (UMD) for theirsupport in running the experiments presented here. Spe-cial thanks to N.M. Linke, C. Figgatt, K.A. Landsman,and D. Zhu for useful discussions and for running the sev-eral experiments used to test and validate the pipeline ofthis work, and the ones presented in the Results section.The authors acknowledge E. Edwards from the communi-cations/publicity division at the Joint Quantum Instituteat UMD for the rendering of the ion-trap graphic usedin Figure 2, and would like to thank J. Realpe-Gomez,A.M. Wilson, and G. Paz-Silva for useful discussions andfeedback on an early version of this manuscript. The au-thors would like to thank J.I. Latorre for pointing outRef. [59].

AUTHOR CONTRIBUTIONS

M.B. and A.P-O designed the generative model and thelearning algorithm. M.B., Y.N., and A.P-O designed thedata sets and the experiments. M.B. and V.L-O wrotethe code and performed the in silico experiments. D.G-Pwrote code for the analysis and figures. A.P-O designedthe qBAS score. O.P. performed the analysis related toentanglement entropies. All authors analyzed the exper-imental results and contributed to the final version of themanuscript.

COMPETING INTERESTS

The authors declare no competing interests.

[1] John Preskill, “Quantum computing in the NISQ era andbeyond,” Quantum 2, 79 (2018).

[2] Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J. Love, Alan Aspuru-Guzik, and Jeremy L. O’Brien, “A variational eigenvaluesolver on a photonic quantum processor,” Nature Com-

munications 5, 4213 EP – (2014).[3] Jarrod R McClean, Jonathan Romero, Ryan Babbush,

and Alan Aspuru-Guzik, “The theory of variationalhybrid quantum-classical algorithms,” New Journal ofPhysics 18, 023023 (2016).

http://dx.doi.org/ 10.22331/q-2018-08-06-79

http://dx.doi.org/ 10.1038/ncomms5213

http://dx.doi.org/ 10.1038/ncomms5213

http://stacks.iop.org/1367-2630/18/i=2/a=023023


11

[4] Sam Gutmann Edward Farhi, Jeffrey Goldstone,“A quantum approximate optimization algorithm,”arXiv:1411.4028 (2014).

[5] Stuart Hadfield, Zhihui Wang, Bryan O’Gorman,Eleanor G Rieffel, Davide Venturelli, and Rupak Biswas,“From the quantum approximate optimization algorithmto a quantum alternating operator ansatz,” Algorithms12, 34 (2019).

[6] PJJ O’Malley, Ryan Babbush, ID Kivlichan, JonathanRomero, JR McClean, Rami Barends, Julian Kelly, Pe-dram Roushan, Andrew Tranter, Nan Ding, et al., “Scal-able quantum simulation of molecular energies,” PhysicalReview X 6, 031007 (2016).

[7] J. I. Colless, V. V. Ramasesh, D. Dahlen, M. S. Blok,M. E. Kimchi-Schwartz, J. R. McClean, J. Carter, W. A.de Jong, and I. Siddiqi, “Computation of molecular spec-tra on a quantum processor with an error-resilient algo-rithm,” Phys. Rev. X 8, 011021 (2018).

[8] Abhinav Kandala, Antonio Mezzacapo, Kristan Temme,Maika Takita, Markus Brink, Jerry M Chow, andJay M Gambetta, “Hardware-efficient variational quan-tum eigensolver for small molecules and quantum mag-nets,” Nature 549, 242 (2017).

[9] Nikolaj Moll, Panagiotis Barkoutsos, Lev S Bishop,Jerry M Chow, Andrew Cross, Daniel J Egger, Stefan Fil-ipp, Andreas Fuhrer, Jay M Gambetta, Marc Ganzhorn,Abhinav Kandala, Antonio Mezzacapo, Peter Mller, Wal-ter Riess, Gian Salis, John Smolin, Ivano Tavernelli,and Kristan Temme, “Quantum optimization using varia-tional algorithms on near-term quantum devices,” Quan-tum Science and Technology 3, 030503 (2018).

[10] J. S. Otterbach, R. Manenti, N. Alidoust, A. Bestwick,M. Block, B. Bloom, S. Caldwell, N. Didier, E. SchuylerFried, S. Hong, P. Karalekas, C. B. Osborn, A. Pa-pageorge, E. C. Peterson, G. Prawiroatmodjo, N. Ru-bin, C. A. Ryan, D. Scarabelli, M. Scheer, E. A. Sete,P. Sivarajah, R. S. Smith, A. Staley, N. Tezak, W. J.Zeng, A. Hudson, B. R. Johnson, M. Reagor, M. P. daSilva, and C. Rigetti, “Unsupervised machine learn-ing on a hybrid quantum computer,” arXiv preprintarXiv:1712.05771 (2017).

[11] Jonathan Romero, Jonathan P Olson, and Alan Aspuru-Guzik, “Quantum autoencoders for efficient compressionof quantum data,” Quantum Sci. Technol. 2, 045001(2017).

[12] Lucas Lamata, Unai Alvarez-Rodriguez, Jose Martın-Guerrero, Mikel Sanz, and Enrique Solano, “Quan-tum autoencoders via quantum adders with genetic al-gorithms,” Quantum Science and Technology (2018),10.1088/2058-9565/aae22b.

[13] Rui Li, Unai Alvarez-Rodriguez, Lucas Lamata, andEnrique Solano, “Approximate quantum adders with ge-netic algorithms: An IBM quantum experience,” Quan-tum Measurements and Quantum Metrology 4, 1–7(2017).

[14] Edward Farhi and Aram W. Harrow, “Quantumsupremacy through the quantum approximate optimiza-tion algorithm,” arXiv:1602.07674 (2016).

[15] Alejandro Perdomo-Ortiz, Marcello Benedetti, JohnRealpe-Gomez, and Rupak Biswas, “Opportunities andchallenges for quantum-assisted machine learning innear-term quantum computers,” Quantum Science andTechnology 3, 030502 (2018).

[16] Aram W. Harrow and Ashley Montanaro, “Quantumcomputational supremacy,” Nature 549, 203–209 (2017).

[17] Krysta M. Svore Nathan Wiebe, Ashish Kapoor, “Quan-tum deep learning,” arXiv:1412.3489 (2015).

[18] Maria Kieferova and Nathan Wiebe, “Tomography andgenerative training with quantum boltzmann machines,”Phys. Rev. A 96, 062327 (2017).

[19] Marcello Benedetti, John Realpe-Gomez, Rupak Biswas,and Alejandro Perdomo-Ortiz, “Estimation of effectivetemperatures in quantum annealers for sampling appli-cations: A case study with possible applications in deeplearning,” Phys. Rev. A 94, 022308 (2016).

[20] Mohammad H Amin, Evgeny Andriyash, Jason Rolfe,Bohdan Kulchytskyy, and Roger Melko, “Quantumboltzmann machine,” Physical Review X 8, 021050(2018).

[21] Marcello Benedetti, John Realpe-Gomez, Rupak Biswas,and Alejandro Perdomo-Ortiz, “Quantum-assisted learn-ing of hardware-embedded probabilistic graphical mod-els,” Phys. Rev. X 7, 041052 (2017).

[22] Marcello Benedetti, John Realpe-Gomez, and AlejandroPerdomo-Ortiz, “Quantum-assisted helmholtz machines:A quantum–classical deep learning framework for indus-trial datasets in near-term devices,” Quantum Scienceand Technology 3, 034007 (2018).

[23] Peter Wittek and Christian Gogolin, “Quantum en-hanced inference in Markov logic networks,” ScientificReports 7 (2017), 10.1038/srep45672.

[24] Song Cheng, Jing Chen, and Lei Wang, “Informationperspective to probabilistic modeling: Boltzmann ma-chines versus born machines,” Entropy 20, 583 (2018).

[25] Edwin Stoudenmire and David J Schwab, “Supervisedlearning with tensor networks,” in Advances in NeuralInformation Processing Systems 29 , edited by D. D. Lee,M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett(Curran Associates, Inc., 2016) pp. 4799–4807.

[26] Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, andPan Zhang, “Unsupervised generative modeling usingmatrix product states,” Physical Review X 8 (2018),10.1103/physrevx.8.031012.

[27] Ding Liu, Shi-Ju Ran, Peter Wittek, Cheng Peng,Raul Blazquez Garcıa, Gang Su, and Maciej Lewenstein,“Machine learning by two-dimensional hierarchical tensornetworks: A quantum information theoretic perspectiveon deep architectures,” arXiv preprint arXiv:1710.04833(2017).

[28] Xun Gao, Zhengyu Zhang, and Luming Duan, “Aquantum machine learning algorithm based on genera-tive models,” Science Advances 4 (2018), 10.1126/sci-adv.aat9004.

[29] Jin-Guo Liu and Lei Wang, “Differentiable learning ofquantum circuit born machines,” Phys. Rev. A 98,062324 (2018).

[30] Guillaume Verdon, Michael Broughton, and Jacob Bia-monte, “A quantum algorithm to train neural networksusing low-depth circuits,” arXiv:1712.05304 (2017).

[31] Seth Lloyd, Masoud Mohseni, and Patrick Reben-trost, “Quantum principal component analysis,” NaturePhysics 10, 631–633 (2014).

[32] Iordanis Kerenidis and Anupam Prakash, “Quantum rec-ommendation systems,” arXiv preprint arXiv:1603.08675(2016).

[33] Fernando G. S. L. Brandao, Amir Kalev, TongyangLi, Cedric Yen-Yu Lin, Krysta M. Svore, and Xiaodi

http://dx.doi.org/ 10.3390/a12020034

http://dx.doi.org/ 10.3390/a12020034

http://dx.doi.org/10.1103/physrevx.6.031007


http://dx.doi.org/10.1103/PhysRevX.8.011021

http://dx.doi.org/ 10.1038/nature23879



http://dx.doi.org/10.1088/2058-9565/aa8072

http://dx.doi.org/10.1088/2058-9565/aa8072

http://dx.doi.org/10.1088/2058-9565/aae22b

http://dx.doi.org/10.1088/2058-9565/aae22b

http://dx.doi.org/10.1515/qmetro-2017-0001



http://dx.doi.org/10.1088/2058-9565/aab859

http://dx.doi.org/10.1088/2058-9565/aab859

http://dx.doi.org/10.1038/nature23458

http://dx.doi.org/ 10.1103/PhysRevA.96.062327

http://dx.doi.org/10.1103/PhysRevA.94.022308



http://dx.doi.org/10.1103/PhysRevX.7.041052

http://dx.doi.org/10.1088/2058-9565/aabd98

http://dx.doi.org/10.1088/2058-9565/aabd98

http://dx.doi.org/ 10.1038/srep45672

http://dx.doi.org/ 10.1038/srep45672

http://dx.doi.org/10.3390/e20080583

http://papers.nips.cc/paper/6211-supervised-learning-with-tensor-networks.pdf

http://papers.nips.cc/paper/6211-supervised-learning-with-tensor-networks.pdf



http://dx.doi.org/ 10.1126/sciadv.aat9004

http://dx.doi.org/ 10.1126/sciadv.aat9004



http://dx.doi.org/10.1038/nphys3029

http://dx.doi.org/10.1038/nphys3029

12

Wu, “Exponential quantum speed-ups for semidefiniteprogramming with applications to quantum learning,”arXiv:1710.02581 (2017).

[34] Maria Schuld, Mark Fingerhuth, and Francesco Petruc-cione, “Implementing a distance-based classifier with aquantum interference circuit,” EPL (Europhysics Let-ters) 119, 60002 (2017).

[35] Masoud Mohseni, Peter Read, Hartmut Neven, SergioBoixo, Vasil Denchev, Ryan Babbush, Austin Fowler,Vadim Smelyanskiy, and John Martinis, “Commercializequantum technologies in five years,” Nature 543, 171–174(2017).

[36] Lev S Bishop, Sergey Bravyi, Andrew Cross, Jay M Gam-betta, and John Smolin, “Quantum volume,” (2017).

[37] Norbert M Linke, Dmitri Maslov, Martin Roetteler,Shantanu Debnath, Caroline Figgatt, Kevin A Lands-man, Kenneth Wright, and Christopher Monroe, “Ex-perimental comparison of two quantum computing archi-tectures,” Proceedings of the National Academy of Sci-ences , 201618020 (2017).

[38] Shantanu Debnath, Norbert M Linke, Caroline Figgatt,Kevin A Landsman, Kevin Wright, and ChristopherMonroe, “Demonstration of a small programmable quan-tum computer with atomic qubits,” Nature 536, 63 EP– (2016).

[39] Ian Goodfellow Yoshua Bengio and Aaron Courville,“Deep learning,” (2016), MIT Press.

[40] S. Kullback and R. A. Leibler, “On information and suffi-ciency,” The Annals of Mathematical Statistics 22, 79–86(1951).

[41] James Kennedy and Russell Eberhart, “Particle swarmoptimization,” in Proceedings of ICNN’95 - InternationalConference on Neural Networks, Vol. 4 (1995) pp. 1942–1948 vol.4.

[42] Yuhui Shi and Russell Eberhart, “A modified particleswarm optimizer,” in 1998 IEEE International Confer-ence on Evolutionary Computation Proceedings. IEEEWorld Congress on Computational Intelligence (Cat.No.98TH8360) (1998) pp. 69–73.

[43] Anders Sørensen and Klaus Mølmer, “Quantum compu-tation with ions in thermal motion,” Phys. Rev. Lett. 82,1971–1974 (1999).

[44] Anders Sørensen and Klaus Mølmer, “Entanglement andquantum computation with ions in thermal motion,”Phys. Rev. A 62, 022311 (2000).

[45] Jan Benhelm, Gerhard Kirchmair, Christian F. Roos,and Rainer Blatt, “Towards fault-tolerant quantum com-puting with trapped ions,” Nature Physics 4, 463 EP –(2008).

[46] Dmitri Maslov and Yunseong Nam, “Use of global inter-actions in efficient quantum circuit constructions,” NewJournal of Physics (2017), 10.1088/1367-2630/aaa398.

[47] David J. C. MacKay, Information Theory, Inference &Learning Algorithms (Cambridge University Press, NewYork, NY, USA, 2002).

[48] The meaning and usage of precision in the field of infor-mation retrieval differs from the definition of precisionwithin other branches of science and statistics.

[49] It is an interesting problem to optimize Nreads such thatit maximizes the sensitivity of the score towards differ-entiating two probability distributions. This problem isleft for future work.

[50] One may argue that the same scalability issue applies tothe cost function in Eq. (1) used for learning. However,

we can choose alternative scalable cost functions, as weshow in the Supplementary Material. A comprehensivestudy is left for future work.

[51] Daniel M. Greenberger, Michael A. Horne, Abner Shi-mony, and Anton Zeilinger, “Bell’s theorem without in-equalities,” American Journal of Physics 58, 1131–1143(1990).

[52] Thomas Monz, Philipp Schindler, Julio T. Barreiro,Michael Chwalla, Daniel Nigg, William A. Coish, Max-imilian Harlander, Wolfgang Hansel, Markus Hennrich,and Rainer Blatt, “14-qubit entanglement: Creation andcoherence,” Phys. Rev. Lett. 106, 130506 (2011).

[53] Andrea Rocchetto, Scott Aaronson, Simone Severini,Gonzalo Carvacho, Davide Poderini, Iris Agresti, MarcoBentivegna, and Fabio Sciarrino, “Experimental learn-ing of quantum states,” Science Advances 5 (2019),10.1126/sciadv.aau1946.

[54] Asier Ozaeta and Peter L McMahon, “Decoherence of upto 8-qubit entangled states in a 16-qubit superconductingquantum processor,” Quantum Science and Technology4, 025015 (2019).

[55] Thomas Monz, Philipp Schindler, Julio T. Barreiro,Michael Chwalla, Daniel Nigg, William A. Coish, Max-imilian Harlander, Wolfgang Hansel, Markus Hennrich,and Rainer Blatt, “14-qubit entanglement: Creation andcoherence,” Phys. Rev. Lett. 106, 130506 (2011).

[56] Federico Ricci-Tersenghi, “The bethe approximation forsolving the inverse ising problem: a comparison withother inference methods,” Journal of Statistical Mechan-ics: Theory and Experiment 2012, P08015 (2012).

[57] Iacopo Mastromatteo, “On the typical properties of in-verse problems in statistical mechanics,” arXiv preprintarXiv:1311.0190 (2013).

[58] David H Ackley, Geoffrey E Hinton, and Terrence J Se-jnowski, “A learning algorithm for boltzmann machines,”Cognitive science 9, 147–169 (1985).

[59] Atsushi Higuchi and Anthony W. Sudbery, “How entan-gled can two couples get?” Physics Letters A 273, 213 –217 (2000).

[60] Kosuke Mitarai, Makoto Negoro, Masahiro Kitagawa,and Keisuke Fujii, “Quantum circuit learning,” Phys.Rev. A 98, 032309 (2018).

[61] D. Zhu, N. M. Linke, M. Benedetti, K. A. Landsman,N. H. Nguyen, C. H. Alderete, A. Perdomo-Ortiz, N. Ko-rda, A. Garfoot, C. Brecque, L. Egan, O. Perdomo, andC. Monroe, “Training of quantum circuits on a hybridquantum computer,” arXiv preprint arXiv:1812.08862(2018).

[62] Vicente Leyton-Ortega, Alejandro Perdomo-Ortiz, andOscar Perdomo, “Robust implementation of generativemodeling with parametrized quantum circuits,” arXivpreprint arXiv:1901.08047 (2019).

[63] Kathleen E. Hamilton, Eugene F. Dumitrescu,and Raphael C. Pooser, “Generative model bench-marks for superconducting qubits,” arXiv preprintarXiv:1811.09905 (2018).

[64] Yuxuan Du, Min-Hsiu Hsieh, Tongliang Liu, andDacheng Tao, “The expressive power of parameter-ized quantum circuits,” arXiv preprint arXiv:1810.11922(2018).

[65] J.R. Johansson, P.D. Nation, and Franco Nori, “Qutip 2:A python framework for the dynamics of open quantumsystems,” Computer Physics Communications 184, 1234– 1240 (2013).



http://dx.doi.org/10.1038/543171a

http://dx.doi.org/10.1038/543171a

http://dx.doi.org/10.1073/pnas.1618020114

http://dx.doi.org/10.1073/pnas.1618020114



http://www.deeplearningbook.org

http://dx.doi.org/10.1214/aoms/1177729694

http://dx.doi.org/10.1214/aoms/1177729694

http://dx.doi.org/10.1109/ICNN.1995.488968

http://dx.doi.org/10.1109/ICNN.1995.488968

http://dx.doi.org/10.1109/ICEC.1998.699146




http://dx.doi.org/ 10.1103/PhysRevLett.82.1971

http://dx.doi.org/ 10.1103/PhysRevLett.82.1971


http://dx.doi.org/ 10.1038/nphys961

http://dx.doi.org/ 10.1038/nphys961

http://dx.doi.org/ 10.1088/1367-2630/aaa398

http://dx.doi.org/ 10.1088/1367-2630/aaa398

http://dx.doi.org/10.1119/1.16243

http://dx.doi.org/10.1119/1.16243

http://dx.doi.org/10.1103/PhysRevLett.106.130506

http://dx.doi.org/10.1126/sciadv.aau1946

http://dx.doi.org/10.1126/sciadv.aau1946

http://dx.doi.org/10.1088/2058-9565/ab13e5

http://dx.doi.org/10.1088/2058-9565/ab13e5

http://dx.doi.org/10.1103/PhysRevLett.106.130506

http://dx.doi.org/ 10.1088/1742-5468/2012/08/p08015

http://dx.doi.org/ 10.1088/1742-5468/2012/08/p08015

http://dx.doi.org/10.1016/S0375-9601(00)00480-1

http://dx.doi.org/10.1016/S0375-9601(00)00480-1



http://dx.doi.org/ 10.1016/j.cpc.2012.11.019

http://dx.doi.org/ 10.1016/j.cpc.2012.11.019

13

[66] Lester James V. Miranda, “Pyswarms, a research-toolkitfor particle swarm optimization in python,” (2017),10.5281/zenodo.986300.

[67] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas,“The earth mover’s distance as a metric for image re-trieval,” International journal of computer vision 40, 99–121 (2000).

[68] Elizaveta Levina and Peter Bickel, “The earth mover’sdistance is the mallows distance: Some insights fromstatistics,” in Computer Vision, 2001. ICCV 2001.Proceedings. Eighth IEEE International Conference on,Vol. 2 (IEEE, 2001) pp. 251–256.

[69] Ofir Pele and Michael Werman, “Fast and robust earthmover’s distances,” in 2009 IEEE 12th International Con-ference on Computer Vision (IEEE, 2009) pp. 460–467.

Supplementary Material for “A generative modeling approach for benchmarking andtraining shallow quantum circuits”

Comparison of cost functions

In realistic machine learning scenarios, we typically do not have access to the complete target probability dis-tribution, nor to that obtained from the output state of a quantum circuit. Hence, we need to compare the twodistributions at the level of histograms and using a finite number of samples and measurements. Here we comparethree cost functions via simulations in silico.

First, we defined the clipped negative log-likelihood as

Cnll(θ) = − 1

D

D∑d=1

ln(max(ε, Pθ(x(d)))), (3)

where probabilities are estimated from samples and ε > 0 is a small number that avoids an infinite cost whenPθ(x(d)) = 0. This may happen if for example entanglement in the circuit prevents us from measuring configurationx(d). Moreover, when the number of variables N is large, all except few configurations will ever be measured due tothe finite number of samples. Note that by re-normalizing all the probabilities after the clipping, we could interpretthis variant as a Laplace additive smoothing. All the experiments in the main text are carried out using the clippednegative log-likelihood with ε = 10−8.

Second, we defined the earth mover’s distance [67] as

Cemd(θ) = minF〈d(x,y)〉F , (4)

where F (x,y) is a joint probability distribution such that∑

y F (x,y) = PD(x) and∑

x F (x,y) = Pθ(y), that is, itsmarginals correspond to the data and circuit distributions, respectively. Intuitively, this is the minimum cost of turningone histogram into the other where the ground metric d(x,y) specifies the cost of transporting a single unit from xto y. We chose d(x,y) to be the Hamming distance between strings x ∈ {−1,+1}N and y ∈ {−1,+1}N . Since wenormalize histograms to sum up to one, the Earth Mover’s Distance is equivalent to the 1-st Wasserstein distance [68].In our simulations, we use the PyEMD Python library for fast computation of the earth moving distance [69].

Third, we defined the moment matching as

Cmm(θ) =1

N

N∑i

(〈xi〉PD − 〈xi〉Pθ)2 +

2

N(N − 1)

N∑i>j

(〈xixj〉PD − 〈xixj〉Pθ)2, (5)

where the expectation values PD and Pθ are taken with respect to data and circuit distributions, respectively. Thiscost function can be generalized to include moments beyond the second as well as using different positive exponentsfor the error.

We compared the cost functions on the task of learning thermal states of size N = 5, with L = 3 layers and alltopology. Figure 5 shows the bootstrapped median KL divergence on 25 realizations and 90% confidence intervalas learning progress for 100 iterations. The fact that Cnll (red diamonds) outperforms other cost functions does notcome as a surprise; minimization of the negative log-likelihood is indeed directly related to minimization of the KLdivergence. However, we expect the performance of DDQCL based on this cost function to degrade quickly as the sizeof the problem increases. In realistic applications, the relevant probabilities in Eq. (3), i.e. those associated with the

http://dx.doi.org/10.5281/zenodo.986300

http://dx.doi.org/10.5281/zenodo.986300

http://dx.doi.org/10.1109/iccv.2009.5459199

http://dx.doi.org/10.1109/iccv.2009.5459199

14

data, are a vanishing fraction of the 2N probabilities. Moreover, they need to be estimated from a finite number ofmeasurements. The earth mover’s distance Cemd (green pentagons) performs well, but it suffers from similar scalabilityissues. Fast algorithms for the computation of this distance may struggle when the number of bins in the histogramincreases exponentially as in our case. However, it is reassuring to see that alternative cost functions with no relationto the KL divergence can still produce satisfactory results. Surprisingly, the moment matching Cmm (purple crosses)closely tracks the other cost functions while retaining computational efficiency. In fact, even though a large numberof samples may be needed to obtain low-variance estimates for the moments, only O(N2) terms are computed at eachiteration. We expect this cost function to be a good heuristic for DDQCL on large systems.

0 20 40 60 80 100Iteration

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

KL

div

erg

en

ce

Cemd

Cnll

Cmm

FIG. 5. Bootstrapped median KL divergence and 90% confidence interval of circuits learned by minimizing different costfunctions. Both the moment matching (Cmm) and earth mover’s distance (Cemd) closely track the clipped negative log-likelihood(Cnll) used in the main text.

Approximate preparation of coherent thermal states for N = 6

FIG. 6. DDQCL preparation of approximate coherent thermal states for the case of six qubits, different circuit depths L ∈{1, 2, 3}, and different temperatures T ∈ {2Tc, Tc, Tc/1.5}. In these simulations, an all-to-all topology was used for the entanglinggate layer (l = 2). We report the bootstrapped median and 90% confidence interval. For the low temperature case shown inpanel (c), the inverse Bethe approximation converged only in 7 out of 25 instances. Hence, no median value was extracted.We plotted a KL divergence of 2.0 as a reference, but it is clear that for all those instances DDQCL outperformed the inverseBethe approximation.

15

Details for theoretical and experimental results

FIG. 7. Detailed comparison at the level of output states between simulation of circuits and the respective experimentalimplementations in the ion trap quantum computer. These are the best circuits in terms of KL divergence that were obtainedby DDQCL for BAS(2, 2) under three different setting: (a) all-to-all with L = 2 layers, (b) star with L = 4, and (c) star withL = 2. The theoretical state obtained in (a) is close to optimal and attains an entanglement entropy averaged over all two-qubitsubsets of SBAS(2,2) = 1.69989. Circuit diagrams for (a-c) are shown in Figure 8.

(a)

|0〉 Rx(+.00π) Rz(+.00π) XX XX XX

|0〉 Rx(+.00π) Rz(+.50π) +.50π XX XX

|0〉 Rx(+.00π) Rz(+.25π) −απ −απ XX

|0〉 Rx(+.00π) Rz(+.00π) +απ −απ −.50π

(b)

|0〉 Rx(+.15π) Rz(−.01π) XX XX XX Rz(−.63π) Rx(−.42π) Rz(+.26π) XX XX XX

|0〉 Rx(+.30π) Rz(+.31π) +.48π Rz(+.01π) Rx(+.21π) Rz(+.77π) −.01π

|0〉 Rx(+.61π) Rz(+.24π) −.76π Rz(+.95π) Rx(−.62π) Rz(+.99π) +.51π

|0〉 Rx(−.31π) Rz(−.99π) −.13π Rz(+.65π) Rx(−.56π) Rz(+.74π) −.46π

(c)

|0〉 Rx(+.11π) Rz(−.02π) XX XX XX

|0〉 Rx(−.88π) Rz(+.28π) −.48π

|0〉 Rx(+.20π) Rz(−.48π) −.32π

|0〉 Rx(−.23π) Rz(−.15π) −.32π

FIG. 8. Circuit diagrams for the analytical solution and for the best circuits found by DDQCL under three different qubit-to-qubit connectivity topologies. (a) the all-to-all circuit with L = 2 layers can achieve zero KL divergence for BAS(2, 2) by

setting α = π−1 arctan(2−1/2). All single-qubit rotations can be set to zero. DDQCL found an almost optimal solution withα = 0.2 and two non-zero Rz rotations. These Rz gates act as the identity on the |0000〉 state. (b) the star circuit with L = 4.(c) the star circuit with L = 2. The qBAS scores for (a-c) are shown in Figure 2 in the Main Text.

Entanglement entropy of BAS(2,2)

The measure of entanglement entropy used in this work is the average von Neumann entropy over all 2-qubitsubsets [59]. Consider a 4-qubit pure state ρ = |ψ〉〈ψ| and label the four qubits as A, B, C, and D. Then, the entropycan be computed as

Sψ = −1

3

[Tr(ρAB log2 ρAB) + Tr(ρAC log2 ρAC) + Tr(ρAD log2 ρAD)

], (6)

16

where ρXY is the reduced density matrix for the subset XY . As an example, the 4-qubit cat state has an entanglemententropy of SGHZ = 1. Now consider a pure state encoding the uniform probability distribution over the BAS(2, 2)data set in the computational basis

|BAS(2, 2)〉 =1√6

(eiu1 |0000〉+ eiu2 |0011〉+ eiu3 |0101〉+ eiu4 |1010〉+ eiu5 |1100〉+ |1111〉

). (7)

A direct computation shows that the entropy of this state is

SBAS(2,2) = − 19

[2

ln(2)

√cos(u2−u3−u4+u5)+1

2 tanh−1(√

cos(u2−u3−u4+u5)+12

)+(cos(u1−u3−u4

2

)+ 1)

log2

(23 cos2

(u1−u3−u4

4

))+(cos(u1−u2−u5

2

)+ 1)

log2

(23 cos2

(u1−u2−u5

4

))+ log2

(4 + 2

√2√

cos (u2 − u3 − u4 + u5) + 1)

+ log2

(4− 2

√2√

cos (u2 − u3 − u4 + u5) + 1)

−(cos(u1−u3−u4

2

)− 1)

log2

(23 sin2

(u1−u3−u4

4

))−(cos(u1−u2−u5

2

)− 1)

log2

(23 sin2

(u1−u2−u5

4

))− log2(31104)

].

(8)

Defining new variables v1 = u2 − u3 − u4 + u5 and v2 = u1 − u3 − u4, the expression above reduces to

SBAS(2,2) = − 19

[2

ln(2)

√cos2

(v12

)tanh−1

(√cos2

(v12

))+ 2 cos2

(v24

)log2

(23 cos2

(v24

))+ 2 cos2

(v2−v1

4

)log2

(23 cos2

(v2−v1

4

))+ log2

(4 + 2

√2√

cos (v1) + 1)

+ log2

(4− 2

√2√

cos (v1) + 1)

+ 2 sin2(v24

)log2

(23 sin2

(v24

))+ 2 sin2

(v2−v1

4

)log2

(23 sin2

(v2−v1

4

))− log2(31104)

].

(9)

In Figure 9 we graphically show the entropy SBAS(2,2) as a function of the new variables v1 and v2. Such a functionhas extrema

minSBAS(2,2) =1

3log2

(27

2

)≈ 1.25163,

maxSBAS(2,2) =1

2log2(12) ≈ 1.79248.

(10)

For the minimum value, v1 = v2 = 0, which can be obtained setting u1 = · · · = u5 = 0. For the maximum value,v1 = 4π/3 and v2 = 2π/3, which can be obtained setting u1 = u2 = u3 = 0 and u4 = −u5 = 2π/3. Interestingly, themaximum of SBAS(2,2) happens to coincide with the maximum entanglement entropy known for any 4-qubit state [59].

FIG. 9. Entanglement entropy SBAS(2,2) as a function of variables v1 and v2. Points in the domain represent states that encodethe BAS(2, 2) data set in the computational basis. The maximum entropy (black dots) attained is 1.79248 which coincides withthe maximum entanglement entropy known for any 4-qubit state [59].

Documents

circuits - arXivA generative modeling approach for benchmarking and training shallow quantum circuits Marcello Benedetti,1,2 Del na Garcia-Pintos,3 Oscar Perdomo,3,4,5 Vicente Leyton-Ortega,3,4