Representation Learning via Quantum Neural Tangent Kernels

Representation Learning via Quantum Neural Tangent Kernels

Junyu Liu,1, 2, 3, ∗ Francesco Tacchino,4, † Jennifer R. Glick,5, ‡ Liang Jiang,1, 2, § and Antonio Mezzacapo5, ¶

1Pritzker School of Molecular Engineering, The University of Chicago, Chicago, IL 60637, USA2Chicago Quantum Exchange, Chicago, IL 60637, USA

3Kadanoff Center for Theoretical Physics, The University of Chicago, Chicago, IL 60637, USA4IBM Quantum, IBM Research – Zurich, 8803 Ruschlikon, Switzerland

5IBM Quantum, IBM T. J. Watson Research Center, Yorktown Heights, New York 10598, USA(Dated: November 10, 2021)

Variational quantum circuits are used in quantum machine learning and variational quantumsimulation tasks. Designing good variational circuits or predicting how well they perform for givenlearning or optimization tasks is still unclear. Here we discuss these problems, analyzing variationalquantum circuits using the theory of neural tangent kernels. We define quantum neural tangentkernels, and derive dynamical equations for their associated loss function in optimization and learn-ing tasks. We analytically solve the dynamics in the frozen limit, or lazy training regime, wherevariational angles change slowly and a linear perturbation is good enough. We extend the analysisto a dynamical setting, including quadratic corrections in the variational angles. We then considerhybrid quantum-classical architecture and define a large-width limit for hybrid kernels, showing thata hybrid quantum-classical neural network can be approximately Gaussian. The results presentedhere show limits for which analytical understandings of the training dynamics for variational quan-tum circuits, used for quantum machine learning and optimization problems, are possible. Theseanalytical results are supported by numerical simulations of quantum machine learning experiments.

I. INTRODUCTION

The idea of using quantum computers for machinelearning has recently received attention both in academiaand industry [1–13]. While proof of principle studyhave shown that some problems of mathematical interestquantum computers are useful [13], quantum advantagein machine learning algorithms for practical applicationsis still unclear [14]. On classical architectures, a first-principle theory of machine learning, especially the so-called deep learning that uses a large number of layers,is still in development. Early developments of the statis-tical learning theory provide rigorous guarantees on thelearning capability in generic learning algorithms, buttheoretical bounds obtained from information theory aresometimes weak in practical settings.

The theory of neural tangent kernel (NTK) has beendeemed an important tool to understand deep neural net-works [15–21]. In the large-width limit, a generic neuralnetwork becomes nearly Gaussian when averaging overthe initial weights and biases, and the learning capa-bilities become predictable. The NTK theory allows toderive analytical understanding of the neural networksdynamics, improving on statistical learning theory andshedding light on the underlying principle of deep learn-ing [22–26]. In the quantum machine learning commu-nity, a similar first principle theory would help in under-standing the training dynamics and selecting appropri-

∗ [email protected]† [email protected]‡ [email protected]§ [email protected]¶ [email protected]

ate variational quantum circuits to target specific prob-lems. A step in this direction has been onsidered recentlyfor quantum classical neural networks [27]. However inthe framework considered there no variational parame-ters were considered in the quantum circuits, leaving theproblem of understanding and designing the quantum dy-namical training not addressed.

In this paper, we address this problem, focusing on thelimit where the learning rate is sufficiently small, inspiredby the classical theory of NTK. We first define a quan-tum analogue of a classical NTK. In the limit where thevariational angles do not change much, the so-called lazytraining [28], the frozen QNTK leads to an exponentialdecaying of the loss function used on the training set.We furthermore compute the leading order perturbationabove the static limit, where we define a quantum ver-sion of the classical meta-kernel. We derive closed-formformulas for the dynamics of the training in terms of pa-rameters of variational quantum circuits, see Fig. 1).

We then move to a hybrid quantum-classical neuralnetwork framework, and find that is becomes approxi-mately Gaussian, as long as the quantum outputs aresufficiently orthogonal. We present an analytic derivationof the large-width limit where the non-Gaussian contri-bution to the neuron correlations is suppressed by largewidth. Interestingly, we observe that now the width is de-fined by the number of independent Hermitian operatorsin the variational ansatz, which is upper-bounded by (apolynomial of) the dimension of the Hilbert space. Thus,a large Hilbert space size will naturally bring our neu-ral network to the large-width limit. Moreover, the or-thogonality assumption in the variational ansatz could beachieved statistically using randomized assumptions. Ifnot, the hybrid quantum-classical neural networks couldstill learn features even at the large width, indicating

arX

iv:2

111.

0422

5v2

[qu

ant-

ph]

9 N

ov 2

021

mailto:[email protected]





2

loss<latexit sha1_base64="K0vJvhng88LM9Lq+fb/iBpEJbJ8=">AAACznicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFl047KCfUArkqTTOnTyIDMpllLc+gNu9bPEP9C/8M6YglpEJyQ5c+45d+be6yeCS+U4rwVrYXFpeaW4Wlpb39jcKm/vNGWcpQFrBLGI07bvSSZ4xBqKK8HaScq80Bes5Q/Pdbw1YqnkcXSlxgm7Dr1BxPs88BRRna5id2oiYimnN+WKU3XMsueBm4MK8lWPyy/ooocYATKEYIigCAt4kPR04MJBQtw1JsSlhLiJM0xRIm9GKkYKj9ghfQe06+RsRHudUxp3QKcIelNy2jggT0y6lLA+zTbxzGTW7G+5JyanvtuY/n6eKyRW4ZbYv3wz5X99uhaFPk5NDZxqSgyjqwvyLJnpir65/aUqRRkS4jTuUTwlHBjnrM+28UhTu+6tZ+JvRqlZvQ9ybYZ3fUsasPtznPOgeVR1CV8eV2pn+aiL2MM+DmmeJ6jhAnU0TMcf8YRnq26NrKl1/ym1CrlnF9+W9fABZf2UQw==</latexit><latexit sha1_base64="K0vJvhng88LM9Lq+fb/iBpEJbJ8=">AAACznicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFl047KCfUArkqTTOnTyIDMpllLc+gNu9bPEP9C/8M6YglpEJyQ5c+45d+be6yeCS+U4rwVrYXFpeaW4Wlpb39jcKm/vNGWcpQFrBLGI07bvSSZ4xBqKK8HaScq80Bes5Q/Pdbw1YqnkcXSlxgm7Dr1BxPs88BRRna5id2oiYimnN+WKU3XMsueBm4MK8lWPyy/ooocYATKEYIigCAt4kPR04MJBQtw1JsSlhLiJM0xRIm9GKkYKj9ghfQe06+RsRHudUxp3QKcIelNy2jggT0y6lLA+zTbxzGTW7G+5JyanvtuY/n6eKyRW4ZbYv3wz5X99uhaFPk5NDZxqSgyjqwvyLJnpir65/aUqRRkS4jTuUTwlHBjnrM+28UhTu+6tZ+JvRqlZvQ9ybYZ3fUsasPtznPOgeVR1CV8eV2pn+aiL2MM+DmmeJ6jhAnU0TMcf8YRnq26NrKl1/ym1CrlnF9+W9fABZf2UQw==</latexit><latexit sha1_base64="K0vJvhng88LM9Lq+fb/iBpEJbJ8=">AAACznicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFl047KCfUArkqTTOnTyIDMpllLc+gNu9bPEP9C/8M6YglpEJyQ5c+45d+be6yeCS+U4rwVrYXFpeaW4Wlpb39jcKm/vNGWcpQFrBLGI07bvSSZ4xBqKK8HaScq80Bes5Q/Pdbw1YqnkcXSlxgm7Dr1BxPs88BRRna5id2oiYimnN+WKU3XMsueBm4MK8lWPyy/ooocYATKEYIigCAt4kPR04MJBQtw1JsSlhLiJM0xRIm9GKkYKj9ghfQe06+RsRHudUxp3QKcIelNy2jggT0y6lLA+zTbxzGTW7G+5JyanvtuY/n6eKyRW4ZbYv3wz5X99uhaFPk5NDZxqSgyjqwvyLJnpir65/aUqRRkS4jTuUTwlHBjnrM+28UhTu+6tZ+JvRqlZvQ9ybYZ3fUsasPtznPOgeVR1CV8eV2pn+aiL2MM+DmmeJ6jhAnU0TMcf8YRnq26NrKl1/ym1CrlnF9+W9fABZf2UQw==</latexit><latexit sha1_base64="K0vJvhng88LM9Lq+fb/iBpEJbJ8=">AAACznicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFl047KCfUArkqTTOnTyIDMpllLc+gNu9bPEP9C/8M6YglpEJyQ5c+45d+be6yeCS+U4rwVrYXFpeaW4Wlpb39jcKm/vNGWcpQFrBLGI07bvSSZ4xBqKK8HaScq80Bes5Q/Pdbw1YqnkcXSlxgm7Dr1BxPs88BRRna5id2oiYimnN+WKU3XMsueBm4MK8lWPyy/ooocYATKEYIigCAt4kPR04MJBQtw1JsSlhLiJM0xRIm9GKkYKj9ghfQe06+RsRHudUxp3QKcIelNy2jggT0y6lLA+zTbxzGTW7G+5JyanvtuY/n6eKyRW4ZbYv3wz5X99uhaFPk5NDZxqSgyjqwvyLJnpir65/aUqRRkS4jTuUTwlHBjnrM+28UhTu+6tZ+JvRqlZvQ9ybYZ3fUsasPtznPOgeVR1CV8eV2pn+aiL2MM+DmmeJ6jhAnU0TMcf8YRnq26NrKl1/ym1CrlnF9+W9fABZf2UQw==</latexit>

(a)<latexit sha1_base64="x5JXUoP4+lQ2gSSbRT0rwv9BvKk=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkWom5KIoMuimy4r2gdokcl0WkPzYjJRShH8Abf6aeIf6F94Z0xBLaITkpw5954zc+/1ksBPleO8Fqy5+YXFpeJyaWV1bX2jvLnVTuNMctHicRDLrsdSEfiRaClfBaKbSMFCLxAdb3Sq451bIVM/ji7UOBG9kA0jf+Bzpog6r7L963LFqTlm2bPAzUEF+WrG5RdcoY8YHBlCCERQhAMwpPRcwoWDhLgeJsRJQr6JC9yjRNqMsgRlMGJH9B3S7jJnI9prz9SoOZ0S0CtJaWOPNDHlScL6NNvEM+Os2d+8J8ZT321Mfy/3ColVuCH2L9008786XYvCAMemBp9qSgyjq+O5S2a6om9uf6lKkUNCnMZ9ikvC3CinfbaNJjW1694yE38zmZrVe57nZnjXt6QBuz/HOQvaBzWX8NlhpX6Sj7qIHeyiSvM8Qh0NNNEi7yEe8YRnq2FFVmbdfaZahVyzjW/LevgAQSCPyw==</latexit><latexit sha1_base64="x5JXUoP4+lQ2gSSbRT0rwv9BvKk=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkWom5KIoMuimy4r2gdokcl0WkPzYjJRShH8Abf6aeIf6F94Z0xBLaITkpw5954zc+/1ksBPleO8Fqy5+YXFpeJyaWV1bX2jvLnVTuNMctHicRDLrsdSEfiRaClfBaKbSMFCLxAdb3Sq451bIVM/ji7UOBG9kA0jf+Bzpog6r7L963LFqTlm2bPAzUEF+WrG5RdcoY8YHBlCCERQhAMwpPRcwoWDhLgeJsRJQr6JC9yjRNqMsgRlMGJH9B3S7jJnI9prz9SoOZ0S0CtJaWOPNDHlScL6NNvEM+Os2d+8J8ZT321Mfy/3ColVuCH2L9008786XYvCAMemBp9qSgyjq+O5S2a6om9uf6lKkUNCnMZ9ikvC3CinfbaNJjW1694yE38zmZrVe57nZnjXt6QBuz/HOQvaBzWX8NlhpX6Sj7qIHeyiSvM8Qh0NNNEi7yEe8YRnq2FFVmbdfaZahVyzjW/LevgAQSCPyw==</latexit><latexit sha1_base64="x5JXUoP4+lQ2gSSbRT0rwv9BvKk=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkWom5KIoMuimy4r2gdokcl0WkPzYjJRShH8Abf6aeIf6F94Z0xBLaITkpw5954zc+/1ksBPleO8Fqy5+YXFpeJyaWV1bX2jvLnVTuNMctHicRDLrsdSEfiRaClfBaKbSMFCLxAdb3Sq451bIVM/ji7UOBG9kA0jf+Bzpog6r7L963LFqTlm2bPAzUEF+WrG5RdcoY8YHBlCCERQhAMwpPRcwoWDhLgeJsRJQr6JC9yjRNqMsgRlMGJH9B3S7jJnI9prz9SoOZ0S0CtJaWOPNDHlScL6NNvEM+Os2d+8J8ZT321Mfy/3ColVuCH2L9008786XYvCAMemBp9qSgyjq+O5S2a6om9uf6lKkUNCnMZ9ikvC3CinfbaNJjW1694yE38zmZrVe57nZnjXt6QBuz/HOQvaBzWX8NlhpX6Sj7qIHeyiSvM8Qh0NNNEi7yEe8YRnq2FFVmbdfaZahVyzjW/LevgAQSCPyw==</latexit><latexit sha1_base64="x5JXUoP4+lQ2gSSbRT0rwv9BvKk=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkWom5KIoMuimy4r2gdokcl0WkPzYjJRShH8Abf6aeIf6F94Z0xBLaITkpw5954zc+/1ksBPleO8Fqy5+YXFpeJyaWV1bX2jvLnVTuNMctHicRDLrsdSEfiRaClfBaKbSMFCLxAdb3Sq451bIVM/ji7UOBG9kA0jf+Bzpog6r7L963LFqTlm2bPAzUEF+WrG5RdcoY8YHBlCCERQhAMwpPRcwoWDhLgeJsRJQr6JC9yjRNqMsgRlMGJH9B3S7jJnI9prz9SoOZ0S0CtJaWOPNDHlScL6NNvEM+Os2d+8J8ZT321Mfy/3ColVuCH2L9008786XYvCAMemBp9qSgyjq+O5S2a6om9uf6lKkUNCnMZ9ikvC3CinfbaNJjW1694yE38zmZrVe57nZnjXt6QBuz/HOQvaBzWX8NlhpX6Sj7qIHeyiSvM8Qh0NNNEi7yEe8YRnq2FFVmbdfaZahVyzjW/LevgAQSCPyw==</latexit>

Quantum

NTK

<latexit sha1_base64="4m6RZ597+WNPZdELOp1U3Ty0cpI=">AAAC13icjVHLSsNAFD2Nr1pfsS7dBIvgqiQi6LLoRhCkhb6kLSVJpzU0L5KJtJTiTtz6A271j8Q/0L/wzpiCWkQnJDlz7j1n5t5rha4Tc11/zSgLi0vLK9nV3Nr6xuaWup2vx0ES2axmB24QNS0zZq7jsxp3uMuaYcRMz3JZwxqeiXjjhkWxE/hVPg5ZxzMHvtN3bJMT1VXzbc5GfFJJTJ8nnnZZvZh21YJe1OXS5oGRggLSVQ7UF7TRQwAbCTww+OCEXZiI6WnBgI6QuA4mxEWEHBlnmCJH2oSyGGWYxA7pO6BdK2V92gvPWKptOsWlNyKlhn3SBJQXERanaTKeSGfB/uY9kZ7ibmP6W6mXRyzHNbF/6WaZ/9WJWjj6OJE1OFRTKBlRnZ26JLIr4ubal6o4OYTECdyjeETYlspZnzWpiWXtoremjL/JTMGKvZ3mJngXt6QBGz/HOQ/qh0WDcOWoUDpNR53FLvZwQPM8RgnnKKNG3iM84gnPypVyq9wp95+pSibV7ODbUh4+AGvulvE=</latexit><latexit sha1_base64="4m6RZ597+WNPZdELOp1U3Ty0cpI=">AAAC13icjVHLSsNAFD2Nr1pfsS7dBIvgqiQi6LLoRhCkhb6kLSVJpzU0L5KJtJTiTtz6A271j8Q/0L/wzpiCWkQnJDlz7j1n5t5rha4Tc11/zSgLi0vLK9nV3Nr6xuaWup2vx0ES2axmB24QNS0zZq7jsxp3uMuaYcRMz3JZwxqeiXjjhkWxE/hVPg5ZxzMHvtN3bJMT1VXzbc5GfFJJTJ8nnnZZvZh21YJe1OXS5oGRggLSVQ7UF7TRQwAbCTww+OCEXZiI6WnBgI6QuA4mxEWEHBlnmCJH2oSyGGWYxA7pO6BdK2V92gvPWKptOsWlNyKlhn3SBJQXERanaTKeSGfB/uY9kZ7ibmP6W6mXRyzHNbF/6WaZ/9WJWjj6OJE1OFRTKBlRnZ26JLIr4ubal6o4OYTECdyjeETYlspZnzWpiWXtoremjL/JTMGKvZ3mJngXt6QBGz/HOQ/qh0WDcOWoUDpNR53FLvZwQPM8RgnnKKNG3iM84gnPypVyq9wp95+pSibV7ODbUh4+AGvulvE=</latexit><latexit sha1_base64="4m6RZ597+WNPZdELOp1U3Ty0cpI=">AAAC13icjVHLSsNAFD2Nr1pfsS7dBIvgqiQi6LLoRhCkhb6kLSVJpzU0L5KJtJTiTtz6A271j8Q/0L/wzpiCWkQnJDlz7j1n5t5rha4Tc11/zSgLi0vLK9nV3Nr6xuaWup2vx0ES2axmB24QNS0zZq7jsxp3uMuaYcRMz3JZwxqeiXjjhkWxE/hVPg5ZxzMHvtN3bJMT1VXzbc5GfFJJTJ8nnnZZvZh21YJe1OXS5oGRggLSVQ7UF7TRQwAbCTww+OCEXZiI6WnBgI6QuA4mxEWEHBlnmCJH2oSyGGWYxA7pO6BdK2V92gvPWKptOsWlNyKlhn3SBJQXERanaTKeSGfB/uY9kZ7ibmP6W6mXRyzHNbF/6WaZ/9WJWjj6OJE1OFRTKBlRnZ26JLIr4ubal6o4OYTECdyjeETYlspZnzWpiWXtoremjL/JTMGKvZ3mJngXt6QBGz/HOQ/qh0WDcOWoUDpNR53FLvZwQPM8RgnnKKNG3iM84gnPypVyq9wp95+pSibV7ODbUh4+AGvulvE=</latexit><latexit sha1_base64="4m6RZ597+WNPZdELOp1U3Ty0cpI=">AAAC13icjVHLSsNAFD2Nr1pfsS7dBIvgqiQi6LLoRhCkhb6kLSVJpzU0L5KJtJTiTtz6A271j8Q/0L/wzpiCWkQnJDlz7j1n5t5rha4Tc11/zSgLi0vLK9nV3Nr6xuaWup2vx0ES2axmB24QNS0zZq7jsxp3uMuaYcRMz3JZwxqeiXjjhkWxE/hVPg5ZxzMHvtN3bJMT1VXzbc5GfFJJTJ8nnnZZvZh21YJe1OXS5oGRggLSVQ7UF7TRQwAbCTww+OCEXZiI6WnBgI6QuA4mxEWEHBlnmCJH2oSyGGWYxA7pO6BdK2V92gvPWKptOsWlNyKlhn3SBJQXERanaTKeSGfB/uY9kZ7ibmP6W6mXRyzHNbF/6WaZ/9WJWjj6OJE1OFRTKBlRnZ26JLIr4ubal6o4OYTECdyjeETYlspZnzWpiWXtoremjL/JTMGKvZ3mJngXt6QBGz/HOQ/qh0WDcOWoUDpNR53FLvZwQPM8RgnnKKNG3iM84gnPypVyq9wp95+pSibV7ODbUh4+AGvulvE=</latexit>

(b)<latexit sha1_base64="ZXxWJkmE0nsH3YWbvmCltKFoh3Y=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkWom5KIoMuimy4r2gdokWQ6rUPzIpkopQj+gFv9NPEP9C+8M05BLaITkpw5954zc+/1k0Bk0nFeC9bc/MLiUnG5tLK6tr5R3txqZ3GeMt5icRCnXd/LeCAi3pJCBrybpNwL/YB3/NGpindueZqJOLqQ44T3Qm8YiYFgniTqvOrvX5crTs3Ry54FrgEVmNWMyy+4Qh8xGHKE4IggCQfwkNFzCRcOEuJ6mBCXEhI6znGPEmlzyuKU4RE7ou+QdpeGjWivPDOtZnRKQG9KSht7pIkpLyWsTrN1PNfOiv3Ne6I91d3G9PeNV0isxA2xf+mmmf/VqVokBjjWNQiqKdGMqo4Zl1x3Rd3c/lKVJIeEOIX7FE8JM62c9tnWmkzXrnrr6fibzlSs2jOTm+Nd3ZIG7P4c5yxoH9RcwmeHlfqJGXURO9hFleZ5hDoaaKJF3kM84gnPVsOKrNy6+0y1CkazjW/LevgAQ4GPzA==</latexit><latexit sha1_base64="ZXxWJkmE0nsH3YWbvmCltKFoh3Y=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkWom5KIoMuimy4r2gdokWQ6rUPzIpkopQj+gFv9NPEP9C+8M05BLaITkpw5954zc+/1k0Bk0nFeC9bc/MLiUnG5tLK6tr5R3txqZ3GeMt5icRCnXd/LeCAi3pJCBrybpNwL/YB3/NGpindueZqJOLqQ44T3Qm8YiYFgniTqvOrvX5crTs3Ry54FrgEVmNWMyy+4Qh8xGHKE4IggCQfwkNFzCRcOEuJ6mBCXEhI6znGPEmlzyuKU4RE7ou+QdpeGjWivPDOtZnRKQG9KSht7pIkpLyWsTrN1PNfOiv3Ne6I91d3G9PeNV0isxA2xf+mmmf/VqVokBjjWNQiqKdGMqo4Zl1x3Rd3c/lKVJIeEOIX7FE8JM62c9tnWmkzXrnrr6fibzlSs2jOTm+Nd3ZIG7P4c5yxoH9RcwmeHlfqJGXURO9hFleZ5hDoaaKJF3kM84gnPVsOKrNy6+0y1CkazjW/LevgAQ4GPzA==</latexit><latexit sha1_base64="ZXxWJkmE0nsH3YWbvmCltKFoh3Y=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkWom5KIoMuimy4r2gdokWQ6rUPzIpkopQj+gFv9NPEP9C+8M05BLaITkpw5954zc+/1k0Bk0nFeC9bc/MLiUnG5tLK6tr5R3txqZ3GeMt5icRCnXd/LeCAi3pJCBrybpNwL/YB3/NGpindueZqJOLqQ44T3Qm8YiYFgniTqvOrvX5crTs3Ry54FrgEVmNWMyy+4Qh8xGHKE4IggCQfwkNFzCRcOEuJ6mBCXEhI6znGPEmlzyuKU4RE7ou+QdpeGjWivPDOtZnRKQG9KSht7pIkpLyWsTrN1PNfOiv3Ne6I91d3G9PeNV0isxA2xf+mmmf/VqVokBjjWNQiqKdGMqo4Zl1x3Rd3c/lKVJIeEOIX7FE8JM62c9tnWmkzXrnrr6fibzlSs2jOTm+Nd3ZIG7P4c5yxoH9RcwmeHlfqJGXURO9hFleZ5hDoaaKJF3kM84gnPVsOKrNy6+0y1CkazjW/LevgAQ4GPzA==</latexit><latexit sha1_base64="ZXxWJkmE0nsH3YWbvmCltKFoh3Y=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkWom5KIoMuimy4r2gdokWQ6rUPzIpkopQj+gFv9NPEP9C+8M05BLaITkpw5954zc+/1k0Bk0nFeC9bc/MLiUnG5tLK6tr5R3txqZ3GeMt5icRCnXd/LeCAi3pJCBrybpNwL/YB3/NGpindueZqJOLqQ44T3Qm8YiYFgniTqvOrvX5crTs3Ry54FrgEVmNWMyy+4Qh8xGHKE4IggCQfwkNFzCRcOEuJ6mBCXEhI6znGPEmlzyuKU4RE7ou+QdpeGjWivPDOtZnRKQG9KSht7pIkpLyWsTrN1PNfOiv3Ne6I91d3G9PeNV0isxA2xf+mmmf/VqVokBjjWNQiqKdGMqo4Zl1x3Rd3c/lKVJIeEOIX7FE8JM62c9tnWmkzXrnrr6fibzlSs2jOTm+Nd3ZIG7P4c5yxoH9RcwmeHlfqJGXURO9hFleZ5hDoaaKJF3kM84gnPVsOKrNy6+0y1CkazjW/LevgAQ4GPzA==</latexit>

Frozen NTK<latexit sha1_base64="5beDxwlcrQDnl+eIHxlCupZ3IWI=">AAAC6HicjVHLSsNAFD2Nr1pfVZdugkVwVVIRdFkURBCkQl/QFknSaY1NkzCZiDXkA9y5E7f+gFv9EvEP9C+8M6bgA9EJSc6ce86ZuTNW4DqhMIyXjDYxOTU9k53Nzc0vLC7ll1fqoR9xm9Vs3/V50zJD5joeqwlHuKwZcGYOLZc1rMG+rDcuGA8d36uKUcA6Q7PvOT3HNgVRp/lCW7BLoXJiy41YEisiPuD+FfP04+pRkpDKKBpq6D9BKQUFpKPi55/RRhc+bEQYgsGDIOzCREhPCyUYCIjrICaOE3JUnSFBjrwRqRgpTGIH9O3TrJWyHs1lZqjcNq3i0svJqWODPD7pOGG5mq7qkUqW7G/ZscqUexvR30qzhsQKnBH7l2+s/K9P9iLQw67qwaGeAsXI7uw0JVKnIneuf+pKUEJAnMRdqnPCtnKOz1lXnlD1Ls/WVPVXpZSsnNupNsKb3CVdcOn7df4E9a1iifDJdqG8l151FmtYxybd5w7KOEQFNcq+xgMe8aSdazfarXb3IdUyqWcVX4Z2/w434p78</latexit><latexit sha1_base64="5beDxwlcrQDnl+eIHxlCupZ3IWI=">AAAC6HicjVHLSsNAFD2Nr1pfVZdugkVwVVIRdFkURBCkQl/QFknSaY1NkzCZiDXkA9y5E7f+gFv9EvEP9C+8M6bgA9EJSc6ce86ZuTNW4DqhMIyXjDYxOTU9k53Nzc0vLC7ll1fqoR9xm9Vs3/V50zJD5joeqwlHuKwZcGYOLZc1rMG+rDcuGA8d36uKUcA6Q7PvOT3HNgVRp/lCW7BLoXJiy41YEisiPuD+FfP04+pRkpDKKBpq6D9BKQUFpKPi55/RRhc+bEQYgsGDIOzCREhPCyUYCIjrICaOE3JUnSFBjrwRqRgpTGIH9O3TrJWyHs1lZqjcNq3i0svJqWODPD7pOGG5mq7qkUqW7G/ZscqUexvR30qzhsQKnBH7l2+s/K9P9iLQw67qwaGeAsXI7uw0JVKnIneuf+pKUEJAnMRdqnPCtnKOz1lXnlD1Ls/WVPVXpZSsnNupNsKb3CVdcOn7df4E9a1iifDJdqG8l151FmtYxybd5w7KOEQFNcq+xgMe8aSdazfarXb3IdUyqWcVX4Z2/w434p78</latexit><latexit sha1_base64="5beDxwlcrQDnl+eIHxlCupZ3IWI=">AAAC6HicjVHLSsNAFD2Nr1pfVZdugkVwVVIRdFkURBCkQl/QFknSaY1NkzCZiDXkA9y5E7f+gFv9EvEP9C+8M6bgA9EJSc6ce86ZuTNW4DqhMIyXjDYxOTU9k53Nzc0vLC7ll1fqoR9xm9Vs3/V50zJD5joeqwlHuKwZcGYOLZc1rMG+rDcuGA8d36uKUcA6Q7PvOT3HNgVRp/lCW7BLoXJiy41YEisiPuD+FfP04+pRkpDKKBpq6D9BKQUFpKPi55/RRhc+bEQYgsGDIOzCREhPCyUYCIjrICaOE3JUnSFBjrwRqRgpTGIH9O3TrJWyHs1lZqjcNq3i0svJqWODPD7pOGG5mq7qkUqW7G/ZscqUexvR30qzhsQKnBH7l2+s/K9P9iLQw67qwaGeAsXI7uw0JVKnIneuf+pKUEJAnMRdqnPCtnKOz1lXnlD1Ls/WVPVXpZSsnNupNsKb3CVdcOn7df4E9a1iifDJdqG8l151FmtYxybd5w7KOEQFNcq+xgMe8aSdazfarXb3IdUyqWcVX4Z2/w434p78</latexit><latexit sha1_base64="5beDxwlcrQDnl+eIHxlCupZ3IWI=">AAAC6HicjVHLSsNAFD2Nr1pfVZdugkVwVVIRdFkURBCkQl/QFknSaY1NkzCZiDXkA9y5E7f+gFv9EvEP9C+8M6bgA9EJSc6ce86ZuTNW4DqhMIyXjDYxOTU9k53Nzc0vLC7ll1fqoR9xm9Vs3/V50zJD5joeqwlHuKwZcGYOLZc1rMG+rDcuGA8d36uKUcA6Q7PvOT3HNgVRp/lCW7BLoXJiy41YEisiPuD+FfP04+pRkpDKKBpq6D9BKQUFpKPi55/RRhc+bEQYgsGDIOzCREhPCyUYCIjrICaOE3JUnSFBjrwRqRgpTGIH9O3TrJWyHs1lZqjcNq3i0svJqWODPD7pOGG5mq7qkUqW7G/ZscqUexvR30qzhsQKnBH7l2+s/K9P9iLQw67qwaGeAsXI7uw0JVKnIneuf+pKUEJAnMRdqnPCtnKOz1lXnlD1Ls/WVPVXpZSsnNupNsKb3CVdcOn7df4E9a1iifDJdqG8l151FmtYxybd5w7KOEQFNcq+xgMe8aSdazfarXb3IdUyqWcVX4Z2/w434p78</latexit>

(c)<latexit sha1_base64="35GcylWq/xFpXOYGQHB6sf/oTgw=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkWom5KIoMuimy4r2gdokWQ6rUPzIpkopQj+gFv9NPEP9C+8M05BLaITkpw5954zc+/1k0Bk0nFeC9bc/MLiUnG5tLK6tr5R3txqZ3GeMt5icRCnXd/LeCAi3pJCBrybpNwL/YB3/NGpindueZqJOLqQ44T3Qm8YiYFgniTqvMr2r8sVp+boZc8C14AKzGrG5RdcoY8YDDlCcESQhAN4yOi5hAsHCXE9TIhLCQkd57hHibQ5ZXHK8Igd0XdIu0vDRrRXnplWMzoloDclpY090sSUlxJWp9k6nmtnxf7mPdGe6m5j+vvGKyRW4obYv3TTzP/qVC0SAxzrGgTVlGhGVceMS667om5uf6lKkkNCnMJ9iqeEmVZO+2xrTaZrV731dPxNZypW7ZnJzfGubkkDdn+Ocxa0D2ou4bPDSv3EjLqIHeyiSvM8Qh0NNNEi7yEe8YRnq2FFVm7dfaZaBaPZxrdlPXwAReKPzQ==</latexit><latexit sha1_base64="35GcylWq/xFpXOYGQHB6sf/oTgw=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkWom5KIoMuimy4r2gdokWQ6rUPzIpkopQj+gFv9NPEP9C+8M05BLaITkpw5954zc+/1k0Bk0nFeC9bc/MLiUnG5tLK6tr5R3txqZ3GeMt5icRCnXd/LeCAi3pJCBrybpNwL/YB3/NGpindueZqJOLqQ44T3Qm8YiYFgniTqvMr2r8sVp+boZc8C14AKzGrG5RdcoY8YDDlCcESQhAN4yOi5hAsHCXE9TIhLCQkd57hHibQ5ZXHK8Igd0XdIu0vDRrRXnplWMzoloDclpY090sSUlxJWp9k6nmtnxf7mPdGe6m5j+vvGKyRW4obYv3TTzP/qVC0SAxzrGgTVlGhGVceMS667om5uf6lKkkNCnMJ9iqeEmVZO+2xrTaZrV731dPxNZypW7ZnJzfGubkkDdn+Ocxa0D2ou4bPDSv3EjLqIHeyiSvM8Qh0NNNEi7yEe8YRnq2FFVm7dfaZaBaPZxrdlPXwAReKPzQ==</latexit><latexit sha1_base64="35GcylWq/xFpXOYGQHB6sf/oTgw=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkWom5KIoMuimy4r2gdokWQ6rUPzIpkopQj+gFv9NPEP9C+8M05BLaITkpw5954zc+/1k0Bk0nFeC9bc/MLiUnG5tLK6tr5R3txqZ3GeMt5icRCnXd/LeCAi3pJCBrybpNwL/YB3/NGpindueZqJOLqQ44T3Qm8YiYFgniTqvMr2r8sVp+boZc8C14AKzGrG5RdcoY8YDDlCcESQhAN4yOi5hAsHCXE9TIhLCQkd57hHibQ5ZXHK8Igd0XdIu0vDRrRXnplWMzoloDclpY090sSUlxJWp9k6nmtnxf7mPdGe6m5j+vvGKyRW4obYv3TTzP/qVC0SAxzrGgTVlGhGVceMS667om5uf6lKkkNCnMJ9iqeEmVZO+2xrTaZrV731dPxNZypW7ZnJzfGubkkDdn+Ocxa0D2ou4bPDSv3EjLqIHeyiSvM8Qh0NNNEi7yEe8YRnq2FFVm7dfaZaBaPZxrdlPXwAReKPzQ==</latexit><latexit sha1_base64="35GcylWq/xFpXOYGQHB6sf/oTgw=">AAACxnicjVHLSsNAFD2Nr1pfVZdugkWom5KIoMuimy4r2gdokWQ6rUPzIpkopQj+gFv9NPEP9C+8M05BLaITkpw5954zc+/1k0Bk0nFeC9bc/MLiUnG5tLK6tr5R3txqZ3GeMt5icRCnXd/LeCAi3pJCBrybpNwL/YB3/NGpindueZqJOLqQ44T3Qm8YiYFgniTqvMr2r8sVp+boZc8C14AKzGrG5RdcoY8YDDlCcESQhAN4yOi5hAsHCXE9TIhLCQkd57hHibQ5ZXHK8Igd0XdIu0vDRrRXnplWMzoloDclpY090sSUlxJWp9k6nmtnxf7mPdGe6m5j+vvGKyRW4obYv3TTzP/qVC0SAxzrGgTVlGhGVceMS667om5uf6lKkkNCnMJ9iqeEmVZO+2xrTaZrV731dPxNZypW7ZnJzfGubkkDdn+Ocxa0D2ou4bPDSv3EjLqIHeyiSvM8Qh0NNNEi7yEe8YRnq2FFVm7dfaZaBaPZxrdlPXwAReKPzQ==</latexit>

Representation Learning<latexit sha1_base64="nWA5qes7WfFMm5Shnw2M+knq2gY=">AAAC9HicjVHLSsRAECzja32vevQSXARPmoigR9GLBw+ruLuCiiTZcR3MJmEyWZSQz/DmTbz6A171G8Q/0L+wp43gA9EJSaqru2qmp/0klKl2nOc+q39gcGi4MjI6Nj4xOVWdnmmmcaYC0QjiMFYHvpeKUEaioaUOxUGihNf1Q9Hyz7dMvtUTKpVxtK8vE3Hc9TqRPJWBp4k6qS4faXGh2SdXol3kHOd7glxSEWkus3eEpyIZdYripFpzlhxe9k/glqCGctXj6hOO0EaMABm6EIigCYfwkNJzCBcOEuKOkROnCEnOCxQYJW1GVYIqPGLP6duh6LBkI4qNZ8rqgHYJ6VWktLFAmpjqFGGzm835jJ0N+5t3zp7mbJf090uvLrEaZ8T+pfuo/K/O9KJxinXuQVJPCTOmu6B0yfhWzMntT11pckiIM7hNeUU4YOXHPdusSbl3c7ce51+40rAmDsraDK/mlDRg9/s4f4LmypJLeHe1trFZjrqCOcxjkea5hg1so44GeV/hHg94tHrWtXVj3b6XWn2lZhZflnX3Bi3bpKk=</latexit><latexit sha1_base64="nWA5qes7WfFMm5Shnw2M+knq2gY=">AAAC9HicjVHLSsRAECzja32vevQSXARPmoigR9GLBw+ruLuCiiTZcR3MJmEyWZSQz/DmTbz6A171G8Q/0L+wp43gA9EJSaqru2qmp/0klKl2nOc+q39gcGi4MjI6Nj4xOVWdnmmmcaYC0QjiMFYHvpeKUEaioaUOxUGihNf1Q9Hyz7dMvtUTKpVxtK8vE3Hc9TqRPJWBp4k6qS4faXGh2SdXol3kHOd7glxSEWkus3eEpyIZdYripFpzlhxe9k/glqCGctXj6hOO0EaMABm6EIigCYfwkNJzCBcOEuKOkROnCEnOCxQYJW1GVYIqPGLP6duh6LBkI4qNZ8rqgHYJ6VWktLFAmpjqFGGzm835jJ0N+5t3zp7mbJf090uvLrEaZ8T+pfuo/K/O9KJxinXuQVJPCTOmu6B0yfhWzMntT11pckiIM7hNeUU4YOXHPdusSbl3c7ce51+40rAmDsraDK/mlDRg9/s4f4LmypJLeHe1trFZjrqCOcxjkea5hg1so44GeV/hHg94tHrWtXVj3b6XWn2lZhZflnX3Bi3bpKk=</latexit><latexit sha1_base64="nWA5qes7WfFMm5Shnw2M+knq2gY=">AAAC9HicjVHLSsRAECzja32vevQSXARPmoigR9GLBw+ruLuCiiTZcR3MJmEyWZSQz/DmTbz6A171G8Q/0L+wp43gA9EJSaqru2qmp/0klKl2nOc+q39gcGi4MjI6Nj4xOVWdnmmmcaYC0QjiMFYHvpeKUEaioaUOxUGihNf1Q9Hyz7dMvtUTKpVxtK8vE3Hc9TqRPJWBp4k6qS4faXGh2SdXol3kHOd7glxSEWkus3eEpyIZdYripFpzlhxe9k/glqCGctXj6hOO0EaMABm6EIigCYfwkNJzCBcOEuKOkROnCEnOCxQYJW1GVYIqPGLP6duh6LBkI4qNZ8rqgHYJ6VWktLFAmpjqFGGzm835jJ0N+5t3zp7mbJf090uvLrEaZ8T+pfuo/K/O9KJxinXuQVJPCTOmu6B0yfhWzMntT11pckiIM7hNeUU4YOXHPdusSbl3c7ce51+40rAmDsraDK/mlDRg9/s4f4LmypJLeHe1trFZjrqCOcxjkea5hg1so44GeV/hHg94tHrWtXVj3b6XWn2lZhZflnX3Bi3bpKk=</latexit><latexit sha1_base64="nWA5qes7WfFMm5Shnw2M+knq2gY=">AAAC9HicjVHLSsRAECzja32vevQSXARPmoigR9GLBw+ruLuCiiTZcR3MJmEyWZSQz/DmTbz6A171G8Q/0L+wp43gA9EJSaqru2qmp/0klKl2nOc+q39gcGi4MjI6Nj4xOVWdnmmmcaYC0QjiMFYHvpeKUEaioaUOxUGihNf1Q9Hyz7dMvtUTKpVxtK8vE3Hc9TqRPJWBp4k6qS4faXGh2SdXol3kHOd7glxSEWkus3eEpyIZdYripFpzlhxe9k/glqCGctXj6hOO0EaMABm6EIigCYfwkNJzCBcOEuKOkROnCEnOCxQYJW1GVYIqPGLP6duh6LBkI4qNZ8rqgHYJ6VWktLFAmpjqFGGzm835jJ0N+5t3zp7mbJf090uvLrEaZ8T+pfuo/K/O9KJxinXuQVJPCTOmu6B0yfhWzMntT11pckiIM7hNeUU4YOXHPdusSbl3c7ce51+40rAmDsraDK/mlDRg9/s4f4LmypJLeHe1trFZjrqCOcxjkea5hg1so44GeV/hHg94tHrWtXVj3b6XWn2lZhZflnX3Bi3bpKk=</latexit>

Iterations<latexit sha1_base64="JTh9O5OyLkgM0VNbgMq7CaXA9GA=">AAAC1nicjVHLSsNAFD2Nr1pfqS7dBIvgqiQi6LLoRncV7APaUpJ0WkPzYjJRS6k7cesPuNVPEv9A/8I7YwpqEZ2Q5My595yZe68T+14iTPM1p83NLywu5ZcLK6tr6xt6cbOeRCl3Wc2N/Ig3HTthvheymvCEz5oxZ3bg+KzhDE9kvHHFeOJF4YUYxawT2IPQ63uuLYjq6sW2YDdifCYYV0wy6eols2yqZcwCKwMlZKsa6S9oo4cILlIEYAghCPuwkdDTggUTMXEdjInjhDwVZ5igQNqUshhl2MQO6TugXStjQ9pLz0SpXTrFp5eT0sAuaSLK44TlaYaKp8pZsr95j5WnvNuI/k7mFRArcEnsX7pp5n91shaBPo5UDR7VFCtGVudmLqnqiry58aUqQQ4xcRL3KM4Ju0o57bOhNImqXfbWVvE3lSlZuXez3BTv8pY0YOvnOGdBfb9sET4/KFWOs1HnsY0d7NE8D1HBKaqokfc1HvGEZ62p3Wp32v1nqpbLNFv4trSHD6PDlxE=</latexit><latexit sha1_base64="JTh9O5OyLkgM0VNbgMq7CaXA9GA=">AAAC1nicjVHLSsNAFD2Nr1pfqS7dBIvgqiQi6LLoRncV7APaUpJ0WkPzYjJRS6k7cesPuNVPEv9A/8I7YwpqEZ2Q5My595yZe68T+14iTPM1p83NLywu5ZcLK6tr6xt6cbOeRCl3Wc2N/Ig3HTthvheymvCEz5oxZ3bg+KzhDE9kvHHFeOJF4YUYxawT2IPQ63uuLYjq6sW2YDdifCYYV0wy6eols2yqZcwCKwMlZKsa6S9oo4cILlIEYAghCPuwkdDTggUTMXEdjInjhDwVZ5igQNqUshhl2MQO6TugXStjQ9pLz0SpXTrFp5eT0sAuaSLK44TlaYaKp8pZsr95j5WnvNuI/k7mFRArcEnsX7pp5n91shaBPo5UDR7VFCtGVudmLqnqiry58aUqQQ4xcRL3KM4Ju0o57bOhNImqXfbWVvE3lSlZuXez3BTv8pY0YOvnOGdBfb9sET4/KFWOs1HnsY0d7NE8D1HBKaqokfc1HvGEZ62p3Wp32v1nqpbLNFv4trSHD6PDlxE=</latexit><latexit sha1_base64="JTh9O5OyLkgM0VNbgMq7CaXA9GA=">AAAC1nicjVHLSsNAFD2Nr1pfqS7dBIvgqiQi6LLoRncV7APaUpJ0WkPzYjJRS6k7cesPuNVPEv9A/8I7YwpqEZ2Q5My595yZe68T+14iTPM1p83NLywu5ZcLK6tr6xt6cbOeRCl3Wc2N/Ig3HTthvheymvCEz5oxZ3bg+KzhDE9kvHHFeOJF4YUYxawT2IPQ63uuLYjq6sW2YDdifCYYV0wy6eols2yqZcwCKwMlZKsa6S9oo4cILlIEYAghCPuwkdDTggUTMXEdjInjhDwVZ5igQNqUshhl2MQO6TugXStjQ9pLz0SpXTrFp5eT0sAuaSLK44TlaYaKp8pZsr95j5WnvNuI/k7mFRArcEnsX7pp5n91shaBPo5UDR7VFCtGVudmLqnqiry58aUqQQ4xcRL3KM4Ju0o57bOhNImqXfbWVvE3lSlZuXez3BTv8pY0YOvnOGdBfb9sET4/KFWOs1HnsY0d7NE8D1HBKaqokfc1HvGEZ62p3Wp32v1nqpbLNFv4trSHD6PDlxE=</latexit><latexit sha1_base64="JTh9O5OyLkgM0VNbgMq7CaXA9GA=">AAAC1nicjVHLSsNAFD2Nr1pfqS7dBIvgqiQi6LLoRncV7APaUpJ0WkPzYjJRS6k7cesPuNVPEv9A/8I7YwpqEZ2Q5My595yZe68T+14iTPM1p83NLywu5ZcLK6tr6xt6cbOeRCl3Wc2N/Ig3HTthvheymvCEz5oxZ3bg+KzhDE9kvHHFeOJF4YUYxawT2IPQ63uuLYjq6sW2YDdifCYYV0wy6eols2yqZcwCKwMlZKsa6S9oo4cILlIEYAghCPuwkdDTggUTMXEdjInjhDwVZ5igQNqUshhl2MQO6TugXStjQ9pLz0SpXTrFp5eT0sAuaSLK44TlaYaKp8pZsr95j5WnvNuI/k7mFRArcEnsX7pp5n91shaBPo5UDR7VFCtGVudmLqnqiry58aUqQQ4xcRL3KM4Ju0o57bOhNImqXfbWVvE3lSlZuXez3BTv8pY0YOvnOGdBfb9sET4/KFWOs1HnsY0d7NE8D1HBKaqokfc1HvGEZ62p3Wp32v1nqpbLNFv4trSHD6PDlxE=</latexit>

FIG. 1. A cartoon illustration of the QNTK theory. (a):the QNTK characterizes the gradient descent dynamics inthe variational quantum circuit. The quantum state modifiesaccording to the QNTK prediction. (b): Around the end ofthe training, the QNTK is frozen and almost a constant. (c):The gradient descent dynamics could be highly non-linear,and the QNTK is running during gradient descent, which isa property of representation learning.

a significant difference comparing to the classical neuralnetworks.

We test the analytical derivations of our theory com-paring against numerical experiments with the IBMquantum device simulator [29], on a classification prob-lem in the supervised learning setting, finding goodagreement with the theory. The structure of this paperand the ideas presented are summarized in Fig. 2.

QNTK theoryClassical NTK theory

Hybrid quantum-classical neural networks: Sec IV and SM V, VI.

Large width limit Barren plateau problem

Optimization Learning

General QNTK: Sec II.A, SM II.A Sec III.A, SM III/IV.AGeneral nonlinear dynamics

Frozen QNTK: Sec II.B, SM II.B Sec III.B, SM III/IV.BFrozen QNTK limit

dQNTK: Sec II.C, SM II.C Sec III.C, SM III/IV.C Quantum representation learning

Qiskit simulationsSec V, SM VII

FIG. 2. Structure of our paper. In Section II we establish thetheory of QNTK in the context of optimization without datafor generic variational quantum ansatz, which is the typicaltask in quantum simulation. In Section III, we establish thetheory of quantum machine learning with the help of QNTK.In Section IV, we define the hybrid quantum-classical neuralnetwork model, and we prove that in the large-width limit, themodel is approximated by the Gaussian process. In SectionV, we give numerical examples to demonstrate our quantumrepresentation theory. In Section VI, we discuss the implica-tion of this work, and outline open problems for future works.More technical details are given in the Supplementary Mate-rial (SM).

II. THEORY OF QUANTUM OPTIMIZATION

A. QNTK for optimization

We start from a relatively simple example about theoptimization of a quantum cost function, without amodel to be learned from some data associate to it. Leta variational quantum wavefunction [30–35] be given as

|φ(θ)〉 = U(θ) |Ψ0〉 =

(L∏

`=1

W` exp (iθ`X`)

)|Ψ0〉 . (1)

Here we have defined L unitary operators of the typeU`(θ`) = exp(iθ`X`), with a variational parameter θ`,and a Hermitian operator X` associated to them. Wedenote the collection of all variational parameters as θ =θ` and the initial state as |Ψ0〉. Moreover, our ansatzalso includes constant gates W`s that do not depend onthe variational angles.

We introduce the following mean squared error (MSE)loss function when we wish to optimize the expectationvalue of a Hermitian operator O to its minimal eigenvalueO0, which is assumed to be known here, over the class ofstates |φ(θ〉)

L(θ) =1

2

(⟨Ψ0

∣∣U†(θ)OU(θ)∣∣Ψ0

⟩−O0

)2 ≡ 1

2ε2. (2)

Here we have defined the residual optimization errorε ≡

⟨Ψ0

∣∣U†(θ)OU(θ)∣∣Ψ0

⟩−O0. When using gradient

descent to optimize Eq. (2), the difference equation forthe dynamics of the training parameter is given by

dθ` = −η dL(θ)

dθ`= −ηε dε

dθ`. (3)

We use the notation do to denote the difference betweenthe step t + 1 and the step t during gradient descentfor the quantity o, do = o(t + 1) − o(t), associated to alearning rate η. Then we have also, to the linear orderin θ,

dε =∑

`

dε

dθ`dθ` = −η

∑

`

dε

dθ`

dε

dθ`ε. (4)

The object∑`

dεdθ`

dεdθ`

serves to construct a toy version of

the NTK in the quantum setup, in the sense that it can beseen as a 1-dimensional kernel matrix with training dataO0. We can make our definition of a QNTK associatedto an optimization problem more precise as follows:

Definition 1 (QNTK for optimization). The quantumneural tangent kernel (QNTK) associated to the opti-mization problem of Eq. (2) is given by

K =∑

`

dε

dθ`

dε

dθ`

= −⟨

Ψ0

∣∣∣U†+,`[X`, U

†`W

†` U†−,ÒU−,`WÙ`

]U+,`

∣∣∣Ψ0

⟩2

,

(5)

3

where

U−,` ≡`−1∏

`′=1

W`′U`′ , U+,` ≡L∏

`′=`+1

W`′U`′ . (6)

It is easy to show that the quantity squared in Eq (5)is imaginary, hence K is always non-negative, K ≥ 0. Aderivation of Eq. (5) can be found in SM.

B. Frozen QNTK limit for optimization

An analytic theory of the NTK is established when thelearning rate is sufficiently small. It is defined by solvingthe coupled difference equations Eqs. (3, 4), which wereport here

dθ` = −ηε dεdθ`

,

dε = −η∑

`

dε

dθ`

dε

dθ`ε = −ηKε. (7)

In the the continuum learning rate limit η → 0, Eqs. (7)become coupled non-linear ordinary differential equa-tions, which are hard to solve in general. Note that thissystem of equations stems from a quantum optimizationproblem and in general it is classically hard to even in-stantiate.

Nevertheless, in the following we build an analyticmodel for a quantum version of the frozen NTK (frozenQNTK) in the regime of lazy training, where variationalangles do not change too much. To be more precise, weassume that at a certain value θ∗ our variational angles θchange by a small amount, θ∗+ δϕ. A typical scenario isto do the Taylor expansion around such values θ∗ duringthe convergence regime for instance. Here δ is a smallscaling parameter. We will call the limit δ → 0+ thefrozen QNTK limit.

In this limit, one can write WÙ` =W` exp(iθ∗`X`) exp(iδϕ`X`), so that the θ∗ depen-dence is absorbed into the non-variational part of theunitary by defining W`(θ

∗` ) ≡ W` exp(iθ∗`X`), and we

have WÙ` → W`(θ∗` ) exp(iδϕ`X`). In what follows, we

drop the θ∗ notation and understand the variationalangles as small parameters that change by δ around avalue θ∗. Then, expanding linearly for small δ we candefine

Definition 2 (Frozen QNTK for quantum optimization).In the optimization problem Eq. (2) the frozen QNTKlimit is

K = −δ2×∑

`

⟨Ψ0

∣∣∣W †+,`[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣Ψ0

⟩2

,

(8)

with

W−,` ≡`−1∏

`′=1

W`′ , W+,` ≡L∏

`′=`+1

W`′ . (9)

In the frozen kernel limit, we can state the following re-sult about the dependency of the residual error ε, solvingEq. (7) linearly for small δ.

Theorem 1 (Performance guarantee of optimizationwithin the frozen QNTK approximation). When usingstandard gradient descent for the optimization problemEq. (2) within the frozen QNTK limit, the residual opti-mization error ε decays exponentially as

ε(t) = (1− ηK)tε(0) = ε(0)×(

1 + ηδ2×

∑

`

⟨Ψ0

∣∣∣W †+,`[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣Ψ0

⟩2)t.

(10)

with a convergence rate defined as

τc = − log(1− ηK) ≈ ηK

= ηδ2∑

`

⟨Ψ0

∣∣∣W †+,`[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣Ψ0

⟩2

≤ 2ηδ2L‖O‖2max`‖X`‖2, (11)

with the L2 norm.

The derivation is given in the SM. An immediate con-sequence is that the residual error will converge to zero,

ε(∞) = 0. (12)

C. dQNTK

The frozen QNTK limit describes the regime of thelinear approximation of non-linearities. Therefore, thefrozen QNTK cannot reflect the non-linear nature of thevariational quantum algorithms. In order to formulate ananalytical model of the non-linearities, we now analyzethe leading order correction in terms of the expansion ofthe learning rate η and the size of the variational angleδ. We formulate the expansion of dε to the second orderin dϕ,

dε =∑

`

dε

dϕ`dϕ` +

1

2

∑

`1,`2

d2ε

dϕ`1dϕ`2dϕ`1 dϕ`2 . (13)

This time dε during gradient descent will follow the equa-tion:

dε = −η∑

`

dε

dϕ`

dε

dϕ`ε+

1

2η2ε2

∑

`1,`2

d2ε

dϕ`1dϕ`2

dε

dϕ`1

dε

dϕ`2.

(14)

4

With this expansion at second order, we have two con-tributing terms in Eq. (13). We label the first term ofEq. (13) quantum effective kernel, KE . We use KE todistinguish it from K, when only a first-order expansionis considered in the description of the dynamics. It is dy-namical in the sense that it depends on the value of thetraining parameter ϕ during the dynamics regulated bya gradient descent. We label the variable part of the thesecond term in Eq. (14) quantum meta-kernel or dQNTK(differential of QNTK),

Definition 3 (Quantum meta-kernel for optimization).The quantum meta-kernel associated with the optimiza-tion problem in Eq (2) is defined via

µ =∑

`1,`2

d2ε

dϕ`1dϕ`2

dε

dϕ`1

dε

dϕ`2. (15)

In the limit of small changes in θ = θ∗+ δϕ, optimiza-tion problem Eq. 2, the quantum meta-kernel is given atthe leading order perturbation theory in δ as

µ = δ4∑

`1,`2

⟨Ψ0

∣∣∣W †+,`1[X`1 ,W

†`1W †−,`1OW−,`1W`1

]W+,`1

∣∣∣Ψ0

⟩⟨

Ψ0

∣∣∣W †+,`2[X`2 ,W

†`2W †−,`2OW−,`2W`2

]W+,`2

∣∣∣Ψ0

⟩×

⟨Ψ0

∣∣∣W †+,`1[X`1 ,W

†`1W †`1,`2W

†+,`2

[X`2 ,W

†`2W †−,`2OW−,`2W`2

]W`2,`1W`1

]W+,`1

∣∣∣Ψ0

⟩: `1 ≥ `2⟨

Ψ0

∣∣∣W †+,`2[X`2 ,W

†`2W †`2,`1W

†+,`1

[X`1 ,W

†`1W †−,`1OW−,`1W`1

]W`1,`2W`2

]W+,`2

∣∣∣Ψ0

⟩: `1 < `2

. (16)

The residual error ε in the optimization problem of Eq. (2), can then be computed as

ε =

⟨Ψ0

∣∣∣∣(∏1

`′=LW †`′

)O

(∏L

`=1W`

)∣∣∣∣Ψ0

⟩−O0

− iδ∑

`

ϕ`

⟨Ψ0

∣∣∣W †+,`[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣Ψ0

⟩

− δ2

2

∑

`1,`2

ϕ`1ϕ`2 ×

⟨Ψ0

∣∣∣W †+,`1[X`1 ,W

†`1W †`1,`2W

†+,`2

[X`2 ,W

†`2W †−,`2OW−,`2W`2

]W`2,`1W`1

]W+,`1

∣∣∣Ψ0

⟩: `1 ≥ `2⟨

Ψ0

∣∣∣W †+,`2[X`2 ,W

†`2W †`2,`1W

†+,`1

[X`1 ,W

†`1W †−,`1OW−,`1W`1

]W`1,`2W`2

]W+,`2

∣∣∣Ψ0

⟩: `1 < `2

.

(17)

We are now ready to make a statement about the resid-ual error in the limit of the dQNTK

Theorem 2 (Performance guarantee of optimizationfrom dQNTK). In optimization problem Eq. (2) at thedQNTK order, we split the residual optimization errorinto two pieces, the free part, and the interacting part,

ε = εF + εI . (18)

The free part follows the exponentially decaying dynamics

εF = (1− ηK)tε(0), (19)

and the interacting part is given by

εI(t) = −ηt(1− ηK)t−1

K∆ε(0). (20)

Here we have

K∆ ≡ KE(0)−K =

(∑

`

dε

dθ`

dε

dθ`

)(0)−

∑

`

dεF

dθ`

dεF

dθ`

= 2iδ3∑

`

⟨Ψ0

∣∣∣W †+,`[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣Ψ0

⟩

∑

`′

⟨Ψ0

∣∣∣W †+,`′[X`′ ,W

†`′W

†`′,`W

†+,`

[X`,W

†`W

†−,ÒW−,`W`

]W`,`′W`′

]W+,`′

∣∣∣Ψ0

⟩: `′ ≥ `⟨

Ψ0

∣∣∣W †+,`[X`,W

†`W

†`,`′W

†+,`′

[X`′ ,W

†`′W

†−,`′OW−,`′W`′

]W`′,`W`

]W+,`

∣∣∣Ψ0

⟩: `′ < `

ϕ`′(0). (21)

Thus, the residual optimization error ε will always finally approach zero,

ε(∞) = 0. (22)

5

Thus, the leading order perturbative correction givesthe contribution O(δ3).

III. THEORY OF LEARNING

A. General theory

The results outlined in Section II can be extended inthe context of supervised learning from a data space D.In particular, we are given a training set contained in thedataspace A ⊂ D. The data can be loaded into quantumstates through a quantum feature map [9, 11]. We definethe variational quantum ansatz with a single layer byregarding the output of a quantum neural network as

zi;δ ≡ zi (θ,xδ) = 〈φ (xδ)|U†OiU |φ (xδ)〉 . (23)

Here, we assume that Oi is taken from O(H), a subsetof the space of Hermitian operators of the Hilbert spaceH, and the index i describes the i-th component of theoutput, associated to the i-th operator Oi. The aboveHermitian operator expectation value evaluation model isa common definition of the quantum neural network. Onecould also measure the real and imaginary parts directlyto define a complexified version of the quantum neuralnetwork, useful in the context of amplitude encoding forthe zi;δ, as discussed in the Supplementary Material. We

are now in the position of introducing the loss function

LA(θ) =1

2

∑

α,i

(yi;α − zi;α)2

=1

2

∑

α,i

ε2i;α. (24)

Here, we call εi;α the residual training error and we as-sume yi;α is associated with the encoded data φi(xα).Now, similarly to what described in section II A, we havethe gradient descent equation

dzi;δ = −η∑

`,i′,α

εi′;αdzi;δdθ`

dzi′;αdθ`

, (25)

with an associated kernel

Kii′δ,α =

∑

`

dzi;δdθ`

dzi′;αdθ`

. (26)

To ease the notation, we shall define the joint index

(δ, i) = a, (α, i′) = b, (27)

which are running in the space D×O(H) and A×O(H)respectively (we use a to indicate that the correspondingdata component is in the sample set A, and if we wishto make a general data point we will denote it as a), andour gradient descent equations are

dza = −η∑

b

Kabεb. (28)

It is possible to show that this kernel is always positivesemidefinite and Hermitian, see Supplementary Materialfor a proof. Now recalling Eq.(1), we are in the posi-tion to give an analytical expression for the QNTK fora supervised learning problem as follows. Details on thederivation can be found in the Supplementary Material.

Definition 4 (QNTK for quantum machine learning). The QNTK for the quantum learning model Eq. (24) is givenby

Kii′δ,α =

∑

`

dzi;δdθ`

dzi′;αdθ`

= −∑

`

⟨φ (xδ)

∣∣∣U†+,`[X`, U

†`W

†` U†−,ÒiU−,`WÙ`

]U+,`

∣∣∣φ (xδ)⟩×⟨

φ (xα)∣∣∣U†+,`

[X`, U

†`W

†` U†−,Òi′U−,`WÙ`

]U+,`

∣∣∣φ (xα)⟩. (29)

B. Absence of representation learning in the frozenlimit

In the frozen QNTK case, the kernel is static, and thelearning algorithm cannot learn features from the data.In the same fashion of section II B, we take the frozenQNTK limit where the changes of variational angles θ

are small. Using the previous notations we can definethe QNTK in for quantum machine learning in the frozenlimit, and a performance guarantee for the error on theloss function in this regime as follows.Definition 5 (Frozen QNTK for quantum machinelearning). In the quantum learning model Eq. (24) withthe frozen QNTK limit,

6

Kii′δ,α = −δ2

∑

`

⟨φ (xδ)

∣∣∣W †+,`[X`,W

†`W

†−,ÒiW−,`W`

]W+,`

∣∣∣φ (xδ)⟩×⟨

φ (xα)∣∣∣W †+,`

[X`,W

†`W

†−,Òi′W−,`W`

]W+,`

∣∣∣φ (xα)⟩. (30)

Theorem 3 (Performance guarantee of quantum ma-chine learning in the frozen QNTK limit). In the quan-tum learning model Eq.( 24) with the frozen QNTK limit,the residual optimization error decays exponentially dur-ing the gradient descent as

εa1(t) =∑

a2

Ua1a2(t)εa2(0),

Ua1a2(t) =[(1− ηK)

t]a1a2

. (31)

The convergence rate is defined as

τc = ‖− log (1− ηK)‖ ≈ η∥∥∥Kii′

δ,α

∥∥∥ . (32)

Then we obtain for the quantum learning model Eq. 24with the frozen QNTK limit, the asymptotic dynamicswith the D ×O(H) index a, is given by

za(∞) = za(0)−∑

a1,a2

K a1a2Kaa1εa2(0). (33)

Here K means that the kernel defined only restricted tothe space A × O(H) (note that it is different from thekernel inverse defined for the whole space in general),and we denote the kernel inverse as

∑

a∈A×O(H)

K a1a2Ka2a3 = δa1a3 . (34)

Specifically, if we assume a indicates the data in the spaceA×O(H), we will have εa(∞) = 0. Proofs and details ofthese results are given in the SM. Moreover, the asymp-totic value is different from the frozen QNTK case in theoptimization problem, because of the existence of the dif-ference between the training set A and the whole dataspace.

C. Representation learning in the dynamicalsetting

In the dynamical case, the kernel is changing duringthe gradient descent optimization, due to non-linearity inthe unitary operations. In this case then the variationalquantum circuits could naturally serve as architecturesof representation learning in the classical sense.

We generalize the leading order perturbation theory ofoptimization naturally to the learning case, and we statethe main theorems here. First, we haveTheorem 4 (Performance guarantee of quantum ma-chine learning in the dQNTK limit). In the quantumlearning model Eq. (24) at the dQNTK order, the trainingerror is given by two contributions, a free and interactingpart, as follows

εa(t) = εFa (t) + εIa(t), (35)

where

εFa (t) =∑

a1

Uaa1(t)εa1(0),

Ua1a2(t) =[(1− ηK)

t]a1a2

, (36)

and

εIa(t) =

(−η

t−1∑

s=0

(1− ηK)t−1−s

K∆(1− ηK)sε(0)

)

a

.

(37)

Here K is the frozen (linear) part of the QNTK. Usinga matrix notation for the compact indices a, in the spaceA×O(H), we have

∥∥εI(t)∥∥ ≤ ηt‖1− ηK‖t−1 ∥∥K∆

∥∥ ‖ε(0)‖ . (38)

where K∆ is defined as

K∆,ii′

δ,α = iδ3∑

`,`′

ϕ`′(0)Gδ,i`′,`Θα,i′

` + iδ3∑

`,`′

ϕ`′(0)Gα,i′

`′,`Θδ,i` ,

(39)

and

7

Gδ,i`1,`2 ≡ G`1,`2(φ(xδ), Oi) =

⟨φ(xδ)

∣∣∣W †+,`1[X`1 ,W

†`1W †`1,`2W

†+,`2

[X`2 ,W

†`2W †−,`2OiW−,`2W`2

]W`2,`1W`1

]W+,`1

∣∣∣φ(xδ)⟩

: `1 ≥ `2⟨φ(xδ)

∣∣∣W †+,`2[X`2 ,W

†`2W †`2,`1W

†+,`1

[X`1 ,W

†`1W †−,`1OiW−,`1W`1

]W`1,`2W`2

]W+,`2

∣∣∣φ(xδ)⟩

: `1 < `2

,

Θδ,i` ≡ Θ`(φ(xδ), Oi) =

⟨φ(xδ)

∣∣∣W †+,`[X`,W

†`W

†−,ÒiW−,`W`

]W+,`

∣∣∣φ(xδ)⟩. (40)

For the quantum learning model Eq. 24 at the dQNTK order, the dynamics given by gradient descent on a generaldata point is given by

za(∞) = za(0)−∑

a1,a2

Kaa1Ka1a2εa2(0)

+∑

a1,a2,a3,a4

µa1aa2 −

∑

a5,a6

Kaa5Ka5a6µa1a6a2

Z a1a2a3a4A εa3(0)εa4(0)

+∑

a1,a2,a3,a4

µaa1a2 −

∑

a5,a6

Kaa5Ka5a6µa6a1a2

Z a1a2a3a4B εa3(0)εa4(0), (41)

where ZA,Bs are called the quantum algorithm projec-tors,

Z a1a2a3a4A

≡ K a1a3K a2a4 −∑

a5

K a2a5X a1a5a3a4‖ ,

Z a1a2a3a4B

≡ K a1a3K a2a4 −∑

a5

K a2a5X a1a5a3a4‖ +

η

2X a1a2a3a4‖ ,

(42)

and X‖ is defined as

X a1a2a3a4‖ =

∞∑

s=0

[(1− ηK)s]a1a3 [(1− ηK)

s]a2a4 , (43)

or

δa1a5 δa2a6

=∑

a3,a4

X a1a2a3a4‖ ×

(Ka3a5δa4a6 + δa3a5Ka4a6 − ηKa3a5Ka4a6

). (44)

Finally, µ is the quantum meta-kernel in the quantummachine learning context,

µi0i1i2δ0δ1δ2= µa0a1a2 =

∑

`1,`2

d2zi0;δ0

dϕ`1dϕ`2

(dzi1;δ1

dϕ`1

dzi2;δ2

dϕ`2

)∣∣∣∣∣∣ϕ=0

= δ4∑

`1,`2

Θδ1,i1`1

Θδ2,i2`2

Gδ0,i0`1,`2. (45)

Specifically, if we assume that a is from A×O(H), we willget εa(∞) = 0. More details of µ are given in SM. The ex-istence of quantum algorithm projectors shows the quan-tum algorithm dependence of the variational quantum cir-cuits, which indicates powerful representation learningpotential because of non-linearity.

IV. HYBRID QUANTUM-CLASSICALNETWORK AND THE LARGE-WIDTH LIMIT

In this section we define a setting in which one canspeak of a quantum analog of the large-width limit forNTKs. In such a limit, we expect that the dynamics lin-earizes during the whole training process, similar to whathappens in the frozen regime of lazy training, and the cor-relation function of the outputs neuraons becomes Gaus-sian. The classical NTK theory requires a random ini-tialization of weights and bias and takes the large-widthlimit of neural network architectures. In the quantumsetup, the random initialization is a random choice oftrainable ansatze.

To see it more clearly, we consider a hybrid quantumclassical neural network model [36, 37]. Starting from aquantum neural network, we measure the output neuronsfrom the quantum architecture and dress them with asingle-layer classical neural network. The output of theclassical neural network could be then re-encoded into aquantum register via another quantum feature map. Asingle quantum to classical step can be called one hybrid

8

layer, and then one could construct multiple hybrid layersconnected by feature map encoding, see Fig. 3 for anillustration.

FIG. 3. The hybrid quantum-classical neural network consid-ered here. We repetitively apply quantum and classical neuralnetworks in our architecture, with feature map encoding |ψ〉and quantum measurements, mapping the data point xα tothe prediction zα.

For the quantum part of the circuit, we use the samestructure of quantum neural networks with Hermitianoperator expectation values. Mathematically, the modelis defined as

zQ1;α;j1=⟨φ1 (xα)

∣∣U†,1(θ1)O1j1U

1(θ1)∣∣φ1 (xα)

⟩, (46)

zQω;α;jω= 〈φω (wω−1;α)|U†,ω(θω)OωjωU

ω(θω) |φω (wω−1;α)〉 ,(47)

wω;α;jCω= σωjCω

dimOω(Hω)∑

jω=1

WωjCω ,jω

zQω;α;jω+ bωjCω

≡

σωjCω

(zCω;α;jCω

).

(48)

Here, Eq. 46 initializes the quantum neural network,mapping the data xα to components j1, labeling the in-dex in the space of Hermitian operator we use O1(H1).The variational ansatz is similar to what we have dis-cussed before, but they might be different in differentlayers. We use the label ω to denote the order of hy-brid layers, ranging from 1 to the total number of hy-brid layers. We introduce the quantum ansatz Uω(θω) =Lω∏`ω=1

Wω`ω

exp(iθω`ωX

ω`ω

), the feature map φω, and the op-

erator space Oω(Hω) index jω. Eq. 47 introduces the re-cursive encoding from the classical neural network datawω−1;α = (wω−1;α)jCω−1

to the space Oω(Hω), where

the classical data vector wω;α;jCωis obtained through a

single-layer classical neural network with the non-linearactivation σωjCω

, weight matrix WωjCω ,jω

, and bias vector

bωjCωwith the classical index jCω , and the preactivation

zCω;α;jCω. When we intialize the hybrid network, the classi-

cal weights and biases are statistically Gaussian following

the LeCun parametrization [38],

E(WωjC1,ω,j1,ω

WωjC2,ω,j2,ω

)= δjC1,ω,jC2,ωδj1,ω,j2,ω

CωWdimOω(Hω)

,

E(bωjC1,ω

bωjC2,ω

)= δjC1,ω,jC2,ωC

ωb . (49)

Note that in this case, the role of width in the large-width theory is replaced the dimension of the opera-tor space, dim(Oω(Hω)). The value of the dimension(width) could be arbitrary in principle, but it is upperbounded by the square of the dimension of the Hilbertspace, dim(Oω(Hω)) ≤ dim(Hω)2, in the qubit system.

If we now assume that our quantum training param-eters θω are chosen from ensembles (or the variationalansatze themselves are from some ensembles), similar tothe classical assumption. Denoting the expectation valuefrom quantum ensembles as E, we will show the followingstatement,

Theorem 5 (Non-Gaussianity from large width). Thefour-point function of classical preactivations is nearlyGaussian if dim(Oω(Hω)) is large,

Econn

(zCω;α1;jC1,ω

zCω;α2;jC2,ωzCω;α3;jC3,ω

zCω;α4;jC4,ω

)

= O(1

dim(Oω(Hω))), (50)

as long as,

Econn

(zQω;αω;j1,ω

zQω;αω ;j1,ωzQω;α3;j2,ω

zQω;α4;j2,ω

)

= O(1)× δj1,ω,j2,ω , (51)

and their permutations for all ωs. Here the notationEconn means the connected Gaussian correlators subtract-ing Wick contractions.

More details are given in SM. The orthogonal conditionEq. 51 can be naturally achieved by randomized archi-tectures, for instance, Haar randomness and k-designs.We interpret the result as:

• In this hybrid case, the role of width in the neuralnetwork is upper bounded by the square dimensionof the w-th Hilbert space. Thus, if we scale up thenumber of qubits, we are naturally in the large-width limit. However, if our variational ansatz issparse enough such that the operator space dimen-sion dimO(H) is small, then we will have significantfinite width effects.

• The condition Eq. (51) for quantum outputs is nat-urally satisfied by random architectures. If we as-sume that our variational ansatz is highly random,we are expected to have similar Gaussian processbehaviors as the large-width limit of classical neu-ral networks. However, the same assumption willgenerically lead to the barren plateau problem [8],where the derivatives of the loss function will move

9

slowly when we scale up our operator space dimen-sion. Our result shows a possible connection be-tween the large-width limit and the barren plateauproblem.

• Moreover, the orthogonal condition in Eq. 51 weimpose does not mean that we have to set theansatz to be highly random. It could also be poten-tially satisfied by fixed ansatze, for instance, withsome error-correction types of orthogonal condi-tions. If the condition is generally not satisfied, thevariational architecture we study could have highlynon-Gaussian and representation learning features,although it might be theoretically hard to under-stand [39].

V. NUMERICAL RESULTS

In this section, we test our QNTK theory in practice,using the Qiskit software library [29] to simulate the im-plementation of a paradigmatic quantum machine learn-ing task on quantum processors, both in noiseless andnoisy cases. We consider a variational classification prob-lem in supervised learning with three qubits. The dataset is generated with the ad hoc data functionality asprovided in qiskit.ml.datasets within the Qiskit Ma-chine Learning module [40], see Fig. 4 for an illustration.

FIG. 4. A 3D illustration of our data set for training. Here weuse three inputs x[0, 1, 2] and label them with 0 or 1, coloredby red or blue respectively. The data set is generated usingad hoc data.

Our numerical experiments are performed first us-ing the noiseless statevector simulator backend, thenincluding both statistical (nshots = 8192) and simu-lated hardware noise with the Qiskit qasm simulator.A simplified model of device noise, featuring thequbit relaxation and dephasing, single-qubit and 2-qubitgate errors and readout inaccuracies, is constructed

with the NoiseModel.from backend() Qiskit methodand parametrized using our calibration data from theibmq bogota superconducting processor (accessed onOct, 15 2021).

We implement supervised learning using a QiskitMachine Learning NeuralNetworkClassifier with asquared error loss, obtaining reasonable convergence withgradient descent algorithms (See Fig. 5). The un-derlying variational quantum classifier is based on theTwoLayerQNN design, with a 3-input ZZFeatureMap anda RealAmplitudes trainable ansatz with 3 repetitionsand 12 parameters. Further details on numerical simu-lations are given in SM. Note that we do not demanda perfect convergence around the global minimum, sincethe QNTK theory only cares about the derivatives ofthe residual learning errors, which is invariant by shift-ing a constant or changing the initial condition whensolving the training dynamics. In the classical the-ory of NTK, in the infinite width case, for instance,the multilayer perceptron (MLP) model is both over-parametrized and generalized, and the answer would givethe global minimum. Including finite width corrections,there might be multiple local minima, and it is a featureof representation learning. Moreover, we use the errormitigation protocol by applying CompleteMeasFitterfrom qiskit.ignis.mitigation.measurement to mit-igate readout noise.

FIG. 5. The convergence of objective function during gradientdescent. Here we compare the ideal and noisy cases, labeledby red or blue respectively.

In Fig. 6, we compute the QNTK eigenvalues forboth the noiseless and noisy simulations, comparing themwith theoretical predictions. Since we are in the under-parametrized regime, the number of non-zero eigenvaluesof the QNTK is the same as the number of variationalangles, which is 12 in our experiments. We find agree-ment between those two in the late time, which showsthe power of predictability using the QNTK theory.

VI. CONCLUSIONS AND OPEN PROBLEMS

The results presented here establish a general frame-work of a QNTK theory, deriving analytical treatment of

10

FIG. 6. Kernel eigenvalues during the gradient descent dy-namics. Up: noiseless simulation. Down: noisy simulationincluding a model of device errors. The solid curves are 12nonzero eigenvalues of the QNTK, while the dashed lines aretheoretical predictions of the frozen QNTK at the late time.

the optimization and learning dynamics in that regime.We outline the following open problems for future study.

• Our research gives practical guidance to the designvariational quantum algorithms. One could com-pute the QNTK, or the kernel itself in the quan-tum kernel method [11]. If their eigenvalues arelarge, it will indicate faster convergence. It will beinteresting to compare those results with other the-oretical criteria about the quality of the variationalquantum algorithms [41, 42].

• It would be interesting if one could investigate whenthe frozen QNTK limit is useful in other contexts.In Trotter product formulas, for instance,

limn→∞

(eiaX/neibZ/n

)n, (52)

to implement the gate U = ei(aX+bZ). Thus, smallvariational angles might widely appear in real cases

of quantum architectures, even beyond the regimeof lazy training around convergence.

• Connection to the barren plateau problem. Ourwork suggests a possible connection between thebarren plateau problem in variational quantum al-gorithms and the large-width limit in classical neu-ral networks, by observing the following similaritybetween the LeCun parametrization above

E(WjC1 j1

WjC2 j2

)= δjC1 ,jC2 δj1,j2

CWwidth

, (53)

and the 1-design random formula [43]

E(UijU†kl) =

δilδjkdimH . (54)

• It will be interesting to explore the robustness ofthe QNTK theory against noise. Specifically, wehave obtained an exponential convergence when theQNTK is frozen.

• Finally, it will be interesting to dive deeper whenthe perturbative analysis fails in the non-linearregime. One could draw phase diagrams of quan-tum machine learning about some order parame-ters, for instance, the learning rate [44]. Thosestudies will deepen our theoretical understandingof quantum machine learning.

ACKNOWLEDGEMENT

We thank Bryan Clark, Cliff Cheung, Jens Eisert,Thomas Faulkner, Guy Gur-Ari, Yoni Kahn, Risi Kon-dor, David Meltzer, Ryan Mishmash, John Preskill, DanA. Roberts, David Simmons-Duffin, Kristan Temme,Lian-Tao Wang, Rebecca Willett, and Changchun Zhongfor useful discussions. JL is supported in part by In-ternational Business Machines (IBM) Quantum throughthe Chicago Quantum Exchange, and the Pritzker Schoolof Molecular Engineering at the University of Chicagothrough AFOSR MURI (FA9550-21-1-0209). LJ ac-knowledges support from the ARO (W911NF-18-1-0020,W911NF-18-1-0212), ARO MURI (W911NF-16-1-0349),AFOSR MURI (FA9550-19-1-0399, FA9550-21-1-0209),DoE Q-NEXT, NSF (EFMA-1640959, OMA-1936118,EEC-1941583), NTT Research, and the Packard Foun-dation (2013-39273).

Note Added: A recent independent paper [45] thataddresses the same topic has been posted publicly onthe arXiv four days prior to the publication of thismanuscript.

[1] A. W. Harrow, A. Hassidim, and S. Lloyd, Physical re-view letters 103, 150502 (2009).

[2] N. Wiebe, D. Braun, and S. Lloyd, Physical review let-

11

ters 109, 050505 (2012).[3] S. Lloyd, M. Mohseni, and P. Rebentrost, Nature Physics

10, 631 (2014).[4] P. Wittek, Quantum machine learning: what quantum

computing means to data mining (Academic Press, 2014).[5] N. Wiebe, A. Kapoor, and K. M. Svore, arXiv preprint

arXiv:1412.3489 (2014).[6] P. Rebentrost, M. Mohseni, and S. Lloyd, Physical re-

view letters 113, 130503 (2014).[7] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost,

N. Wiebe, and S. Lloyd, Nature 549, 195 (2017).[8] J. R. McClean, S. Boixo, V. N. Smelyanskiy, R. Babbush,

and H. Neven, Nature communications 9, 1 (2018).[9] M. Schuld and N. Killoran, Physical review letters 122,

040504 (2019).[10] E. Tang, in Proceedings of the 51st Annual ACM

SIGACT Symposium on Theory of Computing (2019) pp.217–228.

[11] V. Havlıcek, A. D. Corcoles, K. Temme, A. W. Harrow,A. Kandala, J. M. Chow, and J. M. Gambetta, Nature567, 209 (2019).

[12] H.-Y. Huang, M. Broughton, M. Mohseni, R. Babbush,S. Boixo, H. Neven, and J. R. McClean, Nature commu-nications 12, 1 (2021).

[13] Y. Liu, S. Arunachalam, and K. Temme, Nature Physics, 1 (2021).

[14] H.-Y. Huang, R. Kueng, and J. Preskill, Physical ReviewLetters 126, 190505 (2021).

[15] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz,J. Pennington, and J. Sohl-Dickstein, arXiv preprintarXiv:1711.00165 (2017).

[16] A. Jacot, F. Gabriel, and C. Hongler, arXiv preprintarXiv:1806.07572 (2018).

[17] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak,J. Sohl-Dickstein, and J. Pennington, Advances in neuralinformation processing systems 32, 8572 (2019).

[18] S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, andR. Wang, arXiv preprint arXiv:1904.11955 (2019).

[19] J. Sohl-Dickstein, R. Novak, S. S. Schoenholz, andJ. Lee, arXiv preprint arXiv:2001.07301 (2020).

[20] G. Yang and E. J. Hu, arXiv preprint arXiv:2011.14522(2020).

[21] S. Yaida, in Mathematical and Scientific Machine Learn-ing (PMLR, 2020) pp. 165–192.

[22] E. Dyer and G. Gur-Ari, arXiv preprint arXiv:1909.11304(2019).

[23] J. Halverson, A. Maiti, and K. Stoner, Machine Learn-ing: Science and Technology 2, 035002 (2021).

[24] D. A. Roberts, arXiv preprint arXiv:2104.00008 (2021).[25] D. A. Roberts, S. Yaida, and B. Hanin, arXiv preprint

arXiv:2106.10165 (2021).[26] J. Liu, Does Richard Feynman Dream of Electric Sheep?

Topics on Quantum Field Theory, Quantum Computing,and Computer Science, Ph.D. thesis, Caltech (2021).

[27] K. Nakaji, H. Tezuka, and N. Yamamoto, “Quantum-enhanced neural networks in the neural tangent kernelframework,” (2021), arXiv:2109.03786 [quant-ph].

[28] L. Chizat, E. Oyallon, and F. Bach, arXiv preprintarXiv:1812.07956 (2018).

[29] G. Aleksandrowicz et al., “Qiskit: An open-source frame-work for quantum computing,” (2019).

[30] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q.Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. O’brien,Nature communications 5, 1 (2014).

[31] E. Farhi, J. Goldstone, and S. Gutmann, arXiv preprintarXiv:1411.4028 (2014).

[32] J. R. McClean, J. Romero, R. Babbush, and A. Aspuru-Guzik, New Journal of Physics 18, 023023 (2016).

[33] A. Kandala, A. Mezzacapo, K. Temme, M. Takita,M. Brink, J. M. Chow, and J. M. Gambetta, Nature549, 242 (2017).

[34] S. McArdle, S. Endo, A. Aspuru-Guzik, S. C. Benjamin,and X. Yuan, Reviews of Modern Physics 92, 015003(2020).

[35] M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin,S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan,L. Cincio, et al., Nature Reviews Physics , 1 (2021).

[36] J. Otterbach, R. Manenti, N. Alidoust, A. Bestwick,M. Block, B. Bloom, S. Caldwell, N. Didier, E. S. Fried,S. Hong, et al., arXiv preprint arXiv:1712.05771 (2017).

[37] E. Farhi and H. Neven, arXiv preprint arXiv:1802.06002(2018).

[38] This is somewhat different from the so-called NTKparametrization in some literature. See details in SM.Moreover, here we assume the weights and biases arereal.

[39] Similar analysis could be done on dynamics, see SM forfurther comments.

[40] “Qiskit machine learning,” https://qiskit.org/

documentation/machine-learning/ (2021).[41] A. Abbas, D. Sutter, C. Zoufal, A. Lucchi, A. Figalli,

and S. Woerner, Nature Computational Science 1, 403(2021).

[42] T. Haug, K. Bharti, and M. Kim, arXiv preprintarXiv:2102.01659 (2021).

[43] D. A. Roberts and B. Yoshida, JHEP 04, 121 (2017),arXiv:1610.04903 [quant-ph].

[44] A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, andG. Gur-Ari, arXiv preprint arXiv:2003.02218 (2020).

[45] N. Shirai, K. Kubo, K. Mitarai, and K. Fujii, “Quantumtangent kernel,” (2021), arXiv:2111.02951 [quant-ph].

http://dx.doi.org/10.7907/adtc-ss13

http://arxiv.org/abs/2109.03786

https://qiskit.org/documentation/machine-learning/

https://qiskit.org/documentation/machine-learning/

http://dx.doi.org/10.1007/JHEP04(2017)121



Supplemental Material: Representation Learning via Quantum Neural Tangent Kernels

Junyu Liu,1, 2, 3, ∗ Francesco Tacchino,4, † Jennifer R. Glick,5, ‡ Liang Jiang,1, 2, § and Antonio Mezzacapo5, ¶

1Pritzker School of Molecular Engineering, The University of Chicago, Chicago, IL 60637, USA2Chicago Quantum Exchange, Chicago, IL 60637, USA

3Kadanoff Center for Theoretical Physics, The University of Chicago, Chicago, IL 60637, USA4IBM Quantum, IBM Research, Zurich, 8803 Ruschlikon, Switzerland

5IBM Quantum, IBM T. J. Watson Research Center, Yorktown Heights, New York 10598, USA(Dated: November 9, 2021)

CONTENTS

I. Details on the linear and the quadratic models 1A. Linear model 1B. Quadratic model 3

1. The residual training error 52. The asymptotic regime 7

II. Details on the quantum optimization 8A. General setup 8B. Frozen QNTK 9C. dQNTK 10

III. Details on the quantum machine learning: Hermitian operator expectation value evaluation 12A. General setup 12B. No representation learning 13C. Representation learning 14

IV. Details on the quantum machine learning: amplitude encoding 17A. General setup 17B. No representation learning 19C. Representation learning 20D. Reading the amplitude 24

V. Suppression of non-Gaussianity in the large-width limit 25

VI. On the LeCun parametrization and the NTK parametrization 27

VII. Simulation details and additional data 28

References 29

I. DETAILS ON THE LINEAR AND THE QUADRATIC MODELS

A. Linear model

Before we discuss the quantum model setup, we start by reviewing the linear and quadratic models in the classical setup. Seethe discussion from [1, 2] for more details.

∗ [email protected]† [email protected]‡ [email protected]§ [email protected]¶ [email protected]

2

We define the model as

zi;δ(θ) =∑

j

Wijφj (xδ) . (1)

Here xδ is the data point δ in the space D, and Wijs are weights and biases. We use the slack notation such that W0j = bjincludes the bias. The model is called the linear model, which is linear in weights, but we wish to keep the feature map φ ingeneral. Moreover, we write θ as a compact notation of the vectorized W .

We wish to optimize the following loss function

LA(θ) =1

2

∑

i,α

yi;α −

∑

j

Wijφj (xα)

2

. (2)

For the sample set A. Again, we will use the notation α such that α is in the sample set A. We define the kernel

kδ1δ2 ≡ k (xδ1 ,xδ2) ≡∑

i

φi (xδ1)φi (xδ2) =∑

a,b

dzi;δ1dWab

dzi;δ2dWab

. (3)

Note that the right hand side does not contain a sum over i. It is equal for all is. Moreover, We call δ ∈ D as the whole data set,while A ⊂ D is the sample set. We define kα1α2 with tilde to indicate the submatrix. We also define [3]

∑

α2∈Akα1α2 kα2α3

= δα1

α3. (4)

namely, the upper index means the inverse. Similar to the main text, we consider the gradient descent algorithm,

dθµ = −η ∂LA∂θµ

. (5)

The partial derivatives are computed as

∂LA(W )

∂Wab= −

∑

α,i,j

δiaδjbφj (xα)

yi;α −

∑

j

Wijφj (xα)

=∑

α

φb (xα) (za;α − ya;α) =∑

α

εa;αφb (xα) (6)

where ε is the residual training error,

dWij = −η∑

α

φj (xα) εi;α . (7)

So we have

dzi;δ =∑

a,b

∂zi;δ∂Wab

dWab = −η∑

α

kδαεi;α . (8)

In the linear model, the kernel kδα is static. The solution for the residual training error is

εi;α1(t) =∑

α2

Uα1α2(t)εi;α2(0) , (9)

where

Uαtα0(t) ≡

[(1− ηk)t

]αtα0

=∑

α1,...,αt−1

(δαtαt−1 − ηkαtαt−1

)· · · (δα1α0 − ηkα1α0) . (10)

3

Now we could predict the model output for an arbitrary δ. We have

zi;δ(∞) = zi;δ(0)−∑

α

kδα

η

∞∑

s=0

εi;α(s)

= zi;δ(0)−∑

α

kδα

η∞∑

s=0

[∑

α1

Uαα1(s)εi;α1

(0)

]

= zi;δ(0)−∑

α,α1

kδα

η∞∑

s=0

[(1− ηk)s]αα1

εi;α1

(0)

= zi;δ(0)−∑

α,α1

kδα

η[1− (1− ηk)]

−1αα1

εi;α1(0)

= zi;δ(0)−∑

α,α1

kδαkαα1εi;α1

(0) , (11)

where we have made use of geometric sums. It is easy to check that

zi;δ(∞) = yi;δ , (12)

where we take δ ∈ A.

B. Quadratic model

Now we start to study perturbative corrections about the linear model. We consider the model definition,

zi;δ(θ) =∑

j

Wijφj (xδ) +σ

2

∑

j1,j2

Wij1Wij2ψj1j2 (xδ) . (13)

Here σ is a small number as a perturbative correction. Thus, the model prediction difference up to the quadratic order will begiven by

dzi;δ =∑

j

φj (xδ) + ε

∑

j1=0

Wij1ψj1j (xδ)

dWij +

σ

2

∑

j1,j2

ψj1j2 (xδ) dWij1 dWij2 . (14)

The first term here is the effective feature map,

φEij (xδ) ≡dzi;δdWij

= φj (xδ) + σ∑

k

Wikψkj (xδ) . (15)

Now, we could collect our dynamical equations for z, φE ,W as

dzi;δ =∑

j

dWijφEij (xδ) +

σ

2

∑

j1,j2

dWij1 dWij2ψj1j2 (xδ) ,

dφEij(xδ) = σ∑

k

dWikψkj (xδ) ,

dWij = −η∑

α

φEij(xα)εi;α . (16)

Moreover, now we have an MSE loss as

LA =1

2

∑

i,α

yi;α −

∑

j

Wijφj (xα)− σ

2

∑

j1,j2

Wij1Wij2ψj1j2 (xα)

2

. (17)

4

So we could write the dynamics of z more explicitly as

dzi;δ = −η∑

α

∑

j

φEij(xδ)φEij(xα)

εi;α

+η2

2

∑

α1,α2

∑

j1,j2

σψj1j2 (xδ)φEij1(xα1

)φEij2(xα2)

εi;α1

εi;α2. (18)

We define the effective kernel:

kEii;δ1δ2 ≡∑

j

φEij (xδ1)φEij (xδ2) , (19)

and the meta kernel,

µδ0δ1δ2 ≡ σ∑

j1,j2

ψj1j2 (xδ0)φj1 (xδ1)φj2 (xδ2)

= σ∑

j1,j2

εψj1j2 (xδ0)φEij1 (xδ1)φEij2 (xδ2) +O(σ2). (20)

So we have

dzi;δ = −η∑

α

kEii;δαεi;α +η2

2

∑

α1,α2

µδα1α2εi;α1

εi;α2+O

(σ2). (21)

The meta kernel is fixed. The effective kernel will satisfy the following dynamics

dkEii;δ1δ2 = −η∑

α

(µδ1δ2α + µδ2δ1α) εi;α +O(σ2). (22)

Now we could try solving the output z. Based on perturbation theory, we could divide the whole output by the free term zF andthe interacting term zI ,

zi;δ(t) ≡ zFi;δ(t) + zIi;δ(t) . (23)

The free part follows the following exponential dynamics,

dzFi;δ = −η∑

α

kδαεFi;α ≡ zFi;δ − η

∑

α

kδα[zFi;α − yi;α

]. (24)

Here, k is the old definition of the kernel without quadratic terms, which is different from the effective kernel byO(σ). Moreover,we have

kEii;δ1δ2(t)

=kEii;δ1δ2(0)−∑

α

(µδ1δ2α + µδ2δ1α) ai;α(t) , (25)

where

ai;α(t) =∑

α2

kαα2(εi;α(t)− εFi;α(t)

). (26)

where εF is the free part of the residual training error zF − y. Then, one could compute the interacting piece. We have

dzIi;δ = −∑

j,α

ηkδαzIj;α(t) + ηFi;δ(t) . (27)

5

Here F is the damping force

Fi;δ(t) ≡ −∑

α

[kEδα(t)− kδα

]εFi;α(t) +

η

2

∑

α1,α2

µδα1α2εFi;α1

(t)εFi;α2(t) , (28)

and we have

zIi;δ(t) = ηt−1∑

s=0

Fi;δ(s)−

∑

α1,α2

kδα1kα1α2Fi;α2

(s)

+

∑

α1,α2

kδα1kα1α2zIi;α2

(t) . (29)

The whole system is a set of non-linear difference equations.

1. The residual training error

Now we could try to solve the residual training error. We note that, for α ∈ A, we have

εi;α(t)− εFi;α(t) = zIi;α(t) . (30)

So we have

kEii;δ1δ2(t) = kEii;δ1δ2(0)−∑

α,α2

(µδ1δ2α + µδ2δ1α) kαα2zIi;α2(t) . (31)

Moreover, we call

k∆ii;δ1δ2 ≡ k∆

ii;δ1δ2 ≡ kEii;δ1δ2(0)− kδ1δ2 . (32)

We note that the effective kernel looks like,

kEii;δ1δ2 ≡∑

j

φEij (xδ1 ; 0)φEij (xδ2 ; 0)

=∑

j

(φj (xδ1) + σ

∑

k

Wik(0)ψkj (xδ1)

)(φj (xδ2) + σ

∑

k′

Wik(0)ψkj (xδ2)

)

= kδ1δ2 + σ∑

j,k

Wik(0)ψkj (xδ1)φj (xδ2) + σ∑

j,k

Wik(0)ψkj (xδ2)φj (xδ1) +O(σ2) . (33)

Keeping the leading order, we have

k∆ii,δ1δ2 = σ

∑

j,k

Wik(0)ψkj (xδ1)φj (xδ2) + σ∑

j,k

Wik(0)ψkj (xδ2)φj (xδ1) , (34)

and

kEii;δ1δ2(t) = kEii;δ1δ2(0)−∑

α,α2

(µδ1δ2α + µδ2δ1α)kαα2zIi;α2(t)

= kδ1δ2 + k∆ii,δ1δ2 −

∑

α,α2

(µδ1δ2α + µδ2δ1α)kαα2zIi;α2(t) . (35)

6

Moreover, let us solve the damping force F:

Fi;δ(t) = −∑

α

[kEδα(t)− kδα

]εFi;α(t) +

η

2

∑

α1,α2

µδα1α2εFi;α1

(t)εFi;α2(t)

= −∑

α,α3

kEii;δα(0)−

∑

α1,α2

(µδαα1+ µαδα1

) kα1α2εIi;α2(t)− kδα

Uαα3

(t)εi;α3(0)

+η

2

∑

α1,α2,α3,α4

µδα1α2Uα1α3(t)Uα2α4(t)εi;α3(0)εi;α4(0)

= −∑

α,α3

k∆ii;δαUαα3(t)εi;α3(0) +

∑

α,α1,α2,α3

(µδαα1 + µαδα1) kα1α2Uαα3(t)εi;α3(0)zIi;α2(t)

+η

2

∑

α1,α2,α3,α4

µδα1α2Uα1α3

(t)Uα2α4(t)εi;α3

(0)εi;α4(0) . (36)

The second and the last term in the last equality contributes higher orders. Thus, if the initial k∆ is not vanishing, we have

Fi;δ(t) = −∑

α,α3

k∆ii;δαUαα3

(t)εi;α3(0) +O(ησ) . (37)

The contribution in the first term will dominate at least in the early time when k∆ is not vanishing. In the late time where Udecays significantly, we would have some non-perturbative effects.

Thus, we can plug the expression back to solve zI(t). We have

zIi;α(t) = η

t−1∑

s=0

∑

α1

Uαα1(t− 1− s)Fi;α1(s)

= −ηt−1∑

s=0

∑

α1,α2,α3

Uαα1(t− 1− s)Uα2α3

(s)k∆ii;α1α2

εi;α3(0)

= −ηt−1∑

s=0

(1− ηk)t−1−s

k∆ii (δ − ηk)

sεi(0) , (38)

The last formula is given in the following matrix form:

(1− ηk)α1α2 = δα1α2 − ηkα1α2 = δα1α2 − η∑

j

φj (xα1)φj (xα2) ,

(k∆ii )α1α2

= σ∑

j,k

Wik(0)ψkj (xα1)φj (xα2

) + σ∑

j,k

Wik(0)ψkj (xα2)φj (xα1

) ≡ σ(Mi)α1,α2,

(εi(0))α = εi;α(0) . (39)

Moreover, we could indeed get more information by just making the bound. We have

∥∥zIi;α(t)∥∥ =

∥∥∥∥∥ηt−1∑

s=0

(1− ηk)t−1−s

k∆ii (1− ηk)

sεi(0)

∥∥∥∥∥

≤ ηt−1∑

s=0

‖1− ηk‖t−1−s ∥∥k∆ii

∥∥ ‖1− ηk‖s ‖εi(0)‖

= η

(t−1∑

s=0

‖1− ηk‖t−1

)∥∥k∆

ii

∥∥ ‖εi(0)‖

= ηt‖1− ηk‖t−1 ∥∥k∆ii

∥∥ ‖εi(0)‖ = σηt‖1− ηk‖t−1 ∥∥Mi∥∥ ‖εi(0)‖ . (40)

Now we compare the convergence time noticing that

εFi;α1(t) =

∑

α2

Uα1α2(t)εi;α2

(0) . (41)

7

So∥∥εFi (t)

∥∥ ≤ ‖1− ηk‖t ‖εi(0)‖ . (42)

Schematically, we have∥∥zIi (t)

∥∥∥∥εFi (t)

∥∥ ∼σηt

∥∥Mi∥∥ ‖εi(0)‖

‖1− ηk‖ ‖εi(0)‖ ∼σηt

∥∥Mi∥∥

‖1− ηk‖ . (43)

The relative perturbative error contribution will grow linearly in time.

2. The asymptotic regime

Now instead of only looking at the training set, we study the asymptotic regime for general inputs. We start from

zIi;δ(t) = ηt−1∑

s=0

Fi;δ(s)−

∑

α1,α2

kδα1kα1α2Fi;α2

(s)

+

∑

α1,α2

kδα1kα1α2zIi;α2

(t) . (44)

At the asymptotic convergence, the interacting perturbative correction on the training set will converge to zero, so we have

zIi;δ(∞) =

[η∞∑

s=0

Fi;δ(s)

]−∑

α1,α2

kδα1kα1α2

[η∞∑

s=0

Fi;α2(s)

]. (45)

So we need to perform the sum

η∞∑

s=0

Fi;δ(s) =∑

α0,α

(µδαα0+ µαδα0

)

η∞∑

s=0

ai;α0(s)εFi;α(s)

+η

2

∑

α1,α2

µδα1α2

η∞∑

s=0

εFi;α1(s)εFi;α2

(s)

. (46)

We note that

η

∞∑

s=0

εFi;α1(t)εFi;α2

(t) = η∑

α3,α4

∞∑

s=0

[(1− ηk)s]α1α3

[(1− ηk)s]α2α4

εi;α3(0)εi;α4(0)

=∑

α3,α4

X α1α2α3α4

‖ εi;α3(0)εi;α4(0) . (47)

Here, we define the following inverting tensor:

X α1α2α3α4

‖ =

∞∑

s=0

[(1− ηk)s]α1α3

[(1− ηk)s]α2α4

, (48)

which is implicitly defined as

δα1

α5δα2

α6=∑

α3,α4

X α1α2α3α4

‖1

η

[δα3α5δα4α6 −

(δα3α5 − ηkα3α5

)(δα4α6 − ηkα4α6

)]

=∑

α3,α4

X α1α2α3α4

‖

(kα3α5δα4α6 + δα3α5 kα4α6 − ηkα3α5 kα4α6

). (49)

Using the inverting tensor, the final expression is given by

zi;δ(∞) = zi;δ(0)−∑

α1,α2

kδα1kα1α2εi;α2

(0)

+∑

α1,...,α4

µα1δα2

−∑

α5,α6∈Akδα5

kα5α6µα1α6α2

Zα1α2α3α4

A εi;α3(0)εi;α3

(0)

+∑

α1,...,α4

µδα1α2

−∑

α5,α6∈Akδα5

kα5α6µα6α1α2

Zα1α2α3α4

B εi;α3(0)εi;α3

(0) , (50)

8

where Zs are the algorithm projectors,

Zα1α2α3α4

A ≡ kα1α3 kα2α4 −∑

α5

kα2α5X α1α5α3α4

‖ ,

Zα1α2α3α4

B ≡ kα1α3 kα2α4 −∑

α5


‖ +η

2X α1α2α3α4

‖ . (51)

Finally, in the continuum limit, we could drop out the η terms in the algorithm projectors,

Zα1α2α3α4

A = Zα1α2α3α4

B ≡ kα1α3 kα2α4 −∑

α5


‖ ,

∑

α3,α4

X α1α2α3α4

‖

(kα3α5

δα4α6+ δα3α5

kα4α6

)= δα1

α5δα2

α6. (52)

The existence of algorithm projectors shows algorithm dependence in those perturbative corrections. This will typically happenwhen the model has multiple (local) minimal.

II. DETAILS ON THE QUANTUM OPTIMIZATION

A. General setup

The quantum optimization problem we discuss here has a simpler formulation compared to quantum machine learning. Sincethe loss function does not contain the training data, we do not need to consider the difference between the training set and thewhole space. It could be understood as a limiting case of the quantum machine learning problem.

We use the loss function,

L(θ) =1

2

(⟨Ψ0

∣∣U†OU∣∣Ψ0

⟩−O0

)2 ≡ 1

2ε2 . (53)

The trainable ansatz is,

U =∏L

`=1WÙ` =

∏L

`=1W` exp(iθ`X`) . (54)

Here, note that since the operator O is Hermitian, the loss function is always real and non-negative. The gradient descent is

dθ` = −η dL(θ)

dθ`= −η

(⟨Ψ0


⟩−O0

) d⟨Ψ0


⟩

dθ`= −ηε dε

dθ`, (55)

where

ε =⟨Ψ0


⟩−O0 . (56)

Thus

dε =∑

`

dε

dθ`dθ` = −η

∑

`

dε

dθ`

dε

dθ`ε . (57)

The object

K =∑

`

dε

dθ`

dε

dθ`, (58)

is the optimization analog of the QNTK. More precisely, we have

dε

dθ`=d⟨Ψ0


⟩

dθ`= −i

⟨Ψ0

∣∣∣U†+,`[X`, U

†`W

†` U†−,ÒU−,`WÙ`

]U+,`

∣∣∣Ψ0

⟩, (59)

9

with the help of the definition,

U−,` ≡`−1∏

`′=1

W`′U`′ ,

U+,` ≡L∏

`′=`+1

W`′U`′ . (60)

So,

dε = −η∑

`

dε

dθ`

dε

dθ`ε = ηε

⟨Ψ0

∣∣∣U†+,`[X`, U

†`W

†` U†−,ÒU−,`WÙ`

]U+,`

∣∣∣Ψ0

⟩2

. (61)

Can we solve the difference equation? We have

dθ` = iη⟨

Ψ0

∣∣∣U†+,`[X`, U

†`W

†` U†−,ÒU−,`WÙ`

]U+,`

∣∣∣Ψ0

⟩. (62)

In the continuum limit, we have

dθ`dt

= iηc

⟨Ψ0

∣∣∣U†+,`[X`, U

†`W

†` U†−,ÒU−,`WÙ`

]U+,`

∣∣∣Ψ0

⟩, (63)

where ηc = η/dt is the continuous version of the learning rate. In principle, this is a coupled nonlinear ODE system, and onecould use the ODE theory to study them. Moreover, we have

U†+,`

[X`, U

†`W

†` U†−,ÒU−,`WÙ`

]U+,` = U†+,Ù

†`

[X`,W

†` U†−,ÒU−,`W`

]UÙ+,` . (64)

Defining

A` =[X`,W

†` U†−,ÒU−,`W`

], (65)

and if we are expanding θ` when it is slightly deviated from 0 (more generally, around a fixed angle, which is equivalent to aredefinition of constant gates), we have,

U†+,Ù†ÀÙÙ+,` = U†+,ÀÙ+,` − iθÙ†+,` [X`, A`]U+,` +O(θ2

` ) . (66)

Thus, we note that the second-order expansions of θ` here means second-order commutators among X` and some unitary-addressed versions of the operator O, which is the operator we wish to optimize. Our experiences could be easily generalized tohigh orders.

B. Frozen QNTK

Now, we rescale the original variational angles by a factor δ where the variational angles are around θ∗+δϕ. The constant termθ∗ will produce a constant gate exp(iθ∗`X`), such thatWÙ` →W` exp(iθ∗`X`) exp(iδϕ`X`). So we could absorb the θ∗ depen-dence to the definition of the constant gate by defining W`(θ

∗` ) ≡W` exp(iθ∗`X`), and we have WÙ` →W`(θ

∗` ) exp(iδϕ`X`).

Thus, for simplicity, we could understand the trainable gate as U` → exp(iδϕ`X`) with a redefined constant gate. In our calcu-lation later, we will drop the θ∗ notation and understand the variational angles as small parameters rescaled by δ for our notationconvenience.

In the frozen QNTK limit, we have,

K = −δ2∑

`

⟨Ψ0

∣∣∣W †+,`[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣Ψ0

⟩2

. (67)

And we define

W−,` ≡`−1∏

`′=1

W`′ , W+,` ≡L∏

`′=`+1

W`′ . (68)

10

Note that the frozen QNTK does not depend on the variational parameters. The gradient descent dynamics, in this case, is veryeasy to solve. We have

ε(t) = (1− ηK)tε(0) ≡(

1 + ηδ2∑

`

⟨Ψ0

∣∣∣W †+,`[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣Ψ0

⟩2)tε(0) . (69)

The convergence rate is given by,

τc = − log(1− ηK) ≈ ηK

= ηδ2∑

`

⟨Ψ0

∣∣∣W †+,`[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣Ψ0

⟩2

≤ 2ηδ2L‖O‖2max`‖X`‖2 . (70)

C. dQNTK

Now let us focus on the second order to develop an analog of representation learning theory in the quantum optimizationexample. We have

ε =⟨Ψ0


⟩−O0

≈⟨

Ψ0

∣∣∣∣(∏1

`′=LW †`′

)O

(∏L

`=1W`

)∣∣∣∣Ψ0

⟩−O0

− iδ∑

`

ϕ`

⟨Ψ0

∣∣∣W †+,`[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣Ψ0

⟩

− δ2

2

∑

`1,`2

ϕ`1ϕ`2 ×

⟨Ψ0

∣∣∣W †+,`1[X`1 ,W

†`1W †`1,`2W

†+,`2

[X`2 ,W

†`2W †−,`2OW−,`2W`2

]W`2,`1W`1

]W+,`1

∣∣∣Ψ0

⟩: `1 ≥ `2⟨

Ψ0

∣∣∣W †+,`2[X`2 ,W

†`2W †`2,`1W

†+,`1

[X`1 ,W

†`1W †−,`1OW−,`1W`1

]W`1,`2W`2

]W+,`2

∣∣∣Ψ0

⟩: `1 < `2

,

(71)

where

W`1,`2 =∏`2−1

`=`1+1W` . (72)

So we could take the derivative as,

dε

dϕ`= −iδ

⟨Ψ0

∣∣∣W †+,`[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣Ψ0

⟩− δ2 ×

∑

`′

ϕ`′×

⟨Ψ0

∣∣∣W †+,`′[X`′ ,W

†`′W

†`′,`W

†+,`

[X`,W

†`W

†−,ÒW−,`W`

]W`,`′W`′

]W+,`′

∣∣∣Ψ0

⟩: `′ ≥ `⟨

Ψ0

∣∣∣W †+,`[X`,W

†`W

†`,`′W

†+,`′

[X`′ ,W

†`′W

†−,`′OW−,`′W`′

]W`′,`W`

]W+,`

∣∣∣Ψ0

⟩: `′ < `

. (73)

Thus

dε =∑

`

−iδ⟨

Ψ0

∣∣∣W †+,`[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣Ψ0

⟩− δ2

∑`′ϕ`′×

⟨Ψ0

∣∣∣W †+,`′[X`′ ,W

†`′W

†`′,`W

†+,`

[X`,W

†`W

†−,ÒW−,`W`

]W`,`′W`′

]W+,`′

∣∣∣Ψ0

⟩: `′ ≥ `⟨

Ψ0

∣∣∣W †+,`[X`,W

†`W

†`,`′W

†+,`′

[X`′ ,W

†`′W

†−,`′OW−,`′W`′

]W`′,`W`

]W+,`

∣∣∣Ψ0

⟩: `′ < `

dϕ`

− δ2

2

∑

`1,`2

dϕ`1 dϕ`2×

⟨Ψ0

∣∣∣W †+,`1[X`1 ,W

†`1W †`1,`2W

†+,`2

[X`2 ,W

†`2W †−,`2OW−,`2W`2

]W`2,`1W`1

]W+,`1

∣∣∣Ψ0

⟩: `1 ≥ `2⟨

Ψ0

∣∣∣W †+,`2[X`2 ,W

†`2W †`2,`1W

†+,`1

[X`1 ,W

†`1W †−,`1OW−,`1W`1

]W`1,`2W`2

]W+,`2

∣∣∣Ψ0

⟩: `1 < `2

. (74)

11

Moreover, we have

dϕ` = −ηε dεdϕ`

. (75)

Thus

dε = −η∑

`

dε

dϕ`

dε

dϕ`ε+

1

2η2ε2

∑

`1,`2

d2ε

dϕ`1dϕ`2

dε

dϕ`1

dε

dϕ`2. (76)

The structure of the gradient descent equation for the residual optimization error is very similar to the case in the quadratic modeland representation learning context. We define the free part and the interacting part of ε as

ε = εF + εI . (77)

The free part is given by,

εF = (1− ηK)tε(0) ,

K = −δ2∑

`

⟨Ψ0

∣∣∣W †+,`[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣Ψ0

⟩2

, (78)

and the interacting part is given by

εI(t) = −ηt(1− ηK)t−1

K∆ε(0) , (79)

where we define the effective kernel up to the dQNTK order,

KE =∑

`

dε

dϕ`

dε

dϕ`. (80)

And we have

K∆ ≡ KE(0)−K =

(∑

`

dε

dϕ`

dε

dϕ`

)(0)−

∑

`

dεF

dϕ`

dεF

dϕ`

= 2iδ3∑

`

⟨Ψ0

∣∣∣W †+,`[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣Ψ0

⟩

∑

`′

⟨Ψ0

∣∣∣W †+,`′[X`′ ,W

†`′W

†`′,`W

†+,`

[X`,W

†`W

†−,ÒW−,`W`

]W`,`′W`′

]W+,`′

∣∣∣Ψ0

⟩: `′ ≥ `⟨

Ψ0

∣∣∣W †+,`[X`,W

†`W

†`,`′W

†+,`′

[X`′ ,W

†`′W

†−,`′OW−,`′W`′

]W`′,`W`

]W+,`

∣∣∣Ψ0

⟩: `′ < `

ϕ`′(0) . (81)

Similarly, one define the quantum meta-kernel (dQNTK) as

µ =∑

`1,`2

d2ε

dϕ`1dϕ`2

dε

dϕ`1

dε

dϕ`2

= δ4∑

`1,`2

⟨Ψ0

∣∣∣W †+,`1[X`1 ,W

†`1W †−,`1OW−,`1W`1

]W+,`1

∣∣∣Ψ0

⟩⟨

Ψ0

∣∣∣W †+,`2[X`2 ,W

†`2W †−,`2OW−,`2W`2

]W+,`2

∣∣∣Ψ0

⟩×

⟨Ψ0

∣∣∣W †+,`1[X`1 ,W

†`1W †`1,`2W

†+,`2

[X`2 ,W

†`2W †−,`2OW−,`2W`2

]W`2,`1W`1

]W+,`1

∣∣∣Ψ0

⟩: `1 ≥ `2⟨

Ψ0

∣∣∣W †+,`2[X`2 ,W

†`2W †`2,`1W

†+,`1

[X`1 ,W

†`1W †−,`1OW−,`1W`1

]W`1,`2W`2

]W+,`2

∣∣∣Ψ0

⟩: `1 < `2

. (82)

Finally, we wish to mention that in [1], for the classical neural networks they study, the leading order perturbative contributionO(1/width) is both given by dNTK and ddNTK in dynamics. In our frozen QNTK limit (in the context of lazy training), thisdoes not happen because of power counting in δ.

12

III. DETAILS ON THE QUANTUM MACHINE LEARNING: HERMITIAN OPERATOR EXPECTATION VALUEEVALUATION

A. General setup

We define our model as

zi;δ ≡ zi (θ,xδ) = 〈φ (xδ)|U†OiU |φ (xδ)〉 . (83)

where Oi is the i-th Hermitian observable. We assume that Oi is taken from a subset of Hermitian operators of the Hilbert spaceH, denoted by O(H). The dimension of O(H) is upper bounded by polynomials of the dimension of the Hilbert space, dimH.The trainable ansatz is,

U =∏L

`=1WÙ` =

∏L

`=1W` exp(iθ`X`) . (84)

One could take the derivative,

dzi;δ =∑

`

dzi;δdθ`

dθ` , (85)

The loss is

LA(θ) =1

2

∑

α,i

(yi;α − zi;α)2

=1

2

∑

α,i

ε2i;α . (86)

So

dLA(θ)

dθ`=∑

α,i

εi;αdzi;αdθ`

. (87)

The gradient descent rule is

dθ` = −η dLAdθ`

= −η∑

α,i

εi;αdzi;αdθ`

, (88)

so we have

dzi;δ = −η∑

`,i′,α

εi′;αdzi;δdθ`

dzi′;αdθ`

. (89)

Since we measure the expectation values of operators, our εs are always real. Defining the kernel,

Kii′δ,α =

∑

`

dzi;δdθ`

dzi′;αdθ`

, (90)

we have

dzi;δ = −η∑

α,i′

Kii′δ,αεi′;α . (91)

We could also make the joint indices

(δ, i) = a , (α, i′) = b , (92)

which are running in the spaceD×O(H) andA×O(H) respectively. The notation a is indicating that the data point componentbelongs to the training set A, while the notation a means that the data point component is general in D. And we have

dza = −η∑

b

Kabεb . (93)

In general, one could prove the following statement.

13

Theorem 1. The matrix Kab is non-negative and symmetric.

Proof. It is symmetric by definition. Moreover, we consider an arbitrary vector fa. We have

∑

a,b

faKabfb =∑

δ,δ′,i,i′

fδ,iKii′δ,δ′fδ′,i′ =

=∑

δ,δ′,i,i′,`

fδ,ifδ′,i′dzi′;δ′

dθ`

dzi;δdθ`

=∑

`

∑

δ,i

fδ,idzi;δdθ`

2

≥ 0 . (94)

Thus, the matrix K is a proper version of the positive semi-definite symmetric (PDS) kernel [4] in the sense of the classicallearning theory.

Now, putting the variational ansatz inside the kernel, we get

dzi;δdθ`

=⟨φ (xδ)

∣∣∣U†+,`[X`, U

†`W

†` U†−,ÒiU−,`WÙ`

]U+,`

∣∣∣φ (xδ)⟩,

dzi′;αdθ`

=⟨φ (xα)

∣∣∣U†+,`[X`, U

†`W

†` U†−,Òi′U−,`WÙ`

]U+,`

∣∣∣φ (xα)⟩, (95)

So we have

Kii′δ,α =

∑

`

dzi;δdθ`

dzi′;αdθ`

= −∑

`

⟨φ (xδ)

∣∣∣U†+,`[X`, U

†`W

†` U†−,ÒiU−,`WÙ`

]U+,`

∣∣∣φ (xδ)⟩×⟨

φ (xα)∣∣∣U†+,`

[X`, U

†`W

†` U†−,Òi′U−,`WÙ`

]U+,`

∣∣∣φ (xα)⟩ . (96)

B. No representation learning

The statement of no representation learning corresponds to the limit where all the change of variational angles are sufficientlyclose to zero. In this case, the QNTK becomes frozen (static), similar to the optimization problem. With the variational angleredefined, and the frozen QNTK limit, we have,

Kii′δ,α = −δ2

∑

`

⟨φ (xδ)

∣∣∣W †+,`[X`,W

†`W

†−,ÒiW−,`W`

]W+,`

∣∣∣φ (xδ)⟩×⟨

φ (xα)∣∣∣W †+,`

[X`,W

†`W

†−,Òi′W−,`W`

]W+,`

∣∣∣φ (xα)⟩ . (97)

Here, δ is the factor we are used to redefine the variational angles θ by θ∗+δ×ϕ. Moreover, one could exactly solve the gradientdescent dynamics. We start from the equation

dεa = −η∑

b

Kabεb . (98)

Thus we have

εa1(t) =∑

a2

Ua1a2(t)εa2(0) , (99)

where

Ua1a2(t) =[(1− ηK)

t]a1a2

. (100)

One can compute the convergence time as

τc = ‖− log (1− ηK)‖ ≈ η∥∥∥Kii′

δ,α

∥∥∥ . (101)

14

And we could compute the prediction similarly by

za(∞) = za(0)− η∑

b

Kab

+∞∑

t=0

εb(t)

= za(0)− η∑

b

Kab

+∞∑

t=0,b0

[(1− ηK)

t]bb0

εb0(0)

= za(0)− η∑

b

Kab

∑

b0

[(1− (1− ηK))

−1]bb0εb0(0)

= za(0)−∑

b

Kab

∑

b0

[Ki,−1

]bb0εb0(0) . (102)

Moreover, we could compute the prediction by firstly defining the kernel inverse. We define

∑

a∈A×O(H)

K a1a2Ka2a3 = δa1a3 , (103)

so we get

za(∞) = za(0)−∑

a1,a2

K a1a2Kaa1εa2(0) . (104)

C. Representation learning

Now we start to develop our quantum representation learning theory at the dQNTK order. We make a quadratic expansion:

εi;α =⟨φ(xα)

∣∣U†OiU∣∣φ(xα)

⟩− yi;α

≈⟨φ(xα)

∣∣∣∣∣

(1∏

`′=L

W †`′

)Oi

(L∏

`=1

W`

)∣∣∣∣∣φ(xα)

⟩− yi;α

− iδ∑

`

ϕ`

⟨φ(xα)

∣∣∣W †+,`[X`,W

†`W

†−,ÒiW−,`W`

]W+,`

∣∣∣φ(xα)⟩− δ2

2

∑

`1,`2

ϕ`1ϕ`2×

⟨ψ(xα)

∣∣∣W †+,`1[X`1 ,W

†`1W †`1,`2W

†+,`2

[X`2 ,W

†`2W †−,`2OiW−,`2W`2

]W`2,`1W`1

]W+,`1

∣∣∣ψ(xα)⟩

: `1 ≥ `2⟨ψ(xα)

∣∣∣W †+,`2[X`2 ,W

†`2W †`2,`1W

†+,`1

[X`1 ,W

†`1W †−,`1OiW−,`1W`1

]W`1,`2W`2

]W+,`2

∣∣∣ψ(xα)⟩

: `1 < `2. (105)

So we could write down the derivatives

dεi;αdϕ`

= −iδ⟨φ(xα)

∣∣∣W †+,`[X`,W

†`W

†−,ÒiW−,`W`

]W+,`

∣∣∣φ(xα)⟩− δ2 ×

∑

`′

ϕ`′×

⟨φ(xα)

∣∣∣W †+,`′[X`′ ,W

†`′W

†`′,`W

†+,`

[X`,W

†`W

†−,ÒiW−,`W`

]W`,`′W`′

]W+,`′

∣∣∣φ(xα)⟩

: `′ ≥ `⟨φ(xα)

∣∣∣W †+,`[X`,W

†`W

†`,`′W

†+,`′

[X`′ ,W

†`′W

†−,`′OiW−,`′W`′

]W`′,`W`

]W+,`

∣∣∣φ(xα)⟩

: `′ < `. (106)

15

Similar derivation works for zi;δ in general. So we have

dzi;δ =∑

`

−iδ⟨φ(xδ)

∣∣∣W †+,`[X`,W

†`W

†−,ÒiW−,`W`

]W+,`

∣∣∣φ(xδ)⟩− δ2

∑`′ϕ`′×

⟨φ(xδ)

∣∣∣W †+,`′[X`′ ,W

†`′W

†`′,`W

†+,`

[X`,W

†`W

†−,ÒiW−,`W`

]W`,`′W`′

]W+,`′

∣∣∣φ(xδ)⟩

: `′ ≥ `⟨φ(xδ)

∣∣∣W †+,`[X`,W

†`W

†`,`′W

†+,`′

[X`′ ,W

†`′W

†−,`′OiW−,`′W`′

]W`′,`W`

]W+,`

∣∣∣φ(xδ)⟩

: `′ < `

dϕ`

− δ2

2

∑

`1,`2

dϕ`1 dϕ`2×

⟨φ(xδ)

∣∣∣W †+,`1[X`1 ,W

†`1W †`1,`2W

†+,`2

[X`2 ,W

†`2W †−,`2OiW−,`2W`2

]W`2,`1W`1

]W+,`1

∣∣∣φ(xδ)⟩

: `1 ≥ `2⟨φ(xδ)

∣∣∣W †+,`2[X`2 ,W

†`2W †`2,`1W

†+,`1

[X`1 ,W

†`1W †−,`1OiW−,`1W`1

]W`1,`2W`2

]W+,`2

∣∣∣φ(xδ)⟩

: `1 < `2

.

(107)

The leading order piece is exactly the effective feature map. A more compact version is,

dzi;δ =∑

`

dzi;δdϕ`

dϕ` −1

2δ2∑

`1,`2

dϕ`1 dϕ`2Gδ,i`1,`2

, (108)

where we define

Gδ,i`1,`2 ≡ G`1,`2(φ(xδ), Oi) =

⟨φ(xδ)

∣∣∣W †+,`1[X`1 ,W

†`1W †`1,`2W

†+,`2

[X`2 ,W

†`2W †−,`2OiW−,`2W`2

]W`2,`1W`1

]W+,`1

∣∣∣φ(xδ)⟩

: `1 ≥ `2⟨φ(xδ)

∣∣∣W †+,`2[X`2 ,W

†`2W †`2,`1W

†+,`1

[X`1 ,W

†`1W †−,`1OiW−,`1W`1

]W`1,`2W`2

]W+,`2

∣∣∣φ(xδ)⟩

: `1 < `2

.

(109)

Moreover, we want to define

Θδ,i` ≡ Θ`(φ(xδ), Oi) =

⟨φ(xδ)

∣∣∣W †+,`[X`,W

†`W

†−,ÒiW−,`W`

]W+,`

∣∣∣φ(xδ)⟩. (110)

We will now compute the effective kernel. We have,

dzi;δdϕ`

= −iδΘδ,i` − δ2

∑

`′

ϕ`′Gδ,i`′,` ,

dzi′;αdϕ`

= −iδΘα,i′

` − δ2∑

`′

ϕ`′Gα,i′

`′,` . (111)

Thus, we define

KE,ii′

δ,α =∑

`

dzi;δdϕ`

dzi′;αdϕ`

, (112)

and

KE,ii′

δ,α (0) = Kii′δ,α +K∆,ii′

δ,α . (113)

We find

K∆,ii′

δ,α = iδ3∑

`,`′

ϕ`′(0)Gδ,i`′,`Θα,i′

` + iδ3∑

`,`′

ϕ`′(0)Gα,i′

`′,`Θδ,i` . (114)

Now we could write down the prediction on the training set. We have

εa(t) = εFa (t) + εIa(t) , (115)

16

where

εFa (t) =∑

a1

Uaa1(t)εa1(0) ,

Ua1a2(t) =[(1− ηK)

t]a1a2

, (116)

and

εIa(t) =

(−η

t−1∑

s=0

(1− ηK)t−1−s

K∆(1− ηK)sε(0)

)

a

. (117)

It is the compact matrix product form in the space A×O(H). Moreover, we have∥∥εI(t)

∥∥ ≤ ηt‖1− ηK‖t−1 ∥∥K∆∥∥ ‖ε(0)‖ . (118)

Finally, we discuss the asymptotic convergence regime. We could compute the quantum meta-kernel. We notice that

dzi;δ = −η∑

`,i′,α

dzi;δdϕ`

dzi′;αdϕ`

εi;α + η2∑

`1,`2,α1,i1,α2,i2

d2zi;δdϕ`1dϕ`2

dzi1;α1

dϕ`1

dzi2;α2

dϕ`2εi1;α1

εi2;α2. (119)

So we define

µii1i2δα1α2= µaa1a2 =

∑

`1,`2

d2zi;δdϕ`1dϕ`2

(dzi1;α1

dϕ`1

dzi2;α2

dϕ`2

)∣∣∣∣ϕ=0

, (120)

in the leading order. Note that it is natural to extend the definition to more general inputs

µi0i1i2δ0δ1δ2= µa0a1a2 =

∑

`1,`2

d2zi0;δ0

dϕ`1dϕ`2

(dzi1;δ1

dϕ`1

dzi2;δ2

dϕ`2

)∣∣∣∣∣∣ϕ=0

. (121)

So the asymptotic convergence is given by

za(∞) = za(0)−∑

a1,a2

Kaa1Ka1a2εa2(0)

+∑

a1,a2,a3,a4

µa1aa2 −

∑

a5,a6

Kaa5Ka5a6µa1a6a2

Z a1a2a3a4A εa3(0)εa4(0)

+∑

a1,a2,a3,a4

µaa1a2 −

∑

a5,a6

Kaa5Ka5a6µa6a1a2

Z a1a2a3a4B εa3(0)εa4(0) , (122)

where the algorithm projector ZA,Bs are

Z a1a2a3a4A ≡ K a1a3K a2a4 −∑

a5

K a2a5X a1a5a3a4‖ ,

Z a1a2a3a4B ≡ K a1a3K a2a4 −∑

a5

K a2a5X a1a5a3a4‖ +

η

2X a1a2a3a4‖ , (123)

and

X a1a2a3a4‖ =

∞∑

s=0

[(1− ηK)s]a1a3 [(1− ηK)

s]a2a4 , (124)

which is defined implicitly as

δa1a5 δa2a6

=∑

a3,a4

X a1a2a3a4‖

(Ka3a5δa4a6 + δa3a5Ka4a6 − ηKa3a5Ka4a6

). (125)

17

Now we write down all components of µ. We have

dzi1;δ1

dϕ`1(ϕ = 0) = −iδΘδ1,i1

`1,

dzi2;δ2

dϕ`2(ϕ = 0) = −iδΘδ2,i2

`2,

d2zi0;δ0

dϕ`1dϕ`2= −δ2Gδ0,i0`1,`2

. (126)

So

µi0i1i2δ0δ1δ2=∑

`1,`2

d2zi0;δ0

dϕ`1dϕ`2

(dzi1;δ1

dϕ`1

dzi2;δ2

dϕ`2

)∣∣∣∣∣∣ϕ=0

= δ4∑

`1,`2

Θδ1,i1`1

Θδ2,i2`2

Gδ0,i0`1,`2. (127)

IV. DETAILS ON THE QUANTUM MACHINE LEARNING: AMPLITUDE ENCODING

A. General setup

Here, we consider alternative quantum machine learning models with the amplitude encoding. In this case, we naturallyextend the NTK formalism to the complex version. We consider the variational setup where

zi;δ ≡ zi (θ,xδ) =

⟨i

∣∣∣∣∣L∏

`=1

WÙ`

∣∣∣∣∣φ (xδ)

⟩. (128)

One could take the derivative,

dzi;δ =∑

`

dzi;δdθ`

dθ` , (129)

The loss is

LA(θ) =1

2

∑

α,i

|yi;α − zi;α|2 =1

2

∑

α,i

(yi;α − zi;α)(yi;α − zi;α)∗. (130)

So

dLA(θ)

dθ`=

1

2

∑

α,i

εi;αdz∗i;αdθ`

+1

2

∑

α,i

ε∗i;αdzi;αdθ`

= Re∑

α,i

εi;αdz∗i;αdθ`

. (131)

Note that if we count the number of times that U appears in the loss function, the amplitude encoding model is a squareroot ofthe operator expectation value model. The gradient descent rule is

dθ` = −η dLAdθ`

= −η2

∑

α,i

εi;αdz∗i;αdθ`

− η

2

∑

α,i

ε∗i;αdzi;αdθ`

, (132)

so we have

dzi;δ = −η∑

`,i′,α

dzi;δdθ`

Re

(εi;α

dz∗i′;αdθ`

)

= −η2

∑

`,i′,α

εi′;αdzi;δdθ`

dz∗i′;αdθ`

− η

2

∑

`,i′,α

ε∗i′;αdzi;δdθ`

dzi′;αdθ`

. (133)

We notice that the variable z is complex in general. So we could write,

dzi;δ = −η2

∑

`,i′,α

εi′;αdzi;δdθ`

dz∗i′;αdθ`

− η

2

∑

`,i′,α

ε∗i′;αdzi;δdθ`

dzi′;αdθ`

,

dz∗i;δ = −η2

∑

`,i′,α

ε∗i′;αdz∗i;δdθ`

dzi′;αdθ`

− η

2

∑

`,i′,α

εi′;αdz∗i;δdθ`

dz∗i′;αdθ`

. (134)

18

Defining the kernel

(K+,ii′

δ,α K−,ii′

δ,α

K∗,−,ii′

δ,α K∗,+,ii′

δ,α

)=

∑`

dzi;δdθ`

dz∗i′;αdθ`

∑`

dzi;δdθ`

dzi′;αdθ`

∑`

dz∗i;δdθ`

dz∗i′;αdθ`

∑`

dz∗i;δdθ`

dzi′;αdθ`

, (135)

we have

dzi;δ = −η2

∑

α,i′

K+,ii′

δ,α εi′;α −η

2

∑

α,i′

K−,ii′

δ,α ε∗i′;α ,

dz∗i;δ = −η2

∑

α,i′

K∗,−,ii′

δ,α εi′;α −η

2

∑

α,i′

K∗,+,ii′

δ,α ε∗i′;α , (136)

Here we also use K∗ to denote the complex conjugate. We introduce the worldsheet index α, β to denote the two-component ofthe complex variable (dz, dz∗). So we have [5]

dzαi;δ = −η2

∑

α,β,i′

Kαβ,ii′

δ,α εβi′;α , (137)

where

εαi;δ =

(εi;δε∗i;δ

), Kαβ,ii′

δ,α =

(K+,ii′

δ,α K−,ii′

δ,α

K∗,−,ii′

δ,α K∗,+,ii′

δ,α

). (138)

We could also make the joint index

(δ, α, i) = µ , (α, β, i′) = ν , (139)

which is running in the space D × Z2 × H and A × Z2 × H respectively. The notation µ is indicating that the data pointcomponent belongs to the training setA, while the notation µmeans that the data point component is general inD. And we have

dzµ = −η2

∑

ν

Kµνεν . (140)

In general, one could prove the following statement.

Theorem 2. The matrix Kµν is non-negative and Hermitian.

Proof. One could easily check that the matrix form

K†µν =

(K+,ii′

δ,α K−,ii′

δ,α

K∗,−,ii′

δ,α K∗,+,ii′

δ,α

)∗,T=

(K∗,+,ii

′

δ,α K∗−,ii′

δ,α

K−,ii′

δ,α K+,ii′

δ,α

)T

=

(K∗,+,ii

′

δ,α )T

(K−,ii′

δ,α )T

(K∗,−,ii′

δ,α )T

(K+,ii′

δ,α )T

=

(K+,ii′

δ,α K−,i′i

δ,α

K∗,−,ii′

δ,α K∗,+,ii′

δ,α

)= Kµν , (141)

is Hermitian. Moreover, consider an arbitrary vector fµ, we have

∑

µ,ν

f∗µKµνfν =∑

δ,δ′,i,i′

(f∗δ,i fδ,i

)(

K+,ii′

δ,δ′ K−,ii′

δ,δ′

K∗,−,ii′

δ,δ′ K∗,+,ii′

δ,δ′

)(fδ′,i′f∗δ′,i′

)

=∑

δ,δ′,i,i′

(f∗δ,ifδ′,i′K

+,ii′

δ,δ′ + fδ,ifδ′,i′K∗,−,ii′δ,δ′ + f∗δ,if

∗δ′,i′K

−,ii′δ,δ′ + fδ,if

∗δ′,i′K

∗,+,ii′δ,δ′

)

=∑

δ,δ′,i,i′,`

(f∗δ,ifδ′,i′

dzi;δdθ`

dz∗i′;δ′dθ`

+ fδ,ifδ′,i′dz∗i;δdθ`

dz∗i′;δ′dθ`

+f∗δ,if∗δ′,i′

dzi;δdθ`

dzi′;δ′dθ`

+ fδ,if∗δ′,i′

dz∗i;δdθ`

dzi′;δ′dθ`

)

=∑

δ,δ′,i,i′,`

(fδ,i

dz∗i;δdθ`

+ f∗δ,idzi;δdθ`

)(fδ′,i′

dz∗i′;δ′

dθ`+ f∗δ′,i′

dzi′;δ′

dθ`

)

=∑

`

∑

δ,i

fδ,idz∗i;δdθ`

+ f∗δ,idzi;δdθ`

2

= 4∑

`

Re

∑

δ,i

fδ,idz∗i;δdθ`

2

≥ 0 . (142)

19

Thus, the matrix K is a complexified version of the positive semi-definite symmetric (PDS) kernel in the sense of the kernelmethod of statistical learning theory (see, for instance, [4]), but running during the training dynamics. Moreover, our workprovides a complexified version of the NTK theory that could be useful for machine learning itself.

Now, putting the variational ansatz inside the kernel, we get

dzi;αdθ`

=d

dθ`

⟨i

∣∣∣∣∣L∏

`′=1

W`′U`′

∣∣∣∣∣φ (xα)

⟩= i

⟨i

∣∣∣∣∣`−1∏

`′=1

W`′U`′X`

L∏

`′′=`

W`′′U`′′

∣∣∣∣∣φ (xα)

⟩, (143)

dz∗i;αdθ`

=d

dθ`

⟨φ (xα)

∣∣∣∣∣1∏

`′=L

U†`′W†`′

∣∣∣∣∣ i⟩

= −i⟨φ (xα)

∣∣∣∣∣`+1∏

`′=L

U†`′W†`′X`

∏

`′′=1

U†`′′W†`′′

∣∣∣∣∣ i⟩. (144)

And we have

dzi;αdθ`

= i 〈i |U−,`X`WÙÙ+,`|φ (xα)〉 ,dz∗i;αdθ`

= −i⟨φ (xα)

∣∣∣U†+,Ù†`W

†`XÙ

†−,`

∣∣∣ i⟩. (145)

So we have∑

`

dzi;δdθ`

dz∗i′;αdθ`

=∑

`

〈i |U−,`X`WÙÙ+,`|φ (xδ)〉⟨φ (xα)

∣∣∣U†+,Ù†`W

†`XÙ

†−,`

∣∣∣ i′⟩,

∑

`

dzi;δdθ`

dzi′;αdθ`

= −∑

`

〈i |U−,`X`WÙÙ+,`|φ (xδ)〉〈i′ |U−,`X`WÙÙ+,`|φ (xα)〉 , (146)

and their conjugates. Now, we define some notations. We define the feature density matrix

ρδα = |φ (xδ)〉〈φ (xα)| , (147)

and the feature projector

Λi′;δ = |φ (xδ)〉〈i′| . (148)

We have

K+,ii′

δ,α =∑

`

⟨i∣∣∣U−,`X`WÙÙ+,`ρδαU

†+,Ù

†`W

†`XÙ

†−,`

∣∣∣ i′⟩,

K−,ii′

δ,α = −∑

`

〈i′ |U−,`X`WÙÙ+,`Λi;αU−,`X`WÙÙ+,`|φ (xδ)〉 . (149)

B. No representation learning

In the frozen QNTK limit with the variational angle redefined, one could compute the expressions of the kernel as

K+,ii′

δ,α = δ2∑

`

⟨i∣∣∣W−,`X`W`W+,`ρδαW

†+,`W

†`X`W

†−,`

∣∣∣ i′⟩,

K−,ii′

δ,α = −δ2∑

`

〈i′ |W−,`X`WÙ+,`Λi;αW−,`X`W`W+,`|φ (xδ)〉 . (150)

Here, δ is the factor we are used to redefine the variational angles θ by θ∗ + δ × ϕ, and we define

W−,` ≡`−1∏

`′=1

W`′ , W+,` ≡L∏

`′=`+1

W`′ . (151)

Moreover, one could exactly solve the gradient descent dynamics. We start from the equation

dεµ = −η2

∑

ν

Kµνεν . (152)

20

Thus we have

εµ1(t) =∑

µ2

Uµ1µ2(t)εµ2(0) , (153)

where

Uµ1µ2(t) =

[(1− η

2K)t]

µ1µ2

. (154)

One can compute the convergence time as

τc =∥∥∥− log

(1− η

2K)∥∥∥ ≈ η

2

∥∥∥∥∥

(K+,ii′

δ,α K−,ii′

δ,α

K∗,−,ii′

δ,α K∗,+,ii′

δ,α

)∥∥∥∥∥ . (155)

And we could compute the prediction similarly by

zµ(∞) = zµ(0)− η

2

∑

ν

Kµν

+∞∑

t=0

εν(t)

= zµ(0)− η

2

∑

ν

Kµν

+∞∑

t=0,ν0

[(1− η

2K)t]

νν0

εν0(0)

= zµ(0)− η

2

∑

ν

Kµν

∑

ν0

[(1−

(1− η

2K))−1

]

νν0

εν0(0)

= zµ(0)−∑

ν

Kµν

∑

ν0

[Ki,−1

]νν0εν0(0) . (156)

Moreover, we could compute the prediction by firstly defining the kernel inverse. We define∑

µ∈A×Z2×HKµ1µ2Kµ2µ3

= δµ1

µ3, (157)

so we get

zµ(∞) = zµ(0)−∑

µ1,µ2

Kµ1µ2Kµµ1εµ2

(0) . (158)

C. Representation learning

Now we start to develop our quantum representation learning theory at the dQNTK order. We make a quadratic expansion:

U(θ)→∏

`

W` exp (iδϕ`X`) =∏

`

W` + iδ∑

`

ϕ`W`,−W`X`W`,+

− 1

2δ2∑

`1,`2

ϕ`1ϕ`2

W`1,−W`1X`1W`1,`2W`2X`2W`2,+ : `1 ≤ `2W`2,−W`2X`2W`2,`1W`1X`1W`1,+ : `1 > `2

. (159)

Now, we could call

G(1)` ≡W`X` ,

G(2)`1`2≡W`1,−W`1X`1W`1,`2W`2X`2W`2,+ : `1 ≤ `2W`2,−W`2X`2W`2,`1W`1X`1W`1,+ : `1 > `2

. (160)

So the model will look like

zj;δ =∑

i

(∏

`

W`)jiφi (xδ) + iδ∑

i,`

ϕ`G(1)`,jiφi (xδ)−

1

2δ2∑

i,`1,`2

ϕ`1ϕ`2G(2)`1`2,ji

φi (xδ) . (161)

21

So we could write down the derivatives

dzj;δdϕ`

= iδ∑

i

G(1)`,jiφi (xδ)−

1

2δ2∑

i,`′

ϕ`′(G(2)``′,ji +G

(2)`′`,ji)φi (xδ) . (162)

Moreover, we have

dzj;δ = iδ∑

i

∑

`

dϕ`G(1)`,jiφi (xδ)−

1

2δ2∑

i

∑

`1,`2

dϕ`1ϕ`2G(2)`1`2,ji

φi (xδ)

− 1

2δ2∑

i

∑

`1,`2

ϕ`1 dϕ`2G(2)`1`2,ji

φi (xδ)−1

2δ2∑

i

∑

`1,`2

dϕ`1 dϕ`2G(2)`1,`2,ji

φi (xδ) . (163)

The leading order piece is exactly the effective feature map. A more compact version is,

dzj;δ =∑

`

dzj;δdϕ`

dϕ` −1

2δ2∑

i

∑

`1,`2

dϕ`1 dϕ`2G(2)`1,`2,ji

φi (xδ) . (164)

Firstly we need to compute the effective kernel, we have

dzj;δdϕ`

= iδ∑

i

G(1)`,jiφi (xδ)−

1

2δ2∑

i,`′

ϕ`′(G

(2)``′,ji +G

(2)`′`,ji

)φi (xδ) ,

dzi′;αdϕ`

= iδ∑

i

G(1)`,i′iφi (xα)− 1

2δ2∑

i,`′

ϕ`′(G

(2)``′,i′i +G

(2)`′`,i′i

)φi (xα) ,

dz∗i′;αdϕ`

= −iδ∑

i

φ∗i (xα)G†,(1)`,ii′ −

1

2δ2∑

i,`′

ϕ`′φ∗i (xα)

(G†,(2)``′,ii′ +G

†,(2)`′`,ii′

). (165)

Thus, we define

KE,αβ,ii′

δ,α =

(KE,+,,ii′

δ,α KE,−,,ii′δ,α

K∗,E,−,,ii′

δ,α K∗,E,+,,ii′

δ,α

)(ϕ) =

∑`

dzi;δdϕ`

dz∗i′;αdϕ`

∑`

dzi;δdϕ`

dzi′;αdϕ`

∑`

dz∗i;δdϕ`

dz∗i′;αdϕ`

∑`

dz∗i;δdϕ`

dzi′;αdϕ`

, (166)

and

KE,αβ,ii′

δ,α (0) = Kαβ,ii′

δ,α +K∆,αβ,ii′

δ,α =

(K+,ii′

δ,α K−,ii′

δ,α

K∗,−,ii′

δ,α K∗,+,jδ,α

)+

(K∆,+,ii′

δ,α K∆,−,ii′δ,α

K∗,∆,−,ii′

δ,α K∗,∆,+,ii′

δ,α

). (167)

We find

K∆,+,ii′

δ,α =i

2δ3∑

j,`′,i′′

ϕ`′(0)(G

(2)``′,ij +G

(2)`′`,ij

)G†,(1)`,i′′i′φj (xδ)φ

∗i′′ (xα)

− i

2δ3∑

j,`′,i′′

ϕ`′(0)(G†,(2)``′,i′′i′ +G

†,(2)`′`,i′′i′

)G

(1)`,ijφj (xδ)φ

∗i′′ (xα) ,

K∆,−,ii′δ,α = − i

2δ3∑

j,`′,i′′

ϕ`′(0)(G

(2)``′,ij +G

(2)`′`,ij

)G

(1)`,i′i′′φj (xδ)φi′′ (xα)

− i

2δ3∑

j,`′,i′′

ϕ`′(0)(G

(2)``′,i′i′′ +G

(2)`′`,i′i′′

)G

(1)`,ijφj (xδ)φi′′ (xα) . (168)

We could also write it in the bracket fashion

K∆,+,ii′

δ,α =i

2δ3∑

`′

ϕ`′(0)

〈i|

(G

(2)``′ +G

(2)`′`

)|φ (xδ)〉〈φ (xα)|G†,(1)

` |i′〉− 〈φ (xα)|

(G†,(2)``′ +G

†,(2)`′`

)|i′〉〈i|G(1)

` |φ (xδ)〉

,

K∆,−,ii′δ,α = − i

2δ3∑

`′

ϕ`′(0)

〈i|

(G

(2)``′ +G

(2)`′`

)|φ (xδ)〉〈i′|G(1)

` |φ (xα)〉+ 〈i′|

(G

(2)``′ +G

(2)`′`

)|φi′′ (xα)〉〈i|G(1)

` |φ (xδ)〉

. (169)

22

Now we could write down the prediction on the training set. We have

εµ(t) = εFµ (t) + εIµ(t) , (170)

where

εFµ (t) =∑

µ1

Uµµ1(t)εµ1

(0) ,

Uµ1µ2(t) =

[(1− η

2K)t]

µ1µ2

, (171)

and

εIµ(t) =

(−η

2

t−1∑

s=0

(1− η

2K)t−1−s

K∆(

1− η

2K)sε(0)

)

µ

. (172)

It is the compact matrix product form in the space A× Z2 ×H. Moreover, we have

∥∥εI(t)∥∥ ≤ η

2t∥∥∥1− η

2K∥∥∥t−1 ∥∥K∆

∥∥ ‖ε(0)‖ . (173)

Finally, we could discuss the asymptotic convergence. We could start by computing the meta-kernel. We notice that

dzi;δ = −η2

∑

`,i′,α

εi;αdzi;δdϕ`

dz∗i′;αdϕ`

− η

2

∑

`,i′,α

ε∗i;αdzi;δdϕ`

dzi′;αdϕ`

+(η

2

)2 ∑

`1,`2

d2zi;δdϕ`1dϕ`2

∑

α1,i1

εi1;α1

dz∗i1;α1

dϕ`1+∑

α1,i1

ε∗i1;α1

dzi1;α1

dϕ`2

∑

α2,i2

εi2;α2

dz∗i2;α2

dϕ`2+∑

α2,i2

ε∗i2;α2

dzi2;α2

dϕ`2

, (174)

We have

∑

`1,`2

d2zi;δdϕ`1dϕ`2

∑

α1,i1

εi1;α1

dz∗i1;α1

dϕ`1+∑

α1,i1

ε∗i1;α1

dzi1;α1

dϕ`2

∑

α2,i2

εi2;α2

dz∗i2;α2

dϕ`2+∑

α2,i2

ε∗i2;α2

dzi2;α2

dϕ`2

=∑

`1,`2,α1,i1,α2,i2,α1,α2

d2zi;δdϕ`1dϕ`2

dz1−α1

i1;α1

dϕ`1

dz1−α2

i2;α2

dϕ`2εα1

i1;α1εα2

i2;α2. (175)

So we define

µii1i2;βα1α2

δα1α2= µµµ1µ2

=∑

`1,`2

d2zβi;δdϕ`1dϕ`2

(dz1−α1

i1;α1

dϕ`1

dz1−α2

i2;α2

dϕ`2

)∣∣∣∣∣ϕ=0

, (176)

in the leading order. Moreover, in general we have,

µi0i1i2;α0α1α2

δ0δ1δ2= µµ0µ1µ2 =

∑

`1,`2

d2zα0

i0;δ0

dϕ`1dϕ`2

(dz1−α1

i1;δ1

dϕ`1

dz1−α2

i2;δ2

dϕ`2

)∣∣∣∣∣∣ϕ=0

. (177)

So the asymptotic convergence is given by

zµ(∞) = zµ(0)−∑

µ1,µ2

Kµµ1Kµ1µ2εµ2

(0)

+∑

µ1,µ2,µ3,µ4

µµ1µµ2

−∑

µ5,µ6

Kµµ5Kµ5µ6µµ1µ6µ2

Zµ1µ2µ3µ4

A εµ3(0)εµ4

(0)

+∑

µ1,µ2,µ3,µ4

µµµ1µ2

−∑

µ5,µ6

Kµµ5Kµ5µ6µµ6µ1µ2

Zµ1µ2µ3µ4

B εµ3(0)εµ4

(0) , (178)

23

where the algorithm projector ZA,Bs are

Zµ1µ2µ3µ4

A ≡ Kµ1µ3Kµ2µ4 −∑

µ5

Kµ2µ5X µ1µ5µ3µ4

‖ ,

Zµ1µ2µ3µ4

B ≡ Kµ1µ3Kµ2µ4 −∑

µ5

Kµ2µ5X µ1µ5µ3µ4

‖ +η

4X µ1µ2µ3µ4

‖ , (179)

and

X µ1µ2µ3µ4

‖ =∞∑

s=0

[(1− η

2K)

s]µ1µ3

[(1− η

2K)

s]µ2µ4

, (180)

which is defined implicitly as

δµ1

µ5δµ2

µ6=∑

µ3,µ4

X µ1µ2µ3µ4

‖

(Kµ3µ5

δµ4µ6+ δµ3µ5

Kµ4µ6− η

2Kµ3µ5

Kµ4µ6

). (181)

Now we write down all components of µ. We have

d2zi0;δ0

dϕ`1dϕ`2= −1

2δ2∑

i

(G

(2)`1`2,i0i

+G(2)`2`1,i0i

)φi (xδ0) ,

d2z∗i0;δ0

dϕ`1dϕ`2= −1

2δ2∑

i

φ∗i (xδ0)(G†,(2)`1`2,ii0

+G†,(2)`2`1,ii0

),

dz∗i1;δ1

dϕ`1= −iδ

∑

i′

φ∗i′ (xδ1)G†,(1)`1,i′i1

,dzi1;δ1

dϕ`1= iδ

∑

i′

G(1)`1,i1i′

φi′ (xδ1) ,

dz∗i2;δ2

dϕ`2= −iδ

∑

i′′

φ∗i′′ (xδ2)G†,(1)`2,i′′i2

,dzi2;δ2

dϕ`2= iδ

∑

i′′

G(1)`2,i2i′′

φi′′ (xδ2) . (182)

So

µi0i1i2;000δ0δ1δ2

=1

2δ4

∑

i,i′,i′′,`1,`2

(G

(2)`1`2,i0i

+G(2)`2`1,i0i

)G†,(1)`1,i′i1

G†,(2)`2,i′′i2

φi (xδ0)φ∗i′ (xδ1)φ∗i′′ (xδ2) ,


= −1

2δ4

∑

i,i′,i′′,`1,`2

(G

(2)`1`2,i0i

+G(2)`2`1,i0i

)G†,(1)`1,i′i1

G(2)`2,i2i′′

φi (xδ0)φ∗i′ (xδ1)φi′′ (xδ2) ,


= −1

2δ4

∑

i,i′,i′′,`1,`2

(G

(2)`1`2,i0i

+G(2)`2`1,i0i

)G

(1)`1,i1i′

G†,(2)`2,i′′i2

φi (xδ0)φi′ (xδ1)φ∗i′′ (xδ2) ,


=1

2δ4

∑

i,i′,i′′,`1,`2

(G

(2)`1`2,i0i

+G(2)`2`1,i0i

)G

(1)`1,i1i′

G(2)`2,i2i′′

φi (xδ0)φi′ (xδ1)φi′′ (xδ2) ,


=1

2δ4

∑

i,i′,i′′,`1,`2

(G†,(2)`1`2,ii0

+G†,(2)`2`1,ii0

)G†,(1)`1,i′i1

G†,(2)`2,i′′i2

φ∗i (xδ0)φ∗i′ (xδ1)φ∗i′′ (xδ2) ,


= −1

2δ4

∑

i,i′,i′′,`1,`2

(G†,(2)`1`2,ii0

+G†,(2)`2`1,ii0

)G†,(1)`1,i′i1

G(2)`2,i2i′′

φ∗i (xδ0)φ∗i′ (xδ1)φi′′ (xδ2) ,


= −1

2δ4

∑

i,i′,i′′,`1,`2

(G†,(2)`1`2,ii0

+G†,(2)`2`1,ii0

)G

(1)`1,i1i′

G†,(2)`2,i′′i2

φ∗i (xδ0)φi′ (xδ1)φ∗i′′ (xδ2) ,


=1

2δ4

∑

i,i′,i′′,`1,`2

(G†,(2)`1`2,ii0

+G†,(2)`2`1,ii0

)G

(1)`1,i1i′

G(2)`2,i2i′′

φ∗i (xδ0)φi′ (xδ1)φi′′ (xδ2) . (183)

24

One could also write them in the bracket notations.


=1

2δ4∑

`1,`2

〈i0|(G

(2)`1`2

+G(2)`2`1

)|φ (xδ0)〉〈φ (xδ1)|G†,(1)

`1|i1〉〈φ (xδ2)|G†,(1)

`2|i2〉 ,


= −1

2δ4∑

`1,`2

〈i0|(G

(2)`1`2

+G(2)`2`1

)|φ (xδ0)〉〈φ (xδ1)|G†,(1)

`1|i1〉〈i2|G(1)

`2|φ (xδ2)〉 ,


= −1

2δ4∑

`1,`2

〈i0|(G

(2)`1`2

+G(2)`2`1

)|φ (xδ0)〉〈i1|G(1)

`1|φ (xδ1)〉〈φ (xδ2)|G†,(1)

`2|i2〉 ,


=1

2δ4∑

`1,`2

〈i0|(G

(2)`1`2

+G(2)`2`1

)|φ (xδ0)〉〈i1|G(1)

`1|φ (xδ1)〉〈i2|G(1)

`2|φ (xδ2)〉 ,


=1

2δ4∑

`1,`2

〈φ (xδ0)|(G†,(2)`1`2

+G†,(2)`2`1

)|i0〉〈φ (xδ1)|G†,(1)

`1|i1〉〈φ (xδ2)|G†,(1)

`2|i2〉 ,


= −1

2δ4∑

`1,`2

〈φ (xδ0)|(G†,(2)`1`2

+G†,(2)`2`1

)|i0〉〈φ (xδ1)|G†,(1)

`1|i1〉〈i2|G(1)

`2|φ (xδ2)〉 ,


= −1

2δ4∑

`1,`2

〈φ (xδ0)|(G†,(2)`1`2

+G†,(2)`2`1

)|i0〉〈i1|G(1)

`1|φ (xδ1)〉〈φ (xδ2)|G†,(1)

`2|i2〉 ,


=1

2δ4∑

`1,`2

〈φ (xδ0)|(G†,(2)`1`2

+G†,(2)`2`1

)|i0〉〈i1|G(1)

`1|φ (xδ1)〉〈i2|G(1)

`2|φ (xδ2)〉 . (184)

D. Reading the amplitude

Here, we review the amplitude protocol for realizing the evaluation of the inner product of two states: 〈x|y〉 [6]. The protocolis very similar to the celebrated Hadamard test. For a given pair of quantum states |x〉 and |y〉, we need to get access to the state

|ϕ〉 =1√2

(|0〉|x〉+ |1〉|y〉) . (185)

The first qubit is serving as an ancillary qubit. Applying the Hadamard gate to the first qubit, we get

|ϕ〉 → 1

2(|0〉(|x〉+ |y〉) + |1〉(|x〉 − |y〉)) . (186)

Now we could measure the probability to obtain |0〉 in the ancillary system. We have

p =1

2(1 + Re(〈x|y〉) . (187)

Thus, we could use the probability to estimate the real part of the inner product. Moreover, we could add a phase rotation to |ϕ〉to get

|ϕ〉 =1√2

(|0〉|x〉 − i|1〉|y〉) . (188)

Then, we apply the same Hadamard gate and measure the first qubit in the Pauli-Z basis. We get the probability,

p =1

2(1 + Im(〈x|y〉) , (189)

to obtain |0〉. Similar to the Hadamard test where we need to get access to a controlled unitary acting on an arbitrary state,here we need to get access to the state |ϕ〉. The inner product evaluation operation has a statistical error coming from themeasurement. If we measure N times, the error scales as 1/

√N .

25

V. SUPPRESSION OF NON-GAUSSIANITY IN THE LARGE-WIDTH LIMIT

In this section, we visit the statistics of hybrid quantum-classical neural networks. The model is defined as the following.First, we initialize the neural network by a quantum model,

zQ1;α;j1=⟨φ1 (xα)

∣∣U†,1(θ1)O1j1U

1(θ1)∣∣φ1 (xα)

⟩. (190)

Here we use the notation jω to denote the index of the operator space Oω(Hω), where ω ∈ [1, 2, · · · ,Ω] is denoting the layerof the hybrid quantum-classical neural network. Here we are starting our first layer, so ω = 1. We use zQω;α;jω

to denote ourquantum model output. We also use

Uω(θω) =

Lω∏

`ω=1

Wω`ωexp

(iθω`ωX

ω`ω

), (191)

to denote our ωth quantum ansatz, and φω is used to denote the ωth feature map. In each layer, after the quantum network, weconnect it with a classical neural network by

wCω;α;jCω= σωjCω

dimOω(Hω)∑

jω=1

WωjCω ,jω

zQω;α;jω+ bωjCω

≡ σωjCω

(zCω;α;jCω

). (192)

Here, we call

zCω;α;jCω=

dimOω(Hω)∑

jω=1

WωjCω ,jω

zQω;α;jω+ bωjCω , (193)

as classical preactivation in each layer, and we use the non-linear activation σ. At initialization, we will set all W s and bsdistributed randomly from the following Gaussian statistics:

E(WωjC1,ω,j1,ω

WωjC2,ω,j2,ω

)= δjC1,ω,jC2,ωδj1,ω,j2,ω

CωWdimOω(Hω)

,

E(bωjC1,ω

bωjC2,ω

)= δjC1,ω,jC2,ωC

ωb . (194)

Moreover, after the classical network in each layer, we can move to the next quantum layer by doing the following encoding,

zQω;α;jω= 〈φω (wω−1;α)|U†,ω(θω)OωjωU

ω(θω) |φω (wω−1;α)〉 , (195)

where we use the vector notation wω;α = (wω;α)jCω . Moreover, we will assume that the initial distribution of quantum variationalangles is given by a statistical ensemble. We denote all those ensemble averages as E. One could, for instance, compute thetwo-point function as

E(zCω;α;jC1,ω

zCω;β;jC2,ω

)= E

dimOω(Hω)∑

j1,ω=1

WωjC1,ωj1,ω

zQω;α;j1,ω+ bωjC1,ω

dimOω(Hω)∑

j2,ω=1

WωjC2,ω,j2,ω

zQω;β;j2,ω+ bωjC2,ω

=

dimOω(Hω)∑

j1,ω,j2,ω=1

E(WωjC1,ωj1,ω

WωjC2,ω,j2,ω

)E(zQω;α;j1,ω

zQω;β;j2,ω

)+ E

(bωjC1,ω

bωjC2,ω

)

=CωW

dimOω(Hω)

dimOω(Hω)∑

j1,ω,j2,ω=1

δjC1,ω,jC2,ωδj1,ω,j2,ωE(zQω;α;j1,ωzQω;β;j2,ω

) + δjC1,ω,jC2,ωCωb

= δjC1,ω,jC2,ω

CωW

dimOω(Hω)

dimOω(Hω)∑

jω=1

E(zQω;α;jωzQω;β;jω

) + Cωb

. (196)

26

In order to compute non-Gaussianities of preactivations, we start from the connected part of the two-point function, which isgiven by

Econn

(zCω;α1;jC1,ω


zCω;α4;jC4,ω

)

≡ E(zCω;α1;jC1,ω


zCω;α4;jC4,ω

)− E

(zCω;α1;jC1,ω

zCω;α2;jC2,ω

)E(zCω;α3;jC3,ω

zCω;α4;jC4,ω

)

− E(zCω;α1;jC1,ω

zCω;α3;jC3,ω

)E(zCω;α2;jC2,ω

zCω;α4;jC4,ω

)− E

(zCω;α1;jC1,ω

zCω;α4;jC4,ω

)E(zCω;α2;jC2,ω

zCω;α3;jC3,ω

). (197)

We proceed by direct computation. We have

E(zCω;α1;jC1,ω


zCω;α4;jC4,ω

)

= E

(dimOω(Hω)∑j1,ω=1

WωjC1,ω,j1,ω

zQω;α1;j1,ω+ bω

jC1,ω

)(dimOω(Hω)∑j2,ω=1

WωjC2,ω,j2,ω

zQω;α2;j2,ω+ bω

jC2,ω

)


WωjC3,ω,j3,ω

zQω;α3;j3,ω+ bω

jC3,ω

)(dimOω(Hω)∑j4,ω=1

WωjC4,ω,j4,ω

zQω;α4;j4,ω+ bω

jC4,ω

)

= E

dimOω(Hω)∑

j1,ω,j2,ω,j3,ω,j4,ω=1

WωjC1,ω,j1,ω

WωjC2,ω,j2,ω

WωjC3,ω,j3,ω

WωjC4,ω,j4,ω

zQω;α1;j1,ωzQω;α2;j2,ω


+ E

bωjC3,ωb

ωjC4,ω

dimOω(Hω)∑

j1,ω,j2,ω=1

WωjC1,ω,j1,ω

WωjC2,ω,j2,ω


+ (5 perms.)

+ E(bωjC1,ω

bωjC2,ωbωjC3,ω

bωjC4,ω

). (198)

The notation 5 perms means that(

42

)− 1 = 5 permutations of indices 1,2,3,4. Now, combining with the disconnected

four-point function result,

E(zCω;α1;jC1,ω

zCω;α2;jC2,ω

)E(zCω;α3;jC3,ω

zCω;α4;jC4,ω

)

= δjC1,ω,jC2,ωδjC3,ω,jC4,ω

(CωW

dimOω(Hω)

dimOω(Hω)∑j1,ω=1

E(zQω;α1;j1,ω

zQω;α2;j1,ω

)+ Cωb

)

(CωW

dimOω(Hω)


E(zQω;α3;j2,ω

zQω;α4;j2,ω

)+ Cωb

)

= δjC1,ω,jC2,ωδjC3,ω,jC4,ω

(CωW

dimOω(Hω)

)2(


E(zQω;α1;j1,ω

zQω;α2;j1,ω

))(dimOω(Hω)∑j2,ω=1

E(zQω;α3;j2,ω

zQω;α4;j2,ω

))

+CωWC

ωb

dimOω(Hω)


E(zQω;α1;j1,ω

zQω;α2;j1,ω

)+


E(zQω;α3;j2,ω

zQω;α4;j2,ω

))

+(Cωb )2

,

(199)

and its two other t (14-23) and u-channel (13-24) permutations, we have

Econn

(zCω;α1;jC1,ω


zCω;α4;jC4,ω

)

=

(CωW

dimOω(Hω)

)2

δjC1,ω,jC2,ωδjC3,ω,jC4,ω

dimOω(Hω)∑j1,ω,j2,ω=1

Econn

(zQω;α1;j1,ω


zQω;α4;j2,ω

)

+δjC1,ω,jC3,ωδjC2,ω,jC4,ω


Econn

(zQω;α1;j1,ω


zQω;α4;j2,ω

)

+δjC1,ω,jC4,ωδjC2,ω,jC3,ω


Econn

(zQω;α1;j1,ω


zQω;α3;j2,ω

)

. (200)

27

Here, the connected piece made by quantum circuits is given by

Econn

(zQω;α1;j1,ω


zQω;α4;j2,ω

)

≡ E(zQω;α1;j1,ω


zQω;α4;j2,ω

)

− E(zQω;α1;j1,ω

zQω;α2;j1,ω

)E(zQω;α3;j2,ω

zQω;α4;j2,ω

). (201)

Now, we note that since we have O(

(dimOω(Hω)))2)

terms in the sum, the connected part will at most scale as O(1) sincethe quantum outputs are made by normalized states whose norms are bounded by 1. However, if we assume

Econn

(zQω;αω;j1,ω

zQω;αω;j1,ωzQω;α3;j2,ω

zQω;α4;j2,ω

)= O(1)× δj1,ω,j2,ω , (202)

and their permutations for all ωs, we have

Econn

(zQω;α1;j1,ω


zQω;α4;j2,ω

)

=

(CωW

dimOω(Hω)

)2 (δjC1,ω,jC2,ωδjC3,ω,jC4,ω + δjC1,ω,jC3,ωδjC2,ω,jC4,ω + δjC1,ω,jC4,ωδjC2,ω,jC3,ω

)×

dimOω(Hω)∑

j1,ω,j2,ω=1

δj1,ω,j2,ω ×O(1)

=

(CωW

dimOω(Hω)

)2

×O(dimOω(Hω))×(δjC1,ω,jC2,ωδjC3,ω,jC4,ω + δjC1,ω,jC3,ωδjC2,ω,jC4,ω + δjC1,ω,jC4,ωδjC2,ω,jC3,ω

)

=(CωW )

2

dimOω(Hω)×O(1)×

(δjC1,ω,jC2,ωδjC3,ω,jC4,ω + δjC1,ω,jC3,ωδjC2,ω,jC4,ω + δjC1,ω,jC4,ωδjC2,ω,jC3,ω

). (203)

Thus, the connected part is suppressed by the large width 1/dimOω(Hω).The orthogonal condition we impose for quantum neural networks might be satisfied by random assumptions of choices of

operators O or random variational ansatze (say, averaging over the Pauli group or the 1-design). Similar arguments could bemade for the dynamical NTK. It is beyond our scope in this work to discuss how to connect the suppressed non-Gaussiancorrelations randomized over initialization, and the suppression of dNTK (dQNTK) in dynamics in the large width. Thus, wehereby make more general arguments and leave the detailed work in the future. Consider the schematic formula of dQNTK inthe hybrid network

µi0i1i2δ0δ1δ2=∑

`1,`2

d2zi0;δ0

dθ`1dθ`2

(dzi1;δ1

dθ`1

dzi2;δ2

dθ`2

), (204)

We schematically use notation θ` to denote both classical and quantum training variables. The sum will lead to a combinationof the following three cases: classical contribution, quantum contribution, and the classical-quantum mixed contribution. Forpure classical contribution, the formula from Gaussian correlations will lead to 1/width suppression. For quantum contributionand its mixture with classical ones, the meta-kernel will be naturally suppressed by 1/ dimO(H) for its corresponding operatorspace dimension dimO(H). Otherwise, there are non-negligible terms that give O(1) modification to NTK during the gradientdescent, leading to a non-linear regime of training dynamics and representation learning. Otherwise, the training dynamics willlinearize, and we get an exponential convergence of the residual training error. The observations indicate a possible connectionbetween barren plateau, random unitary, and large-width limit in classical neural networks, where we will leave details to ourfuture works.

VI. ON THE LECUN PARAMETRIZATION AND THE NTK PARAMETRIZATION

In this section, we will clarify the issue of the initialization (parametrization) convention we use comparing to differentliterature. In several original papers about classical NTK [7], there is a convention which is called the NTK parametrization,which defines the Gaussian correlation of weights as

E (WijWkl) = δikδjlCW . (205)

The initialization will lead to rescaling of parameters as the following table,

In our work, we use the standard (LeCun) scaling, where we will still naturally obtain the natural NTK, and the dNTK suppres-sion results [1]. However, there are also studies pointing out alternative parametrizations [8].

28

Type Weight Single-layer Learning rateLeCun E (WijWkl) =

δikδjlCW /widthWx+ b η = O(1)/width

NTK E (WijWkl) =δikδjlCW

(Wx+ b)/√

width η = O(1)

TABLE I. Comparison between the LeCun (standard) parametrization and the NTK parametrization.

VII. SIMULATION DETAILS AND ADDITIONAL DATA

We simulate the NTK dynamics in a concrete example using the Qiskit python library. In the main text, we consider a binaryclassification task carried out with a 3-qubit quantum neural network (QNN) in a supervised learning setting, using an ad-hocdata set provided in Qiskit Machine Learning. The input data points have 3-dimensional features, and are partitioned in a trainingset (20 elements) and a test set (5 elements). Each input x = (x[0],x[1],x[2]) is encoded through the ZZFeatureMap

ZZFeatureMap = q0 : H φ(2x0) • • • •

2

q1 : H φ(2x1) φ(2x0x1) • •q2 : H φ(2x2) φ(2x0x2) φ(2x1x2)

, (206)

where φ(θ) is the single-qubit phase gate and xi = π − xi. The trainable part of the quantum neural network, which appendedto the quantum circuit after the feature map, is represented by a 3-layer RealAmplitudes variational ansatz, combiningparametrized Ry rotations and entangling CNOT operations:

RealAmplitudes =

q0 : Ry(θ0) • • Ry(θ3) • • Ry(θ6) • • Ry(θ9)

q1 : Ry(θ1) • Ry(θ4) • Ry(θ7) • Ry(θ10)

q2 : Ry(θ2) Ry(θ5) Ry(θ8) Ry(θ11)

. (207)

The QNN output is obtained as the expectation value of the 3-qubit observable Z0Z1Z2, where Zi is the single qubit Pauli Zoperator. Our model is

zδ = 〈φ(xδ)|U†OU |φ(xδ)〉 , (208)

where O is given by

O = diag(1,−1,−1, 1,−1, 1, 1,−1) . (209)

Note that we only have one observable O, so our quantum neural network only has a scalar as the output. In this setup, theQNTK is given by

Kδ,α =∑

`

dzδdθ`

dzαdθ`

= −∑

`

⟨φ (xδ)

∣∣∣U†+,`[X`, U

†`W

†` U†−,ÒU−,`WÙ`

]U+,`

∣∣∣φ (xδ)⟩×⟨

φ (xα)∣∣∣U†+,`

[X`, U

†`W

†` U†−,ÒU−,`WÙ`

]U+,`

∣∣∣φ (xα)⟩ , (210)

and the frozen QNTK is given by

Kδ,α = −δ2∑

`

⟨φ (xδ)

∣∣∣W †+,`[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣φ (xδ)⟩×⟨

φ (xα)∣∣∣W †+,`

[X`,W

†`W

†−,ÒW−,`W`

]W+,`

∣∣∣φ (xα)⟩ , (211)

with the help of the initial angle redefinition. All the theories we have discussed could be applied in a straightforward way inour simulation case. For the theoretical prediction, we take the variational angle θ∗ at the last step of our training.

29

FIG. 1. Noiseless diagonal and off-diagonal entries of QNTK during the noiseless gradient descent dynamics. We compute the gradient descentevolution of five random diagonal and off-diagonal elements in the QNTK. The solid line is the actual value of the QNTK entries during theexperiment, and the dashed line is the theoretical prediction of the frozen QNTK.

FIG. 2. Noisy diagonal and off-diagonal entries of QNTK during the gradient descent dynamics. We compute the gradient descent evolutionof five random diagonal and off-diagonal elements in the QNTK. The solid line is the actual value of the QNTK entries during the simulationof a noisy quantum processor, and the dashed line is the theoretical prediction of the frozen QNTK.

In addition to the plots reported in the main text, here we show more data about the properties of QNTK during the gradientdescent. See Fig. 1 and Fig. 2. We could see that, even including a model of device noise, the simulation is successful and thekernel is stable at the late time including the effect of the error mitigation. This will indicate an exponential decay of the residualtraining error at the late time for noisy quantum circuits.

[1] D. A. Roberts, S. Yaida, and B. Hanin, arXiv preprint arXiv:2106.10165 (2021).[2] D. A. Roberts and S. Yaida, Deep Learning Theory Summer School at Princeton (2021).[3] Note that the inverse has to be taken in the over-parametrized regime where the number of variational angles are larger than the training

samples to make the formalism simple, which is not the case in our numerical simulation in this paper.[4] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning (MIT press, 2018).[5] The convention factor 1/2 appearing in η is from the worldsheet index defined for holomorphic and anti-holomorphic variables. Similar

conventions appear in the two-dimensional conformal field theories, see [9].[6] L. Zhao, Z. Zhao, P. Rebentrost, and J. Fitzsimons, Quantum Machine Intelligence 3, 1 (2021).[7] A. Jacot, F. Gabriel, and C. Hongler, arXiv preprint arXiv:1806.07572 (2018).[8] J. Sohl-Dickstein, R. Novak, S. S. Schoenholz, and J. Lee, arXiv preprint arXiv:2001.07301 (2020).[9] J. Polchinski, String theory: Volume 1, an introduction to the bosonic string (Cambridge university press, 1998).

Documents

Representation Learning via Quantum Neural Tangent Kernels