arXiv:2004.03891v1 [cs.LG] 8 Apr 2020 · Normalizing Flows with Multi-Scale Autoregressive Priors Shweta Mahajan1 Apratim Bhattacharyya2 Mario Fritz3 Bernt Schiele2 Stefan Roth1

Normalizing Flows with Multi-Scale Autoregressive Priors

Shweta Mahajan* 1 Apratim Bhattacharyya* 2 Mario Fritz3 Bernt Schiele2 Stefan Roth1

1Department of Computer Science, TU Darmstadt2Max Planck Institute for Informatics, Saarland Informatics Campus

3CISPA Helmholtz Center for Information Security, Saarland Informatics Campus

Abstract

Flow-based generative models are an important class ofexact inference models that admit efficient inference and sam-pling for image synthesis. Owing to the efficiency constraintson the design of the flow layers, e.g. split coupling flow lay-ers in which approximately half the pixels do not undergofurther transformations, they have limited expressivenessfor modeling long-range data dependencies compared toautoregressive models that rely on conditional pixel-wisegeneration. In this work, we improve the representationalpower of flow-based models by introducing channel-wisedependencies in their latent space through multi-scale au-toregressive priors (mAR). Our mAR prior for models withsplit coupling flow layers (mAR-SCF) can better capturedependencies in complex multimodal data. The resultingmodel achieves state-of-the-art density estimation results onMNIST, CIFAR-10, and ImageNet. Furthermore, we showthat mAR-SCF allows for improved image generation qual-ity, with gains in FID and Inception scores compared tostate-of-the-art flow-based models.

1. IntroductionDeep generative models aim to learn complex dependen-

cies within very high-dimensional input data, e.g. naturalimages [3, 28] or audio data [7], and enable generating newsamples that are representative of the true data distribution.These generative models find application in various down-stream tasks like image synthesis [11, 20, 38] or speechsynthesis [7, 39]. Since it is not feasible to learn the ex-act distribution, generative models generally approximatethe underlying true distribution. Popular generative mod-els for capturing complex data distributions are GenerativeAdversarial Networks (GANs) [11], which model the distri-bution implicitly and generate (high-dimensional) samplesby transforming a noise distribution into the desired spacewith complex dependencies; however, they may not cover

*Authors contributed equally.

x<latexit sha1_base64="km/NRXnOOp4f6oP/VQKy7pjAg5U=">AAACKnicbZC7TsMwFIYd7pRbgZEloq0EA1VSBhgLLIxFooDUS+S4J9TCsSP7BLUKfR4WXoWFAVSx8iC4oQO3X7L0+T/nyD5/mAhu0PPGzszs3PzC4tJyYWV1bX2juLl1ZVSqGTSZEkrfhNSA4BKayFHATaKBxqGA6/DubFK/vgdtuJKXOEygE9NbySPOKForKJ5UyslevysC9tDv6v1yoVJus55CkxMkhgslc0YYYDYYTTgKMj7qZge+vQ2CYsmrerncv+BPoUSmagTFl3ZPsTQGiUxQY1q+l2Anoxo5EzAqtFMDCWV39BZaFiWNwXSyfNWRW7FOz42Utkeim7vfJzIaGzOMQ9sZU+yb37WJ+V+tlWJ03Mm4TFIEyb4eilLhonInubk9roGhGFqgTHP7V5f1qaYMbboFG4L/e+W/cFWr+ofV2kWtVD+dxrFEdsgu2SM+OSJ1ck4apEkYeSTP5JW8OU/OizN23r9aZ5zpzDb5IefjE3V+o6U=</latexit>


computational cost: O(N2)<latexit sha1_base64="FKtIJ4N2omd+GqsclRug/8rXFOw=">AAACAHicbVBNS8NAEN34WetX1IMHL8Ei1EtJqqAnKXjxpBXsBzSxbLabdulmE3YnYgm5+Fe8eFDEqz/Dm//GTduDtj4YeLw3w8w8P+ZMgW1/GwuLS8srq4W14vrG5ta2ubPbVFEiCW2QiEey7WNFORO0AQw4bceS4tDntOUPL3O/9UClYpG4g1FMvRD3BQsYwaClrrnvhhgGBPP0Jiu7QB8hvc7uq8dds2RX7DGseeJMSQlNUe+aX24vIklIBRCOleo4dgxeiiUwwmlWdBNFY0yGuE87mgocUuWl4wcy60grPSuIpC4B1lj9PZHiUKlR6OvO/Fw16+Xif14ngeDcS5mIE6CCTBYFCbcgsvI0rB6TlAAfaYKJZPpWiwywxAR0ZkUdgjP78jxpVivOSaV6e1qqXUzjKKADdIjKyEFnqIauUB01EEEZekav6M14Ml6Md+Nj0rpgTGf20B8Ynz+GkpZW</latexit>

O(N)<latexit sha1_base64="546io6NKTUA48p2+HleloCoLc0U=">AAAB/nicbVDLSsNAFJ3UV62vqLhyM1iEuilJFXQlBTeutIJ9QBPKZDpph04ezNyIJQT8FTcuFHHrd7jzb5y2WWjrgYHDOfdyzxwvFlyBZX0bhaXlldW14nppY3Nre8fc3WupKJGUNWkkItnxiGKCh6wJHATrxJKRwBOs7Y2uJn77gUnFo/AexjFzAzIIuc8pAS31zAMnIDCkRKS3WcUB9gjpTXbSM8tW1ZoCLxI7J2WUo9Ezv5x+RJOAhUAFUaprWzG4KZHAqWBZyUkUiwkdkQHrahqSgCk3ncbP8LFW+tiPpH4h4Kn6eyMlgVLjwNOTk7Bq3puI/3ndBPwLN+VhnAAL6eyQnwgMEZ50gftcMgpirAmhkuusmA6JJBR0YyVdgj3/5UXSqlXt02rt7qxcv8zrKKJDdIQqyEbnqI6uUQM1EUUpekav6M14Ml6Md+NjNlow8p199AfG5w9Q5ZWy</latexit>

computational cost: computational cost: O(1)<latexit sha1_base64="ClxYka/OvP/pRUL0xzqpvqy45/g=">AAAB/3icbVDLSsNAFJ34rPUVFdy4CZZC3ZSkCrqSght3VrAPaEqZTCft0MkkzNyIJWbhr7hxoYhbf8Odf+OkzUJbDwwczrmXe+Z4EWcKbPvbWFpeWV1bL2wUN7e2d3bNvf2WCmNJaJOEPJQdDyvKmaBNYMBpJ5IUBx6nbW98lfnteyoVC8UdTCLaC/BQMJ8RDFrqm4dugGFEME9u0ooL9AESJz0p982SXbWnsBaJk5MSytHom1/uICRxQAUQjpXqOnYEvQRLYITTtOjGikaYjPGQdjUVOKCql0zzp1ZZKwPLD6V+Aqyp+nsjwYFSk8DTk1laNe9l4n9eNwb/opcwEcVABZkd8mNuQWhlZVgDJikBPtEEE8l0VouMsMQEdGVFXYIz/+VF0qpVndNq7fasVL/M6yigI3SMKshB56iOrlEDNRFBj+gZvaI348l4Md6Nj9nokpHvHKA/MD5/AIoOlcQ=</latexit>


Our mAR-SCF<latexit sha1_base64="3AYedwdPmT6J2SCBQYNX4/tvSuU=">AAACFHicbVDJSgNBFOyJW4xb1KOXxhCIiGEmCnqSSEC8JS5ZIAmhp9OTNOlZ6H4jhmE+wou/4sWDIl49ePNv7CwHTSxoKKrq0e+VHQiuwDS/jcTC4tLySnI1tba+sbmV3t6pKT+UlFWpL3zZsIlignusChwEawSSEdcWrG4PSiO/fs+k4r53B8OAtV3S87jDKQEtddKHLWAPYDtROZTYvbg5ui1dxtmWS6BPiYjKcW4ciKz4INtJZ8y8OQaeJ9aUZNAUlU76q9X1aegyD6ggSjUtM4B2RCRwKlicaoWKBYQOSI81NfWIy1Q7Gh8V46xWutjxpX4e4LH6eyIirlJD19bJ0bZq1huJ/3nNEJyzdsS9IATm0clHTigw+HjUEO5yySiIoSaESq53xbRPJKGge0zpEqzZk+dJrZC3jvOF65NM8XxaRxLtoX2UQxY6RUV0hSqoiih6RM/oFb0ZT8aL8W58TKIJYzqzi/7A+PwBwh6d+w==</latexit>

PixelCNN<latexit sha1_base64="npze3w4yMVhtByJ0K8pkA9pdx/A=">AAACD3icbVA9TwJBEN3zE/ELtbS5SDDYkDs00cqQ0FghJvKRACF7ywAb9j6yO2cgl/sHNv4VGwuNsbW189+4wBUKvmSSl/dmMjPPCQRXaFnfxsrq2vrGZmorvb2zu7efOTisKz+UDGrMF75sOlSB4B7UkKOAZiCBuo6AhjMqT/3GA0jFfe8eJwF0XDrweJ8zilrqZk7bCGOMqnwMolypxLm2S3HIqIhu4/zcs+OzXDeTtQrWDOYysROSJQmq3cxXu+ez0AUPmaBKtWwrwE5EJXImIE63QwUBZSM6gJamHnVBdaLZP7GZ00rP7PtSl4fmTP09EVFXqYnr6M7ptWrRm4r/ea0Q+1ediHtBiOCx+aJ+KEz0zWk4Zo9LYCgmmlAmub7VZEMqKUMdYVqHYC++vEzqxYJ9XijeXWRL10kcKXJMTkie2OSSlMgNqZIaYeSRPJNX8mY8GS/Gu/Exb10xkpkj8gfG5w/dWJyD</latexit>

Glow<latexit sha1_base64="1uEt6qp0VUKR/YRwQkS8vB3QOVI=">AAACC3icbVDLSsNAFJ34rPUVdekmtBTqpiRV0JUUXOjOCvYBTSiT6aQdOnkwc6OWkL0bf8WNC0Xc+gPu/BsnbRbaeuDC4Zx7ufceN+JMgml+a0vLK6tr64WN4ubW9s6uvrfflmEsCG2RkIei62JJOQtoCxhw2o0Exb7LaccdX2R+544KycLgFiYRdXw8DJjHCAYl9fWSDfQBkkse3qcV28cwIpgn12l1plvpUaWvl82aOYWxSKyclFGOZl//sgchiX0aAOFYyp5lRuAkWAAjnKZFO5Y0wmSMh7SnaIB9Kp1k+ktqVJQyMLxQqArAmKq/JxLsSznxXdWZXSvnvUz8z+vF4J05CQuiGGhAZou8mBsQGlkwxoAJSoBPFMFEMHWrQUZYYAIqvqIKwZp/eZG06zXruFa/OSk3zvM4CugQlVAVWegUNdAVaqIWIugRPaNX9KY9aS/au/Yxa13S8pkD9Afa5w8lPJsT</latexit>

Computational cost: O(N2)<latexit sha1_base64="K8LF3b4ACVOfy+gBcD1cTgltvPI=">AAACUnicbZJLSwMxEMfT+qr1VfXoJVgWFKTsVkHxJPSgJx9gVWhrmU1TDc0mSzIrlmU/oyBe/CBePKjpA1HrhMCf38yQyT8JYyks+v5rLj81PTM7V5gvLiwuLa+UVteurE4M43WmpTY3IVguheJ1FCj5TWw4RKHk12GvNshfP3BjhVaX2I95K4I7JbqCATrULokm8kdMazqKExwykJRpi4cZbe64FQHeM5DpWbY1Kj3Nbqvb3jc/zbwRP4bEWgEq8yZ7gmzba5fKfsUfBp0UwViUyTjO26XnZkezJOIKmQRrG4EfYysFg4JJnhWbieUxsB7c8YaTCiJuW+nQkox6jnRoVxu3FdIh/dmRQmRtPwpd5WBa+zc3gP/lGgl2D1qpUM4urtjooG4iKWo68Jd2hOEMZd8JYEa4WSm7BwMM3SsUnQnB3ytPiqtqJditVC/2ykcHYzsKZINski0SkH1yRE7IOakTRp7IG/kgn7mX3Hve/ZJRaT437lknvyK/+AVmcbW/</latexit>

Computational cost: O(1)<latexit sha1_base64="NQvsrkOoxLZUQQcPzUe25Ivy/hc=">AAACUHicbZFLSyQxEMerZ32Or9ndo5fgMKAgQ7curHgSPLgnH+CoMDMM1ZnMGEwnTVItDk1/xL1483N48aBo5qH4qhDy51dVpPJPnCrpKAzvgtKPqemZ2bn58sLi0vJK5eevM2cyy0WDG2XsRYxOKKlFgyQpcZFagUmsxHl8tT/Mn18L66TRpzRIRTvBvpY9yZE86lT6LRI3lO+bJM1oxFAxbhztFqy16VeCdMlR5UfF+rg0KjZqb/SwqI3pAWbOSdRF7fuOTqUa1sNRsK8imogqTOK4U7ltdQ3PEqGJK3SuGYUptXO0JLkSRbmVOZEiv8K+aHqpMRGunY8MKVjNky7rGeu3Jjai7ztyTJwbJLGvHE7rPueG8LtcM6PeTjuX2pslNB9f1MsUI8OG7rKutIKTGniB3Eo/K+OXaJGT/4OyNyH6/OSv4myrHm3Xt07+VPd2JnbMwSqswTpE8Bf24B8cQwM4/Id7eISn4DZ4CJ5Lwbj09YTf8CFK5RfsLLX9</latexit>

AR<latexit sha1_base64="V31E6RP3taEBsbTYDhHp7sJ5wik=">AAACWnicbVHZSgMxFM2MW61bXd58CZaCgpQZFRSfKj7okxtWhbaUO2mqwUwyJnfEMsxP+iKCvyKYLorbDYHDOfckNydRIoXFIHj1/LHxicmpwnRxZnZufqG0uHRldWoYrzMttbmJwHIpFK+jQMlvEsMhjiS/ju4P+/r1IzdWaHWJvYS3YrhVoisYoKPapYcm8ifMDi7yyhAd6jhJcaCCpExb3M9pc9OtGPCOgcxO8/Vha5hvVL7Yk88DjiC1VoDKK/872qVyUA0GRf+CcATKZFRn7dJzs6NZGnOFTIK1jTBIsJWBQcEkz4vN1PIE2D3c8oaDCmJuW9kgmpxWHNOhXW3cVkgH7HdHBrG1vThynf1p7W+tT/6nNVLs7rUyoVxYXLHhRd1UUtS0nzPtCMMZyp4DwIxws1J2BwYYut8ouhDC30/+C662quF2det8p1zbH8VRIKtkjayTkOySGjkmZ6ROGHkh796kN+W9+b4/7c8MW31v5FkmP8pf+QAwu7Y4</latexit>



Flow<latexit sha1_base64="opFXpsPVWHXQiMFXcOiK64mF9g8=">AAACXnicbVHbSgMxEM2ut1pvVV8EX4KloCBlVwXFJ6GgPnkBq0JbymyaajCbLMmsWpb9Sd/EFz/FdFvF24TAyZkzk8lJlEhhMQhePX9icmp6pjRbnptfWFyqLK9cW50axptMS21uI7BcCsWbKFDy28RwiCPJb6KHxjB/88iNFVpd4SDhnRjulOgLBuiobiVtI39Gy7JjqZ/yWnHKGjpOUiwUICnTFg9z2t52Kwa8ZyCz83xzJA3zrdoXe/bZ4ARSawWovPZ/RbdSDepBEfQvCMegSsZx0a28tHuapTFXyCRY2wqDBDsZGBRM8rzcTi1PgD3AHW85qCDmtpMV9uS05pge7WvjtkJasN8rMoitHcSRUw6ntb9zQ/K/XCvF/kEnE8qZxRUbXdRPJUVNh17TnjCcoRw4AMwINytl92CAofuRsjMh/P3kv+B6px7u1ncu96pHh2M7SmSdbJBNEpJ9ckROyQVpEkbePM8re3Peuz/tL/hLI6nvjWtWyY/w1z4APm+3PA==</latexit>

Flow<latexit sha1_base64="opFXpsPVWHXQiMFXcOiK64mF9g8=">AAACXnicbVHbSgMxEM2ut1pvVV8EX4KloCBlVwXFJ6GgPnkBq0JbymyaajCbLMmsWpb9Sd/EFz/FdFvF24TAyZkzk8lJlEhhMQhePX9icmp6pjRbnptfWFyqLK9cW50axptMS21uI7BcCsWbKFDy28RwiCPJb6KHxjB/88iNFVpd4SDhnRjulOgLBuiobiVtI39Gy7JjqZ/yWnHKGjpOUiwUICnTFg9z2t52Kwa8ZyCz83xzJA3zrdoXe/bZ4ARSawWovPZ/RbdSDepBEfQvCMegSsZx0a28tHuapTFXyCRY2wqDBDsZGBRM8rzcTi1PgD3AHW85qCDmtpMV9uS05pge7WvjtkJasN8rMoitHcSRUw6ntb9zQ/K/XCvF/kEnE8qZxRUbXdRPJUVNh17TnjCcoRw4AMwINytl92CAofuRsjMh/P3kv+B6px7u1ncu96pHh2M7SmSdbJBNEpJ9ckROyQVpEkbePM8re3Peuz/tL/hLI6nvjWtWyY/w1z4APm+3PA==</latexit> Flow

<latexit sha1_base64="opFXpsPVWHXQiMFXcOiK64mF9g8=">AAACXnicbVHbSgMxEM2ut1pvVV8EX4KloCBlVwXFJ6GgPnkBq0JbymyaajCbLMmsWpb9Sd/EFz/FdFvF24TAyZkzk8lJlEhhMQhePX9icmp6pjRbnptfWFyqLK9cW50axptMS21uI7BcCsWbKFDy28RwiCPJb6KHxjB/88iNFVpd4SDhnRjulOgLBuiobiVtI39Gy7JjqZ/yWnHKGjpOUiwUICnTFg9z2t52Kwa8ZyCz83xzJA3zrdoXe/bZ4ARSawWovPZ/RbdSDepBEfQvCMegSsZx0a28tHuapTFXyCRY2wqDBDsZGBRM8rzcTi1PgD3AHW85qCDmtpMV9uS05pge7WvjtkJasN8rMoitHcSRUw6ntb9zQ/K/XCvF/kEnE8qZxRUbXdRPJUVNh17TnjCcoRw4AMwINytl92CAofuRsjMh/P3kv+B6px7u1ncu96pHh2M7SmSdbJBNEpJ9ckROyQVpEkbePM8re3Peuz/tL/hLI6nvjWtWyY/w1z4APm+3PA==</latexit>

Flow<latexit sha1_base64="opFXpsPVWHXQiMFXcOiK64mF9g8=">AAACXnicbVHbSgMxEM2ut1pvVV8EX4KloCBlVwXFJ6GgPnkBq0JbymyaajCbLMmsWpb9Sd/EFz/FdFvF24TAyZkzk8lJlEhhMQhePX9icmp6pjRbnptfWFyqLK9cW50axptMS21uI7BcCsWbKFDy28RwiCPJb6KHxjB/88iNFVpd4SDhnRjulOgLBuiobiVtI39Gy7JjqZ/yWnHKGjpOUiwUICnTFg9z2t52Kwa8ZyCz83xzJA3zrdoXe/bZ4ARSawWovPZ/RbdSDepBEfQvCMegSsZx0a28tHuapTFXyCRY2wqDBDsZGBRM8rzcTi1PgD3AHW85qCDmtpMV9uS05pge7WvjtkJasN8rMoitHcSRUw6ntb9zQ/K/XCvF/kEnE8qZxRUbXdRPJUVNh17TnjCcoRw4AMwINytl92CAofuRsjMh/P3kv+B6px7u1ncu96pHh2M7SmSdbJBNEpJ9ckROyQVpEkbePM8re3Peuz/tL/hLI6nvjWtWyY/w1z4APm+3PA==</latexit>

Gaussian<latexit sha1_base64="cRnyg3aMlpDPcKJuuHBef6XwRjg=">AAACYHicbVFLSwMxEM6ur1ofbfWml2ApKEjZVUHxJHjQk1awKrSlzKapBrPJksyqZdk/6c2DF3+J6UOpjwmBL998M5l8iRIpLAbBm+fPzM7NLxQWi0vLK6ulcmXtxurUMN5kWmpzF4HlUijeRIGS3yWGQxxJfhs9ng7zt0/cWKHVNQ4S3onhXom+YICO6paf28hfMDuD1FoBKq+Nz6c6TlIcaUBSpi0e57S961YM+MBAZpf59lga5ju1b/biq8FUw38ruuVqUA9GQf+CcAKqZBKNbvm13dMsjblCJsHaVhgk2MnAoGCS58V2ankC7BHuectBBTG3nWxkUE5rjunRvjZuK6Qjdroig9jaQRw55XBa+zs3JP/LtVLsH3UyoZxZXLHxRf1UUtR06DbtCcMZyoEDwIxws1L2AAYYuj8pOhPC30/+C2726uF+fe/qoHpyPLGjQDbJFtkmITkkJ+ScNEiTMPLuzXjL3or34Rf8kl8ZS31vUrNOfoS/8Qkfergd</latexit>

Gaussian<latexit sha1_base64="cRnyg3aMlpDPcKJuuHBef6XwRjg=">AAACYHicbVFLSwMxEM6ur1ofbfWml2ApKEjZVUHxJHjQk1awKrSlzKapBrPJksyqZdk/6c2DF3+J6UOpjwmBL998M5l8iRIpLAbBm+fPzM7NLxQWi0vLK6ulcmXtxurUMN5kWmpzF4HlUijeRIGS3yWGQxxJfhs9ng7zt0/cWKHVNQ4S3onhXom+YICO6paf28hfMDuD1FoBKq+Nz6c6TlIcaUBSpi0e57S961YM+MBAZpf59lga5ju1b/biq8FUw38ruuVqUA9GQf+CcAKqZBKNbvm13dMsjblCJsHaVhgk2MnAoGCS58V2ankC7BHuectBBTG3nWxkUE5rjunRvjZuK6Qjdroig9jaQRw55XBa+zs3JP/LtVLsH3UyoZxZXLHxRf1UUtR06DbtCcMZyoEDwIxws1L2AAYYuj8pOhPC30/+C2726uF+fe/qoHpyPLGjQDbJFtkmITkkJ+ScNEiTMPLuzXjL3or34Rf8kl8ZS31vUrNOfoS/8Qkfergd</latexit>

Computational cost: O(N2)<latexit sha1_base64="2cgaeQd+6dv8kX7sj6IISl21lCA=">AAACG3icbVDLSgMxFM3UV62vqks3wSJUkDJTBYurQjeutIJ9QKeWTJq2oZnJkNwRy9D/cOOvuHGhiCvBhX9jpu1CW08IHM65l3vv8ULBNdj2t5VaWl5ZXUuvZzY2t7Z3srt7dS0jRVmNSiFV0yOaCR6wGnAQrBkqRnxPsIY3rCR+454pzWVwC6OQtX3SD3iPUwJG6mSLLrAHiCvSDyOYaERgKjVcjLF7kjzs+gQGlIj4epy/uised7I5u2BPgBeJMyM5NEO1k/10u5JGPguACqJ1y7FDaMdEAaeCjTNupFlI6JD0WcvQgPhMt+PJbWN8ZJQu7kllfgB4ov7uiImv9cj3TGWyp573EvE/rxVBr9SOeWDuZgGdDupFAoPESVC4yxWjIEaGEKq42RXTAVGEgokzY0Jw5k9eJPViwTktFG/OcuXSLI40OkCHKI8cdI7K6BJVUQ1R9Iie0St6s56sF+vd+piWpqxZzz76A+vrB/yloKk=</latexit>

Computational cost: O(N)<latexit sha1_base64="qBnUSRusla8k8isZq4KhmpARUD8=">AAACGXicbVDLSgMxFM34rPVVdekmWIQKUmaqYHFV6MaVVrAP6JSSSdM2NDMZkjtiGfobbvwVNy4Ucakr/8ZMOwttPSFwOOde7r3HCwXXYNvf1tLyyuraemYju7m1vbOb29tvaBkpyupUCqlaHtFM8IDVgYNgrVAx4nuCNb1RNfGb90xpLoM7GIes45NBwPucEjBSN2e7wB4grko/jGCqEYGp1HA5we5p8rDrExhSIuKbSeH6pJvL20V7CrxInJTkUYpaN/fp9iSNfBYAFUTrtmOH0ImJAk4Fm2TdSLOQ0BEZsLahAfGZ7sTTyyb42Cg93JfK/ADwVP3dERNf67HvmcpkSz3vJeJ/XjuCfrkT88BczQI6G9SPBAaJk5hwjytGQYwNIVRxsyumQ6IIBRNm1oTgzJ+8SBqlonNWLN2e5yvlNI4MOkRHqIAcdIEq6ArVUB1R9Iie0St6s56sF+vd+piVLllpzwH6A+vrB7P2oAU=</latexit>

Figure 1. Our mAR-SCF model combines normalizing flows withautoregressive (AR) priors to improve modeling power while en-suring that the computational cost grows linearly with the spatialimage resolution N ×N .

all modes of the underlying data distribution. VariationalAutoencoders (VAEs) [20] optimize a lower bound on thelog-likelihood of the data. This implies that VAEs can onlyapproximately optimize the log-likelihood [30].

Autoregressive models [10, 37, 38] and normalizing flow-based generative models [8, 9, 18] are exact inference modelsthat optimize the exact log-likelihood of the data. Autore-gressive models can capture complex and long-range depen-dencies between the dimensions of a distribution, e.g. incase of images, as the value of a pixel is conditioned on alarge context of neighboring pixels. The main limitation ofthis approach is that image synthesis is sequential and thusdifficult to parallelize. Recently proposed normalizing flow-based models, such as NICE [8], RealNVP [9], and Glow[18], allow exact inference by mapping the input data to aknown base distribution, e.g. a Gaussian, through a series ofinvertible transformations. These models leverage invertiblesplit coupling flow (SCF) layers in which certain dimensionsare left unchanged by the invertible transformation as well asSPLIT operations following which certain dimensions do notundergo subsequent transformations. This allows for consid-erably easier parallelization of both inference and generationprocesses. However, these models lag behind autoregressivemodels for density estimation.

To appear in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, June 2020.

© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, orreuse of any copyrighted component of this work in other works.

arX

iv:2

004.

0389

1v1

[cs

.LG

] 8

Apr

202

0

In this work, we (i) propose multi-scale autoregressive pri-ors for invertible flow models with split coupling flow layers,termed mAR-SCF, to address the limited modeling powerof non-autoregressive invertible flow models [9, 14, 18, 28](Fig. 1); (ii) we apply our multi-scale autoregressive priorafter every SPLIT operation such that the computationalcost of sampling grows linearly in the spatial dimensionsof the image compared to the quadratic cost of traditionalautoregressive models (given sufficient parallel resources);(iii) our experiments show that we achieve state-of-the-artdensity estimation results on MNIST [22], CIFAR10 [21],and ImageNet [31] compared to prior invertible flow-basedapproaches; and finally (iv) we show that our multi-scaleautoregressive prior leads to better sample quality as mea-sured by the FID metric [13] and the Inception score [32],significantly lowering the gap to GAN approaches [27, 40].

2. Related Work

In this work, we combine the expressiveness of autore-gressive models with the efficiency of non-autoregressiveflow-based models for exact inference.

Autoregressive models [6, 10, 12, 15, 26, 37, 38] are a classof exact inference models that factorize the joint probabilitydistribution over the input space as a product of conditionaldistributions, where each dimension is conditioned on theprevious ones in a pre-defined order. Recent autoregressivemodels, such as PixelCNN and PixelRNN [37, 38], can gen-erate high-quality images but are difficult to parallelize sincesynthesis is sequential. It is worth noting that autoregressiveimage models, such as that of Domke et al. [10], significantlypre-date their recent popularity. Various extensions havebeen proposed to improve the performance of the PixelCNNmodel. For example, Multiscale-PixelCNN [29] extendsPixelCNN to improve the sampling runtime from linear tologarithmic in the number of pixels, exploiting conditionalindependence between the pixels. Chen et al. [6] introduceself-attention in PixelCNN models to improve the modelingpower. Salimans et al. [33] introduce skip connections and adiscrete logistic likelihood model. WaveRNN [17] leveragescustomized GPU kernels to improve the sampling speed foraudio synthesis. Menick et al. [23] synthesize images by se-quential conditioning on sub-images within an image. Thesemethods, however, still suffer from slow sampling speed andare difficult to parallelize.

Flow-based generative models, first introduced in [8], alsoallow for exact inference. These models are composed ofa series of invertible transformations, each with a tractableJacobian and inverse, which maps the input distribution toa known base density, e.g. a Gaussian. Papamakarios etal. [25] proposed autoregressive invertible transformationsusing masked decoders. However, these are difficult to par-allelize just like PixelCNN-based approaches. Kingma et

al. [19] propose inverse autoregressive flow (IAF), where themeans and variances of pixels depend on random variablesand not on previous pixels, making it easier to parallelize.However, the approach offers limited generalization [39].

Recent extensions. Recent work [1, 9, 18] extends normaliz-ing flows [8] to multi-scale architectures with split couplings,which allow for efficient inference and sampling. For exam-ple, Kingma et al. [18] introduce additional invertible 1× 1convolutions to capture non-linearities in the data distribu-tion. Hoogeboom et al. [16] extend this to d×d convolutions,increasing the receptive field. [4] improve the residual blocksof flow layers with memory efficient gradients based on thechoice of activation functions. A key advantage of flow-based generative models is that they can be parallelized forinference and synthesis. Ho et al. [14] propose Flow++ withvarious modifications in the architecture of the flows in [9],including attention and a variational quantization method toimprove the data likelihood. The resulting model is compu-tationally expensive as non-linearities are applied along allthe dimensions of the data at every step of the flow, i.e. allthe dimensions are instantiated with the prior distribution atthe last layer of the flow. While comparatively efficient, suchflow-based models have limited expressiveness comparedto autoregressive models, which is reflected in their lowerdata log-likelihood. It is thus desirable to develop modelsthat have the expressiveness of autoregressive models andthe efficiency of flow-based models. This is our goal here.

Methods with complex priors. Recent work [5] developscomplex priors to improve the data likelihoods. VQ-VAE2integrates autoregressive models as priors [28] with discretelatent variables [5] for high-quality image synthesis andproposes latent graph-based models in a VAE framework.Tomczak et al. [36] propose mixtures of Gaussians with pre-defined clusters, and [5] use neural autoregressive modelpriors in the latent space, which improves results for imagesynthesis. Ziegler et al. [41] learn a prior based on normal-izing flows to capture multimodal discrete distributions ofcharacter-level texts in the latent spaces with nonlinear flowlayers. However, this invertible layer is difficult to be op-timized in both directions. Moreover, these models do notallow for exact inference. In this work, we propose complexautoregressive priors to improve the power of invertible splitcoupling-based normalizing flows [9, 14, 18].

3. Overview and Background

In this work, we propose multi-scale autoregressive pri-ors for split coupling-based flow models, termed mAR-SCF,where we leverage autoregressive models to improve themodeling flexibility of invertible normalizing flow modelswithout sacrificing sampling efficiency. As we build uponnormalizing flows and autoregressive models, we first pro-vide an overview of both.

2

· · ·<latexit sha1_base64="qbkj/FjCrOoMEPBM474FzHRAzTw=">AAACEHicbVC7TgJBFJ31ifhCLW0mskRsyC4WWhJpLDGRRwLrZnYYYMLsIzN3jWSzn2Djr9hYaIytpZ1/47BQKHiSm5w5597MvceLBFdgWd/Gyura+sZmbiu/vbO7t184OGypMJaUNWkoQtnxiGKCB6wJHATrRJIR3xOs7Y3rU799z6TiYXALk4g5PhkGfMApAS25hdOS2QP2AMkodROe3iWinL1x3eVnqZk3e7QfgjLdQtGqWBnwMrHnpIjmaLiFr14/pLHPAqCCKNW1rQichEjgVLA034sViwgdkyHrahoQnyknyQ5KcUkrfTwIpa4AcKb+nkiIr9TE93SnT2CkFr2p+J/XjWFw6SQ8iGJgAZ19NIgFhhBP08F9LhkFMdGEUMn1rpiOiCQUdIZ5HYK9ePIyaVUr9nmlelMt1q7mceTQMTpBZWSjC1RD16iBmoiiR/SMXtGb8WS8GO/Gx6x1xZjPHKE/MD5/ADmwnAs=</latexit>




x<latexit sha1_base64="CPn/AH6zyetDuK2aXv+DRQKMHf0=">AAAB8XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LLoxmUF+8C2lEx6pw3NZIYkI5ahf+HGhSJu/Rt3/o2ZdhbaeiBwOOdecu7xY8G1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZQohg0WiUi1fapRcIkNw43AdqyQhr7Alj++yfzWIyrNI3lvJjH2QjqUPOCMGis9dENqRn6QPk37pbJbcWcgy8TLSRly1Pulr+4gYkmI0jBBte54bmx6KVWGM4HTYjfRGFM2pkPsWCppiLqXzhJPyalVBiSIlH3SkJn6eyOlodaT0LeTWUK96GXif14nMcFVL+UyTgxKNv8oSAQxEcnOJwOukBkxsYQyxW1WwkZUUWZsSUVbgrd48jJpViveeaV6d1GuXed1FOAYTuAMPLiEGtxCHRrAQMIzvMKbo50X5935mI+uOPnOEfyB8/kD/oeRIA==</latexit>

hCnn<latexit sha1_base64="2Pak5o+h5tmLcJYstk0gkqZC9TE=">AAACAnicbZDLSsNAFIYnXmu9RV2Jm2ARXJWkCrosduOygr1AG8NkOmmHTiZh5kQsIbjxVdy4UMStT+HOt3HSZqGtPwx8/Occ5pzfjzlTYNvfxtLyyuraemmjvLm1vbNr7u23VZRIQlsk4pHs+lhRzgRtAQNOu7GkOPQ57fjjRl7v3FOpWCRuYRJTN8RDwQJGMGjLMw/7IYaRH6SjzBN3aR/oA6QNzZlnVuyqPZW1CE4BFVSo6Zlf/UFEkpAKIBwr1XPsGNwUS2CE06zcTxSNMRnjIe1pFDikyk2nJ2TWiXYGVhBJ/QRYU/f3RIpDpSahrzvzhdV8LTf/q/USCC7dlIk4ASrI7KMg4RZEVp6HNWCSEuATDZhIpne1yAhLTECnVtYhOPMnL0K7VnXOqrWb80r9qoijhI7QMTpFDrpAdXSNmqiFCHpEz+gVvRlPxovxbnzMWpeMYuYA/ZHx+QOAZpgj</latexit>

h1n<latexit sha1_base64="nIupoLrEG/H0nc73CYEcU2FfRPA=">AAAB+XicbVBNS8NAFHypX7V+RT16WSyCp5JUQY9FLx4r2FpoY9hsN+3SzSbsbgol5J948aCIV/+JN/+NmzYHbR1YGGbe481OkHCmtON8W5W19Y3Nrep2bWd3b//APjzqqjiVhHZIzGPZC7CinAna0Uxz2kskxVHA6WMwuS38xymVisXiQc8S6kV4JFjICNZG8m17EGE9DsJsnPviKXNz3647DWcOtErcktShRNu3vwbDmKQRFZpwrFTfdRLtZVhqRjjNa4NU0QSTCR7RvqECR1R52Tx5js6MMkRhLM0TGs3V3xsZjpSaRYGZLHKqZa8Q//P6qQ6vvYyJJNVUkMWhMOVIx6ioAQ2ZpETzmSGYSGayIjLGEhNtyqqZEtzlL6+SbrPhXjSa95f11k1ZRxVO4BTOwYUraMEdtKEDBKbwDK/wZmXWi/VufSxGK1a5cwx/YH3+AOJwk9E=</latexit>

lCn�1

n�1<latexit sha1_base64="htegtSTJtXB0It2TEZ9/GiLadAI=">AAACJnicbVBNS8NAEN34bfyqevQSLAUvlkQFvQhFLx4VrApNDZvtxC5uNmF3IpYlv8aLf8WLB0XEmz/FbY3g14OFt+/NMDMvzgXX6Ptvztj4xOTU9MysOze/sLhUW14501mhGLRZJjJ1EVMNgktoI0cBF7kCmsYCzuPrw6F/fgNK80ye4iCHbkqvJE84o2ilqLbfCFOK/Tgx/TKSlyYo3UYIueYik+6XJcrIyM2gvDQhwi2aw+pfRrW63/RH8P6SoCJ1UuE4qj2FvYwVKUhkgmrdCfwcu4Yq5ExA6YaFhpyya3oFHUslTUF3zejM0mtYpeclmbJPojdSv3cYmmo9SGNbOVxc//aG4n9ep8Bkr2u4zAsEyT4HJYXwMPOGmXk9roChGFhCmeJ2V4/1qaIMbbKuDSH4ffJfcrbVDLabWyc79dZBFccMWSPrZIMEZJe0yBE5Jm3CyB15IE/k2bl3Hp0X5/WzdMypelbJDzjvH4TupnE=</latexit>

l1n�1<latexit sha1_base64="hbWrvH1ZHi3X53FeNSEXrXXd6Iw=">AAACPHicbVBNS8NAEN34bfyqevQSLAUvlkQFPRa9eFS0KrQ1bLYTu7jZhN2JWJb8MC/+CG+evHhQxKtntzWIWh8svH1vhpl5USa4Rt9/dMbGJyanpmdm3bn5hcWlyvLKmU5zxaDJUpGqi4hqEFxCEzkKuMgU0CQScB5dHwz88xtQmqfyFPsZdBJ6JXnMGUUrhZWTWjuh2Iti0ytCeWmCwq21IdNcpNL99kQRGrkZFJemjXCL5qD8F+5oRVCElapf94fwRklQkiopcRRWHtrdlOUJSGSCat0K/Aw7hirkTIAdkmvIKLumV9CyVNIEdMcMjy+8mlW6Xpwq+yR6Q/Vnh6GJ1v0kspWDXfVfbyD+57VyjPc6hsssR5Dsa1CcCw9Tb5Ck1+UKGIq+JZQpbnf1WI8qytDm7doQgr8nj5KzrXqwXd863qk29ss4ZsgaWScbJCC7pEEOyRFpEkbuyBN5Ia/OvfPsvDnvX6VjTtmzSn7B+fgEOzivWA==</latexit>

rn�1<latexit sha1_base64="/t7OLNYEryEW8edzC+SG1ChHidA=">AAACTnicbVHLSgMxFM3UV62vqks3g6XgxjKjgi6Lblwq2Cq0dcikd2wwkxmSO2IJ84VuxJ2f4caFIpq2o/i6EDg5575yEqaCa/S8R6c0NT0zO1eerywsLi2vVFfX2jrJFIMWS0SiLkKqQXAJLeQo4CJVQONQwHl4fTTSz29AaZ7IMxym0IvpleQRZxQtFVSh3o0pDsLIDPJAXho/r9S7kGouEln50kQeGLnt55emi3CL5qi45/+l2BafpCrIoFrzGt443L/AL0CNFHESVB+6/YRlMUhkgmrd8b0Ue4Yq5EyAHZBpSCm7plfQsVDSGHTPjO3I3bpl+m6UKHskumP2e4WhsdbDOLSZoz31b21E/qd1MowOeobLNEOQbDIoyoSLiTvy1u1zBQzF0ALKFLe7umxAFWVof6BiTfB/P/kvaO80/N3GzulerXlY2FEmG2STbBGf7JMmOSYnpEUYuSNP5IW8OvfOs/PmvE9SS05Rs05+RKn8AS/vtZc=</latexit>

lCii

<latexit sha1_base64="d3bX77ZnYgJdrutHWNDGatUnHgo=">AAACbnicbVHLSsNAFJ3Ed31VBReKGCwFN5ZEBV2KblwqWBXaGibTGzt0MgkzN2IZsvQH3fkNbvwEJzVIfVwYOHPOfc2ZKBNco++/Oe7U9Mzs3PxCbXFpeWW1vrZ+q9NcMWizVKTqPqIaBJfQRo4C7jMFNIkE3EXDi1K/ewKleSpvcJRBL6GPksecUbRUWH9pdhOKgyg2gyKUDyYoas0uZJqLVNa+NVGERh4ExYPpIjyjuajuxX8pwQSrKrY2mcZ/9OG2S1hv+C1/HN5fEFSgQaq4Cuuv3X7K8gQkMkG17gR+hj1DFXImwI7LNWSUDekjdCyUNAHdM2O7Cq9pmb4Xp8oeid6YnawwNNF6lEQ2s9xa/9ZK8j+tk2N82jNcZjmCZF+D4lx4mHql916fK2AoRhZQprjd1WMDqihD+0OlCcHvJ/8Ft4et4Kh1eH3cODuv7Jgn22SP7JOAnJAzckmuSJsw8u6sOVvOtvPhbro77u5XqutUNRvkR7j7n5sFvXo=</latexit>

l1i<latexit sha1_base64="ecvxS75HhrKB7G4eL+FCC2tp2hQ=">AAACY3icbVHLSsQwFE3re3zVx06E4jDgxqFVQZeiG5cKjgozY0kzt04wTUtyK46hP+nOnRv/w3SmyPi4EDg551zuzUmcC64xCN4dd2Z2bn5hcamxvLK6tu5tbN7qrFAMOiwTmbqPqQbBJXSQo4D7XAFNYwF38dNFpd89g9I8kzc4yqGf0kfJE84oWiryXlu9lOIwTsywjOSDCctGqwe55iKTjW9NlJGRB2H5YHoIL2gu6nv5nyWcYlXNNqZtfGKKvGbQDsbl/wVhDZqkrqvIe+sNMlakIJEJqnU3DHLsG6qQMwF2RqEhp+yJPkLXQklT0H0zzqj0W5YZ+Emm7JHoj9npDkNTrUdpbJ3Vqvq3VpH/ad0Ck9O+4TIvECSbDEoK4WPmV4H7A66AoRhZQJnidlefDamiDO23VCGEv5/8F9wetsOj9uH1cfPsvI5jkeyQPbJPQnJCzsgluSIdwsiHM++sO57z6S67m+72xOo6dc8W+VHu7hdwjrgh</latexit>

ri<latexit sha1_base64="IntGalcq7C5e/6qsABvGT1cA6ZU=">AAACcnicbVHLSgMxFM2M7/qqihsFHS0FF1pmVNCl6MalglWhrUMmvdMGM5khuSOWMB/g77nzK9z4AaZ1kPq4EDg559zk5iTKBNfo+2+OOzE5NT0zO1eZX1hcWq6urN7qNFcMmiwVqbqPqAbBJTSRo4D7TAFNIgF30ePFUL97AqV5Km9wkEEnoT3JY84oWiqsvtTbCcV+FJt+EcoHExSVehsyzUUqK9+aKEIjD4LiwbQRntFclPviP0swxqqS/enjX65xEy/Cas1v+KPy/oKgBDVS1lVYfW13U5YnIJEJqnUr8DPsGKqQMwH2+FxDRtkj7UHLQkkT0B0ziqzw6pbpenGq7JLojdjxDkMTrQdJZJ3DKfVvbUj+p7VyjE87hsssR5Ds66I4Fx6m3jB/r8sVMBQDCyhT3M7qsT5VlKH9pYoNIfj95L/g9rARHDUOr49rZ+dlHLNkk+ySPRKQE3JGLskVaRJG3p11Z8vZdj7cDXfHLbNznbJnjfwod/8TvxW+1A==</latexit>

lC11

<latexit sha1_base64="nJfwarp14/3SuBphIRvRqLpmSRY=">AAACkXicbVFdS8MwFE3rd/2a+uhLcQx8cbQqqA/K0BfBFwWnwjZLmt26YJqW5FYcof/H3+Ob/8ZsFtl0FwIn55zLubmJc8E1BsGX487NLywuLa94q2vrG5u1re0HnRWKQZtlIlNPMdUguIQ2chTwlCugaSzgMX69GumPb6A0z+Q9DnPopfRF8oQzipaKah+NbkpxECdmUEby2YSl1+hCrrnIpPeriTIy8iAsn00X4R3NVXUvZ1nCCVZV7LSPz3Dx0pu0TGfZpKhWD5rBuPz/IKxAnVR1G9U+u/2MFSlIZIJq3QmDHHuGKuRMgE0rNOSUvdIX6FgoaQq6Z8YbLf2GZfp+kil7JPpjdrLD0FTrYRpb52ho/VcbkbO0ToHJac9wmRcIkv0EJYXwMfNH3+P3uQKGYmgBZYrbWX02oIoytJ/o2SWEf5/8HzwcNsOj5uHdcb11Wa1jmeySPbJPQnJCWuSa3JI2Yc6Gc+ycOxfujnvmttzK6zpVzw6ZKvfmG10qyi8=</latexit>

l11<latexit sha1_base64="4bSR2ZXb81VwRwOEYUePVL7K0eU=">AAACiHicbVFdS8MwFE3rd/2a+uhLcQwEcbR+oL6Jvvio4FTYZkmzWxeWpiW5FUfob/E/+ea/MZ1Fpu5Cwsk55+be3MS54BqD4NNx5+YXFpeWV7zVtfWNzcbW9oPOCsWgwzKRqaeYahBcQgc5CnjKFdA0FvAYj64r/fEVlOaZvMdxDv2UvkiecEbRUlHjvdVLKQ7jxAzLSD6bsPRaPcg1F5n0fjRRRkYehuWz6SG8obmuz+UsSzjFqpr97eMzXLz0pi3VRXYro0YzaAeT8P+DsAZNUsdt1PjoDTJWpCCRCap1Nwxy7BuqkDMBtkihIadsRF+ga6GkKei+mQyy9FuWGfhJpuyS6E/Y6QxDU63HaWydVa/6r1aRs7Rugcl533CZFwiSfRdKCuFj5le/4g+4AoZibAFlittefTakijK0f+fZIYR/n/wfPBy1w+P20d1J8/KqHscy2SV7ZJ+E5IxckhtySzqEOQvOgXPinLqeG7hn7sW31XXqnB3yK9yrLzAQxho=</latexit>

r1<latexit sha1_base64="JkZ7Gz44b+3N4MSildVhTHCaRBA=">AAACmHicbVFdT9swFHUCDOg+KKC9jJeIqhIvq2I2CSReKmBie2MTBaS2RI57Qy0cJ7JvEJWV38R/4Y1/g1OiqZReydLxOed++DrOpTAYhs+ev7S88mF1bb3x8dPnLxvNza1LkxWaQ49nMtPXMTMghYIeCpRwnWtgaSzhKr47qfSre9BGZOoCJzkMU3arRCI4Q0dFzcf2IGU4jhM7LiN1Y2nZaA8gN0JmqvFfk2Vk1Xda3tgBwgPak/peLrLQGVbX7FufWOASc56qkq06zJpoGTVbYSecRvAe0Bq0SB3nUfNpMMp4kYJCLpkxfRrmOLRMo+ASXPnCQM74HbuFvoOKpWCGdrrYMmg7ZhQkmXZHYTBlZzMsS42ZpLFzVlOaea0iF2n9ApPDoRUqLxAUf22UFDLALKh+KRgJDRzlxAHGtXCzBnzMNOPo/rLhlkDnn/weXO536I/O/t+fre5xvY41skN2yR6h5IB0yW9yTnqEe1+9I+/U++V/87v+mf/n1ep7dc42eRP+vxd5gMyp</latexit>

f�1✓n

<latexit sha1_base64="ceoEBAmvXla9c13lAl4jB56aMXg=">AAACq3icbVHLTuMwFHXCc8JjCizZRFSV2FAlgDQsEWxmhUBQWtGUyHFvqIXjRPYNorLyc/MJs5u/wSnRiJZeydLxOec+fJ0UgmsMgn+Ou7K6tr6x+cPb2t7Z/dna23/UeakY9FgucjVIqAbBJfSQo4BBoYBmiYB+8npd6/03UJrn8gGnBYwy+iJ5yhlFS8WtP50oozhJUjOpYvlswsrrRFBoLnLp/ddEFRt5ElbPJkJ4R3Pd3KtllvALqxp23seXuPiCp65k5jqoGeulsR1iAkhjaS22dtxqB91gFv53EDagTZq4jVt/o3HOygwkMkG1HoZBgSNDFXImoPKiUkNB2St9gaGFkmagR2a268rvWGbsp7myR6I/Y79mGJppPc0S66zn1otaTS7ThiWmFyPDZVEiSPbZKC2Fj7lff5w/5goYiqkFlCluZ/XZhCrK0H6vZ5cQLj75O3g87YZn3dO78/blVbOOTXJIjsgxCckvckl+k1vSI8w5dm6cvjNwT9x798mNPq2u0+QckLlw4QPd19RG</latexit>

f�1✓n�1

<latexit sha1_base64="c7G++aevJkdOCFyUToiqkLDLAV8=">AAACr3icbVFdT9swFHWy8bGMjzIe9xKtqrSHUcUwaTwieOGRTZQiNW3muDfUwnEi+wZRWfl7+wF749/glAjR0ivZOj7nXN/r67SUwmAUPXn+h48bm1vbn4LPO7t7+52DLzemqDSHAS9koW9TZkAKBQMUKOG21MDyVMIwvb9o9OEDaCMKdY3zEsY5u1MiE5yho5LOv16cM5ylmZ3ViZpYWge9GEojZKGCV03WiVVHtJ7YGOER7UV7rtdZ6BtWt+yyT6xxiRVPc5NdqqAXbJAlrokZIGtbmFi3J51u1I8WEb4HtAVd0sZV0vkfTwte5aCQS2bMiEYlji3TKLiEOogrAyXj9+wORg4qloMZ28W867DnmGmYFdotheGCfZthWW7MPE+ds+ndrGoNuU4bVZidjq1QZYWg+EuhrJIhFmHzeeFUaOAo5w4wroXrNeQzphlH98WBGwJdffJ7cHPcpyf9498/u2fn7Ti2yVfyjXwnlPwiZ+SSXJEB4d4P74838mKf+kN/4v99sfpem3NIlsIXz6O81cQ=</latexit>

f�1✓i

<latexit sha1_base64="EGIsJm0Y1FhKKYwSntDWXtDiSw0=">AAACrXicbVFdT9swFHUyNiBjo4xHXiKqShPSuoRNgke0vvBYJApsTRs57k1r4TiRfYOorPy7/QLe+Dc4bYRo6ZUsHZ9z7oevk0JwjUHw7Lgftj5+2t7Z9T7vffm63zr4dqPzUjEYsFzk6i6hGgSXMECOAu4KBTRLBNwm971av30ApXkur3FewCijU8lTzihaKm7970QZxVmSmlkVy7EJK68TQaG5yKX3qokqNvJHWI1NhPCIptfcq02W8A2rGnbVxze4+JqnrmRWOqgF66WxHWIGSOsca7LV41Y76AaL8N+DsAFt0kQ/bj1Fk5yVGUhkgmo9DIMCR4Yq5ExA5UWlhoKyezqFoYWSZqBHZrHtyu9YZuKnubJHor9g32YYmmk9zxLrrCfX61pNbtKGJabnI8NlUSJItmyUlsLH3K+/zp9wBQzF3ALKFLez+mxGFWVoP9izSwjXn/we3Jx2w1/d06vf7Ys/zTp2yBE5Jt9JSM7IBbkkfTIgzDlx+s5f55/70x24kTteWl2nyTkkK+FOXwAvI9VN</latexit>

f�1✓1

<latexit sha1_base64="l1ndQNTsICS2U9LTB9emtbh0vE8=">AAACrXicbVFNT9tAEF27tIBLSyhHLhZRpAqpqU0rwRE1F45BIkAbJ9Z6M05WrNfW7hgRrfzv+gu48W9YJxYiISOt9ObNm4+dSQrBNQbBs+N+2Pr4aXtn1/u89+Xrfuvg243OS8VgwHKRq7uEahBcwgA5CrgrFNAsEXCb3Pfq+O0DKM1zeY3zAkYZnUqeckbRUnHrfyfKKM6S1MyqWI5NWHmdCArNRS6915ioYiN/hNXYRAiPaHqNX22ShG9Y1bCrOr5Bxdc0dSWz0kEtWC+N7RAzQFp7VmSrx6120A0W5r8HYQPapLF+3HqKJjkrM5DIBNV6GAYFjgxVyJmAyotKDQVl93QKQwslzUCPzGLbld+xzMRPc2WfRH/Bvs0wNNN6niVWWU+u12M1uSk2LDE9HxkuixJBsmWjtBQ+5n59On/CFTAUcwsoU9zO6rMZVZShPbBnlxCuf/k9uDnthr+6p1e/2xd/mnXskCNyTL6TkJyRC3JJ+mRAmHPi9J2/zj/3pztwI3e8lLpOk3NIVsydvgDYrNUV</latexit>

✏<latexit sha1_base64="LebZ8miRV0JsqtgipoMV87075vE=">AAACtnicbVFLT+MwEHayyyssUNjjXqKtKnHZKgEkOCK4cGQlWor6iBx30lp1nMieICorP5ELN/7NOiWs2tKRLH3zzTcPz8S54BqD4N1xv33f2t7Z3fP2fxwcHjWOT7o6KxSDDstEpnox1SC4hA5yFNDLFdA0FvAYz26r+OMzKM0z+YDzHIYpnUiecEbRUlHjtTVIKU7jxEzLSI5MWHqfxAByzUUmS++/RpSRkX/CcmQGCC9obmt/oyRcYlXNrur4BhVf01SVzEoHtWC9VhLZKaaAtHKtypaPGs2gHSzM/wrCGjRJbfdR420wzliRgkQmqNb9MMhxaKhCzgTYXRQacspmdAJ9CyVNQQ/NYu2l37LM2E8yZZ9Ef8EuZxiaaj1PY6usRtfrsYrcFOsXmFwNDZd5gSDZR6OkED5mfnVDf8wVMBRzCyhT3M7qsylVlKG9tGeXEK5/+SvonrXD8/bZ34vm9U29jl3yi/wmpyQkl+Sa3JF70iHMOXeenNhh7pU7csGdfEhdp875SVbMzf8B+tbZNQ==</latexit>

(a) Generative model for mAR-SCF.

SQUEEZE

SPLIT

STEPOFFLOW

SQUEEZE

STEPOFFLOW

SQUEEZE

× (n − 1)

x<latexit sha1_base64="CPn/AH6zyetDuK2aXv+DRQKMHf0=">AAAB8XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LLoxmUF+8C2lEx6pw3NZIYkI5ahf+HGhSJu/Rt3/o2ZdhbaeiBwOOdecu7xY8G1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZQohg0WiUi1fapRcIkNw43AdqyQhr7Alj++yfzWIyrNI3lvJjH2QjqUPOCMGis9dENqRn6QPk37pbJbcWcgy8TLSRly1Pulr+4gYkmI0jBBte54bmx6KVWGM4HTYjfRGFM2pkPsWCppiLqXzhJPyalVBiSIlH3SkJn6eyOlodaT0LeTWUK96GXif14nMcFVL+UyTgxKNv8oSAQxEcnOJwOukBkxsYQyxW1WwkZUUWZsSUVbgrd48jJpViveeaV6d1GuXed1FOAYTuAMPLiEGtxCHRrAQMIzvMKbo50X5935mI+uOPnOEfyB8/kD/oeRIA==</latexit>

f✓i<latexit sha1_base64="P5UJmnQyw5/e2L+UV+5jztiRc34=">AAACsXicbZHPT9swFMed8GNdGKyD4y7RKiQuqxKKtB3RuHAsEgWmJg2O+0KtOk5kv0yrrPx/O+/Gf4NTItSWPsnS15/3ffbzc1oKrjEInh13Z3dv/0Pno3fw6fDoc/fL8Z0uKsVgxApRqIeUahBcwgg5CngoFdA8FXCfzq+a/P0fUJoX8hYXJcQ5fZI844yiRUn332mUU5ylmZnViZyYsPbeSASl5qKQK0jUiZHfw3piIoS/aK7a/VbL6lmqpes+vsXFNzzNSWbtBrWkXpbYJmaAtKmpk24v6AfL8N+LsBU90sYw6f6PpgWrcpDIBNV6HAYlxoYq5ExA7UWVhpKyOX2CsZWS5qBjs5x47Z9aMvWzQtkl0V/S1QpDc60XeWqdTdN6M9fAbblxhdnP2HBZVgiSvV6UVcLHwm++z59yBQzFwgrKFLe9+mxGFWVoP9mzQwg3n/xe3J33w0H//Oaid/mrHUeHfCXfyBkJyQ9ySa7JkIwIc/rOrRM7E3fg/nYf3fTV6jptzQlZC3f+AlTD14c=</latexit>

SPLIT

mAR prior

l1i<latexit sha1_base64="O9k67+zGctmKvIePEB0ScaSEDDg=">AAADBHicbVJba9RAFJ7EW42XbvVNXwaXQIXtklShBV+KgiwEpEK3LWy2YTI7yY47k4SZiXQd5sEX/4ovPijiqz/CN/+Nk22WXdseCHznO5fvnJNJK0alCoK/jnvj5q3bdzbuevfuP3i42dl6dCzLWmAyxCUrxWmKJGG0IENFFSOnlSCIp4ycpLM3TfzkIxGSlsWRmldkzFFe0IxipCyVbDlP/CzR0xSZM70TmpiRTG3HHKlpmmlskg+9pZNZJxY0n6rnnh8rcq4sOTdJdIZhLClf5GHE9DtjG9RJ1LNszlES2fw8GVzfP+rBNYFoJbCmuvI+mUTPjOcPEBJx7PlHAhUyKwX3lluscs8brCgnMn4Fo52wHVpLOyORsMxM0+FtW7asYlaB2lvo0Jik0w36wcLgVRC2oAtaO0w6f+JJiWtOCoUZknIUBpUaayQUxYxYkVqSCuEZysnIwgLZ4cZ68RMN9C0zgXYZ+xUKLtj1Co24lHOe2sxmVnk51pDXxUa1yvbHmhZVrUiBL4SymkFVwuZFwAkVBCs2twBhQe2sEE+RQFjZd+PZI4SXV74Kjnf74Yv+7vuX3YPX7Tk2wFPwDGyDEOyBAzAAh2AIsPPZ+ep8d364X9xv7k/310Wq67Q1j8F/5v7+BynJ9Ow=</latexit>

}<latexit sha1_base64="7iKKFhijwPJkM5hhP7UDcqTxMCA=">AAACLXicbZDNbtNAFIXHoZSQlpLCko1FHCksiOywoMtIZcGySOSnyo81nlwno4xnrJnrqpHrF+qGV0GVuihCbHkNJk4WNOmRRvrm3Hs1c0+UCm7Q9x+cyrOD54cvqi9rR8evTl7XT9/0jco0gx5TQulhRA0ILqGHHAUMUw00iQQMouX5uj64Am24kt9xlcIkoXPJY84oWiusf2l6aWsxFSG7WUz1B6/W9MZsptCUBKnhQsmSEa4xvy7WHIc5L6b5x8DevHHhhfWG3/ZLufsQbKFBtroI63fjmWJZAhKZoMaMAj/FSU41ciagqI0zAyllSzqHkUVJEzCTvNy2cJvWmbmx0vZIdEv3/4mcJsasksh2JhQXZre2Np+qjTKMzyY5l2mGINnmoTgTLip3HZ074xoYipUFyjS3f3XZgmrK0AZcsyEEuyvvQ7/TDj61O986je7lNo4qeUfekxYJyGfSJV/JBekRRm7JT/JAfjk/nHvnt/Nn01pxtjNvySM5f/8BKr2kgw==</latexit>

4Ci<latexit sha1_base64="ulKr4XrO3F+qLT5GZx8FIbEQtKc=">AAACPHicbZBLbxMxFIU9KY8QKISyZDMiEyksGs0EpLKMlA3LVJBHlcfI49xJrHrskX2najSdH9YNP4IdKzZdgBBb1jiTLCDplWx9Pude2T5RKrhB3//mVI4ePHz0uPqk9vTZ8fMX9ZcnQ6MyzWDAlFB6HFEDgksYIEcB41QDTSIBo+iyt/FHV6ANV/IzrlOYJXQpecwZRSuF9U9NL22t5iJkN6u5fuvVmt6ULRSakiA1XChZMsI15tfFhuMw58U8Pw3K09Tu3vut3ytC7oX1ht/2y3IPIdhBg+yqH9a/TheKZQlIZIIaMwn8FGc51ciZgKI2zQyklF3SJUwsSpqAmeXl5wu3aZWFGyttl0S3VP+dyGlizDqJbGdCcWX2vY14nzfJMP4wy7lMMwTJthfFmXBRuZsk3QXXwFCsLVCmuX2ry1ZUU4Y275oNIdj/8iEMO+3gXbtz3ml0L3ZxVMlr8oa0SEDOSJd8JH0yIIzcku/kB/npfHHunF/O721rxdnNvCL/lfPnLwIcqeg=</latexit>

2Ci<latexit sha1_base64="1V+8KepGM19fwt2hyA+AaPNo5PA=">AAACPHicbZA7b9swFIWpNE/n5bRjFiGWgWSIIblDMxrw0jFF48SBHwJFX8VEKFIgrwIbin5Ylv6IbJm6dGhRZM0cWvbQxrkAiY/n3AuSJ0oFN+j7T87Kh9W19Y3Nrcr2zu7efvXg46VRmWbQYUoo3Y2oAcEldJCjgG6qgSaRgKvotj3zr+5AG67kBU5TGCT0RvKYM4pWCqvf6156PB6KkN2Ph/rEq9S9PhspNCVBarhQsmSECeaTYsZxmPNimJ8G5alvd68599tFyL2wWvMbflnuMgQLqJFFnYfVx/5IsSwBiUxQY3qBn+Igpxo5E1BU+pmBlLJbegM9i5ImYAZ5+fnCrVtl5MZK2yXRLdV/J3KaGDNNItuZUBybt95MfM/rZRifDXIu0wxBsvlFcSZcVO4sSXfENTAUUwuUaW7f6rIx1ZShzbtiQwjefnkZLpuN4HOj+a1Za10v4tgkh+SIHJOAfCEt8pWckw5h5IH8JL/JH+eH88v56zzPW1ecxcwn8l85L6/+76nm</latexit>

2Ci<latexit sha1_base64="1V+8KepGM19fwt2hyA+AaPNo5PA=">AAACPHicbZA7b9swFIWpNE/n5bRjFiGWgWSIIblDMxrw0jFF48SBHwJFX8VEKFIgrwIbin5Ylv6IbJm6dGhRZM0cWvbQxrkAiY/n3AuSJ0oFN+j7T87Kh9W19Y3Nrcr2zu7efvXg46VRmWbQYUoo3Y2oAcEldJCjgG6qgSaRgKvotj3zr+5AG67kBU5TGCT0RvKYM4pWCqvf6156PB6KkN2Ph/rEq9S9PhspNCVBarhQsmSECeaTYsZxmPNimJ8G5alvd68599tFyL2wWvMbflnuMgQLqJFFnYfVx/5IsSwBiUxQY3qBn+Igpxo5E1BU+pmBlLJbegM9i5ImYAZ5+fnCrVtl5MZK2yXRLdV/J3KaGDNNItuZUBybt95MfM/rZRifDXIu0wxBsvlFcSZcVO4sSXfENTAUUwuUaW7f6rIx1ZShzbtiQwjefnkZLpuN4HOj+a1Za10v4tgkh+SIHJOAfCEt8pWckw5h5IH8JL/JH+eH88v56zzPW1ecxcwn8l85L6/+76nm</latexit>

4Ci<latexit sha1_base64="ulKr4XrO3F+qLT5GZx8FIbEQtKc=">AAACPHicbZBLbxMxFIU9KY8QKISyZDMiEyksGs0EpLKMlA3LVJBHlcfI49xJrHrskX2najSdH9YNP4IdKzZdgBBb1jiTLCDplWx9Pude2T5RKrhB3//mVI4ePHz0uPqk9vTZ8fMX9ZcnQ6MyzWDAlFB6HFEDgksYIEcB41QDTSIBo+iyt/FHV6ANV/IzrlOYJXQpecwZRSuF9U9NL22t5iJkN6u5fuvVmt6ULRSakiA1XChZMsI15tfFhuMw58U8Pw3K09Tu3vut3ytC7oX1ht/2y3IPIdhBg+yqH9a/TheKZQlIZIIaMwn8FGc51ciZgKI2zQyklF3SJUwsSpqAmeXl5wu3aZWFGyttl0S3VP+dyGlizDqJbGdCcWX2vY14nzfJMP4wy7lMMwTJthfFmXBRuZsk3QXXwFCsLVCmuX2ry1ZUU4Y275oNIdj/8iEMO+3gXbtz3ml0L3ZxVMlr8oa0SEDOSJd8JH0yIIzcku/kB/npfHHunF/O721rxdnNvCL/lfPnLwIcqeg=</latexit>

Ni/2<latexit sha1_base64="P+4d+u2ER0yCxMehDRlp8OEY0ww=">AAACTHicbZDNT9swGMadMqAEBh07conWVILDSlImjSNSLztNTKLA1I/Icd9Qq44d2W8mqpA/cJcddttfscsOm6ZJuGkOfOyVbP38vM8r20+cCW4wCH44jbUX6xubzS13e+fl7l7r1f6lUblmMGBKKH0dUwOCSxggRwHXmQaaxgKu4nl/2b/6AtpwJS9wkcE4pTeSJ5xRtFLUYh0/O5xNRMTuZhN95Lsdf8SmCk1FkBkulKwY4RaL23LJSVTwclK8DavTqNrfrQz9MuK+W7s/2sNxz49a7aAbVOU9h7CGNqnrPGp9H00Vy1OQyAQ1ZhgGGY4LqpEzAaU7yg1klM3pDQwtSpqCGRdVGKXXscrUS5S2S6JXqQ8nCpoas0hj60wpzszT3lL8X2+YY3I6LrjMcgTJVhclufBQectkvSnXwFAsLFCmuX2rx2ZUU4Y2f9eGED798nO47HXDk27vU6999rmOo0kOyBtySELynpyRD+ScDAgjX8lP8pv8cb45v5y/zr+VteHUM6/Jo2ps3AO76q6Q</latexit>

Ni/2<latexit sha1_base64="P+4d+u2ER0yCxMehDRlp8OEY0ww=">AAACTHicbZDNT9swGMadMqAEBh07conWVILDSlImjSNSLztNTKLA1I/Icd9Qq44d2W8mqpA/cJcddttfscsOm6ZJuGkOfOyVbP38vM8r20+cCW4wCH44jbUX6xubzS13e+fl7l7r1f6lUblmMGBKKH0dUwOCSxggRwHXmQaaxgKu4nl/2b/6AtpwJS9wkcE4pTeSJ5xRtFLUYh0/O5xNRMTuZhN95Lsdf8SmCk1FkBkulKwY4RaL23LJSVTwclK8DavTqNrfrQz9MuK+W7s/2sNxz49a7aAbVOU9h7CGNqnrPGp9H00Vy1OQyAQ1ZhgGGY4LqpEzAaU7yg1klM3pDQwtSpqCGRdVGKXXscrUS5S2S6JXqQ8nCpoas0hj60wpzszT3lL8X2+YY3I6LrjMcgTJVhclufBQectkvSnXwFAsLFCmuX2rx2ZUU4Y2f9eGED798nO47HXDk27vU6999rmOo0kOyBtySELynpyRD+ScDAgjX8lP8pv8cb45v5y/zr+VteHUM6/Jo2ps3AO76q6Q</latexit>

Ci<latexit sha1_base64="8kLZ3xs7OABmNnz+vpdDmcxrxOE=">AAACS3icbZDNT9swGMadDkYpbHTjyCWirQQHuqQc4IjEhdNUpJUP9SNy3DfUwrEj+w2iyvL/7bILt/0TXHYYQhzmpDnw9Uq2fu/zvJbtJ0wEN+h5f5zah6Xljyv11cba+qfPG80vX8+MSjWDAVNC6YuQGhBcwgA5CrhINNA4FHAeXh8X/vkNaMOV/IHzBMYxvZI84oyilYJm2GknO7OJCNjP2UTvthud9ohNFZqSIDFcKFkywi1mt3nBUZDxfJLt+WU3sntlH+cBt9Ki+W6bb712I2i2vK5XlvsW/ApapKp+0LwbTRVLY5DIBDVm6HsJjjOqkTMBeWOUGkgou6ZXMLQoaQxmnJVZ5G7HKlM3UtouiW6pPj+R0diYeRzayZjizLz2CvE9b5hidDjOuExSBMkWF0WpcFG5RbDulGtgKOYWKNPcvtVlM6opQxt/EYL/+stv4azX9fe7vdNe6+iyiqNOtsg22SE+OSBH5IT0yYAw8ovck3/kwfnt/HUenafFaM2pzmySF1Vb/g8Ki644</latexit>

Ni<latexit sha1_base64="ABGXnieStoWuRnfhiC1QXNfIe+E=">AAACS3icbZDNb9MwGMadQkcpbCtw5BLRVtoO65LuAMdKXDihItGPqR+R475prTl2ZL9Bq0L+Py5cuPFPcOEAQhxw0khs617J1s/P876y/YSJ4AY977tTe/CwfvCo8bj55Onh0XHr2fOxUalmMGJKKD0NqQHBJYyQo4BpooHGoYBJePW28CefQBuu5EfcJrCI6VryiDOKVgpaYbeTnGyWImCfN0t92ml2O3O2UmhKgsRwoWTJCNeYXecFR0HG82V25penud0r+30ecCv9P5z3O82g1fZ6XlnuPvgVtElVw6D1bb5SLI1BIhPUmJnvJbjIqEbOBOTNeWogoeyKrmFmUdIYzCIrs8jdrlVWbqS0XRLdUr05kdHYmG0c2s6Y4sbc9QrxPm+WYvRmkXGZpAiS7S6KUuGicotg3RXXwFBsLVCmuX2ryzZUU4Y2/iIE/+6X92Hc7/kXvf6HfntwWcXRIC/JK3JCfPKaDMg7MiQjwsgX8oP8Ir+dr85P54/zd9dac6qZF+RW1er/ABwTrkM=</latexit>

Ni<latexit sha1_base64="ABGXnieStoWuRnfhiC1QXNfIe+E=">AAACS3icbZDNb9MwGMadQkcpbCtw5BLRVtoO65LuAMdKXDihItGPqR+R475prTl2ZL9Bq0L+Py5cuPFPcOEAQhxw0khs617J1s/P876y/YSJ4AY977tTe/CwfvCo8bj55Onh0XHr2fOxUalmMGJKKD0NqQHBJYyQo4BpooHGoYBJePW28CefQBuu5EfcJrCI6VryiDOKVgpaYbeTnGyWImCfN0t92ml2O3O2UmhKgsRwoWTJCNeYXecFR0HG82V25penud0r+30ecCv9P5z3O82g1fZ6XlnuPvgVtElVw6D1bb5SLI1BIhPUmJnvJbjIqEbOBOTNeWogoeyKrmFmUdIYzCIrs8jdrlVWbqS0XRLdUr05kdHYmG0c2s6Y4sbc9QrxPm+WYvRmkXGZpAiS7S6KUuGicotg3RXXwFBsLVCmuX2ryzZUU4Y2/iIE/+6X92Hc7/kXvf6HfntwWcXRIC/JK3JCfPKaDMg7MiQjwsgX8oP8Ir+dr85P54/zd9dac6qZF+RW1er/ABwTrkM=</latexit>

hn<latexit sha1_base64="6Grj4UggFtVlyjwoy/TGWDI7tzY=">AAACW3icbZFBT9swFMedAKPL2CibdtolWlMJDitJOYwjEpedJiatwNS0keO+EAvHjuyXiSrLl+QEB74KmhMqMWBPsvR7/+dnP/+dloIbDMNbx11b33i12Xvtvdl6+267v/P+1KhKM5gwJZQ+T6kBwSVMkKOA81IDLVIBZ+nlcVs/+w3acCV/4rKEWUEvJM84o2ilpK+HQbmbz0XC/uRzvRd4wyBmC4WmIygNF0p2jHCF9VXTcpbUvJnXX6Iui5vH+vcm4TZ7TPbHgRfEBcU8zeq8SexZSX8QjsIu/JcQrWBAVnGS9K/jhWJVARKZoMZMo7DEWU01ciag8eLKQEnZJb2AqUVJCzCzuvOm8YdWWfiZ0nZJ9Dv1346aFsYsi9TubKc0z2ut+L/atMLscFZzWVYIkj1clFXCR+W3RvsLroGhWFqgTHM7q89yqilD+x2tCdHzJ7+E0/EoOhiNf4wHR79WdvTIJ/KZ7JKIfCVH5Bs5IRPCyA25dzadnnPnrrmeu/Ww1XVWPR/Ik3A//gUM9K9a</latexit>

li<latexit sha1_base64="6dg92o6a2Y8m6Kku2rHM59lZeYo=">AAADBXicbVLLahsxFNVMX+n05bTLbETNQAqOmUkLLXQTWiiGgZJCnAQ8ttDImrFizQNJU+IKbbrpr3TTRUvptv/QXf+mGmeM3SQXBOee+zj3SkoqzqQKgr+Oe+Pmrdt3tu569+4/ePios/34WJa1IHRISl6K0wRLyllBh4opTk8rQXGecHqSzN828ZOPVEhWFkdqUdFxjrOCpYxgZSm07ez4KdKzBJuJ3gtNzGmqduMcq1mSamLQWW/lpNaJBctm6pnnx4qeK0suDIomBMaS5cs8grl+b2yDGkU9y2Y5RpHNz9Dg+v5RD24IRGuBDdW198kgPTeeP8BYxLHnHwlcyLQUubfaYp173mDFcirj1zDaC9uhtbQzUgnL1DQd3rVlqypuFZjxJ1qHxqBON+gHS4NXQdiCLmjtEHX+xNOS1DktFOFYylEYVGqssVCMcGpVakkrTOY4oyMLC2ynG+vlKxroW2YK7Tb2FAou2c0KjXMpF3liM5th5eVYQ14XG9UqfTXWrKhqRQtyIZTWHKoSNl8CTpmgRPGFBZgIZmeFZIYFJsp+HM9eQnh55avgeL8fPu/vf3jRPXjTXscW2AFPwS4IwUtwAAbgEAwBcT47X53vzg/3i/vN/en+ukh1nbbmCfjP3N//AO2Y9Rs=</latexit>

ri<latexit sha1_base64="c25ijodwC2EdGehxdf1LZf30SEY=">AAADBXicbVLLahsxFNVMX+n05bTLbETNQAqOmUkLLXQTWiiGgZJCnAQ8ttDImrFizQNJU+IKbbrpr3TTRUvptv/QXf+mGmeM3SQXBOee+zj3SkoqzqQKgr+Oe+Pmrdt3tu569+4/ePios/34WJa1IHRISl6K0wRLyllBh4opTk8rQXGecHqSzN828ZOPVEhWFkdqUdFxjrOCpYxgZSm07ez4KdKzBJuJ3gtNzGmqduMcq1mSamLQWW/lpNaJBctm6pnnx4qeK0suDIomBMaS5cs8grl+b2yDGkU9y2Y5RpHNz9Dg+v5RD24IRGuBDdW198kgPTeeP8BYxLHnHwlcyLQUubfaYp173mDFcirj1zDaC9uhtbQzUgnL1DQd3rVlqyphFZjxJ1qHxqBON+gHS4NXQdiCLmjtEHX+xNOS1DktFOFYylEYVGqssVCMcGpVakkrTOY4oyMLC2ynG+vlKxroW2YK7Tb2FAou2c0KjXMpF3liM5th5eVYQ14XG9UqfTXWrKhqRQtyIZTWHKoSNl8CTpmgRPGFBZgIZmeFZIYFJsp+HM9eQnh55avgeL8fPu/vf3jRPXjTXscW2AFPwS4IwUtwAAbgEAwBcT47X53vzg/3i/vN/en+ukh1nbbmCfjP3N//APb49SE=</latexit>






lCii

<latexit sha1_base64="xoYISwkAGqbFl3Ol2VnToSoMTRg=">AAADDXicbVJLa9tAEF6pr1R9Oe2xl6VGkIJjrLTQQi+hgWIQlBTiJGA5y2q9kjdePdhdlbjL/oFe+ld66aGl9Np7b/03XckydpMMCGa++Wa+mdHGJWdSDQZ/HffGzVu372zd9e7df/DwUWf78bEsKkHoiBS8EKcxlpSznI4UU5yeloLiLOb0JJ4f1PmTj1RIVuRHalHSSYbTnCWMYGUhtO10/QTpWYzNmd4NTMRponaiDKtZnGhi0HlvFSQ2iARLZ+q550eKXigLLgwKzwiMJMsaHsFcvze2QYXCnkXTDKPQ8lM0vL5/2IMbAuFaYEN1HX0ySM+N5w8xFlHk+UcC5zIpROattlhzL2pfsYzK6A0Md4N2aC3tjFTCIjF1h3dt2aqKWwVmb7HkHjSRQZ3uoD9oDF51gtbpgtYOUedPNC1IldFcEY6lHAeDUk00FooRTq1aJWmJyRyndGzdHNspJ7r5mwb6FplCu5X9cgUbdLNC40zKRRZbZj20vJyrwety40olryea5WWlaE6WQknFoSpg/TTglAlKFF9YBxPB7KyQzLDARNkH5NkjBJdXvuoc7/WDF/29Dy+7+2/bc2yBp+AZ2AEBeAX2wRAcghEgzmfnq/Pd+eF+cb+5P91fS6rrtDVPwH/m/v4HLsr5OQ==</latexit>

l2i<latexit sha1_base64="g2a4AJiloM+w0pa9bBxshT7b6kA=">AAADBHicbVJbi9NAFJ7E2xov29U3fRksgRW6JamCgi+LghQCssJ2d6Hphsl0ko6dScLMRLYO8+CLf8UXHxTx1R/hm//GSTeldXcPBL7znct3zsmkFaNSBcFfx712/cbNW1u3vTt3793f7uw8OJJlLTAZ4ZKV4iRFkjBakJGiipGTShDEU0aO0/mbJn78kQhJy+JQLSoy4SgvaEYxUpZKdpxHfpboWYrMqd4LTcxIpnZjjtQszTQ2yYfeysmsEwuaz9RTz48VOVOWXJgkOsUwlpQv8zBi+p2xDeok6lk25yiJbH6eDK/uH/XghkC0FthQXXufTKLnxvOHCIk49vxDgQqZlYJ7qy3WuWcNVpQTGb+C0V7YDq2lnZFIWGam6fC2LVtVMatA7S30wJik0w36wdLgZRC2oAtaO0g6f+JpiWtOCoUZknIcBpWaaCQUxYxYkVqSCuE5ysnYwgLZ4SZ6+RMN9C0zhXYZ+xUKLtnNCo24lAue2sxmVnkx1pBXxca1yl5ONC2qWpECnwtlNYOqhM2LgFMqCFZsYQHCgtpZIZ4hgbCy78azRwgvrnwZHA364bP+4P3z7v7r9hxb4DF4AnZBCF6AfTAEB2AEsPPZ+ep8d364X9xv7k/313mq67Q1D8F/5v7+BytP9O0=</latexit>

(b) SCF flow with mAR prior.

Figure 2. Flow-based generative models with multi-scale autoregressive priors (mAR-SCF). The generative model (left) shows the multi-scaleautoregressive sampling of the channel dimensions of li at each level. The spatial dimensions of each channel are sampled in parallel. ri arecomputed with invertable transformations. The mAR-SCF model (right) shows the complete multi-scale architecture with the mAR priorapplied along the channels of li, i.e. at each level i after the SPLIT operation.

Normalizing flows. Normalizing flows [8] are a class ofexact inference generative models. Here, we consider invert-ible flows, which allow for both efficient exact inference andsampling. Specifically, invertible flows consist of a sequenceof n invertible functions fθi , which transform a density onthe data x to a density on latent variables z,

xfθ1←→ h1

fθ2←→ h2 · · ·fθn←→ z. (1)

Given that we can compute the likelihood of p(z), thelikelihood of the data x under the transformation f can becomputed using the change of variables formula,

log pθ(x) = log p(z) +

n∑i=1

log |det Jθi |, (2)

where Jθi = ∂hi/∂hi−1 is the Jacobian of the invertible trans-formation fθi going from hi−1 to hi with h0 ≡ x. Note thatmost prior work [4, 8, 9, 14, 18] considers i.i.d. Gaussianlikelihood models of z, e.g. p(z) = N (z |µ, σ).

These models, however, have limitations. First, the re-quirement of invertibility constrains the class of functionsfθi to be monotonically increasing (or decreasing), thus lim-iting expressiveness. Second, of the three possible variantsof fθi to date [41], MAF (masked autoregressive flows),IAF (inverse autoregressive flows), and SCF (split couplingflows), MAFs are difficult to parallelize due to sequentialdependencies between dimensions and IAFs do not performwell in practice. SCFs strike the right balance with respect toparallelization and modeling power. In detail, SCFs partitionthe dimensions into two equal halves and transform one of

the halves ri conditioned on li, leaving li unchanged andthus not introducing any sequential dependencies (makingparallelization easier). Examples of SCFs include the affinecouplings of RealNVP [9] and MixLogCDF couplings ofFlow++ [14].

In practice, SCFs are organized into blocks [8, 9, 18] tomaximize efficiency such that each fθi typically consists ofSQUEEZE, STEPOFFLOW, and SPLIT operations. SQUEEZEtrades off spatial resolution for channel depth. Suppose an in-termediate layer hi is of size [Ci,Ni,Ni], then the SQUEEZEoperation transforms it into size [4 Ci,Ni/2,Ni/2] by re-shaping 2 × 2 neighborhoods into 4 channels. STEPOF-FLOW is a series of SCF (possibly several) coupling layersand invertible 1× 1 convolutions [8, 9, 18].

The SPLIT operation (distinct from the split couplings)splits an intermediate layer hi into two halves {li, ri} ofsize [2 Ci,Ni/2,Ni/2] each. Subsequent invertible layersfθj>i operate only on ri, leaving li unchanged. In otherwords, the SPLIT operation fixes some dimensions of thelatent representation z to li as they are not transformed anyfurther. This leads to a significant reduction in the amountof computation and memory needed. In the following, wedenote the spatial resolutions at the n different levels as N ={N0, · · · ,Nn}, with N = N0 being the input resolution.Similarly, C = {C0, · · · ,Cn} denotes the number of featurechannels, with C = C0 being the number of input channels.

In practice, due to limited modeling flexibility, prior SCF-based models [9, 14, 18] require many SCF coupling layersin fθi to model complex distributions, e.g. images. This inturn leads to high memory requirements and also leads toless efficient sampling procedures.

3

Autoregressive models. Autoregressive generative mod-els are another class of powerful and highly flexible exactinference models. They factorize complex target distribu-tions by decomposing them into a product of conditionaldistributions, e.g. images with N × N spatial resolutionas p(x) =

∏Ni=1

∏Nj=1 p(xi,j |xpred(i,j)) [10, 12, 25, 37, 38].

Here, pred(i, j) denotes the set of predecessors of pixel (i, j).The functional form of these conditionals can be highly flex-ible, and allows such models to capture complex multimodaldistributions. However, such a dependency structure onlyallows for image synthesis via ancestral sampling by gener-ating each pixel sequentially, conditioned on the previouspixels [37, 38], making parallelization difficult. This is alsoinefficient since autoregressive models, including PixelCNNand PixelRNN, require O(N2) time steps for sampling.

4. Multi-scale Autoregressive Flow PriorsWe propose to leverage the strengths of autoregressive

models to improve invertible normalizing flow models suchas [9, 18]. Specifically, we propose novel multi-scale autore-gressive priors for split coupling flows (mAR-SCF). Usingthem allows us to learn complex multimodal latent priorsp(z) in multi-scale SCF models, cf . Eq. (2). This is un-like [9, 14, 18, 28], which rely on Gaussian priors in thelatent space. Additionally, we also propose a scheme forinterpolation in the latent space of our mAR-SCF models.

The use of our novel autoregressive mAR priors for in-vertible flow models has two distinct advantages over bothvanilla SCF and autoregressive models. First, the powerfulautoregressive prior helps mitigate the limited modeling ca-pacity of the vanilla SCF flow models. Second, as only theprior is autoregressive, this makes flow models with our mARprior an order of magnitude faster with respect to samplingtime than fully autoregressive models. Next, we describe ourmulti-scale autoregressive prior in detail.

Our mAR-SCF model uses an efficient invertible splitcoupling flow fθi(x) to map the distribution over the data xto a latent variable z and then models an autoregressive mARprior over z, parameterized by φ. The likelihood of a datapoint x of dimensionality [C,N,N ] can be expressed as

log pθ,φ(x) = log pφ(z) +

n∑i=1

log |det Jθi |. (3)

Here, Jθi is the Jacobian of the invertible transformationsfθi . Note that, as fθi(x) is an invertible function, z has thesame total dimensionality as the input data point x.Formulation of the mAR prior. We now introduce ourmAR prior pφ(z) along with our mAR-SCF model, whichcombines the split coupling flows fθi with an mAR prior.As shown in Fig. 2, our mAR prior is applied after everySPLIT operation of the invertible flow layers as well as at thesmallest spatial resolution. Let li =

{l1i , · · · , lCii

}be the Ci

channels of size [Ci,Ni,Ni], which do not undergo furthertransformation fθi after the SPLIT at level i. Following theSPLIT at level i, our mAR prior is modeled as a conditionaldistribution, pφ(li|ri); at the coarsest spatial resolution it isan unconditional distribution, pφ(hn). Thereby, we assumethat our mAR prior at each level i autoregressively factorizesalong the channel dimension as

pφ(li|ri) =

Ci∏j=1

pφ

(lji

∣∣∣l1i , · · · , lj−1i , ri

). (4)

Furthermore, the distribution at each spatial location (m,n)within a channel lji is modeled as a conditional Gaussian,

pφ(lji(m,n)|l1i , · · · , lj−1i , ri) = N

(µji(m,n), σ

ji(m,n)

). (5)

Thus, the mean, µji(m,n) and variance, σji(m,n) at each spatiallocation are autoregressively modeled along the channels.This allows the distribution at each spatial location to behighly flexible and capture multimodality in the latent space.Moreover from Eq. (4), our mAR prior can model long-rangecorrelations in the latent space as the distribution of eachchannel is dependent on all previous channels.

This autoregressive factorization allows us to em-ploy Conv-LSTMs [34] to model the distributionspφ(lji |l1i , · · · , lj−1i , ri) and pφ(hn). Conv-LSTMs canmodel long-range dependencies across channels in their in-ternal state. Additionally, long-range spatial dependencieswithin channels can be modeled by stacking multiple Conv-LSTM layers with a wide receptive field. This formulationallows all pixels within a channel to be sampled in parallel,while the channels are sampled in a sequential manner,

lji ∼ pφ(lji

∣∣∣l1i , · · · , lj−1i , ri

). (6)

This is in contrast to PixelCNN/RNN-based models, whichsample one pixel at a time.The mAR-SCF model. We illustrate our mAR-SCF modelarchitecture in Fig. 2b. Our mAR-SCF model leveragesthe SQUEEZE and SPLIT operations for invertible flows in-troduced in [8, 9] for efficient parallelization. Following[8, 9, 18], we use several SQUEEZE and SPLIT operationsin a multi-scale setup at n scales (Fig. 2b) until the spatialresolution at hn is reasonably small, typically 4× 4. Notethat there is no SPLIT operation at the smallest spatial res-olution. Therefore, the latent space is the concatenation ofz = {l1, . . . , ln−1,hn}. The split coupling flows (SCF) fθiin the mAR-SCF model remain invertible by construction.We consider different SCF couplings for fθi , including theaffine couplings of [9, 18] and MixLogCDF couplings [14].

Given the parameters φ of our multimodal mAR priormodeled by the Conv-LSTMs, we can compute pφ(z) usingthe formulation in Eqs. (4) and (5). We can thus express

4

Algorithm 1: MARPS: Multi-scale Autoregressive PriorSampling for our mAR-SCF models

1 Sample hn ∼ pφ(hn) ;2 for i← n− 1 to 1 do

/* SplitInverse */

3 ri ← hi+1 ; // Assign previous

4 li ∼ pφ(li|ri) ; // Sample mAR prior

5 hi ←{li, ri

}; // Concatenate

/* StepOfFlowInverse */

6 Apply f−1i (hi) ; // SCF coupling/* SqueezeInverse */

7 Reshape hi ; // Depth to Space

8 end9 x← h1 ;

Eq. (3) in closed form and directly maximize the likelihoodof the data under the multimodal mAR prior distributionlearned by the Conv-LSTMs.

Next, we show that the computational cost of ourmAR-SCF model is O(N) for sampling an image of size[C,N,N ]; this is in contrast to the standard O(N2) compu-tational cost required by purely autoregressive models.Analysis of sampling time. We now formally analyze thecomputational cost in the number of steps T required forsampling with our mAR-SCF model. First, we describethe sampling process in detail in Algorithm 1 (the forwardtraining process follows the sampling process in reverseorder). Next, we derive the worst-case number of steps Trequired by MARPS, given sufficient parallel resources tosample a channel in parallel. Here, the number of steps Tcan be seen as the length of the critical path while sampling.

Lemma 4.1. Let the sampled image x be of resolution[C,N,N ], then the worst-case number of steps T (length ofthe critical path) required by MARPS is O(N).

Proof. At the first sampling step (Fig. 2a) at layer fθn , ourmAR prior is applied to generate hn, which is of shape[2n+1 C,N/2n, N/2n]. Therefore, the number of sequentialsteps required at the last flow layer hn is

Tn = C · 2n+1. (7)

Here, we are assuming that each channel can be sampled inparallel in one time-step.

From fθn−1 to fθ1 , fθi always contains a SPLIT operation.Therefore, at each fθi we use our mAR prior to sample li,which has shape [2i C,N/2i, N/2i]. Therefore, the numberof sequential steps required for sampling at layers hi, 1 ≤i < n of our mAR-SCF model is

Ti = C · 2i. (8)

Therefore, the total number of sequential steps (length ofthe critical path) required for sampling is

T = Tn + Tn−1 + · · ·+ Ti + · · ·+ T1

= C ·(2n+1 + 2n−1 + · · ·+ 2i + · · ·+ 21

)= C ·

(3 · 2n − 2

).

(9)

Now, the total number of layers in our mAR-SCF model isn ≤ log(N). This is because each layer reduces the spatialresolution by a factor of two. Therefore, the total number oftime-steps required is

T ≤ 3 · C ·N. (10)

In practice, C � N , with C = C0 = 3 for RGB images.Therefore, the total number of sequential steps required forsampling in our mAR-SCF model is T = O(N).

It follows that with our multi-scale autoregressive mARpriors in our mAR-SCF model, sampling can be performedin a linear number of time-steps in contrast to fully autore-gressive models like PixelCNN, which require a quadraticnumber of time-steps [38].Interpolation. A major advantage of invertible flow-basedmodels is that they allow for latent spaces, which are usefulfor downstream tasks like interpolation – smoothly trans-forming one data point into another. Interpolation is simplein case of typical invertible flow-based models, because thelatent space is modeled as a unimodal i.i.d. Gaussian. To al-low interpolation in the space of our multimodal mAR priors,we develop a simple method based on [2].

Let xA and xB be the two images (points) to be interpo-lated and zA and zB be the corresponding points in the latentspace. We begin with an initial linear interpolation betweenthe two latent points,

{zA, z

1A,B, · · · , zkA,B, zB

}, such that,

ziA,B = (1−αi) zA +αi zB. The initial linearly interpolatedpoints ziA,B may not lie in a high-density region under ourmultimodal prior, leading to non-smooth transformations.Therefore, we next project the interpolated points ziA,B to ahigh-density region, without deviating too much from theirinitial position. This is possible because our mAR prior al-lows for exact inference. However, the image correspondingto the projected ziA,B must also not deviate too far from eitherxA and xB either to allow for smooth transitions. To thatend, we define the projection operation as

ziA,B = arg min(∥∥ziA,B − ziA,B

∥∥− λ1 log pφ(ziA,B

)+ λ2 min

(∥∥f−1(ziA,B)− xA∥∥,∥∥f−1(ziA,B)− xB

∥∥)),(11)

where λ1, λ2 are the regularization parameters. The termcontrolled by λ1 pulls the interpolated ziA,B back to high-density regions, while the term controlled by λ2 keeps theresult close to the two images xA and xB. Note that thisreduces to linear interpolation when λ1 = λ2 = 0.

5

MNIST CIFAR10

Method Coupling Levels |SCF| Channels bits/dim (↓) Levels |SCF| Channels bits/dim (↓)

PixelCNN [38] Autoregressive – – – – – – – 3.00PixelCNN++ [37] Autoregressive – – – – – – – 2.92

Glow [18] Affine 3 32 512 1.05 3 32 512 3.35Flow++ [14] MixLogCDF – – – – 3 – 96 3.29Residual Flow [4] Residual 3 16 – 0.97 3 16 – 3.28

mAR-SCF (Ours) Affine 3 32 256 1.04 3 32 256 3.33mAR-SCF (Ours) Affine 3 32 512 1.03 3 32 512 3.31mAR-SCF (Ours) MixLogCDF 3 4 96 0.88 3 4 96 3.27mAR-SCF (Ours) MixLogCDF – – – – 3 4 256 3.24

Table 1. Evaluation of our mAR-SCF model on MNIST and CIFAR10 (using uniform dequantization for fair comparsion with [4, 18]).

5. ExperimentsWe evaluate our approach on the MNIST [22], CIFAR10

[21], and ImageNet [38] datasets. In comparison to datasetslike CelebA, CIFAR10 and ImageNet are highly multimodaland the performance of invertible SCF models has lagged be-hind autoregressive models in density estimation and behindGAN-based generative models regarding image quality.

5.1. MNIST and CIFAR10

Architecture details. Our mAR prior at each level fθi con-sists of three convolutional LSTM layers, each of which uses32 convolutional filters to compute the input-to-state andstate-to-state components. Keeping the mAR prior architec-ture constant, we experiment with different SCF couplingsin fθi to highlight the effectiveness of our mAR prior. Weexperiment with affine couplings of [9, 18] and MixLogCDFcouplings [14]. Affine couplings have limited modelingflexibility. The more expressive MixLogCDF applies thecumulative distribution function of a mixture of logistics.In the following, we include experiments varying the num-ber couplings and the number of channels in the convo-lutional blocks of the neural networks used to predict theaffine/MixLogCDF transformation parameters.†

Hyperparameters. We use Adamax (as in [18]) with alearning rate of 8× 10−4. We use a batch size of 128 withaffine and 64 with MixLogCDF couplings (following [14]).

Density estimation. We report density estimation resultson MNIST and CIFAR10 in Table 1 using the per-pixellog-likelihood metric in bits/dim. We also include the ar-chitecture details (# of levels, coupling type, # of channels).We compare to the state-of-the-art Flow++ [14] method withSCF couplings and Residual Flows [4]. Note that in terms ofarchitecture, our mAR-SCF model with affine couplings isclosest to that of Glow [18]. Therefore, the comparison withGlow serves as an ideal ablation to assess the effectivenessof our mAR prior. Flow++ [14], on the other hand, usesthe more powerful MixLogCDF transformations and their

†Code available at: https://github.com/visinf/mar-scf

model architecture does not include SPLIT operations. Be-cause of this, Flow++ has higher computational and memoryrequirements for a given batch size compared to Glow. Fur-thermore, for fair comparison with Glow [18] and Residualflows [4], we use uniform dequantization unlike Flow++,which proposes to use variational dequantization.

In comparison to Glow, we achieve improved density es-timation results on both MNIST and CIFAR10. In detail,we outperform Glow (e.g. 1.05 vs. 1.04 bits/dim on MNISTand 3.35 vs. 3.33 bits/dim on CIFAR10) with |SCF|= 32affine couplings and 3 levels, while using parameter predic-tion networks with only half (256 vs. 512) the number ofchannels. We observe that increasing the capacity of ourparameter prediction networks to 512 channels boosts thelog-likelihood further to 1.03 bits/dim on MNIST and 3.31bits/dim on CIFAR10. As this setting with 512 channelsis identical to setting reported in [18], this shows that ourmAR prior boosts the accuracy by ∼ 0.04 bits/dim in caseof CIFAR10. To place this performance gain in context, itis competitive with the ∼ 0.03 bits/dim boost reported in[18] (cf . Fig. 3 in [18]) with the introduction of the 1 × 1convolution. We train our model for ∼3000 epochs, similarto [18]. Also note that we only require a batch size of 128to achieve state-of-the-art likelihoods, whereas Glow usesbatches of size 512. Thus our mAR-SCF model improvesdensity estimates and requires significantly lower computa-tional resources (∼48 vs. ∼128 GB memory). Overall, wealso observe competitive sampling speed (see also Table 2).This firmly establishes the utility of our mAR-SCF model.

For fair comparison with Flow++ [14] and Residual Flows[4], we employ the more powerful MixLogCDF couplings.Our mAR-SCF model uses 4 MixLogCDF couplings at eachlevel with 96 channels but includes SPLIT operations un-like Flow++. Here, we outperform Flow++ and ResidualFlows (3.27 vs. 3.29 and 3.28 bits/dim on CIFAR10) whilebeing equally fast to sample as Flow++ (Table 2). A baselinemodel without our mAR prior has performance compara-ble to Flow++ (3.29 bits/dim). Similarly on MNIST, ourmAR-SCF model again outperforms Residual Flows (0.88 vs.

6

https://github.com/visinf/mar-scf

(a) Residual Flows [4] (3.28 bits/dim, 46.3 FID) (b) Flow++ with variational dequantization [14] (3.08 bits/dim)

(c) Our mMAR-SCF Affine (3.31 bits/dim, 41.0 FID) (d) Our mMAR-SCF MixLogCDF (3.24 bits/dim, 41.9 FID)

Figure 3. Comparison of random samples from our mAR-SCF model with state-of-the-art models.

0.97 bits/dim). Finally, we train a more powerful mAR-SCFmodel with 256 channels with sampling speed competitivewith [4], which achieves state-of-the-art 3.24 bits/dim onCIFAR10 ‡. This is attained after ∼ 400 training epochs(comparable to ∼ 350 epochs required by [4] to achieve3.28 bits/dim). Next, we compare the sampling speed of ourmAR-SCF model with that of Flow++ and Residual Flow.

Method Coupling Levels |SCF| Ch. Speed (ms, ↓)Glow [18] Affine 3 32 512 13Flow++ [14] MixLogCDF 3 – 96 19Residual Flow [4] Residual 3 16 – 34

PixelCNN++ [33] Autoregressive – – – 5× 103

mAR-SCF (Ours) Affine 3 32 256 6mAR-SCF (Ours) Affine 3 32 512 17mAR-SCF (Ours) MixLogCDF 3 4 96 19mAR-SCF (Ours) MixLogCDF 3 4 256 32

Table 2. Evaluation of sampling speed with batches of size 32.

Sampling speed. We report the sampling speed of our mAR-SCF model in Table 2 in terms of sampling one image onCIFAR10. We report the average over 1000 runs using abatch size of 32. We performed all tests on a single NvidiaV100 GPU with 32GB of memory. First, note that our mAR-

‡mAR-SCF with 256 channels trained on CIFAR10(∼ 1000 epochs)trained to convergence achieves 3.22 bits/dim

SCF model with affine coupling layers in 3 levels with 512channels needs 17 ms on average to sample an image. Thisis comparable with Glow, which requires 13 ms. This showsthat our mAR prior causes only a slight increase in sam-pling time – particularly because our mAR-SCF requiresonly O(N) steps to sample and the prior has far fewer pa-rameters compared to the invertible flow network. Moreover,our mAR-SCF model with affine coupling layers with 256channels is considerably faster (6 vs. 13 ms) with an accu-racy advantage. Similarly, our mAR-SCF with MixLogCDFand 96 channels is competitive in speed with [14] with anaccuracy advantage and considerably faster than [4] (19 vs.34 ms). This is because Residual Flows are slower to in-vert (sample) as there is no closed-form expression of theinverse. Furthermore, our mAR-SCF with MixLogCDF and256 channels is competitive with respect to [4] in terms ofsampling speed while having a large accuracy advantage.Finally, note that these sampling speeds are two orders ofmagnitude faster than state-of-the-art fully autoregressiveapproaches, e.g. PixelCNN++ [33].Sample quality. Next, we analyze the sample quality ofour mAR-SCF model in Table 3 using the FID metric [13]and Inception scores [32]. The analysis of sample quality isimportant as it is well-known that visual fidelity and test log-likelihoods are not necessarily indicative of each other [35].

7

Figure 4. Interpolations of our mAR-SCF model on CIFAR10.

(a) Residual Flows [4] (3.75 bits/dim)

(b) Our mMAR-SCF (Affine, 3.80 bits/dim)

Figure 5. Random samples on ImageNet (64× 64).

We achieve an FID of 41.0 and an Inception score of 5.7 withour mAR-SCF model with affine couplings, significantlybetter than Glow with the same specifications and Resid-ual Flows. While our mAR-SCF model with MixLogCDFcouplings also performs comparably, empirically we findaffine couplings to lead to better image quality as in [4].We show random samples from our mAR-SCF model withboth affine and MixLogCDF couplings in Fig. 3. Here,we compare to the version of Flow++ with MixLogCDFcouplings and variational dequantization (which gives evenbetter log-likelihoods) and Residual Flows. Our mAR-SCFmodel achieves better sample quality with more clearly de-fined objects. Furthermore, we also obtain improved samplequality over both PixelCNN and PixelIQN and close the gapin comparison to adversarial approaches like DCGAN [27]and WGAN-GP [40]. This highlights that our mAR-SCFmodel is able to better capture long-range correlations.Interpolation. We show interpolations on CIFAR10 inFig. 4, obtained using Eq. (11). We observe smooth in-terpolation between images belonging to distinct classes.This shows that the latent space of our mAR prior can bepotentially used for downstream tasks similarly to Glow [18].We include additional analyses in Appendix E.

5.2. ImageNet

Finally, we evaluate our mAR-SCF model on ImageNet(32×32 and 64×64) against the best performing models on

Method Coupling FID (↓) Inception Score (↑)

PixelCNN [38] Autoregressive 65.9 4.6PixelIQN [24] Autoregressive 49.4 –

Glow [18] Affine 46.9 –Residual Flow [4] Residual 46.3 5.2mAR-SCF (Ours) MixLogCDF 41.9 5.7mAR-SCF (Ours) Affine 41.0 5.7

DCGAN [27] Adversarial 37.1 6.4WGAN-GP [40] Adversarial 36.4 6.5

Table 3. Evaluation of sample quality on CIFAR10. Other resultsare quoted from [4, 24].

Method Coupling |SCF| Ch. bits/dim (↓) Mem (GB, ↓)Glow [18] Affine 32 512 4.09 ∼ 128Residual Flow [4] Residual 32 – 4.01 –

mAR-SCF (Ours) Affine 32 256 4.07 ∼ 48mAR-SCF (Ours) MixLogCDF 4 460 3.99 ∼ 80

Table 4. Evaluation on ImageNet (32× 32).

MNIST and CIFAR10 in Table 4, i.e. Glow [18] and Resid-ual Flows [4]. Our model with affine couplings outperformsGlow while using fewer channels (4.07 vs. 4.09 bits/dim).For comparison with the more powerful Residual Flow mod-els, we use four MixLogCDF couplings at each layer fθiwith 460 channels. We again outperform Residual Flows [4](3.99 vs. 4.01 bits/dim). These results are consistent with thefindings in Table 1, highlighting the advantage of our mARprior. Finally, we also evaluate on the ImageNet (64× 64)dataset. Our mAR-SCF model with affine flows achieves 3.80vs. 3.81 bits/dim in comparison to Glow [18]. We show qual-itative examples in Fig. 5 and compare to Residual Flows.We see that although the powerful Residual Flows obtainbetter log-likelihoods (3.75 bits/dim), our mAR-SCF modelachieves better visual fidelity. This again highlights that ourmAR is able to better capture long-range correlations.

6. ConclusionWe presented mAR-SCF, a flow-based generative model

with novel multi-scale autoregressive priors for modelinglong-range dependencies in the latent space of flow models.Our mAR prior considerably improves the accuracy of flow-based models with split coupling layers. Our experimentsshow that not only does our mAR-SCF model improve den-sity estimation (in terms of bits/dim), but also considerablyimproves the sample quality of the generated images com-pared to previous state-of-the-art exact inference models.We believe the combination of complex priors with flow-based models, as demonstrated by our mAR-SCF model,provides a path toward efficient models for exact inferencethat approach the fidelity of GAN-based approaches.

8

Acknowledgement. SM and SR acknowledge the supportby the German Research Foundation as part of the ResearchTraining Group Adaptive Preparation of Information fromHeterogeneous Sources (AIPHES) under grant No. GRK 1994/1.

References[1] Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David

Duvenaud, and Jörn-Henrik Jacobsen. Invertible residualnetworks. In ICML, pages 573–582, 2019. 2

[2] Christoph Bregler and Stephen M. Omohundro. Nonlinearimage interpolation using manifold learning. In NIPS, pages973–980, 1994. 5

[3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Largescale GAN training for high fidelity natural image synthesis.In ICLR, 2019. 1

[4] Ricky T. Q. Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Residual flows for invertible generativemodeling. In NeurIPS, pages 9913–9923, 2019. 2, 3, 6, 7, 8,11, 13, 14

[5] Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan,Prafulla Dhariwal, John Schulman, Ilya Sutskever, and PieterAbbeel. Variational lossy autoencoder. In ICLR, 2017. 2

[6] Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and PieterAbbeel. Pixelsnail: An improved autoregressive generativemodel. In ICML, pages 864–872, 2018. 2

[7] Sander Dieleman, Aäron van den Oord, and Karen Simonyan.The challenge of realistic music generation: Modelling rawaudio at scale. In NeurIPS, pages 7989–7999, 2018. 1

[8] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE:Non-linear independent components estimation. In ICLRWorkshop, 2015. 1, 2, 3, 4

[9] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Den-sity estimation using real NVP. In ICLR, 2017. 1, 2, 3, 4,6

[10] Justin Domke, Alap Karapurkar, and Yiannis Aloimonos.Who killed the directed model? In CVPR, 2008. 1, 2, 4

[11] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville,and Yoshua Bengio. Generative adversarial nets. In NIPS,pages 2672–2680, 2014. 1

[12] Alex Graves. Generating sequences with recurrent neuralnetworks. arXiv:1308.0850, 2013. 2, 4

[13] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-hard Nessler, and Sepp Hochreiter. GANs trained by a twotime-scale update rule converge to a local Nash equilibrium.In NeurIPS, pages 6626–6637, 2017. 2, 7

[14] Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, andPieter Abbeel. Flow++: Improving flow-based generativemodels with variational dequantization and architecture de-sign. In ICML, pages 2722–2730, 2019. 2, 3, 4, 6, 7, 11,14

[15] Sepp Hochreiter and Jürgen Schmidhuber. Long short-termmemory. Neural Computation, 9(8):1735–1780, 1997. 2

[16] Emiel Hoogeboom, Rianne van den Berg, and Max Welling.Emerging convolutions for generative normalizing flows. InICML, pages 2771–2780, 2019. 2

[17] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, SebNoury, Norman Casagrande, Edward Lockhart, Florian Stim-berg, Aäron van den Oord, Sander Dieleman, and KorayKavukcuoglu. Efficient neural audio synthesis. In ICML,pages 2410–2419, 2018. 2

[18] Diederik P. Kingma and Prafulla Dhariwal. Glow: Generativeflow with invertible 1x1 convolutions. In NeurIPS, pages10215–10224, 2018. 1, 2, 3, 4, 6, 7, 8

[19] Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, XiChen, Ilya Sutskever, and Max Welling. Improved variationalinference with inverse autoregressive flow. In NIPS, pages4743–4751, 2016. 2

[20] Diederik P. Kingma and Max Welling. Auto-encoding varia-tional Bayes. In ICLR, 2014. 1

[21] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiplelayers of features from tiny images. Technical report, U. ofToronto, 2009. 2, 6

[22] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner,et al. Gradient-based learning applied to document recogni-tion. Proc. IEEE, 86(11):2278–2324, 1998. 2, 6

[23] Jacob Menick and Nal Kalchbrenner. Generating high fidelityimages with subscale pixel networks and multidimensionalupscaling. In ICLR, 2019. 2

[24] Georg Ostrovski, Will Dabney, and Rémi Munos. Autore-gressive quantile networks for generative modeling. In ICML,pages 3936–3945, 2018. 8

[25] George Papamakarios, Iain Murray, and Theo Pavlakou.Masked autoregressive flow for density estimation. InNeurIPS, pages 2338–2347, 2017. 2, 4

[26] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, LukaszKaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im-age transformer. In ICML, pages 4055–4064, 2018. 2

[27] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper-vised representation learning with deep convolutional genera-tive adversarial networks. In ICLR, 2016. 2, 8

[28] Ali Razavi, Aäron van den Oord, and Oriol Vinyals. Generat-ing diverse high-fidelity images with VQ-VAE-2. In NeurIPS,pages 14837–14847, 2019. 1, 2, 4

[29] Scott E. Reed, Aäron van den Oord, Nal Kalchbrenner, Ser-gio Gomez Colmenarejo, Ziyu Wang, Yutian Chen, DanBelov, and Nando de Freitas. Parallel multiscale autoregres-sive density estimation. In ICML, pages 2912–2921, 2017.2

[30] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wier-stra. Stochastic backpropagation and approximate inferencein deep generative models. In ICML, pages 1278–1286, 2014.1

[31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, Alexander C. Berg, andLi Fei-Fei. ImageNet large scale visual recognition challenge.International Journal of Computer Vision, 115(13):211–252,2015. 2

[32] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, VickiCheung, Alec Radford, and Xi Chen. Improved techniquesfor training GANs. In NIPS, pages 2234–2242, 2016. 2, 7

9

[33] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P.Kingma. PixelCNN++: Improving the PixelCNN with dis-cretized logistic mixture likelihood and other modifications.In ICLR, 2017. 2, 7

[34] Xingjian Shi, Zhihan Gao, Leonard Lausen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Deeplearning for precipitation nowcasting: A benchmark and anew model. In NeurIPS, pages 5617–5627, 2017. 4

[35] Lucas Theis, Aäron van den Oord, and Matthias Bethge. Anote on the evaluation of generative models. In ICLR, 2016.7

[36] Jakub M. Tomczak and Max Welling. VAE with a VampPrior.In AISTATS, pages 1214–1223, 2018. 2

[37] Aäron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Ko-ray Kavukcuoglu, Oriol Vinyals, and Alex Graves. Condi-tional image generation with PixelCNN decoders. In NIPS,pages 4790–4798, 2016. 1, 2, 4, 6

[38] Aäron van den Oord, Nal Kalchbrenner, and KorayKavukcuoglu. Pixel recurrent neural networks. In ICML,pages 1747–1756, 2016. 1, 2, 4, 5, 6, 8, 11, 13

[39] Aäron van den Oord, Yazhe Li, Igor Babuschkin, Karen Si-monyan, Oriol Vinyals, Koray Kavukcuoglu, George van denDriessche, Edward Lockhart, Luis C. Cobo, Florian Stimberg,Norman Casagrande, Dominik Grewe, Seb Noury, SanderDieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, AlexGraves, Helen King, Tom Walters, Dan Belov, and DemisHassabis. Parallel WaveNet: Fast high-fidelity speech synthe-sis. In ICML, pages 3918–3926, 2018. 1, 2

[40] Xiang Wei, Boqing Gong, Zixia Liu, Wei Lu, and LiqiangWang. Improving the improved training of Wasserstein GANs:A consistency term and its dual effect. In ICLR, 2018. 2, 8

[41] Zachary M. Ziegler and Alexander M. Rush. Latent normaliz-ing flows for discrete sequences. In ICML, pages 7673–7682,2019. 2, 3

10

AppendicesWe provide additional details of Lemma 4.1 in the mainpaper, additional details of our mAR-SCF model architecture,as well as additional results and qualitative examples.

A. Channel Dimensions in mAR-SCF

We begin by providing additional details of Lemma 4.1 inthe main paper. We formally prove the claim that the numberof channels at the last layer n (which does not have a SPLIToperation) is Cn = 2n+1 · C for an image of size [C,N,N ].Thereby, we also show that the number of channels at anylayer i (with a SPLIT operation) is Ci = 2i · C.

Note that the number of time steps required for samplingdepends on the number of channels for mAR-SCF (flowmodel) based on MARPS (Algorithm 1).

Lemma A.1. Let the size of the sampled image be [C,N,N ].The number of channels at the last layer hn is Cn = 2n+1 ·C.

Proof. The shape of image to be sampled is [C,N,N ].Since the network is invertible, the size of the image atlevel h0 is the size of the image sampled, i.e. [C,N,N ].

First, let the number of layers n be 1. This implies thatduring the forward pass the network applies a SQUEEZE oper-ation, which reshapes the image to [4C,N/2, N/2] followedby STEPOFFLOW, which does not modify the shape of theinput, i.e. the shape remains [4C,N/2, N/2]. Thus the num-ber of channels at the last layer h1 is C1 = 4 ·C = 21+1 ·Cwhen n = 1.

Next, let us assume that the number of channels at thelast layer hk−1 for a flow with k − 1 layers is

Ck−1 = 2k · C. (12)

We aim to prove by induction that the number of channels atthe last layer hk for a flow with k layers then is

Ck = 2k+1 · C. (13)

To that end, we note that the dimensionality of the out-put at layer k − 1 after SQUEEZE and STEPOFFLOW fromEq. (12) by assumption is Ck−1 = 2k · C. For a flow withk layers, the (k − 1)

st layer has a SPLIT operation resultingin {lk−1, rk−1}, each with size [2k−1C,N/2k−1, N/2k−1].The number of channels for rk−1 at layer k − 1 is thus2k · C/2 = 2k−1 · C. At layer k the input with 2k−1 · Cchannels is transformed by SQUEEZE to 2k−1 · 4 · C =2(k−1)+2 · C = 2k+1 · C channels.

Therefore, by induction Eq. (13) holds for k = n. Thusthe number of channels at the last layer hn is given as Cn =2n+1 · C.

B. mAR-SCF Architecture

In Fig. 6, we show the architecture of our mAR priorin detail for a layer i of mAR-SCF. The network is a con-volutional LSTM with three LSTM layers. The input ateach time-step is the previous channel of li, concatenatedwith the output after convolution on ri. Our mAR priorautoregressivly outputs the probability of each channel ljigiven by the distribution pφ

(lji |lj−1i , · · · , l1i , ri

)modeled as

N(µji , σ

ji

)during inference. Because of the internal state

of the Convolutional LSTM (memory), our mAR prior canlearn long-range dependencies.

In Fig. 2 in the main paper, the STEPOFFLOW operationconsists of an activation normalization layer, an invertible1 × 1 convolution layer followed by split coupling layers(32 layers for affine couplings or 4 layers of MixLogCDF ateach level).

C. Empirical Analysis of Sampling Speed

In Fig. 7, we analyze the real-world sampling time ofour state-of-the-art mAR-SCF model with MixLogCDF cou-plings using varying image sizes [C,N,N ]. In particular,we vary the input spatial resolution N . We report the meanand variance of 1000 runs with batch size of 8 on an NvidiaV100 GPU with 32GB memory. Using smaller batch sizesof 8 ensures that we always have enough parallel resourcesfor all image sizes. Under these conditions, we see that thesampling time increases linearly with the image size. Thisis because the number of sampling steps of our mAR-SCFmodel scales as O(N), cf . Lemma 4.1.

D. Additional Qualitative Examples

We include additional qualitative examples for trainingour generative model on the MNIST, CIFAR10, and Ima-geNet (32 × 32) datasets. In Fig. 9, we see that the visualsample quality of our mAR-SCF model with MixLogCDFcouplings is competitive with those of the state-of-the-artResidual Flows [4]. Moreover, note that we achieve bettertest log-likelihoods (0.88 vs. 1.00 bits/dim).

We additionally provide qualitative examples for Ima-geNet (32×32) in Figure 11. We observe that in comparisonto the state-of-the-art Flow++ [14] and Residual Flows [4],the images generated by our mAR-SCF model are detailedwith a competitive visual quality.

Finally, on CIFAR10 we compare with the fully autore-gressive PixelCNN model [38] in Figure 10. We see thatalthough the fully autoregressive PixelCNN achieves signif-icantly better test log-likelihoods (3.00 vs. 3.24 bits/dim),our mAR-SCF models achieves better visual sample quality(also shown by the FID and Inception metrics in Table 3 ofthe main paper).

11

⋯⋯⋯

Output

⋯Layer 1

Layer 2

Layer 3

Convolutional LSTM

Input

p�(l1i |ri)<latexit sha1_base64="JaYfueMc+emaemN5bO9So7xq1PU=">AAADLXicbVJbaxNBFJ5db3W9pfroy2AIpJCGbBQUfCkWJLAgFZq2kEmG2clsMs3shZlZaRznD/niXxHBh4r46t9wNk1ITHtg4TvfuXznnJ24EFzpTufS82/dvnP33s794MHDR4+f1Hafnqi8lJT1aS5yeRYTxQTPWF9zLdhZIRlJY8FO49lhFT/9xKTieXas5wUbpmSS8YRToh2Fd73DRoLNNCZ2ZPZDiwRLdBOlRE/jxFCLz1srJ3EOknwy1XtBA2l2oR05tzgaUYgUTxd5lAjzwboGJY5ajp2kBEcuf4J7N/ePWnBDIFoLbKiuvc8Wm5kNGj1CJEJB41iSTCW5TIPVFuvciwprnjKF3sJoP1wObZSbkSmYJ7bq8H67TDgJ7o5hutYGBTaomHLbNOvwKMT8y8qVFnO7h2v1TruzMHgdhEtQB0s7wrUfaJzTMmWZpoIoNQg7hR4aIjWngtkAlYoVhM7IhA0czIjbYmgWf9vChmPG0G3tvkzDBbtZYUiq1DyNXWY1pdqOVeRNsUGpkzdDw7Oi1CyjV0JJKaDOYfV04JhLRrWYO0Co5G5WSKdEEqrdAwvcEcLtla+Dk247fNnufnxVP3i3PMcOeA5egCYIwWtwAHrgCPQB9b56371L75f/zf/p//b/XKX63rLmGfjP/L//ABanBnc=</latexit>

p�(l2i |l1i , ri)<latexit sha1_base64="v4KkQAPvUBA+XbZz9GMCqn6cnGg=">AAADPnicbVLNixMxFM+MH7uOX109egmWQhe6pVOFFbwsClIYkBXa3YWmDZk008ZmPkgysjXmL/Pi3+DNoxcPinj1aKbbpbW7DwZ+7/c+fu+9SVwIrnSn883zb9y8dXtn905w9979Bw9re49OVF5KygY0F7k8i4ligmdsoLkW7KyQjKSxYKfx/HUVP/3ApOJ51teLgo1SMs14winRjsJ7Xr+RYDOLiR2bg9AiwRLdRCnRszgx1OL3rUsncQ6SfDrT+0EDaXauHbmwOBpTiBRPl3mUCPPWugYljlqOnaYERy5/invX949acEMgWgtsqK69jxabuQ0aPUIkQkGjL0mmklymweUW69zzCmueMoVewuggXA1tlJuRKZgnturwZrtMOAnujmG61gYFNqiYcds06/C4i/knuOGHmK+XkBZzu49r9U67szR4FYQrUAcrO8a1r2iS0zJlmaaCKDUMO4UeGSI1p4LZAJWKFYTOyZQNHcyIW2tklr/fwoZjJtCdwX2Zhkt2s8KQVKlFGrvMakq1HavI62LDUicvRoZnRalZRi+EklJAncPqLcEJl4xqsXCAUMndrJDOiCRUuxcXuCOE2ytfBSfddvis3X33vH70anWOXfAEPAVNEIJDcAR64BgMAPU+e9+9n94v/4v/w//t/7lI9b1VzWPwn/l//wG9xg0X</latexit>

p�(lCii |lCi�1

i , · · · l1i , ri)<latexit sha1_base64="xBWarJ0/zVs4WhVrb+yUyqGeiA8=">AAADb3icbVJba9swFFacXTrv0rR72EPHEAuBBNIQp4UN9lJWGAHD6KBpC3EiZEVOtMgXJHk00/S4P7i3/Ye97B9MTlySpRE2fOc737lywowzqbrd3xWn+uDho8d7T9ynz56/2K8dHF7JNBeEDkjKU3ETYkk5S+hAMcXpTSYojkNOr8P5eeG//kaFZGlyqRYZHcV4mrCIEawshQ4qPxsR0rMQm7E+9kzAaaSaQYzVLIw0Mehr+86IrBEINp2pltsIFL1VllwY5I8JDCSLlzqCuf5sbIIc+W3LTmOMfKufov7u/H4bbhTw1wU2qq6t7wbpuXEbfYxFELiNS4ETGaUidu+mWGtvC6xYTGXwAfrHXtm0lrZHKmEamSLDp+0wbkswuwzdM8bNkA6yGTNNvXaP9SrPuUHMfj/gTpfNYcdFzI5HJqmSmypvSZe2KNK0UK3e7XSXD94HXgnqoHwXqPYrmKQkj2miCMdSDr1upkYaC8UIp8YNckkzTOZ4SocWJtjuYaSX92JgwzITaPdm/0TBJbsZoXEs5SIOrbLoUm77CnKXb5ir6P1IsyTLFU3IqlCUc6hSWBwfnDBBieILCzARzPYKyQwLTJQ9Udcuwdse+T646nW8k07vy2n97GO5jj1wBN6CJvDAO3AG+uACDACp/HEOnSPntfO3+qr6pgpXUqdSxrwE/71q6x/3KBuh</latexit>

p�(li|ri)<latexit sha1_base64="UKa9oOnO5wKhbf5gXeygxLH13CA=">AAADlHicbVJbi9pAFI6ml2160xb60pehIii4YmxhC+2DrbQIgWUL6+6C0WEyTnTq5MLMpKydnT/Un9O3/ptONFtT1yGB73znO1dOkDIqZK/3p1K1791/8PDokfP4ydNnz2v1FxciyTgmY5ywhF8FSBBGYzKWVDJylXKCooCRy2A1zP2XPwgXNInP5Tol0wgtYhpSjKShYL3yqxlCtQyQnqljV/uMhLLlR0gug1BhDb93bo3QGD6ni6VsO01fkmtpyLWG3gwDX9Boo8OIqVNtEmTQ6xh2ESHoGf0Cjg7n9zqgVMDbFShV3Vk/NVQr7TRHCHHfd5rnHMUiTHjk3E6x017nWNKICP8D8I7domklTI9EgCTUeYav+2HMlKBmGaqvDZtC5adLqltq55+pbaKhhtR8N+CgyyQx80Jq5sPzRIqyyt3Qhc3zNG3nX6VSI/SmLGrDWqPX7W0euAvcAjSs4p3B2m9/nuAsIrHEDAkxcXupnCrEJcWMaMfPBEkRXqEFmRgYI7OsqdoclQZNw8yBWa75Ywk2bDlCoUiIdRQYZd6k2Pfl5CHfJJPh+6micZpJEuNtoTBjQCYgv1Awp5xgydYGIMyp6RXgJeIIS3PHjlmCuz/yXXDR77pvu/1v7xqDz8U6jqzX1hurZbnWiTWwRtaZNbZwtV49qQ6qn+xX9kd7aH/ZSquVIual9d+zT/8C0ywpbw==</latexit>



l1i<latexit sha1_base64="O9k67+zGctmKvIePEB0ScaSEDDg=">AAADBHicbVJba9RAFJ7EW42XbvVNXwaXQIXtklShBV+KgiwEpEK3LWy2YTI7yY47k4SZiXQd5sEX/4ovPijiqz/CN/+Nk22WXdseCHznO5fvnJNJK0alCoK/jnvj5q3bdzbuevfuP3i42dl6dCzLWmAyxCUrxWmKJGG0IENFFSOnlSCIp4ycpLM3TfzkIxGSlsWRmldkzFFe0IxipCyVbDlP/CzR0xSZM70TmpiRTG3HHKlpmmlskg+9pZNZJxY0n6rnnh8rcq4sOTdJdIZhLClf5GHE9DtjG9RJ1LNszlES2fw8GVzfP+rBNYFoJbCmuvI+mUTPjOcPEBJx7PlHAhUyKwX3lluscs8brCgnMn4Fo52wHVpLOyORsMxM0+FtW7asYlaB2lvo0Jik0w36wcLgVRC2oAtaO0w6f+JJiWtOCoUZknIUBpUaayQUxYxYkVqSCuEZysnIwgLZ4cZ68RMN9C0zgXYZ+xUKLtj1Co24lHOe2sxmVnk51pDXxUa1yvbHmhZVrUiBL4SymkFVwuZFwAkVBCs2twBhQe2sEE+RQFjZd+PZI4SXV74Kjnf74Yv+7vuX3YPX7Tk2wFPwDGyDEOyBAzAAh2AIsPPZ+ep8d364X9xv7k/310Wq67Q1j8F/5v7+BynJ9Ow=</latexit>

lCi�1i

<latexit sha1_base64="gKZN+gxCfKaZkH4Ehnw16jR8KnI=">AAADs3icfVLbitswEHXsXrbuLds+9kU0BBLIBjttaaEvSxdKwFC2sMkuxIkqK3KiRr4gyWVTrT6wr33r31ROvMTNhgobZs6cmTMzTJQzKqTn/WnYzr37Dx4ePXIfP3n67Hnz+MVYZAXHZIQzlvGrCAnCaEpGkkpGrnJOUBIxchmtzsr45Q/CBc3SC7nOyTRBi5TGFCNpIHjc+NWOoVpGSM/Uia9DRmLZCRMkl1GssIbfe7dObJyQ08VSdt12KMm1NOBaw2CGQShosuFhxNQXbQoUMOgZdJEgGBj+Ag4P1w96oCYQ7ARqqjvvp4Zqpd32ECEehm77gqNUxBlP3Nspdtzr0pY0ISL8CIITv2paCdMjESCLdVnh834aMxLULEMNtEFzqMJ8SXVH7eIztS10piE13w04GDJFzLyQmvnwPJOizvI3cOXzsky3JlXrhN7UWV33v0qw2fL63uaBu4ZfGS2reuew+TucZ7hISCoxQ0JMfC+XU4W4pJgR7YaFIDnCK7QgE2OmyOxyqjY3p0HbIHNgdm/+VIINWs9QKBFinUSGWTYt9mMleCg2KWT8YapomheSpHgrFBcMyAyUBwzmlBMs2doYCHNqegV4iTjC0py5a5bg74981xgP+v6b/uDr29bpp2odR9Yr67XVsXzrvXVqDa1za2Rh27PHNrS/Oe+ciRM58y3VblQ5L61/npP8BaYSNp4=</latexit>

Figure 6. Architecture of our multi-scale autoregressive (mAR) prior.

Figure 7. Analysis of real-world sampling time of our mAR-SCFmodel with varying image size [C,N,N ]. For reference, we showthe line y = 2x in black.

E. Additional Interpolation Results

Figure 8. Inception scores of random samples generated with ourinterpolation scheme versus linear interpolation.

Finally, in Fig. 12 we show qualitative examples to com-pare the effect using our proposed interpolation method(Eq. 11) versus simple linear interpolation in the multimodallatent space of our mAR prior. We use the Adamax opti-mizer to optimize Eq. (11) with a learning rate of 5× 10−2

for ∼ 100 iterations. The computational requirement periteration is approximately equivalent to a standard trainingiteration of our mAR-SCF model, i.e. ∼ 1 sec for a batchof size 128. We observe that interpolated images obtainedfrom our mAR-SCF affine model using our proposed in-terpolation method (Eq. 11) have a better visual quality,especially for the interpolations in the middle of the inter-polating path. Note that we find our scheme to be stable inpractice in a reasonably wide range of the hyperparametersλ2 ≈ λ1 ∈ [0.2, 0.5] as it interpolates in high-density re-gions between xA,xB. We support the better visual qualityof the interpolations of our scheme compared to the linearmethod with Inception scores computed for interpolationsobtained from both methods in Fig. 8.

12

(a) Residual Flows [4] (b) Our mAR-SCF (MixLogCDF)

Figure 9. Random samples when trained on MNIST.

(a) PixelCNN [38] (b) Our mAR-SCF (MixLogCDF)

Figure 10. Random samples when trained on CIFAR10 (32× 32).

13

(a) Real Data (b) Flow++ [14]

(c) Residual Flows [4] (d) Our mAR-SCF (MixLogCDF)

Figure 11. Random samples when trained on ImageNet (32× 32).

14

Figure 12. Comparison of interpolation quality of our interpolation scheme (Eq. 11) with a standard linear interpolation scheme. The top rowof each result shows the interpolations between real images for the linear method and the corresponding bottom rows are the interpolationswith our proposed scheme.

15

Documents

arXiv:2004.03891v1 [cs.LG] 8 Apr 2020 · Normalizing Flows with Multi-Scale Autoregressive Priors Shweta Mahajan*1 Apratim Bhattacharyya*2 Mario Fritz3 Bernt Schiele2 Stefan Roth1

arXiv:2004.03891v1 [cs.LG] 8 Apr 2020 · Normalizing Flows with Multi-Scale Autoregressive Priors Shweta Mahajan1 Apratim Bhattacharyya2 Mario Fritz3 Bernt Schiele2 Stefan Roth1