20
RePr: Improved Training of Convolutional Filters Aaditya Prakash * Brandeis University James Storer Brandeis University Dinei Florencio, Cha Zhang Microsoft Research Brandeis University Brandeis University Microsoft Research [email protected] [email protected] {dinei,chazhang}@microsoft.com Abstract A well-trained Convolutional Neural Network can easily be pruned without significant loss of performance. This is be pruned without significant loss of performance. This is because of unnecessary overlap in the features captured by because of unnecessary overlap in the features captured by the network’s filters. Innovations in network architecture the network’s filters. Innovations in network architecture such as skip/dense connections and Inception units have mitigated this problem to some extent, but these improve- mitigated this problem to some extent, but these improve- ments come with increased computation and memory re- ments come with increased computation and memory re- quirements at run-time. We attempt to address this problem quirements at run-time. We attempt to address this problem from another angle - not by changing the network structure from another angle - not by changing the network structure but by altering the training method. We show that by tem- but by altering the training method. We show that by tem- porarily pruning and then restoring a subset of the model’s porarily pruning and then restoring a subset of the model’s filters, and repeating this process cyclically, overlap in the filters, and repeating this process cyclically, overlap in the learned features is reduced, producing improved general- ization. We show that the existing model-pruning criteria ization. We show that the existing model-pruning criteria are not optimal for selecting filters to prune in this con- are not optimal for selecting filters to prune in this con- text and introduce inter-filter orthogonality as the ranking text and introduce inter-filter orthogonality as the ranking criteria to determine under-expressive filters. Our method criteria to determine under-expressive filters. Our method is applicable both to vanilla convolutional networks and is applicable both to vanilla convolutional networks and more complex modern architectures, and improves the per- more complex modern architectures, and improves the per- formance across a variety of tasks, especially when applied formance across a variety of tasks, especially when applied to smaller networks. 1. Introduction Convolutional Neural Networks have achieved state-of- the-art results in various computer vision tasks [1, 2]. Much of this success is due to innovations of a novel, task- of this success is due to innovations of a novel, task- specific network architectures [3, 4]. Despite variation in network design, the same core optimization techniques are network design, the same core optimization techniques are used across tasks. These techniques consider each indi- used across tasks. These techniques consider each indi- vidual weight as its own entity and update them indepen- vidual weight as its own entity and update them indepen- dently. Limited progress has been made towards developing dently. Limited progress has been made towards developing a training process specifically designed for convolutional a training process specifically designed for convolutional networks, in which filters are the fundamental unit of the networks, in which filters are the fundamental unit of the network. A filter is not a single weight parameter but a stack network. A filter is not a single weight parameter but a stack of spatial kernels. of spatial kernels. Because models are typically over-parameterized, a * Part of the research was done while the author was an intern at MSR A B C D E F Figure 1: Performance of a three layer ConvNet with 32 filters Figure 1: Performance of a three layer ConvNet with 32 filters each over 100 epochs using standard scheme and our method - RePr on CIFAR-10. The shaded regions denote periods when only part of the network is trained. Left: Training Accuracy, Right: Test Accuracy. Annotations [A-F] are discussed in Section 4. trained convolutional network will contain redundant fil- ters [5, 6]. This is evident from the common practice of pruning filters [7, 8, 6, 9, 10, 11], rather than individual pruning filters [7, 8, 6, 9, 10, 11], rather than individual parameters [12], to achieve model compression. Most of parameters [12], to achieve model compression. Most of these pruning methods are able to drop a significant num- these pruning methods are able to drop a significant num- ber of filters with only a marginal loss in the performance ber of filters with only a marginal loss in the performance of the model. However, a model with fewer filters can- of the model. However, a model with fewer filters can- not be trained from scratch to achieve the performance of a large model that has been pruned to be roughly the same

RePr: Improved Training of Convolutional Filters · Aaditya Prakash Brandeis University [email protected] James Storer Brandeis University [email protected] Dinei Florencio,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19

    RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19

    RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19

    RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydropped. After additional training of the reduced network,we reintroduce the previously dropped filters, initializedwith new weights, and continue standard training. We ob-serve that following the reintroduction of the dropped fil-ters, the model is able to achieve higher performance thanwas obtained before the drop. Repeated application of thisprocess obtains models which outperform those obtained bystandard training as seen in Figure 1 and discussed in Sec-tion 4. We observe this improvement across various tasksand over various types of convolutional networks. This

    1

    arX

    iv:1

    811.

    0727

    5v3

    [cs

    .CV

    ] 2

    5 Fe

    b 20

    19RePr: Improved Training of Convolutional Filters

    Aaditya Prakash∗

    Brandeis [email protected]

    James StorerBrandeis [email protected]

    Dinei Florencio, Cha ZhangMicrosoft Research

    {dinei,chazhang}@microsoft.com

    Abstract

    A well-trained Convolutional Neural Network can easilybe pruned without significant loss of performance. This isbecause of unnecessary overlap in the features captured bythe network’s filters. Innovations in network architecturesuch as skip/dense connections and Inception units havemitigated this problem to some extent, but these improve-ments come with increased computation and memory re-quirements at run-time. We attempt to address this problemfrom another angle - not by changing the network structurebut by altering the training method. We show that by tem-porarily pruning and then restoring a subset of the model’sfilters, and repeating this process cyclically, overlap in thelearned features is reduced, producing improved general-ization. We show that the existing model-pruning criteriaare not optimal for selecting filters to prune in this con-text and introduce inter-filter orthogonality as the rankingcriteria to determine under-expressive filters. Our methodis applicable both to vanilla convolutional networks andmore complex modern architectures, and improves the per-formance across a variety of tasks, especially when appliedto smaller networks.

    1. Introduction

    Convolutional Neural Networks have achieved state-of-the-art results in various computer vision tasks [1, 2]. Muchof this success is due to innovations of a novel, task-specific network architectures [3, 4]. Despite variation innetwork design, the same core optimization techniques areused across tasks. These techniques consider each indi-vidual weight as its own entity and update them indepen-dently. Limited progress has been made towards developinga training process specifically designed for convolutionalnetworks, in which filters are the fundamental unit of thenetwork. A filter is not a single weight parameter but a stackof spatial kernels.

    Because models are typically over-parameterized, a

    ∗Part of the research was done while the author was an intern at MSR

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Accu

    racy

    Train - RePrTrain - Standard

    AB

    0 20 40 60 80 100Epoch

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0Test - RePrTest - Standard

    C

    D

    EF

    Figure 1: Performance of a three layer ConvNet with 32 filterseach over 100 epochs using standard scheme and our method -RePr on CIFAR-10. The shaded regions denote periods when onlypart of the network is trained. Left: Training Accuracy, Right: TestAccuracy. Annotations [A-F] are discussed in Section 4.

    trained convolutional network will contain redundant fil-ters [5, 6]. This is evident from the common practice ofpruning filters [7, 8, 6, 9, 10, 11], rather than individualparameters [12], to achieve model compression. Most ofthese pruning methods are able to drop a significant num-ber of filters with only a marginal loss in the performanceof the model. However, a model with fewer filters can-not be trained from scratch to achieve the performance ofa large model that has been pruned to be roughly the samesize [6, 11, 13]. Standard training procedures tend to learnmodels with extraneous and prunable filters, even for ar-chitectures without any excess capacity. This suggests thatthere is room for improvement in the training of Convolu-tional Neural Networks (ConvNets).

    To this end, we propose a training scheme in which,after some number of iterations of standard training, weselect a subset of the model’s filters to be temporarilydroppe