Ensemble Knowledge Distillation for Learning Improved and Efficient\n Networks

Abstract

Ensemble models comprising of deep Convolutional Neural Networks (CNN) have\nshown significant improvements in model generalization but at the cost of large\ncomputation and memory requirements. In this paper, we present a framework for\nlearning compact CNN models with improved classification performance and model\ngeneralization. For this, we propose a CNN architecture of a compact student\nmodel with parallel branches which are trained using ground truth labels and\ninformation from high capacity teacher networks in an ensemble learning\nfashion. Our framework provides two main benefits: i) Distilling knowledge from\ndifferent teachers into the student network promotes heterogeneity in feature\nlearning at different branches of the student network and enables the network\nto learn diverse solutions to the target problem. ii) Coupling the branches of\nthe student network through ensembling encourages collaboration and improves\nthe quality of the final predictions by reducing variance in the network\noutputs. Experiments on the well established CIFAR-10 and CIFAR-100 datasets\nshow that our Ensemble Knowledge Distillation (EKD) improves classification\naccuracy and model generalization especially in situations with limited\ntraining data. Experiments also show that our EKD based compact networks\noutperform in terms of mean accuracy on the test datasets compared to\nstate-of-the-art knowledge distillation based methods.\n