Publication | Closed Access
Variational Information Distillation for Knowledge Transfer
639
Citations
31
References
2019
Year
Unknown Venue
Artificial IntelligenceTeacher Neural NetworkConvolutional Neural NetworkEngineeringMachine LearningKnowledge TransferData ScienceEntropyKnowledge DistillationComputer EngineeringVariational Information DistillationKnowledge Transfer MethodsMulti-task LearningComputer ScienceTransfer LearningDeep LearningNeural Architecture SearchMixture Of Expert
Knowledge transfer from a pretrained teacher network can markedly boost a student network’s performance, yet current methods typically align activations or hand‑crafted features. The authors propose an information‑theoretic framework that maximizes mutual information between teacher and student networks to guide knowledge transfer. They train the student to maximize this mutual information, evaluate the approach on distillation and transfer‑learning tasks, and apply it to transfer knowledge from a CNN to an MLP on CIFAR‑10. The resulting MLP surpasses state‑of‑the‑art methods and achieves performance comparable to the CNN while using only a single convolutional layer.
Transferring knowledge from a teacher neural network pretrained on the same or a similar task to a student neural network can significantly improve the performance of the student neural network. Existing knowledge transfer approaches match the activations or the corresponding hand-crafted features of the teacher and the student networks. We propose an information-theoretic framework for knowledge transfer which formulates knowledge transfer as maximizing the mutual information between the teacher and the student networks. We compare our method with existing knowledge transfer methods on both knowledge distillation and transfer learning tasks and show that our method consistently outperforms existing methods. We further demonstrate the strength of our method on knowledge transfer across heterogeneous network architectures by transferring knowledge from a convolutional neural network (CNN) to a multi-layer perceptron (MLP) on CIFAR-10. The resulting MLP significantly outperforms the-state-of-the-art methods and it achieves similar performance to the CNN with a single convolutional layer.
| Year | Citations | |
|---|---|---|
Page 1
Page 1