Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse\n Coding

Abstract

We construct an unsupervised learning model that achieves nonlinear\ndisentanglement of underlying factors of variation in naturalistic videos.\nPrevious work suggests that representations can be disentangled if all but a\nfew factors in the environment stay constant at any point in time. As a result,\nalgorithms proposed for this problem have only been tested on carefully\nconstructed datasets with this exact property, leaving it unclear whether they\nwill transfer to natural scenes. Here we provide evidence that objects in\nsegmented natural movies undergo transitions that are typically small in\nmagnitude with occasional large jumps, which is characteristic of a temporally\nsparse distribution. We leverage this finding and present SlowVAE, a model for\nunsupervised representation learning that uses a sparse prior on temporally\nadjacent observations to disentangle generative factors without any assumptions\non the number of changing factors. We provide a proof of identifiability and\nshow that the model reliably learns disentangled representations on several\nestablished benchmark datasets, often surpassing the current state-of-the-art.\nWe additionally demonstrate transferability towards video datasets with natural\ndynamics, Natural Sprites and KITTI Masks, which we contribute as benchmarks\nfor guiding disentanglement research towards more natural data domains.\n