Computational visual neuroscience has come a long way in the past 10 years. For the first time, we have fully explicit, image-computable models that can recognise objects with near-human accuracy, and predict brain activity in high-level visual regions. I will present evidence that diverse deep neural network architectures all predict brain representations well, and that task-training and subsequent reweighting of model features is critical to this high performance. However, vision is not yet explained. The most successful models are deep neural networks that have been supervised using ground-truth labels for millions of images. Brains have no such access to the ground truth, and must instead learn directly from sensory data. Unsupervised deep learning, in which networks learn statistical regularities in their data by compressing, extrapolating or predicting images and videos, is an ecologically feasible alternative. I will show that an unsupervised deep network trained on an environment of 3D rendered surfaces with varying shape, material and illumination, spontaneously comes to encode those factors in its internal representations. Most strikingly, the network makes patterns of errors in its perception of material which follow, on an image-by-image basis, the patterns of errors made by human observers. Unsupervised deep learning may provide a coherent framework for how our perceptual dimensions arise.