When DNNs are trained on ground-truth information, they can reach or exceed human performance in many visual tasks. However, no equivalent to these massive labelled training sets exists during human visual development. I will talk about two projects in which we use unsupervised deep learning as a framework to understand how brains learn rich scene representations without ground-truth world information. In one project, I train an unsupervised autoregressive PixelVAE network to generate new images of rendered surfaces. In the other I train an unsupervised recurrent PredNet network to predict the next frame in videos of moving objects. In both cases, the networks spontaneously learn to cluster images according to underlying scene properties such as illumination, shape, and material. Material properties decoded from the models can predict human perception and mis-perception on an image-by-image basis. A supervised DNN fails to predict human patterns of (mis)perception. Unsupervised DNNs are exciting models of how brains might learn about the physical world from just their sensory data.