Perceiving the glossiness of a surface is a challenging visual inference that requires disentangling the contributions of reflectance, lighting, and shape to the retinal image. How do our visual systems develop this ability? We suggest that brains learn to infer distal properties, like gloss, by learning to model the structure in proximal images. To test this, we trained unsupervised generative neural networks on renderings of glossy surfaces and compared their representations with human gloss judgments. The networks spontaneously cluster images according to distal properties such as specular reflectance, shape and illumination, despite receiving no explicit information about them. Linearly decoding specular reflectance from the model’s internal code predicts human perception of gloss better than ground truth, supervised networks, or control models, and predicts, on an image-by-image basis, illusions of gloss perception caused by interactions between material, shape, and lighting. Unsupervised learning may underlie many perceptual dimensions in vision, and beyond.