video autoencoder: self-supervised disentanglement of 3d structure and motion