We found our model can learn to generate masks for the parts of a scene that a VAE
prefers to reconstruct in distinct components, and that these parts correspond to semantically
meaningful parts of scene images (such as walls, floors and individual objects). Further-
more, MONet learned disentangled representations, with latents corresponding to distinct
interpretable features, of the scene elements. Interestingly, MONet also learned to complete
partially occluded scene elements in the VAE’s reconstructions, in a purely unsupervised
fashion, as this provided a more parsimonious representation of the data. This confirms
that our approach is able to distill the structure inherent in the dataset it is trained on.
Furthermore, we found that MONet trained on a dataset with 1-3 objects could readily be
used with fewer or more objects, or novel scene configurations, and still behave accurately.
This robustness to the complexity of a scene, at least in the number of objects it can
contain, is very promising when thinking about leveraging the representations for downstream
tasks and reinforcement learning agents.
Although we believe this work provides an exciting step for unsupervised scene decompo-
sition, there is still much work to be done. For example, we have not tackled datasets with
increased visual complexity, such as natural images or large images with many objects. As
the entire scene has to be ‘explained’ and decomposed by MONet, such complexity may pose
a challenge and warrant further model development. It would be interesting, in that vein, to
support partial decomposition of a scene, e.g. only represent some but not all objects, or
perhaps with different precision of reconstruction.
We also have not explored decomposition of videos, though we expect temporal continuity
to amplify the benefits of decomposition in principle, which may make MONet more robust
on video data. Finally, although we showed promising results for the disentangling properties
of our representations in Figure 5, more work should be done to fully assess it, especially in
the context of reusing latent dimensions between semantically different components.
Acknowledgments
We thank Ari Morcos, Daniel Zoran and Luis Piloto for helpful discussions and insights.
References
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian
Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz,
Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry
Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya
Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda
Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and
Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems,
2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius
Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan
Faulkner, Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl,
Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess,
Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan
13