Here’s a list of papers accepted to our workshop. Our OpenReview venue can be accessed here where reviews and meta-reviews of accepted papers are available.

Efficient and Interpretable Robot Manipulation with Graph Neural Networks ORAL
Yixin Lin, Austin S Wang, Eric Undersander, Akshara Rai

Manipulation tasks like loading a dishwasher can be seen as a sequence of spatial constraints and relationships between different objects. We aim to discover these rules from demonstrations by posing manipulation as a classification problem over a graph, whose nodes represent task-relevant entities like objects and goals. In our experiments, a single GNN policy trained using imitation learning (IL) on 20 expert demonstrations can solve blockstacking and rearrangement tasks in both simulation and on hardware, generalizing over the number of objects and goal configurations. These experiments show that graphical IL can solve complex long-horizon manipulation problems without requiring detailed task descriptions.
Vision-based system identification and 3D keypoint discovery using dynamics constraints ORAL
Miguel Jaques, Martin Asenov, Michael Burke, Timothy Hospedales

This paper introduces V-SysId, a novel method that enables simultaneous keypoint discovery, 3D system identification, and extrinsic camera calibration from an unlabeled video taken from a static camera, using only the family of equations of motion of the object of interest as weak supervision. V-SysId takes keypoint trajectory proposals and alternates between maximum likelihood parameter estimation and extrinsic camera calibration, before applying a suitable selection criterion to identify the track of interest. This is then used to train a keypoint tracking model using supervised learning. Results on a range of settings (robotics, physics, physiology) highlight the utility of this approach.
Playful Interactions for Representation Learning ORAL
Sarah Young, Jyotish Pari, Pieter Abbeel, Lerrel Pinto

One of the key challenges in visual imitation learning is collecting large amounts of expert demonstrations for a given task. While methods for collecting human demonstrations are becoming easier with teleoperation methods and the use of low-cost assistive tools, we often still require 100-1000 demonstrations for every task to learn a visual representation and policy. To address this, we turn to an alternate form of data that does not require task-specific demonstrations -- play. Playing is a fundamental method children use to learn a set of skills and behaviors and visual representations in early learning. Importantly, play data is diverse, task-agnostic, and relatively cheap to obtain. In this work, we propose to use playful interactions in a self-supervised manner to learn visual representations for downstream tasks. We collect 2 hours of playful data in 19 diverse environments and use self-predictive learning to extract visual representations. Given these representations, we train policies using imitation learning for two downstream tasks; Pushing and Stacking. Our representations, which are trained from scratch, compare favorably against ImageNet pretrained representations. Finally, we provide an experimental analysis on the effects of different pretraining modes on downstream task learning.
TorchDyn - Implicit Models and Neural Numerical Methods in PyTorch
Michael Poli, Stefano Massaroli, Atsushi Yamashita, Hajime Asama, Jinkyoo Park, Stefano Ermon

Computation in traditional deep learning models is determined by the explicit linking of select primitives e.g. layers or blocks arranged in a computational graph. Implicit neural models follow instead a declarative approach. First, a desiderata relating inputs and outputs of a neural network is encoded into constraints; then, a numerical method is applied to solve the resulting optimization problem as part of the inference pass. Existing open-source software frameworks focus on explicit models and do not offer implementations of the numerical routines required to study and benchmark this new class of models. We introduce TorchDyn, a PyTorch library dedicated to implicit learning. TorchDyn provides a standardized implementation of implicit models and the underlying numerical methods, designed to serve as stable baselines. Beyond models and numerics, the library further offers a collection of step-by-step tutorials and benchmarks designed to accelerate research and improve the robustness of experimental evaluations.
AVoE - A Synthetic 3D Dataset on Understanding Violation of Expectation for Artificial Cognition
Arijit Dasgupta, Jiafei Duan, Marcelo H Ang, Cheston Tan

Recent work in cognitive reasoning and computer vision has engendered an increasing popularity for the Violation-of-Expectation (VoE) paradigm in synthetic datasets. Inspired by work in infant psychology, researchers have started evaluating a model's ability to discriminate between expected and surprising scenes as a sign of its reasoning ability. Existing VoE-based 3D datasets in physical reasoning only provide vision data. However, current cognitive models of physical reasoning by psychologists reveal infants create high-level abstract representations of objects and interactions. Capitalizing on this knowledge, we propose AVoE; a synthetic 3D VoE-based dataset that presents stimuli from multiple novel sub-categories for five event categories of physical reasoning. Compared to existing work, AVoE is armed with ground-truth labels of abstract features and rules augmented to vision data, paving the way for high-level symbolic predictions in physical reasoning tasks. Code -
3D Neural Scene Representations for Visuomotor Control
Yunzhu Li, Shuang Li, Vincent Sitzmann, Pulkit Agrawal, Antonio Torralba

Humans have a strong intuitive understanding of the 3D environment around us. The mental model of the physics in our brain applies to objects of different materials and enables us to perform a wide range of manipulation tasks that are far beyond the reach of current robots. In this work, we desire to learn models for dynamic 3D scenes purely from 2D visual observations. Our model combines Neural Radiance Fields (NeRF) and time contrastive learning with an autoencoding framework, which learns viewpoint-invariant 3D-aware scene representations. We show that a dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks involving both rigid bodies and fluids, where the target is specified in a viewpoint different from what the robot operates on. When coupled with an auto-decoding framework, it can even support goal specification from camera viewpoints that are outside the training distribution. We further demonstrate the richness of the learned 3D dynamics model by performing future prediction and novel view synthesis. Finally, we provide detailed ablation studies regarding different system designs and qualitative analysis of the learned representations.
Learning Graph Search Heuristics
Michal Pandy, Rex Ying, Gabriele Corso, Petar Veličković, Jure Leskovec, Pietro Liò

Searching for a path between two nodes in a graph is one of the most well-studied and fundamental problems in computer science. In numerous domains such as robotics, AI, or biology, practitioners develop search heuristics to accelerate their pathfinding algorithms. However, it is a laborious and complex process to hand-design heuristics based on the problem and the structure of a given use case. Here we present PHIL (Path Heuristic with Imitation Learning), a novel neural architecture and a training algorithm for discovering graph search and navigation heuristics from data by leveraging recent advances in imitation learning and graph representation learning. At training time, we aggregate datasets of search trajectories and ground-truth shortest path distances, which we use to train a specialized graph neural network-based heuristic function using backpropagation through steps of the pathfinding process. Our heuristic function learns graph embeddings useful for inferring node distances, runs in constant time independent of graph sizes, and can be easily incorporated in an algorithm such as A* at test time. Experiments show that PHIL reduces the number of explored nodes compared to state-of-the-art methods on benchmark datasets by 40.8% on average and allows for fast planning in time-critical robotics domains.
Physics-guided Learning-based Adaptive Control on the SE(3) Manifold
Thai P Duong, Nikolay A Atanasov

In real-world robotics applications, accurate models of robot dynamics are critical for safe and stable control in rapidly changing operational conditions. This motivates the use of machine learning techniques to approximate robot dynamics and their disturbances over a training set of state-control trajectories. This paper demonstrates that inductive biases arising from physics laws can be used to improve the data efficiency and accuracy of the approximated dynamics model. For example, the dynamics of many robots, including ground, aerial, and underwater vehicles, are described using their $SE(3)$ pose and satisfy conservation of energy principles. We design a physically plausible model of the robot dynamics by imposing the structure of Hamilton's equations of motion in the design of a neural ordinary differential equation (ODE) network. The Hamiltonian structure guarantees satisfaction of $SE(3)$ kinematic constraints and energy conservation by construction. It also allows us to derive an energy-based adaptive controller that achieves trajectory tracking while compensating for disturbances. Our learning-based adaptive controller is verified on an under-actuated quadrotor robot.
Neural NID Rules
Luca Viano, Johanni Brea

Abstract object properties and their relations are deeply rooted in human common sense, allowing people to predict the dynamics of the world even in situations that are novel but governed by familiar laws of physics. Standard machine learning models in model-based reinforcement learning are inadequate to generalize in this way. Inspired by the classic framework of noisy indeterministic deictic (NID) rules, we introduce here Neural NID, a method that learns abstract object properties and relations between objects with a suitably regularized graph neural network. We validate the greater generalization capability of Neural NID on simple benchmarks specifically designed to assess the transition dynamics learned by the model.
Efficient Partial Simulation Quantitatively Explains Deviations from Optimal Physical Predictions
Ilona Bass, Kevin Smith, Elizabeth Bonawitz, Tomer Ullman

Humans are adept at planning actions in real-time dynamic physical environments. Machine intelligence struggles with this task, and one cause is that running simulators of complex, real-world environments is computationally expensive. Yet recent accounts suggest that humans use mental simulation in order to make intuitive physical judgments. How is human physical reasoning so accurate, while maintaining computational tractability? We suggest that human behavior is well described by partial simulation, which moves forward in time only parts of the world deemed relevant. We take as a case study Ludwin-Peery, Bramley, Davis, and Gureckis (2020), in which a conjunction fallacy was found in the domain of intuitive physics. This phenomenon is difficult to explain with full simulation, but we show it can be quantitatively accounted for with partial simulation. We discuss how AI research could make use of efficient partial simulation in implementations of commonsense physical reasoning.
DLO@Scale -- A Large-Scale Meta Dataset for Learning Non-Rigid Object Pushing Dynamics
Robert Gieselmann, Alberta Longhini, Alfredo Reichlin, Danica Kragic and Florian T. Pokorny

The ability to quickly understand our physical environment and make predictions about interacting objects is fundamental to us humans. To equip artificial agents with similar reasoning capabilities, machine learning can be used to approximate the underlying state dynamics of a system. In this regard, deep learning has gained much popularity but is relying on the availability of large-enough datasets. In this work, we present DLO@Scale, a new dataset for studying future state prediction in the context of multi-body deformable linear object pushing. It contains a large collection of 100 million simulated interactions enabling thorough statistical analysis and algorithmic benchmarks. Our data is generated using a high-fidelity physics engine which simulates complex mechanical phenomena such as elasticity, plastic deformation and friction. An important aspect is the large variation of the physical parameters making it suitable for testing meta learning algorithms. We describe DLO@Scale and present a first empirical evaluation using neural network baselines.
3D-OES -- Viewpoint-Invariant Object-FactorizedEnvironment Simulators
Hsiao-Yu Tung, Zhou Xian, Mihir Prabhudesai, Shamit Lal, Katerina Fragkiadaki

We propose an action-conditioned dynamics model that predicts scene changes caused by object and agent interactions in a viewpoint-invariant 3D neural scene representation space, inferred from RGB-D videos. In this 3D feature space, objects do not interfere with one another and their appearance persists over time and across viewpoints. This permits our model to predict future scenes long in the future by simply “moving" 3D object features based on cumulative object motion predictions. Object motion predictions are computed by a graph neural network that operates over the object features extracted from the 3D neural scene representation. Our model generalizes well across varying number and appearances of interacting objects as well as across camera viewpoints, outperforming existing 2D and 3D dynamics models, and enables successful sim-to-real transfer.