Abstract
Object-centric representation is an essential abstraction for forward prediction. Most existing forward models learn this representation through extensive supervision (e.g., object class and bounding box) although such ground-truth information is not readily accessible in reality. To address this, we introduce KINet (Keypoint Interaction Network) - an end-to-end unsupervised framework to reason about object interactions based on a keypoint representation. Using visual observations, our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system as a set of keypoint embeddings and their relations. It then learns an action-conditioned forward model using contrastive estimation to predict future keypoint states. By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects, novel backgrounds, and unseen object geometries. Experiments demonstrate the effectiveness of our model in accurately performing forward prediction and learning plannable object-centric representations for downstream robotic pushing manipulation tasks.
Original language | English (US) |
---|---|
Pages (from-to) | 6195-6202 |
Number of pages | 8 |
Journal | IEEE Robotics and Automation Letters |
Volume | 8 |
Issue number | 10 |
DOIs | |
State | Published - Oct 1 2023 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2016 IEEE.
Keywords
- Representation learning
- deep learning methods
- manipulation planning