Abstract
In this paper, we study a problem of egocentric scene understanding, i.e., predicting depths and surface normals from an egocentric image. Egocentric scene understanding poses unprecedented challenges: (1) due to large head movements, the images are taken from non-canonical viewpoints (i.e., tilted images) where existing models of geometry prediction do not apply; (2) dynamic foreground objects including hands constitute a large proportion of visual scenes. These challenges limit the performance of the existing models learned from large indoor datasets, such as ScanNet [6] and NYUv2 [36], which comprise predominantly upright images of static scenes. We present a multimodal spatial rectifier that stabilizes the egocentric images to a set of reference directions, which allows learning a coherent visual representation. Unlike unimodal spatial rectifier that often produces excessive perspective warp for egocentric images, the multimodal spatial rectifier learns from multiple directions that can minimize the impact of the perspective warp. To learn visual representations of the dynamic foreground objects, we present a new dataset called EDINA (Egocentric Depth on everyday INdoor Activities) that comprises more than 500K synchronized RGBD frames and gravity directions. Equipped with the multimodal spatial rectifier and the EDINA dataset, our proposed method on single-view depth and surface normal estimation significantly outperforms the baselines not only on our ED-INA dataset, but also on other popular egocentric datasets, such as First Person Hand Action (FPHA) [18] and EPIC-KITCHENS [7].
Original language | English (US) |
---|---|
Title of host publication | Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 |
Publisher | IEEE Computer Society |
Pages | 2822-2831 |
Number of pages | 10 |
ISBN (Electronic) | 9781665469463 |
DOIs | |
State | Published - 2022 |
Event | 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States Duration: Jun 19 2022 → Jun 24 2022 |
Publication series
Name | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition |
---|---|
Volume | 2022-June |
ISSN (Print) | 1063-6919 |
Conference
Conference | 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 |
---|---|
Country/Territory | United States |
City | New Orleans |
Period | 6/19/22 → 6/24/22 |
Bibliographical note
Funding Information:In this paper, we present a new multimodal spatial rectifier for egocentric scene understanding, i.e., predicting depths and surface normals from a single view egocentric image. The multimodal spatial rectifier identifies multiple reference directions to learn a geometrically coherent representation from tilted egocentric images. This rectifier enables warping the image to the closest mode such that the geometry predictor in this mode can accurately estimate the geometry of the rectified scene. To facilitate the learning of our multimodal spatial rectifier, we introduce a new dataset called EDINA that comprises 550K synchronized RGBD and gravity data of diverse indoor activities. We show that EDINA is complementary to ScanNet, allowing us to learn a strong multimodal spatial rectifier. We evaluate our method on egocentric datasets including our EDINA, FPHA and EPIC-KITCHENS, which outperforms the baselines. Acknowledgements This work is partially supported by NSF CAREER IIS-1846031.
Publisher Copyright:
© 2022 IEEE.
Keywords
- 3D from single images
- Scene analysis and understanding