The recent advances in instance-level detection tasks lay strong foundation for genuine comprehension of the visual scenes. However, the ability to fully comprehend a social scene is still in its preliminary stage. In this work, we focus on detecting human-object interactions (HOIs) in social scene images, which is demanding in terms of research and increasingly useful for practical applications. To undertake social tasks interacting with objects, humans direct their attention and move their body based on their intention. Based on this observation, we provide a unique computational perspective to explore human intention in HOI detection. Specifically, the proposed human intention-driven HOI detection (iHOI) framework models human pose with the relative distances from body joints to the object instances. It also utilizes human gaze to guide the attended contextual regions in a weakly-supervised setting. In addition, we propose a hard negative sampling strategy to address the problem of mis-grouping. We perform extensive experiments on two benchmark datasets, namely V-COCO and HICO-DET. The efficacy of each proposed component has also been validated.
Bibliographical noteFunding Information:
Manuscript received July 20, 2018; revised January 3, 2019 and June 17, 2019; accepted September 16, 2019. Date of publication September 25, 2019; date of current version May 21, 2020. This work was supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Strategic Capability Research Centres Funding Initiative. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Federica Battisti. (Corresponding author: Bingjie Xu.) B. Xu and J. Li are with the Graduate School for Integrative Sciences and Engineering, National University of Singapore, Singapore 119077 (e-mail: firstname.lastname@example.org; email@example.com).
© 1999-2012 IEEE.
- Human-Object Interactions (HOIs)
- Intention-Driven Analysis
- Visual Relationships