TY - GEN
T1 - Hierarchical spatio-temporal context modeling for action recognition
AU - Sun, Ju
AU - Wu, Xiao
AU - Yan, Shuicheng
AU - Cheong, Loong Fah
AU - Chua, Tat Seng
AU - Li, Jintao
PY - 2009
Y1 - 2009
N2 - The problem of recognizing actions in realistic videos is challenging yet absorbing owing to its great potentials in many practical applications. Most previous research is limited due to the use of simplified action databases under controlled environments or focus on excessively localized features without sufficiently encapsulating the spatiotemporal context. In this paper, we propose to model the spatio-temporal context information in a hierarchical way, where three levels of context are exploited in ascending order of abstraction: 1) point-level context (SIFT average descriptor), 2) intra-trajectory context (trajectory transition descriptor), and 3) inter-trajectory context (trajectory proximity descriptor). To obtain efficient and compact representations for the latter two levels, we encode the spatiotemporal context information into the transition matrix of a Markov process, and then extract its stationary distribution as the final context descriptor. Building on the multichannel nonlinear SVMs, we validate this proposed hierarchical framework on the realistic action (HOHA) and event (LSCOM) recognition databases, and achieve 27% and 66% relative performance improvements over the state-ofthe- art results, respectively. We further propose to employ the Multiple Kernel Learning (MKL) technique to prune the kernels towards speedup in algorithm evaluation.
AB - The problem of recognizing actions in realistic videos is challenging yet absorbing owing to its great potentials in many practical applications. Most previous research is limited due to the use of simplified action databases under controlled environments or focus on excessively localized features without sufficiently encapsulating the spatiotemporal context. In this paper, we propose to model the spatio-temporal context information in a hierarchical way, where three levels of context are exploited in ascending order of abstraction: 1) point-level context (SIFT average descriptor), 2) intra-trajectory context (trajectory transition descriptor), and 3) inter-trajectory context (trajectory proximity descriptor). To obtain efficient and compact representations for the latter two levels, we encode the spatiotemporal context information into the transition matrix of a Markov process, and then extract its stationary distribution as the final context descriptor. Building on the multichannel nonlinear SVMs, we validate this proposed hierarchical framework on the realistic action (HOHA) and event (LSCOM) recognition databases, and achieve 27% and 66% relative performance improvements over the state-ofthe- art results, respectively. We further propose to employ the Multiple Kernel Learning (MKL) technique to prune the kernels towards speedup in algorithm evaluation.
UR - http://www.scopus.com/inward/record.url?scp=70450214829&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70450214829&partnerID=8YFLogxK
U2 - 10.1109/CVPRW.2009.5206721
DO - 10.1109/CVPRW.2009.5206721
M3 - Conference contribution
AN - SCOPUS:70450214829
SN - 9781424439935
T3 - 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009
SP - 2004
EP - 2011
BT - 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2009
PB - IEEE Computer Society
T2 - 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009
Y2 - 20 June 2009 through 25 June 2009
ER -