Multi-modal gesture recognition using integrated model of motion, audio and video

Goutsu, Yusuke; Kobayashi, Takaki; Obara, Junya; Kusajima, Ikuo; Takeichi, Kazunari; Takano, Wataru; Nakamura, Yoshihiko

doi:10.3901/CJME.2015.0202.053

Published: 19 July 2015

Multi-modal gesture recognition using integrated model of motion, audio and video

Yusuke Goutsu¹,
Takaki Kobayashi¹,
Junya Obara¹,
Ikuo Kusajima¹,
Kazunari Takeichi¹,
Wataru Takano¹ &
…
Yoshihiko Nakamura¹

Chinese Journal of Mechanical Engineering volume 28, pages 657–665 (2015)Cite this article

274 Accesses
4 Citations
Metrics details

An Erratum to this article was published on 16 May 2017

Abstract

Gesture recognition is used in many practical applications such as human-robot interaction, medical rehabilitation and sign language. With increasing motion sensor development, multiple data sources have become available, which leads to the rise of multi-modal gesture recognition. Since our previous approach to gesture recognition depends on a unimodal system, it is difficult to classify similar motion patterns. In order to solve this problem, a novel approach which integrates motion, audio and video models is proposed by using dataset captured by Kinect. The proposed system can recognize observed gestures by using three models. Recognition results of three models are integrated by using the proposed framework and the output becomes the final result. The motion and audio models are learned by using Hidden Markov Model. Random Forest which is the video classifier is used to learn the video model. In the experiments to test the performances of the proposed system, the motion and audio models most suitable for gesture recognition are chosen by varying feature vectors and learning methods. Additionally, the unimodal and multi-modal models are compared with respect to recognition accuracy. All the experiments are conducted on dataset provided by the competition organizer of MMGRC, which is a workshop for Multi-Modal Gesture Recognition Challenge. The comparison results show that the multi-modal model composed of three models scores the highest recognition rate. This improvement of recognition accuracy means that the complementary relationship among three models improves the accuracy of gesture recognition. The proposed system provides the application technology to understand human actions of daily life more precisely.

References

RABINER L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2): 257–286.
Article Google Scholar
YAMATO J, OHYA J, ISHII K. Recognizing human action in timesequential images using hidden Markov model[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Champaign, IL, USA, June 15–18, 1992: 379–385.
Chapter Google Scholar
GOUTSU Y, TAKANO W, NAKAMURA Y. Generating sentence from motion by using large-scale and high-order N-grams[C]//Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, November 3–8, 2013: 151–156.
Google Scholar
SHOTTON J, FITZGIBBON A, COOK M, et al. Real-time human pose recognition in parts from single depth images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, June 20–25, 2011: 1297–1304.
Google Scholar
ROSS A, GOVINDARAJAN R. Feature level fusion using hand and face biometrics[C]//Proceedings of the SPIE Conference on Biometric Technology for Human Identification, Orlando, FL, USA, March 28, 2005: 196–204.
Google Scholar
SNOEK C, WORRING M, SMEULDERS A. Early versus late fusion in semantic video analysis[C]//Proceedings of the 13th ACM International Conference on Multimedia, Hilton, Singapore, November 6–11, 2005: 399–402.
Google Scholar
ZHANG D, SONG F, XU Y, et al. Decision level fusion[J]. Advanced Pattern Recognition Technologies with Applications to Biometrics, 2009: 328–348.
Chapter Google Scholar
ESCALERA S, GONZALEZ J, BARO X, et al. Multi-modal gesture recognition challenge 2013: Dataset and results[C]//Proceedings of the 15th ACM International Conference on Multimodal Interaction, Sydney, Australia, December 9–13, 2013: 445–452.
Chapter Google Scholar
YAMANE K, HODGINS J K, BROWN H B. Controlling a marionette with human motion capture data[C]//Proceedings of the IEEE International Conference on Robotics and Automation, Taipei, Taiwan, September 14–19, 2003: 3834–3841.
Google Scholar
HERDA L, FUA P, PLANKERS R, et al. Skeleton-based motion capture for robust reconstruction of human motion[C]//Proceedings of the IEEE Computer Society on Computer Animation 2000, Philadelphia, PA, USA, May 3–5, 2000: 77–83.
Chapter Google Scholar
JAIMES A, SEBE N. Multimodal human-computer interaction: A survey[J]. Computer Vision and Image Understanding, 2007, 108(1): 116–134.
Article Google Scholar
MITRA S, ACHARYA T. Gesture recognition: A survey[J]. IEEE Transaction on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2007, 37(3): 311–324.
Article Google Scholar
GAVRILA D M. The visual analysis of human movement: A survey[J]. Computer Vision and Image Understanding, 1999, 73(1): 82–98.
Article MATH Google Scholar
PAVLOVIC V I, SHARMA R, HUANG T S. Visual interpretation of hand gestures for human-computer interaction: A review[J]. IEEE Transaction on Pattern Analysis and Machine Intelligence, 1997, 19(7): 677–695.
Article Google Scholar
WU Y, HUANG T S. Vision-based gesture recognition: A review[J]. Gesture-based Communication in Human-computer Interaction, Springer, 1999: 103–115.
Chapter Google Scholar
LANG S, BLOCK M, ROJAS R. Sign language recognition using kinect[J]. Artificial Intelligence and Soft Computing, 2012: 394–402.
Chapter Google Scholar
REN Z, MENG J, YUAN J, et al. Robust hand gesture recognition with kinect sensor[C]//Proceedings of the 19th ACM International Conference on Multimedia, Scottsdale, AZ, USA, November 28–December 1, 2011: 759–760.
Google Scholar
REN Z, YUAN J, ZHANG Z. Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera[C]//Proceedings of the 19th ACM International Conference on Multimedia, Scottsdale, AZ, USA, November 28–December 1, 2011: 1093–1096.
Google Scholar
ZAFRULLA Z, BRASHEAR H, STARNER T, et al. American sign language recognition with the kinect[C]//Proceedings of the 13th International Conference on Multimodal Interfaces, Alicante, Spain, November 14–18, 2011: 279–286.
Google Scholar
DAN L, EKENEL H K, JUN O. Human gesture analysis using multimodal features[C]//Proceedings of the IEEE International Conference on Multimedia & Expo Workshops, Melbourne, Australia, July 9–13, 2012: 471–476.
Google Scholar
AKROUF S, BELAYADI Y, MOSTEFAI M, et al. A multi-modal recognition system using face and speech[J]. International Journal of Computer Science Issues, 8(3), 2011: 1694–0814.
MATH Google Scholar
DAVIS S, MERMELSTEIN P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences[J]. IEEE Transaction on Acoustics, Speech and Signal Processing, 28(4), 1980: 357–366.
Article Google Scholar
FUJIMOTO M, FUJITA N, TAKEGAWA Y, et al. A motion recognition method for a wearable dancing musical instrument[C]//13th IEEE International Symposium on Wearable Computers, Linz, Austria, September 4–7, 2009: 11–18.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan
Yusuke Goutsu, Takaki Kobayashi, Junya Obara, Ikuo Kusajima, Kazunari Takeichi, Wataru Takano & Yoshihiko Nakamura

Authors

Yusuke Goutsu
View author publications
You can also search for this author in PubMed Google Scholar
Takaki Kobayashi
View author publications
You can also search for this author in PubMed Google Scholar
Junya Obara
View author publications
You can also search for this author in PubMed Google Scholar
Ikuo Kusajima
View author publications
You can also search for this author in PubMed Google Scholar
Kazunari Takeichi
View author publications
You can also search for this author in PubMed Google Scholar
Wataru Takano
View author publications
You can also search for this author in PubMed Google Scholar
Yoshihiko Nakamura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yusuke Goutsu, Wataru Takano or Yoshihiko Nakamura.

Additional information

Supported by Grant-in-Aid for Young Scientists(A)(Grant No. 26700021), Japan Society for the Promotion of Science and Strategic Information and Communications R&D Promotion Programme(Grant No. 142103011), and Ministry of Internal Affairs and Communications

GOUTSU Yusuke is a PhD candidate at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, Japan. He received the BS and MS degrees from mechano-informatics, University of Tokyo, Japan, in 2011 and 2013. His field of research includes artificial intelligence of humanoid robots. He is a student member of IEEE, Robotics Society of Japan.

KOBAYASHI Takaki is a master candidate at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, Japan. He received the BS degree from mechano-informatics, University of Tokyo, Japan, in 2013. His field of research includes intelligent vehicles. He is a student member of IEEE, Robotics Society of Japan.

OBARA Junya is a master candidate at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, Japan. He received the BS degree from control and systems engineering, Tokyo Institute of Technology, Japan, in 2013. His field of research includes artificial intelligence of humanoid robots.

KUSAJIMA Ikuo is a master candidate at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, Japan. He received the BS degree from mechano-informatics, University of Tokyo, Japan, in 2013. His field of research includes artificial intelligence of humanoid robots.

TAKEICHI Kazunari is a master candidate at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, Japan. He received the BS degree from mechano-informatics, University of Tokyo, Japan, in 2013. His field of research includes human neuromusculoskeletal modeling and simulation. He is a student member of IEEE, Robotics Society of Japan.

TAKANO Wataru, born in 1976, is an assistant professor at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, Japan. His field of research includes kinematics, dynamics, artificial intelligence of humanoid robots, and intelligent vehicles. He is a member of IEEE, Robotics Society of Japan, and Information Processing Society of Japan. He has been the chair of Technical Committee of Robot Learning, IEEE RAS.

NAKAMURA Yoshihiko, born in 1954, is a professor at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, Japan. His fields of research include the kinematics, dynamics, control and intelligence of robots—particularly, robots with non-holonomic constraints, computational brain information processing, humanoid robots, human-figure kinetics, and surgical robots. He is a member of IEEE, ASME, SICE, Robotics Society of Japan, Institute of Systems, Control, and Information Engineers, and Japan Society of Computer Aided Surgery. He was honored with a fellowship from Japan Society of Mechanical Engineers. Since 2005, he has been the president of Japan IFToMM Congress. He is a international member of Academy of Engineering in Serbia and Montenegro.

An erratum to this article is available at http://dx.doi.org/10.1007/s10033-017-0136-y.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Goutsu, Y., Kobayashi, T., Obara, J. et al. Multi-modal gesture recognition using integrated model of motion, audio and video. Chin. J. Mech. Eng. 28, 657–665 (2015). https://doi.org/10.3901/CJME.2015.0202.053

Download citation

Received: 18 September 2014
Revised: 19 January 2015
Accepted: 02 February 2015
Published: 19 July 2015
Issue Date: July 2015
DOI: https://doi.org/10.3901/CJME.2015.0202.053

Multi-modal gesture recognition using integrated model of motion, audio and video

Abstract

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords