Skip to main content
  • Published:

Multi-modal gesture recognition using integrated model of motion, audio and video

An Erratum to this article was published on 16 May 2017

Abstract

Gesture recognition is used in many practical applications such as human-robot interaction, medical rehabilitation and sign language. With increasing motion sensor development, multiple data sources have become available, which leads to the rise of multi-modal gesture recognition. Since our previous approach to gesture recognition depends on a unimodal system, it is difficult to classify similar motion patterns. In order to solve this problem, a novel approach which integrates motion, audio and video models is proposed by using dataset captured by Kinect. The proposed system can recognize observed gestures by using three models. Recognition results of three models are integrated by using the proposed framework and the output becomes the final result. The motion and audio models are learned by using Hidden Markov Model. Random Forest which is the video classifier is used to learn the video model. In the experiments to test the performances of the proposed system, the motion and audio models most suitable for gesture recognition are chosen by varying feature vectors and learning methods. Additionally, the unimodal and multi-modal models are compared with respect to recognition accuracy. All the experiments are conducted on dataset provided by the competition organizer of MMGRC, which is a workshop for Multi-Modal Gesture Recognition Challenge. The comparison results show that the multi-modal model composed of three models scores the highest recognition rate. This improvement of recognition accuracy means that the complementary relationship among three models improves the accuracy of gesture recognition. The proposed system provides the application technology to understand human actions of daily life more precisely.

References

  1. RABINER L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2): 257–286.

    Article  Google Scholar 

  2. YAMATO J, OHYA J, ISHII K. Recognizing human action in timesequential images using hidden Markov model[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Champaign, IL, USA, June 15–18, 1992: 379–385.

    Chapter  Google Scholar 

  3. GOUTSU Y, TAKANO W, NAKAMURA Y. Generating sentence from motion by using large-scale and high-order N-grams[C]//Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, November 3–8, 2013: 151–156.

    Google Scholar 

  4. SHOTTON J, FITZGIBBON A, COOK M, et al. Real-time human pose recognition in parts from single depth images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, June 20–25, 2011: 1297–1304.

    Google Scholar 

  5. ROSS A, GOVINDARAJAN R. Feature level fusion using hand and face biometrics[C]//Proceedings of the SPIE Conference on Biometric Technology for Human Identification, Orlando, FL, USA, March 28, 2005: 196–204.

    Google Scholar 

  6. SNOEK C, WORRING M, SMEULDERS A. Early versus late fusion in semantic video analysis[C]//Proceedings of the 13th ACM International Conference on Multimedia, Hilton, Singapore, November 6–11, 2005: 399–402.

    Google Scholar 

  7. ZHANG D, SONG F, XU Y, et al. Decision level fusion[J]. Advanced Pattern Recognition Technologies with Applications to Biometrics, 2009: 328–348.

    Chapter  Google Scholar 

  8. ESCALERA S, GONZALEZ J, BARO X, et al. Multi-modal gesture recognition challenge 2013: Dataset and results[C]//Proceedings of the 15th ACM International Conference on Multimodal Interaction, Sydney, Australia, December 9–13, 2013: 445–452.

    Chapter  Google Scholar 

  9. YAMANE K, HODGINS J K, BROWN H B. Controlling a marionette with human motion capture data[C]//Proceedings of the IEEE International Conference on Robotics and Automation, Taipei, Taiwan, September 14–19, 2003: 3834–3841.

    Google Scholar 

  10. HERDA L, FUA P, PLANKERS R, et al. Skeleton-based motion capture for robust reconstruction of human motion[C]//Proceedings of the IEEE Computer Society on Computer Animation 2000, Philadelphia, PA, USA, May 3–5, 2000: 77–83.

    Chapter  Google Scholar 

  11. JAIMES A, SEBE N. Multimodal human-computer interaction: A survey[J]. Computer Vision and Image Understanding, 2007, 108(1): 116–134.

    Article  Google Scholar 

  12. MITRA S, ACHARYA T. Gesture recognition: A survey[J]. IEEE Transaction on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2007, 37(3): 311–324.

    Article  Google Scholar 

  13. GAVRILA D M. The visual analysis of human movement: A survey[J]. Computer Vision and Image Understanding, 1999, 73(1): 82–98.

    Article  MATH  Google Scholar 

  14. PAVLOVIC V I, SHARMA R, HUANG T S. Visual interpretation of hand gestures for human-computer interaction: A review[J]. IEEE Transaction on Pattern Analysis and Machine Intelligence, 1997, 19(7): 677–695.

    Article  Google Scholar 

  15. WU Y, HUANG T S. Vision-based gesture recognition: A review[J]. Gesture-based Communication in Human-computer Interaction, Springer, 1999: 103–115.

    Chapter  Google Scholar 

  16. LANG S, BLOCK M, ROJAS R. Sign language recognition using kinect[J]. Artificial Intelligence and Soft Computing, 2012: 394–402.

    Chapter  Google Scholar 

  17. REN Z, MENG J, YUAN J, et al. Robust hand gesture recognition with kinect sensor[C]//Proceedings of the 19th ACM International Conference on Multimedia, Scottsdale, AZ, USA, November 28–December 1, 2011: 759–760.

    Google Scholar 

  18. REN Z, YUAN J, ZHANG Z. Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera[C]//Proceedings of the 19th ACM International Conference on Multimedia, Scottsdale, AZ, USA, November 28–December 1, 2011: 1093–1096.

    Google Scholar 

  19. ZAFRULLA Z, BRASHEAR H, STARNER T, et al. American sign language recognition with the kinect[C]//Proceedings of the 13th International Conference on Multimodal Interfaces, Alicante, Spain, November 14–18, 2011: 279–286.

    Google Scholar 

  20. DAN L, EKENEL H K, JUN O. Human gesture analysis using multimodal features[C]//Proceedings of the IEEE International Conference on Multimedia & Expo Workshops, Melbourne, Australia, July 9–13, 2012: 471–476.

    Google Scholar 

  21. AKROUF S, BELAYADI Y, MOSTEFAI M, et al. A multi-modal recognition system using face and speech[J]. International Journal of Computer Science Issues, 8(3), 2011: 1694–0814.

    MATH  Google Scholar 

  22. DAVIS S, MERMELSTEIN P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences[J]. IEEE Transaction on Acoustics, Speech and Signal Processing, 28(4), 1980: 357–366.

    Article  Google Scholar 

  23. FUJIMOTO M, FUJITA N, TAKEGAWA Y, et al. A motion recognition method for a wearable dancing musical instrument[C]//13th IEEE International Symposium on Wearable Computers, Linz, Austria, September 4–7, 2009: 11–18.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yusuke Goutsu, Wataru Takano or Yoshihiko Nakamura.

Additional information

Supported by Grant-in-Aid for Young Scientists(A)(Grant No. 26700021), Japan Society for the Promotion of Science and Strategic Information and Communications R&D Promotion Programme(Grant No. 142103011), and Ministry of Internal Affairs and Communications

GOUTSU Yusuke is a PhD candidate at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, Japan. He received the BS and MS degrees from mechano-informatics, University of Tokyo, Japan, in 2011 and 2013. His field of research includes artificial intelligence of humanoid robots. He is a student member of IEEE, Robotics Society of Japan.

KOBAYASHI Takaki is a master candidate at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, Japan. He received the BS degree from mechano-informatics, University of Tokyo, Japan, in 2013. His field of research includes intelligent vehicles. He is a student member of IEEE, Robotics Society of Japan.

OBARA Junya is a master candidate at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, Japan. He received the BS degree from control and systems engineering, Tokyo Institute of Technology, Japan, in 2013. His field of research includes artificial intelligence of humanoid robots.

KUSAJIMA Ikuo is a master candidate at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, Japan. He received the BS degree from mechano-informatics, University of Tokyo, Japan, in 2013. His field of research includes artificial intelligence of humanoid robots.

TAKEICHI Kazunari is a master candidate at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, Japan. He received the BS degree from mechano-informatics, University of Tokyo, Japan, in 2013. His field of research includes human neuromusculoskeletal modeling and simulation. He is a student member of IEEE, Robotics Society of Japan.

TAKANO Wataru, born in 1976, is an assistant professor at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, Japan. His field of research includes kinematics, dynamics, artificial intelligence of humanoid robots, and intelligent vehicles. He is a member of IEEE, Robotics Society of Japan, and Information Processing Society of Japan. He has been the chair of Technical Committee of Robot Learning, IEEE RAS.

NAKAMURA Yoshihiko, born in 1954, is a professor at Department of Mechano-Informatics, School of Information Science and Technology, University of Tokyo, Japan. His fields of research include the kinematics, dynamics, control and intelligence of robots—particularly, robots with non-holonomic constraints, computational brain information processing, humanoid robots, human-figure kinetics, and surgical robots. He is a member of IEEE, ASME, SICE, Robotics Society of Japan, Institute of Systems, Control, and Information Engineers, and Japan Society of Computer Aided Surgery. He was honored with a fellowship from Japan Society of Mechanical Engineers. Since 2005, he has been the president of Japan IFToMM Congress. He is a international member of Academy of Engineering in Serbia and Montenegro.

An erratum to this article is available at http://dx.doi.org/10.1007/s10033-017-0136-y.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Goutsu, Y., Kobayashi, T., Obara, J. et al. Multi-modal gesture recognition using integrated model of motion, audio and video. Chin. J. Mech. Eng. 28, 657–665 (2015). https://doi.org/10.3901/CJME.2015.0202.053

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3901/CJME.2015.0202.053

Keywords