Human Action Recognition is the task of recognizing a set of actions being performed in a video sequence. Reliably and efficiently detecting and identifying actions in video could have vast impacts in the surveillance, security, healthcare and entertainment spaces.
The problem addressed in this paper is to explore different engineered spatial and temporal image and video features (and combinations thereof) for the purposes of Human Action Recognition, as well as explore different Deep Learning architectures for non-engineered features (and classification) that may be used in tandem with the handcrafted features. Further, comparisons between the different combinations of features will be made and the best, most discriminative feature set will be identified.
In the paper, the development and implementation of a robust framework for Human Action Recognition was proposed. The motivation behind the proposed research is, firstly, the high effectiveness of gradient-based features as descriptors - such as HOG, HOF, and N-Jets - for video-based human action recognition. They are capable of capturing both the salient spatial
and temporal information in the video sequences, while removing much of the redundant information that is not pertinent to the action. Combining these features in a hierarchical fashion further increases performance.
Inhaltsverzeichnis (Table of Contents)
- Abstract
- 1 Introduction
- 2 Background and Related Work
- 2.1 Feature Extraction and Descriptor Representation
- 2.1.1 Space-Time Interest Points (STIP)
- 2.1.2 Dense Sampling
- 2.1.3 Histogram of Oriented Gradients (HOG)
- 2.1.4 N-Jets
- 2.1.5 Histograms of Oriented Optical Flow (HOF)
- 2.1.6 Feature Combination
- 2.2 Learning Algorithms
- 2.2.1 Support Vector Machines (SVM)
- 2.2.2 Convolutional Neural Networks (CNN)
- 2.2.3 Recurrent Neural Networks (RNNs)
- 2.3 Conclusion
- 2.1 Feature Extraction and Descriptor Representation
- 3 Research Method
- 3.1 Research Hypothesis
- 3.2 Methodology
- 3.2.1 Phase 1: Implementation
- 3.2.2 Phase 2: Training
- 3.2.3 Phase 3: Testing
- 3.3 Motivation for Method
- 3.3.1 Features
- 3.3.2 Classifier
- 3.4 Conclusion
- 4 Research Plan
- 4.1 Deliverables
- 4.1.1 Phase 1: Implementation
- 4.1.2 Phase 2: Training
- 4.1.3 Phase 3: Testing
- 4.2 Potential Issues
- 4.2.1 Lengthy Training Time
- 4.2.2 Low Accuracies
- 4.3 Conclusion
- 4.1 Deliverables
Zielsetzung und Themenschwerpunkte (Objectives and Key Themes)
This research aims to explore and evaluate various engineered spatial and temporal image and video features, as well as deep learning architectures, for the purpose of Human Action Recognition in video sequences. The goal is to identify the most effective feature set and combination of approaches to achieve reliable and efficient action detection and identification.
- Feature Extraction and Descriptor Representation
- Deep Learning Architectures
- Human Action Recognition in Video Sequences
- Feature Combination and Optimization
- Performance Evaluation and Comparison
Zusammenfassung der Kapitel (Chapter Summaries)
- Chapter 1: Introduction - This chapter introduces the topic of Human Action Recognition and its significance in various applications such as surveillance, security, healthcare, and entertainment. It also highlights the challenge of accurately and efficiently detecting and identifying actions in video sequences.
- Chapter 2: Background and Related Work - This chapter provides a comprehensive overview of existing approaches to human action recognition, focusing on feature extraction, descriptor representation, and learning algorithms. It examines various methods like Space-Time Interest Points (STIP), dense sampling, Histogram of Oriented Gradients (HOG), N-Jets, and Histograms of Oriented Optical Flow (HOF). It also discusses machine learning techniques such as Support Vector Machines (SVM), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNNs).
- Chapter 3: Research Method - This chapter outlines the research hypothesis, methodology, and motivation for the chosen approach. It describes the three phases of the research process: implementation, training, and testing. It also discusses the rationale behind the selection of specific features and classifiers for the study.
- Chapter 4: Research Plan - This chapter details the specific deliverables and potential issues associated with the research project. It includes a breakdown of the implementation, training, and testing phases, as well as potential challenges such as lengthy training times and low accuracies.
Schlüsselwörter (Keywords)
Human Action Recognition, Deep Learning, Feature Extraction, Descriptor Representation, Space-Time Interest Points, Histogram of Oriented Gradients, Histograms of Oriented Optical Flow, Convolutional Neural Networks, Recurrent Neural Networks, Feature Combination, Performance Evaluation.
- Citar trabajo
- Mike Nkongolo (Autor), 2018, Demystifying Human Action Recognition in Deep Learning with Space-Time Feature Descriptors, Múnich, GRIN Verlag, https://www.grin.com/document/413235