Human Action Recognition is the task of recognizing a set of actions being performed in a video sequence. Reliably and efficiently detecting and identifying actions in video could have vast impacts in the surveillance, security, healthcare and entertainment spaces.
The problem addressed in this paper is to explore different engineered spatial and temporal image and video features (and combinations thereof) for the purposes of Human Action Recognition, as well as explore different Deep Learning architectures for non-engineered features (and classification) that may be used in tandem with the handcrafted features. Further, comparisons between the different combinations of features will be made and the best, most discriminative feature set will be identified.
In the paper, the development and implementation of a robust framework for Human Action Recognition was proposed. The motivation behind the proposed research is, firstly, the high effectiveness of gradient-based features as descriptors - such as HOG, HOF, and N-Jets - for video-based human action recognition. They are capable of capturing both the salient spatial
and temporal information in the video sequences, while removing much of the redundant information that is not pertinent to the action. Combining these features in a hierarchical fashion further increases performance.
Table of Contents
1 Introduction
2 Background and Related Work
2.1 Feature Extraction and Descriptor Representation
2.1.1 Space-Time Interest Points (STIP)
2.1.2 Dense Sampling
2.1.3 Histogram of Oriented Gradients (HOG)
2.1.4 N-Jets
2.1.5 Histograms of Oriented Optical Flow (HOF)
2.1.6 Feature Combination
2.2 Learning Algorithms
2.2.1 Support Vector Machines (SVM)
2.2.2 Convolutional Neural Networks (CNN)
2.2.3 Recurrent Neural Networks (RNNs)
2.3 Conclusion
3 Research Method
3.1 Research Hypothesis
3.2 Methodology
3.2.1 Phase 1: Implementation
3.2.2 Phase 2: Training
3.2.3 Phase 3: Testing
3.3 Motivation for Method
3.3.1 Features
3.3.2 Classifier
3.4 Conclusion
4 Research Plan
4.1 Deliverables
4.1.1 Phase 1: Implementation
4.1.2 Phase 2: Training
4.1.3 Phase 3: Testing
4.2 Potential Issues
4.2.1 Lengthy Training Time
4.2.2 Low Accuracies
4.3 Conclusion
5 Conclusion
Research Objectives and Core Topics
The primary objective of this research is to develop and evaluate a robust, efficient framework for Human Action Recognition by integrating advanced, engineered spatial-temporal features with sequential deep learning architectures to improve classification accuracy and robustness against real-world video complexities.
- Human Action Recognition in video sequences
- Engineered spatial-temporal feature descriptors (HOG, HOF, N-Jets)
- Deep Learning architectures for automated feature learning
- Integration of sequential models (Recurrent Neural Networks)
- Benchmarking against standard datasets (KTH, UCF)
Excerpt from the Book
2.1.1 Space-Time Interest Points (STIP)
The Harris interest point detector was extended into the temporal domain by Laptev (2005). The detector can localise spatio-temporal interest points in an image sequence. This interest point detection mechanism has been widely employed in the context of Human Action Recognition (Laptev, 2005); (Kovashka and Grauman, 2010). When used, it is typically considered the first stage in the feature engineering process to obtain the eventual full video features. They are computed by finding the local maxima (at a range of spatial and temporal scale values) of H = det(µ)-kT race3 (µ), where H is the interest point operator; µ is the spatio-temporal second-moment matrix; k ∈ R is a hyperparameter; Trace is the trace operator for a matrix (Nkongolo, 2017); and det is the determinant operator for a matrix (Nkongolo and Kalonji, 2017); (Laptev and Lindeberg, 2006). The multi-scale approach is often taken as it is more computationally efficient than scale selection, as in regular Harris interest point detection. The core difference between the 2D Harris and 3D Harris algorithms is that convolution at the various stages occurs with 3D anisotropic separable Gaussian kernel with independent spatial and temporal variances, given by:
Chapter Summaries
1 Introduction: Provides an overview of Human Action Recognition, its practical applications, and the challenges caused by viewpoint variation and occlusions.
2 Background and Related Work: Reviews existing literature on feature extraction techniques and various learning algorithms, including traditional SVMs and modern Deep Learning approaches.
3 Research Method: Details the proposed research hypothesis and the methodological approach, including implementation phases and the choice of Recurrent Neural Networks.
4 Research Plan: Outlines the project timeline, key deliverables, and identifies potential challenges such as training time and accuracy issues, along with mitigation strategies.
5 Conclusion: Summarizes the research findings, highlighting the motivation for combining hierarchical feature extraction with sequential modeling to achieve state-of-the-art performance.
Keywords
Human Action Recognition, Deep Learning, Space-Time Interest Points, STIP, Histogram of Oriented Gradients, HOG, Optical Flow, HOF, Support Vector Machines, SVM, Recurrent Neural Networks, RNN, Computer Vision, Machine Learning, Feature Extraction
Frequently Asked Questions
What is the primary focus of this research paper?
The paper focuses on solving the Human Action Recognition problem by investigating both hand-crafted feature engineering and automated feature learning using Deep Learning techniques.
What are the central themes discussed in the work?
The central themes include spatial-temporal feature extraction, such as HOG and HOF, and the application of machine learning classifiers like SVMs, CNNs, and RNNs to video analysis.
What is the main research goal?
The goal is to identify the most discriminative feature sets and neural architectures that enable robust and efficient recognition of human actions in varying video environments.
Which scientific methodology is employed?
The methodology involves a systematic pipeline: collecting benchmark datasets, implementing hierarchical feature extraction, training sequential neural network models, and evaluating performance metrics.
What does the main body cover?
The main body covers a comprehensive survey of related work, a detailed description of the proposed research methodology, and a structured implementation and testing plan.
Which keywords characterize the work?
Key terms include Human Action Recognition, Deep Learning, Space-Time Interest Points (STIP), Histogram of Oriented Gradients (HOG), and Recurrent Neural Networks (RNN).
Why are Recurrent Neural Networks chosen for this problem?
RNNs are specifically suited for processing sequential data, and since video frames are inherently ordered and correlated over time, RNNs allow the model to learn necessary temporal dependencies.
How does the author address the problem of "Lengthy Training Time"?
The author suggests reducing the dimensionality of features, optimizing the network architecture to use fewer trainable parameters, or utilizing computing clusters to increase processing power.
- Citar trabajo
- Mike Nkongolo (Autor), 2018, Demystifying Human Action Recognition in Deep Learning with Space-Time Feature Descriptors, Múnich, GRIN Verlag, https://www.grin.com/document/413235