Abstract
Action recognition, a field within human-centered computing, is vital for identifying and understanding human actions, benefiting applications like surveillance, autonomous vehicles, and human-computer interaction. Artificial intelligence’s enduring goal is to develop robust models for perceiving and understanding the visual world around us. Deep neural networks have shown exceptional performance in various tasks, profoundly impacting real-world applications, including action prediction and recognition, which have advanced significantly in recent years. Most work in visual human action recognition focuses on single viewpoints or modalities from complete observation. Yet, the real significance lies in predicting future actions from incomplete observations to prevent real-world tragedies. With the availability of multiple cameras and data from multiple modalities (RGB, Depth, and Skeleton) available today, it becomes possible to model human action in multi-view and multi-modality context, minimizing the data loss due to occlusions and signal quality issues to improve recognition accuracy on the strength of state-of-the-art deep learning models. However, deep neural network models are susceptible to adversarial attacks, where imperceptible perturbations can compromise action recognition model performance. This thesis focuses on identifying latent vulnerabilities and proposing a defense mechanism against such threats in a multi-modality and multi-view setting. This work introduces an efficient and effective attack mechanism that perturbs skeleton data by targeting key joints and segments while employing a graph attention mechanism that learns the semantics to perturb other modalities. Additionally, an approach has been developed that not only adds noise but also alters the visual spatial structure of skeleton data through generative modeling. Furthermore, this dissertation introduces a defense mechanism known as the Collaborative Knowledge Distillation Network, which leverages graph attention and knowledge distillation techniques. This network leverages the knowledge from compromised multi-view data and integrates information from clean data to address incomplete observations and noisy action videos, enhancing the robustness of action recognition models for real-world applications.