top of page
ゲーム用アクセサリー

Research: Video Analysis

In today's society, an enormous amount of video data is generated daily from streaming services like YouTube and Netflix, surveillance cameras, medical imaging, and sports analysis. However, manually analyzing this vast amount of video data is simply impractical for humans.

This is where Video Analysis comes in. Video analysis is a research field that primarily leverages deep learning to automatically understand video content and extract meaningful information.

Video analysis technology has a wide range of applications across various fields, including:

  • Security & Surveillance: Detecting suspicious behavior in real-time from surveillance camera footage

  • Healthcare & Medicine: Analyzing surgical videos and detecting abnormal patient behavior

  • Sports Analytics: Automatically extracting highlight scenes from game footage

  • Autonomous Driving: Recognizing pedestrians and obstacles to support safe navigation

  • Entertainment: Automating video content search and editing

Advancements in video analysis technology enable the discovery of information that might be easily overlooked by the human eye, improve operational efficiency, and enhance AI’s ability to interpret the real world.

However, understanding video content differs significantly from analyzing still images, as it involves the critical element of time. This makes video analysis far more complex than simple image recognition.

In particular, accurately recognizing when and what happens in a video is studied as Temporal Action Localization, one of the most challenging problems in video analysis. Several key difficulties arise in this task:

  • Ambiguity in Action Boundaries

    • Sign language movements transition seamlessly from one word to the next in a continuous flow.

    • For example, the sign language gestures for "hello" and "thank you" have distinct motions, but in natural signing, it is extremely difficult to pinpoint exactly where one word ends and the next begins.

    • Additionally, experienced signers tend to produce even more fluid transitions, making word boundaries less distinct.

    • Some sign words are followed by a brief pause, while others smoothly connect to the next motion, requiring advanced AI techniques to accurately recognize where one word ends and another begins.

  • Difficulty in Differentiating Similar Actions

    • Some gestures, such as waving and throwing, appear visually similar but have entirely different meanings.

    • In sign language, many words share similar hand movements but convey different meanings, making it necessary to consider context and preceding/following gestures for accurate recognition.

  • Challenges in Annotating Video Data

    • Accurately labeling human actions in video requires extensive manual effort, making it costly and time-consuming to create large-scale datasets.

    • This challenge makes supervised learning approaches less scalable, highlighting the need for self-supervised or weakly supervised learning techniques in video understanding.

Our Research Approach

To tackle these challenges, our research lab is leveraging state-of-the-art deep learning techniques to enhance the accuracy of video analysis, with a particular focus on sign language recognition through Temporal Action Localization.

Specifically, we are developing a Japanese finger-spelling dataset and applying Temporal Action Localization models to identify which syllables are present in a video and when they occur within the timeline. Our goal is to build a model that can accurately predict the temporal boundaries of sign language syllables within video sequences.

As this research progresses, the potential social impact of video analysis technology will expand significantly. In particular, improving the accuracy and practical application of sign language recognition is expected to facilitate smoother communication with individuals with hearing impairments, contributing to the realization of a more inclusive society.

のりもの.gif

Example of our Japanese Sign Language dataset

Related Papers | 関連する論文

Point-Supervised Temporal Localization for Sign Language Recognition Using Hierarchical Reliability Propagation

Ryota Murai, Tamon Kondo and Yousun Kang

International Conference on Electronics, Information, and Communication (ICEIC) 2025

Abstract In recent years, advances in deep learning technology have significantly contributed to improving communication tools for the hearing impaired, particularly by enhancing sign language recognition. In this study, we apply a Point-Supervised Temporal Action Localization method with Hierarchical Reliability Propagation to Japanese sign language recognition. First, features are extracted from video using I3D, a 3D CNN architecture. These features are then processed by a two-stage model: snippet-level learning followed by instance-level learning. The effectiveness of this approach was validated through recognition experiments on Japanese sign language videos, achieving an average mAP of 27.21%.

This paper addresses the prediction of character start and end timing in sign language videos through point-supervised learning. This method efficiently captures critical timing in sign language videos and significantly reduces the cost of annotating data compared to traditional frame-by-frame annotation methods. In the experiment, we tested the ability of a video dataset of Japanese sign language to effectively capture the boundaries between characters, achieving an average mAP of 27.21%.

この論文では、点教師あり学習によって手話動画における文字の開始と終了のタイミングの予測に取り組んでいます。この手法により、手話動画における重要なタイミングを効率よく捉え、従来のフレームごとにアノテーションを付与する方法に比べて、データのアノテーションコストを大幅に削減することができます。実験では、日本語手話の動画データセットを用いて、文字間の境界を効果的に捉えることができるかを検証し、平均mAP 27.21%を記録しました。

HR-pro_arch.jpg

© 2024 by KangLab at Tokyo Polytechnic University 

bottom of page