Research: Japanese Sign Language Recognition
Our lab focuses on research in Japanese Sign Language (JSL) recognition. Specifically, we are developing AI technologies to recognize sign language through hand and finger movements. This work aims to bridge communication gaps between sign language users and non-signers. By employing innovative approaches using deep learning and computer vision, we strive to improve the accuracy of sign language recognition and explore possibilities for real-time processing.
Related Work・関連研究
Researchers have devised various methods in sign language recognition, including physical methods such as the use of wearable devices [1]. Two types of recent research streams are continuous sign language recognition (CSLR) and isolated sign language recognition (ISLR) CSLR is intended to recognize continuous sign language movements. It addresses the challenge of identifying and interpreting signs in a continuous stream of signs flowing into one another; in CSLR, the work of Lianyu Hu et al [2] focuses on inter-frame correlation.Recently, Neena Aloysius et al.[3] applied Conformer, a state-of-the-art model for speech recognition, to continuous sign language recognition (CSLR) and proposed a framework called ConSignformer. ConSignformer has a bimodal pipeline of a CNN as a feature extractor and a Conformer for sequence learning, and also introduces Cross-Modal Relative Attention (CMRA) to improve context learning. State-of-the-art performance has been achieved in PHOENIX-2014 and PHOENIX-2014T using this approach.
研究者は、ウェアラブルデバイスの使用などの物理的な方法を含め、手話認識におけるさまざまな方法を考案してきた [1]。最近の研究の流れには、連続手話認識(CSLR)と孤立手話認識(ISLR)の2種類がある。CSLRは、連続的な手話の動きを認識することを目的としている。CSLRでは、Lianyu Huら[2]の研究がフレーム間相関に焦点を当てている。最近、Neena Aloysiusら[3]は、音声認識の最先端モデルであるConformerを連続手話認識(CSLR)に適用し、ConSignformerと呼ばれるフレームワークを提案した。ConSignformerは、特徴抽出器としてのCNNとシーケンス学習用のConformerのバイモーダルパイプラインを持ち、さらにコンテキスト学習を改善するためにCross-Modal Relative Attention(CMRA)を導入している。PHOENIX-2014とPHOENIX-2014Tでは、このアプローチにより最先端の性能が達成されている。
[1] Ambar, R. Fai, C.K.; Wahab, M.H.A. Jamil, M.M.A. Ma’radzi, “A.A. Development of a Wearable Device for Sign Language Recognition.” JPCS 2018, 1019, 012017.
[2] Lianyu, H. Liqing, G. Zekang, L. Wei, F. “Continuous Sign Language Recognition with Correlation Network.” In Proceedings of the CVPR, 20–22 June 2023; pp. 2529–2539.
[3] Neena Aloysius, Geetha M, and Prema Nedungadi “Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining” arXiv:2405.12018v1

Our Research・我々の研究
Our research aims to improve the accuracy of automatic recognition and translation systems for Japanese finger alphabet by focusing on the following four points. We are thus conducting consistent research from the creation of datasets to the proposal of a recognition system.
1, Video recording for the creation of a Japanese fingerprint dataset
2, Extraction of features from finger character videos using posture estimation
3, Classification of fingerprints using deep learning
4, Segmentation and recognition of word videos using fingerprints
1, 日本語指文字データセット作成のための映像撮影
2, 姿勢推定を用い指文字映像から特徴量の抽出
3, ディープラーニングを用いた指文字の分類
4, 指文字での単語映像の分割及び認識
1, Video recording for the creation of a Japanese fingerprint dataset
We have created a data set of sign languages with the help of experienced signers as well as inexperienced signers. The photo session is also available on NEWS. Also, please visit the website of the filming session!
The video data taken at the filming session, for which we received permission, is available on the following website. If you are a researcher of Japanese finger alphabet, please take advantage of this site!
2, Extraction of features from finger character videos using posture estimation
In angle-based feature extraction, joint coordinates obtained from MediaPipe are used to calculate the tilt of fingers and the hand as cosine angles. For finger joint angles, three points are selected around the target joint, and the angles between the vectors formed by these points are computed using cosine values. For the overall hand tilt, angles are calculated between vectors from the wrist to each finger joint and a reference vector along the x-axis. This process extracts 20 features for finger joints and 20 for overall hand tilt, resulting in a total of 40 features.
3, Classification of fingerprints using deep learning
For fingerprint classification using deep learning, data containing angular features is converted into a two-dimensional array with time on the x-axis and angular features on the y-axis. The generated data is fed into ViT and CNN models to classify 46 classes of fingerprints: ViT learns abstract features by increasing encoder layers, and CNN extracts spatial patterns for classification. These methods provide highly accurate recognition.
4, Segmentation and recognition of word videos using fingerprints
Although the conventional Transformer is excellent for video recognition with large differences, it has a problem with sign language videos with few large movements, which reduces the accuracy of letter-to-letter segmentation in finger-word videos. Therefore, this study applies the existing change point detection method by Truong et al. to the angular features extracted earlier to verify whether it can detect the timing of sign language changes.
The following four methods were used.
・PELT (Pruned Exact Linear Time)
・PELT(Pruned Exact Linear Time)