
Research
Our research areas include:
-
Sign Language Recognition: Advancing temporal action localization and multimodal feature integration for real-time Japanese sign language recognition.
-
Medical AI: Utilizing Vision Transformers and other state-of-the-art methods to analyze medical imaging data with applications in diagnostics and segmentation.
-
Multimodal Learning: Exploring the synergy between visual, spatial, and temporal data for enhanced recognition systems.
-
Object Detection and Image Generation: Pioneering techniques to improve accuracy and efficiency in real-world applications.
Recent contributions from our lab include the development of novel models for Japanese sign language recognition using frame clustering and point-supervised methods, as well as integrating angular features for improved performance. Additionally, our work in robotics and color analysis systems has demonstrated the versatility of AI in industrial and consumer applications.
​
Our mission is to push the boundaries of AI-driven solutions, fostering advancements that bridge technology and human-centric applications. We invite you to explore our research and join us in shaping the future of intelligent systems.
-
Japanese Sign Language Recognition
​

Japanese Sign Language Recognition with Vision Transformer and CNN Using Angular Features
In recent years, developments in deep learning technology have driven significant advancements in research aimed at facilitating communication with individuals who have hearing impairments. The focus has been on enhancing automatic recognition and translation systems for sign language. This study proposes a novel approach using a vision transformer (ViT) for recognizing Japanese Sign Language. Our method employs a pose estimation library, MediaPipe, to extract the positional coordinates of each finger joint within video frames and generate one-dimensional angular feature data from these coordinates.
-
Action Recognition using Video Analysis
Point-Supervised Temporal Localization using Hierarchical Reliability Propagation


In recent years, advances in deep learning technology have significantly contributed to improving communication tools for the hearing impaired, particularly by enhancing sign language recognition. In this study, we apply a Point-Supervised Temporal Action Localization method with Hierarchical Reliability Propagation to Japanese sign language recognition. First, features are extracted from video using I3D, a 3D CNN architecture. These features are then processed by a two-stage model: snippet-level learning followed by instance-level learning. The effectiveness of this approach was validated through recognition experiments on Japanese sign language videos, achieving an average mAP of 27.21%.
-
Medical AI
Pathology Report Generation Using Giga-pixel Whole Slide Images for Bladder Tumors

Our objective was to generate pathology reports using Giga-pixel Whole Slide Images (WSIs) of bladder tumors. Our model's key features include the use of a foundation model to effectively extract slide image features as vision-encoding vectors. To enhance the report generation process, we restructured pathology report content into six categories and employed a two-step strategy, first analyzing the images and then generating the reports, to address performance challenges in vision-language alignments. For slide image analysis, we built two models: Attention-based multi-instance learning (MIL) and Knowledge Distillation model utilizing Transformer-based MIL. The best results of multi-label classification were obtained through ensembling. For report generation, we developed a text-to-text model using T5, simplifying the input format to reduce complexity and applying data augmentation techniques to improve training stability and overall performance.
Multimodal AI4TB Challenge 2024

This challenge aims to utilize multimodal data to develop models that accurately predict the Time to Positivity (TTP) category for patients undergoing TB treatment. TTP reflects the bacterial load and disease burden, playing a crucial role in determining treatment duration and monitoring patient response. Accurate prediction of TTP categories can enable stratification of patients, identifying those who respond well to treatment and those requiring extended or shortened therapy. Challenge objective is to develop a machine learning model that predicts the TTP category for TB patients using multimodal data, including chest X-rays (CXRs), clinical data, and radiology reports.
-
Object Detection and Image Segmentation
Color Assessment System using Object Detection for Plastic Components of Automobile


The objective of this research is to evaluate the hue of plastic components in cars and identify discrepancies from the initial color established by the vehicle manufacturing facility. Many plastic components have contoured surfaces, which complicates the process of positioning the color detector close to the component for automated assessment. Additionally, because the color sensor is highly sensitive to the distance from the component's surface, measurements are conducted using a robotic arm capable of precise distance adjustments. This robotic arm ensures stable and vibration-free color evaluation. To identify the plastic component's area for color assessment, the Intel RealSense Depth Camera is used to capture images. A color image is then processed using the Mask R-CNN framework, which performs segmentation to accurately recognize the area of interest. Based on the segmented region, 3D coordinate values are extracted from the depth camera and transmitted to the robotic arm, enabling it to approach the plastic component precisely. Various positions within the segmented region are established to evaluate the hue of the plastic components comprehensively. By incorporating Mask R-CNN for segmentation and leveraging robotic precision, the proposed system enhances the efficiency and accuracy of color inspections in automobile manufacturing facilities.