In this research paper, the authors aim to improve the quality of machine-generated American Sign Language (ASL) signs by developing a novel approach that combines vision and language models. The proposed method leverages both pose and expression embeddings, transformers, and back-translation networks to generate more realistic and diverse ASL signs.
The authors evaluate their approach using various metrics, including mean per vertex position error (MPVPE), mean per joint position error (MPJPE), dynamic time warping (DTW), fréchet inception distance (FID), and transformer-based back-translation network. The results show that the proposed method significantly outperforms existing state-of-the-art methods, demonstrating a top-1 accuracy of 40% and a top-2 accuracy of 80% in recognizing ASL signs.
To further validate their approach, the authors conduct a user study involving 15 ASL fluent subjects who evaluate the generated signs based on their alignment with the text transcripts and fidelity and readability. The results show that the proposed method performs better than the baseline methods in terms of alignment and overall performance.
The authors also analyze the generated ASL signs and find that they are more diverse and realistic compared to existing methods. They demonstrate this by showing that their approach can generate sign languages with different styles, vocabularies, and sentence structures.
In conclusion, the proposed method has the potential to significantly improve the quality of machine-generated ASL signs, making it easier for deaf and hard-of-hearing individuals to communicate with hearing people. The study shows that the proposed approach is effective in generating more realistic and diverse ASL signs, and the user study confirms its superiority over existing methods.
Computer Science, Computer Vision and Pattern Recognition