Evaluation of Co-Speech Gesture Tracking Techniques in Naturalistic Interactions

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Hand gestures convey a significant portion of communicative meaning, making multimodal datasets essential for interaction research. However, annotating gestures remains a time-consuming and challenging task. To speed up the process, semi-automatic methods have been developed that identify segments with hand movement for annotators to refine. These typically combine a pose estimation model with a rule-based or statistical movement detection algorithm. However, most are validated on idealised, non-naturalistic datasets with minimal hand occlusions. We benchmark combinations of four pose estimation methods (OpenPose, MediaPipe, DeepLabCut, and Kinect) and two rule-based movement detection algorithms on two naturalistic, conversational datasets. The best pipelines combine the SPUDNIG displacement algorithm with OpenPose on MULTISIMO and with DeepLabCut on ECOLANG. These pipelines achieved Tversky scores of 0.57 on MULTISIMO and 0.65 on ECOLANG, with recall scores of 0.73 and 0.78, respectively. While off-the-shelf gesture detection systems can support annotation, performance remains limited on naturalistic data, and careful camera setup minimizing occlusions is essential.