End-to-End Learned Visual Odometry Based on Vision Transformer
Vyas, Aman Manishbhai (2024-07-30)
End-to-End Learned Visual Odometry Based on Vision Transformer
Vyas, Aman Manishbhai
(30.07.2024)
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
avoin
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2024080263484
https://urn.fi/URN:NBN:fi-fe2024080263484
Tiivistelmä
Estimating the camera’s pose from images of a single camera, a task known as monocular visual odometry, is fundamental in mobile robots and autonomous vehicles. Traditional approaches often rely on geometric methods that require significant engineering effort tailored to specific scenarios. Deep learning methods, while generalizable with extensive training data, have shown promising results. Recently, transformer-based architectures, which have been highly successful in natural language processing and computer vision, are proving to be superior for this task as well. In this study, we introduce the Vision Transformer (ViT) model, which leverages spatio-temporal self-attention mechanisms to extract features from images and estimate camera motions in an end-to-end manner.
Extensive experimentation on the KITTI visual odometry dataset demonstrates that ViT achieves competitive state-of-the-art performance. Remarkably, it surpasses both traditional geometry-based methods and existing deep learning approaches, including DeepVO, MagicVO, and PoseNet. This significant improvement underscores the effectiveness of transformer-based architectures in capturing complex spatio-temporal dependencies essential for accurate visual odometry. Our results highlight ViT's potential as a powerful tool for enhancing pose estimation in dynamic environments, making it a valuable contribution to the advancement of autonomous navigation technologies. Our results over five different route trajectories with varying environmental conditions show that ViT achieves up to an 8% improvement in translation error and a 4% improvement in rotation error compared to previous deep learning methods. This highlights ViT's potential to enhance pose estimation in dynamic environments and advance autonomous navigation.
Extensive experimentation on the KITTI visual odometry dataset demonstrates that ViT achieves competitive state-of-the-art performance. Remarkably, it surpasses both traditional geometry-based methods and existing deep learning approaches, including DeepVO, MagicVO, and PoseNet. This significant improvement underscores the effectiveness of transformer-based architectures in capturing complex spatio-temporal dependencies essential for accurate visual odometry. Our results highlight ViT's potential as a powerful tool for enhancing pose estimation in dynamic environments, making it a valuable contribution to the advancement of autonomous navigation technologies. Our results over five different route trajectories with varying environmental conditions show that ViT achieves up to an 8% improvement in translation error and a 4% improvement in rotation error compared to previous deep learning methods. This highlights ViT's potential to enhance pose estimation in dynamic environments and advance autonomous navigation.