Equipped with Monocular Depth Estimation and Efficient and Accurate Visual Tracking via Cross-Attention Transformer for Human-Following System
Abstract:
With the advancement of technology and the development of human civilization, intelligent robots as "service robots" have been gradually integrated into the practical application of people's daily lives and the good assistants for people¡¦s family life, medical and health, entertainment, and social security.
In this paper, we proposed a novel, efficient, and accurate transformer-based tracking framework (named SiamCATR).
Considering the speed, robustness, and accuracy in real applications, we designed a concise tracking network with variants of lightweight backbone and a simple feature fusion mechanism to achieve a good balance between speed and accuracy.
We also deploy a human-following tracking system with our proposed tracker.
To achieve human tracking using only RGB images, a monocular depth estimation network is combined with the tracking network.
According to the experimental results, our proposed tracker gets the best tracking results on numerous benchmarks and surpasses the second-best tracker by more than 5.9%.
Moreover, our tracker achieves real-time tracking on the embedded system. Finally, the human tracking network achieved an accuracy of 97.9% on our custom dataset.
The human-following system achieves 5.56 FPS and 14.4 FPS on NVIDIA Jetson AGX Xavier and NVIDIA GeForce RTX 2080Ti GPU, respectively.
Artitechture: The proposed transformer-based tracking framework is shown in the Figure 1.
Our framework is very compact, consisting of three components: feature extraction (backbone), transformer-like feature fusion network, and prediction head modules.
The feature extraction extracts the features of the template image and the search region image respectively.
Then, the transformer like feature fusion network fused both features by the cross-attention transformer modules (CATM) and channel-attention depthwise cross correlation.
At last, the prediction head module performs binary classification and bounding box regression on the fused features to predict the tracking results Implementation of the mobile robot with the proposed human-following system.
Fig 1¡GThe overall architecture
Fig 2: Mobile robot with the proposed human-following system
Made
by §õ«TÀM