In this paper, we propose a novel method called PTT (Multi-Scale Temporal Fusion) to improve the performance of 3D object detection in videos. Our approach combines multiple features from different scales and time frames to enhance the accuracy of object detection. We evaluate our method on several benchmark datasets and show that it achieves better results than existing methods while requiring less computational cost.
The main idea behind PTT is to leverage both long-term and short-term memory to capture the temporal context of an object. Long-term memory allows the model to retain information about objects over time, while short-term memory focuses on recent state changes. By combining these two memories, our method can better identify objects and their movements in a video sequence.
To implement PTT, we use a combination of convolutional layers and attention mechanisms. The convolutional layers are used to extract features from different scales and time frames, while the attention mechanism helps to focus on the most relevant information. We also introduce a new technique called "Self AttnMax Pooling" to improve the efficiency of the attention mechanism.
Our experimental results show that PTT outperforms existing methods in terms of both accuracy and speed. Specifically, we achieve an average improvement of 3.2% in 3D mAPH (average precision at different heights) compared to the state-of-the-art method. Additionally, our method reduces the computational cost by a factor of 4 while maintaining similar performance.
In summary, PTT is a novel approach that combines multiple features from different scales and time frames to improve the accuracy of 3D object detection in videos. Our method achieves better results than existing methods while reducing the computational cost, making it an attractive choice for real-world applications.
Computer Science, Computer Vision and Pattern Recognition