Paper: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10660097
We propose the V-APT method, which incorporates a lightweight LSTM block for temporal relationship modeling before average temporal pooling. Furthermore, our approach also encompasses a deep visual and text prompt interaction block, which guides visual prompts through text prompts in order to ensure mutual coordination between vision and language.
The specific experimental code and data will be released after the paper is accepted. Please stay tuned.
This research is supported by the "Tianjin University of Technology Postgraduate Scientific Research Innovation Project"(No. YJ2390), thanks to the Tianjin University of Technology for its funding, and thanks to the TECHNICAL COLLEGE FOR THE DEAF for providing better resources.