MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval (ECCV 2022)
We visualize the self- attention map from the video encoder through computing the self-attention of the [CLS] token in the last block. Our pre-trained model pays high attention to those significant local regions in the video.
We visualize the cross-modality alignment between text and video tokens by calculating the similarity map between features embedded from the text encoder and video encoder. Our pre-trained model aligns words with corresponding visual regions accurately.
Our pre-trained model can be downloaded in Pre-trained Model, which contains the weights of Video Encoder and Text Encoder.
Our video encoder is exactly the same as Frozen, which consists of a stack of divided space-time self-attention blocks. Compared to the video encoder of MCQ, the video encoder of MILES adds temporal attention to enable reasoning among the visible regions along the temporal dimensions for masked video modeling.
-
Download our pre-trained model in Pre-trained Model.
-
Load the pre-trained model in "configs/zero_msrvtt_4f_i21k_MILES.json".
bash sctripts/test_retrieval_MILES.sh
Our code is based on the implementation of "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval" https://github.com/m-bain/frozen-in-time.git.
If our code is helpful to your work, please cite:
@article{ge2022miles,
title={MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval},
author={Ge, Yuying and Ge, Yixiao and Liu, Xihui and Wang, Alex Jinpeng and Wu, Jianping and Shan, Ying and Qie, Xiaohu and Luo, Ping},
journal={arXiv preprint arXiv:2204.12408},
year={2022}
}