Skip to content

Latest commit

 

History

History
53 lines (37 loc) · 3.02 KB

MILES.md

File metadata and controls

53 lines (37 loc) · 3.02 KB

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval (ECCV 2022)

Paper | Pre-trained Model image

Main Results on Downstream Tasks

Text-to-video Retrieval on MSR-VTT

image

Text-to-video Retrieval on MSVD, LSMDC and DiDeMo

image

Visualization

Local Visual Semantics Capture

We visualize the self- attention map from the video encoder through computing the self-attention of the [CLS] token in the last block. Our pre-trained model pays high attention to those significant local regions in the video.

image

Fine-grained Video-text Alignment

We visualize the cross-modality alignment between text and video tokens by calculating the similarity map between features embedded from the text encoder and video encoder. Our pre-trained model aligns words with corresponding visual regions accurately.

image

Pre-trained Model

Our pre-trained model can be downloaded in Pre-trained Model, which contains the weights of Video Encoder and Text Encoder.

Video Encoder

Our video encoder is exactly the same as Frozen, which consists of a stack of divided space-time self-attention blocks. Compared to the video encoder of MCQ, the video encoder of MILES adds temporal attention to enable reasoning among the visible regions along the temporal dimensions for masked video modeling.

Downstream Retrieval (Zero-shot on MSR-VTT)

  1. Download our pre-trained model in Pre-trained Model.

  2. Load the pre-trained model in "configs/zero_msrvtt_4f_i21k_MILES.json".

    bash sctripts/test_retrieval_MILES.sh
    

Acknowledgement

Our code is based on the implementation of "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval" https://github.com/m-bain/frozen-in-time.git.

Citation

If our code is helpful to your work, please cite:

@article{ge2022miles,
  title={MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval},
  author={Ge, Yuying and Ge, Yixiao and Liu, Xihui and Wang, Alex Jinpeng and Wu, Jianping and Shan, Ying and Qie, Xiaohu and Luo, Ping},
  journal={arXiv preprint arXiv:2204.12408},
  year={2022}
}