MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval (ECCV 2022)

Main Results on Downstream Tasks

Text-to-video Retrieval on MSR-VTT

Text-to-video Retrieval on MSVD, LSMDC and DiDeMo

Visualization

Local Visual Semantics Capture

We visualize the self- attention map from the video encoder through computing the self-attention of the [CLS] token in the last block. Our pre-trained model pays high attention to those significant local regions in the video.

Fine-grained Video-text Alignment

We visualize the cross-modality alignment between text and video tokens by calculating the similarity map between features embedded from the text encoder and video encoder. Our pre-trained model aligns words with corresponding visual regions accurately.

Pre-trained Model

Our pre-trained model can be downloaded in Pre-trained Model, which contains the weights of Video Encoder and Text Encoder.

Video Encoder

Our video encoder is exactly the same as Frozen, which consists of a stack of divided space-time self-attention blocks. Compared to the video encoder of MCQ, the video encoder of MILES adds temporal attention to enable reasoning among the visible regions along the temporal dimensions for masked video modeling.

Downstream Retrieval (Zero-shot on MSR-VTT)

Download our pre-trained model in Pre-trained Model.
Load the pre-trained model in "configs/zero_msrvtt_4f_i21k_MILES.json".
```
bash sctripts/test_retrieval_MILES.sh
```

Acknowledgement

Our code is based on the implementation of "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval" https://github.com/m-bain/frozen-in-time.git.

Citation

If our code is helpful to your work, please cite:

@article{ge2022miles,
  title={MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval},
  author={Ge, Yuying and Ge, Yixiao and Liu, Xihui and Wang, Alex Jinpeng and Wu, Jianping and Shan, Ying and Qie, Xiaohu and Luo, Ping},
  journal={arXiv preprint arXiv:2204.12408},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MILES.md

MILES.md

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval (ECCV 2022)

Main Results on Downstream Tasks

Text-to-video Retrieval on MSR-VTT

Text-to-video Retrieval on MSVD, LSMDC and DiDeMo

Visualization

Local Visual Semantics Capture

Fine-grained Video-text Alignment

Pre-trained Model

Video Encoder

Downstream Retrieval (Zero-shot on MSR-VTT)

Acknowledgement

Citation

Files

MILES.md

Latest commit

History

MILES.md

File metadata and controls

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval (ECCV 2022)

Main Results on Downstream Tasks

Text-to-video Retrieval on MSR-VTT

Text-to-video Retrieval on MSVD, LSMDC and DiDeMo

Visualization

Local Visual Semantics Capture

Fine-grained Video-text Alignment

Pre-trained Model

Video Encoder

Downstream Retrieval (Zero-shot on MSR-VTT)

Acknowledgement

Citation