Abstract

본 연구에서는 convolution-free Transformer 구조를 활용하여 unlabeled 데이터로부터 multimodal representation을 학습할 수 있는 프레임워크를 제안한다. 구체적으로 제안하는 Video-Audio-Text Transformer (VATT)은 raw signal을 입력으로 받아 downstream task들을 수행하기에 이점이 있는 충분히 다양한 정보를 가진 multimodal representation을 출력한다. VATT는 multimodal contrastive losses로 단대단으로 학습하고 성능을 평가하기 위해 video action recognition, audio event classification, image classification, and text-to-video retrieva 과 같은 downstream tasks을 평가한다.

Introduction

Convolutional Neural Networks (CNNs) in Computer Vision community
- CNN 모델은 다양한 컴퓨터비전 task에서 많이 사용되는 모델로 visual data에 효과적인 구조임을 입증
- Convolution operation에 의해 strong inductive-bias를 지님
Paradiam shift in Natural Language Processing (NLP) community
- 최근 NLP분야에서는 기존의 inductive-bias models (e.g. CNNs, RNNs) 에서 self-attention을 활용하는 general architecture을 활용하는 것으로 패러다임이 변하고 있음
- 특히, Transformer는 이제 NLP domain에서 가장 많이 활용되는 모델로 자리잡음
- 대량의 텍스트데이터에서 transformer를 사전학습시키는 것은 여러 downstream tasks들에서 SOTA 성능을 기록하는 성과를 이뤄냄
Transformer in other domains
- NLP 분야에서 Transformer의 성공은 computer vision 분야에서도 Transformer을 활용한 여러 연구가 진행되어져 오고 있음
  - Convolutions + Attention module
  - Convolution-free architectures
    - CNN의 performance 뛰어넘음
    - 이를 통해, 대용량의 labeled data에서 학습할 경우 inductive-bias을 뛰어넘을 수 있다는 것을 확인함
Transformer with supervised training
- 그러나, 지도학습으로 학습하는 Transformer에는 두 가지 문제가 있음
  1. 방대한 양의 Unstructured visual data & Unlabeled data를 활용하지 못함
  2. 데이터 레이블링의 비용적 시간적인 노력이 많이 들어, application 측면으로 적합하지 않음
⇒ 따라서, 본 연구의 목적은 방대한 양의 unlabeled visual data를 활용할 수 있는 방법론에 대해 연구하는 것
Overview of the VATT architecture
- 각 모달리티별 tokenization 과 linear projection을 제외하고 BERT와 ViT 구조와 거의 동일
- Modality-Specific or Modality-Agnostic Transformer
  - Modality-Specific
    - 각 모달리티별 Transformer 학습
  - Modality-Agnostic
    - 모든 모달리티에 해당하는 Transformer 학습 (shared weights)
- DropToken
  - 랜덤하게 일부 token을 제거함으로써 computational complexity 감소

Approach

각 모달리티 데이터를 tokenization layer을 통해 projection시켜 emedding 산출
각 모달리티 데이터에 해당하는 backbone Transformer setting, 모든 모달리티 데이터를 공유하는 backbone Transformer setting 존재

Tokenization and Positional Encoding

https://www.youtube.com/watch?v=rgXxAFIBido&t=585s

Vision-modality
- 3-channel RGB pixels of video frames
Audio-modality
- Air density amplitudes (waveforms)
Text-modality
- Sequence of words

Untitled

Partition an entire video clip of size $T \times H \times W$ to $\lceil T/t \rceil \times \lceil H/t \rceil \times \lceil W/t \rceil$
Positional embeddings to encode all the $\lceil T/t \rceil \cdot \lceil H/t \rceil \cdot \lceil W/t \rceil$ patches in a video clip