Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

A. Aflal, S. Shukla, Omid Poursaeed, P. Zhang, A. Shah, S. Lim

Abstract

A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. We formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification and temporal action localization, showing consistent gains over existing approaches and achieving state-of-the-art performance on long-form video modeling.

Type

Conference proceedings

Publication

ICCVW

Date

January, 2023

Links

PDF ArXiv