TY  - JOUR
TI  - Feature distillation from vision-language model for semisupervised action classification
AB  - The training of supervised machine learning approaches is critically dependent on annotating large-scale datasets. Semisupervised learning approaches aim to achieve compatible performance with supervised methods using relatively less annotation without sacrificing good generalization capacity. In line with this objective, ways of leveraging unlabeled data have been the subject of intense research. However, semisupervised video action recognition has received relatively less attention compared to image domain implementations. Existing semisupervised video action recognition methods trained from scratch rely heavily on augmentation techniques, complex architectures, and/or the use of other modalities while distillation-based methods use models that have only been trained for 2D computer vision tasks. In another line of work, pretrained vision-language models have shown very promising results for generating general-purpose visual features with reports of high zero-shot performance for many downstream tasks. In this work, we exploit a language-supervised visual encoder for learning video representations for video action classification tasks. We propose a teacher-student learning paradigm through feature distillation and pseudo-labeling. Our experimental results are a proof-of-concept revealing that multimodal feature extractors can be utilized for spatiotemporal feature extraction in a semisupervised learning context and show compatible performance with SOTA methods, especially in a low-label regime.
AU  - Çelik, Aslı
AU  - Küçükmanisa, Ayhan
AU  - Urhan, Oğuzhan
DO  - 10.55730/1300-0632.4038
PY  - 2023
JO  - Turkish Journal of Electrical Engineering and Computer Sciences
VL  - 31
IS  - SI-1 (6)
SN  - 1300-0632
SP  - 1129
EP  - 1145
DB  - TRDizin
UR  - http://search/yayin/detay/1208582
ER  -