TY - JOUR TI - Feature distillation from vision-language model for semisupervised action classification AB - The training of supervised machine learning approaches is critically dependent on annotating large-scale datasets. Semisupervised learning approaches aim to achieve compatible performance with supervised methods using relatively less annotation without sacrificing good generalization capacity. In line with this objective, ways of leveraging unlabeled data have been the subject of intense research. However, semisupervised video action recognition has received relatively less attention compared to image domain implementations. Existing semisupervised video action recognition methods trained from scratch rely heavily on augmentation techniques, complex architectures, and/or the use of other modalities while distillation-based methods use models that have only been trained for 2D computer vision tasks. In another line of work, pretrained vision-language models have shown very promising results for generating general-purpose visual features with reports of high zero-shot performance for many downstream tasks. In this work, we exploit a language-supervised visual encoder for learning video representations for video action classification tasks. We propose a teacher-student learning paradigm through feature distillation and pseudo-labeling. Our experimental results are a proof-of-concept revealing that multimodal feature extractors can be utilized for spatiotemporal feature extraction in a semisupervised learning context and show compatible performance with SOTA methods, especially in a low-label regime. AU - Çelik, Aslı AU - Küçükmanisa, Ayhan AU - Urhan, Oğuzhan DO - 10.55730/1300-0632.4038 PY - 2023 JO - Turkish Journal of Electrical Engineering and Computer Sciences VL - 31 IS - SI-1 (6) SN - 1300-0632 SP - 1129 EP - 1145 DB - TRDizin UR - http://search/yayin/detay/1208582 ER -