Feature distillation from vision-language model for semisupervised action classification

Çelik, Aslı; Küçükmanisa, Ayhan; Urhan, Oğuzhan

doi:10.55730/1300-0632.4038

Feature distillation from vision-language model for semisupervised action classification

Aslı ÇELİK, (Kocaeli Üniversitesi, Mühendislik Fakültesi, Elektronik ve Haberleşme Mühendisliği Bölümü, Kocaeli, Türkiye)

Ayhan KÜÇÜKMANİSA, (Kocaeli Üniversitesi, Mühendislik Fakültesi, Elektronik ve Haberleşme Mühendisliği Bölümü, Kocaeli, Türkiye)

Oğuzhan URHAN (Kocaeli Üniversitesi, Mühendislik Fakültesi, Elektronik ve Haberleşme Mühendisliği Bölümü, Kocaeli, Türkiye)

Turkish Journal of Electrical Engineering and Computer Sciences

5 0

Yıl: 2023 Cilt: 31 Sayı: SI-1 (6) Sayfa Aralığı: 1129 - 1145 Metin Dili: İngilizce DOI: 10.55730/1300-0632.4038 İndeks Tarihi: 22-11-2023

Feature distillation from vision-language model for semisupervised action classification

Öz:

The training of supervised machine learning approaches is critically dependent on annotating large-scale datasets. Semisupervised learning approaches aim to achieve compatible performance with supervised methods using relatively less annotation without sacrificing good generalization capacity. In line with this objective, ways of leveraging unlabeled data have been the subject of intense research. However, semisupervised video action recognition has received relatively less attention compared to image domain implementations. Existing semisupervised video action recognition methods trained from scratch rely heavily on augmentation techniques, complex architectures, and/or the use of other modalities while distillation-based methods use models that have only been trained for 2D computer vision tasks. In another line of work, pretrained vision-language models have shown very promising results for generating general-purpose visual features with reports of high zero-shot performance for many downstream tasks. In this work, we exploit a language-supervised visual encoder for learning video representations for video action classification tasks. We propose a teacher-student learning paradigm through feature distillation and pseudo-labeling. Our experimental results are a proof-of-concept revealing that multimodal feature extractors can be utilized for spatiotemporal feature extraction in a semisupervised learning context and show compatible performance with SOTA methods, especially in a low-label regime.

Anahtar Kelime: Video action classification multimodal learning semisupervised learning feature distillation

Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık

[1] Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the Kinetics dataset. In: Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA; 2017. pp. 6299-6308.
[2] Fan Q, Chen CFR, Kuehne H, Pistoia M, Cox D. More is less: learning eﬀicient video representations by big-little network and depthwise temporal aggregation. In: Conference on Neural Information Processing Systems; Vancouver, BC, Canada; 2019. pp. 2264-2273.
[3] Feichtenhofer C. X3D: Expanding architectures for eﬀicient video recognition. In: Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA; 2020. pp. 203-213.
[4] Feichtenhofer C, Fan H, Malik J, He K. SlowFast networks for video eecognition. In: International Conference on Computer Vision; Seoul, Korea; 2019. pp. 6202-6211.
[5] Lin J, Gan C, Han S. TSM: Temporal shift module for eﬀicient video understanding. In: International Conference on Computer Vision; Seoul, Korea; 2019. pp. 7083–7093.
[6] Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: International Conference on Computer Vision; Santiago, Chile; 2015. pp. 4489-4497.
[7] Sohn K, Berthelot D, Carlini N, Zhang Z, Zhang H et al. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In: Conference on Neural Information Processing Systems; Vancouver, BC, Canada; 2020. pp. 596-608.
[8] Pham H, Dai Z, Xie Q, Le QV. Meta pseudo labels. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA; 2021. pp. 11557-11568.
[9] Tarvainen A, Valpola H. Mean teachers are better role models: weight-averaged consistency targets improve semi- supervised deep learning results. In: Conference on Neural Information Processing Systems; Long Beach, CA, USA; 2017. pp. 1195-1204.
[10] Berthelot D, Carlini N, Cubuk ED, Kurakin A, Sohn K et al. Remixmatch: Semi-supervised learning with distri- bution alignment and augmentation anchoring. arXiv preprint. arXiv:1911.09785. 2019.
[11] Frome A, Corrado GS, Shlens J, Bengio S, Dean J et al. Devise: A deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems; Lake Tahoe, NV, United States; 2013. pp. 2121-2129.
[12] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G et al. Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning; Virtual; 2021. pp. 8748-8763.
[13] Xu X, Hospedales T, Gong S. Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision 2017; 123 (3): 309-333. https://doi.org/10.1007/s11263-016-0983-5
[14] Chen B, Rouditchenko A, Duarte K, Kuehne H, Thomas S et al. Multimodal clustering networks for self-supervised learning from unlabeled videos. In: International Conference on Computer Vision; Virtual; 2021. pp. 8012-8021.
[15] Brattoli B, Tighe J, Zhdanov F, Perona P, Chalupka K. Rethinking zero-shot video classification: end-to-end training for realistic applications. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA. 2020. pp. 4613-4623.
[16] Bain M, Nagrani A, Varol G, Zisserman A. Frozen in time: a joint video and image encoder for end-to-end retrieval. In: International Conference on Computer Vision; Virtual; 2021. pp. 1728-1738.
[17] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems; Montreal, QC, Canada; 2014. pp. 1-9.
[18] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X et al. An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations; Virtual; 2021. pp. 1-21.
[19] Hara K, Kataoka H, Satoh Y. Learning spatio-temporal features with 3D residual networks for action recognition. In: International Conference on Computer Vision Workshops; Venice, Italy; 2017. pp. 3154-3160.
[20] Jing L, Parag T, Wu Z, Tian Y, Wang. VideoSSL: Semi-supervised learning for video classification. In: Winter Conference on Applications of Computer Vision; Virtual; 2021. pp. 1110-1119.
[21] Gao G, Liu Z, Zhang G, Li J, Qin AK. DANet: Semi-supervised differentiated auxiliaries guided network for video action recognition. Neural Networks 2023; 158: 121-131. https://doi.org/10.1016/j.neunet.2022.11.009
[22] Singh A, Chakraborty O, Varshney A, Panda R, Feris R et al. Semi-supervised action recognition with temporal contrastive learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; Virtual; 2021. pp. 10389-10399.
[23] Xu Y, We F, Sun X, Yang C, Shen Y et al. Cross-model pseudo-labeling for semi-supervised action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA; 2022. pp. 2959-2968.
[24] Xiao J, Jing L, Zhang L, He J, She Q et al. Learning from temporal gradient for semi-supervised action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA; 2022. pp. 3252- 3262.
[25] Socher R, Ganjoo M, Manning CD, Ng A. Zero-shot learning through cross-modal transfer. In: International Conference on Neural Information Processing System; Lake Tahoe, NV, USA; 2013. pp. 935-943.
[26] Feichtenhofer C, Pinz A, Wildes RP. Spatiotemporal residual networks for video action recognition. arXiv preprint. arXiv:1611.02155. 2016.
[27] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv preprint. arXiv:1503.02531. 2015.
[28] Soomro K, Zamir AR, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint. arXiv:1212.0402. 2012.
[29] Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T. HMDB: A large video database for human motion recognition. In: International Conference on Computer Vision; Barcelona, Spain; 2011. pp. 2556-2563.
[30] Girdhar R, Tran D, Torresani L, Ramanan D. Distinit: Learning video representations without a single labeled video. In: International Conference on Computer Vision; Seoul, Korea; 2019. pp. 852-861.
[31] Lee DH. Pseudo-label: the simple and eﬀicient semi-supervised learning method for deep neural networks. In: ICML 2013 Workshop: Challenges in Representation Learning; Atlanta, GA, USA; 2013. pp. 1-6.
[32] Xiong B, Fan H, Grauman, Feichtenhofer C. Multiview pseudo-labeling for semi-supervised learning from video. In: International Conference on Computer Vision; Montreal, QC, Canada; 2021. pp. 7209-7219.
[33] Zhai X, Oliver A, Kolesnikov A, Beyer L. S4l: Self-supervised semi-supervised learning. In: International Conference on Computer Vision; Seoul, Korea; 2019. pp. 1476-1485.
[34] Zou Y, Choi J, Wang Q, Huang J. Learning representational invariances for data-eﬀicient action recognition. arXiv preprint. arXiv:2103.16565. 2021.
[35] Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R et al. Large-scale video classification with convolutional neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; Columbus, OH, USA; 2014. pp. 1725-1732.
[36] Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA; 2016. pp. 1933-1941.
[37] Lei J, Li L, Zhou L, Gan Z, Berg TL et al. Less is more: Clipbert for video-and-language learning via sparse sampling. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA; 2021. pp. 7331-7341.
[38] Arnab A, Dehghani M, Heigold G, Sun C, Lučić M. Vivit: A video vision transformer. In: International Conference on Computer Vision; Montreal, QC, Canada; 2021. pp. 6836-6846.
[39] Liu Z, Ning J, Cao Y, Wei Y, Zhang Z. Video swin transformer. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA 2022. pp. 3202-3211.
[40] Yan S, Xiong X, Arnab A, Lu Z, Zhang M et al. Multiview transformers for video recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA; 2022. pp. 3333-3343.
[41] Xing Z, Dai Q, Hu H, Che J, Wu Z et al. Svformer: Semi-supervised video transformer for action recognition. In: Conference on Computer Vision and Pattern Recognition; Vancouver, BC, Canada; 2023. pp. 18816-18826.
[42] Miech A, Zhukov D, Alayrac JB, Tapaswi M, Laptev. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: IEEE/CVF International Conference on Computer Vision; Seoul, Korea; 2019. pp. 2630-2640.
[43] Zellers R, Lu X, Hessel J, Yu Y, Park JS. Merlot: Multimodal neural script knowledge models. In: 35th Conference on Neural Information Processing Systems; Virtual; 2021. pp. 23634-23651.
[44] Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA; 2018. pp. 6546-6555.
[45] Feichtenhofer C. X3d: Expanding architectures for eﬀicient video recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA; 2020. pp. 203-213.
[46] Xie S, Sun C, Huang J, Tu Z, Murphy K. Rethinking spatiotemporal feature learning for video understanding. arXiv preprint. arXiv:1712.04851. 2017.
[47] Liu Z, Luo D, Wang Y, Wang L, Tai Y et al. Teinet: Towards an eﬀicient architecture for video recognition. In: AAAI Conference on Artificial Intelligence; New York, NY, USA; 2020. pp. 11669-11676.
[48] Li Y, Ji B, Shi X, Zhang J, Kang B. Tea: Temporal excitation and aggregation for action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA; 2020. pp. 909-918.
[49] Tong A, Tang C, Wang W. Semi-supervised action recognition from temporal augmentation using curricu- lum learning. IEEE Transactions on Circuits and Systems for Video Technology 2022; 33 (3): 1305-1319. https://doi.org/10.1109/TCSVT.2022.3210271
[50] Liu Z, Wang L, Wu W, Qian C, Lu T. TAM: Temporal adaptive module for video recognition. In: International Conference on Computer Vision; Montreal, QC, Canada; 2021. pp. 13708-13718.
[51] Wang L, Tong Z, Ji B, Wu G. TDN: Temporal difference networks for eﬀicient action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA; 2021. pp. 1895-1904.
[52] Neimark D, Bar O, Zohar M, Asselmann D. Video transformer network. In: International Conference on Computer Vision; Montreal, BC, Canada; 2021. pp. 3163-3172.
[53] Kolesnikov A, Zhai X, Beyer L. Revisiting self-supervised visual representation learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA; 2019. pp. 1920-1929.
[54] Choi J, Gao C, Messou JC, Huang JB. Why can’t I dance in the mall? Learning to mitigate scene bias in action recognition. In: 33rd Conference on Neural Information Processing Systems; Vancouver, BC, Canada; 2019. pp. 1-13.
[55] Xie Q, Dai Z, Hovy E, Luong T, Le Q. Unsupervised data augmentation for consistency training. In: 34th Conference on Neural Information Processing Systems; Vancouver, BC, Canada; 2020. pp. 6256-6268.
[56] Gowda SN, Rohrbach M, Keller F, Sevilla-Lara L. Learn2augment: Learning to composite videos for data augmen- tation in action recognition. In: European Conference on Computer Vision; Tel Aviv, Israel; 2022. pp. 242-259.
[57] Jia C, Yang Y, Xia Y, Chen YT, Parekh Z et al. Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning; Virtual; 2021. pp. 4904-4916.
[58] Yuan L, Chen D, Chen YL, Codella N, Dai X et al. Florence: A new foundation model for computer vision. arXiv preprint. arXiv:2111.11432. 2021.
[59] Fu TJ, Li L, Gan Z, Lin K, Wang WY. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint. arXiv:2111.12681. 2021.

APA	Çelik A, Küçükmanisa A, Urhan O (2023). Feature distillation from vision-language model for semisupervised action classification. , 1129 - 1145. 10.55730/1300-0632.4038
Chicago	Çelik Aslı,Küçükmanisa Ayhan,Urhan Oğuzhan Feature distillation from vision-language model for semisupervised action classification. (2023): 1129 - 1145. 10.55730/1300-0632.4038
MLA	Çelik Aslı,Küçükmanisa Ayhan,Urhan Oğuzhan Feature distillation from vision-language model for semisupervised action classification. , 2023, ss.1129 - 1145. 10.55730/1300-0632.4038
AMA	Çelik A,Küçükmanisa A,Urhan O Feature distillation from vision-language model for semisupervised action classification. . 2023; 1129 - 1145. 10.55730/1300-0632.4038
Vancouver	Çelik A,Küçükmanisa A,Urhan O Feature distillation from vision-language model for semisupervised action classification. . 2023; 1129 - 1145. 10.55730/1300-0632.4038
IEEE	Çelik A,Küçükmanisa A,Urhan O "Feature distillation from vision-language model for semisupervised action classification." , ss.1129 - 1145, 2023. 10.55730/1300-0632.4038
ISNAD	Çelik, Aslı vd. "Feature distillation from vision-language model for semisupervised action classification". (2023), 1129-1145. https://doi.org/10.55730/1300-0632.4038

APA	Çelik A, Küçükmanisa A, Urhan O (2023). Feature distillation from vision-language model for semisupervised action classification. Turkish Journal of Electrical Engineering and Computer Sciences, 31(SI-1 (6)), 1129 - 1145. 10.55730/1300-0632.4038
Chicago	Çelik Aslı,Küçükmanisa Ayhan,Urhan Oğuzhan Feature distillation from vision-language model for semisupervised action classification. Turkish Journal of Electrical Engineering and Computer Sciences 31, no.SI-1 (6) (2023): 1129 - 1145. 10.55730/1300-0632.4038
MLA	Çelik Aslı,Küçükmanisa Ayhan,Urhan Oğuzhan Feature distillation from vision-language model for semisupervised action classification. Turkish Journal of Electrical Engineering and Computer Sciences, vol.31, no.SI-1 (6), 2023, ss.1129 - 1145. 10.55730/1300-0632.4038
AMA	Çelik A,Küçükmanisa A,Urhan O Feature distillation from vision-language model for semisupervised action classification. Turkish Journal of Electrical Engineering and Computer Sciences. 2023; 31(SI-1 (6)): 1129 - 1145. 10.55730/1300-0632.4038
Vancouver	Çelik A,Küçükmanisa A,Urhan O Feature distillation from vision-language model for semisupervised action classification. Turkish Journal of Electrical Engineering and Computer Sciences. 2023; 31(SI-1 (6)): 1129 - 1145. 10.55730/1300-0632.4038
IEEE	Çelik A,Küçükmanisa A,Urhan O "Feature distillation from vision-language model for semisupervised action classification." Turkish Journal of Electrical Engineering and Computer Sciences, 31, ss.1129 - 1145, 2023. 10.55730/1300-0632.4038
ISNAD	Çelik, Aslı vd. "Feature distillation from vision-language model for semisupervised action classification". Turkish Journal of Electrical Engineering and Computer Sciences 31/SI-1 (6) (2023), 1129-1145. https://doi.org/10.55730/1300-0632.4038