WORKSHOP OVERVIEW
Multi-modal understanding plays a crucial role in enabling the machine to perceive the physical world with multiple sensor cues as humans. Recently, large-scale pre-trained models (PTMs) has become a research hotspot in the field of artificial intelligence. Existing techniques follow the self-supervised learning paradigm achieve great success on the uni-modal scenes, such as computer vision (CV) and natural language process (NLP). The recent advances in large-scale pre-trained models inspire the researchers to explore more and deeper pre-training techniques for the multi-modal understanding problem. In this workshop, we aim to bring together researchers from the field of multimedia to discuss recent research and future directions on pre-trained models with self-supervised learning for multimedia understanding.
In recent years, we have witnessed the great success of pre-trained models (PTM) in natural language processing (NLP), such as GPT3, BERT, Roberta, DEBERTA, etc. It motivates the researchers in the multimedia community to leverage the idea of PTM to address multi-modal tasks. The scope of this workshop is focused on pre-trained models with self-supervised learning for multimedia understanding. The potential topics include architecture design for multi-modal PTM, pre-text task design for self-supervised learning, multi-modal data modeling, efficiency enhancing for PTM, interpretability of PTM, etc.
Invited Speakers

Jifeng Dai
Ph.D, Senior ResearcherAn area chair of CVPR 2021, 2023 and ECCV 2020, a public chair of ICCV 2019, and a senior PC member of AAAI 2018, 2022. He is a Young Scientist at Beijing Academy of Artificial Intelligence (BAAI).

Si Liu
Professor, Doctoral supervisorShe is currently the associate editor of IEEE TMM and IEEE TCSVT and has served as the area chair of ICCV, CVPR, ECCV, ACM MM, and other top conferences many times.

Jiajun Deng
Ph.D, Postdoctoral research associateHe is serving as the Guest Editor of IEEE Transactions on Multimedia for the Special Issue of Pre-trained Models for Multi-modality Understanding in 2022.

Zhengyuan Yang
Ph.D, Senior ResearcherSenior researcher at Microsoft. He is a member of the Video Technology Circuits and Systems (TCSVT), AAAI 2023 Senior Projects Committee (SPC).
Organizers

Wengang Zhou
Ph.D, ProfessorEEIS Department, University of Science and Technology of China
Email: zhwg@ustc.edu.cn

Jiaxin Shi
Ph.D, Senior ResearcherHuawei Cloud Computing Technologies Co., Ltd.
Email: shijiaxin3@huawei.com

Lingxi Xie
Ph.D, Senior ResearcherHuawei Cloud Computing Technologies Co., Ltd.
Email: 198808xc@gmail.com