본문 바로가기

정기세미나

View
How to Leverage Multi-modal Foundation Models for Visual Grounding
담당자 손진희 교수(POSTECH) 세미나 일자 2025.10.01 Wed 조회수 20

[Abstract]

Referring Image Segmentation (RIS) or Visual Grounding (VG) aims to generate segmentation masks based on natural language expressions that specify particular regions within images. However, the process of collecting labeled datasets for this task is costly and labor-intensive. In this talk, I will present two recent methods designed to address these challenges by reducing reliance on manual annotations and advancing zero-shot and pseudo-supervised learning techniques.

The first method leverages the pre-trained CLIP model to perform zero-shot RIS, capturing global and local contexts in the visual and textual domains to achieve precise segmentation masks guided by text expressions. The second method generates high-quality segmentation masks and distinctive referring expressions as pseudo-supervisions, using segmentation and captioning foundation models with distinctiveness-based filtering. These approaches enable RIS models to segment target instances without manual annotations, outperforming zero-shot baselines and even fully supervised methods in unseen domains. I believe that these techniques offer promising solutions for open-world RIS tasks.

 

[Biography]

Jeany Son received the B.S. and M.S. degrees in Computer Science and Engineering from Ewha Womans University, Seoul, South Korea, in 2008 and 2010, and the Ph.D. degree in Computer Science and Engineering from POSTECH, Pohang, South Korea, in 2018. She was a postdoctoral researcher at Seoul National University (2018–2019) and a researcher at ETRI (2019–2021). From 2021 to 2025, she served as an assistant professor at the AI Graduate School, Gwangju Institute of Science and Technology (GIST). She is currently an assistant professor with the Graduate School of AI (GSAI) and the Department of Computer Science and Engineering (CSE) at POSTECH.

Her research interests include computer vision, machine learning, and deep learning, with a focus on visual tracking, object detection, semantic segmentation, multi-modal learning, and learning with minimal supervision. More recently, she has also been working on trustworthy AI.