Improving Medical Multi-modal Contrastive Learning with Expert Annotations
No Thumbnail Available
Access rights
openAccess
URL
Journal Title
Journal ISSN
Volume Title
A4 Artikkeli konferenssijulkaisussa
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
Other link related to publication (opens in new window)
View publication in the Research portal (opens in new window)
Other link related to publication (opens in new window)
Authors
Date
2025
Department
Major/Subject
Mcode
Degree programme
Language
en
Pages
Series
Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XX, pp. 468-486, Lecture Notes in Computer Science ; Volume 15078
Abstract
We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the "modality gap" -- a significant disparity between image and text embeddings that diminishes the quality of representations and hampers cross-modal interoperability. eCLIP integrates a heatmap processor and leverages mixup augmentation to efficiently utilize the scarce expert annotations, thus boosting the model's learning effectiveness. eCLIP is designed to be generally applicable to any variant of CLIP without requiring any modifications of the core architecture. Through detailed evaluations across several tasks, including zero-shot inference, linear probing, cross-modal retrieval, and Retrieval Augmented Generation (RAG) of radiology reports using a frozen Large Language Model, eCLIP showcases consistent improvements in embedding quality. The outcomes reveal enhanced alignment and uniformity, affirming eCLIP's capability to harness high-quality annotations for enriched multi-modal analysis in the medical imaging domain.Description
| openaire: EC/H2020/101016775/EU//INTERVENE
Keywords
Contrastive Learning, Deep Neural Networks, LLM Large Language Models, Medical Imaging, Zero-shot Inference
Other note
Citation
Kumar, Y & Marttinen, P 2025, Improving Medical Multi-modal Contrastive Learning with Expert Annotations . in A Leonardis, E Ricci, S Roth, O Russakovsky, T Sattler & G Varol (eds), Computer Vision – ECCV 2024 : 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XX . Lecture Notes in Computer Science, vol. 15078, Springer, pp. 468-486, European Conference on Computer Vision, Milano, Italy, 29/09/2024 . https://doi.org/10.1007/978-3-031-72661-3_27