Journal: IEEE Access, vol. 11, 123209–123222, 2023
Cross-modal representation learning aims to learn a shared representation space where data from multiple modalities can be effectively compared, fused, and understood. This paper investigates the role of increased diversity in the similarity score matrix in enhancing the performance of the CLIP (Contrastive Language-Image Pretraining), a multi-modal learning model that establishes a connection between images and text within a joint embedding space. Two transforming approaches, sine and sigmoid (including two versions), are incorporated into the CLIP model to amplify larger values and diminish smaller values within the similarity matrix (logits). Hardware limitations are addressed using a more compact text encoder (DistilBERT) and a pre-trained ResNet50 image encoder. The proposed adaptations are evaluated on various benchmarks, including image classification and image/text retrieval tasks, using 10 benchmark datasets such as Food101, Flickr30k, and COCO. The performance of the adapted models is compared to the base CLIP model using Accuracy, mean per class, and Recall@k metrics. The results demonstrate improvements in Accuracy (up to 5.32% enhancement for the PatchCamelyon dataset), mean per class (up to 14.48% enhancement for the FGVCAircraft dataset), and retrieval precision (with an increase of up to 45.20% in Recall@1 for the COCO dataset), compared to the baseline algorithm (CLIP).