Exploring Transformer-Based Architectures for Large-Scale Multimodal Information Retrieval Systems
DOI:
https://doi.org/10.64137/XXXXXXXX/IJCSEI-V1I1P103Keywords:
Transformer architectures, Multimodal information retrieval, Self-attention, Cross-modal fusion, Deep learning, Large-scale retrieval, Embedding, Modality-agnostic, Multi-head attention, Contrastive learningAbstract
Transformer-based models have led to a major change in large-scale information retrieval systems capable of working with diverse types of data such as text, images, audio and video. The most important feature of transformers is self-attention, which connects all input tokens using graphs so that every token can interact with all others across the entire sequence, regardless of its type. As a result, transformers can identify and use detailed connections and links that exist among different types of data, helping them work well on tasks requiring the combination of different types of information. Transformer models can now share information between different modalities because of advanced attention mechanisms and fusion methods developed recently. They are effective in locating documents, searching videos and analyzing medical images thanks to using shared embeddings and contrastive learning. The flexibility and ability to process data means transformers can easily handle data from different input formats and do so more effectively than CNNs and RNNs. The main issues in building transformer-based multimodal retrieval systems are making sure different modalities are correctly tokenized and embedded, addressing scalability and crafting architectures capable of handling additional kinds of data. As self-supervised pretraining, multi-head attention and network optimization keep advancing, transformers make up the main components of new-generation information retrieval schemes
References
[1] Wei, C., Chen, Y., Chen, H., Hu, H., Zhang, G., Fu, J., & Chen, W. (2024, September). Uniir: Training and benchmarking universal multimodal information retrievers. In European Conference on Computer Vision (pp. 387-404). Cham: Springer Nature Switzerland.
[2] Lin, S. C., Lee, C., Shoeybi, M., Lin, J., Catanzaro, B., & Ping, W. (2024). Mm-embed: Universal multimodal retrieval with multimodal llms. arXiv preprint arXiv:2411.02571.
[3] Raminedi, S., Shridevi, S., & Won, D. (2024). Multi-modal transformer architecture for medical image analysis and automated report generation. Scientific Reports, 14(1), 19281.
[4] What is information retrieval?, IBM, online. https://www.ibm.com/think/topics/information-retrieval
[5] Saheb, T., & Izadi, L. (2019). Paradigm of IoT big data analytics in the healthcare industry: A review of scientific literature and mapping of research trends. Telematics and informatics, 41, 70-85.
[6] What Is an Information Retrieval System? With Examples, multimodal, online. https://www.multimodal.dev/post/what-is-an-information-retrieval-system
[7] Huang, L., Wu, Q., Miao, Z., & Yamasaki, T. (2025). Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up. arXiv preprint arXiv:2502.20008.
[8] Xu, P., Zhu, X., & Clifton, D. A. (2023). Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10), 12113-12132.
[9] Miech, A., Alayrac, J. B., Laptev, I., Sivic, J., & Zisserman, A. (2021). Thinking fast and slow: Efficient text-to-visual retrieval with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9826-9836).
[10] Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2022). Efficient transformers: A survey. ACM Computing Surveys, 55(6), 1-28.
[11] Moro, G., Salvatori, S., & Frisoni, G. (2023). Efficient text-image semantic search: A multi-modal vision-language approach for fashion retrieval. Neurocomputing, 538, 126196.
[12] Luo, M., Gokhale, T., Varshney, N., Yang, Y., & Baral, C. (2024). Multimodal Information Retrieval. In Advances in Multimodal Information Retrieval and Generation (pp. 35-91). Cham: Springer International Publishing.
[13] Yang, J., Li, Q., & Zhuang, Y. (2000). A Multimodal Information Retrieval System: Mechanism and Interface. IEEE Trans. on Multimedia.
[14] Annie Surla, Aditi Bodhankar and Tanay Varshney, An Easy Introduction to Multimodal Retrieval-Augmented Generation, nvidia, 2024. Online. https://developer.nvidia.com/blog/an-easy-introduction-to-multimodal-retrieval-augmented-generation/
[15] Lee, J., Ko, J., Baek, J., Jeong, S., & Hwang, S. J. (2024). Unified Multi-Modal Interleaved Document Representation for Information Retrieval. arXiv preprint arXiv:2410.02729.
[16] Sattari, S., Kalkan, S., & Yazici, A. (2025). Multimodal multimedia information retrieval through the integration of fuzzy clustering, OWA-based fusion, and Siamese neural networks. Fuzzy Sets and Systems, 109419.
[17] Ji, W., Wei, Y., Zheng, Z., Fei, H., & Chua, T. S. (2023, October). Deep multimodal learning for information retrieval. In Proceedings of the 31st ACM International Conference on Multimedia (pp. 9739-9741).
[18] Sheng, M., Wang, S., Zhang, Y., Wang, K., Wang, J., Luo, Y., & Hao, R. (2024). MQRLD: A Multimodal Data Retrieval Platform with Query-aware Feature Representation and Learned Index Based on Data Lake. arXiv preprint arXiv:2408.16237.
[19] D. Kodi, “Designing Real-time Data Pipelines for Predictive Analytics in Large-scale Systems,” FMDB Transactions on Sustainable Computing Systems., vol. 2, no. 4, pp. 178–188, 2024.
[20] Agarwal S. “Multi-Modal Deep Learning for Unified Search-Recommendation Systems in Hybrid Content Platforms”. IJAIBDCMS [International Journal of AI, BigData, Computational and Management Studies]. 2025 May 30 [cited 2025 Jun. 4]; 4(3):30-39. Available from: https://ijaibdcms.org/index.php/ijaibdcms/article/view/154
[21] Pulivarthy, P. (2024). Optimizing Large Scale Distributed Data Systems Using Intelligent Load Balancing Algorithms. AVE Trends in Intelligent Computing Systems, 1(4), 219–230.
[22] Noor, S., Awan, H.H., Hashmi, A.S. et al. “Optimizing performance of parallel computing platforms for large-scale genome data analysis”. Computing 107, 86 (2025). https://doi.org/10.1007/s00607-025-01441-y.
[23] Mallisetty, Harikrishna; Patel, Bhavikkumar; and Rao, Kolati Mallikarjuna, "Artificial Intelligence Assisted Online Interactions", Technical Disclosure Commons, (December 19, 2023) https://www.tdcommons.org/dpubs_series/6515