Introduction

In recent years, the breakthroughs of large language models(LLMs) in the field of natural language processing have not only demonstrated superior semantic comprehension and generative capabilities, but have also provided emerging technological impetus for the development of generative recommender systems (RSs)1. Leveraging their extensive world knowledge and robust contextual modeling capacities, LLMs have significant advantages in capturing the dynamic changes of users’ interest preferences and handling complex item contents, which is of great significance for enhancing the accuracy and personalization level of sequential recommendation(SR). However, existing LLM-based RSs predominantly rely on single text modal information (e.g., user historical behaviors, textual reviews, etc.), thus failing to take full advantage of the complementary value inherent in multimodal data2.The widespread adoption of smart terminals has led to increasingly multimodal characteristics in user-generated content, where data that encompass textual, visual, auditory and other data modalities provide RSs with richer dimensions for user preference representation. Multimodal information fusion technology can effectively uncover latent cross-modal associations and differences in user preferences by integrating heterogeneous data such as text, images, and videos3, effectively compensating for the limitations of single-modal approaches in information characterization4. This multimodal representations not only enable a more comprehensive understanding of user intent and item characteristics, but also significantly mitigate data sparsity and cold-start problems. Consequently, these techniques facilitate the construction of a more accurate and diverse multimodal recommendation system(MRS)5, as illustrated in Fig. 1.

Fig. 1
figure 1

An example of a multimodal interaction sequential recommendation system. After Winder has bought a smart phone, an earbud, a pad and a laptop computer, what would he buy next?

In particular, recent advances6 in MLLM, represented by GPT-4O, Gemini, etc., provide a new technological path for the development of multimodal SR by virtue of their powerful cross-modal comprehension and generative capabilities7. These MLLMs can not only reformulate recommendation tasks as natural language processing problems for direct recommendation generation, but also integrate with conventional recommendation models to enhance system performance via unified multimodal representation learning. This synergistic framework not only achieves more comprehensive user preference modeling but also significantly improves the adaptability and robustness of the system in complex scenarios.

Nevertheless, existing MLLMs-based RSs still confront several critical challenges8,9: (i) lack of ability to understand user-item interaction patterns specific to recommender scenarios, and difficulty in effectively handling noisy multimodal sequence data, (ii) poor modeling ability of dynamic user behavior sequences due to the limitations of the model’s context length, which makes it difficult to effectively capture the temporal evolution law of user interests, (iii) the understanding of non-textual modalities (e.g., visual/audio) stays at the coarse-grained level (e.g., image description generation), failing to fully explore the differentiated preferences of users in multimodal interaction modes, and (iv) existing fine-tuning strategies insufficiently take into account specialized multimodal joint optimization requirements to recommendation tasks. These limitations substantially hinder the practical effectiveness of MLLMs in real-world RSs.

To address these challenges, this study proposes a novel multimodal SR framework based on a multimodal large language model called MLLM-SRec, which innovatively aligns SR tasks leveraging the multimodal comprehension capabilities of MLLMs. The principal contributions of our work are threefold.

  • The first systematic study of the inference mechanism of MLLMs in MSR, which combines supervised fine-tuning (SFT) and multi-step Chain-of-Thought (CoT) prompting optimization strategies to achieve knowledge migration between recommendation tasks and pretrained multimodal models, and effectively alleviate the shortcomings of insufficient utilization of multimodal data in generative recommendations.

  • Innovatively designed a multimodal item fusion mechanism that effectively eliminates cross-modal differences and visual noise, utilizing a temporal-aware preference comprehension module to capture dynamic evolution law of user preferences.

  • Extensive experiments in four benchmark datasets and practical application scenarios have demonstrated that our method substantially exceeds traditional SR models and current state-of-the-art LLM-based approaches, delivering significant advancements in both recommendation accuracy and system robustness.

Related work

In this section, we discuss advances in fields relevant to our work.

LLM for recommendation

In recent years, LLMs such as GPT-4, LLaMA, and Gemma have brought about a significant paradigm shift in the field of RS, leveraging their extensive world knowledge and exceptional reasoning capabilities10,11. LLMs effectively address domain knowledge gaps in traditional RSs, demonstrating immense potential in tasks such as rating prediction, ranking generation, and sequential recommendation. This has led to the emergence of numerous works, including P512, GPT4Rec13, E4SRec14, Chat-Rec15, VIP516, RecMind17 and OpenP518. Some studies directly employ LLMs as RSs, utilizing their powerful semantic understanding and reasoning capabilities to learn rich semantic information and user behavior patterns from large-scale data through pre-trained models, thereby achieving personalized recommendations. For example, Bao et al.19 introduced a personalized-aware LLM framework for few-shot scenarios, highlighting that fine-tuning LLMs can enhance recommendation performance and personalization. Sanner et al.4 developed an efficient recommendation system adaptable to new domains by combining the zero-shot and few-shot learning capabilities of LLMs, a capability particularly suited for cold-start scenarios. These methods transform various recommendation tasks into language understanding or template generation, using the reasoning and knowledge transfer capabilities of LLMs to achieve personalized recommendations. In addition, researchers have explored the application of LLM in conversational RSs, achieving efficient personalized recommendations through a deep understanding of user historical interaction data. For example, Gao et al.15 proposed the Chat-Rec model, which enhances ChatGPT’s user dialogue and recommendation explanation capabilities through prompting techniques, improving the performance of conversational recommendation systems. However, using LLMs as feature extraction tools for traditional RSs is also an important direction. This approach leverages the text descriptions and feature encoding capabilities generated by LLMs to provide richer input for recommendation models. For example, GenRec20 directly fine-tunes the LLaMA model on plain text for generative recommendations. Another line of work focuses on generating and supplementing missing features. Wang et al.21 utilize LLMs to process user text inputs, analyzing user interests, emotions, and preferences to enhance recommendation accuracy and personalization. Meanwhile, Yuan et al.22 compared the effectiveness of ID-based RS with those based on different modal information encodings, demonstrating that integrating multimodal information can further enhance the performance of RS.

Despite the broad prospects of LLMs in generative RSs23, LLMs can enhance the robustness and precision of traditional RS in terms of contextual understanding and personalized recommendations without substantially increasing computational complexity. However, most existing approaches rely solely on textual information, neglecting multimodal auxiliary information such as images. Furthermore, current LLMs still face challenges in efficiency and performance when processing large-scale multimodal sequential information.

Multimodal recommendation

Compared to traditional unimodal RS, MRS can discover and represent hidden relationships between different modalities and user preference patterns between modalities24. It enables learning of deeper semantic relationships between users and items, providing a more comprehensive understanding of user preferences and item characteristics, thereby delivering more personalized and diverse recommendations. Additionally, by integrating multimodal information, MRSs effectively alleviates the common issues of data sparsity and cold-start problems in traditional RS. Early research on MRS, such as matrix factorization and MLP-based methods, aimed to integrate multimodal content for more reasonable representation learning. For example, the pioneering work VBPR25 improved recommendation performance by combining visual features and ID embeddings to leverage multimodal information. With the advancement of deep learning technologies, methods such as auto-encoders and variational auto-encoders have been gradually introduced into RS, significantly improving recommendation performance through the use of modal information26. To explore higher-order relationships between modal information and users/items, techniques such as graph convolutional networks (GCN) have further strengthened the representation capabilities of multimodal data. For example, MMGCN27 introduced a modality-aware GCN into multimodal recommendation tasks, aggregating and propagating multimodal information on user-item bipartite graphs to learn embeddings for each modality. These modality representations are then fused with ID embeddings to form the final item representation.

Currently, MRSs based on MLLMs are gradually becoming a frontier in the development of next-generation MRS. Using the powerful visual-text understanding capabilities and extensive general knowledge of MLLM, models such as CLIPViT, GPT4V, and LLaMA can effectively integrate and process multimodal data, including text, images, and audio, through pre-training and fine-tuning strategies, demonstrating significant zero-shot recommendation capabilities across various domains28. For example, VIP516 proposes a multimodal recommendation framework built on the P5 foundation model, which unifies visual, textual, and personalized modalities to accomplish various recommendation tasks. The Rec-GPT4V29 framework leverages the visual summarization capabilities of large vision-language models (LVLMs) for multimodal recommendation, utilizing user history as contextual user preferences to prompt LVLMs to generate item image summaries and combining image understanding in natural language space with item titles to query user preferences for candidate items. Although these models achieve basic visual understanding recommendations through image captioning, partially enhancing system interpretability and user experience, their potential applications in multimodal-assisted sequential recommendation tasks remain underexplored.

Sequential recommendation

SR is a significant research direction in the field of recommendation systems, which predicts future users’ interests by analyzing their historical behavior sequences. Early Markov chain-based methods30 captured user behavior patterns through simple sequence models, but faced limitations in handling long-term dependencies and variable-length sequences. The integration of multimodal data (such as text and images) has significantly advanced the research in SR. Traditional methods relied on item IDs or attributes, which limited their generalization capabilities. Methods based on pre-trained language models have improved cross-domain adaptability by learning unified item representations. For instance, UniSRec31 enhances recommendation performance by leveraging multimodal information. Dynamic interest modeling remains a core challenge in SR. Traditional methods like GRU4Rec32 and SASRec33 mainly capture short-term dependencies and struggle with the evolution of long-term interest. Approaches based on Graph Neural Networks (GNNs) and self-attention mechanisms are widely adopted, as they better model dynamic changes in user interests. For example, BERT4Rec34 significantly improves recommendation performance by capturing global dependencies through bidirectional transformers.

In recent years, the introduction of LLMs has provided new directions for SR. Techniques such as instruction tuning, prompt learning, and mixture of experts(MoE) adapt top-layer structures of LLMs to recommendation tasks, further enhancing the model’s ability to understand user intent. For example, the Rec-GPT4V29 framework generates image summaries of items through visual summarization, improving the explainability and user experience of recommendations. However, existing LLM-based MSR approaches still exhibit shortcomings in fusion and dynamic interest modeling, leaving substantial room for further improvement in the accuracy, real-time performance, and explainability of recommendation systems.

Preliminaries

In this section, we first provide a definition for our research’ problem, and then discuss the framework of the MLLM-based multimodal sequence recommendation method by an initial exploration.

Problem definition

In this paper, it is assumed that each user \(u\in U\) has profile information such as id, sex, age, occupation, etc. For each item \(i\in I\), the interaction history of user u can be organized into a sequence \(H_u=\left\{ i_1,i_2,\cdots ,i_n \right\}\) in chronological order and the candidate item \(i_{n+1}\). Where U and I represent the user set and item set respectively, \(i_n\) represents the n-th item with which user u interacts through click, cart, collect, review, rating, etc., and is associated with image and text (e.g. title or review), denoted as \(\left( image_n,text_n \right)\), n represents the length of the user interaction sequence. Then the goal of the MLLM-based MSR algorithm is to fine-tune35 the MLLMs on the multimodal sequence data of user interaction by designing a multimodal prompt method to predict the click probability of user u on the next item, that is, the candidate item \(i_{n+1}\). Certainly, it is selected from the complete set of item I, excluding items that have already been interacted with the user.

MLLM-based base model for SR

The prevailing multimodal recommendation paradigms typically consist of multiple components, such as perception modules, adapters, and recommenders3. As a preliminary concept, we propose a base framework that employs LLM and MLLM as dual backbones to handle both textual and visual modalities, as illustrated in Fig. 2.

Fig. 2
figure 2

Base model of MLLM-based sequential recommendation.

Here, each module is defined as follows: (i) Item image captioning. Leveraging the powerful visual understanding capabilities of MLLM, we employ image captioning to interpret images of the item and generate descriptive textual summaries. (ii) User preference understanding. Utilizing LLMs, we analyze historical image descriptions of interaction and the corresponding user reviews to summarize user preferences and interests. (iii) MLLM-based recommender. Based on the final user interests, together with the image and user review of the target item based on MLLM, we use MLLMs to comprehensively understand and infer whether the user will like the target item.

The Base Model framework utilizes MLLMs as the core component for item visual description and recommendation tasks, with LLMs supporting user preference understanding. This establishes a fundamental MSR framework in which MLLM image captioning extracts comprehensive item features. Building upon this foundation, the framework employs the causal language modeling paradigm of MLLMs to fine-tune individual components through instruction tuning, thereby optimizing the pipeline and enhancing recommendation performance.

Methodology

In this section, we first introduce our proposed MLLM-SRec sequential recommendation framework. Subsequently, we provide a detailed discussion of each component, as illustrated in Fig. 3c. Furthermore, we explore how parameter-efficient fine-tuning (PEFT)36 and CoT37 can effectively enhance the model performance.

Fig. 3
figure 3

A schematic diagram of the MLLM-SRec framework. (a) Dynamic user multimodal interaction sequence construction with a special window and step. (b) On the basis of VQA-based item understanding, multimodal summarization is carried out through the Item Multimodal Summary Generator Unit. (c) Our proposed MLLM-based SR framework based on the base model.

In the MLLM-SRec framework, to address the perception of multimodal information and the understanding of multimodal sequential user behaviors, we developed three key components: VQA-based image understanding, item multimodal summary generator and temporal multimodal user behavior understanding. Furthermore, we aligned the recommendation tasks through PEFT based on QLoRA and four-step CoT prompt learning based on this38.

VQA-based image understanding

Accurately inferring user preferences requires analyzing historical interaction sequences, which is a primary task in MSR. Most existing MLLM-based image understanding tasks often generate textual descriptions of given static images through simple prompts like “describe the image”. However, in multimodal recommendation scenarios, user interaction image sequences are discrete, often containing various noises such as background or borders. There is also a lack of correlation between the images. This approach fails to capture the key features and personalized semantic information of each image, making it even more difficult to discern relationships between sequential images. In addition, such generic prompt lacks specificity and cannot reflect the true semantics of items, making them unsuitable for the unique requirements of MSR. To alleviate these challenges, we propose a VQA-based Image Understanding(VIU) method for image summarization, which fundamentally diverges from the aforementioned Base Model, as shown in Fig. 4.

Fig. 4
figure 4

An illustration of VQA-based image understanding.

This method leverages the robust visual comprehension and generation capabilities of MLLMs to avoid visual noise and useless information, extract key semantic features information in each image, and deeply understand the personalized content of the image to provide detailed responses. Specifically, we formally define the VIU summary for any given \(image_i\) as follows:

$$\begin{aligned} vs_i=VIU(p_1,image_i). \end{aligned}$$
(1)

Where \(p_1\) = “What is in the image and describe its category, type, color, style, brand, specifications, and feature?” Through the question prompt \(p_1\), key information such as item name, category, type, color, style, brand, specifications, and features will be obtained and summarized into a unified textual description. Such precise guidance is crucial for filtering information and capturing elements that are directly consistent with user interests, thereby laying the foundation for subsequent integration with the text modality and the understanding and processing of user preferences.

Item multimodal summary generator

Exploring the hidden relationships between different modalities and the preference patterns of users in different modalities can help us to learn the deeper complementary semantic relationships between users and items across various modalities, understand user preferences and item features more comprehensively, and better integrate multimodal information. For multimodal item information, including image and text content, we select appropriate LLMs as the backbone for each modality to ensure that more detailed fine-grained features and a more thorough understanding are captured from each modality, and then specific fusion prompts are used to integrate the independently processed visual summaries and text content, achieving an all-encompassing and multifaceted summary and understanding of the items. To this end, we further improve the base model. After generating visual summaries for the images of each item in the user interaction sequence, we propose the Item Multimodal Summary Generator (IMSG) method to fuse the two modalities, which leverages the semantic understanding capabilities of LLMs to jointly combine the textual and visual modalities of the items, as illustrated in Fig. 3b.

This method achieves multimodal summarization of items by designing joint prompts for LLMs that integrate multimodal information, where the independently summarized visual abstract of the image modality and the corresponding textual modality are jointly used as inputs. More specifically, we formalize the multimodal summarization process through personalized prompts \(p_2\), which take the visual summary \(vs_i\) obtained via VIU and the corresponding \(text_i\) as input. The multimodal summarization is formally defined as follows:

$$\begin{aligned} ms_i=IMSG(p_2,vs_i,text_i). \end{aligned}$$
(2)

Where \(p_2\) = “Please combine visual summary and text to summarize the item.” It should be emphasized, particularly, that we adopt this step-by-step summarization and fusion learning approach to ensure that the output captures both the individuality and commonality of the item modeled by the two different modalities. This method is consistent with the traditional multimodal recommendation strategy, which emphasizes the integration of various data modalities to create a comprehensive semantic overview of the item.

Temporal user behavior comprehend

Understanding user sequential behaviors and summarizing user interest preferences have always been significant challenges in SR, particularly when dealing with multimodal user interaction sequences. Effective personalized recommendations are based on an accurate understanding of user interest preferences, and the emergence of MLLMs has brought remarkable advances in the comprehension and fusion of multimodal information. However, as mentioned above, they still face difficulties in processing sequential multimodal data. Although our multimodal item summary method effectively integrates multimodal information into a unified item profile, it lacks the capability to mine dynamic user interests and understand immediate preferences. Additionally, the complexity of handling lengthy historical interaction sequences often leads to unstable outputs and hallucinations in inferring user preferences. To cope with these difficulties, we have made two key improvements to the Base Model. First, in terms of initializing user behavior sequence data, a certain sliding window and step are used to construct a dynamic user behavior sequence, as illustrated in Fig. 3a. This approach not only enables the tuning of MLLMs to uncover dynamic user preferences and understand interest drift but also avoids the issues of preference forgetting or distortion caused by excessively long user interaction sequences. Second, a Temporal User Behavior Comprehend(TUBC) layer assisted by LLMs is added to summarize and refine historical user preferences.

More specifically, this method employs prompts \(p_3\) to iteratively model the constructed dynamic multimodal sequence behavior in chronological order, thus enhancing cross-sequence context awareness to optimize the current user interest preferences, effectively overcoming the limitations of MLLMs in processing multimodal data sequences and dynamic interest preferences, and thus more effectively coping with multimodal interaction sequences. Here \(p_3\) =“Please summarize the user’s multimodal summaries sequence in chronological order to refine the user’s interests and preferences.”

Supervised fine-tuning and CoT prompt optimization

Supervised fine-tuning of QLoRA

Through the aforementioned three steps, we obtain the user multimodal interaction sequence preferences, which are subsequently integrated with user profiles and the multimodal information of the next target item to construct manually annotated instruction-tuning data consisting of {task instruction, input, response} structure. Using the widely adopted instruction-following SFT paradigm, we optimize the model parameters to develop a multimodal sequential recommender based on open-source MLLMs39.This approach effectively minimizes the discrepancy between predictions and actual user interactions while enhancing the personalized sequential recommendation capabilities of MLLM. Specifically, when performing instruction tuning, each data sample \(\left( x_i,y_i\right)\) is typically transformed into fine-tuning data structured as instruction-input-response format. This particular data structure simulates instruction fine-tuning scenarios, which can be used to guide MLLMs to understand and learn the instruction task40.Where instruction is the description of the task and the problem statement provided by the user or system to the model. Input is the initialized data or additional context \(x_i\) provided to the model, which refers to the user profile, user behavior sequence, and target item. Each item in the sequence contains text information (e.g. title and reviews) and pictures (e.g. item pictures and poster convers), denoted as \(\left( image_i,text_i \right)\). The response is the answer or action generated by the model based on the instructions and input of the recommendation task. Here, it refers to the label \(y_i\)(\(y_i\in \left\{ 0,1 \right\}\), also converted into Yes or No) consisting of positive and negative samples that the model predicts whether the user will like the next item, as shown in Table  1.

Table 1 Illustration of instruction data formulation for instruction tuning.

Subsequently, the MLLM-based recommender is trained on our constructed instruction dataset using the following SFT objectives:

$$\begin{aligned} \mathcal {L} _{MLLMsft}=-\sum _{i=1}^L{log\,\,P\left( y_i\mid y_{<i},M \right) }. \end{aligned}$$
(3)

Where \(y_i\) is the ith word in the prompt text, L is the length of the prompt text, and M represents the input multimodal conditions. The probability \(P\left( y_i\mid y_{<i},M \right)\) is computed by the MLLMs under the next-token prediction paradigm of causal language modeling, which maximizes the likelihood of the ground-truth tokens given the prompt. This ensures that the model learns to accurately predict subsequent tokens based on the provided context, which is crucial to generate a precise MSR. During the training process, we use QLoRA41 for PEFT, which greatly reduces the number of trainable parameters and speeds up the training process. Ultimately, we utilize MLLM-SRec for inference based on the given multimodal personalized prompts. In the test phase, we constrain the next token predicted by the model to either “yes” or “no” to avoid the influence of irrelevant information on the predicted labels. Finally, based on the probability scores of the first new token predicted, we compute the probability of item interaction using the following function:

$$\begin{aligned} p=\frac{exp\left( p\left( yes \right) \right) }{exp\left( p\left( yes \right) \right) +exp\left( p\left( no \right) \right) } \end{aligned}$$
(4)

Prompt optimization of four-step CoT

As an improvement on the aforementioned PEFT, the MSR model is further optimized using the four-step CoT(4-step CoT) prompt learning approach. Specifically, the prompts and returned answers for each VIU, IMSG, and TUBC step are taken sequentially as the prefix of the next step prompt. These are combined with the current step’s information as input, explicitly guiding the MLLMs to generate a preference prediction (like or dislike) for the next target item based on the given user profile, user preferences summarized historical interaction, and multimodal information of the target item. This approach further optimizes the model’s multimodal understanding and enhances its personalized capabilities in MSR.

Experiments

In this section, we first introduce the datasets, baselines, evaluation protocols, and parameter settings of the experiments, and subsequently conduct extensive experiments and ablation studies, with the aim of addressing the following research questions:

  • RQ1: Does the proposed MLLM-SRec perform well compared with the state-of-the-art LLMs-based recommendation methods in our task ?

  • RQ2: Can the modalities and their different combinations and designs more accurately capture the dynamic preferences of users ?

  • RQ3: How do the components of the proposed MLLM-SRec affect its effectiveness ?

  • RQ4: How do different configurations of the fine-tuning strategies affect the performance of MLLM-SRec ?

Experimental settings

Dataset description

In this paper, we adopt four widely used open source real-world datasets from Amazon Review to evaluate the performance of our proposed MLLM-SRec method on SR tasks through experiments. Table 2 reports more detailed statistics of these data sets.

Table 2 Statistics of the experimental datasets. Data sparsity is calculated by dividing the number of interactions by the product of the number of users and items, and then subtracting the result from 1. Samples refer to the total number of positive and negative samples collected.Samples refer to the total number of positive and negative samples collected.

These selections can be analyzed in depth in different recommendation scenarios. Among them, the Amazon dataset collects users’ historical behaviors, such as ratings and reviews, on 34 product categories from the Amazon e-commerce platform, which primarily records multimodal features such as item titles, text descriptions, images, and video URLs. We choose four categories including baby, sports, beauty, and toys to evaluate our method, defining samples with ratings no less than 4 as positive and the rest as negative. Meanwhile, a 1:1 negative sampling ratio is implemented based on the time order, and each dataset is randomly divided into a training set(\(80\%\)), a validation set(\(10\%\)), and a test set(\(10\%\)) according to convention, and it is ensured that each user and item contains at least one instance in the training set and the test set.

Baseline methods

To conduct an extensive evaluation of our proposed MLLM-SRec model, we comprehensively compare it with representative methods of traditional SR, multimodal recommendation, LLM-based SR, and MLLM-based recommendation:

  • Conventional SR models: SASRec33 is a Transformer-based sequence recommendation model that employs a self-attention mechanism to model global dependencies within user behavior sequences, effectively capturing both long-term and short-term user interest preferences. BERT4Rec34 utilizes the BERT model bidirectional Transformer encoder for recommendations, which can simultaneously consider contextual information in the user behavior sequence, thus improving the precision and personalization of recommendations.

  • Multimodal recommendation model: MMSR42 proposes a graph-based adaptive fusion method for multimodal features. After constructing a multimodal sequence graph for each user, it employs a dual attention mechanism to independently aggregate heterogeneous and homogeneous node information to achieve adaptive adjustment of the sequence, thereby facilitating recommendations that simultaneously consider both sequence and cross-modal aspects. MMGCN27 is a multimodal recommendation framework based on Graph Convolutional Networks (GCN). Using the message-passing paradigm, it integrates multimodal features into the recommendation system. By modeling specific modal representations of users and items, it effectively captures user preferences across different modalities and fuses this information to enhance recommendation accuracy.

  • SR models based on LLMs: GPTRec43 is an SR model based on the GPT-2 architecture. It generates complex interdependent recommendation lists item by item through a novel SVD tokenization algorithm and Next-K recommendation strategy. TALLRec44 is an efficient and effective framework for aligning LLMs with recommendation tasks. Focuses on fine-tuning LLMs with recommendation data to build a large-scale recommendation language model.

  • MLLMs-based models: VIP516 proposes a parameter-efficient multimodal base recommendation model that unifies visual, linguistic, and personalized information. Using the designed multimodal personalized prompts, visual signals are integrated with text and personalized information to enhance recommendations across multiple modalities. Rec-GPT4V29 is a VST inference scheme that generates item image summaries through prompt MLLMs. It combines image understanding in natural language space with item titles to query user preferences for candidate items, utilizing large vision-language models for multimodal recommendation.

Evaluation metrics

We adopt three widely used ranking-based evaluation metrics12, HR@K (H@K), NDCG@K (N@K), Recall@K (R@K), and AUC, to evaluate the performance of baseline methods and our proposed MLLM-SRec for multimodal SR. Higher values of HR, NDCG, RECALL and AUC indicate better model performance. To ensure a fair comparison, we standardize the size of the candidate item set in all baseline methods and our method.

Implementation details

All experiments were conducted on an Ubuntu 22.04 LTS server equipped with three RTX 4080 64GB GPUs. For open-source LLMs, we selected Qwen2-VL-7B-Instruct and Llama-3.1-8B-Instruct (for summarizing user preferences) released on Hugging Face as the backbone models for MLLM-SRec’s MLLMs and LLMs, respectively. Based on the Hugging Face Transformers library, parameter-efficient fine-tuning (PEFT), distributed acceleration, and low-precision computing techniques were adopted. All methods were implemented with Python 3.11.7 and Pytorch 2.2.1+cu118 programming. All reported results are the average performance of at least 5 repeated experiments to mitigate the impact of randomness.

Overall performance (RQ1)

To comprehensively evaluate the effectiveness of our proposed MLLM-SRec model, we take traditional SR, MSR, popular LLM-based SR, and emerging MLLM-based recommendation methods as baselines, and conduct comparative experiments on four datasets for SR tasks with our proposed model to obtain their performance on HR@K, NDCG@K, and Recall@K evaluation indicators. The experimental results are shown in Table 3.

Table 3 Performance comparison of different baseline methods. The best and second-best results are highlighted in bold and italics, respectively.

According to the experimental results, we can easily observe that MLLM-SRec consistently outperforms MLLM-based methods and LLM-based SR methods, significantly surpassing traditional SR models and multimodal SR models. Our MLLM-SRec model achieves superior performance in all metrics. Specifically, compared to the strongest baseline model, MLLM-SRec demonstrates an average improvement of \(9.90\%\) in the four datasets, with an average improvement of \(11.4\%\), \(12.83\%\), and \(18.07\%\) in Recall, NDCG, and HR metrics, respectively. This successfully verifies the accuracy and superiority of MLLM-SRec as an SR for leveraging multimodal data through MLLM for SR tasks.

Research indicates that pre-trained MLLMs possess extensive world knowledge and advanced visual and linguistic comprehension capabilities to interpret complex user intents and contextual nuances. This enables it to effectively extract intricate multimodal information, align visual and textual representations, fully capture dynamic user behavior preferences, and significantly enhance recommendation performance through instruction tuning using multimodal sequential interaction data. In contrast to VIP5 and Rec-GPT4V, the MLLM-SRec framework can comprehensively understand textual and visual information through our IMSG component designed for multimodal data, and accurately fuse visual and textual representations. This alignment allows for a deeper understanding of the multifaceted features of items while modeling their intrinsic correlations, simplifying the information processing pipeline, and significantly improving item representation quality. Additionally, the TUBC component can dynamically balance explicit user preferences and implicit semantic information. Unlike conventional PEFT or basic CoT prompt optimization techniques, our approach synergistically integrates QLoRA-based local parameter fine-tuning with multistep CoT output alignment, leading to significant improvements in MLLMs’ comprehension for multimodal interaction data and recommendation performance. Compared to LLM-based recommendation methods that rely solely on textual information, our MLLM-SRec model successfully integrates multimodal information and unleashes the potential of MLLM. This fully demonstrates the important value of integrating advanced MLLM fine-tuning technology to enhance SR using interactive multimodal sequence data.

Ablation study

Impact of different input modalities (RQ2)

To comprehensively explore the impact of different modalities on the performance of MLLM-SRec and further investigate the contribution of multimodal fusion and various complementary knowledge of modalities to SR, we conducted in-depth experiments on different input modalities. Performance results are shown in Table 4.

Table 4 Performance comparison with different modality inputs. The best results are in bold and the second-best results are italics.The improvement is calculated over the second-best result.

This table compares the impact of different inputs—Image-only, Text-only, and Image+Text—on the recommendation performance of the MLLM-SRec model in SR tasks. Analysis reveals that the combination of text and image modalities consistently achieves the best performance in all evaluation metrics. Compared to using only text modality, our MLLM-SRec significantly enhances the recommendation performance by incorporating multimodal information, particularly image information. In contrast, using only images generally performs slightly worse than using only text generally. Although visual information tends to capture user attention more effectively, images may contain noise information such as background information, and additional irrelevant information may be captured in the process of understanding images, negatively impacting the performance of recommendations. In contrast, the multimodal input including text and image achieves alignment between visual content and textual content during the image-understanding process through our components, effectively enhancing the accuracy of the model. Our findings suggest that our method has greater potential for improvement in multimodal fusion. Since each modality provides unique knowledge that cannot be captured by other modalities, the fusion of multiple modalities outperforms single-modality approaches. By leveraging its extensive world knowledge and effectively integrating complementary information within and across modalities, MLLMs demonstrate the strong potential of MLLM-SRec as an SR method enhanced by multimodal information

Impact of different components of the MLLM-SRec (RQ3)

To evaluate the individual contributions of each component in our MLLM-SRec framework, we developed several variants of MLLM-SRec, each described as follows:

  • MLLM-SV: In this version, we employ a VQA-based approach instead of image captioning to understand the item images. Specifically, the VIU component replaces the item image captioning in Base Model, enabling noise-free comprehension of visual data through a unified questioning framework.

  • MLLM-SVI: This variant utilizes both VIU and IMSG components to process multimodal interaction sequences. It relies entirely on textual and visual summaries, using LLMs for item multimodal summary generation.

As shown in Table  5, we conducted ablation experiments on all variants, including Base Model, using corresponding fine-tuning strategies to validate their effectiveness in achieving sequential recommendation.

Table 5 The performance of MLLM-SRec and its variants. The improvement is calculated over the base model result.

It is readily observable that the performance of our primary model, MLLM-SRec, consistently outperforms its variants, MLLM-SV and MLLM-SVI, achieving the best performance, highlighting the importance of each component in enhancing the overall effectiveness of the model. The variant MLLM-SRVI, which uses the IMSG to produce multimodal item summaries, achieves suboptimal performance. This result verifies the importance of our model that integrates multimodal data through IMSG, which is crucial to understanding user preferences across different modalities. This approach significantly compensates for the incompleteness of textual information, thereby enabling a more comprehensive understanding of the multimodal information of items. Additionally, while visual information tends to capture user attention more effectively than pure text, images themselves may contain excessive redundant information, potentially introducing additional noise into recommendation tasks. The MLLM-SV variant, which excludes dynamic user behavior preference understanding and relies solely on VIU, effectively mitigates noise and extracts more relevant fine-grained features from images. Its performance lies between Base Model and MLLM-SVI, indicating that VQA approach can achieve a more accurate understanding of visual data than image captioning. By further incorporating the TUBC component we proposed, we can fully realize the accurate capture of users’ dynamic interests. As shown in the table, MLLM-SRec incorporating these components achieves significant performance improvements, demonstrating that our approach can effectively utilize multimodal data, more accurately reflect users’ current interests, and reduce the negative impacts of data noise, cross-modal semantic gaps, and lengthy sequences. This further underscores the importance of capturing the dynamic evolution of user preferences.

Impact of fine-tuning strategy selection (RQ4)

In this section, we discuss the application of QLoRA for PEFT and 4-step CoT prompt learning to demonstrate the impact of different tuning strategies on model performance. To better align MLLMs with multimodal sequential recommendation tasks for more robust personalized recommendation capabilities, we tried two fine-tuning approaches: (i) PEFT via QLoRA to improve the model parameters and (ii) further 4-step CoT prompt technique to optimize the output results.

Fig. 5
figure 5

Performance comparison of different fine-tuning strategies. The AUC score for MLLM-SRec is calculated on four datasets using QLoRA and further 4-step CoT strategy.

The experimental results are illustrated in Fig. 5, the average AUC scores for the two fine-tuning approaches are 0.81 and 0.84, respectively, and these results demonstrate that two fine-tuning strategies are essential and necessary to achieve better recommendation performance. However, the AUC score for the QLoRA fine-tuning method is always around 0.81. After 4-step CoT prompting, MLLM-SRec shows a significant improvement in the AUC score on all datasets, with the highest improvement of 5.54% compared to the former. This demonstrates that, unlike conventional PEFT or basic CoT prompt optimization techniques, our approach synergistically integrates QLoRA-based local parameter fine-tuning with multistep CoT output alignment, thereby significantly enhancing the MLLM’s comprehension capabilities for multimodal interaction data and the effectiveness of the model’s output results on the basis of model parameter optimization. The experimental results further validate that there is a considerable gap between multimodal and recommender tasks, and demonstrate the importance of multimodal data utilization and model fine-tuning in inspiring the recommendation capability of MLLMs.

Impact of MLLMs hallucinations on recommendation performance

The superior multimodal understanding and language generation abilities of MLLM-based RSs are compromised by input hallucination phenomena, particularly when handling multiple images or irrelevant sequential image-text pairs or images with visual noise. Their frequent inability to properly interpret visual contexts or discern temporal/logical connections between items often results in some recommendations that are irrelevant or deviate from user intentions, potentially hindering their effectiveness in SR scenarios. We employ the widely used CLIP cross-modal semantic distance metric45 to evaluate the impact of input hallucination on MLLM. For given sequential image-text pairs, we assess the hallucination effect by measuring the average semantic similarity between the MLLM-generated image interpretation and the input text. Higher CLIP scores indicate better hallucination resistance capacity of the model.

Fig. 6
figure 6

Impact analysis of hallucinations on recommendation performance. (a) Hallucination analysis of individual models on different datasets. (b) Multimodal anomalous input data for verifying hallucinations.

As illustrated in Fig. 6a, MLLM-SRec outperforms MLLM-based RSs and their variants in hallucination resistance across all benchmark datasets. In the hallucination robustness evaluation, when exposed to singular or sequential data anomalies (e.g. complex backgrounds, noisy/distorted images, or erroneous image-text pairs) as depicted in Fig. 6b, MLLM-SRec consistently maintains stable recommendation performance, whereas competing models either generate hallucinations or exhibit reasoning failures. The study demonstrates that for specific task scenarios and interaction data, SFT and effective CoT prompting can significantly mitigate hallucinations in MLLMs. Our approach enables MLLMs to generate reasoning steps and produce more authentic and useful responses, thereby enhancing recommendation performance.

Case study

In this section, we conduct a case study to visually demonstrate the effectiveness of incorporating multimodal information into MLLM-based SR. We present an SR example for multimodal item interaction data obtained from a self-built campus services recommendation platform and further analyze how MLLM-SRec assists MLLMs in accurately understanding multimodal information and better capturing dynamic user behavior preferences. We input user multimodal sequential interaction data into MLLM-SRec to predict the user’s preference for the next item, thereby providing personalized recommendations. The results are shown in Fig. 7.

Fig. 7
figure 7

The case study of MLLM-SRec on item recommendation. The left half of the figure shows the user’s multimodal interaction records, while the right side displays the next recommended item. Each row represents the multimodal interaction behavior of a specific user and its corresponding next recommended item.

It is observed that our MLLM-SRec accurately understands the multimodal semantic information of items according to different multimodal interaction sequences, generates a unified user dynamic behavior preference, and can accurately recommend the next item to the user. Compared with text-only, our method can accurately grasp the internal semantics of different modalities, especially visual modalities, capture the correlation between multimodal sequences, and deeply understand users’ temporal behavior preferences, which brings significant contribution to personalization. These results demonstrate the effectiveness of integrating and understanding multimodal interaction sequences in SR based on MLLMs.

Conclusion and future work

The latest advances in MLLMs have established new avenues for investigating intelligent RSs. This study systematically explores the feasibility of MLLMs for processing sequence multimodal interaction data and proposes a promising MLLM-SRec framework. By deeply fusing dynamic multimodal sequential information with the semantic understanding capabilities of MLLMs, the framework can effectively eliminate cross-modal differences and visual noise, comprehensively capture the dynamic evolutionary law of user preference, and effectively alleviate the problem of insufficient utilization of multimodal interaction data in generative recommendation

Future research will systematically investigate the trade-off between inference latency and computational cost in MLLM-based RSs. Through the development of an end-to-end joint training framework that integrates advanced MLLMs with enhanced parameter capacity, optimized prompt engineering strategies, and extended multimodal data coverage, our aim is to establish more robust, energy-efficient, and scalable multimodal recommendation architectures.