Leveraging multimodal large language model for multimodal sequential recommendation

Wang, Zhaoliang; Liu, Baisong; Huang, Weiming; Hao, Tingting; Zhou, Huiqian; Guo, Yuxin

doi:10.1038/s41598-025-14251-1

Download PDF

Article
Open access
Published: 07 August 2025

Leveraging multimodal large language model for multimodal sequential recommendation

Zhaoliang Wang^1,2^na1,
Baisong Liu¹,
Weiming Huang¹^na1,
Tingting Hao³,
Huiqian Zhou⁴ &
…
Yuxin Guo⁵

Scientific Reports volume 15, Article number: 28960 (2025) Cite this article

3405 Accesses
Metrics details

Subjects

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable superiority in various vision-language tasks due to their unparalleled cross-modal comprehension capabilities and extensive world knowledge, offering promising research paradigms to address the insufficient information exploitation in conventional multimodal recommendation systems. Despite significant advances in existing recommendation approaches based on large language models, they still exhibit notable limitations in multimodal feature recognition and dynamic preference modeling, particularly in handling sequential data effectively and most of them predominantly rely on unimodal user-item interaction information, failing to adequately explore the cross-modal preference differences and the dynamic evolution of user interests within multimodal interaction sequences. These shortcomings have substantially prevented current research from fully unlocking the potential value of MLLMs within recommendation systems. To address these critical challenges, we present MLLM-SRec, a promising sequential recommendation architecture built upon MLLMs. Specifically, a novel multimodal feature fusion mechanism based on MLLMs is first established to generate unified semantic representations of items, which achieves semantic alignment between vision and text and effectively eliminates cross-modal differences and visual noise. Secondly, the temporal-aware user behavior comprehension module is designed to comprehensively capture the dynamic evolution law of user preference. Finally, by jointly modeling dynamic user preferences, user profiles, and multimodal information of target item, the supervised fine-tuning is combined with multistep Chain-of-Thought prompting optimization to facilitate effective knowledge transfer from the pre-trained multimodal model to the recommendation task, which effectively alleviates the problem of insufficient utilization of multimodal interaction data in generative recommendation. Experimental results demonstrate that our method achieves significant improvements over state-of-the-art baselines on four benchmark datasets and substantially enhances the precision of the recommendation results while exhibiting superior robustness and adaptability in multimodal sequential recommendation scenarios. These findings provide new methodological insights for multimodal sequence recommendation research and validate the potential of MLLMs for sequential recommendation tasks. Our code and data are available at https://github.com/MLLM-SRec.

MMAgentRec, a personalized multi-modal recommendation agent with large language model

Article Open access 08 April 2025

Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts

Article Open access 04 June 2025

Correspondence of high dimensional emotion structures elicited from video clips between humans and multimodal LLMs

Article Open access 01 September 2025

Introduction

In recent years, the breakthroughs of large language models(LLMs) in the field of natural language processing have not only demonstrated superior semantic comprehension and generative capabilities, but have also provided emerging technological impetus for the development of generative recommender systems (RSs)¹. Leveraging their extensive world knowledge and robust contextual modeling capacities, LLMs have significant advantages in capturing the dynamic changes of users’ interest preferences and handling complex item contents, which is of great significance for enhancing the accuracy and personalization level of sequential recommendation(SR). However, existing LLM-based RSs predominantly rely on single text modal information (e.g., user historical behaviors, textual reviews, etc.), thus failing to take full advantage of the complementary value inherent in multimodal data².The widespread adoption of smart terminals has led to increasingly multimodal characteristics in user-generated content, where data that encompass textual, visual, auditory and other data modalities provide RSs with richer dimensions for user preference representation. Multimodal information fusion technology can effectively uncover latent cross-modal associations and differences in user preferences by integrating heterogeneous data such as text, images, and videos³, effectively compensating for the limitations of single-modal approaches in information characterization⁴. This multimodal representations not only enable a more comprehensive understanding of user intent and item characteristics, but also significantly mitigate data sparsity and cold-start problems. Consequently, these techniques facilitate the construction of a more accurate and diverse multimodal recommendation system(MRS)⁵, as illustrated in Fig. 1.

In particular, recent advances⁶ in MLLM, represented by GPT-4O, Gemini, etc., provide a new technological path for the development of multimodal SR by virtue of their powerful cross-modal comprehension and generative capabilities⁷. These MLLMs can not only reformulate recommendation tasks as natural language processing problems for direct recommendation generation, but also integrate with conventional recommendation models to enhance system performance via unified multimodal representation learning. This synergistic framework not only achieves more comprehensive user preference modeling but also significantly improves the adaptability and robustness of the system in complex scenarios.

Nevertheless, existing MLLMs-based RSs still confront several critical challenges^8,9: (i) lack of ability to understand user-item interaction patterns specific to recommender scenarios, and difficulty in effectively handling noisy multimodal sequence data, (ii) poor modeling ability of dynamic user behavior sequences due to the limitations of the model’s context length, which makes it difficult to effectively capture the temporal evolution law of user interests, (iii) the understanding of non-textual modalities (e.g., visual/audio) stays at the coarse-grained level (e.g., image description generation), failing to fully explore the differentiated preferences of users in multimodal interaction modes, and (iv) existing fine-tuning strategies insufficiently take into account specialized multimodal joint optimization requirements to recommendation tasks. These limitations substantially hinder the practical effectiveness of MLLMs in real-world RSs.

To address these challenges, this study proposes a novel multimodal SR framework based on a multimodal large language model called MLLM-SRec, which innovatively aligns SR tasks leveraging the multimodal comprehension capabilities of MLLMs. The principal contributions of our work are threefold.

The first systematic study of the inference mechanism of MLLMs in MSR, which combines supervised fine-tuning (SFT) and multi-step Chain-of-Thought (CoT) prompting optimization strategies to achieve knowledge migration between recommendation tasks and pretrained multimodal models, and effectively alleviate the shortcomings of insufficient utilization of multimodal data in generative recommendations.
Innovatively designed a multimodal item fusion mechanism that effectively eliminates cross-modal differences and visual noise, utilizing a temporal-aware preference comprehension module to capture dynamic evolution law of user preferences.
Extensive experiments in four benchmark datasets and practical application scenarios have demonstrated that our method substantially exceeds traditional SR models and current state-of-the-art LLM-based approaches, delivering significant advancements in both recommendation accuracy and system robustness.

Related work

In this section, we discuss advances in fields relevant to our work.

LLM for recommendation

In recent years, LLMs such as GPT-4, LLaMA, and Gemma have brought about a significant paradigm shift in the field of RS, leveraging their extensive world knowledge and exceptional reasoning capabilities^10,11. LLMs effectively address domain knowledge gaps in traditional RSs, demonstrating immense potential in tasks such as rating prediction, ranking generation, and sequential recommendation. This has led to the emergence of numerous works, including P5¹², GPT4Rec¹³, E4SRec¹⁴, Chat-Rec¹⁵, VIP5¹⁶, RecMind¹⁷ and OpenP5¹⁸. Some studies directly employ LLMs as RSs, utilizing their powerful semantic understanding and reasoning capabilities to learn rich semantic information and user behavior patterns from large-scale data through pre-trained models, thereby achieving personalized recommendations. For example, Bao et al.¹⁹ introduced a personalized-aware LLM framework for few-shot scenarios, highlighting that fine-tuning LLMs can enhance recommendation performance and personalization. Sanner et al.⁴ developed an efficient recommendation system adaptable to new domains by combining the zero-shot and few-shot learning capabilities of LLMs, a capability particularly suited for cold-start scenarios. These methods transform various recommendation tasks into language understanding or template generation, using the reasoning and knowledge transfer capabilities of LLMs to achieve personalized recommendations. In addition, researchers have explored the application of LLM in conversational RSs, achieving efficient personalized recommendations through a deep understanding of user historical interaction data. For example, Gao et al.¹⁵ proposed the Chat-Rec model, which enhances ChatGPT’s user dialogue and recommendation explanation capabilities through prompting techniques, improving the performance of conversational recommendation systems. However, using LLMs as feature extraction tools for traditional RSs is also an important direction. This approach leverages the text descriptions and feature encoding capabilities generated by LLMs to provide richer input for recommendation models. For example, GenRec²⁰ directly fine-tunes the LLaMA model on plain text for generative recommendations. Another line of work focuses on generating and supplementing missing features. Wang et al.²¹ utilize LLMs to process user text inputs, analyzing user interests, emotions, and preferences to enhance recommendation accuracy and personalization. Meanwhile, Yuan et al.²² compared the effectiveness of ID-based RS with those based on different modal information encodings, demonstrating that integrating multimodal information can further enhance the performance of RS.

Despite the broad prospects of LLMs in generative RSs²³, LLMs can enhance the robustness and precision of traditional RS in terms of contextual understanding and personalized recommendations without substantially increasing computational complexity. However, most existing approaches rely solely on textual information, neglecting multimodal auxiliary information such as images. Furthermore, current LLMs still face challenges in efficiency and performance when processing large-scale multimodal sequential information.

Multimodal recommendation

Compared to traditional unimodal RS, MRS can discover and represent hidden relationships between different modalities and user preference patterns between modalities²⁴. It enables learning of deeper semantic relationships between users and items, providing a more comprehensive understanding of user preferences and item characteristics, thereby delivering more personalized and diverse recommendations. Additionally, by integrating multimodal information, MRSs effectively alleviates the common issues of data sparsity and cold-start problems in traditional RS. Early research on MRS, such as matrix factorization and MLP-based methods, aimed to integrate multimodal content for more reasonable representation learning. For example, the pioneering work VBPR²⁵ improved recommendation performance by combining visual features and ID embeddings to leverage multimodal information. With the advancement of deep learning technologies, methods such as auto-encoders and variational auto-encoders have been gradually introduced into RS, significantly improving recommendation performance through the use of modal information²⁶. To explore higher-order relationships between modal information and users/items, techniques such as graph convolutional networks (GCN) have further strengthened the representation capabilities of multimodal data. For example, MMGCN²⁷ introduced a modality-aware GCN into multimodal recommendation tasks, aggregating and propagating multimodal information on user-item bipartite graphs to learn embeddings for each modality. These modality representations are then fused with ID embeddings to form the final item representation.

Currently, MRSs based on MLLMs are gradually becoming a frontier in the development of next-generation MRS. Using the powerful visual-text understanding capabilities and extensive general knowledge of MLLM, models such as CLIPViT, GPT4V, and LLaMA can effectively integrate and process multimodal data, including text, images, and audio, through pre-training and fine-tuning strategies, demonstrating significant zero-shot recommendation capabilities across various domains²⁸. For example, VIP5¹⁶ proposes a multimodal recommendation framework built on the P5 foundation model, which unifies visual, textual, and personalized modalities to accomplish various recommendation tasks. The Rec-GPT4V²⁹ framework leverages the visual summarization capabilities of large vision-language models (LVLMs) for multimodal recommendation, utilizing user history as contextual user preferences to prompt LVLMs to generate item image summaries and combining image understanding in natural language space with item titles to query user preferences for candidate items. Although these models achieve basic visual understanding recommendations through image captioning, partially enhancing system interpretability and user experience, their potential applications in multimodal-assisted sequential recommendation tasks remain underexplored.

Sequential recommendation

SR is a significant research direction in the field of recommendation systems, which predicts future users’ interests by analyzing their historical behavior sequences. Early Markov chain-based methods³⁰ captured user behavior patterns through simple sequence models, but faced limitations in handling long-term dependencies and variable-length sequences. The integration of multimodal data (such as text and images) has significantly advanced the research in SR. Traditional methods relied on item IDs or attributes, which limited their generalization capabilities. Methods based on pre-trained language models have improved cross-domain adaptability by learning unified item representations. For instance, UniSRec³¹ enhances recommendation performance by leveraging multimodal information. Dynamic interest modeling remains a core challenge in SR. Traditional methods like GRU4Rec³² and SASRec³³ mainly capture short-term dependencies and struggle with the evolution of long-term interest. Approaches based on Graph Neural Networks (GNNs) and self-attention mechanisms are widely adopted, as they better model dynamic changes in user interests. For example, BERT4Rec³⁴ significantly improves recommendation performance by capturing global dependencies through bidirectional transformers.

In recent years, the introduction of LLMs has provided new directions for SR. Techniques such as instruction tuning, prompt learning, and mixture of experts(MoE) adapt top-layer structures of LLMs to recommendation tasks, further enhancing the model’s ability to understand user intent. For example, the Rec-GPT4V²⁹ framework generates image summaries of items through visual summarization, improving the explainability and user experience of recommendations. However, existing LLM-based MSR approaches still exhibit shortcomings in fusion and dynamic interest modeling, leaving substantial room for further improvement in the accuracy, real-time performance, and explainability of recommendation systems.

Preliminaries

In this section, we first provide a definition for our research’ problem, and then discuss the framework of the MLLM-based multimodal sequence recommendation method by an initial exploration.

Problem definition

In this paper, it is assumed that each user $u\in U$ has profile information such as id, sex, age, occupation, etc. For each item $i\in I$, the interaction history of user u can be organized into a sequence $H_u=\left\{ i_1,i_2,\cdots ,i_n \right\}$ in chronological order and the candidate item $i_{n+1}$. Where U and I represent the user set and item set respectively, $i_n$ represents the n-th item with which user u interacts through click, cart, collect, review, rating, etc., and is associated with image and text (e.g. title or review), denoted as $\left( image_n,text_n \right)$, n represents the length of the user interaction sequence. Then the goal of the MLLM-based MSR algorithm is to fine-tune³⁵ the MLLMs on the multimodal sequence data of user interaction by designing a multimodal prompt method to predict the click probability of user u on the next item, that is, the candidate item $i_{n+1}$. Certainly, it is selected from the complete set of item I, excluding items that have already been interacted with the user.

MLLM-based base model for SR

The prevailing multimodal recommendation paradigms typically consist of multiple components, such as perception modules, adapters, and recommenders³. As a preliminary concept, we propose a base framework that employs LLM and MLLM as dual backbones to handle both textual and visual modalities, as illustrated in Fig. 2.

Here, each module is defined as follows: (i) Item image captioning. Leveraging the powerful visual understanding capabilities of MLLM, we employ image captioning to interpret images of the item and generate descriptive textual summaries. (ii) User preference understanding. Utilizing LLMs, we analyze historical image descriptions of interaction and the corresponding user reviews to summarize user preferences and interests. (iii) MLLM-based recommender. Based on the final user interests, together with the image and user review of the target item based on MLLM, we use MLLMs to comprehensively understand and infer whether the user will like the target item.

The Base Model framework utilizes MLLMs as the core component for item visual description and recommendation tasks, with LLMs supporting user preference understanding. This establishes a fundamental MSR framework in which MLLM image captioning extracts comprehensive item features. Building upon this foundation, the framework employs the causal language modeling paradigm of MLLMs to fine-tune individual components through instruction tuning, thereby optimizing the pipeline and enhancing recommendation performance.

Methodology

In this section, we first introduce our proposed MLLM-SRec sequential recommendation framework. Subsequently, we provide a detailed discussion of each component, as illustrated in Fig. 3c. Furthermore, we explore how parameter-efficient fine-tuning (PEFT)³⁶ and CoT³⁷ can effectively enhance the model performance.

In the MLLM-SRec framework, to address the perception of multimodal information and the understanding of multimodal sequential user behaviors, we developed three key components: VQA-based image understanding, item multimodal summary generator and temporal multimodal user behavior understanding. Furthermore, we aligned the recommendation tasks through PEFT based on QLoRA and four-step CoT prompt learning based on this³⁸.

VQA-based image understanding

Accurately inferring user preferences requires analyzing historical interaction sequences, which is a primary task in MSR. Most existing MLLM-based image understanding tasks often generate textual descriptions of given static images through simple prompts like “describe the image”. However, in multimodal recommendation scenarios, user interaction image sequences are discrete, often containing various noises such as background or borders. There is also a lack of correlation between the images. This approach fails to capture the key features and personalized semantic information of each image, making it even more difficult to discern relationships between sequential images. In addition, such generic prompt lacks specificity and cannot reflect the true semantics of items, making them unsuitable for the unique requirements of MSR. To alleviate these challenges, we propose a VQA-based Image Understanding(VIU) method for image summarization, which fundamentally diverges from the aforementioned Base Model, as shown in Fig. 4.

This method leverages the robust visual comprehension and generation capabilities of MLLMs to avoid visual noise and useless information, extract key semantic features information in each image, and deeply understand the personalized content of the image to provide detailed responses. Specifically, we formally define the VIU summary for any given $image_i$ as follows:

$$\begin{aligned} vs_i=VIU(p_1,image_i). \end{aligned}$$

(1)

Where $p_1$ = “What is in the image and describe its category, type, color, style, brand, specifications, and feature?” Through the question prompt $p_1$, key information such as item name, category, type, color, style, brand, specifications, and features will be obtained and summarized into a unified textual description. Such precise guidance is crucial for filtering information and capturing elements that are directly consistent with user interests, thereby laying the foundation for subsequent integration with the text modality and the understanding and processing of user preferences.

Item multimodal summary generator

Exploring the hidden relationships between different modalities and the preference patterns of users in different modalities can help us to learn the deeper complementary semantic relationships between users and items across various modalities, understand user preferences and item features more comprehensively, and better integrate multimodal information. For multimodal item information, including image and text content, we select appropriate LLMs as the backbone for each modality to ensure that more detailed fine-grained features and a more thorough understanding are captured from each modality, and then specific fusion prompts are used to integrate the independently processed visual summaries and text content, achieving an all-encompassing and multifaceted summary and understanding of the items. To this end, we further improve the base model. After generating visual summaries for the images of each item in the user interaction sequence, we propose the Item Multimodal Summary Generator (IMSG) method to fuse the two modalities, which leverages the semantic understanding capabilities of LLMs to jointly combine the textual and visual modalities of the items, as illustrated in Fig. 3b.

This method achieves multimodal summarization of items by designing joint prompts for LLMs that integrate multimodal information, where the independently summarized visual abstract of the image modality and the corresponding textual modality are jointly used as inputs. More specifically, we formalize the multimodal summarization process through personalized prompts $p_2$, which take the visual summary $vs_i$ obtained via VIU and the corresponding $text_i$ as input. The multimodal summarization is formally defined as follows:

$$\begin{aligned} ms_i=IMSG(p_2,vs_i,text_i). \end{aligned}$$

(2)

Where $p_2$ = “Please combine visual summary and text to summarize the item.” It should be emphasized, particularly, that we adopt this step-by-step summarization and fusion learning approach to ensure that the output captures both the individuality and commonality of the item modeled by the two different modalities. This method is consistent with the traditional multimodal recommendation strategy, which emphasizes the integration of various data modalities to create a comprehensive semantic overview of the item.

Temporal user behavior comprehend

Understanding user sequential behaviors and summarizing user interest preferences have always been significant challenges in SR, particularly when dealing with multimodal user interaction sequences. Effective personalized recommendations are based on an accurate understanding of user interest preferences, and the emergence of MLLMs has brought remarkable advances in the comprehension and fusion of multimodal information. However, as mentioned above, they still face difficulties in processing sequential multimodal data. Although our multimodal item summary method effectively integrates multimodal information into a unified item profile, it lacks the capability to mine dynamic user interests and understand immediate preferences. Additionally, the complexity of handling lengthy historical interaction sequences often leads to unstable outputs and hallucinations in inferring user preferences. To cope with these difficulties, we have made two key improvements to the Base Model. First, in terms of initializing user behavior sequence data, a certain sliding window and step are used to construct a dynamic user behavior sequence, as illustrated in Fig. 3a. This approach not only enables the tuning of MLLMs to uncover dynamic user preferences and understand interest drift but also avoids the issues of preference forgetting or distortion caused by excessively long user interaction sequences. Second, a Temporal User Behavior Comprehend(TUBC) layer assisted by LLMs is added to summarize and refine historical user preferences.

More specifically, this method employs prompts $p_3$ to iteratively model the constructed dynamic multimodal sequence behavior in chronological order, thus enhancing cross-sequence context awareness to optimize the current user interest preferences, effectively overcoming the limitations of MLLMs in processing multimodal data sequences and dynamic interest preferences, and thus more effectively coping with multimodal interaction sequences. Here $p_3$ =“Please summarize the user’s multimodal summaries sequence in chronological order to refine the user’s interests and preferences.”

Supervised fine-tuning and CoT prompt optimization

Supervised fine-tuning of QLoRA

Through the aforementioned three steps, we obtain the user multimodal interaction sequence preferences, which are subsequently integrated with user profiles and the multimodal information of the next target item to construct manually annotated instruction-tuning data consisting of {task instruction, input, response} structure. Using the widely adopted instruction-following SFT paradigm, we optimize the model parameters to develop a multimodal sequential recommender based on open-source MLLMs³⁹.This approach effectively minimizes the discrepancy between predictions and actual user interactions while enhancing the personalized sequential recommendation capabilities of MLLM. Specifically, when performing instruction tuning, each data sample $\left( x_i,y_i\right)$ is typically transformed into fine-tuning data structured as instruction-input-response format. This particular data structure simulates instruction fine-tuning scenarios, which can be used to guide MLLMs to understand and learn the instruction task⁴⁰.Where instruction is the description of the task and the problem statement provided by the user or system to the model. Input is the initialized data or additional context $x_i$ provided to the model, which refers to the user profile, user behavior sequence, and target item. Each item in the sequence contains text information (e.g. title and reviews) and pictures (e.g. item pictures and poster convers), denoted as $\left( image_i,text_i \right)$. The response is the answer or action generated by the model based on the instructions and input of the recommendation task. Here, it refers to the label $y_i$($y_i\in \left\{ 0,1 \right\}$, also converted into Yes or No) consisting of positive and negative samples that the model predicts whether the user will like the next item, as shown in Table 1.

Table 1 Illustration of instruction data formulation for instruction tuning.

Full size table

Subsequently, the MLLM-based recommender is trained on our constructed instruction dataset using the following SFT objectives:

$$\begin{aligned} \mathcal {L} _{MLLMsft}=-\sum _{i=1}^L{log\,\,P\left( y_i\mid y_{<i},M \right) }. \end{aligned}$$

(3)

Where $y_i$ is the ith word in the prompt text, L is the length of the prompt text, and M represents the input multimodal conditions. The probability $P\left( y_i\mid y_{<i},M \right)$ is computed by the MLLMs under the next-token prediction paradigm of causal language modeling, which maximizes the likelihood of the ground-truth tokens given the prompt. This ensures that the model learns to accurately predict subsequent tokens based on the provided context, which is crucial to generate a precise MSR. During the training process, we use QLoRA⁴¹ for PEFT, which greatly reduces the number of trainable parameters and speeds up the training process. Ultimately, we utilize MLLM-SRec for inference based on the given multimodal personalized prompts. In the test phase, we constrain the next token predicted by the model to either “yes” or “no” to avoid the influence of irrelevant information on the predicted labels. Finally, based on the probability scores of the first new token predicted, we compute the probability of item interaction using the following function:

$$\begin{aligned} p=\frac{exp\left( p\left( yes \right) \right) }{exp\left( p\left( yes \right) \right) +exp\left( p\left( no \right) \right) } \end{aligned}$$

(4)

Prompt optimization of four-step CoT

As an improvement on the aforementioned PEFT, the MSR model is further optimized using the four-step CoT(4-step CoT) prompt learning approach. Specifically, the prompts and returned answers for each VIU, IMSG, and TUBC step are taken sequentially as the prefix of the next step prompt. These are combined with the current step’s information as input, explicitly guiding the MLLMs to generate a preference prediction (like or dislike) for the next target item based on the given user profile, user preferences summarized historical interaction, and multimodal information of the target item. This approach further optimizes the model’s multimodal understanding and enhances its personalized capabilities in MSR.

Experiments

In this section, we first introduce the datasets, baselines, evaluation protocols, and parameter settings of the experiments, and subsequently conduct extensive experiments and ablation studies, with the aim of addressing the following research questions:

RQ1: Does the proposed MLLM-SRec perform well compared with the state-of-the-art LLMs-based recommendation methods in our task ?
RQ2: Can the modalities and their different combinations and designs more accurately capture the dynamic preferences of users ?
RQ3: How do the components of the proposed MLLM-SRec affect its effectiveness ?
RQ4: How do different configurations of the fine-tuning strategies affect the performance of MLLM-SRec ?

Experimental settings

Dataset description

In this paper, we adopt four widely used open source real-world datasets from Amazon Review to evaluate the performance of our proposed MLLM-SRec method on SR tasks through experiments. Table 2 reports more detailed statistics of these data sets.

Table 2 Statistics of the experimental datasets. Data sparsity is calculated by dividing the number of interactions by the product of the number of users and items, and then subtracting the result from 1. Samples refer to the total number of positive and negative samples collected.Samples refer to the total number of positive and negative samples collected.

Full size table

These selections can be analyzed in depth in different recommendation scenarios. Among them, the Amazon dataset collects users’ historical behaviors, such as ratings and reviews, on 34 product categories from the Amazon e-commerce platform, which primarily records multimodal features such as item titles, text descriptions, images, and video URLs. We choose four categories including baby, sports, beauty, and toys to evaluate our method, defining samples with ratings no less than 4 as positive and the rest as negative. Meanwhile, a 1:1 negative sampling ratio is implemented based on the time order, and each dataset is randomly divided into a training set($80\%$), a validation set($10\%$), and a test set($10\%$) according to convention, and it is ensured that each user and item contains at least one instance in the training set and the test set.

Baseline methods

To conduct an extensive evaluation of our proposed MLLM-SRec model, we comprehensively compare it with representative methods of traditional SR, multimodal recommendation, LLM-based SR, and MLLM-based recommendation:

Conventional SR models: SASRec³³ is a Transformer-based sequence recommendation model that employs a self-attention mechanism to model global dependencies within user behavior sequences, effectively capturing both long-term and short-term user interest preferences. BERT4Rec³⁴ utilizes the BERT model bidirectional Transformer encoder for recommendations, which can simultaneously consider contextual information in the user behavior sequence, thus improving the precision and personalization of recommendations.
Multimodal recommendation model: MMSR⁴² proposes a graph-based adaptive fusion method for multimodal features. After constructing a multimodal sequence graph for each user, it employs a dual attention mechanism to independently aggregate heterogeneous and homogeneous node information to achieve adaptive adjustment of the sequence, thereby facilitating recommendations that simultaneously consider both sequence and cross-modal aspects. MMGCN²⁷ is a multimodal recommendation framework based on Graph Convolutional Networks (GCN). Using the message-passing paradigm, it integrates multimodal features into the recommendation system. By modeling specific modal representations of users and items, it effectively captures user preferences across different modalities and fuses this information to enhance recommendation accuracy.
SR models based on LLMs: GPTRec⁴³ is an SR model based on the GPT-2 architecture. It generates complex interdependent recommendation lists item by item through a novel SVD tokenization algorithm and Next-K recommendation strategy. TALLRec⁴⁴ is an efficient and effective framework for aligning LLMs with recommendation tasks. Focuses on fine-tuning LLMs with recommendation data to build a large-scale recommendation language model.
MLLMs-based models: VIP5¹⁶ proposes a parameter-efficient multimodal base recommendation model that unifies visual, linguistic, and personalized information. Using the designed multimodal personalized prompts, visual signals are integrated with text and personalized information to enhance recommendations across multiple modalities. Rec-GPT4V²⁹ is a VST inference scheme that generates item image summaries through prompt MLLMs. It combines image understanding in natural language space with item titles to query user preferences for candidate items, utilizing large vision-language models for multimodal recommendation.

Evaluation metrics

We adopt three widely used ranking-based evaluation metrics¹², HR@K (H@K), NDCG@K (N@K), Recall@K (R@K), and AUC, to evaluate the performance of baseline methods and our proposed MLLM-SRec for multimodal SR. Higher values of HR, NDCG, RECALL and AUC indicate better model performance. To ensure a fair comparison, we standardize the size of the candidate item set in all baseline methods and our method.

Implementation details

All experiments were conducted on an Ubuntu 22.04 LTS server equipped with three RTX 4080 64GB GPUs. For open-source LLMs, we selected Qwen2-VL-7B-Instruct and Llama-3.1-8B-Instruct (for summarizing user preferences) released on Hugging Face as the backbone models for MLLM-SRec’s MLLMs and LLMs, respectively. Based on the Hugging Face Transformers library, parameter-efficient fine-tuning (PEFT), distributed acceleration, and low-precision computing techniques were adopted. All methods were implemented with Python 3.11.7 and Pytorch 2.2.1+cu118 programming. All reported results are the average performance of at least 5 repeated experiments to mitigate the impact of randomness.

Overall performance (RQ1)

To comprehensively evaluate the effectiveness of our proposed MLLM-SRec model, we take traditional SR, MSR, popular LLM-based SR, and emerging MLLM-based recommendation methods as baselines, and conduct comparative experiments on four datasets for SR tasks with our proposed model to obtain their performance on HR@K, NDCG@K, and Recall@K evaluation indicators. The experimental results are shown in Table 3.

Table 3 Performance comparison of different baseline methods. The best and second-best results are highlighted in bold and italics, respectively.

Full size table

According to the experimental results, we can easily observe that MLLM-SRec consistently outperforms MLLM-based methods and LLM-based SR methods, significantly surpassing traditional SR models and multimodal SR models. Our MLLM-SRec model achieves superior performance in all metrics. Specifically, compared to the strongest baseline model, MLLM-SRec demonstrates an average improvement of $9.90\%$ in the four datasets, with an average improvement of $11.4\%$, $12.83\%$, and $18.07\%$ in Recall, NDCG, and HR metrics, respectively. This successfully verifies the accuracy and superiority of MLLM-SRec as an SR for leveraging multimodal data through MLLM for SR tasks.

Research indicates that pre-trained MLLMs possess extensive world knowledge and advanced visual and linguistic comprehension capabilities to interpret complex user intents and contextual nuances. This enables it to effectively extract intricate multimodal information, align visual and textual representations, fully capture dynamic user behavior preferences, and significantly enhance recommendation performance through instruction tuning using multimodal sequential interaction data. In contrast to VIP5 and Rec-GPT4V, the MLLM-SRec framework can comprehensively understand textual and visual information through our IMSG component designed for multimodal data, and accurately fuse visual and textual representations. This alignment allows for a deeper understanding of the multifaceted features of items while modeling their intrinsic correlations, simplifying the information processing pipeline, and significantly improving item representation quality. Additionally, the TUBC component can dynamically balance explicit user preferences and implicit semantic information. Unlike conventional PEFT or basic CoT prompt optimization techniques, our approach synergistically integrates QLoRA-based local parameter fine-tuning with multistep CoT output alignment, leading to significant improvements in MLLMs’ comprehension for multimodal interaction data and recommendation performance. Compared to LLM-based recommendation methods that rely solely on textual information, our MLLM-SRec model successfully integrates multimodal information and unleashes the potential of MLLM. This fully demonstrates the important value of integrating advanced MLLM fine-tuning technology to enhance SR using interactive multimodal sequence data.

Ablation study

Impact of different input modalities (RQ2)

To comprehensively explore the impact of different modalities on the performance of MLLM-SRec and further investigate the contribution of multimodal fusion and various complementary knowledge of modalities to SR, we conducted in-depth experiments on different input modalities. Performance results are shown in Table 4.

Table 4 Performance comparison with different modality inputs. The best results are in bold and the second-best results are italics.The improvement is calculated over the second-best result.

Full size table

This table compares the impact of different inputs—Image-only, Text-only, and Image+Text—on the recommendation performance of the MLLM-SRec model in SR tasks. Analysis reveals that the combination of text and image modalities consistently achieves the best performance in all evaluation metrics. Compared to using only text modality, our MLLM-SRec significantly enhances the recommendation performance by incorporating multimodal information, particularly image information. In contrast, using only images generally performs slightly worse than using only text generally. Although visual information tends to capture user attention more effectively, images may contain noise information such as background information, and additional irrelevant information may be captured in the process of understanding images, negatively impacting the performance of recommendations. In contrast, the multimodal input including text and image achieves alignment between visual content and textual content during the image-understanding process through our components, effectively enhancing the accuracy of the model. Our findings suggest that our method has greater potential for improvement in multimodal fusion. Since each modality provides unique knowledge that cannot be captured by other modalities, the fusion of multiple modalities outperforms single-modality approaches. By leveraging its extensive world knowledge and effectively integrating complementary information within and across modalities, MLLMs demonstrate the strong potential of MLLM-SRec as an SR method enhanced by multimodal information

Impact of different components of the MLLM-SRec (RQ3)

To evaluate the individual contributions of each component in our MLLM-SRec framework, we developed several variants of MLLM-SRec, each described as follows:

MLLM-SV: In this version, we employ a VQA-based approach instead of image captioning to understand the item images. Specifically, the VIU component replaces the item image captioning in Base Model, enabling noise-free comprehension of visual data through a unified questioning framework.
MLLM-SVI: This variant utilizes both VIU and IMSG components to process multimodal interaction sequences. It relies entirely on textual and visual summaries, using LLMs for item multimodal summary generation.

As shown in Table 5, we conducted ablation experiments on all variants, including Base Model, using corresponding fine-tuning strategies to validate their effectiveness in achieving sequential recommendation.

Table 5 The performance of MLLM-SRec and its variants. The improvement is calculated over the base model result.

Full size table

It is readily observable that the performance of our primary model, MLLM-SRec, consistently outperforms its variants, MLLM-SV and MLLM-SVI, achieving the best performance, highlighting the importance of each component in enhancing the overall effectiveness of the model. The variant MLLM-SRVI, which uses the IMSG to produce multimodal item summaries, achieves suboptimal performance. This result verifies the importance of our model that integrates multimodal data through IMSG, which is crucial to understanding user preferences across different modalities. This approach significantly compensates for the incompleteness of textual information, thereby enabling a more comprehensive understanding of the multimodal information of items. Additionally, while visual information tends to capture user attention more effectively than pure text, images themselves may contain excessive redundant information, potentially introducing additional noise into recommendation tasks. The MLLM-SV variant, which excludes dynamic user behavior preference understanding and relies solely on VIU, effectively mitigates noise and extracts more relevant fine-grained features from images. Its performance lies between Base Model and MLLM-SVI, indicating that VQA approach can achieve a more accurate understanding of visual data than image captioning. By further incorporating the TUBC component we proposed, we can fully realize the accurate capture of users’ dynamic interests. As shown in the table, MLLM-SRec incorporating these components achieves significant performance improvements, demonstrating that our approach can effectively utilize multimodal data, more accurately reflect users’ current interests, and reduce the negative impacts of data noise, cross-modal semantic gaps, and lengthy sequences. This further underscores the importance of capturing the dynamic evolution of user preferences.

Impact of fine-tuning strategy selection (RQ4)

In this section, we discuss the application of QLoRA for PEFT and 4-step CoT prompt learning to demonstrate the impact of different tuning strategies on model performance. To better align MLLMs with multimodal sequential recommendation tasks for more robust personalized recommendation capabilities, we tried two fine-tuning approaches: (i) PEFT via QLoRA to improve the model parameters and (ii) further 4-step CoT prompt technique to optimize the output results.

The experimental results are illustrated in Fig. 5, the average AUC scores for the two fine-tuning approaches are 0.81 and 0.84, respectively, and these results demonstrate that two fine-tuning strategies are essential and necessary to achieve better recommendation performance. However, the AUC score for the QLoRA fine-tuning method is always around 0.81. After 4-step CoT prompting, MLLM-SRec shows a significant improvement in the AUC score on all datasets, with the highest improvement of 5.54% compared to the former. This demonstrates that, unlike conventional PEFT or basic CoT prompt optimization techniques, our approach synergistically integrates QLoRA-based local parameter fine-tuning with multistep CoT output alignment, thereby significantly enhancing the MLLM’s comprehension capabilities for multimodal interaction data and the effectiveness of the model’s output results on the basis of model parameter optimization. The experimental results further validate that there is a considerable gap between multimodal and recommender tasks, and demonstrate the importance of multimodal data utilization and model fine-tuning in inspiring the recommendation capability of MLLMs.

Impact of MLLMs hallucinations on recommendation performance

The superior multimodal understanding and language generation abilities of MLLM-based RSs are compromised by input hallucination phenomena, particularly when handling multiple images or irrelevant sequential image-text pairs or images with visual noise. Their frequent inability to properly interpret visual contexts or discern temporal/logical connections between items often results in some recommendations that are irrelevant or deviate from user intentions, potentially hindering their effectiveness in SR scenarios. We employ the widely used CLIP cross-modal semantic distance metric⁴⁵ to evaluate the impact of input hallucination on MLLM. For given sequential image-text pairs, we assess the hallucination effect by measuring the average semantic similarity between the MLLM-generated image interpretation and the input text. Higher CLIP scores indicate better hallucination resistance capacity of the model.

As illustrated in Fig. 6a, MLLM-SRec outperforms MLLM-based RSs and their variants in hallucination resistance across all benchmark datasets. In the hallucination robustness evaluation, when exposed to singular or sequential data anomalies (e.g. complex backgrounds, noisy/distorted images, or erroneous image-text pairs) as depicted in Fig. 6b, MLLM-SRec consistently maintains stable recommendation performance, whereas competing models either generate hallucinations or exhibit reasoning failures. The study demonstrates that for specific task scenarios and interaction data, SFT and effective CoT prompting can significantly mitigate hallucinations in MLLMs. Our approach enables MLLMs to generate reasoning steps and produce more authentic and useful responses, thereby enhancing recommendation performance.

Case study

In this section, we conduct a case study to visually demonstrate the effectiveness of incorporating multimodal information into MLLM-based SR. We present an SR example for multimodal item interaction data obtained from a self-built campus services recommendation platform and further analyze how MLLM-SRec assists MLLMs in accurately understanding multimodal information and better capturing dynamic user behavior preferences. We input user multimodal sequential interaction data into MLLM-SRec to predict the user’s preference for the next item, thereby providing personalized recommendations. The results are shown in Fig. 7.

It is observed that our MLLM-SRec accurately understands the multimodal semantic information of items according to different multimodal interaction sequences, generates a unified user dynamic behavior preference, and can accurately recommend the next item to the user. Compared with text-only, our method can accurately grasp the internal semantics of different modalities, especially visual modalities, capture the correlation between multimodal sequences, and deeply understand users’ temporal behavior preferences, which brings significant contribution to personalization. These results demonstrate the effectiveness of integrating and understanding multimodal interaction sequences in SR based on MLLMs.

Conclusion and future work

The latest advances in MLLMs have established new avenues for investigating intelligent RSs. This study systematically explores the feasibility of MLLMs for processing sequence multimodal interaction data and proposes a promising MLLM-SRec framework. By deeply fusing dynamic multimodal sequential information with the semantic understanding capabilities of MLLMs, the framework can effectively eliminate cross-modal differences and visual noise, comprehensively capture the dynamic evolutionary law of user preference, and effectively alleviate the problem of insufficient utilization of multimodal interaction data in generative recommendation

Future research will systematically investigate the trade-off between inference latency and computational cost in MLLM-based RSs. Through the development of an end-to-end joint training framework that integrates advanced MLLMs with enhanced parameter capacity, optimized prompt engineering strategies, and extended multimodal data coverage, our aim is to establish more robust, energy-efficient, and scalable multimodal recommendation architectures.

Data availability

The data that support the findings of this study were obtained from the publicly available Amazon Reviews 2023 dataset, which is accessible at https://amazon-reviews-2023.github.io/index.html. The dataset is distributed under an open-source license, which is available at https://github.com/hyp1231/AmazonReviews2023/blob/main/LICENSE.

References

Chang, Y. et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 15(3), 1–45 (2024).
Article Google Scholar
Zhao, Z. et al. Recommender systems in the era of large language models (LLMS). IEEE Trans. Knowl. Data Eng. 36(11), 6889–6907 (2024).
Article Google Scholar
Liu, Q. et al. Multimodal recommender systems: A survey. ACM Comput. Surv. 57(2) (2024)
Sanner, S., Balog, K., Radlinski, F., Wedin, B. & Dixon, L. Large language models are competitive near cold-start recommenders for language-and item-based preferences. In Proceedings of the 17th ACM Conference on Recommender Systems. 890–896.
Wang, S. et al. Sequential Recommender Systems: Challenges, Progress and Prospects (2019)
Yin, S. et al. A survey on multimodal large language models. Natl. Sci. Rev. 11(12) (2024)
Wei, T. et al. Towards unified multi-modal personalization: Large vision-language models for generative recommendation and beyond. arXiv preprint arXiv:2403.10667 (2024)
Li, L., Zhang, Y., Liu, D. & Chen, L. Large Language Models for Generative Recommendation: A Survey and Visionary Discussions (2023)
Liu, Q. et al. Multimodal pretraining, adaptation, and generation for recommendation: A survey. In 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024, August 25, 2024 - August 29, 2024. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 6566–6576. (Association for Computing Machinery).
Deldjoo, Y. et al. A Review of Modern Recommender Systems Using Generative Models (Gen-RecSys, 2024).
Wang, Q. et al. Towards Next-Generation LLM-based Recommender Systems: A Survey and Beyond (2024)
Geng, S., Liu, S., Fu, Z., Ge, Y. & Zhang, Y. Recommendation as language processing: A unified pretrain, personalized prompt predict paradigm. In 16th ACM Conference on Recommender Systems, RecSys 2022, September 18, 2022 - September 23, 2022. RecSys 2022 - Proceedings of the 16th ACM Conference on Recommender Systems. 299–315. (Association for Computing Machinery, Inc).
Li, J., Zhang, W., Wang, T., Xiong, G., Lu, A. & Medioni, G. Gpt4rec: A generative framework for personalized recommendation and user interests interpretation. In 2023 SIGIR Workshop on eCommerce, eCom 2023, July 27, 2023. CEUR Workshop Proceedings. Vol. 3589. CEUR-WS.
Li, X., Chen, C., Zhao, X., Zhang, Y. & Xing, C.: E4srec: An elegant effective efficient extensible solution of large language models for sequential recommendation. arXiv preprint arXiv:2312.02443 (2023)
Gao, Y., Sheng, T., Xiang, Y., Xiong, Y., Wang, H. & Zhang, J. Chat-rec: Towards interactive and explainable llms-augmented recommender system. arXiv preprint arXiv:2303.14524 (2023)
Geng, S., Tan, J., Liu, S., Fu, Z. & Zhang, Y. Vip5: Towards multimodal foundation models for recommendation. In 2023 Findings of the Association for Computational Linguistics: EMNLP 2023, December 6, 2023 - December 10, 2023. Findings of the Association for Computational Linguistics: EMNLP 2023. 9606–9620. (Association for Computational Linguistics (ACL)).
Wang, Y. et al. Recmind: Large language model powered agent for recommendation. In 2024 Findings of the Association for Computational Linguistics: NAACL 2024, June 16, 2024 - June 21, 2024. Findings of the Association for Computational Linguistics: NAACL 2024 - Findings. 4351–4364. (Association for Computational Linguistics (ACL)).
Xu, S., Hua, W. & Zhang, Y. Openp5: An open-source platform for developing, training, and evaluating llm-based recommender systems. In 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024, July 14, 2024 - July 18, 2024. SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 386–394. (Association for Computing Machinery, Inc).
Bao, K. et al. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems. 1007–1014.
Ji, J. et al. Genrec: Large language model for generative recommendation. In European Conference on Information Retrieval. 494–502. (Springer).
Wang, T. & Wang, C. Embracing llms for point-of-interest recommendations. IEEE Intell. Syst. 39(1), 56–59 (2024).
Article Google Scholar
Yuan, Z. et al. Where to Go Next for Recommender Systems? (ID- vs, Modality-Based Recommender Models Revisited, 2023).
Sindhu, B., Prathamesh, R.P., Sameera, M.B. & KumaraSwamy, S. The evolution of large language model: Models, applications and challenges. In 12th International Conference on Current Trends in Advanced Computing, ICCTAC 2024, May 8, 2024 - May 9, 2024. Proceedings - 2024 International Conference on Current Trends in Advanced Computing, ICCTAC 2024. (Institute of Electrical and Electronics Engineers Inc.)
Zhou, H., Zhou, X., Zeng, Z., Zhang, L. & Shen, Z. A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions (2023)
He, R. & McAuley, J. Vbpr: Visual Bayesian personalized ranking from implicit feedback. In 30th AAAI Conference on Artificial Intelligence, AAAI 2016, February 12, 2016 - February 17, 2016. 30th AAAI Conference on Artificial Intelligence, AAAI 2016. 144–150. (AAAI Press).
Lampa, I.L., Gomes, V.Z. & Zafalon, G.F.D. Recommendation systems: A deep learning oriented perspective. In 26th International Conference on Enterprise Information Systems, ICEIS 2024, April 28, 2024 - April 30, 2024. International Conference on Enterprise Information Systems, ICEIS - Proceedings. Vol. 1. 682–689. (Science and Technology Publications).
Wei, Y., He, X., Wang, X., Hong, R., Nie, L. & Chua, T.-S. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In 27th ACM International Conference on Multimedia, MM 2019, October 21, 2019 - October 25, 2019. MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia. 1437–1445. (Association for Computing Machinery).
Caffagni, D. et al. The revolution of multimodal large language models: A survey. arXiv preprint arXiv:2402.12451 (2024)
Liu, Y., Wang, Y., Sun, L. & Yu, P.S. Rec-gpt4v: Multimodal recommendation with large vision-language models. arXiv preprint arXiv:2402.08670 (2024)
Chen, R., Fan, J. & Wu, M. MC-RGN: Residual graph neural networks based on Markov chain for sequential recommendation. Inf. Process. Manag. 60(6) (2023)
Hou, Y., Mu, S., Zhao, W.X., Li, Y., Ding, B. & Wen, J.-R. Towards universal sequence representation learning for recommender systems. In 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2022, August 14, 2022 - August 18, 2022. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 585–593. (Association for Computing Machinery).
Kumar, C., Kumar, M. & Jindal, A. Session-based song recommendation using recurrent neural network. In 4th International Conference on Machine Intelligence and Signal Processing, MISP 2022, March 12, 2022 - March 14, 2022. Lecture Notes in Electrical Engineering. Vol. 997. LNEE. 719–728. (Springer).
Kang, W.-C. & McAuley, J. Self-attentive sequential recommendation. In 18th IEEE International Conference on Data Mining, ICDM 2018, November 17, 2018 - November 20, 2018. Proceedings - IEEE International Conference on Data Mining, ICDM. Vol. 2018. 197–206. (Institute of Electrical and Electronics Engineers Inc).
Sun, F. et al. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. In 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, November 3, 2019 - November 7, 2019. International Conference on Information and Knowledge Management, Proceedings. 1441–1450. (Association for Computing Machinery).
Zhang, Q., Wang, Y., Wang, H., Wang, J. & Chen, H. Comprehensive review of large language model fine-tuning. Comput. Eng. Appl. 60(17), 17–33 (2024).
Google Scholar
Fu, J. et al. Efficient and Effective Adaptation of Multimodal Foundation Models in Sequential Recommendation (2024)
Yang, F., Yue, Y., Li, G., Payne, T. R. & Man, K. L. Chain-of-thought prompting empowered generative user modeling for personalized recommendation. Neural Comput. Appl. 36(34), 21723–21742 (2024).
Article Google Scholar
Zeng, B., Shi, H., Li, Y., Li, R. & Deng, H. Leveraging large language models knowledge enhancement dual-stage fine-tuning framework for recommendation. In 13th CCF International Conference on Natural Language Processing and Chinese Computing, NLPCC 2024, November 1, 2024 - November 3, 2024. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 15360. LNAI. 333–345. (Springer).
Zhang, J. et al. Recommendation as Instruction Following: A Large Language Model Empowered Recommendation Approach (2023)
Qin, Z. ATFLRec: A Multimodal Recommender System with Audio-Text Fusion and Low-Rank Adaptation via Instruction-Tuned Large Language Model (2024)
Buyukakyuz, K. OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models (2024)
Hu, H., Guo, W., Liu, Y. & Kan, M.-Y. Adaptive multi-modalities fusion in sequential recommendation systems. In 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023, October 21, 2023 - October 25, 2023. International Conference on Information and Knowledge Management, Proceedings. 843–853. (Association for Computing Machinery).
Petrov, A.V. & Macdonald, C. Generative Sequential Recommendation with GPTRec (2023)
Bao, K. et al. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In 17th ACM Conference on Recommender Systems, RecSys 2023, September 18, 2023 - September 22, 2023. Proceedings of the 17th ACM Conference on Recommender Systems, RecSys 2023. 1007–1014. (Association for Computing Machinery, Inc).
Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. 8748–8763. (PmLR).

Download references

Acknowledgements

This work was supported in part by the Natural Science Foundation of Zhejiang Province (No. LZ20F020001), Science and Technology Innovation 2025 Major Project of Ningbo (No. 20211ZDYF020036), Natural Science Foundation of Ningbo (No. 2021J091), the Research and Development of a Digital Infrastructure Cloud Operation and Maintenance Platform Based on 5G and AI (No. HK2022000189), and China Innovation Challenge (Ningbo) Major Project (No. 2023T001).

Author information

Zhaoliang Wang and Weiming Huang contributed equally to this work.

Authors and Affiliations

Faculty of Information Science and Engineering, Ningbo University, Ningbo, 315211, People’s Republic of China
Zhaoliang Wang, Baisong Liu & Weiming Huang
Library, Ningbo University of Technology, Ningbo, 315211, People’s Republic of China
Zhaoliang Wang
School of Material Science and Chemical Engineering, Ningbo University, Ningbo, 315211, People’s Republic of China
Tingting Hao
State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082, People’s Republic of China
Huiqian Zhou
Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, 215123, People’s Republic of China
Yuxin Guo

Authors

Zhaoliang Wang
View author publications
Search author on:PubMed Google Scholar
Baisong Liu
View author publications
Search author on:PubMed Google Scholar
Weiming Huang
View author publications
Search author on:PubMed Google Scholar
Tingting Hao
View author publications
Search author on:PubMed Google Scholar
Huiqian Zhou
View author publications
Search author on:PubMed Google Scholar
Yuxin Guo
View author publications
Search author on:PubMed Google Scholar

Contributions

Z.W. and W.H. conducted the MML-SRec experiments, analyzed the data and wrote the draft. T.H. and H.Z. planned and performed relative investigation. Y.G. analyzed the relative data. B.L. conceived the project, designed the experiments, supervised the study and revised the manuscript. All authors commented on the manuscript.

Corresponding author

Correspondence to Baisong Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Z., Liu, B., Huang, W. et al. Leveraging multimodal large language model for multimodal sequential recommendation. Sci Rep 15, 28960 (2025). https://doi.org/10.1038/s41598-025-14251-1

Download citation

Received: 09 April 2025
Accepted: 30 July 2025
Published: 07 August 2025
DOI: https://doi.org/10.1038/s41598-025-14251-1

Subjects

Abstract

Similar content being viewed by others

MMAgentRec, a personalized multi-modal recommendation agent with large language model

Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts

Correspondence of high dimensional emotion structures elicited from video clips between humans and multimodal LLMs

Introduction

Related work

LLM for recommendation

Multimodal recommendation

Sequential recommendation

Preliminaries

Problem definition

MLLM-based base model for SR

Methodology

VQA-based image understanding

Item multimodal summary generator

Temporal user behavior comprehend

Supervised fine-tuning and CoT prompt optimization

Supervised fine-tuning of QLoRA

Prompt optimization of four-step CoT

Experiments

Experimental settings

Dataset description

Baseline methods

Evaluation metrics

Implementation details

Overall performance (RQ1)

Ablation study

Impact of different input modalities (RQ2)

Impact of different components of the MLLM-SRec (RQ3)

Impact of fine-tuning strategy selection (RQ4)

Impact of MLLMs hallucinations on recommendation performance

Case study

Conclusion and future work

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links