Introduction

The rapid growth of e-commerce has revolutionized the way consumers shop, offering unparalleled convenience and accessibility. However, one persistent challenge in online retail, particularly in the fashion industry, is the inability of customers to try clothing before purchasing it. This limitation frequently leads to dissatisfaction, increased return rates, and additional operational costs for the retailers. Addressing this issue is critical not only for improving customer satisfaction; but also, for enhancing the efficiency and sustainability of online shopping systems.

Image-based Virtual Try-On (VITON) technology leverages deep learning and computer vision to allow users to visualize garments on their bodies without physically trying them1,2,3,4,5,6,7,8,9. Early systems relied on Convolutional Neural Networks (CNNs)17 for Feature extraction and Generative Adversarial Networks (GANs)18 to generate realistic images, whereas recent approaches have adopted Diffusion Models19,20 to enhance accuracy10,11,12,13,14,15. Despite advancements, VITON systems still struggle with challenges such as misalignment, occlusion, and loss of fine garment details such as textures, patterns, and colors, particularly when managing intricate garments or diverse body shapes. These limitations affect the overall realism and practicality of the generated try-on results. This research introduces an enhanced VITON model that effectively overcomes existing limitations by incorporating depth maps and multi-head attention mechanisms. The inclusion of depth maps16 adds a layer of spatial understanding, allowing the model to achieve a more accurate alignment of garments with the user’s body shape and pose. To further refine the garment representation, an innovative module is employed to remove irrelevant sections, ensuring a cleaner and more precise depiction of the garment. Additionally, multi-head attention mechanisms21 enable robust feature extraction and representation, preserving intricate details such as texture, patterns, and structure throughout the generation process. These advancements have been achieved while maintaining computational efficiency, making the model a practical and scalable solution for real-world applications.

The model was designed using multistage architecture. In the first stage, the garment and body features are extracted using dual encoders that incorporate convolutional and attention-based layers. These features are then combined in the second stage to generate a semantic garment representation and refined body segmentation map. Finally, a high-resolution generator synthesizes the try-on image, thereby achieving a realistic appearance that closely mimics a physical try-on experience. The pre-trained generator was loaded directly, exemplifying the transfer learning approach utilized in this study.

Beyond its technical contributions, this study also examines the broader implications of VITON technology in the e-commerce sector. By improving the accuracy and realism of try-on results, the proposed model can significantly enhance the online shopping experience, increase customer satisfaction, and reduce product returns. Moreover, the model promotes sustainable practices by minimizing the environmental impact associated with high return rates and excessive production.

Related work

Image-based virtual try-on

Image-based virtual try-on aims to generate realistic images of individuals wearing target garments using visual inputs from the person and clothing. This approach eliminates the need for physical trials, leveraging computer vision to address key challenges, such as garment alignment, occlusion handling, and detail preservation.

Pose-guided methods

Early works, such as VITON3 and CP-VTON4, used pose estimation maps to align garments with body features. These approaches provide basic alignment, but struggle with complex poses and occlusions, leading to artifacts and misalignments.

Garment warping

Garment warping techniques, such as thin-plate spline (TPS)22,23, transform clothing images to fit the target body. While computationally efficient, TPS-based methods lack the flexibility required for complex garment shapes. To improve accuracy, more advanced techniques like Spatial Transformer Networks (STN)24 and FlowNet25 were introduced. STN enables spatial transformations by learning global parameters, avoiding direct pixel-wise operations. On the other hand, FlowNet generates pixel-wise displacement maps, providing finer control for image warping. Despite these advancements, challenges remain in seamlessly adapting loose or flowing garments to the target body.

Segmentation maps

Human segmentation maps play a key role in disentangling body and garment features, thereby enabling spatially consistent synthesis. Methods such as VITON-HD1 and HR-VITON26 enhance segmentation quality, improving garment-body separation, and overall image realism.

Challenges in existing methods

Despite advancements, current virtual try-on systems face the following challenges:

  • Alignment errors: Accurate garment alignment remains difficult, especially with occlusions or complex poses.

  • Detail loss: Fine garment textures and patterns are often not preserved.

  • Artifacts: Imperfections in warping and segmentation lead to distortions, reducing realism.

To overcome these challenges, further innovation is required in alignment techniques, detail preservation mechanisms, and integration strategies to ensure high-quality and realistic virtual try-on results.

Methodology

This study introduces the Depth-Attention Virtual Try-On (DA VITON) model, which is a novel framework designed to overcome the limitations of traditional virtual try-on systems. The proposed method operates through multistage architecture, each carefully designed to refine and utilize the input information for realistic virtual try-on outputs. Key innovations include depth maps for spatial context, a garment refinement module for improved segmentation, and multi-head attention mechanisms to enhance detail preservation. An overview of the proposed Depth-Attention Virtual Try-On Model architecture is presented in Fig. 1.

Fig. 1
figure 1

Overview of the proposed depth-attention virtual try-on model architecture.

Preprocessing

The preprocessing step is a critical component of the pipeline that ensures that vital information is categorized and provided to the model in distinct sections. At this stage, the Garment Refinement Module plays a vital role. It begins by parsing the input garment image to identify and isolate relevant regions while removing unnecessary internal sections, such as the inner parts of collars or sleeves. The Garment Refinement Module process is illustrated in Fig. 2.

Fig. 2
figure 2

Garment refinement module process.

This module utilizes calculations and a garment depth mask, as shown in Fig. 2, to accurately exclude irrelevant sections. By refining the clothing at the very start, this module eliminates extraneous details and provides a cleaner and more accurate representation of the model to process. This step is essential for improving the segmentation accuracy and reducing noise in subsequent stages.

The module uses depth maps to remove irrelevant internal sections (e.g., the inner parts of collars or sleeves), generating a refined garment mask for cleaner preprocessing.

The Garment Refinement Module operates in three stages.

  1. 1.

    The input garment image was processed using a depth-estimation model (depth-anything) to generate a depth map.

  2. 2.

    Using the depth map, irrelevant internal sections (e.g., the inner parts of the collars or sleeves) were identified and removed, producing a refined binary mask.

  3. 3.

    The refined binary mask is applied to the input garment image, isolating only the relevant garment regions for cleaner and more accurate pre-processing.

Feature extraction

Once the preprocessing is complete, the model begins with the feature extraction stage. As shown in the architectural diagram, this stage utilizes two parallel encoders to extract essential features.

  1. 1.

    Garment encoder. The proposed model processes the garment image using resnet-based architecture, which is enhanced with attention layers to improve its performance. this architecture not only extracts high-level features but also focuses on intricate details such as textures, patterns, and shapes, ensuring a more realistic and accurate representation of the garment.

  2. 2.

    Body encoder. The model processes the input person’s image, incorporating pose, body shape, and an auxiliary depth map to provide spatial context. This depth map enables the encoder to accurately capture spatial relationships between the garment and the target body for a precise and realistic fit. The outputs of these encoders provide the foundational feature maps required for accurate garment-body integration.

Garment-body integration

In the integration stage, the extracted garment and body features are combined to align the clothing with the target body. This stage leverages the following processes:

  1. 1.

    Target garment transformation: The model integrates detailed features of the target garment with pose and body structure information extracted from the person’s image. This process transforms the garment into a format that aligns naturally and seamlessly with the target body, ensuring a realistic fit and appearance.

  2. 2.

    Semantic body segmentation map generation: Simultaneously, generates a segmentation map that outlines the body structure in the target state, identifying key regions for garment placement.

Image generation

In the final stage, the generator synthesizes a virtual try-on image. This stage utilizes refined features from earlier stages and employs a pre-trained generator for high-resolution image synthesis. By following this structured pipeline, the DA VITON model effectively transforms input images into realistic virtual try-on results while addressing challenges, such as alignment, occlusions, and detail preservation.

Loss functions

Training the Depth-Attention VITON model involves a combination of loss functions, each designed to target a specific stage of the virtual try-on pipeline. By applying these loss functions to distinct model components, we ensured an accurate representation, precise alignment, and high-quality image synthesis. This process can be divided into two main stages: garment body representation (Fig. 3) and final image generation.

Garment-body representation stage

Since this step involves synthesizing the garment in the context of the person’s body, it is crucial during the training phase to evaluate the accuracy of the garment representation in relation to the primary objective. Additionally, a similar assessment should be performed on the semantic segmentation maps to ensure that the model’s performance aligns with the intended outcomes. The garment-body integration process is outlined in Fig. 3.

Fig. 3
figure 3

Overview of the Garment-Body Integration Process: The model employs separate garment and body encoders to extract features from the input images. The garment-body integration module fuses these features to generate the final output, guided by multiple loss functions, including VGG Loss, Total Variation Loss, L1 Loss, Cross Entropy Loss, and GAN Loss, to ensure realism and alignment.

Cross entropy loss

This loss is employed to enhance the semantic accuracy of predicted body segmentation maps, which guide the alignment process. It improves segmentation precision by penalizing incorrect class predictions at the pixel level:

$$\:{\mathcal{L}}_{CE}=-\sum\:_{i=1}^{N}\sum\:_{c=1}^{C}{y}_{i,c}\text{log}\left({\widehat{y}}_{i,c}\right)$$
(1)

L1 loss

Employed to minimize pixel-wise differences between the intermediate garment-body representations and ground-truth maps. By enforcing pixel-level similarity, this loss ensures smooth and consistent alignment of garment and body features:

$$\:{\mathcal{L}}_{{L1}} = \frac{1}{N}\sum {\:_{{i = 1}}^{N} } \left\| {\hat{y}_{i} - y_{i} } \right\|_{1}$$
(2)

VGG loss

Perceptual loss evaluates feature-level similarity between the predicted and target representations using a pre-trained VGG network. This loss focuses on preserving high-level semantic features and texture information essential for realistic garment-body alignment:

$${\mathcal{L}}_{{VGG}} = \sum {\:_{{i = 1}}^{L} } \frac{1}{{C_{i} H_{i} W_{i} }}\left\| {\varphi \:_{i} \left( {\hat{y}} \right) - \varphi \:_{i} \left( y \right)} \right\|_{2}^{2}$$
(3)

GAN loss

To ensure that the generated garment-body representation appears realistic, the GAN loss is employed. It introduces a discriminator to distinguish between real and generated representations, encouraging the generator to produce outputs indistinguishable from real data:

$$\:{\mathcal{L}}_{GAN}=-\text{log}\left(D\left(G\left(x\right)\right)\right)$$
(4)

Total variation loss (TV)

The TV loss is applied to reduce visual artifacts and enforce spatial smoothness in the intermediate garment-body representations. By minimizing abrupt intensity changes, it encourages spatial coherence and smoother textures across garment regions:

$$\:TV\left(x\right)=\sum\:_{i,j}\left(\sqrt{{({x}_{i+1.j}-{x}_{i,j})}^{2}+{({x}_{i.j+1}-{x}_{i,j})}^{2}}\right)$$
(5)

Loss weighting

To balance the contribution of different objectives during training, we used a weighted sum of the individual loss terms. The generator loss was defined as:

$$\:los{s\_G}_{}=\:\left(10\:*\:los{s}_{l{1}_{cloth}}+\:los{s}_{vgg}+opt.tvlambda\:*\:los{s}_{tv}\right)+\:(CE\_loss\:*\:opt.CElamda\:+\:loss\_G\_GAN\:*\:opt.GANlambda)$$
(6)

The weights were selected based on default settings used throughout the training process and were not fine-tuned. A higher weight was assigned to the L1 loss to compensate for its naturally smaller magnitude, ensuring it remains comparable in scale to other loss components. Similarly, the cross-entropy loss was weighted more heavily to enforce accurate semantic segmentation, while perceptual, smoothness, and adversarial losses contributed at balanced scales to optimizing overall visual quality and realism.

The discriminator is trained with the standard adversarial loss, defined as the sum of real and fake classification errors:

$$\:loss\_D\:=\:loss\_D\_fake\:+\:loss\_D\_real$$
(7)

Final image generation stage

In the final stage, the try-on image is synthesized from the aligned garment-body features. We employed the pre-trained generator from the HR-VITON framework26, without applying any additional fine-tuning or re-training. This generator was originally trained to produce high-resolution virtual try-on images and was directly integrated into our pipeline.

Although this component was used as-is, without gradient updates, the overall system still produces high-quality and realistic outputs, as confirmed by both quantitative results and human evaluation. This demonstrates that the preceding modules — particularly our depth-guided garment-body integration — provide accurate and detailed representations, enabling the fixed generator to perform effectively within our framework.

Experiments

Experimental setup

Training setup

All models were trained using a single NVIDIA RTX 3090 GPU with 24 GB of memory. The Garment-Body Integration module was trained for approximately 150 epochs and a batch size of 16, which took around 32 h in total.

Datasets

For the experiments, we used a high-resolution virtual try-on dataset introduced by VITON-HD1, which contains 13,679 frontal-view woman and top clothing image pairs. The original resolution of the images is 1024 × 768, and the images are bicubically downsampled to the desired resolution when needed. We split the dataset into training and a test set with 11,647 and 2,032 pairs, respectively30.

Compared methods

We evaluated our model on the VITON-HD dataset by comparing it with several state-of-the-art virtual try-on (VITON) methods. These include GAN-based approaches such as VITON-HD1 and HR-VITON26, LDM-based methods such as LaDI-VTON10 and StableVITON28, as well as the most recent framework, CatV2TON29. This comparison provides a comprehensive perspective on how our method performs relative to both established baselines and innovative models.

Evaluation metrics

We evaluated the results generated in both paired and unpaired settings. In the paired setting, the input person and the corresponding target garment are provided to reconstruct the original appearance. In the unpaired setting, the garment is intentionally replaced with a different one, simulating realistic try-on scenarios with diverse clothing types. This dual evaluation setup allows for a more comprehensive assessment of each model’s practical applicability and generalization capability.

For quantitative evaluation, our model supports high-resolution image synthesis at 1024 × 768 pixels. In the paired setting, we used LPIPS and SSIM to measure perceptual and structural similarity between the generated images and ground truth.

LPIPS was computed using the official PyTorch implementation from27 with the AlexNet backbone, which offers a favorable trade-off between computational efficiency and perceptual sensitivity. This choice is consistent with prior work in virtual try-on, where AlexNet has been widely adopted for its ability to effectively evaluate human-centric image synthesis with low complexity. In the unpaired setting, we used FID and KID to evaluate the realism and distributional fidelity of the synthesized images, following standard practices established in recent literature.

Results

Quantitative results

Our model outperforms state-of-the-art methods across multiple metrics. Notably, it obtains the lowest LPIPS score among all compared methods, indicating the highest perceptual similarity to the ground truth and demonstrating effective alignment between garments and body features. Furthermore, it achieves the highest SSIM score, highlighting its robustness in preserving structural details and visual consistency. These results confirm that our method generates more coherent and perceptually faithful try-on images than both GAN-based and diffusion-free baselines.

Although our FID and KID scores are not the best overall, they remain competitive and clearly outperform all non-diffusion-based baselines, such as VITON-HD and HR-VITON. The slight performance gap in FID and KID compared to diffusion-based models (e.g., LaDI-VTON) can be attributed to the known strengths of diffusion architectures in modeling fine-grained textures and photorealism. Nevertheless, our model strikes a practical balance by delivering high visual quality with significantly lower computational cost and faster inference. A detailed comparison of these results is presented in Table 1, and a visual comparison is illustrated in Fig. 4.

Table 1 Quantitative comparison of the proposed model with baseline virtual try-on methods on the VITON-HD dataset at resolution 1024 × 768.
Fig. 4
figure 4

Graphical representation of comparing the proposed method with baselines.

Qualitative results

The qualitative results highlight the effectiveness of the proposed Depth-Attention VITON model in generating high-quality try-on images across a range of scenarios, including complex poses, occlusions, and intricate garment patterns. A comparative analysis with state-of-the-art models such as VITON-HD, HR-VITON, and LaDI-VTON demonstrates the significant advantages of our approach. A representative qualitative example comparing our method with baselines is depicted in Fig. 5.

Alignment and occlusions

Unlike baseline models, our method successfully aligns garments with complex body poses while minimizing occlusion-related artifacts. For instance, in challenging cases where the arms or torso partially obstruct the garment, competing models often produce distorted results, whereas our model maintains garment integrity and natural alignment.

Detail preservation

The integration of depth maps and multi-head attention mechanisms ensures superior preservation of garment details. Intricate textures and patterns that appear blurred or inconsistent in other models are accurately replicated in our outputs. A clear example is the accurate rendering of fine embroidery and lace patterns, which are often lost in traditional methods.

Collar accuracy

The introduction of the garment refinement module significantly improves the accuracy of garment representation around critical areas like the collar. By removing unnecessary internal sections of the garment, our model achieves precise alignment and realistic detail in the collar region, outperforming competing approaches.

Fig. 5
figure 5

A representative qualitative example comparing with baselines.

Overall realism

Side-by-side visual comparisons illustrate the enhanced realism of our model. The outputs exhibit natural transitions between the garment and the body, with realistic folds, shadows, and textures. Competing models frequently display artifacts such as unnatural edges or color mismatches, which are notably absent in our results.

The improvements achieved by our model not only enhance the aesthetic quality but also expand the practical applications of virtual try-on systems. The robust handling of diverse garment types and body shapes makes our method particularly suitable for e-commerce scenarios, where visual accuracy is paramount. A qualitative comparison with baselines at 1024 × 768 resolution is illustrated in Fig. 6.

Fig. 6
figure 6

Qualitative comparison with baselines at 1024 × 768 resolution.

Human evaluation

To complement the quantitative evaluation metrics, we conducted a user-centered perceptual study to assess the visual quality of try-on results from different methods. A diverse set of test image series was used, and ratings were collected from individuals outside the research team to ensure impartiality and mitigate bias.

Participants were asked to evaluate the images generated based on two key criteria:

  • Visual realism: The degree to which the image appears natural, photorealistic, and free of artifacts.

  • Similarity to source: How well the generated image preserves the identity, pose, and body structure of the original person.

Each image was rated on a scale from 1 to 10, and the total scores were normalized based on the maximum possible value. Table 2 summarizes the normalized average scores across methods, and Fig. 7 visualizes the comparative results.

Table 2 Normalized results of human evaluation based on perceptual similarity to the source and overall realism.

As shown in Table 2, our method achieved the highest score in terms of similarity to the source image, indicating superior preservation of person-specific features and alignment. It also performed competitively in visual realism, demonstrating its effectiveness in generating perceptually convincing outputs.

Fig. 7
figure 7

Comparison of normalized human evaluation scores across methods.

Interestingly, although our model does not rely on diffusion-based generation, it achieved a strong realism score while maintaining the highest structural similarity to the source. The best realism score was obtained by LaDI-VTON, a latent diffusion-based model that excels in producing visually rich and photorealistic images. This result aligns with the well-known strengths of diffusion models in capturing fine textures, lighting, and natural details, which often lead to higher perceived realism, albeit sometimes at the expense of structural fidelity and identity consistency.

This user-centered assessment complements automated metrics such as LPIPS and FID, offering a more direct and perceptually grounded measure of the generated images’ realism from the end-user perspective.

Conclusions

In this study, we introduced the Depth-Attention Virtual Try-On (DA VITON) model, a novel framework designed to overcome the limitations of existing VITON systems. By incorporating depth maps and multi-head attention mechanisms, our model achieves significant improvements in garment alignment, detail preservation, and overall visual quality. Extensive quantitative and qualitative evaluations of the VITON-HD dataset demonstrate the robustness and effectiveness of our approach in handling complex poses, occlusions, and intricate garment details.

A unique aspect of our approach is the introduction of depth maps and garment refinement as pre-processing steps, enabling more accurate and consistent outputs. Moreover, the generator used in the final stage was pre-trained and directly loaded, displaying the application of transfer learning in our model design.

The proposed framework not only enhances the realism of virtual try-on results but also contributes to the broader adoption of VITON technologies in e-commerce. By addressing key challenges such as misalignment and artifact generation, our model paves the way for more reliable and user-friendly virtual try-on systems. Future work will focus on expanding the scope of the model to support more complex scenarios, including multi-garment try-ons, and integrating the framework with augmented reality technologies to provide an immersive user experience.