Deep representation learning using layer-wise VICReg losses

Datta, Joy; Rabbi, Rawhatur; Saha, Puja; Zereen, Aniqua Nusrat; Abdullah-Al-Wadud, M.; Uddin, Jia

doi:10.1038/s41598-025-08504-2

Download PDF

Article
Open access
Published: 25 July 2025

Deep representation learning using layer-wise VICReg losses

Joy Datta¹,
Rawhatur Rabbi¹,
Puja Saha²,
Aniqua Nusrat Zereen³,
M. Abdullah-Al-Wadud⁴ &
…
Jia Uddin⁵

Scientific Reports volume 15, Article number: 27049 (2025) Cite this article

1963 Accesses
Metrics details

Subjects

Abstract

This paper presents a layer-wise training procedure of neural networks by minimizing a Variance-Invariance-Covariance Regularization (VICReg) loss at each layer. The procedure is beneficial when annotated data are scarce but enough unlabeled data are present. Being able to update the parameters locally at each layer also handles problems such as vanishing gradient and initialization sensitivity in backpropagation. The procedure utilizes two forward passes instead of one forward and one backward pass as done in backpropagation, where one forward pass works on original data and the other on an augmented version of the data. It is shown that this procedure can construct more compact but informative spaces progressively at each layer. The architecture of the model is selected to be pyramidal, enabling effective feature extraction. In addition, we optimize weights for variance, invariance, and covariance terms of the loss function so that the model can capture higher-level semantic information optimally. After training the model, we assess its learned representations by measuring clustering quality metrics and performance on classification tasks utilizing a few labeled data. To evaluate the proposed approach, we do several experiments with different datasets: MNIST, EMNIST, Fashion MNIST, and CIFAR-100. The experimental results show that the training procedure enhances the classification accuracy of Deep Neural Networks (DNNs) trained on MNIST, EMNIST, Fashion MNIST, and CIFAR-100 by approximately 7%, 16%, 1%, and 7% respectively compared to the baseline models of similar architectures.

Forward layer-wise learning of convolutional neural networks through separation index maximizing

Article Open access 13 April 2024

Variational tensor neural networks for deep learning

Article Open access 16 August 2024

Hybridized sine cosine algorithm with convolutional neural networks dropout regularization application

Article Open access 15 April 2022

Introduction

In deep learning, having enough representative data is essential for a model to generalize well. However, in many cases, obtaining large amounts of labeled data and ensuring annotation quality is challenging. Due to the increased availability of information on the Internet and large unlabeled datasets, advanced techniques such as self-supervised learning are utilized to exploit this large amount of information without relying on labels¹. Learning image representations using self-supervised methods maximizes the agreement between embeddings of images from different viewpoints or augmented versions. Variance-Invariance-Covariance Regularization (VICReg) is a self-supervised method that uses variance, invariance, and covariance loss terms to produce augmentation-invariant and non-redundant representations². Consequently, training a Deep Neural Network (DNN) to construct an abstract and informative representation space using VICReg loss has been proven to be useful, leading to outstanding performance on various downstream tasks.

Although the presence of large datasets is important for DNNs to generalize well, other factors such as the choice of model architectures, hyperparameters, and learning procedures also play a significant role in achieving good performance³. DNNs are generally trained on large labeled datasets using backpropagation⁴. Although the recent success of deep learning is enabled by the use of backpropagation to train a deep model to optimize its parameters based on only one loss term, it has several limitations. Initially, backpropagation is biologically implausible. No convincing evidence is found that the cortex stores neural activities or explicitly propagates error derivatives in a backward pass^5,6. Another disadvantage of backpropagation is that it needs to know perfectly what computation is performed in the forward pass. It is impossible to use backpropagation after inserting black boxes or non-differentiable components into the forward pass. Furthermore, backpropagation suffers from vanishing gradient problems that require the use of batch normalization or a carefully chosen weight initialization technique and activation function^7,8. This learning procedure also needs to store forward-pass computations and error derivatives, which makes it memory inefficient. Calculating and propagating derivatives backward is time-consuming. The Forward-Forward algorithm can alleviate these problems by enabling deep models to learn layer-wise with a simple objective without backpropagating error derivatives⁵. Some advantages of this learning method over backpropagation include the ability to utilize low-powered analog hardware without relying on reinforcement learning and the ability to be considered a model of learning in the cortex.

Given the fact that in recent years, the availability of large unlabeled datasets have increased, annotation quality and precision remains have become challenging due to the time-consuming and labor-intensive nature of the process. Utilizing extensive unlabeled datasets via self-supervised learning can be a solution, requiring the training of large-scale DNNs with numerous parameters using backpropagation. This study aims to integrate self-supervised methods with a layer-wise training strategy to combine the advantages of both approaches. The combined procedure allows the model to learn features layer by layer, addressing the drawbacks of backpropagation and leveraging large unlabeled datasets effectively. The motivation behind this work is to utilize large unlabeled datasets while alleviating the limitations faced by backpropagation using layer-wise training. This work investigates the possibility of using VICReg loss as the layer-wise objective function for DNNs trained in a layer-wise manner, combined with simple data augmentation techniques. Overall, the contributions of this study are as follows:

1.
This proposed learning procedure utilizes two forward passes, similar to the Forward-Forward algorithm. No label information or data corruption is needed in this approach. It requires feeding two batches of data to the model, one with the original data and the other with an augmented version of the same data.
2.
The procedure incorporates VICReg loss at each layer and minimizes it to construct useful abstract representation spaces. The experiments demonstrate that minimizing VICReg loss at each layer of a deep model can construct more informative representation spaces which can later be used for tasks such as classification later by fine-tuning with very limited labeled data.

Related works

The success of DNNs greatly depends on the availability of large datasets. It is pivotal for a deep model to learn good enough representations that lead to generalization. Researchers have found that the performance of DNNs on computer vision tasks increase logarithmically with the size of the training dataset⁹. One approach to making use of massive unlabeled datasets without relying on semantic annotations is self-supervised learning (SSL). Various SSL methods employ much larger datasets using pseudo-labels that enable models to recognize uncommon, more subtle representations¹⁰. DNNs trained with SSL techniques can detect patterns with the help of pretext tasks, such as predicting missing parts of an image, maximizing the agreement between augmented views of an image, predicting the order of a sequence, etc. SSL also improves uncertainty estimation and handles problems such as adversarial examples and label corruption well, making it more robust in practical real-world scenarios¹¹.

SSL methods are widely used in natural language processing domain. For example, Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer 3 (GPT-3) are trained in a self-supervised fashion^12,13. In the computer vision domain, models such as Barlow Twins, Bootstrap Your Own Latent (BYOL), Momentum Contrast (MoCo), Simple Contrastive Learning of Representations (SimCLR), Swapping Assignments between Views (SwAV), and VICReg are examples of self-supervised learning methods^{14,15,16,17,18}. Since the goal of learning in self-supervised methods is to minimize the distance between embeddings of different viewpoints, the encoders may produce identical and non-informative representation vectors. This phenomenon is called collapse and when it happens, the encoders of a joint embedding model ignore the inputs and produce identical output vectors, which do not capture useful information¹⁹.

To prevent this problem, self-supervised representation learning using a joint embedding architecture can be done via two methods namely contrastive and non-contrastive self-supervised learning. First, contrastive methods push dissimilar images away and pull semantically similar images closer together in the representation space explicitly. This technique often requires searching for offending dissimilar images from the memory bank, or from the current batch, which is costly and memory intensive². SimCLR and MoCo are examples of contrastive self-supervised methods. SimCLR, an SSL method for contrastive learning of visual representations is used to maximize agreement between augmented views of similar images and minimize the agreement between different images and MoCo incorporates a dynamic dictionary of negative samples for better contrasting; hence, it creates better representations^14,15.

Next, non-contrastive methods focus on maximizing the information content of the output embeddings. An example of a non-contrastive learning method is VICReg. VICReg works well without requiring weight sharing, batch or feature-wise normalization, stop-gradient operations, memory banks, and contrastive samples. One of the advantages of VICReg is that it does not require the two encoders to share the same architecture or input modality, therefore it is possible to use this method for multimodal data, such as text and audio. This approach utilizes three loss terms to construct a convenient and useful representation space, namely variance, invariance, and covariance. The variance keeps the standard deviation of each variable of the embedding vectors above a threshold to restrict the model from producing identical embeddings, preventing collapse. Then the mean squared distance between embedding vectors from two encoders is minimized to pull the similar data points in the space closer. To avoid informational collapse the covariance is applied to ensure that the covariances between every pair of embedding variables are close to zero. This ensures the variables of the embeddings are not highly correlated, allowing them to span the entire space without losing useful information.

Training a joint embedding architecture model using a VICReg loss consists of four parts. First, encoder networks utilize different views of an image as both input and output representations. Second, the representation vectors are fed into expander networks that map representations into an embedding space. Third, VICReg loss is computed based on the embeddings from both expander networks. Finally, the parameters of the encoder and expander networks are updated. The expander networks are used to de-correlate the embedding variables, reducing the dependencies between the variables of the representation vector by expanding the dimensions in a non-linear fashion, and eliminating the information responsible for the two representations being different. Another method that relies on maximizing the information content of the embeddings does not require any negative samples. It makes the normalized cross-correlation matrix of the embeddings from the two encoders as close to the identity matrix as possible through training¹⁸.

Recent works have adopted VICReg across diverse domains. VICReg, as a pre-training loss, is used in Brugada ECG detection, improving Brugada-syndrome classification with a standard convolutional neural network (CNN)²⁰. DA-VICReg abstains from using augmentations by pairing time-synchronous engine-vibration signals as positives²¹. It also uses attention to extract invariant features of the faults across various operating conditions. The covariance term of the VICReg loss is refined and then added to the IterNorm in the projector in VIbCReg (Variance-Invariance-better-Covariance), resulting in more rapid convergence and higher accuracy on time series data, including ECG data²². VICRegL is a variant of VICReg loss that focuses especially on local features. It applies its loss term to both local and global features²³. In a recent study, researchers have demonstrated that ConvNeXt pretrained via VICRegL loss learns better wound structures, outperforming ResNet benchmarks²⁴. JOSENet, a video violence detection system, also utilizes a custom VICReg loss²⁵. It achieves higher accuracy, with reduced need for memory and computation. The potential use of VICReg and other SSL methods in reinforcement learning has been studied. It is evident from recent studies that incorporating RL-specific augmentations such as replay weighting or state masking in the VICReg approach significantly enhances data efficiency²⁶. In MXene property prediction, it uses graph contrastive learning with a VICReg-inspired approach to embed property structures for property regression²⁷. In a more recent study, IMSVD is introduced, which discretizes latent variables for mutual information maximization²⁸. It explicitly enforces invariance with minimal redundancy as in VICReg. These contributions adapt the technique to new domains or data types such as time series signals, satellite imagery, reinforcement learning, and graphs etc²⁹.

Learning using backpropagation consists of two passes through the neural network. A forward pass computation outputs some predictions based on the combination of the features learned so far, and a loss is computed based on the predicted and target outputs. The error derivatives flow backward to the layers, and based on those derivatives, the parameters are updated accordingly to train the model⁴. There are several attempts to mitigate the drawbacks of backpropagation using layer-wise training. A greedy layer-wise learning approach that scales to large datasets, such as ImageNet, constructs a deep convolutional neural network by sequentially solving 1-hidden-layer auxiliary problems that inherit both the advantages of shallow networks and the representational power of deep networks³⁰. Another layer-wise CNN using local loss for human action recognition (HAR) tasks approaches state-of-the-art performance on several HAR benchmarks³¹. Research has explored complex interactions between layers of deep neural networks. Shallower layers of a DNN tend to converge faster than the deeper layers because shallow layers are responsible for detecting evenly distributed low-level features, whereas deeper layers combine these features to do specific tasks. This results in a comparatively flatter loss landscape in shallower layers than in deeper layers, ensuring faster convergence³².

The Forward-Forward algorithm is a greedy multi-layer learning procedure that tries to mitigate two limitations of backpropagation: the implausibility of backpropagation being involved in learning in the cortex and the computational inefficiency of backpropagation. This learning procedure is based on Boltzmann machines and Noise Contrastive Estimation^33,34. It utilizes two forward passes to update the parameters. The first forward pass works on real data and optimizes the weight of each hidden layer to increase goodness, whereas the second forward pass uses negative data to adjust weights and decrease goodness at every hidden layer. Samples paired with correct labels are referred to as positive data, and samples combined with incorrect labels are called negative data. The measure of goodness is calculated as the sum of squared neural activities for positive data and the negative sum of squared neural activities for negative data. Considering the sum of squares of neural activities as goodness, then the goal of learning is to train the model to keep the goodness above some threshold for real data and below that threshold for negative data. Therefore, the model can classify inputs as either positive or negative data based on the logistic output of the goodness score. A goodness value above the threshold implies the input is paired with the correct label, and this way the model can associate images with true labels⁵.

Adaptation of the Forward-Forward algorithm with spiking neural networks (SNNs) is demonstrated in a recent study, showing that SNNs trained using the algorithm can achieve comparable accuracy to traditional SNNs trained with backpropagation³⁵. In another study, the Forward-Forward algorithm is integrated with a layer-wise supervised contrastive objective³⁶. The approach gradually tightens representations of the same classes, resulting in substantial accuracy gain and much faster convergence. In conclusion, these extensions prove the effectiveness of the Forward-Forward algorithm coupled with biologically inspired and contrastive learning approaches, mitigating the performance gap between backpropagation and local-update or biologically plausible learning paradigm. Table 1 shows the summary of the related works.

Table 1 Summary of related works.

Subjects

Abstract

Similar content being viewed by others

Forward layer-wise learning of convolutional neural networks through separation index maximizing

Variational tensor neural networks for deep learning

Hybridized sine cosine algorithm with convolutional neural networks dropout regularization application

Introduction

Related works

Methodology

Dataset description

Model description

Hyperparameter optimization

Mathematical formulation of VICReg-based layer-wise representation learning

Experimental results

Experiment on MNIST

Experiment on EMNIST

Experiment on more complex datasets

Comparison with self-supervised learning and layer-wise training approaches

Discussion on computational efficiency and scalability

Representation space evolution during training

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links