An empirical analysis on webservice antipattern prediction in different variants of machine learning perspective

Kumar, Lov; Tummalapalli, Sahiti; Murthy, Lalita Bhanu; Misra, Sanjay; Krishna, Aneesh

doi:10.1038/s41598-025-86454-5

Download PDF

Article
Open access
Published: 12 February 2025

An empirical analysis on webservice antipattern prediction in different variants of machine learning perspective

Lov Kumar¹,
Sahiti Tummalapalli²,
Lalita Bhanu Murthy²,
Sanjay Misra³ &
…
Aneesh Krishna⁴

Scientific Reports volume 15, Article number: 5183 (2025) Cite this article

2049 Accesses
3 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Anti-patterns are explicit structures in the design that represents a significant violation of software design principles and negatively impacts the software design quality. The presence of these Anti-patterns highly influences the maintainability and perception of software systems. Thus it becomes necessary to predict anti-patterns at the early stage and refactor them to improve the software quality in terms of execution cost, maintenance cost, and memory consumption. In the anti-pattern prediction domain, during research analysis, it was realized that there had been very little work instigated on addressing both class imbalance and feature redundancy problems jointly to enhance models’ performance and prediction accuracy. It has been perceived in the literature survey to study droughts with a comprehensive comparative analysis of different sampling and feature selection strategies. To achieve greater precision results and performance, this research constructs a web service anti-pattern prediction model over preprocessed software source code metrics using sampling and feature selection techniques to handle imbalanced data and feature redundancy to gain flawless web service anti-pattern prediction outcomes. Considering the above erudition, we have applied different variants of aggregation measures to find the metrics at the system level. These extracted metrics are used as input, so we have also applied different variants of feature selection techniques to remove irrelevant features and select the best combination of features. After finding important features, we have also applied different variants of data sampling techniques to overcome the problem of class imbalance. Finally, we have used thirty-three different classifiers to find import patterns that help identify anti-patterns. These all techniques are compared using Accuracy and Area Under the ROC (receiver operating characteristic curve) Curve (AUC). The experimental result of web service anti-pattern prediction models validated on 226 WSDL files illustrates that the least square support vector machine (LSSVM) with RBF kernel attains the best performance among the other 33 competing classifiers employed with the lowest Friedman mean rank value of 1.18. During comparative analysis over different feature subset selection techniques, the outcome indicates the mean accuracy value of 88.40% and mean AUC value of 0.88 for the models developed using significant features are higher in comparison to other techniques. The result shows the up-sampling methods (UPSAM) method secured the highest mean accuracy % and mean AUC with values of 86.14% and 0.87, respectively. The experimental result indicates the performance of the web service anti-pattern prediction models is adversely impacted by class imbalance and irrelevance of features. The outcome demonstrates that the performance of trained models improved with an AUC value between 0.805 to 0.99 post-application of sampling and feature selection strategies without using feature selection and sampling techniques. The outcome implies that USMAP achieves better performance. The result demonstrates that the models developed using significant features drive the desired effect compared to other implemented feature selection techniques.

Software defect prediction based on residual/shuffle network optimized by upgraded fish migration optimization algorithm

Article Open access 28 February 2025

Detecting refactoring type of software commit messages based on ensemble machine learning algorithms

Article Open access 12 September 2024

Improving learning from the complex multi-class imbalanced and overlapped data by mapping into higher dimension using SVM++

Article Open access 25 August 2025

Introduction

System autonomy, heterogeneity, and context adaptability are critical in the software business, leading to the development of web services based on service-oriented architecture (SOA). For successful businesses and contemporary governments, SOA is the progression of distributed computing toward integrating expert departments and IT. Services may be accessed via the internet using the web service implementation of SOA, which is agnostic of the platform and programming language. SOA is generally regarded in IT systems as the technology that can improve the receptivity of both business and IT organizations since it is self-adaptable to context. Web services may be built in various languages and on various platforms, allowing them to be used on a wide range of devices.

Modeling Service-Based Systems (SBSs) like Paytm, DropBox and Amazon are made feasible by SOA, and the growth of these systems causes many challenges. As new devices and technologies are introduced, SBSs must evolve to keep up with the demands of their users. Like any other big and complicated system, SBSs are prone to ongoing modification to accommodate new user needs and modify the execution circumstances. It’s also possible that all of these modifications may decrease SBS’ Quality of Service (QoS) and result in a retro design, which has been given the name of “Anti-patterns”¹. Structures like these imply a breach of fundamental design principles and a decrease in design quality. Because they make it challenging to improve and maintain a software system, anti-patterns are helpful for spotting issues with its design, source code, or overall project management. Therefore, it has become compulsory to develop prediction models that help to detect anti-patterns present in web services. Software quality researchers have used simple models to predict different types of anti-patterns based on source code metrics that help improve the software quality in terms of execution cost, maintenance cost, and memory consumption. Empirical experiments have been carried out in the past related to web service anti-pattern predictions (Travassos et al.², Marinescu et al.³, Munro et al.⁴, Ciupke et al.⁵ Simon et al.⁶, Rao et al.⁷, Khomh et al.⁸, Moha et al.⁹). Though these research works have raised the need to develop perdition models, it was realized that there had been very little work instigated on addressing both class imbalance and feature redundancy problems jointly to enhance models’ performance and prediction accuracy. It has been perceived in the above work to study droughts with a comprehensive comparative analysis of different sampling and feature selection strategies.

In this work, we investigate the predictive power of different aggregation measures which are used for finding file-level metrics, feature selection techniques that are used for selecting significant features, data sampling techniques that are used for handling the class imbalance nature of datasets, and different variants of machine learning for finding the pattern. Here, our focus is on how accurately these techniques help to predict anti-patterns present in web services. Initially, we selected 226 different web-service as WSDL from various domains such as finance, tourism, health, education, etc. Then we applied the WSDL2Java tool to each WSDL file to extract the java files. After extracting the java files, we have used CKJM¹⁰ tool proposed by Chidamber and Kemerer to find metrics at the class level. Since our objective is to find the anti-pattern present in the WSDL file, so we have applied different variants of aggregation measures to find metrics at the system level. After computing metrics at the system level, we have also applied feature selection techniques to find the significant set of features, which are later used as input for the anti-pattern prediction models. We also observed that the considered data have imbalanced nature of classes. Henceforth, to handle the class imbalance problem and its impact on the prediction accuracy of the models, we have also used five data sampling techniques. We compare the performance of the models generated using this sampling technique with the model developed using the original data (ORGD).

Finally, we have applied different categories of machine learning techniques to find import patterns that help to identify anti-patterns present in unseen WSDL files. Initially, we have applied the most frequently used classifiers like different variants of Naive Bayes (Bernoulli, Gaussian, Multinomial), decision trees, logistic regression, support vector machines with different kernels, and artificial neural networks with different back-propagation algorithms. Different researcher mainly uses these types of classifiers to predict software quality parameters. Then, advanced levels of classifiers like least square support vector machines with multiple kernels and extreme and weighted extreme learning machines with multiple kernels have been used to find better sets of patterns for anti-pattern predictions. Finally, we have used ensemble learning and deep-learning approaches to find the best patterns for anti-pattern predictions. The predictive power of these techniques is evaluated in terms of accuracy and AUC values and validated with 5-fold cross-validation approaches on 226 different web-service. In order to find the significant impact of the techniques, we have used Wilcoxon Signed Rank Test (WSRT) with Friedman mean rank (FMR).

The major contributions of this research work are:

Proposed a framework to predict web service anti-patterns based on extracted java files of WSDL.
Proposed a framework using the aggregation measures concept to extract file-level metrics from class-level metrics.
Usage of different sampling approaches to counter the class imbalance problem.
Usage of different feature selection techniques to remove irrelevant features and set the right sets of features.
thirty-three different classifiers are considered to develop a model to identify the files with anti-patterns.
Various statistical tests were conducted to determine the effectiveness of the proposed anti-pattern detection model.

The paper is organized as follows: Section 2 provides the summary of related work in the field of software fault prediction. Section 3 explains the used methodologies in our experimentation. The research framework, result analysis, and model performance is presented in Sections 4 and 5. Section 6 covers the comparative analysis. The final results discussion and conclusion work are depicted in Sections 7 and 8 respectively.

Related work

There is a good number of existing methods proposed by various researchers to predict anti-patterns or code smells present in object-oriented software. A manual procedure to identify anti-pattern or design smells is proposed by Travassos et al.². They have used manual reviews and reading techniques types of concepts to find the smells that do not meet the specification. A similar kind of work is also proposed by Marinescu et al.³ to predict the design smell present in software systems based on extracted metrics from the source code of the software system. They have executed their proposed work on the IPLASMA tool with the help of some detection techniques to find the pattern that helps to identify smells in a software system. They have applied ten detection techniques to predict anti-patterns or code smells. The major limitations of their approach are that extensive knowledge of metric-based rules is required to detect an anti-pattern successfully, and the varied threshold values lead to a varied outcome. Munro and his team⁴ also proposed one new method with the objective to overcome the limitations of text-based descriptions for predicting systematically characterized code smells. They have applied metric-based heuristics concepts to detect anti-patterns.

Ciupke et al.⁵ presented a method to study legacy code by specifying design problems as queries. Their approach is based on extracting the occurrences of the problems using models designed using extracted metrics from the source code of software systems. Simon and his team⁶ proposed methods based on visualization concepts to find the correlation between fully automated approaches, which are productive, systematic, and time-consuming. The major advantage of their strategies is there is no need for effective manual inspections.

Rao et al.⁷ introduced a method to propose anti-patterns based on the Design Propagation Probability concept to design the models that will treat like detection techniques. Based on the design Propagation Probability concept, they have focused on two anti-patterns, such as Divergent change and Shotgun surgery. Similarly, Khomh and his team⁸ presented the method with the help of anti-pattern definition, Goal Question Metric(GQM), and Bayesian Detection Expert(BDTEX) to develop Bayesian Belief Networks(BBN). The BBN method allows quality analysts to use their prior probability to predict anti-patterns.

Moha et al.⁹ proposed an automated method to predict different types of anti-patterns like Spaghetti Code, Functional Decomposition, Blob, and Swiss Army Knife. Their proposed methods also help to identify 15 underlying code smells. They gave the DECOR name of their proposed methods containing all the necessary steps used to specify and detect code and design smells. Their team also proposed another detection method called DETEX⁹ which helped to provide a platform to convert the rules extracted from the DECOR method into detection algorithms. They have clearly explained the correlation between the metrics extracted from code with different categories of anti-patterns.

Hemanta Kumar Bhuyan and Vinayakumar Ravi presented the importance of feature selection techniques in data mining applications¹¹. They have proposed the optimization model using a Lagrangian multiplier to find and analyze a new class. They have used several classifiers with searching and statistical methods to validate the proposed subfeatures. Their finding confirms that their proposed methods benefit novel classes based on selected subfeature data. Hemanta Kumar Bhuyan and Narendra Kumar Kamila also provide the content related to the importnace of the feature selection techniques in data mining applications¹². The have used fuzzy probabilities to proposed privacy preservation of individual data for both feature and sub-feature selection. They conclused that the fuzzy random variable approach confined the expected range on which the selection of sub-feature from feature database is made easy. Similar work is also done by Hemanta Kumar Bhuyan et al. to find the importance of feature selection during model development¹³. They proposed methods to choose the optimal feature for classification by utilizing mutual information (MI) and linear correlation coefficients (LCC). Their proposed methods offers the best selection on the same data set as compared to others.

Motivation

Based on the above survey, profound research has been conducted in the area of web service anti-pattern prediction models using machine learning approaches. However, further analysis indicates there is very little investment seen in converting file-level metrics using class-level, handling class imbalance of datasets, removing irrelevant features, and comparing wide varieties of machine learning techniques. As a result, there is a need for in-depth research to evaluate the performance of anti-pattern prediction models by combining aggregation, feature selection, and sampling techniques. This point is our primary motivation for our present work. It leads us to endow our focus on implementing the proposed model to address the substantial gap identified to extemporize the performance and predictability of the anti-pattern prediction model by engaging aggregation, sampling, and feature selection techniques jointly with a wide variety of machine learning techniques. This research work exploits the implication of sixteen aggregation measures, seven feature selection techniques, five sampling strategies, and thirty-three different classifiers to develop the best web service anti-pattern prediction models. The performance of these developed models is analyzed using AUC and Accuracy metrics. This leads to the contextual following research questions (RQ):

RQ 1:
Can web-service anti-patter prediction models be developed using source code metrics and machine learning?
RQ 2:
What is the significant impact of considering reduced sets of features as input on the performance of models?
RQ 3:
What is the significant impact of sampling techniques on the predictability of anti-pattern prediction models?
RQ 4:
What effect do different classifiers have on predicting anti-patterns using source code metrics?

Methodologies

This section enlightens on the components required for our study. We are providing information on datasets, feature selection techniques, sampling strategies, and classification approaches.

Data collection

We have prepared the datasets in this experiment to validate our proposed anti-pattern prediction model framework. Figure 1 shows the working procedure to prepare datasets. Initially, we applied the WSDL2Java tool on the WSDL file to extract the java files. These extracted java files are used as an input of CKJM¹⁰ tool to find object-oriented metrics as mentioned in Table 1 at the class level. CKJM takes java files as an input and computes metrics at the class level, but we need metrics at the system level because, in the experiment, we predict the anti-pattern at the WSDL level. To achieve this, we have applied aggregation measures to find metrics at the system level. Vasilescu et al.¹⁴ suggested using multiple aggregation measures to find metrics at the higher level without losing information. They have empirically proved that the use of a single aggregation measure creates a data loss problem. So, in this work, we have applied 16 aggregation measures as mentioned in Table 2 on class-level metrics to find metrics at the system level.

Table 1 Object-oriented software project datasets.

*RQ 1:*	Can web-service anti-patter prediction models be developed using source code metrics and machine learning?
*ANS:*	The high value of AUC, i.e., greater than 0.7, as shown in Table 11, 12, and 13 confirms that the developed models have the ability to predict anti-patterns based on source code metrics. The experimental findings confirmed that the models performed better after applying sampling and FS techniques.

*RQ 2:*	What is the significant impact of considering reduced sets of features as input on the performance of models?
*ANS:*	The experimental findings based on Figs. 4a, b, 5 and Table 14 confirmed that the models trained by taking selected sets of features can predict significantly better than all features.

*RQ 3:*	What is the significant impact of sampling techniques on the predictability of anti-pattern prediction models?
*ANS:*	The experimental findings based on Figs. 6, 7 and Table 15 confirmed that the predictive ability of the models is significantly impacted by using sampling techniques. The performance of the models significantly improves after training on sampled data.

*RQ 4:*	What effect do different classifiers have on predicting anti-patterns using source code metrics?
*ANS:*	The experimental findings based on Figs. 9, 10, and Table 16 confirmed that the predictive ability of the models trained using different classification techniques is significantly different. The performance of the models significantly improves after changing the classification techniques.

Subjects

Abstract

Similar content being viewed by others

Software defect prediction based on residual/shuffle network optimized by upgraded fish migration optimization algorithm

Detecting refactoring type of software commit messages based on ensemble machine learning algorithms

Improving learning from the complex multi-class imbalanced and overlapped data by mapping into higher dimension using SVM++

Introduction

Related work

Motivation

Methodologies

Data collection

Experimental dataset

Data balancing techniques

Selection of relevant metrics

Metrics selection using feature ranking techniques

Metrics selection using feature subset selection techniques

Classification techniques

Proposed framework

Results and analysis

Feature selection results

Accuracy and AUC values analysis

Comparative analysis

Aggregation measures and feature selection techniques

Sampling techniques

Classification techniques

Discussion of results

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Ethical and informed consent for data used:

Additional information

Publisher’s note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links