Introduction

Potato (Solanum tuberosum L.) is a staple food in many countries, being the fourth most important agricultural product after wheat, rice and maize, and is grown on 20 million hectares globally, with a production of circa 366 million tons worldwide1,2. In 2022, Colombia had the 28th largest area planted with potatoes globally, at about 114 thousand hectares3, and had a total of approximately 510 thousand hectares suitable for potato growing4. The soils of most of the potato-producing regions are located in high mountain regions, ranging from 2,000 to 3,500 masl along the mountain ranges of the Andean region5, where the soils are highly variable due to volcanic ash generated by continuous volcanic activity6. These kinds of soil are characterized by acid pH values (below 5.5), high levels of active iron and aluminum, low bulk density, variable organic matter (OM) content according to altitude, high cation-exchange capacity and high phosphorus retention capacity and phosphate fixation4,6.

An understanding of soil heterogeneity in producing areas allows the assessment of fertility variation across uses and ecosystems. This knowledge leads to more efficient management of soil resources, contributes to mitigating soil degradation and provides key information for sustainable soil management7. In this regard, understanding the spatial variability of soil properties, which are site-specific, is crucial for proper resource management. One of the most commonly used approaches is the delineation of management zones (MZs)8, based on the integration of techniques such as geostatistics, geographic information systems (GIS) and machine learning (ML) through a variety of methods that allow the characterization of properties from their distribution8,9, enabling strategies focused on the development of precision agriculture10.

Spatial characterization of soil attributes and quantitative spatialization of nutrient contents facilitate understanding of the dynamics of each element in terms of biophysical parameters11, and can better explain the productivity of a soil and the most appropriate ways to manage it based on its variability12. This variability can be analyzed by clustering methods, the K-means algorithm being one of the most widely used such methods for the classification of non-hierarchical data. This method allows data to be divided into two or more groups, associating those that share the same characteristics in each group13.

Soil characterization and spatial analysis are crucial to the representation of field-specific variability, the tracking of changes in soil quality, precise fertilizer dose recommendations for optimal yield, and other aspects of soil management14. This characterization uses physical, chemical and biological parameters to determine soil quality, as these are sensitive to changes in use and cover15,16. In this regard, attributes related to tillage capacity, growth and tuberization, moisture availability, nutrient and oxygen availability are considered4. These processes enhance optimal nutrient supply, leading to improved crop yields while minimizing soil nutrient losses, and help ensure sustainable crop production and environmental sustainability8,10,17.

Soil resource management, despite its importance economically and in food production, faces significant challenges in Colombia because the production of potatoes, as with other crops, can be hard on soils due to intensive tillage and cropping patterns used18, such as planting density (25.000 plants ha−1) and rotation system (e.g. grasses, carrot, corn, among others). The lack of an adequate approach to classifying and managing production zones according to their similar characteristics has generated problems in intensive practices that can lead to soil degradation, with losses in resilience, productivity, sustainability, economic profitability and environmental quality for the crop18. To respond to this concern, strategies are required based on the identification and selection of suitable sites for the development of such economic activity, taking into account available resources and their limitations19, and proper management of soil areas with more efficient and sustainable agricultural practices needs to be promoted. Therefore, we ask whether explicitly mapping the spatial variability of soil physicochemical and nutrient properties can improve key production components in potato systems. To this end, we delineate edaphic homogeneous zones by applying unsupervised K-means clustering to multivariate soil data, and then relate these zones to potato yield and yield components. The resulting spatial typology provides an evidence base for site-specific soil management in Colombia’s potato-growing regions.

Materials and methods

Methodological approach

This study focuses on edaphic zoning in the principal potato production areas of Colombia through the use of data science techniques (Fig. 1 and Table S1). Firstly, a database of productive potato plots, which included information on soil physicochemical properties (Table S2, Table S3), was compiled and cleaned. This database was used for the clustering process conducted using the K-means algorithm to establish homologous edaphic zones, understood as areas of land that share similar soil characteristics, focusing here only on physical and chemical properties. This information was used to characterize these zones according to the variability of soil element values, establishing an approximation to their relationship with productivity factors and optimal conditions for the crop. The data analysis and processing were conducted using Python in the Google Colab environment.

Fig. 1
figure 1

Flowchart illustrates the sequential phases of the proposed methodology, encompassing data preprocessing, exploratory analysis, clustering of productive zones, and the integration of multiapproach models for productivity assessment.

Study area, sample collection and physicochemical analysis

The study area covers the main potato production zones in Colombia (latitudes 0.36° and 7.47° N, and longitudes 78.21° and − 72.29° W), including the departments of Norte de Santander, Santander, Boyacá, Cundinamarca, Antioquia, Cauca and Nariño, including 91 municipalities with a total area of 20,862.45 km2. These zones are located in the highlands of the Colombian Andes, where constant volcanic activity has generated a great diversity of soils, all which are influenced by volcanic ash with different characteristics6. The growing system in this region is divided into two semesters, adapted to the climatic and edaphic conditions of each zone, as described in Table S1.

Data collection on soil physicochemical properties was conducted in the form of a research and technology transfer program by Fondo Nacional de Fomento de Papa-FNFP, administered by Federación Nacional de Productores de Papa-FEDEPAPA. During 2023 and 2024, 3,137 soil samples were collected from the same number of potato production plots, each with an average area of 0,7 ha. Six composite samples of about 500 g were collected in each plot at a depth of 0–30 cm, before soil tillage and fertilizer application. For the determination of soil analysis, the samples were air-dried at room temperature and sieved through a 2 mm sieve.

The physical property was textural class, based on the USDA textural table using the percentage of sand, silt and clay, determined with the Bouyoucos method20, after dispersion with sodium hexametaphosphate (NTC 6299). The chemical properties obtained were values of soil pH with a 1:1 soil to water ratio (EPA 9045D); organic carbon (OC) (NTC 5403); cation exchange capacity (CEC) extracted in the 1.0 N ammonium acetate solution (pH 7) (NTC 5268); total nitrogen (NTC 5889); available phosphorus determined by the Bray II colorimetric method (NTC 5350); available sulfur determined by extraction with calcium monophosphate and turbidimetric determination (IGAC, 2003); exchangeable bases determined in the 1.0 M ammonium acetate solution (pH 7) (NTC 5349; and exchangeable aluminum and micronutrients by ICP-MS . The database obtained from this process contains the following elements: sand, silt, clay, pH, effective cation exchange capacity (CEC), organic matter (OM), organic carbon (OC), total nitrogen (N), available phosphorus (P), exchangeable potassium (K), exchangeable calcium (Ca), exchangeable magnesium (Mg), available sulfur (S), available boron (B), zinc (Zn), manganese (Mn), iron (Fe), copper (Cu), exchangeable sodium (Na), exchangeable aluminum (Al), base saturation (BS) (Eq. 1), and K, Ca, Mg, Na, and Al saturations (Eq. 2).

$$Base \,Saturation \left( \% \right) = Sum \,of\,Base\, Cations \left( {Ca + Mg + K + Na} \right) \left( {meq/100g} \right)CEC \left( {meq/100g} \right) \times 100$$
(1)
$$Cation\, Saturation \left( \% \right) = \left( {meq/100g of cation} \right)CEC\left( {meq/100g} \right) \times 100$$
(2)

Exploratory data analysis

This stage consisted of the soil database management through an exploratory analysis and cleaning of the information collected, removing some records containing zero values and delimitation of outliers, similar to the methodology used in another approach where data science was used to analyze temporal and spatial data21. In addition, derived values of the basic elements, such as soil cation ratios (Ca/K, Ca/Mg, K/Mg, Ca + Mg/K) and exchangeable sodium percentage (ESP), were obtained, followed by the determination of soil agricultural aptitude (general) for potato production performed using the individual element aptitude values on a scale of high, medium and low (Table S2) comparing these values according to reference levels22,23,24,25,26.

A synthetic data generation process was carried out as a data science strategy based on the development of an autoencoder model designed for dimension reduction and anomaly detection in scaled data with Adversarial Autoencoder (AEE)27. This consists of an encoding layer that compresses the features to half their original size and a decoding layer that attempts to reconstruct the input. It was trained by minimizing the mean squared error (MSE) between input (variable) and output (autoencoder) using the Adam optimizer27 with a learning rate of 0.0001. Once the Autoencoder model was trained, it was used to calculate the reconstruction MSE for each sample, allowing the threshold to be set based on the addition of the standard deviation to the mean error27. Any sample whose reconstruction error exceeds this threshold is considered a potential outlier and is removed from the dataset, generating a clean dataframe with the remaining observations.

Determination of edaphic homologous zones in potato production systems in Colombia and edaphic characterization of the clusters

This process started with the assembly of a deep autoencoder27, designed specifically to reduce the dimensionality of the data and extract the most relevant features. The model consisted of several dense layers (Fully Connected) with ReLU (Rectified Linear Unit) activation function, to ensure a non-linear representation of data. The encoder layers progressively reduced the dimensionality of the input features, from 128 to 64, then to 32 and 16, until reaching the maximum compression layer, where the dimensionality was reduced to 8. During this process, the application of techniques such as batch normalization (BN) to stabilize the training and dropout with a rate of 20% to prevent overfitting were applied29. The final encoder layer transforms the data into a compressed representation that preserves the essential features.

The autoencoder decoder consisted of layers that gradually increased the dimensionality back to the original values, using ReLU and BN activations on each layer. The final output of the decoder used a linear activation to ensure that the reconstructed values were in the same range as the input data. The model was trained using the Adam optimizer with a learning rate of 0.0001 and the MSE function, allowing the autoencoder to learn to reconstruct the inputs as closely as possible. From the output of the autoencoder model, the K-Means was trained with the scikit-learn library30. Homologous edaphic zones at meso scale (region) were determined from the spatial clustering method using the K-Means algorithm that classifies and groups samples into different clusters or subsets based on distance, in which all samples in the same cluster have relatively similar properties13,28. This algorithm was selected over other unsupervised methods because of its ability to minimize intra-group variability and generate stable and comparable clusters in multivariate data sets.

The selection of the optimal number of clusters was carried out through a systematic exploration of different combinations of hyperparameters, generating variations in the number of clusters (from 2 to 5) and the number of initializations of the algorithm (from 5 to 50, every 5). Each configuration was evaluated through internal validation metrics widely used in the literature: average silhouette coefficient (ratio scale distances, such as Euclidean distance, are used when seeking compact and clearly separated clusters)31, Calinski-Harabasz index (it is constructed on the basis of the nearest neighbor and divided into groups using the minimum sum of squares criterion within the cluster)32 and entropy (reflects the average uncertainty based on probability)33 (Fig. 3a). These metrics made it possible to quantify the quality of segmentation based on intra-cluster cohesion and inter-cluster separation. The results obtained were stored and the best configuration was selected based on the maximum value of the silhouette coefficient, thus ensuring optimal clustering of the data. The statistical method t-SNE (t-distributed stochastic neighbor embedding) was used for visualization of clustering results34.

In addition, the variability of soil properties for edaphic characterization and their suitability values for potato cultivation were analyzed by descriptive statistics to obtain frequency values, average, standard deviation, minimum and maximum values and coefficient of variation (CV), . Additionally, charts such as ternary plots and violin plots were used to visualize the distribution of the values of each variable by clusters. The variability of the data was classified based on the coefficient of variation that indicates how much soil properties varies in relation to the average value, allowing comparison of variability within clusters, where a CV < 10% was weak variability, CV between 10 and 100% was moderate variability, and CV > 100% was high variability as performed in the study by20, which means that the dispersion of soil properties values are greater than the mean, as shown in Fig. 2. The relative frequency of aptitude values per soil cluster was also calculated for each soil property in order to establish site-specific management practices such as selection of areas for cultivation or fertilization practices.

Fig. 2
figure 2

Overview of the database management and preprocessing workflow. (a) Heatmap indicating the presence and distribution of zero values across the dataset, used to assess data sparsity and potential quality issues. (b) Bar plots of soil variable values prior to data cleaning, revealing outliers and irregularities. (c) Boxplot illustrating the cleaned dataset, with improved data consistency and reduced variability after preprocessing. (d) Histogram of reconstruction errors, representing the accuracy and reliability of the data imputation and reconstruction step.

Approximation to productivity based on edaphic variables and implementation of elements for sustainable management using digital tools

This stage aims to implement, analyze and interpret a modeling process for the potato yield dataset considering a basic analysis by soil clusters and climatic clusters (the last of these is not described in detail as it is part of a manuscript in progress) from three productivity modeling approaches.

H2O AutoML35, was used to perform the predictive modeling of potato yield for clusters and soil aptitude, separately. The dataset was divided into training (70%) and validation (30%), ensuring a balance between the classes. AutoML was set up with a maximum training time of 2000s, employing fivefold cross-validation (nfolds = 5) and restricting the models to gradient boosting machines (GBM), deep learning using multilayer perceptron (MLP), generalized linear models (GLM) and distributed random forest (DRF). Once the training was completed, the best performing model was selected according to classification metrics (accuracy (AC), precision (PR), recall (RE), F1 score (F1) and Matthews Correlation Coefficient (MCC)), and was used both to classify samples to the previously identified clusters and to identify variables of importance. The process was replicated for soil aptitude variables. Additionally, a Kruskal–Wallis test36 was performed to evaluate whether there were significant differences between algorithms in each ranking metric for predicting cluster, and soil aptitude, respectively.

For the spatialization process, the download and processing of edaphic data was performed using the SoilGrids system37, delineating an area of interest based on the geographic limits of the municipalities evaluated, which was projected to the EPSG:3857 coordinate system for subsequent spatial analysis. Based on these limits, the variables extracted from SoilGrids were bulk density (bdod_mean), cation exchange capacity (cec_mean), volume fraction of organic carbon (cfvo_mean), clay content (clay_mean), nitrogen (nitrogen_mean), organic carbon density (ocd_mean), soil organic carbon (ocs_mean), pH in water (phh2o_mean), sand (sand_mean), silt (silt_mean) and soil organic carbon (soc_mean). Data was downloaded in GeoTIFF format (0–5 cm, 5–15 cm and 15–30 cm depths) and stored in a local directory.

Subsequently, the reference system was transformed to EPSG:4326 to ensure correct hexagon generation with the H3 library38. Then, the municipalities were merged into a single entity to generate a single layer, on which a hexagonal grid of level 7 (approximately 516.1 ha per hexagon) was created using h3pandas39, from which zonal statistics of the variables derived from SoilGrids37 were produced within the generated polygons. The mean values per zone were calculated and the points were associated with the polygons to determine the mode of the cluster variable, eliminating those cells without data. The data generated from the hexagonal grid and the zonal statistics were used for spatial processing and construction of the predictive model for the cluster as the target variable using H2O AutoML35, under the parameters previously described, selecting the model with the best performance in the classification of the modal cluster per polygon, to identify variables of importance and zones associated with each cluster.

A genetic algorithm was adapted to optimize the combination of minimum plant nutrients in soil (N, P, K, Ca, Mg) values that allows for maximum yield, based on the research conducted by40, using Python in the Colab environment with Pandas, Numpy, Random and Math libraries. This algorithm uses a hybrid approach combining roulette selection (prioritizing individuals with higher performance), genetic operators based on Euclidean distance search and neighbor-based crossover and mutation method, and dynamic optimization that automatically adjusts the mutation rate and number of generations. It also evaluates fitness by comparing plant nutrients in soil with a tolerance of ± 0.7 units and generates personalized recommendations using an evolutionary loop that retains the best individuals while exploring new combinations. The population size was set to a value equal to the length of the dataset, the number of generations was set to 10, the neighborhood exploration was set to 5, mutation probability was set to 0,01 and crossover probability to 0.8.

The information generated by the genetic algorithm was included in the second version of the web platform ‘SOLANA’, developed by FEDEPAPA-FNFP and Universidad Nacional de Colombia, Laboratorio de Agrocomputacion y Analisis Epidemiologico de la Facultad de Ciencias Agrarias sede Bogotá (https://go2cloud-fedepapaiaas.web.app/modelation-app/) for the use of producers and extensionists within the framework of the decision support system for potato production systems in Colombia.

Results

Exploratory data analysis

Through the database management it was identified that the variables Al saturation (%), followed by Cu (ppm) and Na saturation (%) present a considerable number of values equal or close to zero (Fig. 2a). From the distribution of values, some variables such as N, K, B, Cu contained outliers (Fig. 2b). Based on this, together with the delimitation of ranges, a total of 162 samples were eliminated, mainly associated with elevated values of P, Fe and Ca/Mg ratio (Fig. 2c). Based on this, we decided not to eliminate some of these values because they may occur under typical natural conditions in the soils of the study area, and their elimination could reduce the accuracy of subsequent analyses conducted considering differences and representativeness of actual soil conditions. From the autoencoder model used for cleaning, a threshold value of 0.008 was generated (Fig. 2d), indicating that error values above this value were removed. Finally, a spatial cleaning process was applied based on the coordinates, eliminating 11 more samples, obtaining a final database with 2867 records distributed in the main potato producing areas of Colombia.

Based on this database management, we identified that the properties of potato production soils include average percentages of sand and silt close to 40%. As a result, the predominant textural classes are silty loam (31.32%), followed by sandy loam (23.79%), loam (18.59%) and, to a lesser extent, clay loam (10.85%). Soil pH is heterogeneous, with a range between 3.72 and 7.54, indicating the presence of both acidic and alkaline soils. The OM and OC levels are higher than 6% in both cases, while N is 0.55% on average. P and Fe variables stand out with values higher than 500 ppm in some cases. Other variables with high average levels are S (14.23 ppm), Mn (31.33 ppm) and Ca saturation (40.43%), while average values of Na and Al are less than 2 meq/100 g. Elements such as K, Mg and B present lower average values between 0.5 and 1. From the individual aptitudes of the elements and the calculation of the general aptitude, the soils are classified as high (59.19%) and medium (40.81%) aptitude for potato growing.

Determination of edaphic homologous zones in potato production systems in Colombia and edaphic characterization of the clusters

Clustering analysis

The analysis shows that the best configuration corresponds to three clusters, obtaining an adequate compression and separation between the groups. From the application of the K-means algorithm, we obtained an average silhouette index of 0.376, a Calinski-Harabasz index value of 1674.8 and an entropy of 1.0798 (Fig. 3a). A great differentiation of the clusters is observed from the t-SNE result, although some samples with difficulty of separation are identified in clusters 1–2 and 1–3, respectively (Fig. 3b). We analyzed the spatial distribution of the soil samples in the database grouped into clusters for each department (Fig. 3c) and their frequency in a 100% stacked bar diagram (Fig. 3d), identifying that although the distribution of the clusters does not have a determined pattern, Cluster 1 was more frequent in the departments of Boyacá (n = 300), Cauca (n = 65) and Santander (n = 48), while Cluster 2 was more frequent in Cundinamarca (n = 304) and Norte de Santander (n = 17), and Cluster 3 was more frequent in Antioquia (n = 81) and Nariño (n = 483).

Fig. 3
figure 3

Clustering analysis of soil data across the study area. (a) Determination of the optimal number of clusters based on average silhouette score metrics, ensuring robust group separation. (b) t-SNE projection displaying the cluster assignments in a reduced-dimensional space, facilitating visual validation of cluster structure. (c) Geographical mapping of the identified soil clusters across the sampled plots, showing their spatial distribution. (d) Histogram representing the frequency of plots grouped by cluster for each department, providing insight into regional soil classification patterns.

Soil texture analysis

The analysis of the physical properties of the soil shows that the textural class with the highest frequency is silty loam for Clusters 1 and 3, followed by loam and clay loam (Fig. 4a), with very similar distribution patterns of the percentages of silt sand and clay (Fig. 4b), while Cluster 2 presented a clearly different pattern of distribution, with highest frequency of sandy loam followed by sandy clay loam and clay loam.

Fig. 4
figure 4

Soil texture distribution across identified clusters. (a) Bar chart showing the frequency of soil texture classes within each soil cluster, highlighting predominant textural compositions. (b) Ternary diagram illustrating the proportions of sand, silt, and clay in the samples, categorized by texture class, providing a visual representation of the variability and overlap among soil types within the study area.

Chemical properties

Figure 5 shows the distribution of data values and Table S3 shows the variability associated with the CV. The analysis of the chemical properties shows high variability, in the following order: Cluster 2 > Cluster 3 > Cluster 1. Soil pH showed low variability for all clusters, while Cu, Fe, Na and Al and their respective saturations showed high variability with CV values greater than 100%, indicating spatial heterogeneity of the soils studied for each cluster. The other properties studied showed moderate variability.

Fig. 5
figure 5

Edaphic characterization of the study area based on the distribution of soil chemical element concentrations across the defined clusters. The figure illustrates variability in soil fertility parameters, enabling comparative analysis of nutrient profiles among clusters and supporting the identification of functional soil units for site-specific management.

Overall, pH presented average values of 5.1, which is considered strongly acidic for soils. For elements such as B, Cu, S, N, K, Mg, Zn and Na saturation, very low average values close to zero are recorded for all clusters, with no relevant difference between clusters in their average values and their distribution. For Zn, Mn, Na saturation, ESP and for exchangeable bases Ca, Mg and K the average values and CV are similar for all the clusters, but there is a downward trend from Cluster 3 to Cluster 1 (Table S3).

Cluster 1 presents values with low ranges for pH, CEC, K, Ca, and Mg, while for Al saturation, OM, OC and N it has greater dispersion than the other clusters. Cluster 2 presents a low dispersion and lower average values of S, B, Fe, Cu and K saturation, while Na presents a wider range of dispersion compared with other clusters. Cluster 3 presents values with a lower dispersion of Al, while Ca and Mg saturation present the highest dispersion values in a wider range. The distribution of P was on average five times higher in Cluster 3 than Cluster 2 and 14 times higher than Cluster 1 (Fig. 5).

Aptitude classification

For aptitude values analysis, we observed that the highest frequencies for high aptitude are found in the variables OC, P and S for Clusters 1 and 3, N for Cluster 1, K and Al saturation for Clusters 2 and 3, Zn for Cluster 3, and Na saturation for all three clusters (Fig. 6). The highest frequencies for medium aptitude are recorded for Ca in Cluster 3, B for Clusters 1 and 3, and soil texture for all clusters. The highest frequencies for low aptitude are found for pH, CEC, Ca, Mg and Cu for Cluster 1, and S, B, Cu and base saturation for Cluster 2. Based on the total calculation, 59% of the soils presented a high aptitude for potato crop and 41% presented a medium aptitude.

Fig. 6
figure 6

Relative frequency distribution of soil property aptitude classes across the identified soil clusters. The figure illustrates how key edaphic characteristics vary in terms of suitability for agricultural use, offering insights into the functional performance and potential limitations of each cluster based on standardized aptitude classifications.

Approximation to productivity based on edaphic variables and implementation of elements for sustainable management using digital tools

Yield analysis

The analysis of potato yield distribution (t ha−1) allowed the identification of important differences between soil clusters and climatic clusters (Fig. 7). At the individual cluster level, Soil Cluster 3 presents the highest average yield of 33.81 t ha−1 (Fig. 7a) and Climatic Cluster 4 obtains the highest average yield of 33.86 t ha−1 (Fig. 7b). If the combination of both is compared, Soil Cluster 2 together with Climatic Cluster 4 present the highest average yield value with 35.27 t ha−1 and the largest distribution range (Fig. 7c), while Climatic Clusters 0 together with Soil Cluster 1 present the lowest values, which are below 29 t ha−1.

Fig. 7
figure 7

Distribution of potato yield (t ha⁻1) across different clustering dimensions. (a) Yield distribution and mean values for each soil cluster, highlighting the influence of edaphic conditions on productivity. (b) Yield variability across climatic clusters, reflecting the impact of agroclimatic conditions. (c) Combined effect of soil and climatic clustering on yield performance, demonstrating the potential of integrated cluster-based analysis to explain spatial yield patterns and support targeted agronomic interventions.

Spatial modeling of soil clusters

All the algorithms evaluated had a satisfactory performance for classification at cluster and aptitude levels (Figure S1). The results of the Kruskal–Wallis tests show significant differences between the algorithms for both classification tasks (clustering and aptitude). In all cases, p-values were less than 0.05, indicating that at least one algorithm performs significantly differently in terms of AC, PR, RE, F1 and MCC.

For cluster classification, the MLP model achieved the best results (Fig. 8a), with an AC of 0.9791 and similarly high PR, RE, and F1 scores, together with an MCC of 0.9683 (Table S4), indicating a good differentiation between soil clusters. Key variables influencing this model included exchangeable aluminum, calcium saturation, aluminum saturation (percentage of the total CEC), and exchangeable calcium (Fig. 8c), highlighting the importance of cation exchange properties in soil grouping.

Fig. 8
figure 8

Variables of importance based on predictive modeling of potato yield with AutoML. (a) Relative importance scores of input variables observed across multiple model iterations, indicating consistent contributors to predictive performance. (b) Aggregated importance values for the final selected models, highlighting the most influential features in explaining yield variability within the study regions.

In contrast, for aptitude classification, the GBM model performed best (Fig. 8b), achieving AC of 0.9477, with PR, RE, and F1 scores close to this value, but a lower MCC of 0.8919 (Table S4). This reflects greater confusion between classes due to overlapping characteristics. The most important predictors here were Zn, Ca, Mg, and B (Fig. 8d), emphasizing the role of micronutrient availability in determining soil suitability.

Overall, these results demonstrate that while both models perform well, cluster classification benefits from clearer distinctions in soil chemical properties, whereas aptitude classification is more challenging due to subtle differences in micronutrient profiles, as reflected in the slightly lower MCC values. This comprehensive evaluation using multiple metrics ensures a reliable understanding of model strengths and limitations in soil classification tasks.

For the spatialization process, the GBM model was trained to assign spatial units to soil clusters (Fig. 9a). This model achieved a high AC of 0.988, indicating excellent fit to the training data. To assess its performance, the model was tested by comparing its predicted cluster assignments against spatial polygons generated using the h3 library, which segments the study area into discrete spatial units. In this testing phase, the model achieved an overall AC of 0.838, demonstrating good predictive capability on unseen data. PR values were 0.88 for Cluster 1, 0.83 for Cluster 2, and 0.82 for Cluster 3, reflecting the model’s ability to correctly identify members of each cluster without many false positives. RE was 0.73, 0.76, and 0.94 for Clusters 1, 2, and 3 respectively. Cluster 3 showed the strongest balance between PR and RE, with an F1 score of 0.88, while Clusters 1 and 2 had F1 scores of 0.80 each. Despite generally accurate predictions, some confusion occurred, primarily between Clusters 1 and 2 and between Clusters 1 and 3, indicating areas where the model’s differentiation could be improved.

Fig. 9
figure 9

Results of the spatial modeling and prediction of soil clusters. (a) Confusion matrix evaluating the performance of the spatial classification model, indicating accuracy and misclassification rates across cluster classes. (b) Relative importance of the environmental and edaphic variables used in the spatial prediction process. (c) Spatial distribution map of the predicted cluster assignments across departments, highlighting regional differentiation and supporting geographic extrapolation of cluster-based insights.

The identification of variables of importance (Fig. 9b) revealed that the most influential predictors for the GBM were soc_mean (soil organic carbon), silt_mean, sand_mean, and phh2o_mean (soil pH), each with relative importance above 0.10. These were followed by ocd_mean (organic carbon density) and clay_mean. Variables such as nitrogen_mean, cec_mean (cation exchange capacity), and cfvo_mean had moderate influence, while bdod_mean (bulk density) and ocs_mean (organic carbon stock) had a lower impact on the model’s predictions. Finally, spatial analysis of cluster distributions across departments (Fig. 9c) showed that Cluster 1 covered the largest areas, with 467,541.6 hectares in Nariño, 321,417.1 hectares in Boyacá, and 124,386.6 hectares in Cauca, indicating large, homogeneous soil regions. Cluster 2 had smaller total coverage but was notably present in Boyacá (106,645.4 ha) and Nariño (77,313 ha). Cluster 3 had the smallest spatial extent in most departments, ranging from 12,054.2 hectares in Santander to 106,517.5 hectares in Cundinamarca (Table S5). Despite these differences in total area, the average cluster size within each department remained consistent, ranging from 521.9 to 581.6 hectares, which suggests a relatively uniform spatial segmentation of soil clusters across the study area.

Nutrient recommendation system

The nutrient recommendation system is a tool that determines the amount of fertilizer to be applied to soils in different management zones according to the requirements of the potato crop. The outputs generated from the code show individual aptitude values for given physical and chemical properties, the obtaining of cation ratio values and soil classification in terms of soil salinity, and a general aptitude value derived from the individual aptitudes (Fig. 10a). This algorithm works by simulating natural selection processes, with the iterative evolution of a population made up of nutrient combinations through operations such as crossing and mutation, selecting the best-performing solutions over generations to find the lowest nutrient content that allows for maximum yield40. The output of the genetic algorithm displays the predicted potential yield (t ha−1) based on the optimized combination of nutrient content. From these values, the calculation of the doses (kg ha−1) for each of the evaluated elements is performed (Fig. 10b), which supports the user’s decision on fertilization management. The proposed display of the platform’s main window integrates different submodules (Fig. 10c), where the “Nutrient Recommendation System” module in particular shows the results of the algorithm with informative charts and tables (Fig. 10b). No further details about the platform are provided because it is subject to another publication currently in progress.

Fig. 10
figure 10

Visualization outputs from the nutrient recommendation system. (a) Example of soil nutrient aptitude classification by element, indicating suitability levels for each parameter. (b) Recommended fertilization doses (kg ha⁻1) generated using a genetic algorithm optimization model. (c) User interface of the platform’s home page. (d) Proposed frontend design for the nutrient recommendation module, showing an interactive layout for user-driven decision support.

Discussion

This study is a first approach to the identification of edaphic homologous zones in the potato production zones in Colombia, integrating the concepts of many studies in other fields, aiming to generate a complex analysis of the soil characterization and to obtain information about the variability of soils as a basis for sustainable soil management.

Exploratory data analysis and data science tools were essential for assessing the quality of soil and spatial data21. Robust statistical criteria were applied, including the removal of outliers and missing values, as well as the filtering of data ranges that could distort the analysis without compromising regional representativeness. This enabled the identification of consistent patterns and trends, such as elevated levels of P and Fe, in line with previous findings6. These tools streamlined data organization and processing, improving the reliability of subsequent analyses41. The use of the K-Means algorithm further facilitated the classification of the study area into three edaphic zones with similar characteristics. This zoning was key to understanding spatial variability in soil properties, supporting targeted nutrient management, enabling more objective evaluation of fertility, and providing a basis for strategies to prevent soil degradation. Such approaches enhance input efficiency and promote sustainable potato production in Colombia42.

The analysis of soil physical properties indicates that Cluster 2 shows the highest frequency of high suitability due to greater clay content. However, it also includes low-suitability areas linked to sandy and loamy sand textures, unlike Clusters 1 and 3, which present more stable textures and medium suitability. Texture is a key factor in land evaluation, as it influences topsoil functionality and water retention43,44. Regarding chemical properties, lower pH in Cluster 1 correlates with higher aluminum saturation, while Clusters 2 and 3 show a broader pH range, from moderately acidic to moderately alkaline8. The low coefficient of variation for pH suggests homogeneous soil conditions, likely due to limited topographic variability11. In contrast, high variability in nutrients such as P, Fe, and CEC reflects the influence of soil management and parent material45. Notably, all clusters exhibit elevated P and CEC levels15, with Cluster 1 showing high Fe and Al concentrations. This highlights the need to consider the integration of soil orders in future analyses. Finally, Cu, Mg, and CEC are limiting factors for potato cultivation, while optimal levels of OC, Na saturation, and Al saturation suggest potential for targeted soil management practices such as liming and micronutrient applications.

Despite the variations in spatialization process, the average number of polygons per department showed homogeneity, indicating a functional structure adaptable to geographic decision-making platforms. This classification identifies differentiated spatial patterns that might be linked to other variables not included such as land use, environmental conditions, or territorial dynamics43, enabling soil macro-characterization and spatial grouping of potato-producing areas. It facilitates analysis of agricultural aptitude variations based on thresholds and cluster data with elaboration of maps as a key tools for farmers, authorities, and extension workers in landscape management strategies (e.g., precise fertilizer application)43, increasing productivity, reducing costs, minimizing soil degradation, and improving agricultural profitability8,11. These maps are also valuable for environmental analysis, differentiated management in agricultural contexts, and multi-purpose evaluation for decision-making11,44,45.

The integration of AutoML tools enabled the development of accurate models to classify edaphic clusters and soil aptitude for potation cropping. The MLP model showed the best performance for cluster prediction (AC > 0.94), while the ML with the model GBM was more effective for classifying aptitude categories. Algorithm performance metrics suggest that all models achieved reliable and comparable results, supporting the internal consistency of the clusters as functional soil units. Based on this approach, variable importance analysis revealed that exchangeable aluminum, calcium saturation, and zinc were the most influential predictors, consistent with their agronomic relevance and the chemical characterization of the clusters11,17. These variables can serve as key indicators for soil quality monitoring and for guiding management recommendations. In further studies, yield modeling through AutoML could be considered integrating environmental variables such as soil dynamics, plant traits, climate, and fertilization in order to enhance decision-making and reduce soil degradation11,17.

The integration of edaphic, climatic, and potato yield data provides a comprehensive understanding of the production system, revealing that yield potential is not solely dependent on soil fertility, but on the interaction of multiple factors. Modeling these interactions enhances the ability to manage soil resources effectively47. Despite variability in soil chemical properties, average yield values across the identified soil clusters showed minimal differences. This suggests that no single soil property is exerting a dominant influence, and yield may be more strongly affected by interactions between factors such as organic matter, pH, and nutrient availability48. For instance, pH and aluminum levels can impact phosphorus availability after fertilization49. This analysis supports the implementation of differential and sustainable soil management strategies tailored to each cluster. While the goal of potato production is to maximize yield and farmer profitability, imbalanced or inefficient nutrient application degrades soil health, reduces long-term productivity, and increases dependence on costly fertilizers highlighting the need for integrated soil fertility management17.

Genetic algorithms used in the study aimed to establish the combination of adequate values of the elements for their application in production systems, reducing conventional doses and the environmental impact generated by the overuse of chemical inputs, while maintaining high yields. The levels of the elements obtained from the genetic algorithm are similar to those obtained by the conventional fertilization plan, a finding similar to the results of the research conducted by40. Its importance lies in the adequate supply of nutrients with site-specific management, increasing long-term agricultural yields and the reduction of environmental risks caused by unequal application of fertilizers11,40, but there is still a need to perform validations under experimental conditions and to increase the large volume data set of different productive plots with yield values, soil type and location40.

It should be noted that this study has certain limitations that must be acknowledged. Sampling density, temporal variability, and possible biases associated with soil data may influence the accuracy and generalizability of the results obtained. However, the use of robust clustering and classification methodologies partially mitigated these effects, providing a reliable analytical framework for the delimitation of soil zones8,10,13,28. It is important to note that the findings are regional in scope and constitute a useful reference base for other crops; however, their extrapolation to different species or regions requires specific adjustments, as each crop has physiological and management characteristics that must be considered. Similarly, our system based on data management on the SOLANA platform seeks to ensure that as new data is generated, the algorithms can adjust the special analysis models, thereby optimizing the adjustment in order to increase the accuracy of the results.,

Finally, this research was the basis for the analysis of soil variability in potato growing areas in Colombia as an approach to generating soil management strategies in order to achieve substantial improvements in the characteristics associated with soil health in production systems, coupled with information on hydrological studies, soil and water conservation, physical and chemical soil quality, natural risk and land degradation management44,50. In addition, the analysis of agronomic practices such as tillage or irrigation of production systems and temporal analysis significantly affect a variety of physical, chemical and biological properties of the soil, with a tendency to increase with subsequent crop cycles, resulting in cumulative and long-lasting residual effects18. In this regard, temporal comparison is necessary, depending on analytical methods, timing and location of sampling, and constant monitoring is required to augment the information collected and analyzed. Future work should move toward greater spatial coverage, the integration of broader time series, and complementary validation with field information, in order to strengthen the applicability and robustness of the results.

Conclusions

This study provides a framework for spatial analysis of soil properties based on data science tools. Through the methods of database management and cleaning, a reliable dataset of soil samples from Colombia’s main potato-producing regions was obtained. The K-means analysis used was adequate to identify three edaphic clusters as the optimal configuration, leading to the finding that there are notable differences in soil properties both within and among clusters, which is useful as a tool for site-specific and differential management. The classification process addressed in this study was useful in determining that for cluster classification, the MLP model achieved high accuracy and robust performance metrics, driven by variables related to Ca, Al and their saturations and for aptitude classification, the GBM model showed the best results, though with slightly lower MCC due to overlapping nutrient profiles, where the most important predictors were Zn, Ca, Mg, and B. For spatial classification, the GBM model demonstrated robust performance of edaphic clusters, where key predictors included soil organic carbon, texture, and pH, highlighting their importance in spatial soil differentiation and reaffirming that soil physicochemical variability depends primarily on the specific content of these properties. Although some confusion occurred between similar clusters, the model provided consistent spatial segmentation across departments, and it is important to increase the number of points, integrating other variables associated with geographical characteristics, and validating the model in similar areas to adjust its parameters. The nutrient recommendation system through a genetic algorithm provides precise fertilizer doses by analyzing soil properties and optimizing nutrient combinations, but it still needs to be calibrated with experimental field studies under the conditions of the study area. Finally, the main findings of this study were integrated into a digital platform that seeks to establish a solid basis for supporting decision-making in Colombia’s potato production sector, facilitating the development of site-specific management strategies.