Ancient chinese glass heritage classification based on compositional data and machine learning

Tang, Pengxiang; Gan, Xiaoting; Tang, Jiade

doi:10.1038/s40494-026-02370-5

Download PDF

Article
Open access
Published: 27 February 2026

Ancient chinese glass heritage classification based on compositional data and machine learning

npj Heritage Science volume 14, Article number: 125 (2026) Cite this article

1023 Accesses
1 Altmetric
Metrics details

Abstract

Ancient Chinese glass resembles foreign glass in appearance but differs in chemical composition, which is further altered by weathering, thereby complicating the classification of artefacts. According to classification information and compositional proportion data for a set of ancient glass samples, we applied compositional data analysis based on the centered log-ratio (CLR) transformation, combined with chi-square and Fisher’s exact tests, to investigate the relationships between surface weathering, glass type, emblazonry and color. Summary statistics, box plots, normality tests and two-sample t tests were used to compare chemical compositions before and after weathering and to estimate pre-weathering compositions from median ratios. Decision trees, logistic regression, support vector machines and random forests were then used to classify high-potassium and lead-barium glass, and ANOVA, significance tests and K-means clustering were used to divide their compositional sub-categories. The resulting models show robust classification performance and provide a reproducible, data-driven framework for the classification of ancient Chinese glass.

Alteration of medieval stained glass windows in atmospheric medium: review and simplified alteration model

Article Open access 17 June 2023

A review of glass corrosion: the unique contribution of studying ancient glass to validate glass alteration models

Article Open access 20 May 2023

Aqueous alteration of silicate glass: state of knowledge and perspectives

Article Open access 13 August 2021

Introduction

The ancient Silk Road served as a crucial conduit for material and cultural exchange between China and the West, with glass artifacts representing one of the hallmark trade goods during the early stages. Initially, early Chinese glass objects were introduced from Western Asia and Egypt, primarily in the form of glass beads¹. After assimilating foreign glass-making techniques, Chinese artisans began to produce glass locally using autochthonous raw materials. Consequently, while domestically produced glass in China often resembled imported glass in appearance, its chemical composition differed markedly² from that of other materials. The primary component of ancient Chinese glass was quartz sand (SiO₂), with limestone added as a stabilizer that, upon calcination, was converted into calcium oxide (CaO). Owing to the high melting point of pure quartz, various flux agents were required during the refining process to lower the melting temperature. In ancient China, common flux agents included lead ore, plant ash, natural natron, and saltpeter, each contributing to distinct compositional characteristics. As a result, ancient Chinese glass can be broadly classified into two main types: lead-barium glass and high-potassium glass. The former, which is produced using lead ore as the flux, typically contains elevated levels of lead oxide (PbO) and barium oxide (BaO), whereas the latter, which is refined with plant ash, is characterized by a higher content of potassium oxide (K₂O). As an important category of archeological material, ancient Chinese glass not only reflects the technological evolution of domestic glassmaking but also provides essential clues regarding raw material sourcing, manufacturing techniques, and Sino-foreign exchange along the Silk Road. However, over time, glass undergoes weathering because of environmental exposure, during which time internal elements may be exchanged with external elements, altering the original chemical composition. As such, identifying glass types solely based on the present concentrations of PbO, BaO, or K₂O can lead to inaccurate conclusions. Currently, the classification of ancient Chinese glass relies heavily on microscopic observation, compositional analysis, and expert interpretation. Cultural heritage specialists often base their judgments on surface characteristics such as emblazonry, color, and degree of weathering, along with their own experience. This approach, however, is highly subjective, labor intensive, and prone to error. More importantly, it tends to overlook deeper structural patterns embedded in the compositional data.

In recent years, the interdisciplinary integration of mathematics with archeology and materials science has opened new avenues and provided unique insights for the study of ancient glass artifacts. Traditional analytical methods in archaeometry rely primarily on high-resolution structural imaging under microscopy and techniques such as scanning electron microscopy with energy dispersive X-ray spectroscopy (SEM‒EDS)³ and inductively coupled plasma‒mass spectrometry (ICP‒MS)⁴ to examine material composition. While these techniques are effective for characterizing archeological materials, the final interpretation still depends heavily on manual comparison and expert judgment. When faced with a large volume of excavated materials or time-sensitive identification tasks, such approaches become increasingly limited in terms of their efficiency. Moreover, inconsistencies in classification criteria and interpretative perspectives across different institutions and experts often make it difficult to reach a consensus or extract deeper patterns. Currently, the application of statistical methods in archeology remains relatively limited and often relies on basic compositional comparisons, which are insufficient for analyzing the high-dimensional data inherent in archeological materials. With the rapid expansion of excavation datasets, there is a pressing need for a mature, replicable analytical framework that can handle both complexity and scale. Mathematical modeling offers a powerful tool for abstracting real-world problems into mathematical frameworks, enabling structured analysis and deeper understanding. In various fields, modeling has proven instrumental in providing practical solutions and critical insights. For instance, in agricultural production, researchers have employed mathematical models to optimize drying conditions for Garcinia fruit, ensuring better preservation without compromising nutritional quality⁵. In aerospace engineering, Xu Yeshou et al.⁶ developed a mathematical model to describe the relationship between the macroscopic damping performance of dampers and the microstructure of viscoelastic materials, effectively predicting dynamic behavior under different testing conditions. In chemistry, mathematical models have been used to determine the optimal reaction conditions and yields for the ethanol coupling synthesis of C4 olefins⁷. As a data-driven extension of modeling, machine learning enables computers to automatically learn and improve performance without being explicitly programmed. It addresses the limitations of traditional statistical methods and has been successfully applied in various domains, such as wastewater treatment⁸, biosensor design⁹, and cardiovascular risk prediction¹⁰. In the field of archeology, researchers have begun to apply machine learning and mathematical modeling techniques¹¹, primarily in ceramic analysis, demonstrating the potential of data-driven approaches in material classification and interpretation¹².

Based on experimental data derived from archeologists’ analyses of a collection of ancient Chinese glass samples, this study conducts compositional data processing to explore the statistical characteristics of chemical weathering in ancient glass. By applying mathematical modeling and machine learning techniques, it identifies the classification patterns of high-potassium glass and lead-barium glass, ultimately aiming to achieve efficient and accurate classification of previously unidentified glass artifacts. In the domain of ancient glass research, Li et al.¹³ proposed a joint machine learning algorithm (JMLA) that integrates Daen-LR, ARIMA-LSTM, and MLR for the classification of unknown types of glass. Guo et al.¹⁴ suggested using the slime mold algorithm to optimize the parameters of a support vector machine (SVM) and analyzed a dataset comprising 69 groups of glass chemical compositions. Their research demonstrated that the SVM algorithm, combined with the slime mold strategy, provides a reliable classification reference for future glass artifacts. Zou¹⁵ systematically studied the changes in the composition of ancient glass due to weathering. They analyzed the surface weathering of glass relics and its correlation with three properties and established a multivariable time series model to predict the chemical composition content before weathering. Compared with these existing studies, our work is more systematic, comprehensive, and deeply analytical. First, the chemical composition data of ancient Chinese glass are preprocessed using central log-ratio (CLR) transformation to account for the closed nature of the compositional data. Subsequently, the correlations between multiple compositional features and the classification of lead-barium and high-potassium glass are investigated. Several statistical methods are then applied to examine the underlying statistical characteristics of chemical components within the classified groups. After confirming the statistical characteristics of the component content in ancient glass, we provide a novel prediction of the chemical component content in ancient glass before weathering. We subsequently employed multiple machine learning methods to explore and compare the classification rules of ancient Chinese glass, making it possible to identify unknown glass types. To ensure the robustness and practical applicability of the classification system, a sensitivity analysis is performed on the models. Ultimately, this study proposes a specific classification scheme for ancient Chinese glass and demonstrates its reliability and interpretative value for archeological and materials science research.

Methods

Sources of data

The original experimental data used in this paper are from the attachment of Question C of the 2022 Higher Education Community Cup National Mathematical Contest in Modeling for College Students¹⁶. All the datasets used in this study are provided in Supplementary Tables S1–S3, which collectively document the provenance, typology, and chemical composition of the analyzed ancient glass samples. The descriptive metadata, including the cultural relic number, decorative motif (emblazonry), glass type (such as high-potassium or lead-barium), color, and surface-weathering condition (before or after weathering), are listed in Supplementary Table S1. These categorical attributes serve to contextualize each artifact and provide a qualitative framework for subsequent compositional interpretation. Supplementary Table S2 presents the quantitative elemental compositions (wt%) of the samples used to construct the classification model, encompassing major, minor, and trace oxides—SiO₂, Na₂O, K₂O, CaO, MgO, Al₂O₃, Fe₂O₃, CuO, PbO, BaO, P₂O₅, SrO, SnO₂, and SO₂. This dataset represents the analytical foundation for distinguishing technological groups and glass-making traditions. Finally, Supplementary Table S3 includes compositional measurements of unclassified or newly examined glass specimens, recorded with the same chemical parameters and contextual information. These data were employed to test and validate the performance of the classification model, thereby bridging the analytical dataset with its archeological application.

Symbol description

Handling missing values in compositional glass data

The original dataset comprises various feature data, including elemental compositions, derived from ancient Chinese glass samples and obtained through instrumental analysis or manual inspection. Owing to potential inaccuracies caused by measurement errors, environmental weathering, and sample degradation, data preprocessing is essential to ensure the reliability and consistency of subsequent mathematical modeling and machine learning analyses.

In Supplementary Table S1, the color of samples 19, 40, 48, and 58 is missing. The after-weathering lead-barium glasses numbered 19 and 48 belong to emblazonry A, and samples 40 and 58 belong to emblazonry C. The modes of the colors of these two types of samples in Supplementary Table S1 are ‘light blue’, and the missing color is filled with ‘light blue’. The values of some chemical components in Supplementary Tables S2, S3 are missing; the reason may be that the content of the component is low, and the instruments, therefore cannot detect it. It is also possible that there is no chemical component in the sample, although this case corresponds to two different situations. Hence, for the convenience of analysis, it is regarded as the same class. Moreover, because it is necessary to take the logarithm of the chemical composition data and then perform the central log-ratio Transformation, the missing null value of the component data is 0, or the component that is not detected is replaced by a smaller positive number, 0.0001.

Validation and screening of compositional glass data

When the sum of the chemical components of ancient glass is between 85% and 105%, it is regarded as valid data. The sum of the components of samples 15 and 17 is calculated to be 79.47% and 71.89%, respectively, which are classified as invalid data. Therefore, these two sets of data were deleted from Supplementary Tables S1, S2 and not considered for the next processing steps.

Limiting conditions for compositional analysis of ancient glass

The chemical composition data of ancient glass are compositional, where the total amount of all oxides theoretically is 100%. However, incomplete detection of some minor elements often causes the total to deviate from 100%. Instead of forcing normalization, which may distort the relative proportions among detected oxides, only variables with acceptable completeness rates were retained for multivariate analysis. Following the common practice in archaeometric studies, components detected in more than 75–85% of the samples were considered reliable. In accordance with this criterion, SiO₂, CaO, Al₂O₃, CuO, and P₂O₅, to a lesser extent, PbO and BaO, as shown in Supplementary Tables S2, S3, were selected as the representative variables for subsequent statistical modeling. When using variables such as PbO or others close to the detection limit, a univariate method is considered to process them.

Central Log-Ratio Transformation (CLR) of compositional data

When machine learning and mathematical modeling techniques are applied in archeology—particularly in studies involving the elemental variation of archeological materials—it is essential to first perform a proper central log-ratio (CLR) transformation on the compositional data. The reason is that the elemental data in archeological samples typically represent closed compositions, where the sum of all components equals 100% (i.e., mass percentages). Such data exhibit compositional constraints, meaning that the components are not independent but are proportionally interdependent, rendering the use of conventional Euclidean-based statistical methods inappropriate. Owing to these constant-sum constraints, compositional variables are inherently collinear, making traditional multivariate analyses susceptible to spurious correlations and misleading results. To address this issue, British statistician John Aitchison^17,18,19,20 introduced log-ratio transformation as a methodology for analyzing compositional data. This approach is grounded in the principle that the ratios between components are not influenced by the closure constraint and that the logarithms of these ratios often approximate a normal distribution. As a result, conventional statistical and machine learning techniques can be reliably applied to the transformed data, enabling valid inference and robust classification in compositional domains such as archeological materials analysis. Owing to the fixed sum constraint, the variables of the composition data have obvious collinearity, which makes the traditional statistical analysis method invalid. It is necessary to solve such problems through appropriate transformations, such as the central log-ratio transformation (CLR), asymmetric logarithmic ratio transformation, equidistant logarithmic ratio transformation, etc.

Statistical analysis of categorical associations

To assess whether macroscopic attributes correlate with surface weathering, we modeled the relationships among weathering state, glass type, emblazonry, and color using contingency-table methods. First, we constructed contingency tables between surface weathering and each categorical variable (glass type, emblazonry, and color) using the 56 artifacts listed in Supplementary Table S1. Before applying the chi-square test, we verified that the total sample size was greater than 40, that no expected cell count was less than 1, and that at least 80% of the expected cell counts were greater than or equal to 5, ensuring the validity of the chi-square approximation. For surface weathering vs. glass type, these conditions were satisfied and the chi-square test was applied to test independence at the 5% significance level. For cross-tabulations involving low-frequency categories (emblazonry and color), the chi-square assumptions were violated. In these cases, we employed Fisher’s exact test to obtain exact p-values based on the hypergeometric distribution. Because emblazonry has three levels (A, B, C), we performed pairwise Fisher tests on the derived AB, AC, and BC groups to probe potential differences in weathering distributions between motif pairs. Fisher’s exact test was also used to evaluate the relationship between weathering state and color.

Statistical analysis of chemical compositions by weathering state

To characterize how weathering affects chemical composition within each glass type, we conducted univariate analyses on CLR-transformed oxide contents. Using the five groups T₁–T₅, we first visualized distributions of SiO₂ and other selected oxides via boxplots to obtain an overview of intergroup variability. For formal hypothesis testing, we examined the distributional assumptions of SiO₂ (and other key oxides as needed) in each group. Normality was assessed using Q–Q plots together with the Lilliefors test, which is suitable for small samples with unknown mean and variance. For pairs of groups representing the same glass type before and after weathering (e.g., T₃ vs. T₄ for lead-barium), we applied two-sample t tests to evaluate whether the mean CLR-transformed oxide content differed significantly between states. To mitigate the influence of potential outliers, we examined the coefficient of variation (CV) for each group.

Estimation of pre-weathering compositions

To obtain approximate pre-weathering compositions for weathered samples, we used group-level statistics from the glass type–weathering categories defined above. For each oxide and for each glass type (high-potassium and lead-barium), we calculated the central tendency (median or mean, depending on the dispersion) in the before-weathering groups and in the corresponding after-weathering groups, and took their ratios. These type-specific ratios were then used as multiplicative correction factors, applied to the measured compositions of after-weathering samples to estimate their pre-weathering values. This procedure was only used for glass types for which both before- and after-weathering data were available, and its performance was evaluated using artifacts that have both interior (before-weathering) and surface (after-weathering) measurements (Table 1).

Table 1 Symbols and definitions of the parameters

Abstract

Similar content being viewed by others

Alteration of medieval stained glass windows in atmospheric medium: review and simplified alteration model

A review of glass corrosion: the unique contribution of studying ancient glass to validate glass alteration models

Aqueous alteration of silicate glass: state of knowledge and perspectives

Introduction

Methods

Sources of data

Symbol description

Handling missing values in compositional glass data

Validation and screening of compositional glass data

Limiting conditions for compositional analysis of ancient glass

Central Log-Ratio Transformation (CLR) of compositional data

Statistical analysis of categorical associations

Statistical analysis of chemical compositions by weathering state

Estimation of pre-weathering compositions

Supervised classification of the high-potassium glass and lead-barium glass

Unsupervised clustering for subclassification within glass types

Results

CLR of Chinese ancient glass compositional data

Analysis of the relationships between the surface weathering of glass relics and the glass type, emblazonry, and color

Analysis of the contents of the corresponding chemical components in ancient glass before or after weathering

Statistical analysis of the chemical composition of different glass types

According to the after-weathering point data, the chemical composition content before weathering is predicted

Analysis of the classification law of high-potassium glass and lead-barium glass

Decision trees

Logit regression

Support Vector Machine (SVM)

Random forest

Identification and sensitivity analysis of unknown types of glass relics

Subclass division of high-potassium glass and lead-barium glass

Discussion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Materials (download PDF )

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links