Background & Summary

The Paris Agreement was adopted by 196 Parties at the UN Climate Change Conference (COP 21) to limit the global average temperature increase1. The global demand for a carbon-neutral future drives research efforts to understand climate technology at various scales2,3. At COP 28, Parties decided to improve climate technology development and transfer through The Technology Mechanism, including the use of artificial intelligence (AI) in technology needs assessments for 2023–20274. The roles of AI in facilitating climate technology implementations are widely discussed in aspects of carbon capture and storage (CCUS) and carbon dioxide removal (CDR) technologies5,6,7,8. Direct air capture (DAC) technology is one of the CDR technologies that removes CO2 from the atmosphere using chemisorption or physisorption. The common sorbent involves alkaline solutions, amine-functionalized adsorbents, ion-exchange resins, metal-organic frameworks (MOFs), etc. Among all these materials, the crystalline structural information of MOFs can be stored in CIF (Crystallographic Information File) format, which allows easier archiving, analysis, and visualization for data interoperation, sharing and reuse in compliance with the FAIR Principles9,10,11. In contrast, the datasets available for amine-functionalised solids as DAC adsorbents are mostly confined to the laboratory stage.

Supported polyethyleneimine (PEI) adsorbents have been studied for carbon capture since 200212. The amine-impregnated adsorbent is a class of supported amines where small amines, polyamines or mixed amines and additives are impregnated on the surface of the porous solid through non-covalent interactions13. It is one of the most widely studied DAC adsorbents among previous research, owing to their accessibility and reactivity towards CO214. The physical impregnation method offers lower costs, simpler operations, and higher amine content compared to other types of supported amines15. However, challenges remain regarding amine leaching and the stability of the adsorbents16. Amine-grafted adsorbent is another class of widely studied adsorbents that possess higher stability due to the covalent bonding of amines to the solid supports17. Additionally, the amine-grafted adsorbent was widely involved in computational studies for investigations in CO2 adsorption mechanism or kinetic modeling for engineering design18,19,20,21,22,23. The simulation study related to amine-impregnated adsorbent is complex due to the diverse molecular structure of amines, the varied compositions and types of porous solids (supports), the movement of amines on different supports, and the synergistic interactions between amines and the supports24,25,26. The limited availability of dataset for amine-impregnated adsorbent has hindered the application of advanced AI technology in studying their performance for DAC. Although, the powered amine-functionalised adsorbents are not suitable for pilot DAC applications, the investigation in their compositions provided valuable insights in CO2 adsorption processes27,28,29. Current research focuses on enhancing the efficiency, stability, and adsorption kinetics of fabricated amine-functionalized adsorbents for direct air capture (DAC)30. Additionally, efforts are being made to develop simulation techniques and adsorption units suitable for larger-scale applications of related DAC adsorbents31,32. The significance of sharing a computational-ready database for amine-impregnated solid adsorbents include: (1) Enabling AI models for designing and discovering amine-functionalized solid sorbents. (2) Enhancing the understanding of structure-property relationships of amine-impregnated solid sorbents through data-driven analysis and explanatory machine learning. (3) Offering opportunities for incorporating large language models in the data extractions of related amine-functionalised adsorbents. (4) Accelerating the identifications of promising amines and porous solid sorbents for DAC studies and reducing unnecessary laboratory work. (5) Providing examples for creating and training small datasets associated with material science33. (6) Encouraging uniform data reporting, data sharing and communications of research ideas among the DAC communities.

Previous studies demonstrated dataset construction procedures through experimental data collection and data extraction from the literature34,35. These datasets are suitable for training machine learning models that make predictions on the performance of the CO2 adsorbent materials. For example, Zhang et al. constructed a database containing 1200 data points of porous carbon physisorbent for CO2 capture. A deep neural network model (DNN) was trained36. The model prediction strongly correlates with the experimentally measured CO2 capacity of the selected porous carbon. In addition, a private dataset of amine-functionalised MCM-41 and SBA-15 for CO2 capture was constructed by Yıldız et al. The dataset comprises 1039 data points extracted from 28 publications spanning from 2002 to 2017. Statistical analysis and decision tree-based classification over the dataset offered valuable recommendations on optimal amine type and operational conditions for achieving high amine efficiencies on carbon capture37. Another private dataset was constructed based on the properties of CO2 adsorbents prepared from Polyethylenimine (PEI) and Tetraethylenepentamine (TEPA) impregnated separately on six commercially available porous solids38. A random forest model was trained to predict the CO2 capacity, amine efficiency, and cyclic stability of the PEI and TEPA-impregnated mesoporous silica KYX using the private dataset. More recently, Li et al. demonstrated the training of a ChatGPT 4o model with in-context learning to predict CO2 adsorption uptake based on input features of adsorbent properties and adsorption measurement conditions with approximately 200 data points39. The findings suggest that context-based modeling is effective in reducing prediction errors when compared to conventional machine learning models in the predictive task. The more comprehensive database derived from this work would further support investigations into the effects of training data size and data diversity on the performance of machine learning models, facilitating a better understanding of how to utilize AI models for lab based research40.

The previous research provided valuable adsorbent characteristics and performance indicators as effective features to describe different amine-functionalised solid adsorbents. Inspired by the promising results, this study aims to collect the data associated with amine-impregnated solid adsorbents with expansions in the range of data source, data size and data diversity. One of the new categories of amine-impregnated solid adsorbent is the adsorbent for DAC. Even though the research for DAC only started to emerge in the last ten years, the accumulated publications are promising in creating a compact set of data for amine-impregnated solid adsorbents with diverse chemical and structural properties. Based on the manual data collection processes introduced in previous research35,37, the new dataset further considers the specific adsorbent characteristics for DAC, such as the presence of hydroxy groups or surfactants, adsorption kinetics, sorbent stability and performance at sub-ambient temperatures. These characteristics are crucial for understanding carbon capture science and advancing the amine-functionalised adsorbents for DAC.

To the best of our knowledge, there is no publicly available database on amine-impregnated solid adsorbents for CO2 capture from both diluted and concentrated sources. Despite all the expected contributions of a publicly available dataset of amine-impregnated solids for carbon capture, many of the chemical and textural properties, and the performance indicators of those adsorbents were reported in inconsistent units or technical terms. Sometimes, the chemical and textural properties of the adsorbents are missing. Moreover, the kinetic data reported are based on adsorption modeling equations or contained in figures. Furthermore, there are multiple approaches to defining promising amine-impregnated solid sorbents, which can lead to variations in the reported types of performance indicators and missing data. In addition, some research focuses on the mechanisms rather than applications of supported amines, where different types of measurement results are reported. Therefore, the initial data collection and processing of amine-functionalised solid adsorbents relies on human expertise. Based on previous research data, this work demonstrates the construction of an amine-impregnated porous solid database for both lab and AI-related research in carbon capture science. The data were extracted from 52 articles (52 cases), with 1,336 data points and 28 features for each data point. The methods used during data collection and data processing are reported in the following sections (Fig. 1). By aggregating and statically validating the data from various research teams, the computational-ready dataset is expected to support predictive and exploratory tasks across different laboratories, enhancing collaborative research efforts in DAC technology.

Fig. 1
figure 1

The workflow of constructing the database of amine-impregnated adsorbents. The identified features of each data point were collected from the bibliometric database, and the data sources were initially allocated based on expert knowledge. Selected articles were screened by ChatGPT. The data collection process involved, data extraction from text and figures, derivatization of data, literature search and measurement of missing contextual data. The collection methods were validated through statistical analysis, and machine learning models were trained for future applications of the database.

Methods

Data source selection via manual screening and prompt engineering

A comprehensive search was conducted on Web of Science, Scopus, and Google Scholar to identify reliable and relevant sources on amine-impregnated solid adsorbents for DAC (Fig. 2a). The keywords used for the search of targeted literature varied depending on the responses from each search engine. Web of Science yields 325 articles with keywords “amine” and “direct air capture” as topic indexes and “amine efficiency” as added keywords during the literature search. Scopus yields 595 research articles with “amine” “air” and “capture” as topic indexes. The search results suggested publication numbers of research in amine functionalised adsorbent for DAC increased dramatically from 2005. Therefore, the time frame of 2005–2024 was used for the search on Google Scholar. The terms “amine”, “impregnated,” and “direct air capture” resulted in 791 articles overall. The exported literature was filtered with defined rules based on domain knowledge. The exclusion criteria are summarized in descending order based on their level of ease of recognition (Table 1).

Fig. 2
figure 2

The detailed workflow on the method for data allocation. (a) The initial data screening was carried out by scientists in amine-impregnated solid adsorbents and (b) The second round of screening involves prompt engineering and interactions with GPT where the prompt used is shown in panel b. (c) The output by the GPT model were verified and processed by scientists in detail to collect the relevant data. Overall, there are 71 articles in total, and 52 articles were selected for final data collection. The accuracy of the GPT model was assessed using the formulas indicated in panel c.

Table 1 Explanation and limitations of excluding specific literature and parameters related to DAC adsorbents.

The first round of data source screening resulted in approximately 30 articles associated with an amine-impregnated adsorbent for DAC. Noteworthy, 10 out of the 30 selected articles involved studies of supported amines not only under DAC conditions but also under simulated flue gas or pure CO2 stream conditions. The partial pressure of CO2 is a crucial factor influencing the performance of amine-impregnated adsorbents15. Since DAC is relatively new compared to point source carbon capture, the performance of supported amines under pure CO2 and flue gas conditions was studied earlier, and more literature is available. These previous work provide valuable insights for investigating novel DAC adsorbents. Therefore, the database could be expanded by including studies on amine-impreganated solid adsorbents under different partial pressures of CO2. As a challenge, the fixed-bed breakthrough experiments are one of the major techniques to assess the performance of amine-impregnated solid adsorbents in earlier publications. In addition, the focus on adsorbent properties changes with the demands of climate technologies over time. These articles are not included in the final dataset to prevent bias introduced by the measurement techniques of the adsorbent. This will degrade the completeness of data ranges in the dataset as a trade-off. Expert-defined rules were applied during the second-round search for expansion of the datasets. The earliest research on amine-grafted solid adsorbents under both 0.04% and 10% volume CO2 conditions was published in 2010, while the first study on amine-impregnated solid adsorbents under these conditions was published in 201121,41. Therefore, the timeframe for the second-round search is from 2011 to 2024. The investigation of the adsorbent properties under varying CO2 conditions is anticipated to yield results that are more consistent with those from earlier studies. The same keywords were used on Web of Science and Scopus. The keywords used on Google Scholar include “amine”, “impregnation”, “CO2”, “adsorption”, “flue gas” and “thermal gravimetric analysis (TGA)” with exclusions of “graft”, “grafting” and “grafted”, resulting in 724 articles. The second round of searches focuses on literature containing various supports, amines, and additives to enhance dataset diversity. Additionally, it checks for any missed articles from the first round.

To accelerate the second round of data source screening and allow a time-saving validation of all the selected articles, a prompt was developed based on the features of a few selected articles and fed to the large language model GPT-3.5-turbo (Fig. 2b)42. To integrate GPT-3.5-turbo into the experimental pipeline, API calls were configured with a system prompt that defined the assistant’s role, while user queries were formatted as message objects. The titles, keywords, and abstracts of all 71 potential articles identified through web searches were input into the GPT model. Based on the provided prompt, the model identified 54 articles as “Yes” and 17 articles as “No.” The articles assigned “Yes” by the GPT model were thoroughly processed by scientists to extract relevant data. During this review, the scientists found that 11 out of the 54 articles were incorrectly assigned (False Positive: FP = 11, True Positives: TP = 43). The articles assigned “No” by the GPT model were checked by the scientists, and it was determined that 9 out of the 17 were incorrectly classified (False Negatives: FN = 9, True Negatives: TN = 8). After validating and finalizing all responses from the GPT model, 52 articles were selected for database construction. The performance of the GPT model was evaluated using precision and recall, resulting in an F1 score of 0.81, as indicated in Fig. 2c. The input and output for the GPT model, along with information on the selected articles, can be found in Table S1 and Table S2, respectively (see supplementary xlsx file).

Identification of chemical descriptors and unit conversions

The selection of appropriate performance indicators for amine-impregnated adsorbents is crucial for ensuring the dataset’s quality for future research (Table 2). Both CO2 capacity and amine efficiency are essential for evaluating an adsorbent performance based on laboratory research which also offer insights into CO2 interactions with specific amines43,44,45. Those two parameters were also frequently reported in former research and investigated by machine learning models35,38. The kinetic parameters and heat of adsorptions are performance indicators which assist the understanding of adsorption phenomena46,47. Although these parameters were not widely reported for amine-impregnated adsorbents, they are valuable for the simulation and modeling of adsorption breakthrough curves and regeneration heat during the engineering design processes46,48. The time to half saturation and 90% saturation of the adsorbent is chosen since the adsorption behaviour of most amine-impregnated adsorbents was not represented by kinetic models and the rate coefficient were not widely reported. As a drawback, apart from the time to reach CO2 saturations the reported energy-related parameters were limited especially for amine-impregnated adsorbents. The limitations for excluding energy-related factors are summarised in Table 1. The cyclic stability related to thermal, hydrothermal, and chemical parameters of the adsorption process is another crucial performance indicator to assess the sustainability and economic feasibility of an adsorbent49. Even though the cyclic stability of powdered adsorbents measured by TGA provides insights into the performance of adsorbents under practical conditions, stability-related data may vary under breakthrough conditions and was sometimes not reported in the selected studies50,51.This dataset only includes commonly reported temperatures for the adsorption-desorption process, the number of adsorption-desorption cycles, and the weight and capacity loss of adsorbents.

Table 2 The selected features within the dataset.

Previous lab research has predominantly focused on optimizing adsorbents by employing different amines and solid supports to achieve an optimal balance among several performance indicators. However, the preparation of each individual adsorbent typically requires time from 6 to 12 hours in the laboratory, with adsorption measurements extending the overall timeframe even further. The selected chemical descriptor of this database focuses on the amine and support properties for lab-based research to identify potential combinations of starting materials to enhance target performance indicators (Table 2). Applying machine learning or advanced AI models to this dataset would help chemists identify patterns and optimal feature combinations to achieve target properties, thereby reducing the learning curve for researchers across various disciplines.

The composition and structure of different amine molecules are key features influencing the performance of adsorbents. For the dataset to be deemed suitable for computational tasks, the input datasets should encompass more than abbreviated amine names. Describing the molecular structure of amine molecules with numerical input can make the dataset more practical for scientists without significant coding knowledge. The adsorbents preparation uses various linear polyamines, branched polyamines or small amines, which exhibit different mobility, viscosity, and volatility when impregnated within porous solids14,24. Chemically, the ratio of primary, secondary, and tertiary amines within a specific polyamine molecule interacts with CO2 through different mechanisms, making it a key factor influencing the efficiency and oxidative stability of the adsorbents52,53. The molecular weight of the polyamines is related to the viscosity of the adsorbents and corelates with the performance in CO2 capacity and amine efficiency54. Furthermore, the presence of additives could also influence CO2 capacity, amine efficiency, adsorption kinetic and adsorption heat of amine-impregnated adsorbents55. In terms of the porous support, their surface chemistry, pore size and pore geometries are also crucial factors to influence the amine loading, capacity, kinetics or stabilities of the adsorbent14,56.

Apart from features of amine and porous support the preparation and activation of amines and instrumental setups for CO2 adsorption also influence the properties and performance indicators of the adsorbents. These experimental parameters are critical for achieving the desired amine loading and textural properties of amine-impregnated adsorbents, as well as for optimizing their performance55.The operational parameters such as adsorption temperature, humidity and flow rate of purge gas are critical for the measured adsorbent performances21,57,58. The information about the methods used for measuring adsorption performance is essential for ensuring reliable comparisons and interpretations across different studies. The explanations and limitations for excluding these factors are outlined in Table 1. Additionally, Table S3 (see supplementary xlsx file) summarizes missing experimental and instrumental parameters, highlighting that the availability of data for these factors in constructing a complete dataset involves trade-offs.

The following section address issues in inconsistently reported chemical features. The unit conversion and data derivatization process ensure that the dataset is aligned with the original research while enhancing the integrity of the datasets. In some articles, the term “amine loading” was used to describe the percentage weight of amine impregnated on a solid adsorbent; however, others employ the term “organic content” instead. The term “amine loading”, “amine content”, or “nitrogen content (N content)” were also used to denote the quantity of nitrogen per gram of the adsorbent. Furthermore, the N content is sometimes reported as a percentage of the adsorbent’s weight (Nadsorbent), these values were converted to moles of nitrogen per gram of adsorbent (Formula 1). The molar mass of nitrogen is 14.01 g/mmol. Madsorbent represents the weight of the adsorbent. This term can be omitted from the formula as it does not impact the calculation.

$$N\,{content}=\frac{{N}_{{adsorbent}}\,\cdot {M}_{{adsorbent}}}{14.01\cdot {M}_{{adsorbent}}}$$
(1)

Notably, within our dataset, “organic content” is employed to represent the weight percentage of organic molecules impregnated on the solid adsorbents, a term considered more precise, particularly when the solid adsorbents encompass both amine molecules and additives.

Most of the obtained data was reported in consistent units, but some literature presented inconsistent units that required further processing by scientists. One common inconsistency is the use of “per gram adsorbent” or “per gram solid support” as the denominator for organic content (%), pseudo-equilibrium CO2 uptake (mmol), and textural properties of the solid adsorbents (6 cases). Jones et al. explained that reporting textural properties per gram of solid support allows for more insightful structural comparisons. They also reported CO2 capacities in both units for easier comparison59. Other studies also use different CO2 capacity units such as “mg/gadsorbent” (3 cases) and wt% of the adsorbent (1 case). Since most research uses “per gram adsorbent” as the denominator for representing properties and performance of amine-impregnated adsorbents, CO2 uptake capacities per mass of adsorbent were converted to mmol/g60.

As a challenge, the units for the textural properties of the solid adsorbents cannot be converted directly without instrumental measurement and analysis. To address this problem, we only report the textural properties of the pristine solid supports as textural property features in our final datasets. Moreover, utilising the textural properties of pristine solid supports as the input for AI model training is justifiable, as these properties are more fundamental than the textural properties of impregnated solid supports in laboratory-based research. The textural properties of commercially available porous solids may be provided by the vendors, and no experiment is required for such input parameters. When the organic content (x) of an adsorbent is reported as percentage per gram of the solid support (\({x}_{{support}}\)), conversions are made using the following formula to obtain the percentage value per gram of adsorbent (\({x}_{{adsorbent}}\)). Noteworthy the mass of solid support (\({M}_{{support}}\)) can be disregarded (Formula 2).

$${x}_{{adsorbent}}=\frac{{x}_{{support}}\cdot {M}_{{support}}}{{M}_{{support}}(1+{x}_{{support}})}$$
(2)

Calculations of nitrogen content and amine efficiency

The chemical feature “Nitrogen Content” is often missing; however, this feature can be easily calculated when both the CO2 capacity and the amine efficiency are provided by the article (Formula 3).

$${Amine\; Efficiency}=\frac{{{CO}}_{2}\,{Capacity}}{N\,{Content}}$$
(3)

In some of the previous research neither the N content nor the amine efficiencies are provided, the N content can still be estimated based on the molecular structure of the amine. When small amine is used for impregnation, the N content is determined by the number of N (\({N}_{{small\; amine}}\)) in the molecule and the number of moles of amine per gram of adsorbent (Formula 4). The moles of amines per gram adsorbent can be calculated from the mass of amines (\({Organic\; Content}\cdot {M}_{{adsrobent}}\)) over the molecular weight of the amine (MW).

$$N\,{content}=\frac{{N}_{{small\; amine}}\cdot {Organic\; Content}\cdot {M}_{{adsrobent}}}{{MW}{\cdot M}_{{adsorbent}}}$$
(4)

When polyamine is used for impregnation, the degree of polymerization of the polymer (DP) is first determined via the number average molecular weight (\({M}_{n}\)) of the polyamine over the molecular weight of the repeating unit (\(\bar{m}\)) (Formula 5). The number of N within a polyamine (\({N}_{{polyamine}}\)) can be calculated based on the number of N atoms within the repeating unit (\({N}_{{repeating\; unit}}\)) and DP. The N content can then be estimated following the similar procedure as the small amine-impregnated solid adsorbent (Formula 6). The simplified formula showed that only the molecular weight of the repeating unit (\(\bar{m}\)) of a polyamine and the organic content are needed to estimate the N content per gram of the adsorbent (Formula 7).

$${DP}=\frac{{M}_{n}}{\bar{m}}$$
(5)
$$N\,{content}=\frac{({{DP}\cdot N}_{{repeating\; unit}})\cdot {Organic\; Content}\cdot {M}_{{adsrobent}}}{{M}_{n\cdot }{\cdot M}_{{adsorbent}}}$$
(6)
$$N\,{content}=({\bar{m}\cdot N}_{{repeating\; unit}})\cdot {Organic\; Content}$$
(7)

Estimation of additive to amine 1 ratio and OH to amine ratio

When the amine-impregnated solid adsorbent contains additives, the organic content involves both amines and additives. However, a few articles do not contain the organic content of the adsorbents. In one case, the article reported the percentage weight of amine 1 (x1) per gram of the adsorbent (\({M}_{{adsorbent}}\)) and the percentage of additive (\({x}_{2}\)) per gram of the support (\({M}_{{support}}\))61. The Organic Content (Formula 8) and the additive to amine 1 ratio (\({R}_{a}\)) can be calculated (Formula 9). Detailed calculations are provided in Table S4 (see supplementary xlsx file).

$${Organic\; Content}==1-\frac{1-{x}_{1}}{1+{x}_{2}}$$
(8)
$${R}_{a}=\frac{{x}_{2}}{{x}_{1}}\cdot \frac{1-{x}_{1}}{1+{x}_{2}}$$
(9)

In a few cases, the OH to amine ratio (\({R}_{{OH}}\)) was not reported. The (\({R}_{{OH}}\)) can be determined from the known percentage weight of amine 2 or additive 2 per gram of adsorbent (\({x}_{2}\)), the molecular weight of the additive \({({MW}}_{{additive}})\), the number of OH groups (\({N}_{{OH}}\)) per additive molecule and the N content of the adsorbent (Formula 11). In some instances, the percentage weight loading for amine 2 or additive 2 per gram of adsorbent (\({x}_{2}\)) was not reported. This can be estimated based on the organic content and the ratio of amine 2 or additive 2 to amine 1 (\({R}_{a}\)) as outlined in Formula 10. Notably, \({R}_{a}\) is provided in most articles. Detailed calculations are provided in Table S5 (see supplementary xlsx file).

$${x}_{2}=\frac{{R}_{a}\cdot {Organic\; Content}}{1+{R}_{a}}$$
(10)
$${R}_{{OH}}=\frac{{x}_{2}\cdot {N}_{{OH}}}{{{MW}}_{{additive}}}\cdot \frac{1}{N{content}}$$
(11)

The estimation of average pore diameters

The average pore diameter (\({V}_{p}\)) of the pristine solid support was not reported in 15 out of the 52 cases of literature. Those missing value were estimated from the reported BET surface area (S) and the average pore volume (\({D}_{p}\)). The formula applies exclusively to porous supports with cylindrical pore geometries, such as SBA-15, MCM-41, and MCM-48. Three of the fifteen article didn’t report average pore diameters for more complex porous supports, including commercial nano silica, mesoporous silica foam, and zeolites, leading to 47 missing data points.

$${D}_{p}=\frac{{4V}_{p}}{S}$$
(12)

Calculation of adsorption from kinetic models

The adsorption isotherm is a valuable mathematical representation to describe the kinetics of each type of adsorbent. Form the existing studies, exponential curves are typically used to represent the time to half CO2 saturation and the time to 90% CO2 saturation of the adsorbents. Thus, both can be easily extracted from the adsorption isotherm. When only graphical data are available and no numerical information is provided in the text, validating these graphical data becomes challenging. Therefore, by calculating some time data using kinetic models that exhibit a strong fit to the experimental results (R2 ≥ 0.98), we can assess the accuracy of the data extraction process by comparing the extracted graphical data to the calculated values. However, only 7 out of the 52 selected articles presented adsorption kinetic models with fully specified parameters that demonstrated a coefficient of determination (R2) from 0.970 to 0.999. The reported kinetic models from all seven articles are summarized in Table 3. Three out of the seven cases reported only kinetic parameters without the full adsorption isotherms62,63,64. The time to saturation capacity and the time to reach 90% CO2 saturation were calculated based on the the Avrami model, the fractal-like pseudo-first-order model and the double-exponential model provided in those three studies. The remaining four out of the seven cases reported both the adsorption isotherms and kinetic models. The time parameters were derived from the pseudo-second-order model and the Avrami model and were then compared to the extracted graphic data from the adsorption isotherms in those four cases for data validation. The parameter \({q}_{e}\) is the CO2 capacity at equilibrium and \({q}_{t}\) is the adsorbed amount at a given time. The adsorption time (t) can be calculated by rearranging most equations to solve for t or by using the fsolve function in Python’s SciPy package. The detailed calculations are provided in Table S6 (see supplementary xlsx file).

Table 3 The kinetic models used to calculate the time to half saturation and the time to 90% saturation of the adsorbents.

Extraction of graphical data

The graphical data were extracted using WebplotDigitizer, a widely used data extraction tool in previous research65. For example, Aditya et al. used WebplotDigitizer to extract critical TGA trace temperatures corresponding to CoRE MOF 2019 structures from 1,886 articles66. Yuan et al. also use it for data collection of biomass-derived porous carbon adsorbents for CO235. In this dataset, the operational parameter features were extracted from approximately 450 figures, including scatter plots and bar charts for amine efficiencies and CO2 capacity, bar charts for cyclic stabilities, and exponential curves for kinetic studies. The most distant data points on the two axes were selected to calibrate the diagram for the highest possible accuracy. The time to half saturation and time to 90% saturation, derived from the pseudo-equilibrium CO2 uptake capacity, were identified on the y-axis to determine the corresponding time on the x-axis in each coordination diagram.

Experimental measurement of number averaged molecular weights

Molecular weight is correlated with the viscosity of the polymer and is a straightforward input that can be easily used and understood by scientists and engineers across all chemistry disciplines. For polyamines, the reported molecular weight can be either the weight averaged molecular weight (Mw) or the number averaged molecular weight (Mn). Most studies report the value provided by the vendors. In some cases, when polyamines are synthesized in a specific study, Mn is reported. Those reported Mn values were recorded in the dataset directly. The Mn value is required to obtain the degree of polymerization for amine content estimations. However, some articles only provided the weight-average molecular weight Mw of polyamines as reported by vendors, with the Mn values missing. The polyamines referenced in these studies include branched PEI with molecular weights of 1200, 1800, and 10,000 Da purchased from Alfa Aesar, as well as branched PEI with a molecular weight of 600 Da from Adam and Aladin (Fig. 4). To avoid bias in the dataset, the Mn values of these polyamines were meaured using gel permeation chromatography (GPC) in this study. The GPC of the branched polyamines were performed on a Waters 1515 system equipped with a refractive index detector (RI-2414) and PL MIXED 7.5 × 50 mm guard and PL MIXED-C 7.5 × 300 mm column. Each polyamine sample (8 mg) was dissolved in CHCl3 (4 mL) and filtered through a 0.22 filter membrane into a 1.5 mL vial. The eluent was CHCl3 at 35 °C, with a flowrate of 1 mL min−1. The system was calibrated with a narrow polystyrene standard.

Data Records

Data storage and data source

The full dataset on amine-impregnated solid adsorbent for CO2 capture is accessible as an XLSX file at Figshare, a public online repository67. This dataset contains 1,336 data points (Number of rows within the Excel sheet) gathered from 52 published articles and supporting information which results in a total of 29,857 records (Number of cells with available data)41,43,54,55,59,61,62,63,64,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110. There are 28 features (Number of columns within the Excel sheet) to describe each data point, however, these features are not equally reported. Some operational parameters and DAC performance indicators such as weight loss stability and heat of adsorptions are often missing. The column name and related information to each column are summarised in Table 4.

Table 4 Descriptions of the dataset of amine-impregnated solid adsorbents containing 28 features and DOI of the data sources.

These 28 features were mainly recorded from texts, extracted from figures or derived from equations. There are approximately 22% of the Mn data obtained from GPC measurement and less than 5% of Mn data obtained from literature interoperation or estimations. The feature N content is obtained from both literature and calculations due to the different reporting methods used in literature (Fig. 3). Data obtained from text and figures has the highest accuracy followed by data obtained from formulas.

Fig. 3
figure 3

The data source distribution for selected chemical features. The donut charts suggest that most of the data are sourced from the article texts or extracted from figures. A relatively high percentage of the N content and the OH to N ratio data are derived from formulas.

The data source and data processing methods used to construct the database are outlined (Fig. 4). Some chemical features are contextual and inconsistently reported, but they can be added from PubChem, derived from molecular structures, or identified from spectroscopic data. For example, some primary, secondary, and tertiary amine ratios were sourced from peer-reviewed literature111. The Mn of LPEI (Mw = 2500 g mol−1)112 and LPEI (Mw = 25000 g mol−1)113 were also sourced from existing Literature. The Mn of LPEI (Mw = 600 g mol−1) was estimated based on the reported polydispersity (PDI = 1.3) of LPEI. In addition, the Mn of polypropylene guanidine (PPG) was derived from the ion peak of mass spectroscopy87. In other cases, the Mn of the pristine polyamine is known, however, the Mn of the product polymer after the epoxidation or guanidinylation reactions are not reported. The Mn value of the modified molecules is therefore estimated based on the provided NMR analysis as well as the number of functional groups added to the pristine polyamine86,93. Even though the process of obtaining these contextual data is not formative, the exceptional cases are rare in the dataset. We present all the derived numerical data of these exceptional cases with their corresponding data processing method for reference in Fig. 4.

Fig. 4
figure 4

The scheme for data collection and data derivatization process with selected validation methods and presentation of the exceptional cases.

Information on chemical features and abbreviated chemical names

Most of the features in the dataset are numeric data with only four strings to describe the various types of porous solid supports, amines and additives. A classification of the porous solid used for amine impregnations is provided to explain the strings listed under the feature “support” (Table 5).

Table 5 The categorization and explanation on the strings of porous solid supports.

The 31 types of the porous solid supports were classified into 7 categories. Among all the types of solid supports, SBA-15 and MCM-41 showed the two highest frequencies in the dataset, as they are the standard solid supports for DAC research (Fig. 5). The hierarchical silica supports and polyresins have been developed more recently.

Fig. 5
figure 5

The distribution of types of porous solid supports within the dataset. The internal layer refers to the 7 categories of solid support, while the outer circular crown to all types of solid supports with abbreviated names.

There are 31 types of amines (Fig. 6) and 20 types of additives included in the dataset (Fig. 7). The structure information of these molecules is provided for future references.

Fig. 6
figure 6

The molecular structure of amine 1 in the dataset. The primary amines are highlighted with orange and the hydroxy groups are highlighted with blue.

Fig. 7
figure 7

The molecular structure of the additives (additive 1 to additive 3) in the dataset. The primary amines are highlighted in orange, secondary amines are highlighted in yellow and hydroxy groups are highlighted in blue.

As indicated by the donut charts and box plots, PEI and TEPA are the most extensively studied amines for CO2 capture which exhibits wide ranges of CO2 capacity features (Fig. 8a). Apart from the molecular structure, the variation in the molecular weight of polyamines should also be considered where specific studies were carried to see how Mn of BPEI influence the performance of amine-impregnated solid adsorbents. BPEI with Mn of 600 g/mol from Sigma Aldrich is a commonly used polyamine for CO2 capture under various conditions, as indicated by the donut chart (Fig. 8b). So far, only PEI and TEAP have been used separately with additives. There is one instance where TEPA is impregnated with two types of additives onto the solid support, so the additive “CTAC” is present in a very small portion under the feature “Amine 3 and Additive 3” (Fig. 8c). These box plots are used to describe the distribution of the dataset rather than the analysis of the adsorbent properties, since many of the operational parameters aren’t considered.

Fig. 8
figure 8

The donut charts and box plots present the distributions of amines and additives used for the impregnation of porous solid supports in the dataset and their corresponding CO2 capacities. (a) The distribution of amine types (donut chart) and the distribution of CO2 capacity data for each amine type (box plot). (b) The distributions of Mn values for BPEI (donut chart) and the distribution of CO2 capacity data for each Mn value (box plot). (c) The percentage for different types of additive 2 and additive 3 (donut chart) and the distribution of CO2 capacity and amine efficiency for all combinations of amines and additives within the dataset (box plot).

Technical Validation

The data collection process was conducted by two lab chemists specializing in DAC research. Consequently, the datasets primarily focused on the structure and composition of amine-functionalized adsorbents, as well as the CO2 capture performances of powder adsorbents at a milligram scale. Interaction with GPT-4 was employed to assess the completeness of the data source. Additionally, the data derivatization process helped identify human errors and typos in the Excel file. The data used for validation, along with a draft dataset from the data collection processes, is presented in Tables S7, S8.

Validation of the extracted and derived data

This section presents the percentage errors (1.3%~20.8%) that are calculated by comparing the extracted or derived data to the data provided in the text. The percentage error reflects the inaccuracies of different data processing methods. In addition, it also indicates the potential errors in the data obtained from the same data extraction or derivatization method. A considerable number of data points originate from figures within the articles (Fig. 3). Nonetheless, a substantial portion of the graphical data was presented solely in graphical form, rendering these datasets unverifiable. In addition, the data from dot plots tend to be more easily extracted than the line graphs and bar charts. Thus, the data points from different types of figures are validated separately. The instances where the data are presented in both graphical and numerical forms were compiled to calculate the errors associated with the data extraction processes (Table 6). The dot plot is primarily utilized for presenting data on CO2 capacity and amine efficiency, and it is used less frequently to present stability data.

Table 6 The average percentage errors for the comparisons between the extracted graphical data or derived data with the data reported in texts.

The time to half saturation and the time to reach 90% saturation data extracted from the exponential curves can be validated by comparing them with the calculated values from the adsorption kinetic models (Table 3). Notably the kinetic model can only be used for validation when it is strongly fitted to the experimentally measured adsorption isotherm. Although the coefficient (R²) of the model exceeds 0.99 which indicates an overall strong fit, it merely signifies the average precision of the kinetic model in characterizing the experimental adsorption process. The discrepancies among specific data points remain possible. Nonetheless, the comparison between the data extracted from exponential curves and the data calculated from kinetic models still provides an indication of the reliability and chemical significance of the extracted data. The extracted graphical data and the data calculated from kinetic models from 4 out of the 52 articles were utilized to determine the percentage error associated with data extraction from the line graphs (Table 6)74,77,104,106.

The N content is primarily derived based on the theoretical moles of N atoms from each of the polyamines and small amines. Since the molecular structures of these molecules are known and most of the organic content data are provided, the N content of each data point can be easily obtained with reasonable accuracy (Table 6). The estimated value was validated by randomly selecting a dataset with reported N content or N content calculated from Formula 3 and comparing it with the N content derived from the molecular structure of amines within the randomly selected dataset (Formula 4 or Formula 7).

During the data collection we noticed several articles didn’t report the OH to N ratio when OH group is present, however, the presence of OH group is known to provide positive impact on the amine efficiency, adsorption kinetic and the optimal desorption temperature of adsorbents75. Therefore, the OH to N ratio is derived from the original articles. As a drawback, the number of articles in mixed amines and additives are limited. Only three cases reported the N to OH ratio, so the reliability and accuracy of the OH to N ratio is validated with a small dataset (Table 6)75,76,81. The OH to N ratio within those studies were calculated using Formula 10 and Formula 11 and then compared with the reported data. Even though the size of validation sample is small, the absolute and percentage errors obtained can still suggest the correctness and reliability of the strategy to obtain the OH to N ratio.

The average pore diameters for pristine supports were estimated from 12 articles using Formula 12. While the formula is applicable to solid supports with assumed cylindrical pore geometry, the mathematical relationship cannot accurately represent the exact geometries of silica supports from various sources and preparation conditions. To assess the associated percentage error, we identified 7 articles that reported average pore diameters based on assumed cylindrical shapes. The average pore diameters from these articles were calculated using Formula 12 and compared to the reported values, resulting in an average percentage error of 20.8% (Table 6). Although this percentage error is relatively high, the estimated pore sizes remain within an acceptable range for both commercial and lab-made SBA-15, MCM-41, and MCM-48.

Overall, the percentage errors suggest a reasonable accuracy for the method used to obtain the final dataset. The errors for kinetic data are probably due to the inaccuracies in the kinetic models under different operational parameters. These inaccuracies lead to more deviations when compared to the plotted measurement data, and consequently, higher percentage errors when comparing the model data with extracted data from the plot. When there is a noticeable difference between the extracted experimental data from the plot and the model data, the extracted experimental data from the plot is retained within the dataset. In addition, most of the stability data (percentage loss in adsorbent weight and percentage loss in CO2 capacity) were obtained from bar graphs. There are limited data from the text to validate the accuracy of extracted numbers from the bar graph, however, many of the stability data were reported in a rounded number or reported as “no significant reduction in the CO2 capacity of the adsorbent.” It also suggests that inaccuracies in the data extraction process have less influence on the actual stability of the adsorbent.

Validation of the dataset via statistical analysis

The statistical analysis not only demonstrates the general information and relationship within the dataset but also identifies the outliers and human errors. The data cleaning process and statistical analysis were carried out with Python and generally involved the following steps: 1) Check the empty cells within the dataset. 2) Check the number of distinct data points and their frequency in each feature. 3) Check the presence of data with frequency lower than 10 for each feature.4) Check for the expected match between data points and features. In this way the outliers can be identified and corrected manually. The statistical analysis was conducted by three data scientists to check the technical reusability of the data. The data cleaning process was also checked several times before the final analysis.

This dataset included the DAC performance of amine-functionalised solid adsorbent for the first time and the performance is expected to be distinct from high-concentration CO2 capture. Three defined CO2 concentration ranges were plotted against each of the six DAC performance indicators in a violin plot (Fig. 9a). The extreme values indicated by the boxplots and violin plot were checked manually. Notable, the recently developed amine-functionalised solid adsorbents adopt the highest CO2 capacity at 5.8 mmol CO2/g under DAC conditions (0.04 vol% of CO2) which is comparable to the CO2 capacity under flue gas conditions (15~25 vol% of CO2). The distribution and range of CO2 capacity are generally in agreement with the amine efficiency suggesting the reliability of the database. The violin plot of CO2 concentrations against the time to half saturation showed an extreme maximum at 75%~100% volume of CO2 (770 min). In general, the maximum value of the half time should be around one third of the time to 90% saturation. This value was checked, and the original study did not provide the time to 90% saturation of the adsorbent. The violin plot of CO2 concentrations against the stability of the adsorbent, indicating weight loss stability (%) of the adsorbent under flue gas conditions, has not been reported thus far. While CO2 capacity loss stability (%) is more frequently reported, the adsorbent under DAC conditions showed less percentage loss compared to other conditions. This result may also be influenced by temperature and the number of adsorption-desorption cycles in the actual experiments.

Fig. 9
figure 9

Violin plots of selected adsorbent characteristics and operational parameters (CO2 concentrations) against adsorbent performance indicators. (a) The ranges of CO2 concentration against six features of performance indicators (CO2 capacity, amine efficiency, time to half saturation, time to 90% saturation, weight loss stability (%), capacity loss stability (%)). (b) The fraction of primary, secondary and tertiary amines within adsorbents against amine efficiency. (c) The textural properties of pristine solid supports against CO2 capacity.

Following the previous analysis of the private database of amine-functionalised SBA-15 and MCM-41 by Yıldız et al., the CO2 capacity was plotted against the percentage quantity of primary, secondary and tertiary amines within the solid adsorbents (Fig. 9b)37. Based on the given information, the private datasets contained different ranges of data and data sourced from different time periods. However, the box plots from the previous database showed maximum values that are similar to the violin plots. Both plots suggest that the presence of primary amine contributes to high amine efficiencies. The violin plot also indicated the secondary amines is in comparative quantity to primary amines contributing to the overall amine efficiencies. This statistical result may be influenced by the instrumental method and the actual chemical qualities in use. Nonetheless this comparison has validated the reproducibility of these types of datasets and demonstrated their potential for the analysis of amine-functionalised solid adsorbents. The violin plots also illustrate the data distribution of the textural properties of solid supports against CO2 capacity (Fig. 9c). It suggested that the solid support with pore volume between 3 to 6 cm3/g and pore size larger than 20 nm provides positive effects to the CO2 capacity of the adsorbents.

A correlation heatmap was generated to illustrate the relationships among the numerical features within the dataset (Fig. 10a). The heatmap reveals the direction and relative strength of correlations between pairs of numerical features; however, no strong correlations with R² values exceeding 0.99 are observed. Generally, some features with known correlations exhibit R² values above 0.70, such as organic content and CO2 capacity, as well as CO2 capacity and amine efficiency. In contrast, some expected correlations, such as those between CO2 capacity and humidity, are not evident. The weak correlations among the selected numerical features may be influenced by non-linearities and multicollinearity among the parameters, as the heatmap only indicates linear correlation strengths between pairs of variables. Based on the strongest observed correlation, represented by an R² value of 0.79, a scatter density diagram was created to plot amine efficiency against CO2 capacity for different CO2 concentrations (Fig. 10b). The dot plot data tend to exhibit a linear relationship, with some exceptions.

Fig. 10
figure 10

Correlation heatmap of features within the dataset. (a) The heatmap illustrates the general strength of linear correlations among each numerical feature in the dataset. (b) The scatter density plot of amine efficiency against CO2 capacity.

The interaction of CO2 with amine-functionalised adsorbents is affected by multiple chemical and physical processes, including adsorption kinetics, thermodynamics, and specific interactions between CO2 and functional groups. The strength of correlations in complex chemical properties may not be accurately observed due to the intercorrelation of many factors, complicating the determination of the independent effects between individual variables. More complex correlations should be further explored through modeling studies or explanatory machining learning with lab-based investigations in future research114. For example, Serna-Guerrero et al. examined how desorption pressure (PD), desorption temperature (TD), purge gas flow (FD), and their respective interactions affected the performance of the adsorbent through a statistical analysis utilizing a 23 factorial design. The statistical model used enables objective assessment of the impact of each parameter and allows for the analysis of interactions that may not be apparent from experimental observations.

Furthermore, the weaker correlation strength may be influenced by limitations within the dataset, such as its size and diversity. For instance, the correlation between weight loss percentage and pore size of the pristine solid supports appears relatively strong, which is rarely observed in experimental studies79. With only 53 data points for weight loss capacity, the dataset may not adequately capture the full range of variability, leading to potentially misleading correlations. Additionally, since this dataset was manually compiled from various research works, inaccuracies in the measurement or reporting of parameters can introduce noise, obscuring true relationships. The variables may also be influenced by unobserved factors or interactions not included in the heatmap analysis. These considerations highlight the importance of dataset quality and comprehensiveness for effectively applying AI technology in material science.

Comparisons with a private dataset

There is only one other manually curated amine-functionalized solid adsorbent database containing reported experimental values, as introduced in the 2019 research by Yıldız et al.37. Since the dataset is private, it is not possible to make a statistical comparison with the dataset from this work. However, a contextual comparison would further suggest how the database from this study can contribute to the existing one (Table 7). The major difference between the two datasets is that Yıldız et al.’s dataset contained amine-functionalised solid adsorbent made from impregnation (Class I) and grafting (Class II) methods. This dataset is an expansion with a specific focus on the adsorbent made from the wet impregnation method which generally shows better performance under DAC conditions14. The other major differences would be the types of amines and supports involved in the database. Yıldız et al.’s dataset only contained two solid supports, the SBA-15 and MCM-41. It allows efficient classification and comparison for the preparation methods of amine-functionalised solid adsorbent. Our dataset incorporated a diverse type of solid support, amine and additive which highlights more on its potential for prediction and selection of ideas from scientists. Finally, our dataset focuses on the performance of amine-impregnated solid adsorbent from 0.0004 bar to 1 bar of CO2 partial pressure, enabling the exploration and evaluation of adsorbents for the negative emission DAC technology.

Table 7 The general comparison of our database with the database created by Yıldız et al.

Usage Notes

Incorporation into machine learning models

To present a transparent process for the usage of this dataset and to prove its usability, we have trained a Random Forest model (RF) and an Extreme Gradient Boosting model (XGBoost) using this dataset. The code for data cleaning process and training of the ML model is uploaded on Figshare. Following the common practice34,35,36, the input includes chemical features of amines, textural properties of the pristine support and operational parameters. The CO2 capacity was used as an example output. We construct a 10-fold cross-validation experiment for each ML model. The values of root mean squared error (RMSE) of RF and XGB are 0.42 mmol/g and 0.53 mmol/g, respectively. These values are comparable with the previously reported RMSE value of 0.68 for an XGB model trained from a database of biomass waste-derived porous carbon for CO2 capture35. Moreover, the relative root mean squared error (RRMSE) values for RF and XGB are both 0.02, indicating that accurate machine learning models can be trained using our dataset.

It is encouraged to explore the initial ideas for the engineering of any amine-impregnated solid adsorbent using the trained ML model. The input data can be easily entered into the model for a quick discovery of adsorbent performance, ranging from DAC conditions to pure CO2 conditions. Apart from the provided machine learning models, the dataset still contains statistical value and provides other research opportunities. In addition, the dataset can be expanded and upgraded by experimental results or data generation technologies in the future.

Limitations and outlook

The dataset provided in this study presents several potential limitations that should be kept in mind for future research. Firstly, there are potentail bias from the data collection methods, as this database is derived from previous literature. Consequently, some selected chemical features may not fully capture the complexities of real experimental settings, particularly as certain features related to instrument configurations have been excluded. Furthermore, many samples are subject to experimental and measurement errors. Notably, standard errors associated with the measurement of DAC performance indicators are inconsistently reported and are not accounted for within the dataset. The errors inherent in our dataset may be comparable to those encountered under actual experimental conditions. Given both the limited data size and chemical features, accurately predicting specific values from this dataset poses significant challenges. Thus, it is imperative to explore how each feature within the dataset may be influenced by experimental measurements and to develop methodologies that align literature-based datasets with real-world conditions. The applications of AI technologies for information extraction and simulation techniques for data generations would further enrich the datasets, avoid human errors and bridge the gap between lab research and data scientists.

The diversity of chemical descriptors significantly affects the accuracy and capacity of artificial intelligence (AI) models40. While this dataset facilitates the exploration of time-saving ideas in laboratory settings, the selected features may limit the discovery of adsorbents with novel textual properties or a wider range of polyamines. As a result, models trained on this dataset may be restricted in generative AI applications, relying primarily on the patterns present in the data. In addition, the representation of solid supports and amines could extend beyond the current textual formats to include Simplified Molecular Input Line Entry System (SMILES) notation or other novel descriptors, promoting innovation and enhancing understanding of amine-functionalized adsorbents.

The selected chemical features may not fully capture the adsorption phenomena of amine-impregnated adsorbents, overlooking key thermodynamic, kinetic, and economic parameters. Including these factors would facilitate a more comprehensive analysis of the adsorbents’ performance and viability in practical applications115. The absence of energy-related data limits the dataset’s utility for simulation studies and the engineering design of DAC adsorbents. Consequently, the trained machine learning model may not be suitable for selecting scalable adsorbents for DAC applications. Given the complexity of CO2 adsorption across various scales, the chosen chemical features also impact on the results of correlation studies aimed at elucidating the mechanisms associated with amine-impregnated adsorbents.

Despite these limitations, this dataset highlights the potential for integrating machine learning with experimental research data, underscoring the necessity for uniform data reporting and the significance of small datasets. Interdisciplinary research efforts are essential for effectively applying AI technologies to the development of amine-functionalized adsorbents.