Abstract
Lignin-carbohydrate complexes (LCCs) are bioproducts with high potential as alternatives for petrochemicals. However, the complex structure and the lack of protocols for high-yield production limit their usage. Herein, we present data collected from a comprehensive artificial intelligence (AI)-guided optimization of the AquaSolv Omni (AqSO) biorefinery process targeting high-yield production of LCCs. The resulting database, termed SP-LCC, includes structural information extracted from nuclear magnetic resonance measurements (NMR) and data on the molar mass distribution, antioxidant activity, glass transition temperature, thermal degradation, and surface tension. In total, we collected data for 95 LCC-containing samples isolated for different AqSO process conditions. SP-LCC provides a holistic dataset for LCC development, materials understanding, and exploiting the LCC valorization potential. Furthermore, SP-LCC provides valuable data for training machine learning models for further optimization of biorefineries outside the scope of AqSO.
Similar content being viewed by others

Background & Summary
Lignin is the most abundant aromatic biopolymer and is an attractive candidate for replacing fossil-based feedstocks1,2,3,4. In traditional biorefineries, lignin has mainly been regarded as a low-value by-product, because its complex structure poses challenges to valorization, and has not been used to its full potential5,6,7,8,9. For example, Kraft lignin has been burnt primarily as a low-value energy source10. Lignin largely consists of the three monolignols p-coumaryl (H), conifernyl (G), and sinapyl (S) alcohol, which are randomly arranged and crosslinked within the structure1,11,12. This randomness is one of the main factors contributing to the structural complexity of lignin, which arises chiefly due to the nonenzymatic nature of the last step of the lignin biosynthesis1. In addition, the structural features of lignin are heavily influenced by the raw material, type of pretreatment, isolation method, fractionation method, and different process conditions2,4.
One of the primary difficulties in isolating lignin from other biomass components is separating lignin and carbohydrates, largely due to the presence of lignin-carbohydrate complexes (LCCs)13,14,15,16. In LCCs, lignin and carbohydrates – long or short chain – are chemically bonded mainly through phenyl glycoside, glucuronic ester and benzylic ether linkages (Fig. 2)14,16,17,18,19,20,21,22. These linkages make LCCs amphiphilic, which gives rise to promising properties for e.g., biomedical applications and surfactants in oil-in-water emulsions13,23,24,25,26,27,28,29,30,31,32,33. Although LCCs were initially seen as a significant barrier to the valorization of lignin, they are promising for high value applications. The amphiphilic characteristic of LCCs provides them with good biocompatibility and biological activities such as immunopotentiation, antiviral-, and antioxidant activity13,23,24,25,26,27,28,29. Additionally, the amphiphilic nature of LCCs enables them to stabilize emulsions and enhance their compatibility with various materials, making them appealing as emulsifiers30,31,32,33. A major limitation in the valorization of LCCs is that current extraction methods are often complex, time-consuming, and result in low yields18,34. To address the need for more efficient protocols, we recently showed that the AquaSolv Omni (AqSO) biorefinery can be optimized to provide a scalable and high-yield route to LCC extraction. AqSO is a green and flexible biorefinery process based on hydrothermal treatment followed by solvent extraction, described by Tarasov et al.17 and outlined in Fig. 1a for reference.
2D HSQC NMR spectra of acetone extracted lignin. The cross-peaks of the main areas of interest were determined according to the corresponding resonances. Each area was colored distinctively to represent the moieties of interest in the oxygenated aliphatic region (a), the occurrence of the LCC linkages (b) and aromatic region (c,d).
Machine learning (ML) played a crucial role in the AqSO optimization process, helping to achieve high LCC yields and customize physicochemical properties. ML techniques have been explored to address a wide range of challenges in bio-based materials science35,36,37,38,39. For process optimization, which typically requires analyzing a large number of samples40,41,42,43, Bayesian optimization (BO) has emerged as an alternative to traditional experimental design methods. In BO, an ML model collaborates with a data collection strategy to determine the processing conditions for new sample isolation. These decisions aim to meet specific objectives, such as maximizing yield with minimal sample use.
As part of our previous work, we used BO-guided data collection to curate a set of 90 LCC-containing samples44. In this paper, we add 5 additional samples and compile comprehensive measurement data (5 physicochemical properties and NMR spectra) for all 95 LCC samples. To allow for the study of structure-property relationships for multiple properties, we favored a detailed characterization of each sample over sheer number of samples. The resulting Structure-Property LCC (SP-LCC) dataset is bolstered with an extensive technical validation and organized for easy access of the community. For the samples included in SP-LCC, key structural moieties characterized by 2D nuclear magnetic resonance (NMR) spectroscopy and selected physicochemical properties are provided. The measured properties are molar mass distribution, antioxidant activity, glass transition temperature, thermal degradation, and surface tension. We selected these properties, because they are basic materials parameters, but they are also important for certain applications of LCCs, e.g., biomedicine23,24, emulsion stabilizers31,33 and fillers in thermoplastic formulations45. To our knowledge, this is the first time a dataset of this scale is published on lignin, that allows for a comprehensive understanding of the behavior of lignin and LCCs. Furthermore, it provides insight into the structural details and property variations of LCCs based on the isolation process conditions. The significance of SP-LCC extends beyond the AqSO biorefinery because: (1) the correlation between structure and properties holds true regardless of the biorefinery process; (2) general conclusions about the impact of processing conditions on structure and properties can be transferred to other biorefinery models. How transferable our dataset from silver Birch is to other source of materials will have to be clarified in the future with similar datasets. Since all the data presented here was gathered by the same research group, consistency between data points is ensured, making SP-LCC particularly suitable for ML applications. We hope that future ML-driven studies of SP-LCC will reveal the LCC structure-property relationship and ultimately promote the widespread valorization of LCCs tailored for various applications.
Methods
Sawdust preparation
We debarked a silver birch (Betula pendula) stem and finely ground it with a Wiley Mill M02 grinder. The obtained sawdust was screened to select a sawdust fraction with a size of 0.5–1.5 mm. The fraction was then exposed to air drying. We removed the extractives from the air-dried sawdust with a Soxhlet apparatus using acetone (99.9% purity) as solvent.
AquaSolv Omni biorefinery process
As depicted in Fig. 1a), we performed hydrothermal treatment (HTT) on extractive-free sawdust according to the recently reported procedure17,46. The resulting lignin structure is heavily affected by the severity of the process, which in AqSO is controlled by the reactor temperature (T), liquid-to-solid ratio (L:S), and residence time. To quantitatively express the severity of the reaction, we combined the reaction temperature and the residence time into the single variable represented as prehydrolysis factor (P-factor). P-factor controls the rate of prehydrolysis in a prehydrolysis-kraft process for a dissolving pulp production that reflects the efficiency of hemicellulose removal from the pulp. It can be found as an area under the curve when plotting the relative reaction rate against time. Relative reaction rate describes the change in the reaction rate at a certain process temperature as opposed to the reference reaction rate at 100 °C (Eq. 1)1:
where \({k}_{H,(T)}\) is the rate constant of xylan hydrolysis at the given temperature \(t\), and \({E}_{A,H}\) is the activation energy. Consequently, P-factor strongly relies on the activation energy of the fast-reacting xylan. The chosen process variables (P-factor, temperature, and liquid-to-solid ratio) were restricted to the following ranges: 250–1000 for the P-factor, 160–195 °C for the temperature, and 0.25–2 for the liquid-to-solid ratio. P-factor as a single variable is depicted in Eq. 21, where \(k\) is the rate constant, \(T\) is the reaction temperature (K) and \(t\) is the residence time (h). All calculations were carried out according to the assumption that activation energy equals 125.6 kJ mol−1.
After the reaction, the resulting solid fraction was exhaustively washed with deionized water and subsequently exposed to acetone (75 vol%, aq.) extraction yielding acetone-extracted lignin (AEL) solutions. LCC-containing AELs were isolated by removing the solvent (75 vol% acetone aq.) through rotary evaporation (T = 40 °C, p = 20 mbar). Finally, the obtained AELs/LCCs were subjected to vacuum oven drying (T = 40 °C, p = 5 mbar) under P2O5 until a constant weight was reached. The dried AELs contained varying amounts of LCC depending on the biorefinery process variables. The obtained samples were labeled to reflect the process conditions used to produce each specific sample and are represented as follows: P840-195-2.00, where P840 is the P-factor employed, 195 is the reaction temperature and 2.00 is the L:S.
Heteronuclear single-quantum coherence nuclear magnetic resonance
We recorded the 2D NMR using a Bruker AVANCE 600 NMR spectrometer equipped with a CryoProbe17,46,47. Approximately 75 mg of each dried LCC-containing sample was dissolved in 0.6 mL of DMSO-d6. We defined the acquisition time for the 1H-dimension as 77.8 ms with 36 scans per block employing 1024 collected complex points, while 3.94 ms was the set time for the 13C-dimension with 256-time increments recorded. We processed the obtained 2D HSQC NMR data using 1024 × 1024 data points and employed the Qsine function for both 1H and 13C dimensions. To calibrate the chemical shifts, we chose the DMSO peak at δC/δH 39.5/2.49 ppm/ppm, and we assigned the cross-peaks according to previous reports14,16,17,46. The normalized quantification of different lignin and LCC signals was carried out assuming that (Eq. 3)17,46:
Molar mass distribution
We determined the molar mass distribution (MMD) with a size exclusion chromatography (SEC) equipped with a multi-angle light scattering (MALS) detector, and a differential refractive index (RI) detector. The MALS detector was equipped with 8 photodiodes with pass filters on every second photodiode. The separation was performed on a Jordi X-stream H2O 1000 Å column (10 × 250 mm i.d.) equipped with a guard column (10 × 50 mm i.d.). The column oven was set to 40 °C. We prepared the samples by dissolving 5 mg of each AEL sample in 1 mL of dimethyl sulfoxide (DMSO) containing 0.05 M lithium bromide. We then placed the samples under careful shaking for a longer period to minimize the formation of aggregates. Before the SEC measurements, we filtrated the dissolved AELs over a 0.22 µm nylon syringe filter. We used the following parameters for the molecular weight analyses: 0.5 mL min−1 flow rate; 100 µL injection volume; 50 min run time; 0.15 dn/dc value. The obtained data was evaluated using ASTRA 7.3.2. software. We defined the peaks based on the RI concentration curve and selected the photodiodes 2, 4, and 6. The result fitting was adjusted with forward extrapolation. For the MMD we are interested in the following parameters: Number average molecular weight (Mn), molecular weight at the peak of the distribution curve (Mp), and average molecular weight (Mw). Our determination of the MMD was based on the methodology reported by Zinovyev et al.48.
Antioxidant properties
For the determination of antioxidant properties of the AELs/LCCs, we used an improved normalized radical scavenging index (nRSI) method49. It is based on the standard procedures for antioxidant activity evaluation utilizing 2,2-diphenyl-1-picrylhydrazyl (DPPH) as a reactive free radical50,51,52. Briefly, we prepared a set of LCC/AEL solutions (120–600 mg L−1) and a DPPH solution (75 µmol L−1) using 90 vol% acetone (aq.) as a solvent in each case. Following that, we mixed the AEL/LCC solutions with the DPPH solution in a lignin:DPPH = 1:39 (v/v) ratio. To measure the change of the DPPH concentration in the prepared solutions, we employed UV-vis spectroscopy using a Shimadzu UV-2550 spectrophotometer and a 10 mm path length quartz cuvette at a wavelength of 515 nm. The absorbance of the solutions was monitored immediately after the preparation and after 24 h when the steady state was reached. In parallel, the absorbance of a LCCs/AEL-free (blank) solution containing 75 µmol L−1 DPPH in 90 vol% acetone (aq.) was measured to correct the absorbance values of the AEL/LCC-containing solutions, according to the severity of the DPPH degradation in a given solvent. The lignin absorbance was corrected during the measurements following our recently reported procedure53. More details regarding the experimental procedure can be found in the ESI.
Glass transition temperature
We determined the glass transition temperature (Tg) with modulated differential scanning calorimetry (MDSC, TA Instruments DSC250, Discovery series). We measured approximately 10 mg of AEL in TZero™ aluminum pans under a flow of nitrogen (50 mL min−1). The samples were heated from 40 °C to 115 °C at 5 °C min−1, followed by cooling to 20 °C. The samples were reheated at 2 °C min−1 until reaching 170 °C. We set the modulation amplitude to 1.20 °C and the modulation period to 60 s. We determined Tg from the last heating ramp as the half-height midpoint of the step-change in the reversible heat flow curve in the TRIOS v5.1.1.5 software.
Thermal degradation
We studied the thermal stability of the AEL samples by thermal gravimetric analysis (Discovery SDT 650 simultaneous DSC/TGA thermal analyzer, TA Instrument). We measured approximately 7 mg of AEL in an aluminum pan against an empty pan as a reference. The samples were heated to 700 °C at 10 °C min−1 under N2 atmosphere. The parameters of interest were Tonset = the start of the thermal degradation, Tmax = temperature where the rate of decomposition reaches its maximum, T50 = when 50% of the sample has degraded and the char yield, and we determined these using the TRIOS v5.1.1.5 software. Tonset and Tmax were determined from a derivative curve of the weight change curve with reference to the temperature curve. The char residue was determined from the weight change curve at maximum temperature.
Surface tension
We measured the air/water interface of the AELs in aq. NaOH solution (pH 12.65) with a force tensiometer-K100 (Krüss, Germany) using a Wilhelmy plate. In total, we measured five concentrations of each AEL sample (0.5 mg mL−1, 0.4 mg mL−1, 0.25 mg mL−1, 0.1 mg mL−1 and 0.08 mg mL−1). First, we prepared the 0.5 mg mL−1 solution by dissolving 12.5 mg AEL in the aq. NaOH solution in a 25 mL volumetric flask overnight under magnetic stirring. Then we prepared the 0.4 mg mL−1 solution by diluting the 0.5 mg mL−1 solution directly after being measured. The remaining solutions we prepared in the same way by diluting the previous solution directly after the measurement. Measurement points were taken until 5 measurements with a standard deviation below 0.1 mN m−1 were obtained.
Bayesian optimization
The data collection was guided by BO, building on the approach described by Löfgren et al.35. A detailed account of the process will be published elsewhere44. Briefly, we modeled the AEL and carbohydrate content as independent Gaussian processes using the BOSS code54. The models were initialized from a sequence of 12 Sobol points55 and the experimental noise was accounted for by incorporating a Gaussian noise, estimated from the technical validation data, into the models. To maintain an optimal workload in the laboratory, we performed the acquisitions in batches of 8 data points each using a Kriging-believer approach56. The first five batches contained two exploration-modified LCB acquisitions57,58 each for the AEL and carbohydrates, and two exploratory acquisitions where only the standard deviation of the models was minimized. To obtain the optimal workload of eight samples, we added two test data points to these five batches, generated independently from a continuation of the initial Sobol sequence. In the last two batches, these test points were dropped in favor of two additional exploratory acquisitions. In total, we collected 54 data points, including initial points, over seven batches. Furthermore, we collected 14 test points, of which four were collected outside of the seven batches. We used these test points for model validation and to terminate the BO once the prediction errors dropped below 10% relative to the observed measurement range.
Data Records
Our database comprises 95 samples isolated under 72 different process conditions. We characterized 88 of the samples with 2D HSQC NMR. Among the characterized samples, the lignin yield was too low for 13 samples to measure any of the physicochemical properties. To bolster the available property information on low-yield samples, we re-isolated seven such samples. For these samples, we decided to forgo measuring the 2D HSQC NMR characterization in favor of measuring all the properties. We refer to these seven samples as replicates, and these are distinguished by an “-R” added to their sample ID. To validate the biorefinery reproducibility, we isolated four samples under identical process conditions. We included these in the SP-LCC dataset and distinguished them by adding “NMRValidation” in their sample ID. No physicochemical properties were measured for these samples.
The SP-LCC dataset is available on Figshare59. It is structured in the following way:
-
The sample ID, processing condition, yield, lignin moieties identified from 2D HSQC NMR, and properties for all samples are available in the CSV file named “SP-LCC_table.csv”. The file is structured as described in Table 1.
Table 1 Description of the scalar data in the SP-LCC dataset and contained in the CSV file “SP-LCC_table.csv”. -
The intensities of the NMR spectra are contained in the archive “NMR_spectra.zip”, which contains one file for the intensities per sample named “NMRspectrum_{sampleID}.csv”. The corresponding chemical shifts, at which these intensities were measured, are contained in the same archive NMR_spectra.zip in two files per sample, for the x-axis (F1 C axis) “xarr_{sampleID}.csv” and the y-axis (F2 H axis) “yarr_{sampleID}.csv”, where {sampleID} refers to the sample ID of the respective sample. The files are structured as described in Table 2.
Table 2 Description of the non-scalar data comprised in the presented SP-LCC dataset. -
The CSV files containing the NMR data were generated from Bruker files, which we make available, too, in the “NMR_spectra_Bruker.zip” archive. In the archive, every subfolder is named by the sample ID of the respective sample.
-
For full transparency, we also include python scripts to parse and plot the NMR spectra both from the Bruker files, with “parse_and_plot_Bruker_NMR_spectrum.py”, and from the.csv files with “plot_NMR_spectrum.py”. The dependencies for their use are described in the file “readme.txt”.
-
The chromatograms are contained in the archives “RI_Chromatogram.zip” and “LS_Chromatogram.zip” with one file per sample named “RI_Chromatogram_{sampleID}.csv” the refractive index and “LS_Chromatogram_{sampleID}.csv” for light scattering chromatograms. For parallel measurements, “_2” has been added to the file name. The files are structured as described in Table 2. Additional details of the chromatogram data are described in section 2 of the Supplementary information.
-
The thermograms are contained in the archives “TD_Thermogram.zip” and “Tg_Thermogram.zip” with one file per sample named “Tg_Thermogram_{sampleID}.csv” for thermograms obtained from DSC analyses and “TD_Thermogram_{sampleID}.csv” obtained from TGA analyses. For parallel measurements, “_2” has been added to the file name. The files are structured as described in Table 2. Additional details of the thermogram data are described in section 2 of the Supplementary information.
Technical Validation
For the technical validation, we consider three sources of uncertainty, as shown in Fig. 1b): (I) uncertainty related to sample preparation, i.e., the degree of reproducibility of the biorefinery process; (II) measurement uncertainty related to NMR measurements; (III) measurement uncertainty related to the physicochemical properties of lignin, i.e., from MMD, antioxidant activity, Tg, surface tension, and thermal degradation measurements.
Evaluation of the biorefinery process reproducibility
To evaluate the reproducibility of the biorefinery process, we investigated four replicates generated for identical processing conditions: P-factor = 500; L:S = 1; T = 195 °C (See Table 1 in section 3 of the Supplementary information). The corresponding spectra in the dataset contain “NMRValidation” as suffix in their ID.
We compared the structure, as determined by HSQC NMR (Fig. 2), and the biorefinery yield of the replicates. For this analysis, we considered the structure rather than the physicochemical properties for two reasons: (1) the structure of LCCs influences their properties; (2) in contrast to the property measurements, the noise of NMR measurements is small, as seen in subsection (II). This allowed us to conveniently isolate the uncertainty derived from the biorefinery process.
To quantify the deviation in lignin structure, we calculated the standard deviation (SD) and relative standard deviation (RSD) of the identified lignin moieties (Table 2). The RSD is reported relative to the average of the four measured samples. Repeated tests indicate that the outcome of the experiments was very similar in all cases with an average RSD of 5% (Table 3, entry 26), which is consistent with previous studies17,46,47,60. Notably, for the structural characterization, the observed SD and RSD depend on the moiety under consideration. In particular, the quantification of low concentration lignin units (<1 mol%) is heavily affected by the background noise60. In general, we note that the deviation between replicates depends on the processing conditions. For four moieties of interest (β-O-4, BE, β-β, and carbohydrates), the SD is visualized in Fig. S1 in the Supplementary information (section 4).
The four samples used for validating the quantification of HSQC NMR were also used for evaluating the yield variability across replicates. The corresponding SD and the RSD are reported in Table 3, and their calculations are presented in Table S1 in the Supplementary Information. The RSD of the AEL yield is 4%.
Measurement uncertainty related to the HSQC NMR analysis
To quantify the NMR uncertainty, we performed two consecutive HSQC measurements on a single LCC sample obtained at P-factor = 625, L:S = 1.13, and T = 178 °C. The sample was first dissolved in DMSO-d6 and then transferred to an NMR tube. As soon as the first measurement was completed, the sample was removed from the magnet, placed back into the magnet and the second HSQC measurement was recorded on the same tube. We employed the same HSQC conditions used throughout this work (see the Methods section).
The average RSD (Table 3) for the quantified lignin moieties between the two measurements was 2.6%, signifying a low error related to the HSQC NMR analysis.
Measurement uncertainty related to the physicochemical properties of lignin
The technical validation for MMD, nRSI, Tg, and thermal degradation was carried out by repeating measurements four times for four selected AEL samples. For the surface tension, a total of five parallel measurements were performed, however, only four were selected for technical validation as the fifth was discarded as an outlier (see Fig. S2 in section 5 of the Supplementary information). Table 4 contains the average SD and the average RSD (relative to the mean of measured properties) across all four validation samples. The calculations are described in Table S2 and Table S3 in the Supplementary information (section 6). In Figs. 3–5 the standard deviation of each individual sample is shown.
Measurement uncertainties for the (a) Normalized radical scavenging index, (b) glass-transition temperature, (c) number average molecular weight, (d) average molecular weight, and (e) molecular weight at the peak of the distribution curve. The error bars indicate the mean standard deviation and were calculated from 4 repeated measurements (crosses) for each of the four samples (x-axis) produced under different processing conditions.
Measurement uncertainties for properties related to thermal degradation: (a) onset temperature, (b) temperature at maximum degradation, (c) temperature after degradation of 50% material weight, and (d) char yield. The error bars indicate the mean standard deviation, calculated from 4 repeated measurements (crosses) for each of the four samples (x-axis) produced under different processing conditions.
Measurement uncertainties for the surface tension (y-axis) at four different concentrations (x-axis). The error bars indicate the mean standard deviation and were calculated from 4 repeated measurements (crosses) at each concentration level, for each of the four samples produced under different processing conditions.
As shown in Table 4, the RSD of the repeated measurements varies between physiochemical properties. For most properties, the RSD is low with values well below 5%. The RSDs of the nRSI with 7.62% and the MMD parameters with up to 15.79% are the exceptions. The high RSD of the MMD parameters is most likely due to the formation of aggregates during the preparation of the AEL-DMSO/LiBr solutions for the MMD measurements61. The aggregates are present in low concentrations; however, they have a high impact on light scattering and will therefore be detected by the MALS detector62. The MW and Mn are calculated based on the averages of all detected molecular weights in the sample, which means the aggregates will affect these parameters. By contrast, the Mp gives the molecular weight at the highest detected concentration. In this molecular weight fraction, the aggregates are not present, and the measurement uncertainty associated with Mp is, therefore, lower than for the Mn and MW.
Comparing the SD of the measured properties with the literature was not possible, as such thorough validation of measurement performed on lignin is typically not performed or not reported. For all properties, including those that vary more strongly with repeated experiments, the SD is considerably lower than the range of observed values across samples. Thus, all measured property data can be meaningfully interpreted within our dataset. We report the range of the measured property values and compare them with the literature in Table S4 in the Supplementary Information.
We also carried out technical validation for the chromatograms and thermograms obtained from the MMD, Tg, and thermal degradation measurements. We did this by extracting the data from the four measurements performed for each AEL sample. The point-wise SD of the chromatograms and thermograms of all four measurements was calculated based on the extracted CSV files and is reported in Fig. S3-6 in the Supplementary information (section 8). The validation of the most important points of the chromatograms and thermograms have been done in combination with the technical validation of their corresponding physicochemical properties and has been described above.
Code availability
The code used to guide and analyze the data collection is based on the BOSS python library for Bayesian optimization and is distributed as a GitLab repository (https://gitlab.com/cest-group/lcc_biorefinery_optimization). The repository includes instructions for installation, including version requirements, and usage.
References
Sixta, H. Handbook of Pulp; Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, (2006).
Gillet, S. et al. Lignin Transformations for High Value Applications: Towards Targeted Modifications Using Green Chemistry. Green Chem. 19, 4200–4233, https://doi.org/10.1039/C7GC01479A (2017).
Cao, L. et al. Lignin Valorization for the Production of Renewable Chemicals: State-of-the-Art Review and Future Prospects. Bioresour. Technol. 269, 465–475, https://doi.org/10.1016/j.biortech.2018.08.065 (2018).
Berlin, A., Balakshin, M. Chapter 18-Industrial Lignins: Analysis, Properties, and Applications, in Bioenergy Research: Advances and Applications. (eds. Gupta, V. K., Tuohy, M. G., Kubicek, C. P., Sadler, J., Xu, F.) 315–336, https://doi.org/10.1016/B978-0-444-59561-4.00018-8 (Elsevier, 2014).
Obydenkova, S. V., Kouris, P. D., Hensen, E. J. M., Heeres, H. J. & Boot, M. D. Environmental Economics of Lignin Derived Transport Fuels. Bioresour. Technol. 243, 589–599, https://doi.org/10.1016/j.biortech.2017.06.157 (2017).
Obydenkova, S. V. et al. Industrial Lignin from 2G Biorefineries – Assessment of Availability and Pricing Strategies. Bioresour. Technol. 291, 121805, https://doi.org/10.1016/j.biortech.2019.121805 (2019).
Dessbesell, L., Paleologou, M., Leitch, M., Pulkki, R. & Xu, C. Global Lignin Supply Overview and Kraft Lignin Potential as an Alternative for Petroleum-Based Polymers. Renewable and Sustainable Energy Rev. 123, 109768, https://doi.org/10.1016/j.rser.2020.109768 (2020).
Ragauskas, A. J. et al. Lignin Valorization: Improving Lignin Processing in the Biorefinery. Science. 344, 1246843–1246843, https://doi.org/10.1126/science.1246843 (2014).
Tuck, C. O., Pérez, E., Horváth, I. T., Sheldon, R. A. & Poliakoff, M. Valorization of Biomass: Deriving More Value from Waste. Science. 337, 695–699, https://doi.org/10.1126/science.1218930 (2012).
Argyropoulos, D. D. S. et al. Kraft Lignin: A Valuable, Sustainable Resource, Opportunities and Challenges. ChemSusChem. 16, e202300492, https://doi.org/10.1002/cssc.202300492 (2023).
Ralph, J. et al. Lignins: Natural Polymers from Oxidative Coupling of 4-Hydroxyphenyl- Propanoids. Phytochem. Rev. 3, 29–60, https://doi.org/10.1023/B:PHYT.0000047809.65444.a4 (2004).
Vanholme, R., Demedts, B., Morreel, K., Ralph, J. & Boerjan, W. Lignin Biosynthesis and Structure. Plant Physiol. 153, 895–905, https://doi.org/10.1104/pp.110.155119 (2010).
Tarasov, D., Leitch, M. & Fatehi, P. Lignin–Carbohydrate Complexes: Properties, Applications, Analyses, and Methods of Extraction: A Review. Biotechnol. Biofuels. 11, 1–28, https://doi.org/10.1186/s13068-018-1262-1 (2018).
Balakshin, M. Y., Capanema, E. A. & Chang, H. M. MWL Fraction with a High Concentration of Lignin-Carbohydrate Linkages: Isolation and 2D NMR Spectroscopic Analysis. Holzforschung. 61, 1–7, https://doi.org/10.1515/HF.2007.001 (2007).
Lawoko, M., Henriksson, G. & Gellerstedt, G. Characterisation of Lignin-Carbohydrate Complexes (LCCs) of Spruce Wood (Picea Abies L.) Isolated with Two Methods. Holzforschung. 60, 156–161, https://doi.org/10.1515/HF.2006.025 (2006).
Balakshin, M., Capanema, E., Gracz, H., Chang, H.-min & Jameel, H. Quantification of Lignin-Carbohydrate Linkages with High-Resolution NMR Spectroscopy. Planta. 233, 1097–1110, https://doi.org/10.1007/s00425-011-1359-2 (2011).
Tarasov, D. et al. AqSO Biorefinery: A Green and Parameter-Controlled Process for the Production of Lignin–Carbohydrate Hybrid Materials. Green Chem. 24, 6639–6656, https://doi.org/10.1039/D2GC02171D (2022).
Balakshin, M., Capanema, E., Berlin, A. Chapter 4-Isolation and Analysis of Lignin–Carbohydrate Complexes Preparations with Traditional and Advanced Methods: A Review in Studies in Natural Products Chemistry Vol. 42 (ed. Atta-ur-Rahman) 83–115, https://doi.org/10.1016/B978-0-444-63281-4.00004-5 (Elsevier, 2014).
Giummarella, N., Zhang, L., Henriksson, G. & Lawoko, M. Structural Features of Mildly Fractionated Lignin Carbohydrate Complexes (LCC) from Spruce. RSC Adv. 6, 42120–42131, https://doi.org/10.1039/C6RA02399A (2016).
Giumarella, N. & Lawoko, M. Structural Insights on Recalcitrance during Hydrothermal Hemicellulose Extraction from Wood. ACS Sustain. Chem. Eng. 5, 5156–5165, https://doi.org/10.1021/acssuschemeng.7b00511 (2017).
Capanema, E., Balakshin, M., Katahira, R., Chang, H. M. & Jameel, H. How Well Do MWL and CEL Preparations Represent the Whole Hardwood Lignin? J. Wood Chem. Technol. 35, 17–26, https://doi.org/10.1080/02773813.2014.892993 (2015).
Nishimura, H., Kamiya, A., Nagata, T., Katahira, M. & Watanabe, T. Direct Evidence for α Ether Linkage between Lignin and Carbohydrates in Wood Cell Walls. Sci. Rep. 8, 1–11, https://doi.org/10.1038/s41598-018-24328-9 (2018).
Sakagami, H. et al. Molecular Requirements of Lignin–Carbohydrate Complexes for Expression of Unique Biological Activities. Phytochemistry. 66, 2108–2120, https://doi.org/10.1016/j.phytochem.2005.05.013 (2005).
Sakagami, H., Kushida, T., Oizumi, T., Nakashima, H. & Makino, T. Distribution of Lignin–Carbohydrate Complex in Plant Kingdom and Its Functionality as Alternative Medicine. Pharmacol. Ther. 128, 91–105, https://doi.org/10.1016/j.pharmthera.2010.05.004 (2010).
Huang, C. et al. Unveiling the Structural Properties of Lignin-Carbohydrate Complexes in Bamboo Residues and Its Functionality as Antioxidants and Immunostimulants. ACS Sustain. Chem. Eng. 2018, 6, 12522–12531, https://doi.org/10.1021/acssuschemeng.8b03262 (2018).
Jiang, B. et al. Structure-Antioxidant Activity Relationship of Active Oxygen Catalytic Lignin and Lignin-Carbohydrate Complex. Int. J. Biol. Macromol. 139, 21–29, https://doi.org/10.1016/j.ijbiomac.2019.07.134 (2019).
Xie, D. et al. Structural Characterization and Antioxidant Activity of Water-Soluble Lignin-Carbohydrate Complexes (LCCs) Isolated from Wheat Straw. Int. J. Biol. Macromol. 161, 315–324, https://doi.org/10.1016/j.ijbiomac.2020.06.049 (2020).
Pei, W. et al. Isolation and Identification of a Novel Anti-Protein Aggregation Activity of Lignin-Carbohydrate Complex From Chionanthus Retusus Leaves. Front. Bioeng. Biotechnol. 8, 573991, https://doi.org/10.3389/fbioe.2020.573991 (2020).
Dong, H. et al. Characterization and Application of Lignin-Carbohydrate Complexes from Lignocellulosic Materials as Antioxidants for Scavenging in Vitro and in Vivo Reactive Oxygen Species. ACS Sustain. Chem. Eng. 8, 256–266, https://doi.org/10.1021/acssuschemeng.9b05290 (2020).
Lahtinen, M. H. et al. Lignin-Rich PHWE Hemicellulose Extracts Responsible for Extended Emulsion Stabilization. Front. Chem. 7, 489961, https://doi.org/10.3389/fchem.2019.00871 (2019).
Carvalho, D. M. D., Lahtinen, M. H., Bhattarai, M., Lawoko, M. & Mikkonen, K. S. Active Role of Lignin in Anchoring Wood-Based Stabilizers to the Emulsion Interface. Green Chem. 23, 9084–9098, https://doi.org/10.1039/D1GC02891J (2021).
Li, Y. F. et al. Comparison of Emulsifying Capacity of Two Hemicelluloses from Moso Bamboo in Soy Oil-in-Water Emulsions. RSC Adv. 10, 4657–4663, https://doi.org/10.1039/C9RA08636F (2020).
Lehtonen, M. et al. Phenolic Residues in Spruce Galactoglucomannans Improve Stabilization of Oil-in-Water Emulsions. J. Colloid Interface Sci. 512, 536–547, https://doi.org/10.1016/J.JCIS.2017.10.097 (2018).
Giummarella, N., Pu, Y., Ragauskas, A. J. & Lawoko, M. A Critical Review on the Analysis of Lignin Carbohydrate Bonds. Green Chem. 21, 1573–1595, https://doi.org/10.1039/C8GC03606C (2019).
Löfgren, J. et al. Machine Learning Optimization of Lignin Properties in Green Biorefineries. ACS Sustain. Chem. Eng. 10, 9469–9479, https://doi.org/10.1021/acssuschemeng.2c01895 (2022).
Liao, M. & Yao, Y. Applications of Artificial Intelligence-Based Modeling for Bioenergy Systems: A Review. GCB Bioenergy. 13, 774–802, https://doi.org/10.1111/gcbb.12816 (2021).
Velidandi, A. et al. State-of-the-Art and Future Directions of Machine Learning for Biomass Characterization and for Sustainable Biorefinery. J. Energy Chem. 81, 42–63, https://doi.org/10.1016/j.jechem.2023.02.020 (2023).
Sharma, V. et al. Di. Advances in Machine Learning Technology for Sustainable Biofuel Production Systems in Lignocellulosic Biorefineries. Sci. Total Environ. 886, 163972, https://doi.org/10.1016/j.scitotenv.2023.163972 (2023).
Ge, H., Zheng, J. & Xu, H. Advances in Machine Learning for High Value-Added Applications of Lignocellulosic Biomass. Bioresour. Technol. 369, 128481, https://doi.org/10.1016/j.biortech.2022.128481 (2023).
Pham, V. & El-Halwagi, M. Process Synthesis and Optimization of Biorefinery Configurations. AIChE Journal. 58, 1212–1221, https://doi.org/10.1002/aic.12640 (2012).
Maity, S. K. Opportunities, Recent Trends and Challenges of Integrated Biorefinery: Part I. Renewable and Sustainable Energy Rev. 43, 1427–1445, https://doi.org/10.1016/j.rser.2014.11.092 (2015).
Espinoza Pérez, A. T., Camargo, M., Narváez Rincón, P. C. & Alfaro Marchant, M. Key Challenges and Requirements for Sustainable and Industrialized Biorefinery Supply Chain Design and Management: A Bibliographic Analysis. Renewable and Sustainable Energy Rev. 69, 350–359, https://doi.org/10.1016/j.rser.2016.11.084 (2017).
Singh, N. et al. Global Status of Lignocellulosic Biorefinery: Challenges and Perspectives. Bioresour. Technol. 344, 126415, https://doi.org/10.1016/j.biortech.2021.126415 (2022).
Diment, D. et al. Enhancing Lignin-Carbohydrate Complexes Production and Properties With Machine Learning. ChemSusChem. e202401711, https://doi.org/10.1002/cssc.202401711 (2024).
Sen, S., Patil, S. & Argyropoulos, D. S. Thermal Properties of Lignin in Copolymers, Blends, and Composites: A Review. Green Chem. 17, 4862–4887, https://doi.org/10.1039/C5GC01066G (2015).
Schlee, P., Tarasov, D., Rigo, D. & Balakshin, M. Advanced NMR Characterization of Aquasolv Omni (AqSO) Biorefinery Lignins/Lignin-Carbohydrate Complexes. ChemSusChem. 16, e202300549, https://doi.org/10.1002/cssc.202300549 (2023).
Rigo, D. et al. Upgrading AquaSolv Omni (AqSO) Biorefinery: Access to Highly Ethoxylated Lignins in High Yields through Reactive Extraction (REx). Green Chem. 26, 2623–2637, https://doi.org/10.1039/D3GC03776B (2024).
Zinovyev, G. et al. Getting Closer to Absolute Molar Masses of Technical Lignins. ChemSusChem. 11(18), 3259–3268, https://doi.org/10.1002/cssc.201801177 (2018).
Diment, D. et al. Study toward a More Reliable Approach to Elucidate the Lignin Structure-Property-Performance Correlation. Biomacromolecules. 25(1) 200-212, https://doi.org/10.1021/acs.biomac.3c00906 (2024).
Brand-Williams, W., Cuvelier, M. E. & Berset, C. Use of a Free Radical Method to Evaluate Antioxidant Activity. LWT-Food Science and Technology. 28, 25–30, https://doi.org/10.1016/S0023-6438(95)80008-5 (1995).
Dizhbite, T., Telysheva, G., Jurkjane, V., Viesturs, U. Characterization of the Radical Scavenging Activity of Lignins - Natural Antioxidants. Bioresour. Technol. 2004, 95, 309–317, https://doi.org/10.1016/j.biortech.2004.02.024 (2004).
Ponomarenko, J. et al. Antioxidant Activity of Various Lignins and Lignin-Related Phenylpropanoid Units with High and Low Molecular Weight. Holzforschung. 69, 795–805, https://doi.org/10.1515/hf-2014-0280 (2015).
Diment, D., Musl, O., Balakshin, M., Rigo, D. Guidelines for Evaluating the Antioxidant Activity of Lignin via the 2,2-Diphenyl-1-Picrylhydrazyl (DPPH) Assay. ChemSusChem. e202402383, https://doi.org/10.1002/CSSC.202402383 (2025).
Todorović, M., Gutmann, M. U., Corander, J. & Rinke, P. Bayesian Inference of Atomistic Structure in Functional Materials. npj Comput. Mater. 5, 1–7, https://doi.org/10.1038/s41524-019-0175-2 (2019).
Sobol’, I. M. On the Distribution of Points in a Cube and the Approximate Evaluation of Integrals. USSR Computational Mathematics and Mathematical Physics. 7, 86–112, https://doi.org/10.1016/0041-5553(67)90144-9 (1967).
Ginsbourger, D., Le Riche, R., Carraro, L. Chapter 6-Kriging Is Well-Suited to Parallelize Optimization, in Computational Intelligence in Expensive Optimization Problems; Adaption Learning and Optimization. (eds. Tenne, Y., Goh, C.-K.) 131–162, https://doi.org/10.1007/978-3-642-10701-6_6 (Springer, Berlin, Heidelberg; 2010).
Cox, D. D. & John, S. A Statistical Method for Global Optimization. [Proceedings] 1992 IEEE International Conference on Systems, Man, and Cybernetics. 2, 1241–1246, https://doi.org/10.1109/ICSMC.1992.271617 (1992).
Gutmann, M. U. & Corander, J. Bayesian Optimization for Likelihood-Free Inference of Simulator-Based Statistical Models. Journal of Machine Learning Research. 17, 1–47 (2016).
Alopaeus, M. et al. SP-LCC — a Dataset on the Structure and Properties of Lignin-Carbohydrate Complexes from Hardwood. figshare https://doi.org/10.6084/m9.figshare.28444583 (2025).
Balakshin, M. Y. & Capanema, E. A. Comprehensive Structural Analysis of Biorefinery Lignins with a Quantitative 13C NMR Approach. RSC Adv. 5, 87187–87199, https://doi.org/10.1039/C5RA16649G (2015).
Gidh, A. V., Decker, S. R., Vinzant, T. B., Himmel, M. E. & Williford, C. Determination of Lignin by Size Exclusion Chromatography Using Multi Angle Laser Light Scattering. J. Chromatogr. A. 1114, 102–110, https://doi.org/10.1016/j.chroma.2006.02.044 (2006).
Matson, J. B., Steele, A. Q., Mase, J. D., Schulz, M. D. Polymer Characterization by Size-Exclusion Chromatography with Multi-Angle Light Scattering (SEC-MALS): A Tutorial Review. Polym. Chem. 2024, 15, 127–142, https://doi.org/10.1039/D3PY01181J (2024).
Acknowledgements
The authors gratefully acknowledge the Research Council of Finland for the support through the project No. 316601, 341589, and 341596/2021, the FinnCEREs BioEconomy flagship, and the Finnish Center for Artificial Intelligence (FCAI). Parts of this work were conducted within the Research Council of Finland Research Infrastructure “Printed Intelligence Infrastructure” (PII-FIRI). In addition, we acknowledge the Aalto Science-IT project and the Aalto Materials Digitalization (AMAD) platform.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
Alopaeus, M.: Conducted experiments (SEC-MALLS, MDSC, TGA, surface tension) and analyzed the results, technical validation, writing and editing. Stosiek, M.: Technical validation, writing, reviewing, editing, data preparation for repository upload. Diment, D.: Produced samples (AqSO biorefinery), conducted experiments (NMR, antioxidant activity) and analyzed the results, technical validation, writing and editing. Löfgren, J.: BO development and validation, writing, reviewing and editing. Cho, MJ: produced samples (AqSO biorefinery). Hemming, J.: assisted with SEC-MALLS analyses. Tirri, T.: assisted with MDSC method development. Pranovich, A. and Eklund, P.: Discussion and supervision. Rigo, D.: Supervision, writing and editing. Balakshin M., Xu C. and Rinke, P.: Conceived the project, supervision, reviewing.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Alopaeus, M., Stosiek, M., Diment, D. et al. SP-LCC — a dataset on the structure and properties of lignin-carbohydrate complexes from hardwood. Sci Data 12, 996 (2025). https://doi.org/10.1038/s41597-025-05327-8
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-025-05327-8






