Background & Summary

Lignin is the most abundant aromatic biopolymer and is an attractive candidate for replacing fossil-based feedstocks1,2,3,4. In traditional biorefineries, lignin has mainly been regarded as a low-value by-product, because its complex structure poses challenges to valorization, and has not been used to its full potential5,6,7,8,9. For example, Kraft lignin has been burnt primarily as a low-value energy source10. Lignin largely consists of the three monolignols p-coumaryl (H), conifernyl (G), and sinapyl (S) alcohol, which are randomly arranged and crosslinked within the structure1,11,12. This randomness is one of the main factors contributing to the structural complexity of lignin, which arises chiefly due to the nonenzymatic nature of the last step of the lignin biosynthesis1. In addition, the structural features of lignin are heavily influenced by the raw material, type of pretreatment, isolation method, fractionation method, and different process conditions2,4.

One of the primary difficulties in isolating lignin from other biomass components is separating lignin and carbohydrates, largely due to the presence of lignin-carbohydrate complexes (LCCs)13,14,15,16. In LCCs, lignin and carbohydrates – long or short chain – are chemically bonded mainly through phenyl glycoside, glucuronic ester and benzylic ether linkages (Fig. 2)14,16,17,18,19,20,21,22. These linkages make LCCs amphiphilic, which gives rise to promising properties for e.g., biomedical applications and surfactants in oil-in-water emulsions13,23,24,25,26,27,28,29,30,31,32,33. Although LCCs were initially seen as a significant barrier to the valorization of lignin, they are promising for high value applications. The amphiphilic characteristic of LCCs provides them with good biocompatibility and biological activities such as immunopotentiation, antiviral-, and antioxidant activity13,23,24,25,26,27,28,29. Additionally, the amphiphilic nature of LCCs enables them to stabilize emulsions and enhance their compatibility with various materials, making them appealing as emulsifiers30,31,32,33. A major limitation in the valorization of LCCs is that current extraction methods are often complex, time-consuming, and result in low yields18,34. To address the need for more efficient protocols, we recently showed that the AquaSolv Omni (AqSO) biorefinery can be optimized to provide a scalable and high-yield route to LCC extraction. AqSO is a green and flexible biorefinery process based on hydrothermal treatment followed by solvent extraction, described by Tarasov et al.17 and outlined in Fig. 1a for reference.

Fig. 1
figure 1

Schematic overview of the workflow for the SP-LCC dataset, where (a) shows the process steps of the AqSO biorefinery and (b) the collection of the experimental data.

Fig. 2
figure 2

2D HSQC NMR spectra of acetone extracted lignin. The cross-peaks of the main areas of interest were determined according to the corresponding resonances. Each area was colored distinctively to represent the moieties of interest in the oxygenated aliphatic region (a), the occurrence of the LCC linkages (b) and aromatic region (c,d).

Machine learning (ML) played a crucial role in the AqSO optimization process, helping to achieve high LCC yields and customize physicochemical properties. ML techniques have been explored to address a wide range of challenges in bio-based materials science35,36,37,38,39. For process optimization, which typically requires analyzing a large number of samples40,41,42,43, Bayesian optimization (BO) has emerged as an alternative to traditional experimental design methods. In BO, an ML model collaborates with a data collection strategy to determine the processing conditions for new sample isolation. These decisions aim to meet specific objectives, such as maximizing yield with minimal sample use.

As part of our previous work, we used BO-guided data collection to curate a set of 90 LCC-containing samples44. In this paper, we add 5 additional samples and compile comprehensive measurement data (5 physicochemical properties and NMR spectra) for all 95 LCC samples. To allow for the study of structure-property relationships for multiple properties, we favored a detailed characterization of each sample over sheer number of samples. The resulting Structure-Property LCC (SP-LCC) dataset is bolstered with an extensive technical validation and organized for easy access of the community. For the samples included in SP-LCC, key structural moieties characterized by 2D nuclear magnetic resonance (NMR) spectroscopy and selected physicochemical properties are provided. The measured properties are molar mass distribution, antioxidant activity, glass transition temperature, thermal degradation, and surface tension. We selected these properties, because they are basic materials parameters, but they are also important for certain applications of LCCs, e.g., biomedicine23,24, emulsion stabilizers31,33 and fillers in thermoplastic formulations45. To our knowledge, this is the first time a dataset of this scale is published on lignin, that allows for a comprehensive understanding of the behavior of lignin and LCCs. Furthermore, it provides insight into the structural details and property variations of LCCs based on the isolation process conditions. The significance of SP-LCC extends beyond the AqSO biorefinery because: (1) the correlation between structure and properties holds true regardless of the biorefinery process; (2) general conclusions about the impact of processing conditions on structure and properties can be transferred to other biorefinery models. How transferable our dataset from silver Birch is to other source of materials will have to be clarified in the future with similar datasets. Since all the data presented here was gathered by the same research group, consistency between data points is ensured, making SP-LCC particularly suitable for ML applications. We hope that future ML-driven studies of SP-LCC will reveal the LCC structure-property relationship and ultimately promote the widespread valorization of LCCs tailored for various applications.

Methods

Sawdust preparation

We debarked a silver birch (Betula pendula) stem and finely ground it with a Wiley Mill M02 grinder. The obtained sawdust was screened to select a sawdust fraction with a size of 0.5–1.5 mm. The fraction was then exposed to air drying. We removed the extractives from the air-dried sawdust with a Soxhlet apparatus using acetone (99.9% purity) as solvent.

AquaSolv Omni biorefinery process

As depicted in Fig. 1a), we performed hydrothermal treatment (HTT) on extractive-free sawdust according to the recently reported procedure17,46. The resulting lignin structure is heavily affected by the severity of the process, which in AqSO is controlled by the reactor temperature (T), liquid-to-solid ratio (L:S), and residence time. To quantitatively express the severity of the reaction, we combined the reaction temperature and the residence time into the single variable represented as prehydrolysis factor (P-factor). P-factor controls the rate of prehydrolysis in a prehydrolysis-kraft process for a dissolving pulp production that reflects the efficiency of hemicellulose removal from the pulp. It can be found as an area under the curve when plotting the relative reaction rate against time. Relative reaction rate describes the change in the reaction rate at a certain process temperature as opposed to the reference reaction rate at 100 °C (Eq. 1)1:

$${Ln}\frac{{k}_{H,(T)}}{{k}_{100^\circ {\rm{C}}}}=\frac{{E}_{A,H}}{375.15R}-\frac{{E}_{A,H}}{{RT}}$$
(1)

where \({k}_{H,(T)}\) is the rate constant of xylan hydrolysis at the given temperature \(t\), and \({E}_{A,H}\) is the activation energy. Consequently, P-factor strongly relies on the activation energy of the fast-reacting xylan. The chosen process variables (P-factor, temperature, and liquid-to-solid ratio) were restricted to the following ranges: 250–1000 for the P-factor, 160–195 °C for the temperature, and 0.25–2 for the liquid-to-solid ratio. P-factor as a single variable is depicted in Eq. 21, where \(k\) is the rate constant, \(T\) is the reaction temperature (K) and \(t\) is the residence time (h). All calculations were carried out according to the assumption that activation energy equals 125.6 kJ mol−1.

$$P-{factor}={\int }_{0}^{{t}_{f}}\frac{k\left(T(t)\right)}{{k}_{100^\circ {\rm{C}}}}{dt}={\int }_{0}^{{t}_{f}}{e}^{40.48-\frac{15106}{T(t)}}{dt}$$
(2)

After the reaction, the resulting solid fraction was exhaustively washed with deionized water and subsequently exposed to acetone (75 vol%, aq.) extraction yielding acetone-extracted lignin (AEL) solutions. LCC-containing AELs were isolated by removing the solvent (75 vol% acetone aq.) through rotary evaporation (T = 40 °C, p = 20 mbar). Finally, the obtained AELs/LCCs were subjected to vacuum oven drying (T = 40 °C, p = 5 mbar) under P2O5 until a constant weight was reached. The dried AELs contained varying amounts of LCC depending on the biorefinery process variables. The obtained samples were labeled to reflect the process conditions used to produce each specific sample and are represented as follows: P840-195-2.00, where P840 is the P-factor employed, 195 is the reaction temperature and 2.00 is the L:S.

Heteronuclear single-quantum coherence nuclear magnetic resonance

We recorded the 2D NMR using a Bruker AVANCE 600 NMR spectrometer equipped with a CryoProbe17,46,47. Approximately 75 mg of each dried LCC-containing sample was dissolved in 0.6 mL of DMSO-d6. We defined the acquisition time for the 1H-dimension as 77.8 ms with 36 scans per block employing 1024 collected complex points, while 3.94 ms was the set time for the 13C-dimension with 256-time increments recorded. We processed the obtained 2D HSQC NMR data using 1024 × 1024 data points and employed the Qsine function for both 1H and 13C dimensions. To calibrate the chemical shifts, we chose the DMSO peak at δCH 39.5/2.49 ppm/ppm, and we assigned the cross-peaks according to previous reports14,16,17,46. The normalized quantification of different lignin and LCC signals was carried out assuming that (Eq. 3)17,46:

$${\rm{G}}+{\rm{S}}={{\rm{G}}}_{2}+{{\rm{S}}}_{2,6}/2=100{\rm{Ar}}$$
(3)

Molar mass distribution

We determined the molar mass distribution (MMD) with a size exclusion chromatography (SEC) equipped with a multi-angle light scattering (MALS) detector, and a differential refractive index (RI) detector. The MALS detector was equipped with 8 photodiodes with pass filters on every second photodiode. The separation was performed on a Jordi X-stream H2O 1000 Å column (10 × 250 mm i.d.) equipped with a guard column (10 × 50 mm i.d.). The column oven was set to 40 °C. We prepared the samples by dissolving 5 mg of each AEL sample in 1 mL of dimethyl sulfoxide (DMSO) containing 0.05 M lithium bromide. We then placed the samples under careful shaking for a longer period to minimize the formation of aggregates. Before the SEC measurements, we filtrated the dissolved AELs over a 0.22 µm nylon syringe filter. We used the following parameters for the molecular weight analyses: 0.5 mL min−1 flow rate; 100 µL injection volume; 50 min run time; 0.15 dn/dc value. The obtained data was evaluated using ASTRA 7.3.2. software. We defined the peaks based on the RI concentration curve and selected the photodiodes 2, 4, and 6. The result fitting was adjusted with forward extrapolation. For the MMD we are interested in the following parameters: Number average molecular weight (Mn), molecular weight at the peak of the distribution curve (Mp), and average molecular weight (Mw). Our determination of the MMD was based on the methodology reported by Zinovyev et al.48.

Antioxidant properties

For the determination of antioxidant properties of the AELs/LCCs, we used an improved normalized radical scavenging index (nRSI) method49. It is based on the standard procedures for antioxidant activity evaluation utilizing 2,2-diphenyl-1-picrylhydrazyl (DPPH) as a reactive free radical50,51,52. Briefly, we prepared a set of LCC/AEL solutions (120–600 mg L−1) and a DPPH solution (75 µmol L−1) using 90 vol% acetone (aq.) as a solvent in each case. Following that, we mixed the AEL/LCC solutions with the DPPH solution in a lignin:DPPH = 1:39 (v/v) ratio. To measure the change of the DPPH concentration in the prepared solutions, we employed UV-vis spectroscopy using a Shimadzu UV-2550 spectrophotometer and a 10 mm path length quartz cuvette at a wavelength of 515 nm. The absorbance of the solutions was monitored immediately after the preparation and after 24 h when the steady state was reached. In parallel, the absorbance of a LCCs/AEL-free (blank) solution containing 75 µmol L−1 DPPH in 90 vol% acetone (aq.) was measured to correct the absorbance values of the AEL/LCC-containing solutions, according to the severity of the DPPH degradation in a given solvent. The lignin absorbance was corrected during the measurements following our recently reported procedure53. More details regarding the experimental procedure can be found in the ESI.

Glass transition temperature

We determined the glass transition temperature (Tg) with modulated differential scanning calorimetry (MDSC, TA Instruments DSC250, Discovery series). We measured approximately 10 mg of AEL in TZero™ aluminum pans under a flow of nitrogen (50 mL min−1). The samples were heated from 40 °C to 115 °C at 5 °C min−1, followed by cooling to 20 °C. The samples were reheated at 2 °C min−1 until reaching 170 °C. We set the modulation amplitude to 1.20 °C and the modulation period to 60 s. We determined Tg from the last heating ramp as the half-height midpoint of the step-change in the reversible heat flow curve in the TRIOS v5.1.1.5 software.

Thermal degradation

We studied the thermal stability of the AEL samples by thermal gravimetric analysis (Discovery SDT 650 simultaneous DSC/TGA thermal analyzer, TA Instrument). We measured approximately 7 mg of AEL in an aluminum pan against an empty pan as a reference. The samples were heated to 700 °C at 10 °C min−1 under N2 atmosphere. The parameters of interest were Tonset = the start of the thermal degradation, Tmax = temperature where the rate of decomposition reaches its maximum, T50 = when 50% of the sample has degraded and the char yield, and we determined these using the TRIOS v5.1.1.5 software. Tonset and Tmax were determined from a derivative curve of the weight change curve with reference to the temperature curve. The char residue was determined from the weight change curve at maximum temperature.

Surface tension

We measured the air/water interface of the AELs in aq. NaOH solution (pH 12.65) with a force tensiometer-K100 (Krüss, Germany) using a Wilhelmy plate. In total, we measured five concentrations of each AEL sample (0.5 mg mL−1, 0.4 mg mL−1, 0.25 mg mL−1, 0.1 mg mL−1 and 0.08 mg mL−1). First, we prepared the 0.5 mg mL−1 solution by dissolving 12.5 mg AEL in the aq. NaOH solution in a 25 mL volumetric flask overnight under magnetic stirring. Then we prepared the 0.4 mg mL−1 solution by diluting the 0.5 mg mL−1 solution directly after being measured. The remaining solutions we prepared in the same way by diluting the previous solution directly after the measurement. Measurement points were taken until 5 measurements with a standard deviation below 0.1 mN m−1 were obtained.

Bayesian optimization

The data collection was guided by BO, building on the approach described by Löfgren et al.35. A detailed account of the process will be published elsewhere44. Briefly, we modeled the AEL and carbohydrate content as independent Gaussian processes using the BOSS code54. The models were initialized from a sequence of 12 Sobol points55 and the experimental noise was accounted for by incorporating a Gaussian noise, estimated from the technical validation data, into the models. To maintain an optimal workload in the laboratory, we performed the acquisitions in batches of 8 data points each using a Kriging-believer approach56. The first five batches contained two exploration-modified LCB acquisitions57,58 each for the AEL and carbohydrates, and two exploratory acquisitions where only the standard deviation of the models was minimized. To obtain the optimal workload of eight samples, we added two test data points to these five batches, generated independently from a continuation of the initial Sobol sequence. In the last two batches, these test points were dropped in favor of two additional exploratory acquisitions. In total, we collected 54 data points, including initial points, over seven batches. Furthermore, we collected 14 test points, of which four were collected outside of the seven batches. We used these test points for model validation and to terminate the BO once the prediction errors dropped below 10% relative to the observed measurement range.

Data Records

Our database comprises 95 samples isolated under 72 different process conditions. We characterized 88 of the samples with 2D HSQC NMR. Among the characterized samples, the lignin yield was too low for 13 samples to measure any of the physicochemical properties. To bolster the available property information on low-yield samples, we re-isolated seven such samples. For these samples, we decided to forgo measuring the 2D HSQC NMR characterization in favor of measuring all the properties. We refer to these seven samples as replicates, and these are distinguished by an “-R” added to their sample ID. To validate the biorefinery reproducibility, we isolated four samples under identical process conditions. We included these in the SP-LCC dataset and distinguished them by adding “NMRValidation” in their sample ID. No physicochemical properties were measured for these samples.

The SP-LCC dataset is available on Figshare59. It is structured in the following way:

  • The sample ID, processing condition, yield, lignin moieties identified from 2D HSQC NMR, and properties for all samples are available in the CSV file named “SP-LCC_table.csv”. The file is structured as described in Table 1.

    Table 1 Description of the scalar data in the SP-LCC dataset and contained in the CSV file “SP-LCC_table.csv”.
  • The intensities of the NMR spectra are contained in the archive “NMR_spectra.zip”, which contains one file for the intensities per sample named “NMRspectrum_{sampleID}.csv”. The corresponding chemical shifts, at which these intensities were measured, are contained in the same archive NMR_spectra.zip in two files per sample, for the x-axis (F1 C axis) “xarr_{sampleID}.csv” and the y-axis (F2 H axis) “yarr_{sampleID}.csv”, where {sampleID} refers to the sample ID of the respective sample. The files are structured as described in Table 2.

    Table 2 Description of the non-scalar data comprised in the presented SP-LCC dataset.
  • The CSV files containing the NMR data were generated from Bruker files, which we make available, too, in the “NMR_spectra_Bruker.zip” archive. In the archive, every subfolder is named by the sample ID of the respective sample.

  • For full transparency, we also include python scripts to parse and plot the NMR spectra both from the Bruker files, with “parse_and_plot_Bruker_NMR_spectrum.py”, and from the.csv files with “plot_NMR_spectrum.py”. The dependencies for their use are described in the file “readme.txt”.

  • The chromatograms are contained in the archives “RI_Chromatogram.zip” and “LS_Chromatogram.zip” with one file per sample named “RI_Chromatogram_{sampleID}.csv” the refractive index and “LS_Chromatogram_{sampleID}.csv” for light scattering chromatograms. For parallel measurements, “_2” has been added to the file name. The files are structured as described in Table 2. Additional details of the chromatogram data are described in section 2 of the Supplementary information.

  • The thermograms are contained in the archives “TD_Thermogram.zip” and “Tg_Thermogram.zip” with one file per sample named “Tg_Thermogram_{sampleID}.csv” for thermograms obtained from DSC analyses and “TD_Thermogram_{sampleID}.csv” obtained from TGA analyses. For parallel measurements, “_2” has been added to the file name. The files are structured as described in Table 2. Additional details of the thermogram data are described in section 2 of the Supplementary information.

Technical Validation

For the technical validation, we consider three sources of uncertainty, as shown in Fig. 1b): (I) uncertainty related to sample preparation, i.e., the degree of reproducibility of the biorefinery process; (II) measurement uncertainty related to NMR measurements; (III) measurement uncertainty related to the physicochemical properties of lignin, i.e., from MMD, antioxidant activity, Tg, surface tension, and thermal degradation measurements.

Evaluation of the biorefinery process reproducibility

To evaluate the reproducibility of the biorefinery process, we investigated four replicates generated for identical processing conditions: P-factor = 500; L:S = 1; T = 195 °C (See Table 1 in section 3 of the Supplementary information). The corresponding spectra in the dataset contain “NMRValidation” as suffix in their ID.

We compared the structure, as determined by HSQC NMR (Fig. 2), and the biorefinery yield of the replicates. For this analysis, we considered the structure rather than the physicochemical properties for two reasons: (1) the structure of LCCs influences their properties; (2) in contrast to the property measurements, the noise of NMR measurements is small, as seen in subsection (II). This allowed us to conveniently isolate the uncertainty derived from the biorefinery process.

To quantify the deviation in lignin structure, we calculated the standard deviation (SD) and relative standard deviation (RSD) of the identified lignin moieties (Table 2). The RSD is reported relative to the average of the four measured samples. Repeated tests indicate that the outcome of the experiments was very similar in all cases with an average RSD of 5% (Table 3, entry 26), which is consistent with previous studies17,46,47,60. Notably, for the structural characterization, the observed SD and RSD depend on the moiety under consideration. In particular, the quantification of low concentration lignin units (<1 mol%) is heavily affected by the background noise60. In general, we note that the deviation between replicates depends on the processing conditions. For four moieties of interest (β-O-4, BE, β-β, and carbohydrates), the SD is visualized in Fig. S1 in the Supplementary information (section 4).

Table 3 The standard deviation and relative standard deviation of the experimental errors in the NMR measurements for different lignin moieties.

The four samples used for validating the quantification of HSQC NMR were also used for evaluating the yield variability across replicates. The corresponding SD and the RSD are reported in Table 3, and their calculations are presented in Table S1 in the Supplementary Information. The RSD of the AEL yield is 4%.

Measurement uncertainty related to the HSQC NMR analysis

To quantify the NMR uncertainty, we performed two consecutive HSQC measurements on a single LCC sample obtained at P-factor = 625, L:S = 1.13, and T = 178 °C. The sample was first dissolved in DMSO-d6 and then transferred to an NMR tube. As soon as the first measurement was completed, the sample was removed from the magnet, placed back into the magnet and the second HSQC measurement was recorded on the same tube. We employed the same HSQC conditions used throughout this work (see the Methods section).

The average RSD (Table 3) for the quantified lignin moieties between the two measurements was 2.6%, signifying a low error related to the HSQC NMR analysis.

Measurement uncertainty related to the physicochemical properties of lignin

The technical validation for MMD, nRSI, Tg, and thermal degradation was carried out by repeating measurements four times for four selected AEL samples. For the surface tension, a total of five parallel measurements were performed, however, only four were selected for technical validation as the fifth was discarded as an outlier (see Fig. S2 in section 5 of the Supplementary information). Table 4 contains the average SD and the average RSD (relative to the mean of measured properties) across all four validation samples. The calculations are described in Table S2 and Table S3 in the Supplementary information (section 6). In Figs. 35 the standard deviation of each individual sample is shown.

Table 4 The average standard deviation and average relative standard deviation of selected properties were measured using four parallel measurements on four AEL samples.
Fig. 3
figure 3

Measurement uncertainties for the (a) Normalized radical scavenging index, (b) glass-transition temperature, (c) number average molecular weight, (d) average molecular weight, and (e) molecular weight at the peak of the distribution curve. The error bars indicate the mean standard deviation and were calculated from 4 repeated measurements (crosses) for each of the four samples (x-axis) produced under different processing conditions.

Fig. 4
figure 4

Measurement uncertainties for properties related to thermal degradation: (a) onset temperature, (b) temperature at maximum degradation, (c) temperature after degradation of 50% material weight, and (d) char yield. The error bars indicate the mean standard deviation, calculated from 4 repeated measurements (crosses) for each of the four samples (x-axis) produced under different processing conditions.

Fig. 5
figure 5

Measurement uncertainties for the surface tension (y-axis) at four different concentrations (x-axis). The error bars indicate the mean standard deviation and were calculated from 4 repeated measurements (crosses) at each concentration level, for each of the four samples produced under different processing conditions.

As shown in Table 4, the RSD of the repeated measurements varies between physiochemical properties. For most properties, the RSD is low with values well below 5%. The RSDs of the nRSI with 7.62% and the MMD parameters with up to 15.79% are the exceptions. The high RSD of the MMD parameters is most likely due to the formation of aggregates during the preparation of the AEL-DMSO/LiBr solutions for the MMD measurements61. The aggregates are present in low concentrations; however, they have a high impact on light scattering and will therefore be detected by the MALS detector62. The MW and Mn are calculated based on the averages of all detected molecular weights in the sample, which means the aggregates will affect these parameters. By contrast, the Mp gives the molecular weight at the highest detected concentration. In this molecular weight fraction, the aggregates are not present, and the measurement uncertainty associated with Mp is, therefore, lower than for the Mn and MW.

Comparing the SD of the measured properties with the literature was not possible, as such thorough validation of measurement performed on lignin is typically not performed or not reported. For all properties, including those that vary more strongly with repeated experiments, the SD is considerably lower than the range of observed values across samples. Thus, all measured property data can be meaningfully interpreted within our dataset. We report the range of the measured property values and compare them with the literature in Table S4 in the Supplementary Information.

We also carried out technical validation for the chromatograms and thermograms obtained from the MMD, Tg, and thermal degradation measurements. We did this by extracting the data from the four measurements performed for each AEL sample. The point-wise SD of the chromatograms and thermograms of all four measurements was calculated based on the extracted CSV files and is reported in Fig. S3-6 in the Supplementary information (section 8). The validation of the most important points of the chromatograms and thermograms have been done in combination with the technical validation of their corresponding physicochemical properties and has been described above.