Main

Designing soft materials, such as gels and elastomers, is a complex task. It requires selecting appropriate types and quantities of building blocks (for example, monomers) and determining their arrangement in the material, creating a gigantic design space with countless possible combinations. Moreover, soft materials exhibit intricate behaviours because of the interplay of weak molecular interactions and thermal fluctuations, resulting in complex structure–property relationships across multiple time and length scales, with mesoscale structures playing an important part7.

These complexities hinder the development of accurate predictive theories or computational models, often rendering soft material discovery reliant on experimental trial and error. To reduce experimental demands, data-driven strategies are becoming increasingly essential8,9. Emerging tools, such as data mining (DM) and machine learning (ML), are transforming the field by advancing the analysis of complex behaviours, improving property predictions and driving theory and modelling development5,10,11,12,13.

Effectively integrating these tools into an end-to-end design framework is important for accelerating soft material discovery. An important first step is the creation of high-quality datasets, which is complicated by the several potential material designs and limited experimental throughput14,15. Adhesive hydrogels, for example, are a promising class of soft material widely sought for high-end applications. Yet achieving instant, strong and repeatable underwater adhesion remains a longstanding challenge16,17. Previous studies on this material have identified several monomer types, making it difficult to form a consistent dataset or forge a simple design principle for optimizing performance16.

Biological soft tissues, as naturally evolved soft materials, exemplify complex structures tailored for specific functions18. Studying these systems can help reduce the design space for synthetic soft materials19, such as gecko-inspired dry adhesives20,21. Particularly, adhesive proteins, found across diverse organisms (for example, archaea, bacteria, eukaryotes and viruses), enable adhesion in wet environments. Despite their diversity, these proteins share common sequence patterns that offer valuable insights into designing underwater adhesives22. However, identifying meaningful patterns, translating them into synthesis strategies and enabling extrapolative predictions by machine learning remain main challenges to achieving an end-to-end design model.

Here we introduce a new data-driven approach that integrates DM, experimentation and ML for the efficient development of high-performance underwater adhesive hydrogels (Fig. 1a). By mining adhesive protein databases, we extract characteristic sequence features to guide hydrogel design. These features are replicated in 180 synthetic hydrogels using random copolymerization and relative composition strategies, which strike a balance between biological fidelity and practical synthesis. Among these DM-driven hydrogels, several exhibit greater adhesive strength (Fa) than those reported in the literature (Fig. 1b). This set of 180 synthetic hydrogels forms a small yet high-quality dataset for further optimization by ML, leading to ML-driven hydrogels with underwater Fa exceeding 1 MPa—an order-of-magnitude improvement over previously reported underwater adhesive hydrogels and elastomers16 (Supplementary Fig. 1).

Fig. 1: Data-driven de novo design of underwater adhesive hydrogels.
figure 1

a, Conceptual scheme of the proposed approach integrating DM, experimentation and ML to design high-performance adhesive hydrogels. b, Comparison of underwater adhesive strength (Fa) between previously reported hydrogels (Supplementary Table 1) and newly developed hydrogels in this study (DM-driven and ML-driven). Fa was measured using tack tests, and the testing conditions were optimized for maximum performance.

The obtained super-adhesive hydrogels hold tremendous potential across a wide range of applications, offering reliable solutions for which traditional adhesives often fall short (Supplementary Fig. 1). They could improve medical procedures, advance biomedical engineering, support marine farming and enable deep-sea exploration. The substantial performance improvements showcase the success of our data-driven approach in designing high-performance hydrogels. Moreover, this approach is highly versatile and can be adapted to develop other types of functional soft materials, opening new possibilities in various fields.

DM of adhesive proteins

We compiled a dataset containing 24,707 adhesive proteins gathered from the National Center for Biotechnology Information (NCBI) protein database, using the keyword ‘adhesive protein’. This dataset includes proteins from 3,822 different organisms across archaea, bacteria, eukaryotes, viruses and artificial proteins. Statistical analysis shows that the average length of those adhesive proteins ranges from approximately 300–500 amino acids (Supplementary Fig. 2).

To identify the most representative protein sequences and minimize the impact of individual variations, we ranked all species by the number of adhesive proteins they contain and selected the top 200 species for further analysis (Fig. 2a and Supplementary Fig. 3). We then performed multiple sequence alignment using Clustal Omega23 to determine consensus sequences for each species (Extended Data Fig. 1), which are believed to play a crucial part in maintaining protein stability and adhesion throughout evolution24,25.

Fig. 2: DM of adhesive proteins and formulation design.
figure 2

a, Schematic of the amino acids feature extraction process used to derive bioinspired formulations through adhesive protein DM, encoding and relative composition computation. b, Distribution of block length (that is, the number of consecutive residues from the same functional class) for the six functional classes, shown along the horizontal axis, based on the consensus sequences of the top 200 species. c, Pairwise frequency distribution of the 21 functional class pair types along encoded sequences, shown for the entire dataset and for eight representative species, shown along the horizontal axis, categorized by their biological classifications in the database.

To reduce the dimensionality of the variables, the 20 canonical amino acids were grouped into six classes based on their physicochemical properties: hydrophobic, nucleophilic, acidic, cationic, amide and aromatic (Supplementary Fig. 4). The consensus sequences were then encoded into functional class sequences. For consistency in the encoding, glycine, alanine and proline were excluded from the hydrophobic class because of their smaller side chains, which are proposed to have a less important role in interfacial contacts and interactions compared with other amino acids26.

The block length of each functional class in the encoded sequences is typically less than three (Fig. 2b), indicating substantial sequence heterogeneity in adhesive proteins even at the coarse functional class level. Different species exhibited distinct patterns in the pairwise frequencies of these functional classes (Fig. 2c). This suggests preferences for specific functional class pairings within the sequences, hinting at an underlying order beneath the observed sequence heterogeneity.

Based on these insights, we devised a strategy for hydrogel design using six functional monomers to represent the six functional classes of amino acids. Although directly replicating functional class sequences offers a straightforward way to mimic protein primary structures and functions, achieving precise control over monomer sequences in synthetic polymers remains a marked challenge. Therefore, we aimed to statistically replicate the sequence features of functional classes through ideal random copolymerization of the six functional monomers, which has minimal composition drift during polymerization and enables statistically controlled sequences19,27,28,29.

For this purpose, we used a relative composition approach to capture the neighbouring preferences of amino acid functional classes in the synthetic polymer chains. Specifically, we counted the occurrences of 21 distinct pair types for the six functional classes, denoted as nij (where i, j = 1, …, 6), along the functional class sequences for each species and ranked them in descending order. The top five pairs, collectively accounting for approximately 50% of all occurrences, were used to compute the monomer proportions of each functional class as \({\phi }_{i}={N}_{i}/{\sum }_{i}{N}_{i}\), where \({N}_{i}={\sum }_{j}({n}_{ij}+{n}_{{ji}})\) for each species (Extended Data Fig. 1 and Supplementary Data 1 and 2). These relative compositions served as descriptors for the corresponding species. From the top 200 species, we derived 180 unique compositions after removing 20 duplicates (Supplementary Table 2), which were then used for hydrogel synthesis.

Synthesis of DM-driven hydrogels

Six functional monomers (Fig. 3a), each representing one of the six functional classes of amino acids, were selected. Their pairwise reactivity ratios, determined by 1H NMR analysis, were close to unity when copolymerized in the cosolvent dimethyl sulfoxide (DMSO) using free-radical polymerization (Supplementary Fig. 5 and Supplementary Table 3). These near-unity values indicate minimal composition drift during copolymerization in DMSO (Supplementary Figs. 6 and 7).

Fig. 3: DM-driven hydrogels for underwater adhesion.
figure 3

a, Chemical structures of six functional monomers, each representing one of the six functional classes of amino acids. b, Distribution of monomer block lengths in heteropolymer sequences generated by Monte Carlo simulations based on experimentally determined reactivity ratios and derived formulations. c, Pairwise frequency distribution of monomer pairs in heteropolymer sequences obtained by Monte Carlo simulations, shown for all 180 derived formulations and for eight formulations (denoted by gel index) corresponding to the representative species shown in Fig. 2c. d, Schematic of the tack test for measuring underwater adhesion. e, Adhesive strength (Fa) of the 180 hydrogels. f, Stress–displacement profiles of two G-004 variants in the tack test: (i) statistical sequences synthesized in DMSO and (ii) block-like sequences synthesized in DMS. Inset images show the appearance of the two hydrogels. Adhesion tests were conducted under a 10-N loading force applied for 10 s on a glass substrate in normal saline (0.154 M NaCl). This test condition was used for rapid screening.

Monte Carlo simulations based on the Mayo–Lewis model were performed to analyse the sequence properties of the six functional monomers in the corresponding 180 heteropolymers, using the measured reactivity ratios (Supplementary Table 3) and the derived monomer proportions (ϕi) (ref. 30) (Supplementary Table 2). The resulting distributions of monomer block lengths and pairwise frequencies (Fig. 3b,c) closely matched those observed in adhesive proteins (Fig. 2b,c), confirming that our synthesis protocol effectively captures key statistical features (Supplementary Fig. 8), such as sequence heterogeneity and neighbouring preferences.

Following the derived formulations, 180 DM-driven gels, labelled G-001 to G-180, were synthesized by one-pot free-radical copolymerization of the functional monomers with crosslinkers in DMSO (Methods and Supplementary Fig. 9). After solvent exchange from DMSO to normal saline (0.154 M NaCl), the hydrogels were characterized by volume swelling ratio, rheological behaviour and underwater adhesive strength (Fa). Adhesion was assessed using tack tests (Fig. 3d and Supplementary Fig. 10) on a glass substrate in normal saline, with a loading force of 10 N and a 10-s contact time applied for rapid screening.

Figure 3e shows the measured Fa for all 180 hydrogels (15 mm diameter, 0.3–0.8 mm thickness). Among them, 16 hydrogels exhibited robust adhesion with Fa > 100 kPa, and 83 hydrogels showed Fa > 46 kPa, surpassing the average reported in the literature (Supplementary Table 1). Notably, G-042 (derived from Escherichia, Supplementary Fig. 8), hereafter referred to as G-max, presented the highest adhesive strength of 147 kPa.

The high Fa values demonstrate the effectiveness of our data-driven approach in guiding the de novo design of adhesive hydrogels, highlighting two key insights. First, the functional class sequences extracted through DM capture the essential sequence features of adhesive proteins that are important for wet adhesion. Second, using ideal random copolymerization of functional monomers to statistically replicate these sequence features through relative compositions provides an effective strategy, bridging the gap between de novo design and material fabrication.

To validate the first insight, we examined the adhesion performance of hydrogels formulated using sequences derived from DM of resilin proteins. These hydrogels exhibited poor underwater adhesion (Extended Data Fig. 2 and Supplementary Table 4), underscoring the importance of specific sequence features from adhesive proteins for effective adhesion.

To validate the second insight, we analysed the adhesion performance of hydrogels synthesized by non-ideal copolymerization in dimethyl sulfide (DMS). In DMS, most pairwise reactivity ratios of monomers deviate significantly from unity (Supplementary Table 3), resulting in composition drift during polymerization and the formation of blocky sequences (Supplementary Figs. 6 and 7). Figure 3f compares two variants of G-004, showing that the variant synthesized in DMS appeared more translucent and exhibited markedly lower Fa than its counterpart with statistical sequences synthesized in DMSO. This finding underscores the important role of ideal random copolymerization of functional monomers (with near-unity reactivity ratios) in achieving the statistical sequence features essential for mimicking protein functions19,27.

To improve Fa, we assessed the correlations between Fa and ϕi using Kendall’s τ coefficients31 and characterized the dependence of Fa on the swelling of hydrogels and rheological behaviours (Extended Data Fig. 3). We found that ϕATAC, ϕBA and ϕPEA exhibit weak positive correlations with Fa, whereas ϕHEA, ϕAAm and ϕCBEA show weak negative correlations. Nevertheless, these weak correlations, along with the intricate structure–property relationships (Extended Data Fig. 3), are insufficient to directly predict hydrogel formulations for optimal adhesion, highlighting the complex synergistic effects of monomer species, sequences and network structures.

Hydrogel optimization by ML

Next, we used ML to explore hydrogel formulations with enhanced adhesive strength, starting with the 180-hydrogel dataset. Among nine ML models benchmarked (Supplementary Tables 5 and 6), Gaussian process (GP)32 and random forest regression (RFR)33 emerged as the most effective base models for predicting Fa from ϕi, achieving low test error while minimizing overfitting (Extended Data Fig. 4).

Based on these models, we implemented sequential model-based optimization (SMBO)33 to propose new hydrogel formulations, taking expected improvement (EI) as the acquisition function. To reduce the number of experimental rounds of hydrogel synthesis and characterization, we designed a batched SMBO workflow, which allows for multiple formulation proposals in a single round.

To enhance efficiency, we explored several batched SMBO methods, using trained base models as the hypothetical value providers (P) and GP, RFR, extra trees (ETR)34 and gradient boosting machine (GBM)35 as the EI maximizers (M), collectively denoted as PM. We also implemented traditional Bayesian optimization methods, using kriging believer (GP_KB) (ref. 36), maximum and minimum constant liar (GP_CLmax, GP_CLmin) (ref. 36) and local penalization (GP_LP) (refs. 36,37) as heuristics for determining batch points. For validation, we selected the top 10 formulations (out of 40 proposed per batch), sorted by either EI magnitude or predicted Fa (PRED) as experimental test sets.

All validation followed the same protocol as for the training set to ensure data consistency. Figure 4a shows the true Fa values for formulations proposed by different SMBO methods (Supplementary Table 7). Non-SMBO baselines, GP_enu and RFR_enu, which selected the top five PRED from an enumeration of 10 million random formulations, failed to improve Fa beyond the training data. By contrast, all SMBO methods achieved higher Fa, with GP_KB and RFR-GP as the top performers, and RFR-GP yielding the highest Fa overall.

Fig. 4: ML optimization of underwater adhesive hydrogels.
figure 4

a, Adhesive strength (Fa) of hydrogels fabricated based on predictions from various models trained on the 180-hydrogel dataset. The model nomenclature and detailed descriptions are provided in the Methods. All adhesion measurements were performed under the same test conditions as the training set: 10 N loading force, 10-s contact time, on a glass substrate and in normal saline. b, UMAP representation of the relationship between Fa and reduced monomer proportions (ϕi), highlighting the formulations proposed by GP_KB and RFR-GP (within the SMBO framework) across different rounds. Symbol size represents the magnitude of adhesive strength. c, SHAP beeswarm plot, ranked by mean absolute SHAP values, showing the influence of ϕi on Fa within the final dataset of 341 samples as analysed by the trained RFR model.

We further tested a ‘warm-start’ strategy using RFR-GP by adding 10 additional data points generated by RFR to the training set. This variant, termed RFR-GP*, exhibited the highest Fa among all models. Furthermore, formulations chosen through PRED sorting generally outperformed those selected by EI sorting. These findings demonstrate the effectiveness of batched SMBO and suggest the optimal models and strategies for improving workflow efficiency.

The validation outcomes expanded our hydrogel dataset. To assess the exploration abilities of RFR-GP and GP_KB within the SMBO framework, we conducted two additional rounds of ML optimization and experimental validation. Although new high-Fa formulations were identified, none surpassed the maximum Fa achieved in the first round (Extended Data Fig. 5). We suspect that the functionalities of the adopted monomer species may account for the observed performance plateau, and further optimization rounds were not pursued.

The relationship between Fa and ϕi in the final dataset (containing 341 hydrogels) is shown in Fig. 4b, using uniform manifold approximation and projection (UMAP)38 for dimensional reduction (from six to two dimensions). Notably, formulations generated by RFR-GP and GP_KB show minimal overlap with the original 180-hydrogel dataset, indicating extrapolation during optimization. RFR-GP data points are more scattered than those of GP_KB, suggesting broader exploration compared with traditional Bayesian optimization.

To assess the influence of ϕi on Fa, we used SHAP (SHaply Additive exPlanations)39 with the RFR model trained on the final 341-hydrogel dataset. The SHAP summary plot (Fig. 4c) shows that high values of ϕBA and ϕPEA significantly enhance Fa. This is because BA and PEA effectively expel water from the contact interface, and, when neighbouring with ATAC (Supplementary Fig. 11), they could enhance electrostatic interactions with the negatively charged glass surface27,40,41,42,43 (Supplementary Fig. 12). By contrast, high values of ϕHEA, ϕCBEA, and ϕAAm tend to reduce Fa. Interestingly, ϕATAC has a dual effect (Supplementary Fig. 13): low levels diminish electrostatic interactions, whereas excessive ϕATAC increases hydrogel swelling, limiting polymer-surface contact and reducing Fa. Therefore, a moderate ϕATAC is crucial.

These insights, consistent across all three ML rounds, establish a clear design principle for achieving strong underwater hydrogel adhesion to glass surfaces using the selected functional monomers: incorporating BA, PEA and ATAC is key. This combination leverages both hydrophobic effects and electrostatic interactions to enhance underwater adhesion to negatively charged surfaces. The hydrogels with the highest Fa from each ML round, denoted as R1-max, R2-max and R3-max, are exclusively composed of these three monomers (Fig. 5a) and share similar statistical sequence features as indicated by Monte Carlo simulations (Supplementary Figs. 11 and 14).

Fig. 5: Characterization and performance of hydrogels identified by DM (G-max) and ML optimization (R1-max, R2-max and R3-max).
figure 5

a, Formulations of the gels. b, Photographic images showing the appearance of the gels. c, Uniaxial tensile stress–strain curves of the gels at a stretch rate of 100 mm min−1. d, Fa of hydrogels as a function of contact time (left) and contact force (right) on glass in normal saline. e, Fa of R1-max on various substrates in normal saline. PC, polycarbonate; PMMA, poly(methyl methacrylate); PF, phenol formaldehyde; POM, polyoxymethylene; PP, polypropylene; PTFE, polytetrafluoroethylene; Al, aluminium alloy; Ti, titanium alloy; SS, stainless steel. f, Photographic image showing R1-max (25 mm × 25 mm in size, about 0.4 mm thickness) joining pairs of ceramics (left), glass (middle) and titanium (right) plates under a 1-kg load in normal saline for over 1 year. g, Fa on glass substrate in deionized water, normal saline and artificial seawater (0.7 M NaCl) for hydrogels equilibrated in the corresponding solutions. The asterisk on G-max indicates cohesive failure during testing. Error bars represent the standard deviation of N = 3 measurements.

Performance of super-adhesive hydrogels

We conducted detailed studies on the three top-performing ML-driven hydrogels (R1-max, R2-max and R3-max) and compared them with the best DM-driven hydrogel (G-max) (Fig. 5, Extended Data Fig. 6 and Supplementary Table 8). In their as-prepared state, all gels were transparent and exhibited frequency-independent storage moduli (G′) (Extended Data Fig. 6a), indicating negligible inter- or intramolecular aggregation in DMSO. Despite compositional differences, comparable G′ values suggest similar network topologies.

On equilibration in normal saline, all gels underwent shrinkage (Extended Data Fig. 6c). In contrast to G-max, the ML-driven hydrogels exhibited increased opacity (Fig. 5b), stronger viscoelasticity and higher moduli (Extended Data Fig. 6b). This suggests that their higher hydrophobic BA and aromatic PEA content (Fig. 5a) promotes strong associations of copolymer strands in aqueous media, which facilitate energy dissipation. Moreover, the ML-driven hydrogels exhibited greater mechanical strength and toughness (Supplementary Video 1), as evidenced by the larger area under their stress–strain curves (Fig. 5c). The enhanced viscoelasticity and toughness contributed to their improved adhesion compared with G-max44.

To comprehensively evaluate adhesive performance, we conducted tack tests across a range of test conditions, substrates and solution media. Generally, Fa increased with increasing loading force and contact time, eventually reaching a plateau (Fig. 5d and Extended Data Fig. 7), attributed to enhanced interfacial contact and water drainage at the hydrogel–substrate interface. These plateau values were used to compare maximum adhesion performance across substrates and solutions.

In normal saline, R1-max achieved a maximum Fa exceeding 1 MPa on glass (Fig. 5e) and maintained robust adhesion over 200 attachment–detachment cycles (Extended Data Fig. 8). It also demonstrated strong adhesion to a variety of substrates, including inorganic materials, plastics and metals, as confirmed by lap shear and peeling tests (Extended Data Fig. 9). Notably, R1-max sustained joints of plates made from different materials under a 1-kg shear load for over 1 year, showcasing exceptional durability (Fig. 5f and Supplementary Fig. 15).

In artificial seawater (0.7 M NaCl), all three ML-driven hydrogels exhibited similar levels of strong adhesion (Fig. 5g). In deionized water, however, R2-max outperformed the others, exhibiting cavitation during debonding (Supplementary Fig. 16). These results indicate that small compositional variations can affect adhesion performance in different environments, reflecting a principle observed in nature—adaptability over universal optimization—in which biological systems evolve to perform optimally in their specific environments. This finding underscores the importance of ensuring data consistency in ML optimizations, as hydrogel performance varies with environmental conditions.

To demonstrate practical applicability, several case studies were conducted. R1-max was used to affix a rubber duck to a seaside rock (Extended Data Fig. 10a). Its strong adhesion in saltwater enabled the duck to withstand continuous ocean tides and wave impacts, revealing its suitability for harsh marine environments (Supplementary Video 2). R2-max, exhibiting the highest adhesion in deionized water (Fig. 5g), successfully sealed a 20-mm-diameter hole at the base of a 3-m-tall polycarbonate pipe filled with tap water (Extended Data Fig. 10b). It instantly stopped the high-pressure water leak (Supplementary Video 3), showcasing a level of performance that common adhesives cannot match (Extended Data Fig. 10c). Furthermore, all these hydrogels demonstrated good biocompatibility, as confirmed by subcutaneous implantation in mice (Supplementary Fig. 17), supporting their potential for biomedical applications.

In summary, we introduced a data-driven approach that integrates the extraction of valuable sequence information from proteins, scalable polymer synthesis and iterative ML to address longstanding challenges in the de novo design and development of soft materials. Beyond adhesive hydrogels, this data-driven design framework offers a systematic, scalable end-to-end approach for developing a wide range of functional soft materials. However, challenges remain, primarily because of limitations in monomer diversity, polymer synthesis technologies for controlling monomer sequences to a scale suitable for materials development and dataset scalability. Overcoming these challenges will require expanding modular monomer libraries, advancing polymerization techniques and developing physics-informed ML models that can generalize across sparse, multiscale datasets.

Methods

Hydrogel fabrication

All copolymer gels were synthesized by one-step free-radical copolymerization of monomers with a chemical crosslinker. The crosslinker concentration was fixed at 0.1 mol% relative to the total monomer content to balance the elasticity and deformability of the gels27. DMSO solutions containing functional monomers (total concentration of 2.4 M) with compositions derived from DM and ML (Supplementary Tables 2 and 7), chemical crosslinker (glycerol 1,3-diglycerolate diacrylate, 2.4 mM), and UV initiator (2-oxoglutaric acid, 6 mM) were used. For example, to prepare the G-max gel, 1.819 g of BA, 0.413 g of HEA, 0.264 g of CBEA, 0.561 g of ATAC, 0.441 g of PEA, 8.4 mg of glycerol 1,3-diglycerolate diacrylate and 8.8 mg of 2-oxoglutaric acid were added to a 10 ml volumetric flask, followed by DMSO to reach 10 ml. The precursor solution was transferred to a glove box to remove oxygen, poured into a reaction cell (two 10 cm × 10 cm glass plates, 0.5-mm spacing) and irradiated with UV light (365 nm wavelength, 4 mW cm−2 intensity) for 8 h to form gels (Supplementary Fig. 9a). After UV irradiation, over 99% of the monomers were converted into polymers, as confirmed by NMR (Supplementary Fig. 9b).

The as-prepared organogels were then immersed in normal saline (0.154 M NaCl) to remove solvent and residual chemicals, with the saline exchanged every 12 h for at least 2 weeks until swelling equilibrium was reached. Hydrogels were stored in normal saline before use.

Underwater adhesion characterization

The tack test was conducted using a SHIMADZU tester (Autograph AG-X) equipped with Trapezium X software. Hydrogel (0.3–0.8 mm thickness) at swelling equilibrium was adhered to the probe using cyanoacrylate adhesive (super glue). For rapid screening, DM-driven hydrogels from the training round and ML-driven hydrogels from three optimization rounds, were prepared as 15 mm diameter samples. For detailed adhesion studies, 10 mm diameter samples were used to avoid exceeding the force range of the instrument. This change in diameter did not affect the adhesive strength results. The hydrogel on the probe was then immersed in a test solution (for example, normal saline) for 5 min to reach equilibrium. The probe descended towards the substrate at 1 mm min−1 until a loading force of 10 N was applied, maintained for 10 s and withdrawn at 10 mm min−1 (Supplementary Fig. 10). These test conditions were used as a standard protocol unless otherwise specified. For repeated adhesion tests, hydrogels rested underwater for 5 min between cycles, with glass substrates replaced every 100 tests. For prolonged attachment–detachment cycles (Extended Data Fig. 8), a 5 N loading force and a 10 s contact time were used to minimize gel fatigue. Each sample was tested at least three times. For hydrogel dataset construction, the highest adhesive strength recorded for each sample was reported as Fa, representing maximum adhesion performance under the specific conditions.

Lap shear adhesive strength was measured using a universal testing machine (UTM, INSTRON 5965). A hydrogel (10 mm diameter, area A = 78.5 mm2) at swelling equilibrium was sandwiched between two glass slides, pressed at 20 N for 1 min in normal saline. Shear loading was applied at 50 mm min−1. Shear adhesive strength (Fa) was calculated as Fa = Fmax/A, where Fmax is the maximum loading force. For adhesion durability tests (Supplementary Fig. 15), the sandwiched assembly was stored in normal saline for varying durations before testing.

Interfacial toughness was measured by 180° peeling tests using INSTRON 5965. Hydrogel strips (10 mm × 150 mm) were adhered to a glass substrate in normal saline using mild finger pressure, followed by a 2 kg hand roller applied in each direction for 1 min to ensure uniform contact. Polyethylene terephthalate (PET) films (50 μm thickness) served as a stiff backing. Peeling tests were conducted at 50 mm min−1. Interfacial toughness (Gc) was calculated as Gc = 2Fc/w, where Fc is the plateau force and w is the sample width (10 mm).

DM of adhesive proteins

A comprehensive dataset of adhesive proteins was compiled from the NCBI protein database, using ‘adhesive proteins’ as the query keyword. A total of 24,707 protein sequences from 3,822 different organisms (bacteria, viruses, eukaryotes and animals) were collected without additional data cleaning. Based on taxonomy annotations, proteins were grouped by species, and a consensus sequence was generated for each species to capture common sequence patterns and reduce the influence of individual variations.

The dataset included 3,111 species, noting that taxonomic overlap results in protein counts not summing to 24,707. For robust analysis, the top 200 species, ranked by the number of distinct proteins identified per species, were selected for further study.

Protein sequences were exported in FASTA format45 using the Bio.SeqIO interface in BioPython46. Consensus sequences were computed with Clustal Omega23, which performs multiple sequence alignment by generating a distance matrix from pairwise alignments, constructing a guide tree based on evolutionary relationships and progressively aligning sequences from the closest to the most distant. The resulting alignment identifies the most frequent residues at each position, yielding a consensus sequence that highlights conserved regions.

Clustal Omega was executed with the command:

$$./{\rm{c}}{\rm{l}}{\rm{u}}{\rm{s}}{\rm{t}}{\rm{a}}{\rm{l}}{\rm{o}}\, \mbox{-} {\rm{i}}\,{\rm{ \mbox{``} }}{\rm{i}}{\rm{n}}{\rm{p}}{\rm{u}}{\rm{t}}{\rm{\_}}{\rm{f}}{\rm{i}}{\rm{l}}{\rm{e}}{\rm{\mbox{''}}}\, \mbox{-} \mbox{-} {\rm{o}}{\rm{u}}{\rm{t}}{\rm{f}}{\rm{m}}{\rm{t}}\,=\,{\rm{c}}{\rm{l}}{\rm{u}}\, \mbox{-} {\rm{o}}\,{\rm{ \mbox{``} }}{\rm{o}}{\rm{u}}{\rm{t}}{\rm{p}}{\rm{u}}{\rm{t}}{\rm{\_}}{\rm{a}}{\rm{l}}{\rm{n}}{\rm{\_}}{\rm{f}}{\rm{i}}{\rm{l}}{\rm{e}}{\rm{\mbox{''}}}\, \mbox{-} {\rm{v}}$$

where “input_file” and “output_aln_file” denote the input protein sequences and output consensus sequences, respectively. The 200 consensus sequences generated were used for subsequent sequence analysis and hydrogel formulation design.

ML methods

A six-dimensional feature vector, ϕi = [ϕBA, ϕHEA, ϕCBEA, ϕATAC, ϕAAm, ϕPEA], was used to represent monomer proportions in hydrogels. The target variable was adhesive strength, Fa. To model the relationship between ϕi and Fa, we explored both linear and non-linear ML models (Supplementary Tables 5 and 6).

Linear models included least absolute shrinkage and selection operator regression (Lasso) and ridge regression (Ridge). Non-linear models comprised k-nearest neighbours (KNN), kernel ridge regression (KRR), support vector regression (SVR), random forest regression (RFR), gradient boosting regression with XGBoost (XGB), extra trees regression (ETR) and Gaussian process (GP) with a Matérn kernel32,34. These non-linear models encompass non-parametric (KNN), kernel-based (KRR, SVR and GP) and tree-ensemble (RFR, XGB and ETR) approaches, enabling a comprehensive comparison34,35,47.

XGB was of v.1.6.2, whereas the other models were implemented using Scikit-learn (v.1.0.2) and Scikit-optimize (v.0.9.0). The hyperparameter n_estimators was tuned using Optuna48, whereas others were optimized using grid search (Supplementary Table 6). A 10-fold cross-validation strategy was used to assess predictive performance on our dataset of 180 hydrogels, using root mean squared error (RMSE) as the metric. GP and RFR, with the lowest RMSE in training-test error using a 90%/10% train/test split (Extended Data Fig. 4), emerged as the top performer and runner-up, respectively, and were subsequently used as the base (surrogate) models.

To make extrapolative predictions, we tried three types of methods.

  1. 1.

    Exploitation-only enumeration:

    • GP_enu: random sampling in the input space using the fitted GP model.

    • RFR_enu: random sampling in the input space using the fitted RFR model.

    Ten million ϕi vectors were generated from a uniform distribution [0, 1.0) for each monomer, normalized to sum to 1.0. The top five vectors, ranked by predicted Fa from each model, were experimentally validated.

  2. 2.

    Batched BO:

    • GP_KB: used GP predictions as the hypothetical values for selecting the next data points maximizing EI.

    • GP_CLmax: used the maximum Fa (y_max) from the training set as a hypothetical value for selecting the next data points with EI maximums.

    • GP_CLmin: used the minimum Fa (y_min) for selecting the next data points with EI maximums.

    • GP_LP: incorporated a locally penalized term in EI calculation37.

    GP_KB, GP_CLmax and GP_CLmin simplified the joint q-EI probability calculation36 by using the GP prediction value as a hypothetical value for selecting the next data points with EI maximums. A batch size of q = 10 was selected.

  3. 3.

    Batched sequential model-based optimization (SMBO):

    • GP-RFR: GP as the hypothetical value provider and RFR as the EI maximizer.

    • RFR-RFR: RFR as both the hypothetical value provider and the EI maximizer.

    • RFR-GP: RFR as the hypothetical value provider and GP as the EI maximizer.

    • RFR-GP*: RFR-GP with a warm start, 10 RFR-generated points were added to the real dataset for GP regression.

    • RFR-ETR: RFR as the hypothetical value provider and ETR as the EI maximizer.

    • RFR-GBM: RFR as the hypothetical value provider and GBM as the EI maximizer.

    SMBO iteratively updates the surrogate model while exploring promising data points33. GP and RFR, when used as the hypothetical value providers, balance exploitation and exploration, whereas GP_CLmax and GP_CLmin emphasize exploitation and exploration, respectively49.

SMBO (Supplementary Algorithm 1) consists of four components: the true function (f), global domain (X), acquisition function (S) and surrogate model (M). Initial training data (D) are sampled from X, and experimental Fa values are obtained (line 1). The surrogate model M is fitted to D (line 3) and S (EI) identifies the next data point based on predictive uncertainty (line 4). This data point is subsequently validated experimentally (line 5), updating D (line 6) for T iterations (line 2).

EI quantifies expected improvement, \({\int }_{y* }^{\infty }(y-{y}^{* })p(y){\rm{d}}y\), over the current best target (y*). Owing to the time-intensive nature of hydrogel fabrication (each takes about 2 weeks), GP and RFR were used as the hypothetical value providers, enabling the maximization of the joint q-EI probability without requiring new experiments per iteration. EI maximizers (GP, RFR, ETR and GBM) used hyperparameters from Scikit-optimize (v.0.9.0).

For GP as the EI maximizer, the limited-memory Broyden–Fletcher–Goldfarb–Shannon (L-BFGS-B) algorithm50 was executed 20 times per iteration (40 iterations total) to identify the point with the highest EI, updating the GP prior. For the other three EI maximizers (RFR, ETR and GBM), 10,000 points were randomly sampled per iteration, as numerical optimization is more suitable for tree-ensemble models lacking gradient information. SMBO ran for 40 iterations with each EI maximizer, selecting two sets of 10 data points in each iteration: the top 10 ranked by EI values (batch size q = 10), and the top 10 ranked by predicted Fa values for experimental validation. These two sets may overlap, and the total number of data points may be less than 20.

For BO methods (GP_KB, GP_CLmax, GP_CLmin and GP_LP), the procedure was similar, except that the hypothetical value provider was either GP itself (GP_KB and GP_LP) or constant values (y_max for GP_CLmax and y_min for GP_CLmin).

After the first round, 109 validated points expanded the dataset to 289 hydrogels. The second and third rounds added 27 and 25 points, respectively, resulting in a final dataset comprising 341 hydrogels.