Introduction

The ability to direct chemical transformations with atomic/single-bond precision stands as one of the central ambitions of modern chemistry1,2,3. Selectively activating or breaking individual chemical bonds not only deepens our understanding of molecular reactivity but also expands the scope for constructing functional molecular architectures4,5,6,7,8,9. While conventional synthetic methods rely on intrinsic thermodynamic and kinetic control, they remain inherently limited in spatial resolution and often lack the precision necessary to enable single-molecule or bond-specific reactions. Recent developments in on-surface synthesis have demonstrated that surfaces can act as reaction templates and/or catalysts for molecular organizations and reactions, thereby enabling the fabrication of low-dimensional materials with high spatial precision10,11,12,13,14,15,16. Yet, achieving deterministic, bond-level control over reaction pathways, particularly in a scalable and automated fashion, remains a challenge.

STM has emerged as a transformative tool in this context, capable of imaging, manipulating, and even inducing on-surface reactions in individual molecules with atomic resolution17,18,19,20,21. STM tip-induced reactions such as selective bond cleavages22,23,24,25,26, intra- and intermolecular coupling27,28, and carbon skeletal rearrangement6,29,30, have offered insights into molecular reactivity at the single-bond level. These operations are traditionally carried out under expert supervision, relying on iterative manual adjustments and domain-specific knowledge. Although effective for proof-of-concept demonstrations, this mode of operation is limited in scalability, reproducibility, and generalizability across diverse molecular systems.

To transcend the above-mentioned limitations, recent efforts have begun to integrate machine learning with STM techniques, aiming to automate image analysis31,32,33,34, probe conditioning35,36,37,38,39, and even tip-induced single-molecule reactions40,41, and lateral manipulations42. Specifically, strategies based on deep learning (DL) neural networks have shown promise in navigating parameter spaces of activation operations and predicting reaction outcomes41. While current progress in deep learning-driven molecular reaction control is constrained to single elementary steps, it is essential to extend these capabilities to multi-step reaction pathways that enable control over chemical transformations. A milestone toward autonomous chemistry is the development of an intelligent nanofabrication system capable of perceiving and interacting with molecular systems in a goal-directed and data-driven manner.

Here, we report a deep learning-driven STM-based strategy that enables multi-step bond-precise chemical transformations at the single-molecule level. By integrating multiple DL algorithms, including computer vision for reactant and product recognitions, and deep reinforcement learning to dynamically optimize manipulation parameters for single-bond activation, we demonstrate multi-step C–Br bond dissociations of individual 5,10,15,20-tetra(4-bromophenyl) porphine (TPP-Br4) molecules on a metal surface, and achieve a selective reaction pathway towards the fully debrominated species. The DL workflow, trained on STM images and real-time feedback, learns to adaptively induce bond-breaking events with high fidelity and repeatability, without human intervention.

Results and discussion

Autonomous STM workflow and system architecture

To demonstrate the flexibility and autonomy of the single-molecule reaction platform, we selected TPP-Br4 as the model molecular system. This molecule features a four-fold symmetric porphyrin core substituted with four identical bromophenyl groups positioned at its peripheral corners43. The target is to achieve control over different C–Br bond-dissociation pathways in a predefined and automated manner.

As shown in Fig. 1, the four bromine atoms of the TPP-Br4 are labeled 1–4 in a counterclockwise direction (in orange, red, blue, and green, respectively). After being activated, progressive debromination leads to a series of distinct intermediate states: TPP-Br3*, TPP-Br22*, and TPP-Br3*, ultimately yielding a fully debrominated product denoted as TPP4*. Notably, the TPP-Br22* intermediates can adopt two distinct isomers depending on the relative positions of the remaining bromine atoms: an ortho configuration (TPP-Br22*-ortho) and a para configuration (TPP-Br22*-para).

Fig. 1: AI-driven autonomous bond-selective reaction.
Fig. 1: AI-driven autonomous bond-selective reaction.The alternative text for this image may have been generated using AI.
Full size image

The diagrams at the corners depict chemical structures of four species alongside their corresponding STM topographic maps. The four bromines on TPP-Br4 are color-coded and labeled as 1 (orange), 2 (red), 3 (blue), and 4 (green), respectively. The central diagram illustrates four distinct stepwise bond-dissociation pathways: ortho (1 → 2 → 3 → 4), para (1 → 3 → 2 → 4), ortho* (1 → 4 → 3 → 2), and ortho-Z (1 → 2 → 4 → 3), by which the molecule is converted through radical intermediates to the fully debrominated tetraradical TPP4*. All imaging, site selection, and reaction parameters (bias and setpoint current) are autonomously determined and executed by the AI agent. Scale bar: 1 nm. All images share the same scale bar.

Due to the symmetry of TPP-Br4, a variety of reaction pathways can be accessed when sequentially dissociating C–Br bonds. To capture this diversity while maintaining clarity, we focused on four representative pathways for experimental demonstration. The para pathway involves debrominations in the sequence of 1 → 3 → 2 → 4, leading to the para-type intermediate. In contrast, the ortho and ortho* pathways follow counterclockwise (1 → 2 → 3 → 4) and clockwise (1 → 4 → 3 → 2) pathways, respectively, both yielding the ortho-type intermediate. The fourth pathway, termed ortho-Z, proceeds through a zig-zag sequence of 1 → 2 → 4 → 3, representing a structurally distinct reaction route.

Importantly, the STM manipulation parameters for each bond dissociation event along these paths are determined by a deep reinforcement learning (DRL) algorithm. The trained agent autonomously controls the STM system via a TCP communication interface, enabling atomically precise C–Br bond cleavage without human intervention35. TPP-Br4 molecules were deposited onto a Au(111) single-crystal substrate at room temperature under ultra-high vacuum, yielding a low surface coverage with predominantly isolated adsorbates. STM imaging and tip-induced reactions were conducted at liquid helium temperature.

To perceive and interact with molecular systems, the autonomous workflow comprises three sequential stages: candidate identification, molecular state characterization, and DRL-controlled STM tip manipulation. First, candidate molecules were located using a large-area autonomous STM survey combined with a computer vision algorithm for keypoint detection44. Here, a 30 nm × 30 nm image is recorded to survey the surface and propose candidate sites (Fig. 2a), with regions delineated by red bounding boxes and centers marked in green. Upon detection, the region of interest is tracked, and the scan is zoomed into a 7 nm × 7 nm window for high-resolution imaging and the subsequent tip-induced reaction. As illustrated in Fig. 2b, the STM topography is processed by a cascade of vision algorithms: a keypoint network confirms the molecular center (green dot), semantic segmentation (U-net)45 extracts the molecular outline (yellow dashed contour), and a square-fitting algorithm determines molecular orientation (red dashed box) (see Supplementary Note 2).

Fig. 2: Definition of the RL agent, molecular searching and recognition workflow.
Fig. 2: Definition of the RL agent, molecular searching and recognition workflow.The alternative text for this image may have been generated using AI.
Full size image

a Candidate molecule searched by STM imaging. Red boxes show detected bounding regions; green dots mark molecular centers. Scale bar: 10 nm. b Molecular orientation determination. Keypoint detection (green dot) and semantic segmentation (yellow dashed contour) yield the molecular outline, square fitting (red) defines the orientation of the molecule (7.31°). c The status of the four C–Br bonds defines the molecular state in vector \({h}_{s}\). d Example of a new molecular state \({h}_{s}\) = [1 0 0 0] after the tip-induced reaction at site 1. Scale bar: 2 nm for (b-d). e The RL action space is composed of four dimensions: lateral displacement (X, Y) of the tip position, STM bias voltage (V) and tunneling current (I). Tip positions are discretized over 13 sites ((0,0) is centered on the molecule); bias voltages range from 2.4 to 4.2 V in 0.05 V increments; currents range from 0.3 to 1.5 nA in 0.05 nA increments. f SAC policy network architecture. The state \({s}_{t}=[{h}_{s},{h}_{p}]\) concatenates the four-dimensional vector \({h}_{{{\rm{s}}}}\) with the one-dimensional reaction-path code hp; the network outputs an action distribution \({\pi }_{\theta }(a|{s}_{t})\) from which at is sampled. g Representative tunneling-current (I–t, blue, rolling-average, window size: 200 ms) and bias voltage (V–t, red) traces for unreacted, reacted, and degraded outcomes. The blue arrow indicates the setpoint ramp, black arrows mark sudden current changes during bias hold. Insets show STM images before and after each manipulation. Scale bars: 2 nm. Source data are provided as a Source Data file.

The four C-Br groups are then localized by intersecting the fitted square with the molecular centroid. In Fig. 2c, the gray square represents the molecular framework, with its four vertices corresponding to bromines. Four smaller patches of identical size centered at each vertex define the evaluation regions; these were chosen to capture the distinct topological signatures of intact versus dissociated bromine atoms. A convolutional neural network (CNN) classifies (Resnet-18 backbone) each site’s dissociation status based on these patches. To ensure high-precision tip manipulation, the keypoint detection, semantic segmentation, and convolutional neural network modules were each trained on separate datasets following data-augmentation protocols. These vision algorithms demonstrate high accuracy and rapid inference.

Finally, the deep reinforcement learning agent determines the manipulation operation parameters by combining the detected molecular state with a pre-defined reaction-path encoding. The status of the four C–Br sites is encoded as a 4-bit binary state vector [b₁, b₂, b₃, b₄], where 0 denotes an intact C–Br bond, and 1 denotes dissociation. Thus, the TPP-Br4 molecule is represented as [0, 0, 0, 0], whereas after a dissociation of site 1, the state becomes [1, 0, 0, 0] (Fig. 2d).

Reinforcement learning for autonomous bond dissociation

Selective C–Br bond cleavage by tip manipulation constitutes a sequential decision-making problem under uncertainty, naturally cast as a Markov decision process. Traditional fixed-policy strategy lacks the flexibility to generalize across varying tip conditions, noise level of environments, and sparse, subjected rewards assessed by human operators. Instead, an RL framework enables the agent to learn an optimal, feedback-driven policy directly from interactions with the molecular system. Among modern DRL algorithms, soft actor-critic (SAC) offers distinct advantages that align with the demands of autonomous STM manipulation46. At each discrete time step t, the agent receives a combined state vector \({s}_{t}=[{h}_{s},{h}_{p}]\), where \({h}_{s}\in {\{0,1\}}^{4}\) encodes the four C–Br bond dissociation statuses and \({h}_{p}\in \{0,1\}\) specifies the target reaction path (0 = ortho, 1 = para). For example, to progress from TPP-Br3* (\({h}_{{{\rm{s}}}}=\) [1 0 0 0]) to the ortho-type intermediate TPP-B22*-ortho (\({h}_{s}=\)[1 1 0 0]), the input is st = [1 0 0 0 0]. Conversely, if the target is the para-type intermediate, then st = [1 0 0 0 1]. Owing to the four-fold symmetry of TPP-Br4, training focuses on the ortho and para pathways, ortho* and ortho-Z, which are generated via symmetry operations on the ortho trajectory (see Supplementary Note 3 for more details).

As presented in Fig. 2f, the policy network comprises three fully connected hidden layers of 256 units (Supplementary Fig. 3). From \({s}_{t}\), the network outputs a 4-dimensional action vector \({{{\bf{a}}}}_{t}=[X,Y,V,I]\), specifying the STM tip’s lateral offset relative to the molecular center and its operation parameters. The exploration ranges for the bias voltage and tunneling current were determined based on both previous literature and preliminary tests on TPP-Br4/Au(111), which established the operational window between non-reactive and destructive regimes. As depicted in Fig. 2e, tip positions are discretized over 13 candidate sites, with the molecular center defined as the origin (0,0). The bias voltage and tunneling current are discretized over ranges of 2.4–4.2 V (ΔV = 0.05 V) and 0.3–1.5 nA (ΔI = 0.05 nA), respectively. These ranges are selected to balance the flexibility of tip manipulation with the safety and structural integrity of both the tip and the sample based on our experiences and literature29,41,47.

Upon selecting the action \({a}_{t}\), the sequence is executed via API calls to the STM controller. The tip is first moved to the target site; the setpoint current is then ramped to the desired value for 2 s to ensure stable tip-molecule separation. Subsequently, feedback is disabled, and the bias is swept from the imaging voltage (1.0 V) to the target value over 4.5 s and held for 8 s to initiate the bond cleavage. This process completes one fully autonomous debromination cycle, integrating imaging, vision-based state recognition, and DRL-driven tip manipulation.

We continuously recorded the tunneling current (I) and bias voltage (V) throughout each STM manipulation operation to obtain further details of the reaction outcomes. As exemplified in Fig. 2g, the solid blue trace depicts the tunneling current versus time, while the dashed red trace shows the applied bias versus time. The blue arrow marks the initial current ramp. Upon disabling the feedback loop, the current varies in response to the bias sweep. In cases where the tunneling current keeps roughly constant during the bias sweep, it implies an unreacted event (Unreact.). Conversely, a marked drop in I during the bias sweep is indicative of possible C–Br dissociation (React.). Under relatively aggressive manipulation parameters, molecules may suffer irreversible molecule degradation (i.e., the molecule loses its recognizable structural signature in the STM image), which is frequently accompanied by multiple current jumps and increased instability. The inset panels in Fig. 2g present representative outcomes for a single molecule subjected to different cases. Notably, the optimal parameter window that yields a successful reaction tends to be fairly narrow41.

After each manipulation process, the system automatically rescans the same molecule to determine its updated state \({h}_{s}\). We employ a tiered reward function \(r\)(\({s}_{t}\), \({s}_{t+1}\)) that assigns positive reinforcement only when the intended C–Br cleavage occurs, while unreacted events, bond dissociation at wrong sites, unintended multi-bond dissociations (Multi-React.), or fail to detect the molecule due to sample damage or molecule disappearance (Bad) incur distinct negative penalties (see Supplementary Note 4 for detailed explanation of the reward function). Each tuple (\({s}_{t}\), \({a}_{t}\), \({r}_{t}\), \({s}_{t+1}\)) forms a trajectory fragment that is stored in an off-policy replay buffer. The new state \({h}_{s+1}\) becomes \({h}_{s}\) for the next manipulation step.

Tip manipulation on a given molecule is terminated either when four C–Br bonds are cleaved (yielding \({h}_{s}\,\)= [1,1,1,1]) or when the molecule is no longer identifiable by the recognition algorithm. Once all candidate molecules within the current 30 nm × 30 nm image frame have been processed, the system zooms out and initiates a new large-area (30 nm × 30 nm) scan, enabling continuous operations for more candidates. Selective C–Br single bond reaction by STM tip manipulation constitutes a high-dimensional, low-tolerance Markov decision process, in which only a narrow range of operation parameters yields the desired outcome. Consequently, during early-stage exploration, the vast majority of experiences incur negative rewards, often leading to spurious convergence48, i.e., the policy collapses to minimal-risk actions that avoid penalties but forgo higher rewards. To address this issue, we adopt invariant transform experience replay (IT-ER)49, which exploits the D4h symmetry of the TPP-Br4 molecule (fourfold rotation about the surface normal and four vertical mirror planes) to generate mathematically equivalent virtual trajectories and thereby enrich the replay buffer (see Supplementary Note 3).

The training process of DRL is summarized in Fig. 3, which completed 341 episodes, comprising 948 single-bond manipulation processes, over a continuous 36 h. Here, an episode corresponds to a complete reaction attempt on a single molecule along the predefined debromination sequence. An episode terminates when either all target C–Br bonds have been selectively cleaved to yield TPP4*, or when an irreversible failure (undesired multi-bond dissociation, incorrect-site activation, or molecule degradation) occurs. Throughout the training period, the agent exhibited steady performance improvement. Figure 3a, c, e, g, i, and k displays the frequency of tip position and resulting outcomes after manipulation operations. Sites with successful reactions are highlighted in green, red marks failed reactions, including undesired reaction sites, multiple-bond activation, and sample damage, and gray markers indicate no-reaction outcomes, i.e., cases where the molecular state remains unchanged (St = St+1). Figure 3a, c, e, g, i, and k (tip position statistics), a site is colored gray when such No-Reaction outcomes constitute the dominant event category at that position. The agent finally converged on the sites nearest each targeted C–Br bond for bond dissociation. The positional preference can be attributed to a substantial decrease in the probability of tunneling electron-induced bond excitation with increasing lateral distance from the tip26,47,50,51,52. Consequently, positioning the tip directly above a given C–Br bond maximizes the local bond activation probability.

Fig. 3: Training performance of the RL agent.
Fig. 3: Training performance of the RL agent.The alternative text for this image may have been generated using AI.
Full size image

a–l Statistics for different reaction outcomes. In each panel, a, c, e, g, i, and k summarize the tip positions and frequency across the 13 candidate sites (circle size proportional to frequency), The gray shaded background represents the TPP-Br4 molecular framework. b, d, f, h, j, and l show the choices and frequency of bias voltage and setpoint currents. Green spots stand for the desired reactions, red for failed reactions (including undesired reaction sites, multiple-bond reaction, molecule degradation), gray for no-reaction cases. The schematic on the left indicates the target reaction: a, b TPP-Br4 → TPP-Br3*; c, d TPP-Br3* → TPP-Br22*-ortho; e, f TPP-Br3* → TPP-Br22*-para; g, h TPP-Br22*-para → TPP-Br3*; i, j TPP-Br22*-ortho → TPP-Br3*; k, l TPP-Br3* → TPP4*; m Evolution of the rolling average (window size: 20 Episode) episode rewards. The arrow marks a typical episode at which the tip condition changes. n Upon fixing the optimal tip location, the optimal manipulation parameters (bias and current) corresponding to steps (b, d, f, h, j, and l) are extracted. Source data are provided as a Source Data file.

Statistical analysis in Fig. 3b, d, f, h, j, l reveals that progressively higher bias voltages are required for bond cleavage as successive Br atoms are detached. During the initial TPP-Br4 → TPP-Br3* transition, the agent tends to apply a bias around 2.5 V, the setpoint value is relatively scattered at this stage. Once the first C–Br bond is cleaved, the optimal bias shifts upward to ≈2.8 V for the TPP-Br3* → TPP-Br22*-ortho step, targeting the para-type intermediate (TPP-Br3* → TPP-Br22*-para) corresponds to a bias around 2.6 V. As the molecule proceeds to TPP-Br22*-para → TPP-Br3*, the agent centers its bias near 2.85 V. Additionally, the optimal bias for TPP-Br22*-ortho to TPP-Br3* conversion is ~3.0 V, ultimately reaching 3.2 V for the final TPP-Br3*→TPP4* cleavage. Although these bias windows correlate strongly with high success rates, occasional failure occurs if the tip is not positioned accurately. The tip operation sites near the molecular center show a very low probability of selective single-bond cleavage. Notably, in the final step (TPP-Br3*→TPP4*), high-success sites show increased spatial dispersion, indicating lower spatial dependence when only a single C–Br bond remains.

Figure 3m plots the rolling average episode rewards during the training process. The reward shows an overall increasing trend during training, reflecting enhanced bond-dissociation efficiency. At several points, sudden drops in reward appear. These events coincide with changes in the tip conditions that alter the response of the molecule. Through continued training, the agent successfully modifies its policy to recover high reward performance. This characteristic disturbance-recovery pattern demonstrates the model’s adaptive capability in response to dynamic tip-sample conditions. Furthermore, as shown in Fig. 3n, we applied a Gaussian fit to the distributions of successful events in the best tip position. The required bias voltage increases progressively with each successive debromination step. The spread of the bias distribution is noticeably narrower than that of the tunneling current, indicating that the bias voltage serves as the primary control parameter governing the C–Br bond cleavage. More statistical details are given in the Supplementary Note 5. To quantify its performance, we specifically analyzed the final 30% of all recorded experiences. The reaction success rates were found to be 51.60% for TPP-Br4 → TPP-Br3*; 55.65% for TPP-Br3* → TPP-Br22*-ortho, 50.00% for the acquisition of TPP-Br22*-para; 56.00% for TPP-Br22*-para → TPP-Br3*; 70.21% for TPP-Br22*-ortho → TPP-Br3*; and 78.95% for the final TPP-Br3* → TPP4* transition. The definition and statistics of the reported success rates are given in the Supplementary Note 5.2. We have also performed density functional theory (DFT) calculations and scanning tunneling spectroscopy to provide complementary insight into the electronic properties of debrominated species (see results in Supplementary Note 7).

Programmable multi-step reaction pathways

After training the DRL agent, we are able to achieve desired reaction pathways in a selective and automated fashion. As an example showing the ortho pathway (Fig. 4a–g), in each STM topography, red markers indicate the algorithm-specified tip positions, and the bias and setpoint current values chosen by the agent are present. Following each successful tip manipulation operation, a bright spot emerges adjacent to the porphyrin core, and the corresponding topography is altered, which is characteristic of a successful C–Br dissociation. In the final STM image (Fig. 4g), a square-shaped molecule is clearly resolved, with four Br adatoms adsorbed nearby on the Au(111) surface. Figure 4a displays the corresponding tunneling current and bias voltage traces during the autonomous reaction. Black arrows highlight the characteristic abrupt drops in current that signal bond cleavage events. The success rate for the ortho pathway without molecule degradation was 35.4%, while the para pathway achieved a comparable success rate of 29.2%. The reaction paths of para, ortho* and ortho-Z are illustrated in Fig. 4h–m, n–s, and t–y, respectively. Throughout these experiments, slight rotations and translations of the intermediates were occasionally observed, and liberated Br atoms sometimes migrated beyond the scan window.

Fig. 4: Achieving different reaction pathways.
Fig. 4: Achieving different reaction pathways.The alternative text for this image may have been generated using AI.
Full size image

a I–t (solid blue) and V–t (dashed red) traces for successive C–Br cleavages along the ortho pathway. The It traces are presented after rolling-average (window size: 200 ms). Black arrows mark the abrupt current drops that signal bond dissociation events. b–y Sequential STM images of a single TPP-Br4 molecule undergoing bond dissociations via the b–g ortho, h–m para, n–s ortho* and t–y ortho-Z pathways. In each panel, red dots denote the tip positions; the bias voltage and setpoint current applied for each manipulation are denoted in the corresponding image. The schematics indicate the molecular state after each manipulation. Scale bars in b–y: 1 nm. Source data are provided as a Source Data file.

Although the agent was not explicitly trained for targets of multi-bond dissociations, experimental results provide valuable insights for developing enhanced tip-manipulation strategies. The multi-bond dissociation events shown in Fig. 5 are a subset of the trajectories contained in Fig. 3 and represent outcomes in which more than one C–Br bond is cleaved within a single manipulation event. Figure 5a–f summarizes the statistical analysis of reaction outcomes in response to operation parameters. Figure 5g–i presents representative I–t and V–t traces extracted from these statistics. Specifically, Fig. 5g corresponds to a typical case from the data summarized in Fig. 5a, d where TPP-Br4 molecules may undergo cleavage of two C–Br bonds to yield TPP-Br22*-ortho. The corresponding current-time (I–t) trace displays two characteristic drops in tunneling current (Fig. 5g). No para-type intermediate (TPP-Br22*-para) formation was detected under Multi-React. conditions. Similarly, we also observed events of transforming TPP-Br22*-ortho to the fully debrominated TPP4* (Fig. 5b, e) which typically required bias voltages of ≈3.2 V, higher than the 2.8 V used for the cases of transforming TPP-Br4 to TPP-Br22*-ortho. Although I–t curves for this reaction did not exhibit clear two-step features, comparison with the V–t trace reveals that the tunneling current drop often occurs before the bias reaches its target, a signature frequently associated with Multi-React. events (Fig. 5h).

Fig. 5: Representative multiple-bond reaction (Multi-React.) statistics and events.
Fig. 5: Representative multiple-bond reaction (Multi-React.) statistics and events.The alternative text for this image may have been generated using AI.
Full size image

Statistical distributions: a–c indicate tip-position frequencies, and d–f show the frequencies of bias voltage and setpoint current values. g–i The corresponding representative I–t (blue) and V–t (red) traces recorded during the respective Multi-React. events. The It traces are presented after rolling average (window size: 200 ms). Source data are provided as a Source Data file.

Lastly, Fig. 5c, f illustrates a case of simultaneous dissociation of four C–Br bonds. In Fig. 5i the I–t trace displays pronounced instability accompanied by a multi-step gradual decline of I, characteristic of uncontrolled molecule degradation. Additional examples and statistical analyses of the dissociation of multiple C–Br bonds are provided in the Supplementary Note 5.1. Specifically, the central position fails to provide any single-bond and multi-bond selectivity. Bias voltages that trigger multiple-bond reactions generally exceed those associated with precise single-bond dissociations. Overall, although tip-induced excitation decays with distance, remote C–Br bond cleavage still occurs with appreciable probability under sufficiently high biases.

In conclusion, we have demonstrated an AI-driven autonomous molecular activation platform. By integrating keypoint detection, semantic segmentation, and a DRL agent with STM-based tip-induced reaction, it can execute selective single bond dissociation with atomic precision and high efficiency. Crucially, we achieved controllable reaction pathways (including the ortho, para, ortho*, and ortho-Z), demonstrating their ability to access diverse bond-selective routes. The three-stage workflow of candidate identification, molecular-state characterization, and DRL-driven tip activation reliably converged on the optimal tip positions and parameters for single-bond cleavage. The tunneling current (I–t) and bias voltage (V–t) traces provided clear dissociation signatures.

These results provide a contribution to AI-driven atomically precise single-bond chemistry and nanofabrication, bridging advanced computer-vision techniques, DRL algorithms, and surface chemistry. The demonstrated autonomy, selectivity, and adaptability support future efforts toward scalable, atom-by-atom manufacturing of molecular nanostructures20,53. Looking forward, we envision extending this strategy to diverse molecular systems and substrates through model retraining. We view the present work as a foundation toward such expansion. Such advances will accelerate the automation of single-molecule engineering and broaden the applicability to nanofabrication and quantum-materials design and construction.

Methods

STM measurements

All experiments were performed in a custom low-temperature STM system (Boson Co. Ltd., Beijing) operating under ultra-high vacuum (base pressure ≈1 × 1010 mbar). The Au(111) single crystals were cleaned by several cycles of Ar+ sputtering and annealing under UHV conditions until large terraces separated by monatomic steps were achieved. The molecular (TPP-Br4) was commercially obtained from Bidepharm with a stated purity of 95%. Imaging and tip-induced manipulation were carried out at liquid-helium temperature (5.3 K) in constant-current mode using a Pt–Ir tip. Feedback control and the application of bias voltage and tunneling-current setpoint pulses were managed by a Nanonis controller (SPECS GmbH) via its API. The Nanonis software provides a TCP programming interface that enables customized external programs to communicate with the software via a dedicated service port for STM control. Through this interface, the STM can be operated by sending byte stream commands that conform to the Nanonis TCP protocol. Nanonis supplies the required protocol specifications and programming interface to facilitate this process. Further details on and instrumental settings are provided in Supplementary Note 1.2.

Image segmentation. We employed a standard U-net architecture for the image segmentation task. The encoder consists of four convolutional and downsampling layers that progressively extract multi-scale features from the input STM images. These features are passed through a bottleneck before being reconstructed in the decoder, which uses four transposed convolutional layers to up-sample and generate the final segmentation mask. Skip connections between corresponding encoder and decoder layers help preserve both foreground and background information. The model was trained on a dataset comprising 16 original STM images of varying quality and their manually labeled masks. During training, the loss steadily decreased from 0.3661 to 0.0383 over 25 epochs, indicating stable and effective convergence. The final model achieved a pixel-wise accuracy of 98.68%. Details of the network architecture are provided in Supplementary Note 1.1.

SAC algorithm

SAC is an off-policy actor-critic algorithm that optimizes a stochastic policy under the maximum-entropy objective, which augments the expected return with an entropy bonus to encourage exploration and robustness. The objective is

$$J(\pi ){\mathbb{=}}{\mathbb{E}}\left[{\sum }_{t = 0}^{\infty }{\gamma }^{t}\left({r}_{t}+\alpha {{\mathcal{H}}}\left(\pi \left(\cdot|{s}_{t}\right)\right)\right)\right]$$
(1)

where \({r}_{t}\) is the reward at time \(t\),\(\gamma \in ({\mathrm{0,1}})\) is the discount factor, \({{\mathcal{H}}}\left(\pi \left(|{s}_{t}\right)\right)\) is the policy entropy, and \(\alpha > 0\) is the temperature that trades off reward and entropy. \(\pi\) represents the stochastic policy, while \({\mathbb{E}}\) denotes the expectation taken over trajectories sampled according to this policy.

We adopted the SAC formulation with two Q-functions (clipped double-Q) and no separate value network. Let the critics be \({Q}_{{\theta }_{1}}\), \({Q}_{{\theta }_{2}}\) with target parameters \({\bar{\theta }}_{1}\), \({\bar{\theta }}_{2}\), and the stochastic actor \({\pi }_{\phi }\left({a|s}\right)\) parameterized as a squashed Gaussian via the reparameterization trick \(f=\tan,\varepsilon \sim {{\mathcal{N}}}\left(0,I\right)\). The target for the critics is

$${y}_{t}={r}_{t}+\gamma {{\mathbb{E}}}_{{a}_{t + 1}\sim {\pi }_{\phi }\left(\cdot {{|s}}_{t + 1}\right)}\left[{\min }_{i\in 1,2}{Q}_{{\bar{\theta }}_{i}}\left({s}_{t + 1},{a}_{t + 1}\right)-\alpha \log {\pi }_{\phi }\left({a}_{t + 1}|{s}_{t + 1}\right)\right]$$
(2)

and each critic is updated by minimizing

$${{{\mathcal{L}}}}_{Q}\left({\theta }_{i}\right)={{\mathbb{E}}}_{\left({s}_{t},{a}_{t},{r}_{t},{s}_{t + 1}\right){{\mathcal{\sim }}}{{\mathcal{D}}}}\left[\frac{1}{2}{({Q}_{{\theta }_{i}}{({s}_{t},{a}_{t})}-{y}_{t})}^{2}\right],i\in 1,2$$
(3)

with samples drawn from a replay buffer \({{\mathcal{D}}}\). The actor is updated to minimize

$${{{\mathcal{J}}}}_{\pi }\left(\phi \right)={{\mathbb{E}}}_{{s}_{t}{{\mathcal{\sim }}}{{\mathcal{D}}},\varepsilon {{\mathcal{\sim }}}{{\mathcal{N}}}}\left[\alpha \log {\pi }_{\phi }\left({f}_{\phi }\left(\varepsilon ;{s}_{t}\right)|{s}_{t}\right)-{\min }_{i}{Q}_{{\theta }_{i}}\left({s}_{t},{f}_{\phi }\left(\varepsilon ;{s}_{t}\right)\right)\right]$$
(4)

where \({f}_{\phi }(\varepsilon ;{s}_{t})=\tanh ({\mu }_{\phi }({s}_{t})+{\sigma }_{\phi }({s}_{t})\odot \varepsilon )\). Target networks are updated by Polyak averaging,

$${\bar{\theta }}_{i}\leftarrow \tau {\theta }_{i}+\left(1-\tau \right){\bar{\theta }}_{i},\tau \in \left(0,1\right)$$
(5)

The temperature \(\alpha\) is either fixed or tuned automatically to match a target entropy \({{{\mathcal{H}}}}_{{\mbox{target}}}\) by minimizing

$${{\mathcal{J}}}(\alpha )={{\mathbb{E}}}_{{a}_{t}\sim {\pi }_{\phi }(\cdot|{s}_{t}),{s}_{t}\sim {{\mathcal{D}}}}[-\alpha (\log {\pi }_{\phi }({a}_{t}|{s}_{t})+{{{\mathcal{H}}}}_{{{\rm{target}}}})]$$
(6)

In practice, SAC alternates gradient updates of \(\left\{{Q}_{{\theta }_{1}},{Q}_{{\theta }_{2}}\right\}\), \(\phi\), and \(\alpha\) using minibatches from \({{\mathcal{D}}}\).

During early-stage development and debugging of the SAC agent, a manually constructed simulated environment based on a TPP-Br4 model system was used for preliminary testing; details are provided in Supplementary Note 6.

Keypoint detection

For molecular keypoint localization, we employed a YOLOv7-based framework adapted for STM images of C–Br sites. The network integrates an efficient layer aggregation network (ELAN) backbone with a feature pyramid (FPN-PAN) to capture both fine spatial details and high-level semantic information. Unlike conventional object detection, the prediction head was modified to directly regress the (x, y) coordinates of molecular keypoints together with confidence scores. A custom dataset was constructed by annotating 28 original STM images of TPP-Br4 molecules with the LabelMe tool. After applying data augmentation (rotation, cropping, flipping, and elastic transformations), the dataset was expanded to 2280 images for training. The model was trained using learning-rate decay and momentum-based optimization to ensure stable convergence. The final YOLOv7 model achieved an accuracy of 0.907, a recall of 0.904, and an mAP@0.5 of 0.96 for keypoint detection. Keypoint-specific evaluation further yielded a PCK@0.5 of 0.943, a mean squared error of 0.82 pixels, and a PVE of 0.065. Detailed hyperparameters and training procedures are provided in Supplementary Note 1.1.