Fig. 1 | Scientific Data

Fig. 1

From: Labeled dataset of X-ray protein ligand images in 3D point cloud and validated deep learning models

Fig. 1

Workflow used to obtain LigPCDS, the deep learning models training and the validated labeling approaches. (a) LigPCDS creation schema. In step 1, a list of PDB entries, with resolutions ranging from 1.5 to 2.2 Å, was retrieved from RCSB (.pdb and.mtz) and their free and organic ligands were downloaded, filtered and validated (.sdf). It resulted in the list of valid ligands with 244,226 entries. In step 2, Dimple v2.6.1 was used to refine the PDB entries and calculate their Fo-Fc maps. Next, for each ligand, a grid sizing was defined to cover its entire blob. Each ligand’s grid was interpolated from its Fo-Fc map to a 3D point cloud and processed to create the final 3D representations of the ligands. In step 3, vocabularies of chemical classes were created and used for labeling the structure of the valid ligands atom-wise. They were based on the chemical atoms themselves and on cyclic substructures of the ligands. Finally, in step 4 the labels of the structure of the ligands were extrapolated pointwise, using an atomic sphere model, for labeling the final 3D representations of the ligands, resulting in LigPCDS. (b) General schema used to train and obtain the validated DL models. A stratified training dataset was created from LigPCDS with n = 78,902 ligand entries, separated in k = 13 similar groups (step 5). The LigPCDS entries of this dataset were used to train DL models in semantic segmentation tasks using the Minkowski Engine47 architecture and networks based on the 3D U-Net52. Cycles of training, evaluation and changes continued until good performance DL models were obtained and validated (step 6). (c) Four of the proposed labeling approaches were validated and are illustrated with ligand FUL from PDB (entry 4Z4T). The average performance in the cross-validation of the best DL model trained with each vocabulary is presented by the mIoU and the mF1 metrics, with corresponding SEM and confidence interval (CI). k = 1 was used in the tests except for the model trained with the vocabulary of “Generic Atoms and Cycles C347CA56”, which used the average k-fold value and k = 13. Image “Machine Learning” is by Srinivas Agra and image “intelligence” is by Gacem Tachfin from the Noun Project (CCBY3.0).

Back to article page