Extended Data Fig. 2: JPLE captures the association between amino acid sequence and RNA sequence specificity.

a, Illustration of the JPLE training process for n RBPs. Singular value decomposition (SVD) is used to decompose the joint protein representation [P R] into U, Σ, and VT. The d singular vectors and values contributing the most to the variance of R in [P R] are selected, leading to the submatrices U’, Σ’, and V’T. The product W of U’ and Σ’ provides the d- dimensional latent embedding of the n RBPs. b, Distribution of the Pearson correlation coefficients (PCCs) between the reconstructed (r*) and measured (r) RNA-binding profiles (that is, the reconstruction similarity), as a function of the number of maintained singular vectors d. Note that PCC is multiplied by 100. The median, minimal, and maximal reconstruction similarities are displayed, and the distribution is indicated in gray. To enable a minimum reconstruction PCC of 0.95 for all measured RBPs, d = 122 is required. The orange line represents the percentage of variance explained in R of [P R]. c, Illustration of the JPLE inference process for RBPu. The left (in blue) showcases a protein query, where the RBP’s latent embedding wu* is obtained by deconvolving its peptide profile pu into a mixture of the singular vectors in VP’. Its RNA-binding profile ru* can be reconstructed through either global (labeled G) or local (labeled L) decoding. The right (in brown) showcases RNA query, where the RBP’s latent embedding wu* is obtained by deconvolving its RNA-binding profile ru into a mixture of the singular vectors in VR’. Its peptide profile pu* can be reconstructed through global decoding. d, Variance explained in P and R of [P R], as a function of the number of selected singular vectors d. At d = 122, 44% and 96% of the total variance of P and R of [P R] are explained respectively. e, Relationship between RNA-binding profile PCCs and their JPLE latent distance. JPLE was trained leaving clusters of RBPs with the same specificity (PCC > 0.6) out, then embedding them into JPLE and measuring the cosine distances between each other and to the RBPs in the training set (that is, the e-dist). The e-dist for each pair of RBPs was compared to the similarity of their RNA-binding profiles (orange, right y-axis) and their amino acid sequence identity (blue, left y-axis). Lines and shaded areas show smoothed mean and standard deviation across 50 equally sized bins. RBP pairs with an e-dist < 0.20 possess an average RNA-binding similarity of at ≥ 0.62 and an average AA SID ≥ 36%.