Extended Data Fig. 6: Inferring conceptual and technical activities.

Using the ground-truth dataset mentioned in Extended Data Fig. 5, we train a neural-network model to infer these two author roles within 16,397,750 papers where author contributions are not explicit. This machine learning model uses eight different variables to predict the dichotomy of author roles, including 1) contribution to references, defined as the overlap between references of the focal paper and all references across previously published papers for each author; 2) contribution to topics, defined as the overlap between MAG topic keywords for the focal paper and all keywords across previously published papers for each author; 3) contribution to leading the research, defined as the probability of being the first author(s); 4) contribution to managing correspondence and presentation, defined as the probability of being the corresponding author(s); 5) career age, defined as the number of years from the first publication to the publication of the focal paper for a given author, 6) citation impact, defined as the total number of citations an author has received to all previous publications; 7) topic diversity, defined as the total number of unique MAG topic keywords across previous publications, and finally; 8) publication productivity, defined as the total number of previous papers until the publication of the focal paper. The machine learning model gives a precision of 0.790 and a recall of 0.793. The predicted and empirical values of the fraction of conceptual workers are highly correlated (Pearson correlation coefficient 0.66, P-value < 0.001). The eight predictors and their contribution to the prediction are displayed. The figure is reproduced from our earlier research22.