Introduction

The Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition, Text Revision (DSM-5-TR) defines substance use disorders (SUDs) as a pattern of substance use resulting in clinically significant impairment or distress1. This encompasses a range of conditions, including tolerance, withdrawal, and persistent, unsuccessful efforts to control or reduce substance consumption.

SUDs pose a substantial global health challenge, significantly contributing to morbidity and mortality2. According to the 2021 United Nations Office on Drugs and Crime (UNODC) report3, an estimated 296 million individuals worldwide engage in drug use, with 29.5% experiencing a SUD, marking a 45% increase in prevalence over the preceding decade. The European Drug Report4 indicates that over 29% of Europeans aged 15 to 64 have used illicit drugs at least once. In Italy, substance use among 15-19-year-olds reached 27.9% in 2023, affecting approximately one million students5. With a cannabis use rate of 21.5%, Italy ranks second in Europe, exceeded only by Czechia, and substantially surpasses the EU average of 12.2%. Cocaine use, reported at 2.1%, aligns closely with the European average of 2.21%, with higher consumption concentrated in Northern Italy in 20234. In 2022, the European Union recorded an estimated mortality rate of 22.5 deaths per million among individuals aged 15–645. Additionally, data from Italian prefectures and law enforcement points to a slight rise in drug-related deaths in 2022 over 20216.

The etiology of SUDs seems to be multifactorial, encompassing neurological, genetic, and sociocultural components7,8. From a genetic standpoint, familial patterns found in twin and family studies indicate that SUDs have a heritable component9. Estimates of heritability for various SUDs usually fall between 50 and 60%, suggesting a high level of polygenicity10,11. Genome-wide association studies (GWAS) have identified specific genomic loci associated with various SUDs, encompassing both licit and illicit substances, including nicotine, cannabis, and opioids12,13,14. While often challenged by limitations in robustness and reproducibility, candidate gene studies have explored genetic variations within dopaminergic, serotonergic, opioid receptor, GABAergic, and nicotinic cholinergic system genes across various SUDs15,16,17,18. Linkage disequilibrium (LD) score regression methods revealed positive genetic correlations between smoking, cannabis use, major depression, and risk-taking behaviors19. Hatoum et al.19 found a shared hereditary risk element for addiction affecting problematic use of opioids, cannabis, tobacco, and alcohol. This risk factor result is separate from general substance use patterns, exhibiting the strongest correlations with opioid and cannabis use disorders, with a weaker association with tobacco use. This factor also correlates with executive functioning, personality traits such as risk-taking and neuroticism, and several non-substance-related mental health conditions. According to the authors, this addiction-specific genetic risk factor remains a significant predictor of addiction even after controlling for typical substance use patterns and general psychopathology, suggesting a unique genetic architecture driving, at least partially, addiction, independently of these contributing elements.

From a psychological perspective, current addiction models suggest that individuals initiate substance use due to positive reinforcement, which can lead to automated processes and inflexible, compulsive behaviors that resist negative consequences20,21. The prefrontal cortex, which is essential for many executive functions22,23, including inhibitory control, working memory, and attention24,25, is significantly affected by persistent substance use. In people with SUDs, these deficits cause more general cognitive problems in domains outside substance-related reward26,27,28,29.

A significant research gap is highlighted by the fact that Northern European ancestry groups are the most frequently studied populations in GWAS, so there is a need to investigate more diverse populations. In fact, different research has demonstrated that several Southern European populations have been influenced by many populations during their history, causing subtle but significant differences in allele frequencies30,31,32,33. Differences in evolutionary pressures, environmental adaptations, and demographic events have resulted in substantial genetic diversity across human populations. This diversity challenges identifying universally applicable risk variants for complex traits like SUDs. Therefore, the substantial genetic heterogeneity within the Italian population underscores the importance of its inclusion in SUD research.

Global biobank development has seen a marked increase in recent decades, preserving biological samples from thousands of participants. Their value is amplified by integrating observational data and in-depth questionnaire responses, thereby significantly boosting research potential34,35. Even though they are still relatively new, biobanks have already transformed biomedicine, especially in association studies, and scientists anticipate they will soon provide amazing insights36. However, for specific traits, such as SUDs, most biobanks lack sufficient data, particularly for the Italian population, with research primarily focused on alcohol and tobacco addiction or abuse, together with lifetime usage data for other substances.

Due to the high societal costs and complex nature of SUDs, investigating this interplay is crucial for improving prevention, diagnosis, and personalized interventions37. This article presents BioSUD, a new Biobank project for studying SUDs in Southern Italy. The primary goal is to determine the genetic causes of SUDs and the relationships between treatment outcomes and environmental and genetic factors. Our findings could improve genotype-informed SUD treatments, overcoming patient outcomes and advancing scientific understanding.

Materials and methods

Recruitment

The BioSUD initiative intends to create a genetic resource for understanding the phenotypic characteristics associated with SUDs. We aim to collect and analyze data from 3,000 people, 1,500 of whom are diagnosed with SUD. On the 1st of February 2024, the cohort included 1,806 participants, of whom 1,508 individuals served as control participants, comprising 1,046 males and 462 females, and 298 case participants, 278 males and 20 females. This study defined controls as individuals without a formal SUD diagnosis from either public or private treatment centers. We recruited control participants exclusively from a single blood donor center (Centro Trasfusionale of the University General Hospital, Bari, Italy) between March and October 2021 during their donation process. Before sample collection, we informed them about the project’s aims and rationale and provided an informative document to obtain their written informed consent.

The case group included samples from several private centers and public structures (detailed below). Recruitment is ongoing, aiming to reach the target of 1,500 SUD cases for a balanced case-control ratio. All the cases met the standardized diagnostic criteria for SUDs according to the International Classification of Diseases, 11th Revision (ICD-11)38, or the DSM-5 TR1. Eligible participants under these criteria were recruited from private (N = 71) and public (N = 227) healthcare facilities in Apulia, Southern Italy. Specifically, we collected the private center samples from the Therapeutic Community Emmanuel Onlus - Sector Dependencies (Lecce) and the Therapeutic Community “Fratello Sole” - Social Cooperative (Gioia del Colle, BA). We gathered the samples from public institutions at the SerD of Bari (BA), Bitonto (BA), Brindisi (BR), Campi Salentina (LE), Castellaneta (TA), Casarano (LE), Foggia (FG), Francavilla (TA), Galliano del Capo (LE), Gallipoli (LE), Grumo Appula (BA), Lecce (LE), Maglie (LE), Manduria (TA), Martina Franca (TA), Nardò (LE), Ostuni (BR), Poggiardo (LE), San Cesario di Lecce (LE), San Pietro Vernotico (BR), Taranto (TA), Ugento (LE) and the SerD in the Brindisi Prison (BR).

We used different engagement strategies to encourage volunteer participation in the case group. In all facilities, the BioSUD members presented the project separately to staff and participants, using multimedia tools such as presentations and short demonstrative videos. After the presentation, we recorded the volunteers’ willingness to participate in the study. During scheduled routine examinations in the following weeks, we collected written consent and blood samples to reduce participant burden. A dedicated medical professional or psychologist was on-site to oversee the process, including obtaining written consent and aiding with the questionnaire (detailed below). Healthcare professionals, such as doctors and specialized nurses on the research team, collected blood samples. After transport to the BioSUD lab facilities, all the blood samples were processed as described in the following sections within 72 hours. We entered the questionnaire data from cases and controls into Excel and processed it with R Studio, version 4.5.039.

Sampling

Blood samples from controls were collected by a specialized nurse and from cases by healthcare professionals, including physicians and specialized nurses affiliated with the research team. Specifically, after the written consent was returned, venous blood (8 ml) was drawn using a Vacutainer K2 EDTA and kept refrigerated until arrival at the processing laboratory at the University of Bari. Within 72 h from collection, all samples underwent centrifugation at 800 g for 15 min at a 45-degree angle. Following stratification, 1 mL of plasma was stored at -80 °C, while the remaining sample was preserved at -20 °C for subsequent DNA extraction and analyses.

DNA extraction

We extracted DNA from 250 µL of the layer of nucleated blood cells obtained after centrifugation during the initial processing, using Qiagen DNA Blood Mini Kit according to the manufacturer’s protocol. We evaluated the quality and concentration of the extracted DNA using the NanoDrop 1000 UV Thermo Scientific.

Genotyping and dataset

A total of 1,378 DNA samples meeting the quality control thresholds (concentration ≥ 30 ng/µl, 260/280 ratio > 1.6) were genotyped at the Institute of Genomics, University of Tartu (Estonia) using the Illumina Global Screening Array (GSA, Illumina Inc.). Moreover, all the samples with a quality call rate lower than 97% were discarded, resulting in 1,279 genotypes with 723,895 SNPs captured. The genotype data were imputed using TOPMed Imputation Server (https://imputation.biodatacatalyst.nhlbi.nih.gov/#!), which leverages Minimac4 and the TOPMed r3 reference panel to infer additional SNPs. The total number of imputed SNPs obtained after applying an Rsq filter ≥ 0.3 was 34,118,504. However, although these imputed variants were available, we did not harness them in the present study. This decision was based on the observation that the total number of retained SNPs did not increase significantly after merging with other datasets and performing LD pruning. We combined the newly generated genotypes with publicly available datasets comprising 4,551 individuals from 140 different populations, including 107 Eurasian populations30,40,41,42,43,44,45,46,47,48,49,50,51,52, using the –bmerge function in PLINK version 1.953. Before merging, we removed all markers and individuals with more than 5% of missing data. This resulting dataset is composed of 5,830 individuals and 85,310 SNPs.

Principal component analysis (PCA)

To explore the genetic variation of the BioSUD cohort, we performed a PCA. We retained only the Eurasian populations from the complete dataset and discarded variants and individuals with missingness rates higher than 5%. After pruning for SNPs with high linkage disequilibrium score (indep-pairwise 200 50 0.4), 69,359 SNPs and 3,530 samples remained, and the PLINK files were converted to EIGENSTRAT format using convertf (version 5722).

We performed the PCA using SmartPCA (version 16000) from the EIGENSOFT package54. Specifically, we projected the BioSUD samples into the principal component space inferred from all other Eurasian individuals (using the poplistname option). Outliers were automatically removed with the numoutlieriter, numoutlierevec, and outliersigmathresh options set to default parameters. This process led to removing 102 samples, reducing the sample size to 3,428 individuals. After visual inspection of the PCA plot, we identified and removed three additional BioSUD participants falling outside the genetic variability of the cohort (Fig. S1 - Supplementary Materials). The final dataset comprised 3,425 individuals, including samples from the BioSUD cohort and the Eurasian populations from the publicly available datasets (Table S1 - Supplementary Materials).

ROH Estimation

The --homozyg function in PLINK was exploited to detect Runs Of Homozygosity (ROHs) containing at least 50 SNPs. The minimum ROH length was 1,500 Kb to exclude short ROH due to Linkage Disequilibrium (LD). ROHs were detected by scanning genotypes for each BioSUD cohort and all other individuals sharing the same bulk of SNPs.

Admixture analysis

To increase the number of SNPs analyzed while maintaining a proper sample size for the population, we confined the analysis to data from the 1000 Genomes Program40 and Raveane et al.31. The resulting dataset comprised 2,680 individuals: 1,401 from European (Finnish: FIN; Central Europeans: CEU; British from England and Scotland: GBR; Iberians from Spain: IBS; Italians: ITA; Tuscans: TSI), African (Luhya from Webuye, Kenya: LWK; Yoruba from Nigeria: YRI), and Asian (Gujarati Indians from Houston: GIH; Dai Chinese: CDX; Japanese from Tokyo: JPT) populations, and 1,279 from the BioSUD sample (Table S1 - Supplementary Materials).

We performed ten independent repetitions (using time as a starting point for randomization with the –seed option) for each K value ranging from two to ten using the ADMIXTURE software tool (version 1.355. We inferred the “optimal” number of K using the cross-validation (CV) procedure with the –cv option and observed the lowest CV error at K = 7 and 8 (Fig. S2). We first conducted an ADMIXTURE analysis excluding the BioSUD data. After obtaining the initial results, we projected the BioSUD data onto the resulting ADMIXTURE profiles (-P flag) to integrate and analyze their genetic structure within the established framework.

We have also performed an unsupervised admixture analysis on the same dataset using the same procedure for comparison.

Questionnaire

Each participant completed a tailored paper-and-pencil questionnaire to assess the frequency, amounts, and patterns of substance use, including nicotine, alcohol, cocaine, heroin, cannabis, and other substances (Tables S1 and S2 - Supplementary Materials), also including questions from the DSM-5 TR checklist1. This instrument explored drug-related behaviors, with a specific focus on psychosocial factors, family relationships, peer group influences, substance accessibility, and social contexts experienced during adolescence.

The questionnaire given to the ‘Emmanuel’ group and control participants was longer (233 items) than the one given to other patients (165 items) to accomodate the difficulties and time constraints experienced by individuals with addiction outside formal rehabilitation settings. The survey encompasses three main sections: sociodemographic, psychosocial, and substance use (Fig. 1).

The sociodemographic section gathers a wide range of participant data to account for the possible impact of various life factors on the study’s results. We included critical demographic details such as gender, age, education, marital status (including the number of children), place of residence and birth, income, employment status, self-reported health, and family background (parents, siblings, and other caregivers).

The psychosocial section focuses on life experiences that may have influenced the participant’s psychological and social well-being. It investigates situations such as early separations from parents, parental divorce, and relocations (Tables S2 and S3 - Supplementary Materials). In this section, we also investigated aspects closely related to the substance use section, such as substance exposure in social settings, including both family and peers, substance accessibility in the cities where participants lived, and perceived safety level. Furthermore, we explored adverse events across life stages, including grief, accidents, illness, violent crimes, sexual abuse, and other painful events. Participants reported occurrences in age categories (< 14, 14–18, 18–25, > 25 years), and we quantified the overall incidence by calculating the cumulative events across these categories. Moreover, we included questions about the perceived quality of relationships with fathers, mothers, siblings, and peers on a scale ranging from “Very Poor” to “Very Good.” We converted this categorical representation into a 5-point numerical scale (1 to 5). We aggregated cumulative scores across family members to reclassify them into categories: 1–3 as “Very Poor,” 4–6 as “Poor,” 7–9 as “Average,” 10–12 as “Good,” and 13–15 as “Very Good” for analytical purposes.

Lastly, the substance use section explores the consumption of nicotine, alcohol, cannabis, cocaine, heroin, and other substances (e.g., amphetamines, MDMA, ecstasy, hallucinogens, etc.). We tailored questions for each substance, adapting the DSM-5 TR checklist1 to thoroughly examine substance use across various categories and identify potential use disorders. The exposure subsection explores family, peers, and accessibility of substance use, considering social settings and craving behaviors. Subjective feelings, including relief, reward, and obsession, are measured. We also used a Visual Analogue Scale (VAS) for the craving assessment. Positive responses to substance consumption-related questionnaire items were assigned to a value of 1 and summed to create a ‘family substance consumption’ score. We then categorized the score into five ordinal levels: ‘None’ (0 positive responses), ‘Low’ (1–2 positive responses), ‘Average’ (3–4 positive responses), ‘High’ (5–6 positive responses), and ‘Very High’ (more than seven positive responses), based on average and standard deviation.

Fig. 1
figure 1

Overview of questionnaire sections: sociodemographic, psychosocial, and substance use assessment variables. *Areas investigated in Controls and Emmanuel questionnaire only.

Results

The genetic variation of the biosud cohort

Principal component analysis

To genetically characterize the BioSUD samples, we first projected the BioSUD individuals onto the PCA space inferred from 2,150 Eurasian individuals (Fig. 2A). The PCA showed that BioSUD samples form a cluster largely overlapping with individuals from Southern Italy and partially overlapping with the ones from Central Italy. On the contrary, it differed from individuals from Northern Italy and Sardinia. Within Eurasia, the BioSUD samples appeared more like Balkan populations and Corsicans than Iberians. Indeed, on the west side of Europe, the Iberian populations were close to Northern Italian populations and Central Europeans. At the same time, on the west side of Europe, Iberians looked closer to Northern Italians than to BioSUD samples and Southern Italy. Central European populations were closer to Northern Italians than to the BioSUD cohort.

Runs of homozygosity

To infer the homozygosity pattern within the BioSUD cohort compared to worldwide populations, we performed an ROH analysis for genomic segments extending more than 1,500 kb and encompassing at least 50 SNPs. Our results showed that the BioSUD cohort has an average of 7.64 ROH segments for an average total length of 18,132.644 kb (Standard Deviation (SD) = 21,627.211), which is comparable with those inferred for the other European populations (ITA = 21,637.46, IBS = 22,599.718, GBR = 23,427.666). However, we inferred a large SD for the BioSUD population, possibly due to 53 individuals showing a total ROH length ranging from 32,796 kb to 349,267 kb.

Moreover, when comparing the median of the total ROH length of the BioSUD cohort with the ones from European populations, the BioSUD resulted in the lower one (BioSUD = 14,940.7 kb; ITA = 16,892 kb; IBS = 18,337.8 kb; GBR = 19,915.7 kb; FIN = 30,780.85 kb), suggesting being the most heterozygous population among the analyzed ones (Fig. 2B). When extending this comparison to the CDX and YRI populations, only the latter showed a higher heterozygosity rate than the BioSUD cohort (CDX = 33,066.85, YRI = 9,379.32).

Admixture analysis

We performed an Admixture for K from 2 to 10 for ten random iterations to infer population structure (Fig. S2 - Supplementary Materials). Due to the lower error, cross-validation showed 7 and 8 as the optimal K values (Fig. S3 - Supplementary Materials).

Figure 2C shows the barplot summarizing the ancestral component proportions for K = 7. The BioSUD samples (Fig. 2C, lower panel) showed a composition very similar to the Italian populations (Fig. 2C, upper panel), with the main genetic components being modal in Southern Europeans (yellow in Fig. 2C, Average (A) = 0.764, Median (M) = 0.766, SD = 0.027), Northern Europeans (blue in Fig. 2C, A = 0.100, M = 0.099, SD = 0.024) and South-East Asians (GIH, red in Fig. 2C, A = 0.090, M = 0.091, SD = 0.012). A minor proportion of genetic component is represented by the ones modal in the African populations, with the East African one (LWK) accounting for 1.9% (pink in Fig. 2C, A = 0.019, M = 0.019, SD = 0.009) and the West African one (YRI) for the 1.3% (green in Fig. 2C, A = 0.013, M = 0.012, SD = 0.021).

Although SD values suggested the BioSUD cohort to be highly homogeneous, we could identify three outliers that deviate from the ancestral component proportions of other individuals. Outlier 1 showed 66% of a component modal in YRI (green in Fig. S4 - Supplementary Materials), suggesting a predominantly African ancestry. The questionnaire data confirmed this observation, showing that this individual was born in Nigeria. Outlier 2 exhibited a higher prevalence of ancestries commonly found in the Japanese population compared to the BioSUD cohort (purple in Fig. S4 - Supplementary Materials, 0.18), Northern European (blue in Fig. S4 - Supplementary Materials, 0.16), and South-East Asians (red in Fig. S4 - Supplementary Materials, 0.09). However, considering the additional information on this subject, we could not explain this different profile, which we suspect to be linked to his family history. Outlier 3 had, as expected, the modal component in the Southern Europeans as the major one (yellow in Fig. S4 - Supplementary Materials, 0.52) and additionally showed a higher rate than expected of the components modal in Western Africans and Eastern Africans, accounting respectively for 29.0% (green in Fig. S4 - Supplementary Materials) and 12.7% (pink in Fig. S4 - Supplementary Materials). From the questionnaire data, we hypothesized that this individual had an Italian and an African parent, thus explaining these results.

The results obtained from the unsupervised ADMIXTURE analysis were substantially equivalent, with the main exception of detecting a main “Apulian/BioSUD component” at K = 7, probably due to the disproportionate sample size of the Apulian cohort (Fig. S5—Supplementary Materials).

Fig. 2
figure 2

Genetic characterization of collected samples. (A) Genetic relationship among BioSUD samples (projected) and other Eurasian populations inferred from PCA; (B) Violin plots of ROH length in different populations; (C) Results of the ADMIXTURE clustering analysis using K = 7. CDX: Dai Chinese; CEU: Central Europeans; FIN: Finnish; GBR: British from England and Scotland; GIH: Gujarati Indians from Houston; IBS: Iberians from Spain; ITA: Italians; JPT: Japanese from Tokyo; LWK: Luhya from Webuye Kenya; TSI: Tuscans; YRI: Yoruba from Nigeria.

Descriptive evaluation of the questionnaire variables

Sociodemographic data and psychosocial factors

All the 1,806 sampled individuals completed the questionnaire. The missingness rate of questions varied from 0 to 39.6% (A = 11.8%).

Both the control (1,046 males vs. 462 females) and case groups (278 males vs. 20 females) included a higher number of male participants (Fig. 3A). The overall sample had an average age of 40.69 years (SD = 12.31, range 18–72). As summarized in Fig. 3B, the control group had an average age of 40.43 years (SD = 12.66, range 18–72), while the cases had an age of A = 42.10 years (SD = 10.15, range 20–71). High school was the most frequent level of education among participants (46.1%), followed by a university degree (25.2%), with the remaining participants distributed across middle school (16.0%), post-graduate studies (10.0%), and primary school (1.9%; Fig. 3C). A comparison between the sample data to the 2020 statistics from the Italian National Institute of Statistics (ISTAT) for Apulia demonstrate inequalities in educational distribution. The control group had a more significant high school attendance rate (49.9% vs. 31.9% national average), degree (30.4% vs. 12.4%), and lower middle school rate (5.4% vs. 31.3%) than the Apulian ISTAT average (31.9%). In contrast, the cases revealed a lower rate for high school (31.4%) and degree (3.4%), as well as a greater rate of higher middle school completion (55.6%) than the ISTAT statistics. For employment status, the case group shows a higher proportion of unemployed individuals for over 12 months (37.7% vs. 12.4%) and a lower rate of full-time employment (34.8% vs. 58.8%; Fig. 3D). Furthermore, participants in the case group experienced more adverse events than controls (Fig. 3E). Controls reported higher rates of bereavement (68.3%) and violent crime victimization (11.3%) if compared with cases (51.2% and 7.3%, respectively). Conversely, serious accidents were more frequent among cases, with a rate of 15.9% compared to 8.7% among controls. Cases also experienced higher rates of serious illness (2.4% vs. 0.5%), witnessed violent crimes (8.2% vs. 5.9%), and sexual abuse (4.8% vs. 1.1%), compared with controls.

Regarding family members’ substance use and behaviors (Fig. 3F), results indicated that controls showed a trend in family drug consumption, with 23.9% reporting no use, 56.4% reporting low consumption, 16.7% reporting moderate consumption, and minimal representation in higher categories (2.1% high and 0.9% very high). Conversely, the case group exhibited a divergent pattern: 6.8% reported no familial substance use, 39.6% reported low use, 26.1% reported average use, and a consistent percentage collocated in higher categories (13.0% high and 14.5% very high).

Most participants in the control group reported having “Good” or “Very Good” relationships with their families (71.7%), with only 2.8% reporting “Poor” or “Very Poor” relationships. In contrast, the case group exhibited a lower proportion of “Good” or “Very Good” family relationships (47.4%) and a significantly higher percentage (21.3%) reporting “Poor” or “Very Poor” quality (Fig. 3G). On the other hand, in the control group, a substantial majority reported either “Good” or “Very Good” relationships with peers (76.0%). In comparison, the combined “Poor” and “Very Poor” categories accounted for a relatively small proportion (1.7%). Conversely, the case group demonstrated a lower prevalence of “Good” and “Very Good” relationships with peers (51.2%) and a higher incidence of “Poor” or “Very Poor” categories (10.1%) (Fig. 3H).

Substance use

To assess participants’ nicotine usage, we asked individuals about their smoking habits, which are defined as using at least one unit of tobacco or nicotine-containing products each day. Among controls, 54.7% are non-smokers, 20.1% are former smokers (more than six months before the questionnaire date), 3.2% are former smokers (less than six months before the date of the questionnaire), and 22.0% are current smokers. In contrast, the case group has a different profile, with 91.3% being current smokers (Fig. 3I).

The case group showed a different pattern of alcohol consumption compared to the controls (Fig. 3J). While fewer participants in the case group reported drinking alcohol than controls (70.5% vs. 81.3%), drinkers showed heavier use, particularly “4 times a week or more” (25.7% vs. 6.8%).

The prevalence of cannabis usage among exposed controls was low, with only 8.6% reporting 30 or more uses and a combined 29.6% indicating less frequent use (less than 29 occasions). 67% of the controls had never used cannabis. With a significant 78.6% reporting 30 or more instances of cannabis usage, a combined 11.7% indicating less frequent consumption (less than 29 times), and just 9.7% reporting never consuming cannabis, the case group, on the other hand, showed a significantly different pattern.

The prevalence of cannabis usage among exposed controls was low, with only 8.6% reporting 30 or more uses and a combined 29.6% for less frequent use (less than 29 times), with 61.7% of the controls having never used cannabis (Fig. 3K). In contrast, the case group showed a different pattern, with a substantial 78.6% reporting 30 or more times of cannabis usage, a combined 11.7% indicating less frequent consumption (less than 29 times), and just 9.7% reporting never using cannabis.

The prevalence of cocaine usage was extremely low in the control group, with only 0.3% of exposed controls reporting 30 or more times of use and a combined 2.6% for less frequent consumption (less than 29 times). Most controls (97.2%) had never used cocaine (Fig. 3L). The case group, on the other hand, showed a significantly different pattern, with 84.0% of participants reporting 30 or more occasions of cocaine usage. Just 9.7% of cases reported abstinence, while a lower number (4.5%) reported less usage.

Heroin use varied significantly between cases and controls. While almost all controls (99.9%) reported no heroin use, only 29.1% of cases never used heroin. Most of the cases (66.02%) reported using heroin 30 times or more, with the remaining 4.9% reporting less than 29 times of usage (Fig. S5E - Supplementary Materials). A similar pattern emerged for “other substances”, with most controls (97.5%) indicating no use, compared to 65.1% of cases. In contrast, 15.5% of cases used other substances 30 times or more, whereas 19.4% indicated less regular usage (Fig. S6 - Supplementary Materials).

Fig. 3
figure 3

Frequency distributions and descriptive statistics for questionnaire variables, stratified by control and case groups (represented graphically as red and blue, respectively). From left to right: (A) Sex; (B) Age; (C) Education; (D) Employment; (E) Adverse events; (F) Family substance use; (G) Family relationships (Quality); (H) Peer relationship (Quality); (I) Nicotine use; (J) Alcohol use; (K) Cannabis use; (L) Cocaine use.

Substance use classification - DSM-5 TR checklist

As expected, the prevalence of each SUD was higher among cases than among controls (see Table S4 — Supplementary materials).

For Cannabis Use Disorder (CaUD), most controls (88.6%) did not meet the diagnostic criteria, while 51.3% of cases had no CaUD. Among the cases, 11.4% had mild CaUD, 9.7% had moderate CaUD, and 20.8% had severe CaUD.

In contrast, a small proportion of controls exhibited probable CaUD, with 1.5%, 0.7%, and 0.4% meeting the criteria for mild, moderate, and severe CaUD, respectively.

For Cocaine Use Disorder (CUD), 93.0% of controls did not meet the diagnostic criteria, whereas only 24.5% of cases were classified as having no CUD. Conversly, 4.4% of cases met the criteria for mild CUD, 6.7% for moderate CUD, and 59.4% for severe CUD, while the prevalence of CUD among controls was negligible.

For Heroin Use Disorder (HUD), 93.0% of controls did not meet the criteria, whereas only 38.6% of cases were classified as having no HUD. Among the cases, 3.0% met the criteria for mild HUD, 3.4% for moderate HUD, and 50.7% for severe HUD. No control participants met the criteria for mild, moderate, or severe HUD.

For Other Substance Use Disorder (OSUD), most controls (94.3%) did not meet the diagnostic criteria, compared to 83.9% of cases. The prevalence of mild, moderate, and severe OSUD was relatively lower than that observed for other substances, with 5.0%, 1.7%, and 3.0% of cases meeting these criteria, respectively. Among controls, only 0.1% met the criteria for mild OSUD, while no control participants met the criteria for moderate or severe OSUD.

In the control group (N = 1,508), only a small number of participants reported mild substance use disorder involving combinations of substances. Specifically, one participant (< 0.1%) reported a mild SUD involving cocaine and other substances, while an another (< 0.1%) reported a combination of cocaine and cannabis. In the case group (N = 298), the most frequent mild SUD combination was cannabis and cocaine, affecting three participants (1%). For moderate SUD, cannabis and cocaine remained the most common combination, with two cases (0.7%). Severe SUD was more prevalent in the control group and showed a strong pattern of polydrug use. The most common SUD combination was in the severe range for CUD and HUD, affecting 86 participants (28.9%), followed by a combination of cocaine and cannabis use, which affected 55 participants (18.5%). Other combinations of substances were less frequent but still contributed to the overall burden of severe SUD. Specifically, 24 cases (8.1%) involved severe use of cannabis, cocaine, and heroin, five cases (1.7%) involved severe use of other substances and heroin, five cases (1.7%) involved severe use of cannabis, cocaine, and other substances, and three cases (1%) involved cannabis, heroin, and other substances.

Discussion

Here, we present the first analysis of the genetic, psychosocial, sociodemographic, and SUD behavior variability within the BioSUD cohort, which comprises 1,806 individuals.

When evaluating the genomic variation of the BioSUD cohort, PCA shows that most of the samples fall within the genetic variability of Southern Italian individuals, with a few samples showing genetic profiles similar to other Italian or Western European regions. Specifically, only three samples show genetic profiles compatible with substantial ancestry from different continents. The admixture analysis confirms a shared demographic and evolutionary history, which indicates that the BioSUD cohort shares a high percentage of ancestry with people from Iberian and, to a lesser extent, other European groups. This ancestral composition is comparable to other Italian groups31. However, almost all the individuals show a substantial proportion of ancestry component, which is modal in Southern East Asian Individuals and is absent in Iberians. Moreover, a low proportion of ancestry modal in Sub-Saharan African groups is observed. The complex demographic history of the Italian peninsula is reflected in these two ancestries, contributing to the high heterogeneity found across the European continent. According to recent studies, Italy exhibits the highest level of genetic variation on the European continent, with significant heterogeneity among its various regional populations31,32. Here, we confirm this heterogeneity by observing that, on average, Italian individuals have the lowest number of ROHs in Europe. The BioSUD sample set, mostly of Apulian descent, carries fewer ROHs than other Italian individuals, suggesting that Southern Italians are among the most genetically diverse European populations. However, more research is required to validate these findings further.

The survey data reveal the demographic details of the population under study, including gender, age, educational attainment, and employment status in the control and case groups. The pronounced male predominance in substance consumption observed in both the control and cases groups aligns with established trends in substance consumption studies: men exhibit higher rates of alcohol consumption, alcohol-related issues, and alcohol use disorder diagnoses, as well as greater use of illicit substances and higher prevalence rates of SUDs55,56,57,58.

The participants’ educational backgrounds are considerably different from one another. Both high school enrollment and high school graduation rates of the case group are lower than the control group. This is typical of the established correlation between drug use and diminishing educational achievement59. However, possible sampling biases in the current study are revealed by comparing it with the 2020 ISTAT demographic data for the Apulia region. In particular, the mean educational attainment of the control group is higher than the stated regional average, while the case group demonstrates a lower average attainment. Therefore, the observed results necessitate a cautious interpretation, limiting their generalizability to the broader Apulian population and accounting for this bias in subsequent GWAS.

Psychosocial factor analysis reveals distinct patterns between case and control groups. Control groups primarily report low levels of familial substance use, while cases exhibit a more diverse range, suggesting that SUDs are influenced by both environmental60,61,62 and genetic63,64 factors. Additionally, case participants demonstrate more significant variability in reported relationship quality, with a significantly higher likelihood of participants describing challenging or strained familial and peer relationships than control participants65,66.

In examining substance use, the control group predominantly consists of non-smokers, whereas the case group exhibits a markedly higher prevalence of current smokers. Individuals with SUDs often face complex challenges related to compulsive substance use, and nicotine, being highly addictive, may become intertwined with other substance use patterns67,68,69. Shared risk factors, common neural pathways, or coping mechanisms may contribute to the increased nicotine consumption observed among participants with SUDs. Whereas the control group exhibited a pattern of regular, moderate alcohol consumption, the case group demonstrated a more polarized distribution, with a higher prevalence of abstinence (roughly 29% vs. 19%) and a significant proportion engaging in heavy drinking among those who did consume alcohol. The supervised environment in the private and public healthcare facilities where participants were recruited explains the comparative results of abstinence in cases versus controls.

Controls mostly abstained from cannabis and cocaine, except for some exposed controls, while cases showed more frequent and intense substance use, especially cannabis. Heroin and other substance use were almost absent among controls (roughly 99.9%), while consumption rates were significantly higher in the case group.

The case group exhibited a high incidence of severe SUDs, predominantly related to cocaine and heroin use, as evidenced by DSM-5-TR symptom response analysis, which supported the clinical classification. The low occurrence of severe SUD cases among exposed controls further reinforced the distinction between the clinical and control groups. Polydrug use, with cocaine-heroin as the most frequent combination, followed by cocaine-cannabis and other substance use, was common in severe SUD. These findings align with research on cocaine-heroin co-use in severe addiction70. Mild and moderate SUD criteria were less frequent in the clinical group but followed similar patterns, with cannabis and cocaine being the most used substances. Even among controls, a small percentage (≤ 0.1%) met the criteria for mild SUD, indicating that some level of substance use occurs even in individuals not classified as clinical cases, referred to as exposed controls71.

Limitations and future directions

Although this study provides valuable insights, it also has some limitations. A limitation of this study is the current imbalance in our case-control ratio, stemming from the ongoing recruitment phase. While our target cohort is 3,000 participants, evenly distributed between cases and controls, the present analysis is based on 298 cases and 1,500 controls. This disparity necessitates a cautious interpretation of our findings. We are actively expanding our recruitment across the Apulia region to address this. Building upon our existing partnerships with initial facilities, we have established new collaborations with centers in the Foggia (FG) and Barletta-Andria-Trani (BAT) provinces. We aim to recruit 1,200 cases and achieve our desired 50:50 case-control ratio. To ensure we reach our target sample size, we are also actively pursuing additional agreements with other centers throughout Apulia.

Another key limitation is the lower representation of female participants. While this gender disparity reflects well-established epidemiological patterns of SUD prevalence and aligns with clinical referral trends, the European Drug Report 20244 estimates the average male-to-female ratio among users entering treatment for cannabis, cocaine, heroin, and other drugs to be 4.39:1. However, this imbalance may constrain the generalizability of our findings to female populations. Although the current sample represents existing clinical populations, future phases of our research project would incorporate targeted recruitment strategies to address this sampling bias and enhance the external validity of our findings.

Methodological challenges inherent in control group classification within SUD research necessitate attention. The standard approach of controlling for the existence of a formal SUD diagnosis as the exclusive determining characteristic of controls is susceptible to overlooking subclinical or undiagnosed SUDs. Furthermore, the approach relies significantly on the reliability of honest self-report, an assumption frequently belied by social desirability bias72 and response biases73. Misclassification is more likely when there is no external validation, such as biological markers, collateral data, or symptom validity tests. Future research should use stricter screening methods to improve control-clinical group differentiation. Moreover, the accurate assessment of drug use is hindered by its sensitive nature and concerns regarding privacy, social stigma, and legal consequences74. This often leads to underreporting and non-response bias, especially in control groups. Future research should incorporate methodological refinements, such as standardized structured interviews and indirect questioning techniques, to mitigate these issues and promote more reliable reporting.

This study obtained self-report measures of early adverse experiences, family, and socioeconomic status but did not evaluate the effects of these factors on SUD severity and polydrug use. Future research studies examining these associations and adding additional psychological measures, such as executive functioning and personality, will inform possible influencing and confounding factors of SUDs.

Despite these limitations, the current study identifies the BioSUD cohort as an essential resource for studying complicated behavioral features linked with SUDs. This cohort may be used in future studies to investigate the link between genetic and environmental factors in characterizing SUD phenotypes.