Abstract
HTLV-1-associated myelopathy (HAM) develops in a part of HTLV-1-infected individuals while most of the individuals remain asymptomatic. This complicates the identification of HTLV-1 carriers at elevated risk. In this study, we integrated HTLV-1 proviral load and antibody titers against Tax, Env, Gag p15, p19, and p24 proteins in a machine learning (ML) framework to identify and characterize high-risk individuals likely to develop HAM. We stratified asymptomatic carrier samples employing an anomaly detection model. We further developed and validated classifier models capable of distinguishing three clinical subgroups, carrier, ATL, and HAM for assessing the anomaly carrier samples as unseen test data. With most anomaly carrier samples (~ 76.47%) predicted as HAM, further statistical and interpretative analysis revealed the ‘HAM-like’ characteristics of the anomaly carrier samples indicating elevated risk. Additionally, significant heterogeneity in immune response was observed among other asymptomatic carriers. As an exploratory, hypothesis-generating study, our findings are preliminary and aim to propose potential biomarkers and computational strategies that warrant validation in future longitudinal investigations. Our machine learning-based approach offers a novel and insightful tool for identifying and evaluating high-risk characteristics for HAM, providing a holistic view of the complex immune dynamics of asymptomatic carriers of HTLV-1.
Similar content being viewed by others
Introduction
Human T-cell leukemia virus type 1 (HTLV-1) is the first identified retrovirus known to cause chronic lifelong infections in humans1. Following infection, it causes adult T-cell leukemia-lymphoma (ATL), a form of blood cancer, and HTLV-1-associated myelopathy (HAM), a chronic inflammatory disease of the central nervous system. While most infected individuals remain asymptomatic, approximately 2 to 5% of HTLV-1 carriers develop ATL2, and 0.25 to 4% develop HAM3. The annual national incidence rate of new HTLV-1 infections in Japan has been reported as 3.8 cases per 100,000 person-years4. The combined strategies of immune evasion and suppression allow HTLV-1 to remain in the host for a prolonged time without expressing symptoms which complicates the identification of carriers at higher risk of developing ATL or HAM5,6,7. Sex and age also have correlations with the disease onset that ATL occurs more in males while HAM in females, which might reflect the hormonal or immunological differences8,9. ATL has been reported to occur in adults at least 20 to 30 years after the infection and for HAM, the latency period ranged from 4 months to 30 years8,10, thus around in the middle age. While it remains unknown why some HTLV-1 carriers develop the disease, it is of interest to develop methods that can identify high-risk asymptomatic individuals.
For HTLV-1-infected individuals, the diversity of the infected cells and immune responses is influenced by the interaction between the virus and the host’s immune system. A disruption of the equilibrium between viral persistence and host immunity can trigger the onset of associated diseases. Distinct immunological responses may arise from impaired immune regulation against the HTLV-1 infection11. As reported previously, HTLV-1 infection evokes both cellular and humoral immunity12,13,14. Therefore, immunological markers can be instrumental in evaluating the risk of associated diseases. Although proviral load (PVL) has been considered a risk factor, its interpretation in asymptomatic carriers is still challenging due to individual variations and lack of definitive thresholds for disease risk15,16. Additionally, some studies reported the PVL quantity remains constant over several years, regardless of clinical manifestation16,17,18,19. However, antibody profiling, when combined with other markers like PVL, has been reported as useful for distinguishing asymptomatic carriers from ATL or HAM patients20,21,22,23 and potential for predicting disease progression. Moreover, serum-based testing is simpler and cheaper than genetic analysis. Hence, analyzing antibody responses can provide a quantitative and specific method for identifying the infected individuals at elevated risk.
In our previous study using a modified Luciferase immunoprecipitation system (LIPS) assay, we screened patients for antibodies against each of the viral proteins, HTLV-1 Gag proteins (p15, p19, p24), Env, and Tax. Along with multivariate analysis of the antibody titers followed by targeted sequencing, we identified carriers at high risk for ATL. We also listed important factors for separating the HAM subgroup22; however, this finding might overlook the possibility that some carriers are already showing similar antibody responses to HAM and thus need to be explored in more depth. Although longitudinal studies are invaluable for observing the disease progression over time, it is still challenging given a prolonged latency period. Moreover, considering the treatment effects, it may not fully capture the hidden patterns within the data. Cohort studies usually maintain high data quality and standardized data collection protocols. This reduces noise and inconsistencies, making ML models more compatible and generalizable24. Machine learning (ML) models surpass the traditional multivariate statistical methods by handling complex, high-dimensional data and uncovering non-linear patterns. Based on clinical data alone ML models offer a robust alternative tool within data-rich contexts25,26,27,28.
In the present study, we developed a two-tiered ML-based framework integrating antibody titers to HTLV-1 PVL and Tax, Env, along with the immunogenic mature Gag p15, p19, and p24 proteins to identify and characterize the asymptomatic carriers with a higher likelihood of developing HAM. Collectively our results show that the ML-based approach can be effective in risk prediction and early intervention of HAM that might otherwise remain undetected through conventional diagnostic approaches.
Results
Detection of potentially high-risk carriers
At the data preprocessing step, before applying ML models to our dataset, we handled the multicollinearity issue29 by excluding Gag p19, based on the highest VIF score (see Determination of key variables in method and Supplementary Table S1A and B for details). Out of 264 asymptomatic carriers, the Isolation forest model detected 17 carrier samples as anomalies, which we labeled as anomaly carriers (AC). One carrier who later developed HAM (CDH) was also identified as an anomaly and included in the AC group, suggesting that the Isolation forest model effectively filters carrier samples with a possible risk of HTLV-1-associated diseases. For details about the algorithm of Isolation forest and choice of hyperparameters, please see Anomaly Detection by Isolation Forest Algorithm in Methods.
Next, we compared the performance of four ML models based on PRAUC scores to classify samples into 3 subgroups: non-anomaly carrier, HAM, and ATL. We chose the Random forest model (RF) for its superior performance (See Classification Modeling in Method for details). Table 1 shows each classifier’s average overall mean PRAUC (mean of per-class PRAUCs) across five repeats of one-vs-rest nested CV. RF classifier model was used to predict the holdout set of AC samples (n = 17). Interestingly, the classifier predicted the majority of AC samples (n = 13 out of 17) as HAM [Fig. 1]. The one CDH sample was also predicted as HAM by the classifier model. Moreover, in the case of 17 anomaly carrier samples, all the feature values showed a stronger positive correlation with their predicted probabilities of HAM [Supplementary Table S4].
Results of Partial Least Squares (PLS) distribution of sample groups further revealed that AC samples were localized near the HAM cluster (Fig. 2). This result combined with the classification result where most AC samples predicted as HAM, suggests that AC samples share significant similarities with HAM samples, further supporting our hypothesis that AC samples exhibit ‘HAM-like’ characteristics.
Illustration of the prediction results of anomaly carrier samples by the random forest classifier model. The left bar shows the training and test data for the classifier model. The model was trained on three sample groups and predicted the anomaly carrier samples as unseen test data. The anomaly carrier samples (n = 17) were classified into three prediction groups: Around 76.47% of the anomaly carrier data were predicted as HAM, whereas only 17.64% and 5.88% of the samples were predicted as carrier (n = 3) and ATL (n = 1) respectively (shown on the right bar).
PLS-based visualization of sample distribution across clinical groups. This PLS plane shows the distribution of all 369 samples from non-anomaly carriers (green), ATL patients (blue), HAM patients (yellow), anomaly carriers (red), and the CDH (dark red as edge color and white-centered). Each dot on the plot represents an individual sample. This plot depicts the clustering of the sample groups based on the analyzed variables, where anomaly carrier samples were positioned near the HAM cluster.
Comparison of feature value distributions among the subgroups
Figure 3 compares the feature values of 4 subgroups. Between AC and non-anomaly carrier comparison, all features in the anomaly carrier were significantly higher. On the contrary, we found no statistically significant difference between AC and HAM for PVL (p = 1.0), Tax (p = 1.0), Gag p15 (p = 0.34), Gag p19 (p = 0.65) and Gag p24 (p = 0.89) except Env (p = 0.0048) (for details about the statistical tests, please see Boxplot Visualization and Statistical Tests in Methods). These observations collectively indicate that AC samples display a high degree of similarity with HAM/TSP. Consistent with previous research11, we found antibody responses to the immunodominant proteins (Env, Tax, Gags) higher in HAM patients (n = 56). Conversely, PVL in ATL patients was significantly higher than in all other subgroups (Fig. 3, Supplementary Table S6), which is consistent with previous studies4,30,31.
The boxplots collectively illustrate the distribution of PVL and Antibody titers to HTLV-1 antigens Tax, Env, Gag p15, Gag p19, and Gag p24 across different clinical subgroups: non-anomaly carrier (green), ATL (blue), HAM (yellow), and anomaly carrier (red). The individual data points overlaid on the boxplots show the actual distribution and density of the data.
Exploring biomarkers for anomaly carrier detection
To determine the crucial factors that characterize the anomaly carrier and HAM, SHapley Additive exPlanations (SHAP) analysis was performed. SHAP analysis is an interpretable machine learning framework that can assess the impact of each feature on the classification of each class (non-anomaly carriers, ATL, HAM, and anomaly carriers). Figure 4 shows the SHAP bar plot from the XGB classifier which performed the best [Supplementary Fig. S6]. We found Tax is the most important feature for HAM, and Gag p15 for anomaly carriers, followed by Env and Gag p24. Although the prediction above indicated high similarity between anomaly carriers and HAM subgroups, the ranking of feature importance differs qualitatively. Gag p15 and Env are influential features in anomaly carriers but their relative rankings are not uniformly elevated in HAM. Also, Tax exhibits a lower ranking in the anomaly carrier.
Discussion
In this study, we developed a machine learning-based approach to capture HTLV-1 carriers at elevated risk of HAM progression. The Isolation forest anomaly detection algorithm identified a subgroup of anomaly samples from the asymptomatic HTLV-1 carrier population. Further characterization through classifier prediction and statistical analysis revealed that the anomaly carrier samples closely resemble the characteristics of HAM, suggesting a similar disease trajectory. Additionally, different patterns of antibody response were observed among the asymptomatic carriers and other clinical subgroups which enabled us to further investigate the risk factors. Finally, we utilized SHAP for comparative feature analysis among the sample groups (non-anomaly carrier, anomaly carrier, ATL, and HAM) to identify the key driving features that characterize each subgroup and contribute to the disease progression.
The main aim of this study was to shed light on asymptomatic carriers who are at a high risk of progressing HAM onset. With most of the anomaly carrier samples being predicted as HAM by the RF classifier [Fig. 1], our hypothesis was further supported when the purposely included CDH sample in the carrier population was also identified as an anomaly and subsequently predicted as HAM. The potential similarities in the underlying profiles of the anomaly carrier samples are also reflected in their clustering near the HAM samples (Fig. 2). All features were significantly higher in anomaly carriers compared to non-anomaly carriers (Fig. 3). Elevated antibody responses in anomaly carriers might reflect the immune response have higher activity during disease progression. Interestingly, we found that only anti-Env antibody titer in anomaly carriers differed significantly from those of HAM (Fig. 3, Supplementary Table S6), whereas other features showed no significant differences. Env is one of the structural proteins of a virion and is necessary for cell-to-cell transmission. Thus it is a primary target of the antibody response32,33,34,35. Furthermore, elevated anti-Env antibody responses have been associated with HAM patients in several studies, which supports our result11,20,36,37. A novel implication is that, before onset, the rate of progression accelerates, as evidenced by the increased antibody levels. In HAM, the immune response is fully engaged; however, in progressive asymptomatic carriers, this saturation has yet to be achieved20. This phase might represent a snapshot of dynamic host-virus interaction where these rising antibody titers likely reflect the heightened viral activity and the immune system’s escalating response as the disease advances toward clinical manifestation. Ultimately, a saturation point is reached at the onset of the disease, where antibody levels level off as the immune response shifts into a steady-state phase. This might be well reflected in feature analysis, where the SHAP value of Env is relatively high in the anomaly carrier, but not in HAM and non-anomaly carrier [Fig. 4, Supplementary Fig. S7].
We found Tax to be the predominant feature of HAM, consistent with findings from multiple studies20,38. Furthermore, prior studies have reported significantly higher antibody responses to Env and Gag proteins in HAM patients reinforcing their potential role in HAM patients11,20. It is known that during infection, Gag and Env proteins are initially unpolarized in isolated T cells and accumulate at the cell-cell junction upon contact. Gag protein is subsequently transferred from HTLV-1-infected T cells to uninfected T cells39. Aligning with these previous observations, we interestingly found the feature values of Gag p15, p24, and Env of anomaly carrier samples exhibited a significant inverse relationship with their anomaly scores, i.e., higher feature values correspond to higher anomaly levels [Supplementary Table S2, Table S5, Supplementary Fig. S5]40. Assessment of humoral immunity to Gag demonstrates potential as a biomarker for detecting high-risk individuals. In our study, we succeeded in suggesting that Gag p15 protein has some important function that may lead to developing HAM onset [Figs. 3 and 4 and Supplementary Fig. S7], however, we avoid attributing our result to some implications about Gag p15; further research is required to identify the specific function of these mature Gag proteins (p15, p19, and p24). It is noteworthy that, although the SHAP value of Gag p24 falls within the high-ranking features that characterize anomaly carriers in some classifiers, we opted to exclude the interpretation of Gags due to their inconsistent contribution patterns observed across the multiple classifiers employed in this study [Supplementary Fig. S7].
Identifying the risk for developing HAM onset is challenging compared to other HTLV-1-associated diseases. In the case of ATL, for example, the risk can often be characterized by the changes in the clonality of infected cells, since a single clonal infected cell expands during the viral progression. Also, several driver mutations are reported to stimulate malignancy, thus leading to the survival of pathogenic cells and outcompete other infected cells towards monoclonal proliferation41. While these promising markers can detect risks of ATL onset, HAM is less described for early diagnosis, due to the nature of its slow progression42. Moreover, complicated host immune responses against infected cells vary widely between patients with different lifestyles, which makes the prediction more difficult43. Having anti-Env at the top of the list, elevated antibody titer might be a key observation for evaluating disease progression.
Of interest is the significant heterogeneity in immune response among the asymptomatic carriers in our study. Surprisingly, antibody responses (against Env, Tax, Gags, and PVL) in many asymptomatic carriers were observed at the same elevated level as that of HTLV 1-related diseases (ATL and HAM). This finding led us to our initial hypothesis to detect high-risk asymptomatic carriers (i.e., anomaly carriers) who are likely to progress to disease onset. Although heterogeneity seems to be obvious when considering the various lifestyle backgrounds of patients, it is noteworthy to confirm it based on our large number of asymptomatic carrier data. This leads to get our distinct approach but still aligns with our previous findings using the same dataset, showing latent and diverse potential of the asymptomatic carriers22.
Our work acknowledges some limitations. First, we don’t have information on anomaly carriers whether they develop HTLV-1-related diseases in the future except for one sample who was diagnosed as HAM later (CDH). To fully evaluate the prediction and the hypothesis of our result especially for HAM, further data accumulation would be critical (a prospective study like15). Second, little is known about the relationship between the antibody titers and the host immune defense as mentioned above. For the dataset from the LIPS assay like ours to be used as a clinical diagnosis, these interplays should be explored in more depth. Furthermore, inconsistent results in antibody titers from previous studies have discouraged clinical application, which makes it difficult to choose consensus cutoff values for disease distinction42. Additionally, our dataset exhibits class imbalance (more carriers than ATL and HAM) which is reflected by the actual skewed prevalence of the disease, but are still biased toward being balanced (larger proportions of ATL and HAM than in the real-world). We implemented repeated down sampling to prevent the model from inaccurate training, though each down-sampling excludes some carrier samples and thus we run the risk to understate model generalizability when applied to the full population. Integrating additional clinical metadata relating clinical background or comorbidities which we don’t have here may enhance the interpretability of the anomaly-detected carrier subgroup. Finally, as this study was designed as a data-driven investigation focusing primarily on immunological patterns inferred from antibody titer profiles, our findings should be considered exploratory and hypothesis-generating. Given the absence of external model validation, the results remain preliminary. The identified risk indicators are not clinically actionable conclusions but should be viewed as starting points for further prospective studies involving larger cohorts and independent validation.
Methods
Ethics statement
This study was performed in accordance with the Declaration of Helsinki and was approved by the Ethics Committees of Kumamoto University (accession numbers: G489, G499, and E2214). Written informed consent was waived because of the retrospective design. Consent for publication was obtained from all patients.
Study population
The data used in this study was published previously by Yamada et al.22. PVL and antibody titer data (non-time series) were collected against HTLV-1 antigens Tax, Env, Gag p15, p19, and p24 using LIPS assay. No cut-off was applied to LIPS antibody titers, allowing continuous evaluation of their distribution and diagnostic relevance. In our analysis, the ATL group comprised of both individuals who had been diagnosed with ATL at the time of sample collection (n = 25) and those who were carriers at the time of sampling but were later diagnosed with ATL (CDA, n = 24). We also had only one carrier who later developed HAM (CDH) and it was purposefully included into the carrier population. Therefore, we focused our study on 264 asymptomatic carriers, 49 ATL, and 56 HAM patients.
Determination of key variables
Initially, Spearman’s rank correlation revealed a significant correlation between Gag p19 and p24 [Supplementary Fig. S2]. To address the multicollinearity issue and choose the variables to use in the ML analysis, the Variance Inflation Factor (VIF) score was used29. See Supplementary Tables S1A and S1B.
Anomaly detection by isolation forest algorithm
For identifying potential outliers or anomalous data points from the asymptomatic carrier population (n = 264), we selected the Isolation Forest Anomaly Detection algorithm, an unsupervised machine learning technique based on decision trees, as our primary method because of its unique approach of isolating anomalies rather than profiling normal data. For each datapoint (sample), the following process is repeated until the datapoint is isolated:
-
1.
Randomly select a feature (e.g. PVL).
-
2.
Randomly choose a threshold between the maximum and minimum values of the selected feature (e.g. PVL = 0.1) and divide the data points below and above the threshold.
The key idea is that data points with anomalous feature values are likely to be isolated with only a few iterations. The algorithm constructs an ensemble of isolation trees for a given dataset and uses the path length from the root to the leaf to determine the anomaly score. Given m is the number of data points, the anomaly scores s for a datapoint x is defined as
,
where \(\:{h}_{i}\left(x\right)\) represents the path length for the \(\:i\)-th isolation tree, \(\:E\left.\left({h}_{i\left(x\right)}\right.\right)=\:{\sum\:}_{i}{h}_{i\left(x\right)}\)
denotes the average path length across the ensemble of isolation trees,
\(c\left( m \right){\text{ }}=\left\{ {\begin{array}{*{20}{c}} {2H(m - 1) - \frac{{2(m - 1)}}{m}}&{(m>2)} \\ 1&{(m=2)} \\ 0&{{\text{(otherwise)}}} \end{array}} \right.\)
is the average path length for a dataset with m points, utilized as a normalization factor44, and \(\:H\left(k\right)\) is the harmonic number. The sklearn implementation of the decision function of Isolation Forest yields negative anomaly scores, where lower (negative) scores indicate potential anomalies40.
By applying a cutoff threshold at -0.05 to the anomaly scores of the Isolation forest, we isolated the anomaly data points for further investigation45. This threshold was strategically chosen to capture approximately 5% of the most extreme anomalies (inversely corresponding to the 95th percentile of the normal data distribution) from our carrier population [Supplementary Fig. S4]. Since around 4% of the carriers develop HAM3,46, we aimed to mirror this proportion.
The resulting anomaly carrier samples were then removed from the carrier data and considered as a holdout test set (unseen data) for further classification analysis. The remaining non-anomaly carrier, ATL, and HAM samples were used for training and cross-validation of the classifier models. Additionally, the feature values of the anomaly carrier samples were tested for Spearman correlation analysis with their anomaly scores. The difference between the sample groups was evaluated by plotting all the samples in a PLS plane.
Classification modeling
We employed the One-vs-Rest (OvR) approach to address the multiclass classification problem. This approach breaks down the multiclass classification into multiple binary classification tasks, where one classifier is trained for each class against all others. Given our data volume, we relied on these classifier models because they represent diverse and well-established approaches suitable for our classification task. Combining tree-based ensemble models with SVMs has been shown balancing performance, interpretability, and generalization in high-dimensional biomedical data47. To determine the best-performing model, we evaluated four different classifiers: Random Forest classifier (RF), XGboost Classifier (XGB), Extra Trees Classifier models (ETC), and Support Vector Machine (SVM).
Nested cross-validation (CV) was used to ensure robust performance evaluation and avoid overfitting. Particularly, an outer cross-validation loop was used to assess the model performances, while an inner loop was used to optimize the hyperparameters of each classifier using GridSearch. To address class imbalance and ensure robust evaluation, we first performed bootstrap downsampling of the carrier group: in each of five independent iterations we randomly sampled 147 carriers with replacement, then merged these with the full ATL and HAM cohorts to form a training/validation subset. We then applied nested CV to each subset, using an outer 5-fold loop to estimate model generalization and an inner 5-fold loop within each outer training fold for hyperparameter tuning using GridSearch. The overall mean area under the precision-recall curve (PRAUC)46,48 served as the optimization criterion in the inner loop and as performance metric in the outer loop.
Next, for each candidate classifier (Random Forest, XGBoost, ExtraTrees, SVM) we averaged its per-class PRAUCs across the five outer-fold repeats to obtain an “overall mean PRAUC” per repeat, and then aggregated these values across repeats to yield a mean for each model. The model with the highest average overall mean PRAUC was chosen as the best, retrained on the full training set, and then applied to the held-out anomaly carrier samples. We extracted the predicted probability for the target class among anomaly carriers. The predicted probability of HAM among the anomaly carrier samples was calculated, followed by a correlation analysis of the predicted probabilities and their feature values. The workflow is depicted in [Supplementary Fig. S1]. For the classification models performed in this study, the implementation available in the sklearn library was used40.
Boxplot visualization and statistical analysis
We employed a combination of visual and statistical methods facilitating an initial comparison of the feature distributions among different sample groups including anomaly carriers. The Kruskal-Wallis test was performed, with a significance level set at α = 0.05. P-values were adjusted for multiple comparisons using the Bonferroni correction method for Dunn’s post-hoc analysis to maintain the overall type 1 error rate. The statistical analysis was performed using the Python Scipy package49,50.
Interpretation with SHapley additive explanations (SHAP) analysis
As an approach to interpreting the model’s behavior, the Shapley Additive exPlanations (SHAP) framework was used51,52. It provides the SHAP value for each feature for all samples and explains how much an increase in each feature value can affect the predicted probability for each clinical subgroup (non-anomaly carriers, ATL, HAM, and anomaly carriers). A higher SHAP value indicates a greater impact on the classification of a sample into a specific subgroup, while a lower SHAP value corresponds to a smaller impact. In this section, four classifiers (RF, ETC, XGB, and SVM) were explored for their performance in terms of PRAUC using nested cross-validation and were calculated for 300 different random seeds (i.e., different values for parameter random_state). Different random seeds are considered in this study because we wanted to extract the SHAP value which is consistent whenever the randomized manipulation during the learning process is different. This allows us to evaluate the results with a high degree of confidence. For each random seed, hyperparameters were optimized on all data without cross-validation by GridSearch and used for calculating SHAP value. The absolute median of the SHAP value from all samples was collected for 300 random seeds, and then the absolute median value and its standard deviation were calculated for visualization. Specifically, KernelSHAP was applied for all classifiers in a SHAP python package (Version 0.45.1)52.
Data availability
The dataset used for this study is publicly available from the corresponding author upon reasonable request. All codes and files associated with this study are deposited and publicly available in the GitHub repository: https://github.com/petadimensionlab/HTLV1-machine-learning.
References
Poiesz, B. J. et al. Detection and isolation of type C retrovirus particles from fresh and cultured lymphocytes of a patient with cutaneous T-cell lymphoma. Proc. Natl. Acad. Sci. U S A. 77 (12), 7415–7419 (1980).
Nagasaka, M. et al. Mortality and risk of progression to adult T cell leukemia/lymphoma in HTLV-1-associated myelopathy/tropical spastic paraparesis. Proc. Natl. Acad. Sci. U S A. 117 (21), 11685–11691 (2020).
Saito, M. Neuroimmunological aspects of human T cell leukemia virus type 1-associated myelopathy/tropical spastic paraparesis. J. Neurovirol. 20, 164–174 (2014).
Iwanaga, M. Epidemiology of HTLV-1 infection and ATL in japan: an update. Front. Microbiol. 11, 1124 (2020).
Martin, F., Fedina, A., Youshya, S. & Taylor, G. P. A 15-year prospective longitudinal study of disease progression in patients with HTLV-1 associated myelopathy in the UK. J. Neurol. Neurosurg. Psychiatry. 81 (12), 1336–1340 (2010).
Cook, L. B. et al. HTLV-1: Persistence and pathogenesis. Virology 435(1), 131–140 (2013).
Mortreux, F., Gabet, A. S. & Wattel, E. Molecular and cellular aspects of HTLV-1 associated leukemogenesis in vivo. Leukemia 17, 26–38 (2003).
Gonçalves, D. et al. Epidemiology, treatment, and prevention of human T-cell leukemia virus type 1-associated diseases. Clin. Microbiol. Rev. 23, 3: 577–589 (2010).
Matsuura, E. et al. HTLV-1 associated myelopathy/tropical spastic paraparesis (HAM/TSP): A comparative study to identify factors that influence disease progression. J. Neurol. Sci. 371, 112–116 (2016).
Yamano, Y. & Sato, T. Clinical pathophysiology of human T-lymphotropic virus-type 1-associated myelopathy/tropical spastic paraparesis. Front. Microbiol. 3, 389 (2012).
Enose-Akahata, Y., Abrams, A., Johnson, K. R., Maloney, E. M. & Jacobson, S. Quantitative differences in HTLV-I antibody responses: classification and relative risk assessment for asymptomatic carriers and ATL and HAM/TSP patients from Jamaica. Blood 119 (12), 2829–2836 (2012).
Zhang, L. et al. Human T-cell lymphotropic virus type 1 and its oncogenesis. Acta Pharmacol. Sin. 38, 1093–1103 (2017).
Brites, C., Grassi, M. F., Quaresma, J. A. S., Ishak, R. & Vallinoto, A. C. R. Pathogenesis of HTLV-1 infection and progression biomarkers: an overview. Braz J. Infect. Dis. 25 (3), 101594 (2021).
Eusebio-Ponce, E., Anguita, E., Paulino-Ramirez, R. & Candel, F. J. HTLV-1 infection: an emerging risk. Pathogenesis, epidemiology, diagnosis and associated diseases. Rev. Esp. Quimioter. 32 (6), 485–496 (2019).
Iwanaga, M. et al. Human T-cell leukemia virus type I (HTLV-1) proviral load and disease progression in asymptomatic HTLV-1 carriers: a nationwide prospective study in Japan. Blood 116 (8), 1211–1219 (2010).
Matsuzaki, T. et al. HTLV-1 proviral load correlates with progression of motor disability in HAM/TSP: analysis of 239 HAM/TSP patients including 64 patients followed up for 10 years. J. Neurovirol. 7, 228–234 (2001).
Kwaan, N. et al. Long-Term variations in human T lymphotropic virus (HTLV)–I and HTLV-II proviral loads and association with clinical data. J. Infect. Dis. 194, 11: 1557–1564 (2006).
Okayama, A. et al. Role of HTLV-1 proviral DNA load and clonality in the development of adult T-cell leukemia/lymphoma in asymptomatic carriers. Int. J. Cancer. 110, 4 (2004).
Harding, D. et al. Refining the risk of HTLV-1-associated myelopathy in people living with HTLV-1: Identification of a HAM-like phenotype in a proportion of asymptomatic carriers. J. Neurovirol 28(4–6) (2022).
Burbelo, P. D. et al. Anti-HTLV antibody profiling reveals an antibody signature for HTLV-1-associated myelopathy/tropical spastic paraparesis (HAM/TSP). Retrovirology 5, 96 (2008).
Akimoto, M. et al. Anti-HTLV-1 tax antibody and tax-specific cytotoxic T lymphocyte are associated with a reduction in HTLV-1 proviral load in asymptomatic carriers. J. Med. Virol. 79 (7), 977–986 (2007).
Yamada, A. et al. Anti-HTLV-1 immunity combined with proviral load as predictive biomarkers for adult T-cell leukemia-lymphoma. Cancer Sci. 115 (1), 310–320 (2024).
Enose-Akahata, Y. et al. Humoral immune response to HTLV-1 basic leucine zipper factor (HBZ) in HTLV-1-infected individuals. Retrovirology 10, 19 (2013). (2013).
Esteva, A. et al. Deep learning-enabled medical computer vision. Npj Digit. Med. 4, 5 (2021).
Bock, C. et al. Enhancing the diagnosis of functionally relevant coronary artery disease with machine learning. Nat. Commun. 15, 5034 (2024).
Dietterich, T. G. Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Vol. 2396. 15–30 (2002).
Krumholz, H. M. Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. Health Aff. 33, 1163–1170 (2014).
Tomašev, N. et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 572, 116–119 (2019).
O’Brien, R. M. A caution regarding rules of thumb for variance inflation factors. Qual. Quant. 41 (5), 673–690 (2007).
Akbarin, M. M. et al. Comparison of HTLV-I proviral load in adult T cell leukemia/lymphoma (ATL), HTLV-I-Associated myelopathy (HAM-TSP) and healthy carriers. Iran. J. Basic. Med. Sci. 16 (3), 208–212 (2013).
Grassi, M. F. R. et al. Human T cell lymphotropic virus type 1 (HTLV-1) proviral load of HTLV-associated myelopathy/tropical spastic paraparesis (HAM/TSP) patients according to new diagnostic criteria of HAM/TSP. 1274, 1269–1274. https://doi.org/10.1002/jmv (2011).
Igakura, T. et al. Spread of HTLV-I between lymphocytes by virus-induced polarization of the cytoskeleton. Science 299 (5613), 1713–1716 (2003).
Delamarre, L., Rosenberg, A. R., Pique, C., Pham, D. & Dokhélar, M. C. A novel human T-leukemia virus type 1 cell-to-cell transmission assay permits definition of SU glycoprotein amino acids important for infectivity. J. Virol. 71 (1), 259–266 (1997).
Derse, D., Hill, S. A., Lloyd, P. A. & Chung Hk, Morse, B. A. Examining human T-lymphotropic virus type 1 infection and replication by cell-free infection with Recombinant virus vectors. J. Virol. 75 (18), 8461–8468 (2001).
Santana, C. S. et al. Advances in preventive vaccine development against HTLV-1 infection: A systematic review of the last 35 years. Front. Immunol. 14, 1073779 (2023).
Nakamura, M. et al. Elevated antibodies to synthetic peptides of HTLV-1 envelope transmembrane glycoproteins in patients with HAM/TSP. J. Neuroimmunol. 35 (1–3), 167–177 (1991).
Sagara, Y. et al. Novel biomarker of HTLV-1-associated disease: specific appearance of antibody recognizing the receptor-binding site on HTLV-1 envelope protein. Cancer Sci. 95 (10), 835–839 (2004).
Enose-Akahata, Y., Vellucci, A. & Jacobson, S. Role of HTLV-1 tax and HBZ in the pathogenesis of HAM/TSP. Front. Microbiol. 8, 2563 (2017). (2017).
Le Blanc IBlot, V. et al. Intracellular distribution of human T-Cell leukemia virus type 1 gag proteins is independent of interaction with intracellular membranes. J. Virol. 76 (2002).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12 (Oct), 2825–2830 (2011). https://scikit-learn.org/
Charles Bangham, R. M. HTLV-1 persistence and the oncogenesis of adult T-cell leukemia/lymphoma. Blood 141 (19), 2299–2306 (2023).
Yamauchi, J. et al. An update on human T-cell leukemia virus type I (HTLV-1)-associated myelopathy/tropical spastic paraparesis (HAM/TSP) focusing on clinical and laboratory biomarkers. Pharmacol. Ther. 218, 107669 (2021).
Penova, M. et al. Genome wide association study of HTLV-1–associated myelopathy/tropical spastic paraparesis in the Japanese population. Proc. Natl. Acad. Sci. 118 (11), e2004199118 (2021).
Liu, F. T., Ting, K. M. & Zhou, Z. Isolation forest. 8th IEEE Int. Conf. Data Min. https://ieeexplore.ieee.org/document/4781136/ (2008).
Chandola, V., Banerjee, A. & Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. (CSUR). 41 (3), 1–58 (2009).
Enose-Akahata, Y. & Jacobson, S. Immunovirological markers in HTLV-1-associated myelopathy/tropical spastic paraparesis (HAM/TSP). Retrovirology 16, 35 (2019).
Fernández-Delgado, M. et al. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15 (1), 3133–3181 (2014).
Saito, T. & Rehmsmeier, M. The Precision-Recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE. 10 (3), e0118432 (2015).
Jones, E. et al. SciPy: Open Source Scientific Computing Tools for Python. https://scipy.org/ (2001).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods. 17, 261–272 (2020). Pmid:32015543.
Shapley, L. S. A Value for n-Person Games. 307–317 (1953).
Lundberg, S. M. & Su-In Lee A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems. Vol. 30. Preprint. https://arxiv.org/pdf/1705.07874 (2017).
Acknowledgements
This work was supported by Japan Agency for Medical Research and Development, Grant/Award Number 21fk0108088h0003 and 24fk0108629h0003 (to J.Y. M.M, and S.N.) and 22gm1710004h0001, Japan Science and Technology Agency (JST) Grant Number JPMJCR23J4, JST Moonshot R&D Grant Number JPMJMS2021, JPMJMS2024-9 (to S.N.) and the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan.
Author information
Authors and Affiliations
Contributions
M-I.R., J.S., A.M., and S.N. analyzed the data, wrote the main manuscript text and prepared all figures. A.Y., T.W., M.I. K-R. K, T.S., M.M., and J.Y. collected data. J.S., A.M., and S.N. designed the research. All authors reviewed the manuscript. M-I.R. and J.S. contributed equally to this work.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Rashid, M.I., Sunagawa, J., Matsuki, A. et al. Machine learning based characterization of high risk carriers of HTLV-1-associated myelopathy (HAM). Sci Rep 15, 25111 (2025). https://doi.org/10.1038/s41598-025-09635-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-09635-2