Abstract
NASA Johnson Space Center has collected more than 54,000 space hardware failure reports. Obtaining engineering processes trends or root cause analysis by manual inspection is impractical. Fortunately, novel data science tools in Machine Learning and Natural Language Processing (NLP) can be utilized to perform text mining and knowledge extraction. In NLP the use of taxonomies (classification trees) are key to the structuring of text data, extracting knowledge and important concepts from documents, and facilitating the identification of correlations and trends within the data set. Usually, these taxonomies and text structures live in the heads of experts in their specific field. However, when an expert is not available, taxonomies and ontologies are not found in data bases, or the field of study is too broad, this approach can enable and provide structure to the text content of a record set. In this paper an automated taxonomical model is presented by the combination of Latent Dirichlet Allocation (LDA) algorithms and Bidirectional Encoder Representations from Transformers (BERT). Additionally, the limitations and outcomes of causal relationship rule mining models, commercial tools, and deep neural networks are also discussed.
Similar content being viewed by others
Data availability
The datasets generated and/or analysed during the current study are not publicly available due to containing sensitive documents with NASA’s engineering processes information but are available from the corresponding author on reasonable request.
References
Sebastiani, F. Machine learning in automated text categorization. ACM computing surveys (CSUR) 34, 1–47 (2002).
Joachims, T. Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning, 137–142 (Springer, 1998).
Kim, Y. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Blei, D. M., Ng, A. Y. & Jordan, M. I. Latent dirichlet allocation. Journal of machine Learning research 3, 993–1022 (2003).
Teh, Y. W. A hierarchical bayesian language model based on pitman-yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 985–992 (2006).
Blei, D. M. & Lafferty, J. D. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, 113–120 (2006).
Miao, Y., Yu, L. & Blunsom, P. Neural variational inference for text processing. In International conference on machine learning, 1727–1736 (PMLR, 2016).
Dieng, A. B., Ruiz, F. J. & Blei, D. M. Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics 8, 439–453 (2020).
Baevski, A., Zhou, H., Mohamed, A. & Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. CoRR arXiv:2006.11477 (2020).
Jalilifard, A., Caridá, V. F., Mansano, A. & Cristo, R. Semantic sensitive TF-IDF to determine word relevance in documents. CoRR arXiv:2001.09896 (2020).
Jelodar, H. et al. Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey (2018). arXiv:1711.04305.
Lewis, D. D. Feature selection and feature extraction for text categorization. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992 (1992).
Bai, H., Xing, F. Z., Cambria, E. & Huang, W.-B. Business taxonomy construction using concept-level hierarchical clustering. arXiv preprint arXiv:1906.09694 (2019).
Lee, D. et al. Taxocom: Topic taxonomy completion with hierarchical discovery of novel topic clusters. In Proceedings of the ACM Web Conference 2022, 2819–2829 (2022).
Sharp, R. et al. Eidos, INDRA, & delphi: From free text to executable causal models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 42–47, https://doi.org/10.18653/v1/N19-4008 (Association for Computational Linguistics, Minneapolis, Minnesota, 2019).
Hagberg, A., Swart, P. & S Chult, D. Exploring network structure, dynamics, and function using networkx. Tech. Rep., Los Alamos National Lab.(LANL), Los Alamos, NM (United States) (2008).
Bokeh Development Team. Bokeh: Python library for interactive visualization (2014).
Honnibal, M. & Montani, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing (2017). To appear.
Lin, F. F. et al. Fast dimensional analysis for root cause investigation in large-scale service environment. CoRR arXiv:1911.01225 (2019).
Gorsuch, R. L. Using bartlett’s significance test to determine the number of factors to extract. Educational and Psychological Measurement 33, 361–364 (1973).
Kaiser, H. F. An index of factorial simplicity. psychometrika 39, 31–36 (1974).
Agrawal, R., Srikant, R. et al. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, vol. 1215, 487–499 (Citeseer, 1994).
Grootendorst, M. Bertopic: Leveraging bert and c-tf-idf to create easily interpretable topics., https://doi.org/10.5281/zenodo.4381785 (2020).
Gawade, M., Mane, T., Ghone, D. & Khade, P. Text document classification by using wordnet ontology and neural network. International Journal of Computer Applications 182, 33–36. https://doi.org/10.5120/ijca2018918229 (2018).
Huang, Z., Ye, Z., Li, S. & Pan, R. Length adaptive recurrent model for text classification. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, 1019–1027, https://doi.org/10.1145/3132847.3132947 (Association for Computing Machinery, New York, NY, USA, 2017).
Zhang, H. et al. ASER: towards large-scale commonsense knowledge acquisition via higher-order selectional preference over eventualities. CoRR arXiv:2104.02137 (2021).
Hoyle, A. M., Goel, P. & Resnik, P. Improving Neural Topic Models using Knowledge Distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1752–1771, https://doi.org/10.18653/v1/2020.emnlp-main.137 (Association for Computational Linguistics, Online, 2020).
Gallagher, R., Reing, K., Kale, D. & Steeg, G. V. Anchored correlation explanation: Topic modeling with minimal domain knowledge. Transactions of the Association for Computational Linguistics 5, 529–542 (2017).
Dong, Y. A survey on neural network-based summarization methods. CoRR arXiv:1804.04589 (2018).
Acknowledgements
The authors declare the work conducted on this project was in support of NASA-internal business practices to understand the effectiveness of standard flight hardware processes. Special thanks to the Langley Research Center Data Science Team: Charles A. Liles GCP for guidance and Jam Session organization. Theodore D. Sidehamer for IBM Watson Explorer support, demo and access. Ilangovan, Hari S. for NLP INDRA-EIDOS discussions and resources. Thanks to the Johnson Space Center: (SA) Ram Pisipati, Robert J. Reynolds for early NLP guidance. (EA IT team) Jacci Bloom, Remyi Cole, Michael Patterson, Jeffrey Myerson for providing software access and troubleshooting support. (EX Intern) Ortiz Martes, Dianeliz for giving Power BI tutorials. (EX Interns) Heriberto Triana, Emanuel Sanchez, Jacquelyne Black, Nathan Berg, Sarah Smith, Rishi K. Chitturi, (GSFC Intern) Alexandra Carpenter, and others for helping me to navigate me through my NASA experience. David Kelldorf, Martin Garcia for early GCP discussions. (Intern Coordinators) Hiba Akram, Jennifer Becerra, Annalise Giuliani, Rosie Patterson. Additional thanks to the Marshall Space Flight Center. Trevor Gevers, Micheal Steele, Adam Gorski, James Lane, Frank S. King III for AWS Comprehend guidance and access. Also thanks to Ames Research Center/Arizona State Arizona State University: Dr. Yongming Liu, Dr. Yan, and Xinyu Zhao for providing useful resources to study BERT. Thanks to David C. Smith, Samantha N. Bianco, Aref F. Malek for LDA-BERT improvement suggestions from NASA community GCP AI ML agency presentation. Finally, thanks to the Goddard Space Flight Center, NASA Center for Climate Simulation support: Ellen M. Salmon, Mark L. Carroll for granting a Virtual Machine with Linux environment to test models.
Funding
NASA’s Office of STEM Engagement The Minority University Research and Education Project (MUREP).
Author information
Authors and Affiliations
Contributions
T.H. - Project Conceptualization, Data Curation, Funding, Supervision, Project Administration, General Resources. D.P. Project Formulation, Formal Analysis, Investigation, Methodology, Visualizations and Writing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Palacios, D., Hill, T.R. Taxonomical modeling and classification in space hardware failure reporting. Sci Rep (2026). https://doi.org/10.1038/s41598-026-36813-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-36813-7


