Abstract
Large, freely available, well-maintained data sets have made astronomy a popular playground for machine learning (ML) projects. Nevertheless, robust insights gained to both ML and physics could be improved by clarity in problem definition and establishing workflows that critically verify, characterize and calibrate ML models. We provide a collection of guidelines to setting up ML projects that are less time-consuming and resource-intensive and more likely to lead to robust and useful scientific insights. We draw examples and experience from astronomy, but the advice is potentially applicable to other areas of science.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
References
Storrie-Lombardi, M. C., Lahav, O., Sodre, L. Jr. & Storrie-Lombardi, L. J. Morphological classification of galaxies by artificial neural networks. Mon. Not. R. Astron. Soc. 259, 8P (1992).
Naim, A., Ratnatunga, K. U. & Griffiths, R. E. Galaxy morphology without classification: self-organizing maps. Astrophys. J. Suppl. S. 111, 357–367 (1997).
du Buisson, L., Sivanandam, N., Bassett, B. A. & Smith, M. Machine learning classification of SDSS transient survey images. Mon. Not. R. Astron. Soc. 454, 2026–2038 (2015).
Burke, C. J. et al. Deblending and classifying astronomical sources with mask R-CNN deep learning. Mon. Not. R. Astron. Soc. 490, 3952–3965 (2019).
Sedaghat, N., Smart, B. M., Kalmbach, J. B., Howard, E. L. & Amindavar, H. Stellar Karaoke: deep blind separation of terrestrial atmospheric effects out of stellar spectra by velocity whitening. Mon. Not. R. Astron. Soc. 526, 1559–1572 (2023).
Shearer, C. The CRISP-DM model: the new blueprint for data mining. J. Data Warehous. 5, 13–22 (2000).
Saltz, J. S. The need for new processes, methodologies and tools to support big data teams and improve big data project effectiveness. In 2015 IEEE Int. Conf. Big Data (Big Data), 2066–2071 (IEEE, 2015).
Martinez, I., Viles, E. & Olaizola, I. G. Data science methodologies: current challenges and future approaches. Big Data Res. 24, 100183 (2021).
Artrith, N. et al. Best practices in machine learning for chemistry. Nat. Chem. 13, 505–508 (2021).
Garofalo, M., Botta, A. & Ventre, G. Astrophysics and big data: challenges, methods, and tools. Proc. Int. Astron. Union 12, 345–348 (2016).
Zhang, Y. & Zhao, Y. Astronomy in the big data era. Data Sci. J. 14, 11 (2015).
Lahav, O. Deep machine learning in cosmology: evolution or revolution? Preprint at https://arxiv.org/abs/2302.04324 (2023).
Borne, K. D. in Next Generation of Data Mining (eds Kargupta, H. et al.) Ch. 5 (CRC Press, 2008).
Djorgovski, S. G., Mahabal, A. A., Graham, M. J., Polsterer, K. & Krone-Martins, A. in Artificial Intelligence For Science: A Deep Learning Revolution (eds Choudhary, A. et al.) 81–94 (World Scientific, 2023).
Fluke, C. J. & Jacobs, C. Surveying the reach and maturity of machine learning and artificial intelligence in astronomy. WIREs Data Min. Knowl. 10, e1349 (2020).
Ivezić, Ž., Connolly, A. J., VanderPlas, J. T. & Gray, A. Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data, Updated Edition (Princeton Univ. Press, 2019).
Baron, D. Machine learning in astronomy: a practical overview. Preprint at https://arxiv.org/abs/1904.07248 (2019).
Hackeling, G. Mastering Machine Learning with scikit-learn (Packt, 2017).
Graham, M., Drake, A., Djorgovski, S. G., Mahabal, A. & Donalek, C. Challenges in the automated classification of variable stars in large databases. EPJ Web Conf. 152, 03001 (2017).
Yang, H. et al. Data mining techniques on astronomical spectra data — II. Classification analysis. Mon. Not. R. Astron. Soc. 518, 5904–5928 (2023).
Settles, B. Active Learning Literature Survey. Report No. 1648 (University of Wisconsin–Madison Department of Computer Sciences, 2009).
Lochner, M. & Bassett, B. A. ASTRONOMALY: personalised active anomaly detection in astronomical data. Astron. Comput. 36, 100481 (2021).
Fotopoulou, S. A review of unsupervised learning in astronomy. Astron. Comput. 48, 100851 (2024).
Yang, H. et al. Data mining techniques on astronomical spectra data — I. Clustering analysis. Mon. Not. R. Astron. Soc. 517, 5496–5523 (2022).
Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29, 141–142 (2012).
James, G., Witten, D., Hastie, T. & Tibshirani, R. An Introduction to Statistical Learning Vol. 112 (Springer, 2013).
Doran, G. T. et al. There’s a S.M.A.R.T. way to write management’s goals and objectives. Manage. Rev. 70, 35–36 (1981).
Bausell, R. B. & Li, Y.-F. Power Analysis for Experimental Research: A Practical Guide for the Biological, Medical and Social Sciences (Cambridge Univ. Press, 2002).
Minkowski, R. Spectra of supernovae. Publ. Astron. Soc. Pac. 53, 224 (1941).
Thornton, C., Hutter, F., Hoos, H. H. & Leyton-Brown, K. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms (Poster). In Proc. 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Ghani, R. et al.) 847–855 (Association for Computing Machinery, 2013).
Erickson, N. et al. AutoGluon-Tabular: robust and accurate AutoML for structured data. Preprint at https://arxiv.org/abs/2003.06505 (2020).
Lieu, M. et al. Deep learning of astronomical features with big data. In Astronomical Data Analysis Software and Systems XXVII (eds Teuben, P. J. et al.) Vol. 523 (Astronomical Society of the Pacific, 2019).
Molnar, C. Interpretable Machine Learning 2nd edn (2022).
Rudin, C. et al. Interpretable machine learning: fundamental principles and 10 grand challenges. Stat. Surv. 16, 1–85 (2022).
Netflix recommendations: beyond the 5 stars (part 1). Netflix Technology Blog https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429 (2012).
Hamill, T. M. Interpretation of rank histograms for verifying ensemble forecasts. Mon. Weather Rev. 129, 550 (2001).
Ghosh, A. et al. GaMPEN: a machine-learning framework for estimating Bayesian posteriors of galaxy morphological parameters. Astrophys. J. 935, 138 (2022).
Rosenbaum, P. R. & Rubin, D. B. Reducing bias in observational studies using subclassification on the propensity score. J. Am. Stat. Assoc. 79, 516–524 (1984).
Revsbech, E. A., Trotta, R. & van Dyk, D. A. STACCATO: a novel solution to supernova photometric classification with biased training sets. Mon. Not. R. Astron. Soc. 473, 3969–3986 (2018).
Ganin, Y. et al. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, 2096–2030 (2016).
Perdue, G. N. et al. Reducing model bias in a deep learning classifier using domain adversarial neural networks in the MINERvA experiment. J. Instrum. 13, P11020 (2018).
Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data https://doi.org/10.1145/2382577.2382579 (2012).
Kapoor, S. & Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4, 100804 (2023).
Springel, V. Smoothed particle hydrodynamics in astrophysics. Annu. Rev. Astron. Astrophys. 48, 391–430 (2010).
Hopkins, P. F. A new class of accurate, mesh-free hydrodynamic simulation methods. Mon. Not. R. Astron. Soc. 450, 53–110 (2015).
Zine, K. & Salim, S. Systematics in the spectral energy distribution fitting parameter estimation of composite galaxies. Astrophys. J. 929, 91 (2022).
Carleo, G. et al. Machine learning and the physical sciences. Rev. Mod. Phys. 91, 045002 (2019).
Zhang, Y., Tiňo, P., Leonardis, A. & Tang, K. A survey on neural network interpretability. IEEE Trans. Emerg. Top. Comput. Intell. 5, 726–742 (2021).
Fan, F.-L., Xiong, J., Li, M. & Wang, G. On interpretability of artificial neural networks: a survey. IEEE Trans. Radiat. Plasma Med. Sci. 5, 741–760 (2021).
Goodfellow, I., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. Preprint at http://arxiv.org/abs/1412.6572 (2015).
Wilkinson, M. D. et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
Salvato, M. et al. The eROSITA Final Equatorial-Depth Survey (eFEDS). Identification and characterization of the counterparts to point-like sources. Astron. Astrophys. 661, A3 (2022).
Mandt, S., Hoffman, M. D. & Blei, D. M. Stochastic gradient descent as approximate Bayesian inference. J. Mach. Learn. Res. 18, 1–35 (2017).
Acknowledgements
The thoughts laid out here have been heavily influenced by conversations with co-workers and discussions in astronomy machine learning conferences such as EAS2022 in Valencia and ML-IAP2021 in Paris.
Author information
Authors and Affiliations
Contributions
The article was conceived, structured and initially written by J.B. S.F. contributed to the writing in all aspects of the article.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Reviews Physics thanks Viviana Acquaviva, Fernanda Psihas and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Buchner, J., Fotopoulou, S. How to set up your first machine learning project in astronomy. Nat Rev Phys 6, 535–545 (2024). https://doi.org/10.1038/s42254-024-00743-y
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s42254-024-00743-y


