Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Expert Recommendation
  • Published:

How to set up your first machine learning project in astronomy

Abstract

Large, freely available, well-maintained data sets have made astronomy a popular playground for machine learning (ML) projects. Nevertheless, robust insights gained to both ML and physics could be improved by clarity in problem definition and establishing workflows that critically verify, characterize and calibrate ML models. We provide a collection of guidelines to setting up ML projects that are less time-consuming and resource-intensive and more likely to lead to robust and useful scientific insights. We draw examples and experience from astronomy, but the advice is potentially applicable to other areas of science.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The relation among false positives, false negatives and the receiver operating characteristic curve.
Fig. 2: Uncertainty verification.
Fig. 3: Impact of the training sample distribution.
Fig. 4: Illustration of covariate shift for a sample detected by the X-ray space telescope eROSITA52.
Fig. 5: Typical workflow in introductory machine learning tutorials and data challenges.
Fig. 6: A workflow for predicting with trustworthy uncertainties.

Similar content being viewed by others

References

  1. Storrie-Lombardi, M. C., Lahav, O., Sodre, L. Jr. & Storrie-Lombardi, L. J. Morphological classification of galaxies by artificial neural networks. Mon. Not. R. Astron. Soc. 259, 8P (1992).

    Article  ADS  Google Scholar 

  2. Naim, A., Ratnatunga, K. U. & Griffiths, R. E. Galaxy morphology without classification: self-organizing maps. Astrophys. J. Suppl. S. 111, 357–367 (1997).

    Article  ADS  Google Scholar 

  3. du Buisson, L., Sivanandam, N., Bassett, B. A. & Smith, M. Machine learning classification of SDSS transient survey images. Mon. Not. R. Astron. Soc. 454, 2026–2038 (2015).

    Article  ADS  Google Scholar 

  4. Burke, C. J. et al. Deblending and classifying astronomical sources with mask R-CNN deep learning. Mon. Not. R. Astron. Soc. 490, 3952–3965 (2019).

    Article  ADS  Google Scholar 

  5. Sedaghat, N., Smart, B. M., Kalmbach, J. B., Howard, E. L. & Amindavar, H. Stellar Karaoke: deep blind separation of terrestrial atmospheric effects out of stellar spectra by velocity whitening. Mon. Not. R. Astron. Soc. 526, 1559–1572 (2023).

    Article  ADS  Google Scholar 

  6. Shearer, C. The CRISP-DM model: the new blueprint for data mining. J. Data Warehous. 5, 13–22 (2000).

    Google Scholar 

  7. Saltz, J. S. The need for new processes, methodologies and tools to support big data teams and improve big data project effectiveness. In 2015 IEEE Int. Conf. Big Data (Big Data), 2066–2071 (IEEE, 2015).

  8. Martinez, I., Viles, E. & Olaizola, I. G. Data science methodologies: current challenges and future approaches. Big Data Res. 24, 100183 (2021).

    Article  Google Scholar 

  9. Artrith, N. et al. Best practices in machine learning for chemistry. Nat. Chem. 13, 505–508 (2021).

    Article  Google Scholar 

  10. Garofalo, M., Botta, A. & Ventre, G. Astrophysics and big data: challenges, methods, and tools. Proc. Int. Astron. Union 12, 345–348 (2016).

    Article  Google Scholar 

  11. Zhang, Y. & Zhao, Y. Astronomy in the big data era. Data Sci. J. 14, 11 (2015).

    Article  ADS  Google Scholar 

  12. Lahav, O. Deep machine learning in cosmology: evolution or revolution? Preprint at https://arxiv.org/abs/2302.04324 (2023).

  13. Borne, K. D. in Next Generation of Data Mining (eds Kargupta, H. et al.) Ch. 5 (CRC Press, 2008).

  14. Djorgovski, S. G., Mahabal, A. A., Graham, M. J., Polsterer, K. & Krone-Martins, A. in Artificial Intelligence For Science: A Deep Learning Revolution (eds Choudhary, A. et al.) 81–94 (World Scientific, 2023).

  15. Fluke, C. J. & Jacobs, C. Surveying the reach and maturity of machine learning and artificial intelligence in astronomy. WIREs Data Min. Knowl. 10, e1349 (2020).

    Article  Google Scholar 

  16. Ivezić, Ž., Connolly, A. J., VanderPlas, J. T. & Gray, A. Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data, Updated Edition (Princeton Univ. Press, 2019).

  17. Baron, D. Machine learning in astronomy: a practical overview. Preprint at https://arxiv.org/abs/1904.07248 (2019).

  18. Hackeling, G. Mastering Machine Learning with scikit-learn (Packt, 2017).

  19. Graham, M., Drake, A., Djorgovski, S. G., Mahabal, A. & Donalek, C. Challenges in the automated classification of variable stars in large databases. EPJ Web Conf. 152, 03001 (2017).

    Article  Google Scholar 

  20. Yang, H. et al. Data mining techniques on astronomical spectra data — II. Classification analysis. Mon. Not. R. Astron. Soc. 518, 5904–5928 (2023).

    Article  ADS  Google Scholar 

  21. Settles, B. Active Learning Literature Survey. Report No. 1648 (University of Wisconsin–Madison Department of Computer Sciences, 2009).

  22. Lochner, M. & Bassett, B. A. ASTRONOMALY: personalised active anomaly detection in astronomical data. Astron. Comput. 36, 100481 (2021).

    Article  ADS  Google Scholar 

  23. Fotopoulou, S. A review of unsupervised learning in astronomy. Astron. Comput. 48, 100851 (2024).

    Article  Google Scholar 

  24. Yang, H. et al. Data mining techniques on astronomical spectra data — I. Clustering analysis. Mon. Not. R. Astron. Soc. 517, 5496–5523 (2022).

    Article  ADS  Google Scholar 

  25. Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29, 141–142 (2012).

    Article  ADS  Google Scholar 

  26. James, G., Witten, D., Hastie, T. & Tibshirani, R. An Introduction to Statistical Learning Vol. 112 (Springer, 2013).

  27. Doran, G. T. et al. There’s a S.M.A.R.T. way to write management’s goals and objectives. Manage. Rev. 70, 35–36 (1981).

    Google Scholar 

  28. Bausell, R. B. & Li, Y.-F. Power Analysis for Experimental Research: A Practical Guide for the Biological, Medical and Social Sciences (Cambridge Univ. Press, 2002).

  29. Minkowski, R. Spectra of supernovae. Publ. Astron. Soc. Pac. 53, 224 (1941).

    Article  ADS  Google Scholar 

  30. Thornton, C., Hutter, F., Hoos, H. H. & Leyton-Brown, K. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms (Poster). In Proc. 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Ghani, R. et al.) 847–855 (Association for Computing Machinery, 2013).

  31. Erickson, N. et al. AutoGluon-Tabular: robust and accurate AutoML for structured data. Preprint at https://arxiv.org/abs/2003.06505 (2020).

  32. Lieu, M. et al. Deep learning of astronomical features with big data. In Astronomical Data Analysis Software and Systems XXVII (eds Teuben, P. J. et al.) Vol. 523 (Astronomical Society of the Pacific, 2019).

  33. Molnar, C. Interpretable Machine Learning 2nd edn (2022).

  34. Rudin, C. et al. Interpretable machine learning: fundamental principles and 10 grand challenges. Stat. Surv. 16, 1–85 (2022).

    Article  MathSciNet  Google Scholar 

  35. Netflix recommendations: beyond the 5 stars (part 1). Netflix Technology Blog https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429 (2012).

  36. Hamill, T. M. Interpretation of rank histograms for verifying ensemble forecasts. Mon. Weather Rev. 129, 550 (2001).

    Article  ADS  Google Scholar 

  37. Ghosh, A. et al. GaMPEN: a machine-learning framework for estimating Bayesian posteriors of galaxy morphological parameters. Astrophys. J. 935, 138 (2022).

    Article  ADS  Google Scholar 

  38. Rosenbaum, P. R. & Rubin, D. B. Reducing bias in observational studies using subclassification on the propensity score. J. Am. Stat. Assoc. 79, 516–524 (1984).

    Article  Google Scholar 

  39. Revsbech, E. A., Trotta, R. & van Dyk, D. A. STACCATO: a novel solution to supernova photometric classification with biased training sets. Mon. Not. R. Astron. Soc. 473, 3969–3986 (2018).

    Article  ADS  Google Scholar 

  40. Ganin, Y. et al. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, 2096–2030 (2016).

    ADS  MathSciNet  Google Scholar 

  41. Perdue, G. N. et al. Reducing model bias in a deep learning classifier using domain adversarial neural networks in the MINERvA experiment. J. Instrum. 13, P11020 (2018).

    Article  Google Scholar 

  42. Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data https://doi.org/10.1145/2382577.2382579 (2012).

  43. Kapoor, S. & Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4, 100804 (2023).

    Article  Google Scholar 

  44. Springel, V. Smoothed particle hydrodynamics in astrophysics. Annu. Rev. Astron. Astrophys. 48, 391–430 (2010).

    Article  ADS  Google Scholar 

  45. Hopkins, P. F. A new class of accurate, mesh-free hydrodynamic simulation methods. Mon. Not. R. Astron. Soc. 450, 53–110 (2015).

    Article  ADS  Google Scholar 

  46. Zine, K. & Salim, S. Systematics in the spectral energy distribution fitting parameter estimation of composite galaxies. Astrophys. J. 929, 91 (2022).

    Article  ADS  Google Scholar 

  47. Carleo, G. et al. Machine learning and the physical sciences. Rev. Mod. Phys. 91, 045002 (2019).

    Article  ADS  Google Scholar 

  48. Zhang, Y., Tiňo, P., Leonardis, A. & Tang, K. A survey on neural network interpretability. IEEE Trans. Emerg. Top. Comput. Intell. 5, 726–742 (2021).

    Article  Google Scholar 

  49. Fan, F.-L., Xiong, J., Li, M. & Wang, G. On interpretability of artificial neural networks: a survey. IEEE Trans. Radiat. Plasma Med. Sci. 5, 741–760 (2021).

    Article  Google Scholar 

  50. Goodfellow, I., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. Preprint at http://arxiv.org/abs/1412.6572 (2015).

  51. Wilkinson, M. D. et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).

    Article  Google Scholar 

  52. Salvato, M. et al. The eROSITA Final Equatorial-Depth Survey (eFEDS). Identification and characterization of the counterparts to point-like sources. Astron. Astrophys. 661, A3 (2022).

    Article  Google Scholar 

  53. Mandt, S., Hoffman, M. D. & Blei, D. M. Stochastic gradient descent as approximate Bayesian inference. J. Mach. Learn. Res. 18, 1–35 (2017).

    MathSciNet  Google Scholar 

Download references

Acknowledgements

The thoughts laid out here have been heavily influenced by conversations with co-workers and discussions in astronomy machine learning conferences such as EAS2022 in Valencia and ML-IAP2021 in Paris.

Author information

Authors and Affiliations

Authors

Contributions

The article was conceived, structured and initially written by J.B. S.F. contributed to the writing in all aspects of the article.

Corresponding author

Correspondence to Johannes Buchner.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Reviews Physics thanks Viviana Acquaviva, Fernanda Psihas and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Buchner, J., Fotopoulou, S. How to set up your first machine learning project in astronomy. Nat Rev Phys 6, 535–545 (2024). https://doi.org/10.1038/s42254-024-00743-y

Download citation

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42254-024-00743-y

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics