Introduction

Orphan crops, also known as “minor”, “neglected”, “underutilised”, and “understudied” crops are frequently grown in developing countries and are an important source of nutrition for local communities (Table 1)1,2,3,4,5,6. Many of these crops have not benefited from the Green Revolution that improved the productivity of major crops such as wheat and rice. Modern crop improvement techniques, such as marker assisted breeding (MAB) and genome editing, have not been widely applied to orphan crops due to the lack of resources available. However, genomic technologies offer significant potential for orphan crop improvement. The major crops such as wheat, rice, maize and soybean are widely distributed and produced on an industrial scale, whereas the orphan crops vary considerably in their production, from the reasonably wide distribution of sorghum, to crops that are only produced in specific regions such as Ensete. Many orphan crops are tolerant of abiotic and biotic stressors and can be produced in marginal and harsh environments, possessing traits, that if understood, may be transferrable to major crops.

Table 1 Production of representative orphan crops

There are currently relatively few genomic resources for orphan crops, though initiatives such as the African Orphan Crops Consortium (AOCC)7 and Crops for the Future (CFF)8 are working towards their improvement. However, there is still much work needed to translate these resources to improve crops. There are orphan crop representatives across major crop types and the relatedness between orphan and major crops can be used to improve crop breeding efforts in both orphan and major crops. Major crops have a wide range of resources available, and this can be transferred to related orphan crop species. Orthologs of agronomically important genes from major crops have been found in orphan crops for traits such as stress tolerance9, and conservation of gene function between major and orphan crops can support their improvement. Similarly, knowledge of novel beneficial genes in orphan crops could be used to enhance traits in the major crops. A major challenge in crop improvement is the continued growth of data. Machine learning based methods are starting to be applied for crop improvement (Fig. 1) and these will have direct applications for orphan crop improvement as well as the translation of knowledge from major crops to orphan crops.

Fig. 1: The proportion of articles available on Europe PMC for the search terms ‘machine learning crops’ and ‘orphan crops’ between 2008 and 2024.
Fig. 1: The proportion of articles available on Europe PMC for the search terms ‘machine learning crops’ and ‘orphan crops’ between 2008 and 2024.
Full size image

The proportion of articles with the search terms ‘machine learning crops’ (orange) and ‘orphan crops’ (blue) is shown on the Y-axis and the year the papers were published is shown on the X-axis.

In the past 100 years, large gains in crop yield have been made possible by the introduction of statistical methods into plant breeding. R. A. Fisher pioneered statistical methods such as ANOVA and randomised control trials in plant breeding10,11. Since then, statistical approaches have been at the core of plant breeding leading to unprecedented increases in crop yields. These methods include RR (Ridge Regression), BLUP (Best Linear Unbiased Prediction) and its variants such as GBLUP (Genomic Best Linear Unbiased Prediction), all of which fall under the broader category of genomic selection. However, yields are not keeping pace with a growing population and the threat of climate change. To ensure sufficient food production for a warmer world, modern approaches such as CRISPR genome editing and machine learning are needed. Machine learning (ML) is a set of methods that uses large amounts of data to approximate mathematical functions. Deep learning (DL), a subset of ML, utilises deep layers of artificial neural networks to “learn” mathematical functions from training data. ML’s ability to identify complex patterns within large and diverse datasets, from images and genomics to tabular data, makes it a powerful tool for improving trait prediction accuracy and crop breeding efficiency12,13,14,15 (Fig. 2).

Fig. 2: Machine learning models can analyse genome sequencing and related datasets to generate various predictions.
Fig. 2: Machine learning models can analyse genome sequencing and related datasets to generate various predictions.
Full size image

Data inputs are shown on the left, including “genome sequencing” at multiple depths from whole-genome sequencing to exome and SNP sequencing. Genomic datasets can be combined with “+ phenotype observations” collected manually or through a wide range of sensors. The “+ complementary ‘OMIC sequencing” refers to transcriptomics, metabolomics, proteomics and other ‘OMIC datasets that can be integrated into the machine learning model to enrich the dataset information. The potential prediction tasks for each input data type are colour-coded on the right.

The success of ML has been facilitated by an explosion of available data, driven by the ever-decreasing costs of genome sequencing ($200 per human genome in sequencing costs only). Other drivers are the increased availability of compute power in the form of accessible and machine learning-specialised GPUs, high performance computing centres, and accessible cloud computing, leading to ML becoming established as a group of tools in genomics and crop improvement.

One area where ML is having an impact in crops is marker-assisted breeding (MAB)16,17, where ML can be used to link phenotypes of agronomic interest with molecular genetic markers so that they can be applied to accelerate breeding. Another is yield prediction, where several studies have evaluated the accuracy of different machine learning architectures across different datasets to predict crop yield18. When ML is combined with CRISPR genome editing19, it can be used both to identify potential favourable modifications and design accurate guide RNAs (sgRNAs) with few off-target effects20.

The knowledge gained by applying these methods to major crops will also assist in the improvement of orphan crops, and vice versa. In this Review, we will discuss the potential of using ML for orphan crop improvement. We will highlight how ML can improve the knowledge available for orphan crops, find similarities between major and orphan crops, and transfer knowledge from major crops to orphan crops.

Machine learning applications for crop improvement

Machine learning has been extensively applied in crop improvement, with hundreds of publications ranging from identifying markers for MAB to using image recognition for accurate phenotyping and disease resistance recognition (Table 2)19,20,21,22,23,24,25,26,27,28,29,30,31,32. Predicting phenotypes has been one of the main applications of machine learning. One of the earliest examples of yield prediction using machine learning is from 2008, where bread wheat field measurements were used in a simple artificial neural network to predict yield over seasons33. Later studies used different variables to predict yield using ML, such as transplanting parameters in rice34, irrigation and evaporation parameters in sugarcane35, or soil and irrigation data in wheat, barley, and canola across years and locations36. All these examples use environmental data, but do not include information about the genetic composition of crops.

Table 2 Different applications of machine learning in crops

Some studies have used genetic data alone to predict yield directly. One of the earliest examples is DeepGS, a convolutional neural network (CNN) that predicts phenotypes from genotype data, complementing the widely used RR-BLUP26. Other DL architectures have been used successfully to predict mixed phenotypes (binary, ordinal, continuous) from genotypes in bread wheat27 as well as phenotypes while incorporating data from multiple environments37. However, benchmarks reveal that DL on its own usually performed similar to traditional genomic selection approaches, with ensemble-based approaches including several models showing the highest prediction accuracy38. A similar benchmark revealed that in soybean, tree-based machine learning approaches such as XGBoost and Random Forests outperformed deep learning-based approaches in 13 out of 14 phenotypes25, indicating that DL may not be the best machine learning approach in plant phenotype prediction.

Genomic data has been successfully combined with environmental data to improve prediction accuracy. Kick et al.39 utilised genetic data, environmental measurements, and recorded management interventions to predict maize yields, finding that DL models performed similarly to, but with greater consistency than, BLUP models. Måløy et al.40 evaluated the then-novel Performer deep learning architecture using SNPs and environmental data to predict barley yield across locations and years, outperforming other DL architectures and Bayesian approaches. Li et al.41 assessed the accuracy of transfer learning by pre-training DL models using genomic and non-yield phenotypic data in maize, rice, and wheat. The pre-trained layers were then fine-tuned for yield prediction tasks, outperforming established DL and RR BLUP approaches. Image-based phenotyping or drone data is commonly used in conjunction with genetic data to predict yield. In maize, Danilevicz et al.42 combined multispectral imagery with genotyping data to identify high-performing varieties in the field. Later research focuses on multimodal models, as integrating multiple data types has generally shown superior performance compared to single-modality models43.

Once phenotype prediction accuracy has been established, ML can be employed to identify quantitative trait loci (QTLs) or genes underlying traits of interest. An early example used QTL identified by genome-wide association studies and several approaches from RR-BLUP to Random Forest, to predict yield based on genome-wide association study (GWAS)-associated markers in rice and showed that these methods outperformed established pedigree-based approaches29. In soybean, predicting yield from genotypic data using XGBoost led to the identification of SNPs linked with prediction accuracy, and these SNPs overlapped with known markers previously linked with yield25. Liu et al.28 trained a Convolutional Neural Network to predict yield based on soybean SNPs, and then drew saliency maps to identify genomic regions with the strongest impact on phenotype prediction. All identified regions overlapped with GWAS-identified SNPs. Another approach is PlantMine, which identified SNPs associated with prediction accuracy using XGBoost, and then used these ‘core’ SNPs to reduce noise in genomic prediction algorithms44. One interesting approach identified nitrogen-use efficiency genes using RNASeq, and then ranked these genes using an expression-level trained XGBoost to identify candidate genes and transcription factors. These genes were functionally validated and are now available for further nitrogen breeding in maize45. Machine learning is now at the core of crop breeding in companies, leading to improved breeding pipelines and reduced cost, for example the application of an AI assistant for breeders selecting the best breeding candidates46.

For ML to have a significant practical impact on plant breeding, training programs are essential. ML practitioners in plant breeding operate at the intersection of bioinformatics, plant biology, and breeding. They require a unique combination of skills and experience, including computational abilities, domain expertise, and proficiency in experimental design. Similar recommendations have been made previously in the field of plant breeding46. However, training opportunities for this specific skill set are currently limited.

Machine learning applications for orphan crops

There are two main approaches for crop trait prediction using ML, image-based ML models and genomics-based ML models, with some studies combining these in an ensemble approach12,14. While genomic and image data is increasingly abundant for the major crops, it is rarely available for orphan crops. Publicly available orphan crop data sources include online resource for community annotation of eukaryotes (ORCAE)47, a metabolomics database for roots, tubers, and bananas48, and a collection of 26 transcriptomes for orphan crops and their wild relatives49. ORCAE is a database for the genomes and annotations of the orphan crops assembled by AOCC47, and the genomes available through ORCAE could be used for the construction of genomics-based ML models for orphan crop trait prediction where suitable phenotype data is available. These could be complemented by intermediate phenotypes, for example transcriptomic or metabolomic data48,49.

There are currently no public databases hosting orphan crop images that could be used for image-based ML models. However, two studies have applied ML to orphan crops for trait prediction using the limited data available50,51. Nazari and colleagues52 developed a DL model, a type of ML model, to predict the quality traits of protein, tannin and, total phenolic content (TPC) in sorghum. Determining chemical content through conventional laboratory tests is expensive and time consuming, so Nazari et al.52 developed an efficient and cost-effective method to predict chemical composition using images and DL. The grains of ten lines of sorghum were harvested at maturity and the protein, tannin, and TPC content of 100 g of each line was measured using conventional laboratory tests. The remaining sorghum grains were photographed on a black background with consistent lighting, and the colours within each photograph were analysed to determine texture variables. The protein, tannin, and TPC content and the texture variables for all the ten sorghum lines were used as input for a multilayer perceptron (MLP) model for trait prediction. Multilayer perceptron is a type of DL model made up of three layers, the input layer, the output layer and a hidden layer. The hidden layer is where the model identifies patterns within the data and these patterns are then used to predict the output. The model learns through interconnected nodes within each layer that are designed to work in a similar way to the neurons in a human brain. Nazari and colleagues52 found a significant difference in the protein, tannin, and TPC content between each sorghum line and the content measured in the laboratory, and predicted by the DL model had a correlation of greater than 0.9 for each of these traits. Another study used near-infrared reflectance spectroscopy (NIRS) and DL to predict quality traits in the orphan crop Perilla53. The DL models had high prediction accuracy with R2 values of 0.83, 0.92, 0.78, and 0.82 for the biochemical traits ash, protein, total soluble sugar and phenol content respectively. By using NIRS and ML the authors were able to develop a cost-efficient and accurate method for predicting the nutritional content within Perilla germplasm. As the knowledge available for orphan crops grows, more studies could use ML to efficiently predict the traits (Fig. 3). However, the limited quantity of public data highlights the need for establishing and supporting databases of image and genomic data for orphan crops that could be applied for ML based trait prediction.

Fig. 3: A workflow of machine learning applications in orphan crops improvement.
Fig. 3: A workflow of machine learning applications in orphan crops improvement.
Full size image

Genomic and phenotype data is collected from an orphan crop population called a training population. This genomic and phenotype data is used to train ML models. The trained ML models can then be used for trait prediction in orphan crop populations that only have genotype data. These trait predictions are then used to select individuals for crop breeding programs.

Large language model applications for crop improvement

Large language models (LLMs) are a subsection of machine learning, designed to “understand” language and identify patterns from text54. Recently, LLMs have been increasingly applied to analyse biological sequential data, such as gene expression profiles, genomic DNA sequences and protein sequences. In this context, biological language models approach the DNA or amino acid sequence as text strings, splitting the biological sequences into words and finding the relationship between them55. The application of language models to understand plant biological datasets is not a new concept56,57,58,59, but recent technological advances have enabled more powerful LLM architectures to emerge60,61. The application of LLMs can enrich the reduced genomic resources of orphan crops, leading to a better understanding of the diversity in orphan crop genomes.

The large language model’s capacity for transferring knowledge into new domains is particularly valuable in the context of orphan crops, as they can leverage insights from well-studied species to predict gene functions, identify regulatory elements, and uncover genetic patterns in orphan crops. Nucleotide Transformer is a prime example of a collection of foundational LLMs for predicting gene sequence phenotype and function that can be used for transfer learning. The Nucleotide Transformer models were trained using an extensive genomic sequence database with approximately 3202 human genomes and 850 genomes from diverse phyla, which allowed the models to learn context-specific nucleotide sequences and gain a robust understanding of genomic indicators that could be used to support the annotation of orphan plant genomes62. For example, the chia (Salvia hispanica) genome annotation used transcriptome and orthologous gene models from multiple other species, leading to ~94% genes identified according to a BUSCO analysis63. Integrating LLMs into the annotation process could further refine the functional annotation of orphan genomes by identifying the genomic patterns and gene context learned during the LLM training. DNABERT is another foundational LLM that was trained with 135 human genomes for predicting gene function, promoter sites, splice sites and transcription factor binding sites based on DNA sequence58. DNABERT demonstrated a high capacity for transferring learning to other species, effectively detecting transcription factor binding sites in genomes with under 50% non-coding similarity to the human genome58. Since transcription factors regulate gene expression, and their binding sites are often found in non-coding regions at varying distances from target genes, DNABERT’s success in identifying these sites suggests it accurately captures conserved semantic relationships within the DNA sequences. Several studies have leveraged the DNABERT model to advance plant research. A recent study further trained the DNABERT model to identify long non-coding RNA (lncRNA) in six major plant species64. The lncRNAs play an important role in regulating gene expression through interactions with DNA, RNA, and proteins that modulate gene activity being valuable targets for crop improvement65. The LLM identified lncRNA sequences from genomic DNA sequences with up to 83% accuracy in target species and a high average accuracy in identifying lncRNA sequences in previously unseen crop species64. Multiple models leveraging these foundational LLMs were proposed for the prediction of DNA methylation sites in plants due to their importance as gene expression regulators. These LLMs were trained in major plant species and tested on previously unseen plant datasets, showing their capacity to capture the species-specific indicators for methylation sites and an ability to generalise across different species that highlighting the LLMs’ effectiveness in identifying critical regulatory elements in less-studied plant genomes66,67,68.

More recently, a foundational LLM focused on crop genome sequences was released. AgroNT uses a similar structure to DNABERT, but it was trained on 48 crop species genomes, including the orphan crops pigeonpea (Cajanus cajan), cassava (Manihot esculenta) and quinoa (Chenopodium quinoa). The AgroNT model has demonstrated high accuracy in predicting regulatory annotations, promoter/terminator strength, lncRNA prediction and tissue-specific gene expression across species, indicating the model’s versatility and potential uses for identifying sites controlling gene expression in orphan crops69. Being trained exclusively with plant datasets may provide an advantage to AgroNT, as it avoids biases towards genomic structures that are exclusive to other organisms. The foundational LLMs above offer a powerful tool for transferring knowledge from major to orphan crops, as the biological annotations and experimental validation from well-curated plant species can be leveraged to detect gene regulation mechanisms in orphan species.

A major limitation for the genomics-based improvement of orphan crops is the insufficient genome references and annotated genomic resources for these species. This has hindered the identification of causal genes associated with valuable crop phenotypes. Pre-trained LLMs models could be useful to predict gene function from DNA or RNA sequencing datasets58,69. The estimated gene function output could also be applied for prioritising functional variants identified through genome wide association studies (GWAS), RNA sequencing and other genomic analysis69. In addition, the pre-trained LLMs models could be fine-tuned for specific orphan crop prediction using a reduced training dataset, leveraging the model’s learning about the molecular relationships to focus on species specific features. Ultimately, integrating pre-trained LLMs with genomic data and focused fine-tuning could help bridge the gap in understanding and harnessing the unique traits of orphan crops, unlocking their full potential for sustainable agriculture.

Transfer of knowledge between major and orphan crops

The limited knowledge and resources available for orphan crops has slowed their development50,51. However, there are many orphan crops that are closely related to major crops. For example, Solanaceae fruit include tomatoes, a major crop, and ground cherries, an orphan crop70,71. For examples like this, their evolutionary relationship can be used to learn about and improve orphan crops through gene homology. Conservation of orthologs and their functions has been found between orphan crops and related major crops9. These conserved genes allow studies to use genes and knowledge available in major crops to identify candidate genes, edit genomes, and predict traits in orphan crops.

Gene homology with major crops or model species can be used to identify genes associated with a trait of interest. Gene homology with Arabidopsis thaliana, a model species, was used to identify 108 candidate genes for seed mucilage production in chia72. Candidate genes for domestication were identified using gene homology with A. thaliana and rice73. While these studies used model species, the same methods could be applied using major crops. A ML approach for identifying candidate genes from sequences associated with a trait of interest is QTG-Finder274. QTG-Finder2 is a fast and efficient way to identify candidate genes from quantitative trait loci (QTL). The QTG-Finder2 ML model was trained on orthologs of causal genes from major crops and model plant species. Lin et al.74 hypothesised that the QTG-Finder2 model could be applied to species with little to no known causal genes, due to the conservation of orthologs between species. To test this hypothesis, they applied the QTG-Finder2 ML model in sorghum, an orphan crop, to predict causal genes for plant height. QTG-Finder2 correctly identified true plant height causal genes 70% of the time74. QTG-Finder2 improves the efficiency of identifying candidate genes and can be applied to species with few if any known causal genes. Machine learning and gene homology can be used to predict essential genes in species with little knowledge available. Essential genes are required for the reproductive success of a species and are highly conserved75. If ML can identify essential genes using gene homology it could be applied to predict other conserved genes.

Genome editing using CRISPR can make changes to DNA to improve a trait. To be able to make changes to DNA, information on the gene sequence is needed. The conservation of orthologs between major and orphan crops can be used to identify targets for genome editing. The mutation of tomato orthologs has improved the fruit size and production of ground cherries through genome editing70,71. Lodging resistance in tef, an orphan crop, has been improved by editing a rice ortholog for semi-dwarfism76. Gene conservation between orphan and major crops can be used to identify candidate genes and design genome editing targets when there is no data available for the gene of interest within the orphan crop. Machine learning can also be used to improve the editing efficiency and specificity of genome editing.

One way to find gene orthologs, that can be used for orphan crop studies, is to source it from the literature72,76; however, this information is spread through papers and journals making it challenging to know the extent of gene homology between major and orphan crops and where to find this data. Databases such as NCBI are a source of protein and nucleotide sequences and gene homology for many species77; however, they do not have information specific to orphan crops. Consolidating all major crop orthologs and their presence in orphan crops into a comprehensive database would aid studies identifying candidate genes and improving traits through genome editing in orphan crops.

Transfer learning is a machine learning method that uses pre-trained models and new datasets to fine tune ML models for a new purpose78. Transfer learning can be used to make predictions in a species with little available knowledge by training the model on a species with available data (Fig. 4). Pre-trained models can be transferred from major to related orphan crops due to the conservation of genes and gene functions45. Tomatoes are a major crop with poor quality annotations. Transfer learning was used to improve the prediction accuracy of generalised and specialised metabolism genes in tomatoes79. A model trained on A. thaliana was applied to tomato annotation data, and the prediction accuracy of the transfer learning model was greater than the model trained on the tomato annotation data for generalised metabolism genes. Prediction accuracy did not improve for specialised metabolism genes. The reason the transfer learning model performed better for the generalised metabolism genes is because they are conserved between species while specialised metabolism genes are lineage specific79. While this study focuses on a major crop with poor annotation, the same method can be applied to orphan crops. Transfer learning can be used to link knowledge from resource rich major crops to related orphan crops, for conserved traits. To aid trait prediction in orphan crops a database of trait prediction models trained on major crops should be collated; these pre-trained models could then be applied to related orphan crops using transfer learning.

Fig. 4: A basic workflow for the use of transfer learning in orphan crops.
Fig. 4: A basic workflow for the use of transfer learning in orphan crops.
Full size image

ML models are trained using data from major crops. These trained ML models can then be used to predict traits in orphan crops, which have limited available data. The trait predictions are used to choose breeding candidates to improve orphan crop varieties.

The limitations of transferring knowledge from major to orphan crops whether it is through gene prediction, genome editing, or transfer learning, is that all these methods rely on conserved genes. Orphan genes are lineage specific genes that have no homologues in other species and make up 10–20% of a genome80. Orphan genes have been found to be associated with agronomically important traits such as disease resistance and abiotic stress tolerance81,82,83. These orphan or novel genes cannot be identified or improved without species specific genomic resources, so, while transferring knowledge from major to orphan crops can be used to improve some traits, we still need orphan crop specific resources to reach the maximum potential for crop improvement. Orphan crop genomic resources can identify these orphan genes that can aid crop improvement in both orphan and major crops.

There are some examples of knowledge transfer between orphan and major crops and vice-versa. For example, abiotic resistance genes not present in the bread wheat genome have been identified in the orphan cereal tef84. The salinity-resistant orphan crop groundnut has been identified as a potential source for salinity resistance in soybean85. Other examples involve transfer of knowledge from wild relatives to major crops. An example is the super-pangenome of Cicer, which included several wild relatives of chickpea and led to the discovery of novel disease resistance genes and genes involved in salt resistance, along with novel mutations in vernalisation genes86,87. A similar super-pangenome in tomato identified a wild-type only cytochrome P450 allele linked with increased yield88. Sequencing of Aegilops accessions has led to the cloning of four novel disease resistance genes not present in bread wheat89. In bread wheat, the Watkins collection of landraces from the 1930s has been a large source of knowledge applied to bread wheat, including novel resistance genes to tan spot, Fusarium head blight90 and eyespot resistance91,92. Sequencing the entire Watkins collection identified and subsequently introgressed 127 QTL alleles from landraces to bread wheat, leading to yield increases of up to 0.91 t ha−193. These examples show that by focusing on wild or landrace relatives, plant breeders can introduce significant yield gains by introgressions and crossbreeding.

Implementation of machine learning-based improvement of orphan crops

Some of the challenges for orphan crop improvement include the lack of genomic resources, limited uptake of modern crop breeding methods, and the lack of local scientists working on these issues. Collaboration between scientists, local communities, smallholder farmers and international collaborators can help bridge the gap between major and orphan crops. The International Maize and Wheat Improvement Centre (CIMMYT) is a non-for-profit organisation that aims to address the challenges faced by smallholder farmers in marginal environments94. CIMMYT develops high yielding, nutritious, and abiotic stress resistant wheat and maize varieties. They work with smallholder farmers in developing countries by providing training, trading knowledge, and exploring market opportunities. With the aid of public and private collaborations CIMMYT has improved the food security of millions of smallholder farmers in Africa, Asia and Latin America. Similar initiatives aim to improve orphan crops. Feed the Future Innovation for Crop Improvement focuses on accelerating the breeding of local roots, tubers, bananas, millets, legumes and sorghum varieties through the collaboration of scientists, global stakeholders, and local communities95. The AOCC uses a network of public and private collaborators from international, non-government, and academic institutes to collect germplasm reserves, sequence genomes and gather local input96. The AOCC aims to sequence a total of 101 orphan crop species, has completed 6 of these genomes and is in the progress of completing an additional 26 genomes. Orphan crop germplasm is held in over 150 gene banks globally, which can be used for sequencing and genotyping by initiatives such as the AOCC97. Important to each of these initiatives is the input of local communities to ensure that the crop varieties are suited to each local environment, the farmers are willing to adopt the technology and that there is a demand for the product within the local marketplace. Another method to increase local involvement and to increase the manpower behind orphan crop improvement is to recruit local farmers as citizen scientists. Triadic comparisons of technologies (TRICOT) is a citizen science method that sends volunteer farmers crop varieties or agronomic technologies to trial98. TRICOT is cost effective and does not require training or specialized skills, making it accessible to farmers in marginal communities. TRICOT has been successfully used to trial the climatic response of crops in marginal environments and to determine consumer preference of orphan crop varieties99,100. Given how important local famer and community input is for orphan crop improvement it is required that these communities benefit from the studies that they take part in. All studies in orphan crops in marginal or regional environments should have the consent of the local community and the results should be accessible by the smallholder farmers that participate. Citizen scientist studies and regional and international collaborations should be supported by policy to ensure funding of initiatives to improve orphan crops. The United Nation’s recommendations for supporting orphan crops includes funding and training for farmers willing to adopt new technologies, funding for smallholder farmers to access markets, and policies encouraging the collaborations of local knowledge and science and technology101,102. Policy frameworks should be developed to train and fund the implementation of ML and modern breeding techniques by local farmers in orphan crops and to encourage further collaborations with local communities when developing new orphan crop varieties.

One of the greatest challenges for orphan crop improvement and associated improvement in food security in nations that rely on these crops, is the lack of funding. While the majority of orphan crops will remain orphans due to their niche habitats or limited potential, many, with appropriate investment, have the potential to become major crops either regionally or even globally. The rising tide of genomic technologies should lift the performance of all crops, as knowledge can be transferred to closely related species. However, the investment should be focussed on those crops with the greatest potential for improvement considering the use of machine learning to optimise results. Machine learning models can leverage major crop datasets for training, decreasing the amount of data required for trait prediction and identification of genomic features in orphan crops. Nonetheless, strategic data generation from orphan crops is required to ensure strong alignment with the training datasets. As additional datasets are generated from orphan crops, the models can be fine-tuned to improve their accuracy and specificity over time. Additionally, understanding the genomic basis of traits in orphan crops could benefit the major crops through gene introgression and editing, and there is an argument for international seed companies to support orphan crop improvement either directly, or indirectly through technology exchange as many of them currently do. Increased support from breeding and seed companies for orphan crop improvement could substantially accelerate the use of machine learning, as these enterprises have a wealth of data from genome sequencing and field trials. Providing machine learning models with a diverse dataset would allow the model to consider the intra-species genomic variability and other factors impacting trait prediction outcomes.

Investing in single technologies for data generation is unlikely to deliver sufficient results and support across fields, with investment diversification from genomics-based breeding through to agronomy, processing and marketing required to boost orphan crop performance. Moreover, while investing in the countries where orphan crops are predominantly grown can leverage local expertise and has significant social benefits, restricting the investment to a geographic boundary may not always be the most efficient pathway to accelerate crop improvement. A strategy that integrates the best available technologies and expertise on a local and global scale can enhance the effectiveness and impact of such efforts. This is particularly important for developing ML models, that often require high computational resources and specialised skills that can be accessed more cost-effectively on a global scale. Given the limited financial resources available for orphan crop improvement, a balanced approach is important, where the most effective improvements per dollar invested may be through low-cost traditional breeding, education and marketing strategies. Subsequently, machine learning models can exploit the generated knowledge after low-cost approaches have been exploited.

Outlook

Machine learning has emerged as a promising tool for advancing research and breeding efforts in orphan crops. These underutilised plant species, often vital for food security in developing regions, have historically received less scientific attention than major staple crops. Here we have demonstrated how ML techniques are being applied to analyse genomic data, predict crop traits, optimise breeding strategies, and enhance disease resistance in major crops, knowledge which is then transferred to orphan crops. By leveraging large datasets and complex algorithms, ML approaches can accelerate the identification of beneficial genes and help develop improved varieties. This technology shows potential to address challenges specific to orphan crops, such as limited genetic resources and adaptation to local environments, to ensure food for a growing population in a warming climate.