Introduction

Ultrarunners participate in running events that exceed the 26.2-mile (42.195 km) marathon distance. The most common ultramarathon race distances are 50 km, 50 miles (80.467 km), 100 km, and 100 miles (160.934 km)1,2. Although the 50-mile ultramarathon is among the most prevalent race distances, there has been limited scientific interest considering 50-mile ultramarathoners or 50-mile ultramarathons2,3.

Age2, age-related performance4, and the sex difference in performance5,6,7 are aspects that were investigated before in 50-mile ultramarathoners. In the literature, a higher age of peak performance was observed in ultramarathons (~ 35 years and older) than in marathons (~ 25–35 years)2,8,9,10,11,12. In particular, Nikolaidis et al.13 examined data from 494,414 finishers who participated in 50-km ultramarathons from 1975 to 2016. When the top 10 athletes were analyzed, they found that men of the age 39 and women of the age 40 had the fastest mean running speeds. Similarly, in 50-mile races, men at the age of 35 and women at the age of 37 achieved the fastest running speeds2. In longer 100-km ultramarathons, men around 41 years old and women around 39 years old were the best in the top 109. Moreover, in even longer 100-mile ultramarathons, Rüst et al.12 reported that the mean age of the fastest men was 37.2 ± 6.1 years and that of the fastest women 39.2 ± 6.2 years.

The sex gap in endurance sports has been investigated extensively and it seems to depend on the race distance7,14. It has been suggested that female predispositions like improved resistance to fatigue, enhanced utilization of substrates, and reduced energy requirements could be advantageous in longer distances, but may only play a role in extreme endurance races14,15. Previously, Waldvogel et al.7 analyzed 50-mile and 100-mile ultramarathons and found a larger sex gap in mean race speed in the shorter 50-mile (9.13%) than in the 100-mile (4.41%) distance. Moreover, when Senefeld et al.16 investigated the top 10 finishers from 20 ultramarathons with a distance between 45 km and 160 km, they reported that the largest sex gap was observed in the shortest distance of 45–50 km (19.3% ± 5.8%). However, the sex differences in the 80-km (18.5 ± 6.0%) and the 100-km distance (14.9 ± 4.2%) were not smaller than in the 160-km distance (18.7 ± 5.1%)16. Furthermore, Knechtle et al.17 examined the sex gap in time-limited ultramarathons ranging from 6 to 240 h and concluded that women did not relatively better perform than men in events with longer race durations. A recently published systematic review that analyzed the literature for evidence regarding sex-specific guidelines for ultramarathoners and their coaches stated that further research is required to develop such guidelines as there is a lack of high-quality evidence3.

Regarding the origin of athletes, it is widely recognized that athletes from specific regions tend to perform better in particular sports. Looking at other endurance running distances, runners from Africa, especially East Africa, dominate worldwide long-distance events up to the marathon distance18,19,20,21,22. Regarding longer ultramarathon distances, only a few studies have analyzed the aspects of nationality on performance23,24,25. Knechtle et al.23 analyzed results from 150,710 finishers of 100-km ultramarathons and found that athletes from Russia and Hungary performed best. A recently published study by Thuany et al.24 on 100-mile ultramarathons with a dataset of 148,169 finishers found that women from Sweden, Hungary, and Russia and men from Brazil, Russia, and Lithuania had the fastest mean speeds in the top 3, top 10 and top 100 respectively. However, when the authors performed a macro-analysis by continent, African runners were the fastest, but their participation was very low. For 50-mile ultramarathons, however, we have no knowledge where the fastest runners were from and where the fastest races were held in the past. Knowledge about the dominance of athletes from specific regions may assist in selecting athletes for further investigation and understanding the factors behind their performance and endurance capabilities. Training26,27, motivation28, mood29, and nutrition30,31 have been investigated in ultrarunners to understand factors that influence performance. Additionally, research was also performed on medical aspects such as the influence on pain sensitivity32,33, the immune system34, skeletal muscles35, and the kidneys36. Such extensive research is justified by the physical demands and the concerns regarding the well-being and health status of athletes in both training and competition settings35,37.

The aim of the present study was to investigate the relationship of an athlete’s age group, gender, nationality, and the location of the race, on race speed using an extensive dataset containing race records from 1863 to 2022 and a machine learning model based on the XGBoost algorithm. Knowledge about these aspects can help athletes and professionals working with ultramarathon runners.

Methods

Ethical approval

This study was approved by the Institutional Review Board of Kanton St. Gallen, Switzerland, with a waiver of the requirement for informed consent of the participants as the study involved the analysis of publicly available data (EKSG 01/06/2010). The study was conducted in accordance with recognized ethical standards according to the Declaration of Helsinki adopted in 1964 and revised in 2013.

Study design, sample, and data preparation

For this cross-sectional study, secondary data from the official website of the ‘Deutsche Ultramarathon-Vereinigung’ was obtained38. Race results from athletes who competed in a 50-mile ultramarathon from 1863 until the end of 2022 were considered. Each race record included the athlete’s name, age group, gender, nationality, the race location and year, the race distance, and an athlete’s race time, from which the race speed was calculated in km per hour (km/h). In total, 341,188 race records were identified. We observed that the dataset was imbalanced, with Anglo-Saxon countries vastly dominating the sample, and the United States of America alone accounting for 82% of the sample. In contrast, there were also countries with only a small number of results from a few individual runners. To address these two issues, we discarded any race records from athlete countries with less than 15 records or less than 5 individual runners, any race records from event countries with less than 10 records, and down-sampled (randomly) the United States of America data to 10% of its full size.

Data analysis

Descriptive statistics are presented using mean ± standard deviation, minimum/maximum values, frequencies, and percentages. Machine learning was used because of its ability to handle non-linear, complex interactions between different variables, making it possible to reveal patterns that might not be evident using traditional statistical methods. We built and evaluated a predictive model for the 50-mile race distance based on the popular machine learning XGBoost regressor algorithm. This algorithm is available for free and can handle large datasets within reasonable computing times. It also supports some of the most powerful explainability tools, which becomes essential for understanding the factors that influence predictions. The following variables were used as predictors, or inputs to the model: Athlete_gender_ID, Age_group_ID, Athlete_country_ID, Event_country_ID. The predictors are numerically encoded versions of the four race record features used for analysis: the athlete’s gender, age group, nationality, and the country where the race took place (see detailed explanation in the section ‘Country Rankings and Numerical Encoding of Categorical Variables’). The predicted variable, or algorithm output, was the race speed (km/h). We calculated two model evaluation metrics, mean absolute error (MAE) and coefficient of determination (R2). Moreover, the model features relative importances and prediction distribution plots were computed and analyzed. In addition to the predictive model interpretability analysis, a set of descriptive target plot charts present the groups’ average (mean) speed (targets), helping to set expectations for the prediction charts.

Country rankings and numerical encoding of categorical variables

Before the data could be fit into the XGBoost model, the predictors had to be numerically encoded. The Athlete gender variable was encoded as female = 0 and male = 1. The Age group variable was already numerically encoded in 5-year age groups, with the exception of the age group 18, which represents runners under 20 years old, and the age group 75, which represents runners who are 75 years old and older. The Athlete country and Event country variables were encoded based on their position in the respective rankings by number of records (participation).

XGBoost model parameters and metrics

After several iterations and tests, the optimal XGBoost model parameters and accuracy scores were:

  • 200 estimators (learners or trees).

  • Learning rate: 0.3.

  • R2 score: 0.21 (in-sample with 90,206 race records).

  • MAE: 1.24 km/h.

Results

After all processing, the final 50-mile sample consisted of 90,206 race records from 55,213 unique runners from 60 different countries who participated in races in 36 different countries between 1863 and 2022. A total of 20,481 race records were by women and 69,725 by men.

Athlete country and event country rankings

The two variables representing the nationality and the country where the race took place, were ranked independently by mean race speed (km/h). The fastest mean running speeds were obtained by athletes originating from Bulgaria, Slovenia, New Zealand, Croatia, and Ukraine (See Table 1 for details).

Table 1 Athlete countries ranked by mean race speed (km/h), with corresponding numbers of unique race records, runners, and races.

The fastest mean running speeds were achieved in races held in New Zealand, Serbia, Croatia, Spain, and Hungary (See Table 2 for details).

Table 2 Event countries ranked by mean race speed (km/h), with corresponding numbers of unique race records, runners, and races.

Evaluation metrics and features importances

The model for the 50-mile race distance obtained a R2 of 0.21 which indicates an existing but weak effect of the predicting variables in the model output. Based on data entropy reduction, the model rated Event country (0.34) as the most important predictor, followed by Gender (0.33), Age group (0.17), and Athlete country (0.16).

Prediction distributions and target value plots

The following charts combine target value plots with prediction distribution plots. The target value plots provide a descriptive visualization of the 50-mile race sample, showing the average (mean) race speed (the target value) in red. The actual distribution of the XGBoost predictions of race speed is represented at the top of each figure in the form of boxplots, showing the median with the box spanning the interquartile range (IQR). The model interpretability chart for the feature age group also shows the number of records in each age group at the bottom (violet bars and boxes). The model interpretability charts for athlete and event country are sorted according to the number of race records, which are presented in detail in Tables 1 and 2, respectively. In general, the prediction distribution plots followed quite well the descriptive ranking tables. The age group 20–24 was the fastest, followed by age groups 25–29, 30–34, and 35–39, while most athletes were in the age group 40–44 (Fig. 1). In terms of nationality, Slovenia (10.28 km/h), New Zealand (10.09 km/h), and Bulgaria (9.87 km/h) had the fastest median predicted values (Fig. 2). For the event country predictor, New Zealand had the fastest median predicted value (10.75 km/h), followed by Croatia (9.80 km/h) and Serbia (9.53 km/h) (Fig. 3). Men were faster than women, with model predictions showing a median race speed around 0.6 km/h faster for males (median: 7.69 km/h, IQR: 0.3 km/h) than for females (median: 7.08 km/h, IQR: 0.2 km/h), with target values of 7.68 km/h and 7.08 km/h, respectively.

Fig. 1
figure 1

Prediction distributions and target values of race speed and number of race records by age group.

Fig. 2
figure 2

Prediction distributions and target values of race speed by athlete country.

Fig. 3
figure 3

Prediction distributions and target values of race speed by event country.

Discussion

The objective of this study was to investigate the relationship of an athlete’s age group, sex, nationality, and the race location on race speed in 50-mile ultramarathons. The main findings according to the XGBoost model’s predictions were (i) model predictions were around 0.6 km/h faster for males than for females, (ii) the age group 20–24 was the fastest, whilst the age-related performance decline became more pronounced starting with age group 40–44, (iii) athletes from Slovenia, New Zealand, and Bulgaria had the fastest race speeds, (iv) across all races within each country, the fastest race speeds were found in New Zealand, Croatia, and Serbia, (v) the most important features with respect to the predictive power of the XGBoost model were the race location and an athlete’s gender.

The faster race speed of men is in accordance with past studies on ultramarathoners6,16 and other endurance sports39,40,41. Men had a median predicted race speed that was about 8.61% faster than the median predicted value for women. Previously, Waldvogel et al.7 reported a comparable mean difference in race speed of 9.13% between both sexes in 50-mile ultramarathons. However, Knechtle et al.17 found that women were able to narrow the performance difference between the sexes across most timed ultramarathons over the years. The main physiological factor for the performance sex gap in endurance sports is probably women’s lower aerobic capacity, which is often measured as maximal oxygen consumption (VO2max)42. The smaller sex gap in ultramarathons, compared to shorter race distances, can be explained by the fact that muscles are exercised at a lower percentage of VO2max15,42. Moreover, while females have a number of traits that are beneficial for ultramarathon running (e.g., increased resistance to muscle fatigue, higher reliance on lipid metabolism, and more even pacing strategies than men), certain aspects of female physiology still clearly hinder performance and reduce the likelihood that the fastest females will outperform the fastest males in longer ultramarathon events in the future14,43.

The age group 20–24 had the fastest predicted median race speed (8.01 km/h), closely followed by age groups 25–29 (7.93 km/h), 30–34 (7.83 km/h), and 35–39 (7.78 km/h). Thus, we found a younger age of peak performance in comparison to previous studies on ultramarathon finishers, which reported an age of peak performance of ~ 35 years and older2,9,12,13. However, a pronounced decline in performance was only observed from the age group 40–44 onward. One explanation could be that we have included data over a longer period of time. In past studies, an increase in the age of peak performance across the calendar years was observed in ultramarathons, as well as in other endurance sports13,44,45. Moreover, we used a different methodological approach and considered data from all finishers and did not only focus on the fastest finishers. The relatively increased performance of older athletes over the last decades in past studies on endurance sports is likely attributable to their higher training volume2,46. As athletes get older, they usually finish more half-marathons, marathons, and short ultramarathon races. With experience, they then tend to compete in longer distances, as a high training volume can better maintain endurance than sprint capacities due to various physiological aspects46,47,48. However, the decline in VO2max typically starts between the ages 20 and 3047. This may explain our finding that the model predicted the fastest race speeds for the age group 20–24 and challenges previous studies that used data over a shorter period.

The fastest median predicted race speeds were achieved by athletes from Slovenia (10.28 km/h), New Zealand (10.09 km/h) and Bulgaria (9.87 km/h). Following closely behind were athletes from Croatia (9.80 km/h) and Ukraine (9.61 km/h). It is noteworthy that the fastest runners mainly came from Eastern Europe, even though the participation in the sample was largely dominated by countries from the Anglosphere. One possible explanation for this trend might be attributed to the specific dataset characteristics. Many countries from Eastern European countries have fewer records in the dataset, potentially indicating a narrower overall range of performance levels. Professional athletes could be more willing to participate in ultramarathon events despite the expenses and various challenges. Amateur athletes frequently encounter real-life obligations and financial limitations49. This observation applies to all participants from countries far away from the race location, but it is especially remarkable in the case of Eastern Europe. For instance, Slovenia is ranked 23rd out of 50, Bulgaria is 36th, Croatia is 33rd, and Ukraine is 50th in the list of sovereign European countries by nominal gross domestic product per capita (in US$) in 2021 according to the World Bank, highlighting the ‘selectiveness’ of these professional athletes50. In our study, athletes from Africa, which as previously stated dominate running competitions worldwide up to the marathon distance, could not be included due to the too low rate of participation from these countries in the events analyzed18,19,20,21,22. One reason for their low participation is probably the lower prize money at ultramarathon events compared to marathon events, an important aspect, especially for elite athletes51,52.

The dominance of Anglo-Saxon countries in terms of overall participation and the number of events held can be attributed to various factors. Since our analysis specifically focused on the 50-mile distance and miles are the preferred unit of measurement in Anglo-Saxon countries, it is obvious that races of this distance have predominantly taken place in these regions and are not as popular in countries that do not use the imperial system. Moreover, the United States of America has been reported to be one of the birthplaces of ultramarathon, where a first running boom was seen in the 1960s24,53.

We found that the fastest median race speeds were predicted for 50-mile race courses held in New Zealand (10.75 km/h), followed by Croatia (9.80 km/h) and Serbia (9.53 km/h). As already mentioned above, athletes from New Zealand also had the second-fastest median predicted race speed. New Zealand has a long tradition of 50-mile races. The ‘New Brighton 50 Mile’ was held from 1962 to 1993 as a road-based race54. The ‘50 Miles Track Race Auckland’ was held from 1969 until 1972 and in the ‘Self-Transcendence’ races in Auckland and Christchurch, 50-mile split times were often recorded55,56. Other races such as the ‘Heaphy Five-0 Trail Run’ or the ‘The Old Forest Hanmer 100’57,58 were held as trail runs. However, as we used secondary data and a wide time frame, particular characteristics such as elevation profiles, weather conditions, and competitive environments could not be considered. Further explanations would be speculative and future studies should be carried out on this topic to understand the best conditions and characteristics for ultramarathon events, especially as the most important feature with respect to the XGBoost model’s predictive power was the race location.

Limitations, strengths, and implications for future research

In contrast to shorter endurance races, it is difficult to compare ultramarathon races as they are poorly standardized. To address this issue, we used a very large dataset and highly sophisticated data analysis techniques, but we must acknowledge some limitations. The results are a summary of the observations across the descriptive charts (i.e. target value plots) and ranking tables, as well as the model interpretability charts (prediction distribution plots). Athletes from countries with small sample sizes (less than 15 records or less than 5 individual runners), as well as records from event countries with fewer than 10 records but with faster race times, may have been excluded from the analysis due to the methodology used. Furthermore, reducing the USA data to 10% of its original size intended to prevent the predictive model from exhibiting a bias towards USA athlete race speeds, ensuring a more balanced analysis. Aspects such as physiological and anthropometric variables, training, previous experience, motivation, equipment, pre-race nutrition, nutrition during the race, elevation profiles, and environmental conditions could not be considered. As secondary data is used, we must also be aware that these race courses might not all have been exactly measured. Moreover, we did not differentiate between the sexes in our analysis regarding the fastest age group, nationality of the athletes, and location of the races. This could be a subject for future studies, which should also consider the use of a relative performance metric. The value of R2 (0.21) obtained with 200 estimators and a learning rate of 0.3 was the best we could get with the available data. This indicates an existing but weak effect of the predicting variables in the model output, acting as a reminder that if we want a better predictive model, we should look for additional features. In general, the prediction distribution plots followed quite well the descriptive ranking tables and charts, manifesting the ability of the machine learning model to learn the statistical structure of the data. A strength of the present study is its novelty, as it is one of the few studies to examine participation and performance trends in distance-limited ultramarathon races, and it is especially the first one on the 50-mile distance, using machine learning models. Moreover, our findings offer several practical applications for athletes and professionals working with ultramarathon runners competing at this distance. Identifying peak performance age groups and the onset of age-related performance decline provides valuable insights for data-driven career planning and realistic short- and long-term goal-setting. Additionally, information about locations with fast race times in the past can assist athletes and coaches in selecting races to achieve personal best performances while taking financial and travel considerations into account. It can also help race organizers to select or promote races that might be attractive for record attempts, which can increase participation and prestige. Future studies should analyze why faster race times are observed in certain locations and include more specific data such as the specific race courses.

Conclusions

In summary, this study used a machine learning algorithm to analyze 50-mile ultramarathon races, highlighting several important factors that influence race speed. The model predicted a faster median race speed for male ultramarathoners, and the fastest for athletes of the age group 20–24. The top 3 countries with the fastest predicted median race speeds with regard to nationality were Slovenia, New Zealand, and Bulgaria, and with regard to event location New Zealand, Croatia, and Serbia. The most significant variables in terms of the predictive power of the XGBoost model were the race location and an athlete’s gender.