Introduction

In ultra-marathon running, the 100-km race is among the most popular1 and most traditional events2. This is also reflected by the scientific research in recent years, where studies have been performed regarding different topics about the 100-km race. Several studies investigated the aspect of training3,4, pre-race preparation5, previous experience, prediction of race performance5,6, age-related performance decline7 and age of peak performance8. Other aspects were pacing9,10 and gender differences in performance11.

Apart from sports science, medical topics were also investigated such as the influence of 100-km running on the heart12,13, the kidneys14, the immune system15,16, the endocrine system17,18, the skeletal muscle19, the mood20,21, and the regulation of the acid–base balance22. Also, nutritional aspects and aspects of fluid metabolism were considered, especially during the race23,24. Regarding fluid metabolism, the influence of dehydration25, electrolyte regulation25, fluid intake during the race26,27, and exercise-associated hyponatremia28 were investigated.

The aspect of the origin of the fastest 100-km ultra-marathoners has also been examined29,30. Regarding the origin of the fastest 100-km ultra-marathoners, a study from 2014 concluded that the fastest runners originated from Japan31. A study from 2018 found, however, that runners from Russia were the fastest in 100-km running29. The disparate findings might be explained by the different analytical approaches (i.e., single, and multi-level regression analyses/regression analyses adjusted by gender, age, and year) and/or the different time frames (1998–2011 versus 1959–2016) of these two studies. Nevertheless, an up-to-date analysis is needed to confirm or reject the more recent finding of Russian dominance29.

Furthermore, the kind of race course has most probably also an influence on overall race time. Little is known about the effect on course characteristics on ultra-marathon performance. Knowledge about this influence on 100-km ultra-marathon performance might help athletes and coaches to select the most suitable race. For marathon running, topography has a high influence on race time32. For a very fast marathon, undulations and curves have a high impact on race time33. In marathon running, altitude above sea level is of importance where race times in marathons at sea level are considerably faster than race times of marathons held in altitude34,35. Since we have no specific knowledge about a potential influence of changes in elevation and race course characteristics in 100-km ultra-marathon running, more investigation is needed.

Therefore, the present study aimed, first, to re-evaluate the origin of the fastest 100-km ultra-marathoners using a different approach (i.e., machine learning). A second aim was to evaluate where the fastest 100-km races are held. And a third aim was to investigate a potential influence of race course characteristic such as elevation changes (i.e., hilly, or flat course) and the kind of race course (i.e., mountain, trail, road, or track). Based upon recent findings, we hypothesized that Russians were the fastest 100-km ultra-marathoners. Considering race course characteristics, we assume that they would have an influence where faster race times would be achieved on flatter race courses.

Method

Ethical approval

This study was approved by the Institutional Review Board of Kanton St. Gallen, Switzerland, with a waiver of the requirement for informed consent of the participants as the study involved the analysis of publicly available data (EKSG 01/06/2010). The study was conducted following recognized ethical standards according to the Declaration of Helsinki adopted in 1964 and revised in 2013.

Data set and data preparation

Official race records from all 100-km ultra-marathons held since 1892 were obtained from the official DUV (Deutsche Ultramarathon Vereinigung) website (https://statistik.d-u-v.org). Each record included the athlete´s first and last name, age group, gender, country of origin, race name, location and year, race distance, and the athlete’s race time. The raw dataset comprised a total of 858,722 race records. This data set was checked for consistency, removing any incomplete or impossible records. In order to minimize the presence of outliers, a top race running speed of 21 km/h was set. Likewise, any countries with less than 10 race records in the sample were removed. The resulting cleaned sample used for analysis and modeling consisted of 858,544 race records. The dataset was further augmented by adding elevation data (as per the DUV website) and the races were classified into hilly or flat courses (as per the DUV website) since only a few races indicated numerical values of changes in altitude. Furthermore, we checked for race course characteristics as indicated in the DUV website and classified the races into mountain, trail, road, and track races.

Statistical data analysis

Once the dataset was cleaned, we commenced the analysis creating two independent ranking tables, by aggregating the race records by country of origin and country of event, and then sorting each list of countries by average running speed, with the fastest countries at the top. The results are summarized in two large reference tables, and included the number of records, the number of unique runners and running speed and descriptive statistical values of the running speed target are given as mean, standard deviation (std), minimum values (min), and maximum values (max). Also, median values are used in the box plots. In addition to giving us a descriptive view of the performance and participation in each country of origin or event, these ranking tables served the purpose of sorting the countries and provide a numerical index that will later be used in their encoding. The association of running speed with the type of race course and the elevation were analyzed using a set of boxplots charts. We built and evaluated three different data models: (i) a Multivariate Linear Regression model, (ii) a Mixed Effects Linear Regression (MELR) model, and (iii) a non-linear ML (Machine Learning) predictive model based on the XG Boost Regression algorithm. In all cases, the full sample was used for training and evaluating the models (in sample tests). Not surprisingly, the XG Boost model obtained the highest predictive score, although each model provided some insights. We also looked into the XG Boost model logic through some explainability tools. The model features relative importances are a measure of how the model rates the predictors in their ability to split the sample data into groups of lower entropy, hence working optimally towards the objective of making an accurate prediction. This tongue-twister means the model rating of “relative importance” often, but not always, matches our human-perceived importance. In this respect, the use of other models to compare against is a good thig when evaluating the results.

Numerical encoding of categorical variables

Before the data models could be trained, the predictor’s data needed to be converted (encoded) into numerical data. The Athlete gender variable was encoded as female = 0 and male = 1. The Age group variable was numerically encoded in 5-year groups (except group 18, which represents runners of less than 20 years, and group 75, which represents 75 years and older). The Athlete country and Event country variables were encoded based on their position in the ranking tables, with the countries with the fastest average running speeds at the top. The Course and Elevation variables were encoded in increasing order of average running speed, that is, the first is hilly = 0, flat = 1, and the second is mountain = 0, trail = 1, road = 2, and track = 3.

Models training and evaluation strategy

No hold-out evaluation strategy was used to train and evaluate the models, as it was not our intent to do any out-of-sample predictive work. Our aim was just to train the models with the full sample and then use different applicable methods to obtain answers to our research questions. The MLR models, a multivariate extension of the common linear regressor, does not need any specific configuration. The model achieved a R2 = 0.379 and rated all factors as statistically significant (this likely due to the large sample size). The second of the models, the MELR model was built to evaluate the level of extent to which individual athlete performance influenced our target variable (the running speed), in the context of the other variables under study. The model rated all factors as statistically significant, and showed an individual athlete variance coefficient of ⁓2.231 (km/h)2 which is between three and four times larger than the next factors (gender, elevation and course). The third and most complex model, the non-linear XG Boost regressor, was run with different test splits and different combinations of estimators and learning rates. The optimal XG Boost model was finally built with n_estimators = 500 and trained with the full data sample and a learning_rate = 0.25, obtaining a predictive score of R2 = 0.51 well ahead of the MLR model (0.379).

XG Boost model interpretation

Beyond the R2 metrics aforementioned, we further exploited the interpretability possibilities of the ML model. We first computed and plotted the model relative features importances. These importances refer to a rather technical aspect of the ML model function that has to do with the effectiveness of using a specific feature to split the data, in order to obtain groups with a lower entropy level. It often coincides with the real-life importances but not always. We also calculated and plotted the model prediction distributions, based on the Partial Dependent Plots (PDP) library. These ML explainability tools allow us to look into that 51% of explained variability in the race speed for each predictor. The prediction distribution plots use boxplots to show the distribution of the model predictions of average race speed.

All computation and analysis were done using a Jupyter Notebook (Google Colab) and Python and associated libraries (pandas, numpy, xgboost, pdpbox, sklearn, matplotlib, sns).

Results

The clean dataset used in the analyses contained a significant sample of 858,544 race records (732,748 from men and 125,796 from women) from 317,312 unique runners from 103 different countries finishing in 2648 100-km races held in 80 different countries worldwide between 1892 and 2022.

Athlete country ranking

The country ranking is shown in Table 1 where the fastest athletes originated from African and Eastern-European countries with Swaziland, Botswana, Belarus, Kazakhstan, and Cape Verde as the top five. Note that, aside Belarus with 118 unique runners and 396 race records, all the countries among the top five have small samples. In the 6th and 8th positions are two other East European countries such as Lithuania and Russia with more representative samples, with Hungary, Latvia and Slovakia next. In the opposite end of the performance axis, we can find most of the South Asian countries including Hong-Kong, China, Philippines, Singapore and others.

Table 1 Origin of the runners sorted by running speed.

Event country ranking

The ranking of countries by race events shown in Table 2. A combination of Africa, the Middle East, and Europe countries seem to hold the fastest 100 km races, with Botswana, Qatar, Belarus, Jordania, and Montenegro as the top five (albeit with very small samples all but Qatar). Netherlands scores next with a much more sizeable sample.

Table 2 List of hosting countries for 100-km ultra-marathons, sorted by running speed.

MLR model results

The linear model scored a R2 value of 0.379 serving as a baseline for comparison. All factors were assessed as being statistically significant.

MLER model results

The main result of the mixed effects model is that the effect of the individual athlete performance is between 3 and 4 times larger than the following factor under consideration (elevation, course and gender).

XG Boost regression model results

The ‘optimal’ XG Boost model (sample size 858,544, 500 estimators and learning rate 0.25) achieved a score value of R2 = 0.51 (MAE (km/h) 1.22) which indicates a moderate effect of the predicting variables in the model output, and not surprisingly higher than the linear model. Figure 1 shows the model features relative importances with elevation (0.85) being overwhelmingly relevant ahead of the country where the race was held (0.07), gender (0.02), age group (0.02), country of origin of the runner (0.02) and course characteristics (0.02).

Fig. 1
figure 1

XGB model features relative importances.

Combined prediction distributions and target plots

The PDP library allows to look in more detail to the associations between predictors and target. The so-called target plots represent a descriptive visualization of the 100-km race dataset by predictor and show the group’s sizes and average running speeds. The prediction plots use boxplots to show the distribution of the XG Boost model output (the predicted running speed) by predictor. For gender (Fig. 2), age group (Fig. 3), course type (Fig. 4) and elevation (Fig. 5) all possible values are displayed. The charts show that men (7.42 km/h) were faster than women (6.68 km/h), runners in age groups 35–39 years and 40–44 years were the fastest, running on track (9.32 km/h) was the fastest ahead of road (8.11 km/h), trail (6.21 km/h) and mountain (5.74 km/h) running. Furthermore, flat running (8.85 km/h) was faster than running on a hilly course (6.57 km/h). Since the athlete (Fig. 6) and event (Fig. 7) countries have a very high cardinality, only the first 20 elements are shown. The bottom chart in the sets show the number of race records in each predictor group. The red line chart (in the middle) represents each group’s average race speed, while the boxplot at the top represents the predictive model output with the 2nd quartile (median value) in the box label. In general, the results replicate those obtained in the descriptive analysis. However, the prediction charts show some peaks for Kazakhstan and Kenia in the athlete country chart and for Qatar, the Netherlands and Slovakia in the event country chart. Given the high sensitivity of ML models, it is always wise to ponder the relevance of any observations with the specific group size.

Fig. 2
figure 2

Prediction distributions and target plots for gender.

Fig. 3
figure 3

Prediction distributions and target plots for age group.

Fig. 4
figure 4

Prediction distributions and target plots for the kind of race course.

Fig. 5
figure 5

Prediction distributions and target plots for the kind of race course.

Fig. 6
figure 6

Prediction distributions and target plots value plots for origin of the athlete.

Fig. 7
figure 7

Prediction distributions and target plots for the country where the events were held.

Type of race course and elevation by gender

Figure 8 shows the running speed by gender regarding elevation where men were always faster than women and running on a flat course was faster than running on a hilly course for both genders. Figure 9 presents running speed by gender in respect of the race course characteristics. Men were always faster than women and the fastest running speeds were achieved in track running ahead for road, trail and mountain running.

Fig. 8
figure 8

Running speed by gender regarding elevation.

Fig. 9
figure 9

Running speed by gender regarding race course characteristics.

Discussion

The aims of this study were threefold where we wanted to know (i) the origin of the fastest runners, and the countries where the fastest 100-km races are held, (ii) a potential influence of race characteristics on performance, (iii) the relative relevance of the predictors or factors under study against the individual athlete performance. The most important findings were that (i) elevation was the most important predictive variable, after discounting individual performance which weighed between 3 and 4 times the following factor (ii) running on a track was the fastest, (iii) flat running was faster than running on a hilly course, (iv) the fastest athletes originated from African and, most notably from East-European countries and (v) the fastest race courses were found in Africa, in the Middle East, and in Europe.

Change in elevation as the most important predictor

Our most important finding was that the model rated elevation as the most important variable ahead of the country where the race was held. A potential explanation for the high influence of elevation could be anthropometry where lower body mass might be helpful for ascents. A study investigating performance determinants in trail running races of different distances reported that body fat percentage was a predictor in a trail run of medium distance36. Furthermore, a loss in body mass during ultra-marathon trail-running seemed also to be of importance regarding ultra-running performance37. Also, physiological aspects such as running economy might have a considerable influence on running races with covering altitude38. Furthermore, alterations in neuromuscular function might have an impact on mountain ultra-marathon running39. Overall, more studies are needed to investigate the influence of changes in elevation on ultra-marathon performance.

Running on a track as the fastest race course

A further important finding was that running on a track was faster than road, trail and mountain running. Similar findings were found in 72-h ultra-marathon running where the fastest races are held on track, followed by road, and then trail40. We assume that running on a 400-m track is more efficient than road running or running on trails and in the mountains. Athletes can better pace in track running with a consistent speed and a predictable pace based in running distance per time unit (e.g. km per minute)41. In contrast, in trail running, the pacing is more effort-based and must be adapted to the terrain42. Future studies might evaluate the fastest track races in 100-km ultra-marathon running.

Race location as the second important predictor

A further important finding was that the country (event location) where the race is held was the second most important variable. Africa, Middle East, and Europe have the fastest 100 km races, with Botswana (23 runners with 47 race finishes), Qatar (143 runners with 319 finishes), Belarus (82 runners with 137 finishes), Jordania (17 runners with 17 finishes), and Montenegro (also 17 runners with 26 finishes) as the top five.

A very likely explanation is that only a few runners competed in these locations and the density was very high. Regarding Botswana, the ‘Salt Pans Ultra Marathon’ is a 100 km race with the fastest runner finishing in 9:11:40 h:min:s and the slowest in 19:08:03 h:min and a difference of around 10 h between the fastest and the slowest (https://saltpansultra.com/). Considering Qatar, the explanation was that in 2014, the IAU 100 km World Championship open race with 143 finishers was held in Doha, Qatar (https://worldathletics.org/news/report/max-king-ellie-greenwood-iau-100km-world-cham). Regarding the race times, 13 men were below 7 h in the World Championship. The best athletes likely competed in a World Championship and obtained one of their best performances in life43. However, a total of 319 runners were competing in a 100-km ultra-marathon held in Qatar, so the World Championship was only one opportunity to achieve a fast race time. These results should be considered in light of our limitations, which included the lack of information regarding the environmental factors or additional geographical information that could provide a qualitative explanation.

Apart from Qatar and Belarus, fast 100-km races were held in Jordan, the Netherlands, Montenegro, Slovakia, Egypt, Lithuania, and Croatia. It was common for these countries to hold Championships at the national or international level. In 2019, the IAU 100 km Asia and Oceania Championships were held in Jordan (https://iau-ultramarathon.org). In the Netherlands, the ‘Run Winschoten’ was held since 1976 (www.runwinschoten.nl). In that event, more than 4000 athletes have already competed, and by the end of 2022, 235 runners have achieved a race time below 7 h. The race organized several Championships, such as the National, the European, and the World Championships. In Montenegro, the ‘Podgorica 100 km Ultramarathon’ was held from 2005 to 2007 as a regular race; however, it was not a Championship. Moreover, Slovakia has a long tradition of running 100-km ultra-marathons. The ‘Medzinárodný cestný beh “Družba” 100 km Košice’ was held from 1974 until 1992 with 476 finishers, and the ‘100 km Self Transcendence Run Nitra’ was held from 2009 to 2019 (https://cs.srichinmoyraces.org/). In Croatia, the ‘100 km Varazdin’ was held from 1979 to 1988; since 2018, the ‘Polojska ultra 100 km’ was held. Furthermore, in 2018, the 30th IAU 100 km World Championship was held in Sveti Martin with 221 finishers, where 14 runners finished below 7 h. In Lithuania, for example, a 100 km race was held in 2023, with Aleksandr Sorokin winning in 06:05:35 h:min:s (www.ultramarathon.org/). In Egypt, the ‘100 km Pharaonic Race’ is iconic and one of the oldest races in Egypt (www.sportseventsegypt.com/event/100-km-pharaonic-race/). Future studies might investigate more deeply the influence of the particular races held in these countries.

The fastest runners are from African and East-European countries

We also found that the fastest athletes originated from African and East-European countries, with Swaziland (5 runners with 18 finished races), Botswana (11 runners with 51 finished races), Belarus (118 runners with 396 finished races), Kazakhstan (11 runners with 29 races), and Cape Verde (9 runners with 48 races) as the top five countries. We could, therefore, not confirm recent findings that Russians were the fastest 100-km ultra-marathoners. Most likely it is more appropriate to say that these six countries had lower average times, but not necessarily have the best or fastest athletes. We assume that the low numbers of athletes are due to a highly selected population and the high average running speed is due to the low difference between the slowest and the fastest race times.

Regarding Swaziland with 18 finishers, we assume that Swaziland only participates with its best athletes. Regarding Russia (5,180 finishers), Germany (74,162 finishers), France (113,986 finishers) or Japan (205,908 finishers), the average running speed is most likely compromised by the large number of slower participants. Interestingly, from these six fastest countries, only a very small number of athletes originated, considering the sample of more than 850′000 runners. We assume that these runners have a very high density in performance. Regarding Swaziland, the fastest male runner achieved 7:05:17 h:min:s while the slowest man had 9:05:35 h:min:s with a difference of 2 h between the fastest and the slowest runner (www.ultramarathon.org/). This could also be because there is no running boom, recreational activity, or money to stimulate the general public to travel and participate in races. Most runners competed in races held in Europe and Japan (www.ultramarathon.org/). Similarly, the fastest male runner from Botswana achieved 7:20:12 h:min:s while the slowest man finished at 09:59:08 h:min: with a difference of less than 3 h between the fastest and the slowest (www.ultramarathon.org/). Again, they competed mainly in European races held in the Netherlands, Great Britain, Italy, Spain, Belgium, and Russia (www.ultramarathon.org/). Also, runners from Belarus were among the fastest and also the second fastest races were held in that country. In contrast to the runners from Swaziland and Botswana, the fastest runner from Belarus finished a 100-km ultra-marathon in 6:33:56 h:min:s and the slowest in 22:55:12 h:min:s with a difference of more than 16 h between the fastest and the slowest (www.ultramarathon.org/). In Belarus, the ‘All-Union 100 km Run Grodno’ was held from 1988 to 1992 as a road-based race with 66 finishers (www.ultramarathon.org/). In 1995, the ‘100 km indoor Minsk’ was held with 6 finishers (www.ultramarathon.org/). From 2000 to 2014, the ‘Molodechno Int. 100 km ultramarathon indoor’ was held in Maladsetschna again with 66 finishers (www.ultramarathon.org/). Overall, a total of 138 runners were recorded, and 132 athletes competed indoors.

We might assume that a selected sample of runners competed in a 100-km ultra-marathon on an indoor track and most probably, a high percentage of the runners were from Belarus. Some studies found that local athletes preferably competed in their own country43,44. Further-more, in an indoor race, the influence of environmental conditions was eliminated. It has been shown that a high temperature or a high humidity could influence long-distance running speed and fluid loss more than a lower temperature and a lower humidity45. Apart from Belarus, fast 100-km ultra-marathoners originated from Kazakhstan, Cape Verde, and Lithuania. Regarding the density in performance, the fastest man from Kazakhstan finished a 100-km race in 6:31:41 h:min:s whereas the slowest needed 12:21:40 h:min:s with a difference of less than 6 h between the fastest and the slowest (www.ultramarathon.org/). For Cape Verde, 7:10:41 h:min:s for the fastest man and 12:17:32 for the slowest male runner, with a difference of around 5 h (www.ultramarathon.org/). Lithuania, with a higher number of runners, also showed a larger difference between the fastest and the slowest runner, where the fastest man finished in 6:05:35 h:min:s and the slowest in 18:11:52 h:min:s with a difference of more than 12 h (www.ultramarathon.org/). It was important to know that Belarus, Kazakhstan, and Lithuania were part of the former Soviet Union (www.history.com/topics/european-history/history-of-the-soviet-union). In a previous study, Russians were the fastest 100-km ultra-marathoners29. We should be aware that in 1991, Russia emerged from the dissolution of the Soviet Union as the independent Russian Federation, similar to the countries that composed the Soviet Union, which can impair our findings.

The age of peak performance

We also found that athletes aged 35–45 years were the fastest, which agreed with previous research46,47. A study investigating the age-related performance decline in 100 km ultra-marathoners competing in a single race (100 km Biel, Switzerland) reported that the best 100-km running times were observed for another age frame with 30–49 years for men and 30–54 years for women48. Another study investigating a large sample (148,017 finishes with 18,998 women and 129,019 men) and a longer time frame (1960–2012) showed that the age of the fastest female and male 100-km ultra-marathoners remained unchanged at 35 years46. In addition, our findings about the age of peak performance were in line with an analysis of the World athletics (formerly known as the International Association of Athletics Federations) database (1999–2015), which reported an age of 35.9 years in men and 36.6 years in women2.

Analytical considerations and limitations

A study from 2014 concluded that the fastest runners originated from Japan31. The result was based on the ten fastest races times by nationality of races held between 1998 and 201131. A study from 2018 that runners from Russia were the fastest in 100-km running29. In that analysis, finishes of races held between 1959 and with more than 14 h were removed (i.e. truncated data set)29. In the present analysis, data between 1892 and 2022 were considered (longer time frame) and no data were excluded, except incomplete data. The results summarized the observations across the descriptive charts (i.e., target plots) and the model interpretability charts (PDP and prediction plots). Some countries with small sample sizes (less than 10 records) in the 100 km sample within the 1892–2022 period but with faster runners may have been excluded from the analysis due to the methodology used. Athletes could change their country of residence/nationality over the years, which was not considered in the present study. Similarly, qualitative information regarding the event location was not available. This was an important limitation because it impaired the generalization of the findings regarding the environmental characteristics that had a positive impact on athletes’ performance. A further limitation was that we have not accounted for repeated measures since some athletes might have competed several times in the same or in another event. We had to include the change in elevation as a categorial variable (i.e. hilly versus flat) since exact data about changes in elevation in meters were not provided on the website. Other variables such as temperature, humidity, altitude, wind, etc. were also not available. These variables might also have an influence on race performance. Another limitation is that in one country more than one 100-km ultra-marathon could have been held. Future studies need to analyze each single 100-km race. On the other hand, the strength of the present study was its novel methodological approach since it was the first time that a machine learning model was used to predict 100 km running performance from age, gender, country of origin, and country of event. Furthermore, considering the popularity of 100-km ultra-marathon races, our findings would provide practical information for professionals working with ultra-marathon runners to set optimal performance goals depending on the event country. For athletes and coaches, these findings provide insight into ultra-marathon running, performance aspects and aspects relevant to performance. Athletes and coaches can now select a race for performance aspects selecting races flat race courses with little elevation, considering track race courses over road, trail or mountain races and races in particular countries, that may be more beneficial for a faster race performance. Future studies should consider factors such as training culture, access to coaching, genetic predisposition, and socioeconomic influences on athletic development.

Conclusion

In summary, elevation is the most important variable in 100-km running ahead of the country where the race was held, gender, age group, country of origin of the runner and course characteristics. Running on a track was the fastest ahead of road, trail and mountain running. Flat running was faster than running on a hilly course. Common for the fastest 100-km race courses was the fact that they were mainly indoor races and/or Championships. The fastest runners originated mainly from former republics of the dissolved Soviet Union. Africa, the Middle East, and Europe hold the fastest 100 km races. Future studies should consider investigating the culture of long-distance endurance events in these countries and explore how the natural environment can be used as an important characteristic of training, providing a safe, supportive context to training and participation in competitions.