Abstract
Goodness of fit (GOF) test approaches for selecting probability distributions of climatic variables are pervasive in the statistical literature. However, a combined approach of multiple tests remains underutilized despite evidence supporting their improved precision. Increased erratic climatic conditions pose severe threats to economic stability, necessitating robust statistical methods for climate modeling. To address this need, this study evaluates probability distributions for climatic variables using a comprehensive approach that combines multiple tests. A scoring system ranked each distribution’s performance across tests, with a composite score indicating the best fit. To assess robustness, sensitivity analysis on the best-performing distribution examined the influence of partitioning data into different segments (block sizes). The results show a generalized extreme value (GEV) distribution consistently outperforming other temperature and rainfall data distributions across multiple metrics. Extended block sizes capture long-term climatic patterns but introduce greater uncertainty due to fewer data points, while shorter block sizes tend to overfit. Intermediate block sizes provide a balance, producing reliable parameter estimates and stable return levels. These findings underscore the importance of selecting suitable block sizes and confirm the robustness of the GEV distribution for climate modeling. The study contributes to improved methodologies for risk assessment and climate adaptation strategies, particularly in regions such as Kenya.
Similar content being viewed by others
Introduction
Kenya’s increasing exposure to the effects of climate variability is a pressing issue, especially with erratic rainfall patterns and rising high-temperature patterns significantly affecting its key sectors. Agriculture, a backbone of Kenya’s economy1,2, is particularly vulnerable, as unpredictable weather disrupts planting and harvesting cycles, reduces yields, and exacerbates food insecurity. Infrastructure, too, faces challenges, with extreme weather events such as floods and droughts causing damage to roads, bridges, and other critical systems. The cumulative effect of these climate-induced challenges undermines the country’s overall economic stability, highlighting the urgent need for robust mitigation and adaptation strategies.
The effects of climate variability are particularly evident in regions like Marsabit, where prolonged droughts and heavy rainfall lead to severe consequences. Droughts reduce water availability, hinder crop growth, and limit pastures, leading to crop failures and livestock losses, exacerbating food insecurity3,4,5. In contrast, intense rainfall causes soil erosion, farmland flooding, and infrastructure damage, imposing significant financial burdens on the government for repairs and diverting resources from development projects.
These recurring events underscore the urgent need for sustainable strategies, such as climate-resilient agricultural practices, improved water management systems, and robust infrastructure design. Investments in early warning systems and community-based adaptation measures are also critical to mitigating the impacts on vulnerable populations.
A deeper understanding of climate variability, such as rainfall and temperature, can be achieved through probability distributions, which provide valuable tools to analyze climate patterns6. Globally, researchers have identified region- and time-dependent distributions for these variables, with models such as GEV, Gamma, log-normal, and Weibull frequently recommended for climatic data. Notable studies include those by Sharma and Singh7, Dzupire et al.8, Athulya and James9, Ozonur et al.10, Ximenes et al.11, Hussain et al.12, Singirankabo and Iyamuremye13 and Agbonaye and Izinyon14. For example, Ximenes et al.11 found Gamma and Weibull to be optimal for monthly precipitation in Northeast Brazil, while Douka and Karacostas15 identified GEV and log-normal as suitable for extreme precipitation in Thessaloniki, Greece. The differences in the probability distributions between11 and15 can be attributed to different geographical locations; Greece is located between \((40^\circ \text 37' N, 22^\circ \text 95' E)\) and northeast Brazil is \((34^\circ \text 47' N, 48^\circ \text 45' W)\). Their work on these regions also employed different periods; Greece’s data comprised monthly precipitation records from 1988 to 2017, whereas the study on Northeast Brazil used hourly rainfall data from 1947 to 2003. These studies and a summary in Table 1 demonstrate the importance of selecting appropriate probability distributions for accurate climate modeling.
Extensive research has also been conducted to identify the best-fitting probability distributions for temperature data. Key studies include those by Athulya and James9, Dzupire et al.8, Hasan22, Hossain23, Hussain et al.12 and Ozonur et al.10. These studies have explored various distributions, including the normal, log-normal, Gamma, and Weibull distributions. For instance, Hussain et al.12 identified the Generalized Pareto (GP), Extreme Value (EV), and GEV models as suitable for modeling temperature data. Similarly, Hasan22 employed ten continuous distributions, including the exponential, Gamma, Log-Gamma, Beta, normal, log-normal, Erlang, power function, Rayleigh, and Weibull distributions, with the Beta distribution emerging as the best fit for the temperature data.
This study aims to identify the most appropriate probability distributions for modeling monthly maximum temperatures and total monthly rainfall in Kenya. The analysis is based on a comprehensive data set covering the last 73 years, capturing the impacts of recent climatic changes. By incorporating these extensive and up-to-date data, the study ensures that the models account for evolving climate patterns. For instance, accurate descriptions of climatic data provide a better understanding of the probability distributions of maximum temperatures and total rainfall, which helps capture the frequency and intensity of climatic events, such as heat waves and heavy downpours. These models also enhance predictive capabilities by leveraging historical trends and recent shifts, improving forecasting accuracy and facilitating better preparation for future climatic scenarios. Additionally, by identifying the underlying distributions, the study supports data-driven decision-making, providing a critical foundation for risk assessment and resource allocation in agriculture, water management, and disaster response sectors.
The study makes a significant contribution to modeling climatic events through three key focus areas. First, it provides a comprehensive theoretical framework for understanding and applying statistical distributions in hydrology and climate studies. The framework offers precise definitions of commonly used distributions, facilitating their identification and application to various climatic datasets. It also includes robust parameter estimation methodologies that ensure accurate modeling of climatic variables. Furthermore, the study outlines strategies for selecting extreme values tailored to specific extreme value distributions, enabling the precise focus on significant climatic events.
Second, the research emphasizes the application of GOF tests to identify the most suitable probability distributions for climatic data. Detailed discussions on the implementation of GOF tests enhance the accuracy and reliability of the models. This methodological rigor improves the alignment of models with observed data and bolsters their credibility for practical applications in risk assessment and decision-making.
Lastly, we emphasized the significance of temporal pattern analysis through block size selection, a crucial factor in statistical modeling that directly impacts the capture of temporal patterns in climatic data. We conducted a sensitivity analysis to assess the impact of varying block sizes on the GEV distribution. This analysis combined graphical methods, GOF tests, return level estimates for various periods, and confidence intervals. By examining the effect of block size on model performance and extremal forecast, this section provides valuable insights into the stability and reliability of the GEV distribution across varying temporal resolutions.
The paper is structured as follows. “Methods” section provides a detailed description of the data, the procedure for selecting candidate probability distributions, parameter estimation methods, and the implementation of GOF tests, including the combined approach of multiple GoF tests. “Results and discussion” section presents summary statistics, results from the selection of candidate distributions, findings from the GoF tests, and insights from the sensitivity analysis. Finally, “Conclusion” section concludes the paper by summarizing the key findings and their implications for climate modeling and risk assessment.
Methods
Data
The monthly maximum temperature (Tmax) and total precipitation (Prep) data for Kenya, covering the period 1950–2022, were sourced from the World Bank Climate Change Knowledge Portal24. The precipitation data (Prep), measured in millimeters, represents the total accumulation of monthly rainfall. This provides a comprehensive measure of rainfall intensity and distribution across different months. The temperature data (Tmax), recorded in degrees Celsius, captures the highest daily maximum temperature observed each month, offering valuable insights into extreme temperature events.
Selection of candidate probability distributions
A review of existing literature identified probability distributions commonly applied in hydrological studies: exponential, Gamma, Weibull, log-normal, logistic, Gumbel, GPD, and GEV, as referenced by7,8,9,10,11,12,14,16,17,18,20. Similarly, for temperature data, these distributions, in addition to a normal distribution, were identified as suitable candidates, supported by findings from22 and other related studies. Table 2 describes each probability distribution function. These distributions were selected due to their suitability in modeling skewed, heavy-tailed, or extreme data characteristics commonly found in climatic datasets. The Cullen and Frey graph25 was used to preliminarily assess the shape characteristics of the data, guiding the selection of appropriate distributions for further analysis.
Parameter estimation
In statistical modeling, parameter estimation is essential due to the typically unknown nature of most model parameters. Commonly employed methods include the Method of Moments, L-moments, Maximum Likelihood Estimation (MLE), and LH-moments, as noted in studies by Al Mamoon and Rahman6 and Haddad and Rahman26. In this paper, we employ the MLE method for parameter estimation across the analyzed distributions, as it is one of the most widely applied and robust methods. MLE is favored for its consistency and efficiency, particularly in large samples, as it maximizes the likelihood of the observed data and often yields more reliable results compared to other methods such as Moments, L-moments, and LH-moments, particularly in terms of asymptotic properties. Research, including foundational studies by Fisher27, Zong28 and Naghettini29, has demonstrated that MLE’s variance and bias are comparatively low, thereby enhancing its suitability across a broad range of distributions. These qualities render MLE exceptionally reliable for environmental datasets, including temperature and rainfall measurements, where precision and robustness are critical.
Goodness of fit tests
The suitability of each probability distribution was assessed using a suite of GOF tests, including the Kolmogorov-Smirnov (KS), Anderson-Darling (AD), Cramer-von Mises (CvM), and Chi-Square tests. These tests evaluate the alignment between theoretical and empirical data, with KS tests focusing on overall distributional fit15,30, AD and CvM emphasizing tail behavior15,26,31,32,33, and Chi-Square examining frequency alignment19. Additional evaluation was performed using Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) to balance model complexity and fit10,12,22,26, along with Root Mean Square Error (RMSE) to quantify predictive accuracy14.
Comprehensive scoring methodology
The literature indicates a lack of suitable GOF tests designed to effectively distinguish between empirical and theoretical distributions34. Numerous studies have shown that the best-fit probability distribution can vary significantly between different regions, even for the same variable32. In response to these challenges, we adopt a comprehensive scoring methodology, as outlined in previous studies14,17,22,35. This method employs an integrated scoring approach that incorporates multiple GOF tests, information criteria, and graphical analyses to ensure a robust selection of the optimal probability distribution model. Each distribution model is subjected to several GOF tests, with a scoring system applied whereby the best-performing model in each test receives the highest rank. To enhance the rigor of the selection process, each model’s rank is determined independently for each GOF test and then aggregated across all tests to produce a composite score. For graphical assessments, rankings are informed by visual inspection of density plots and quantile-quantile (Q–Q) plots, providing additional insight into the best-fitting model.
Results and discussion
This section provides statistical results from the analysis. The dataset used in this study assumes an independent and identically distributed (iid). We tested for stationarity using the Augmented Dickey-Fuller (ADF) test, randomness using the Wald-Wolfowitz runs test, and independence using the Ljung-Box test to verify adherence to these assumptions. All tests were performed at \(5\%\) significance level. The results indicated that the data were stationary and random but exhibited autocorrelation; therefore, the data were aggregated using block analysis.
Summary statistics
Table 3 shows the descriptive statistics for the annual maximum temperature and total rainfall for Kenya.
The maximum temperature (Tmax) for 876 observations has an average of \(26.23 ^\circ C\) with low variability (standard deviation = 1.27) and a range from \(23.16 ^\circ C\) to \(29.97 ^\circ C\). The interquartile range \(25.32 ^\circ C\) to \(27.15 ^\circ C\) highlights a concentration around the median \(26.23 ^\circ C\), with a near-symmetrical distribution (skewness = 0.12) and a relatively flat shape (kurtosis = 2.43). The findings resonate with previous studies in1,2, which indicate that while temperature variability at the national level tends to be low due to data aggregation, an increase in temperature has been observed in most regions across the country.
In contrast, Total rainfall (Prep) exhibits much higher variability, with a mean of 63.97 and a standard deviation of 42.72, ranging from 2.46 to 280.32. This wide range reflects the variability and extreme nature of rainfall. Quartiles (q25 = 35.90, q75 = 81.88) and a median of 50.90 indicate a right-skewed distribution (skewness = 1.46), while positive kurtosis (5.43) points to heavy tails, signifying extreme events. The findings also align with the evidence1,2.
Choice of candidate distributions
For the temperature data in Fig. 1a, the Cullen and Frey graph shows that the distribution approximates the normal region with a slight platykurtic shape, identifying the normal, uniform, log-normal, Gamma, Weibull, and logistic distributions as potential candidates. Studies, such as12, have shown that extreme value distributions are suitable for modeling temperature data; therefore, these distributions were also considered potential candidates. In the rainfall data in Fig. 1b, the distribution exhibits positive skewness and high kurtosis, suggesting alignment with distributions such as log-normal, Gamma, Weibull, and exponential. Given the presence of extreme values, models that account for extreme behavior, specifically the GPD and GEV distributions, were also included in the analysis.
Model fitting was conducted using MLE for parameter estimation. For extreme value distributions, the Block Maxima (BM) and Peak Over Threshold (POT) approaches were used to determine the number of block maxima and thresholds required to fit GEV and GPD distributions, respectively. The BM approach is widely used in extreme value analysis to capture maximum events within defined time intervals, such as annual maxima, and it is commonly applied for environmental and climate data30,36,37. For the POT method, which is well-suited to modeling excesses over a specified threshold, the Mean Residual Life (MRL) plot was generated as shown in Fig. 2, and visual inspection was used to determine an appropriate threshold for each variable13,37. The blue curve in Fig. 2 represents the observed mean excess values {\(e = E(x_i - u \mid \text x_i > u )\)} , the red lines denote the upper and lower confidence intervals \((95\%)\) and threshold \(u\) defines the limit for identifying extreme events \((x_i: x_i > u)\)38. In Fig. 2a, a threshold in the range of 50 to 150 is suitable, as it provides a stable mean excess with narrower confidence intervals. This indicates that values above this threshold exhibit behavior suitable for modeling with a GPD. For temperature, the MRL plot in Fig. 2b did not suggest a proper threshold, hence the initial guess of a threshold around \(u=25\), where the confidence intervals remain relatively narrow, indicating reliable estimates. However, after approximately 28, the confidence intervals begin to widen slightly, indicating increased uncertainty in the mean excess values at higher thresholds. The GPD parameters were estimated based on observations exceeding this threshold.
Graphical assessments and GOF tests results
Graphical assessments
Density and Q–Q plots were generated to compare the observed data with several fitted theoretical distributions. For temperature data, the density plot in Fig. 3 shows that the GEV, Gamma, and log-normal distributions provide the best fit, capturing both the central peak and tail behavior. The normal, Weibull, and logistic distributions also perform reasonably well but exhibit slight deviations in the tails. In contrast, the uniform distribution shows significant discrepancies, particularly in the extremes, suggesting its unsuitability for modeling extreme temperature events. The Q–Q plots in Fig. 4 reveal that most distributions demonstrate deviations in the tails, with the GEV and normal distributions showing the closest adherence to the theoretical quantiles. Among the fitted distributions, the GEV, normal, log-normal, and Gamma distributions provide the best fit in that order, followed by the logistic and Weibull distributions, which exhibit moderate deviations. In contrast, the GPD and uniform distributions exhibit a substantial lack of fit, particularly at the lower and upper tails. This visual approach to identifying the best-fitting distribution is inherently subjective and, therefore, cannot be relied upon solely. To enhance robustness, these results were complemented with findings from other GOF tests to improve the reliability of distribution selection.
Similarly, for the rainfall data in Fig. 5, the GEV, Gamma, and log-normal distributions show the closest alignment with the actual observed data, effectively capturing the shape and spread of the distribution. The Weibull distribution provides a moderate fit, performing well in the central range but diverging in the tails. In contrast, the exponential and GPD distributions exhibit substantial deviations, failing to represent the empirical distribution, especially at the extremes accurately. The Q–Q plots in Fig. 6 reinforce these findings, with the GEV and Gamma distributions displaying the best adherence to the theoretical quantile line, followed by the log-normal and Weibull distributions. Exponential and GPD exhibit the weakest performance. These results are consistent with previous studies, such as21, which identified the GEV distribution as the most appropriate model for extreme rainfall events.
GOF tests
The GOF analysis in Table 4 (a) identifies the GEV distribution as the most suitable model for the maximum temperature data. The GEV distribution achieves the lowest statistics for the KS (0.0297), AD (0.8890), and CvM (0.1335) statistics, accompanied by high p-values (0.4206, 0.4211, and 0.4442), indicating a strong alignment with the observed data. It also produces the lowest Chi-square statistic (3.5969, p = 0.9637) and achieves superior performance in terms of AIC (2,898.30), BIC (2,912.63) and RMSE (1.5694), highlighting its precision and efficiency. Other distributions, such as the normal, log-normal, and Gamma, provide moderate fits, with non-significant GOF statistics but higher AIC and BIC values, along with RMSE values that reflect less accuracy compared to the GEV. Conversely, the Weibull, Uniform, Logistic, and GPD distributions exhibit poor performance, with high test statistics, low p-values, and significant deviations from the observed data. The Uniform and GPD distributions show extreme misalignment, as evidenced by infinite AD statistics, high Chi-square values, and elevated RMSE scores, confirming their unsuitability for modeling maximum temperature data.
For the rainfall data in Table 4 (b), the GEV distribution also emerges as the most robust model, as reflected in the highest p-values for the tests KS (0.3487), AD (0.2753), and CvM (0.2897), indicating minimal deviation from observed data. Furthermore, the GEV achieves among the lowest AIC (8713.87) and BIC (8728.19) values, highlighting its parsimony and suitability for modeling rainfall patterns. Its superior predictive accuracy is evident from the lowest RMSE value (58.86), reinforcing its reliability. Concerning chi-square tests, the log-normal distribution was found to have the lowest chi-square value, indicating a better fit. Yuan et al.17 also had a similar finding when they used Chi-square tests to evaluate the best fit for the frequency analysis of the annual maximum hourly precipitation. In contrast, the GPD and exponential distributions perform poorly, with significant p-values, high Chi-square statistics, and elevated RMSE values, indicating substantial deviation and limited applicability for modeling rainfall data.
A comprehensive scoring method was used to further evaluate the best-fitting distributions, with findings presented in Table 5. Analysis for temperature distributions in Table 5 (a) revealed that the GEV consistently outperformed others as observed in39, achieving the highest overall rank with a total score of 17. This was supported by its superior performance in key tests, including KS, AD, and CVM tests. The Gamma and log-normal distributions ranked second and third, respectively, demonstrating moderate fits across multiple metrics. However, distributions like Weibull, Uniform, Logistic, and GPD performed poorly, accumulating higher total scores and displaying suboptimal results in density plots and QQ plots.
For rainfall distributions, the ranking analysis in Table 5 (b) also confirms that the GEV distribution again emerged as the top performer, ranking first with a total score of 16. These findings are supported by Agbonaye and Izinyon14, Al Mamoon and Rahman6, Alam et al.18, Coronado-Hernández et al.36, Fadhilah et al.21, Ghosh et al.40, Ng et al.35 and Yuan et al.17. Its strength was evident across most GOF tests, where it outperformed or closely matched the best-performing distributions in each category. The Gamma distribution ranked second, showcasing a strong overall fit with balanced performance across metrics. Log-normal followed in third place, excelling in certain tests but lagging in others, such as AIC and BIC. In contrast, the exponential and Weibull distributions demonstrated weaker fits, while the GPD distribution consistently ranked lowest.
Sensitivity analysis
To evaluate the robustness of the GEV distribution’s fit to rainfall data, a sensitivity analysis was performed using various block sizes designed to capture diverse temporal patterns and extremes. Block size refers to a series of independent groups of observations of a particular length38. According to Coles and Coles38, block sizes are often selected to capture a specific period. In this work, the block sizes included annual, seasonal, monthly, 5-year, 10-year, 12-month moving averages, 6-month intervals, and 4-month intervals. Annual blocks, where maximum values were extracted per year, followed the methodologies outlined in38,41. Seasonal blocks were based on quarterly aggregations, as indicated by42 and41. Monthly blocks were used to capture monthly maxima, as discussed in43 and42. For longer-term patterns, multi-year blocks of 5-year and 10-year intervals were established, consistent with approaches adopted in studies such as44. A 12-month moving average window assessed rolling maxima, highlighting shifts in trends. Event-based blocks focused on the most extreme events by isolating total rainfall above the 95th percentile following the techniques used in45. For intermediate seasonality, semi-annual blocks were divided each year into January–June and July–December intervals, consistent with approaches used by42,43,46. Furthermore, a regional seasonal classification for Kenya was used to account for local climatic variations, with blocks corresponding to the “Hot and Dry”, “Long Rainy”, “Cool”, and “Short Rainy” seasons, building on the framework proposed by47. For each block length, maximum values were extracted and the GEV parameters were estimated and presented in Table 6.
For both rainfall and temperature data, parameter estimates reveal notable differences between block sizes, particularly in the shape parameter, which defines tail behavior. For rainfall, annual, 5-year, and 10-year blocks exhibited non-significant negative shape parameters \((p < 0.05)\), indicating a Weibull class of distribution as reported in30 and uncertainty in tail estimates for these broader temporal aggregations. In contrast, mid-range blocks, such as monthly, quarterly, event-based, and seasonal, yielded significant positive shape parameters, reflecting the heavy-tailed Frechet class of distributions with well-defined extremal patterns. This is in agreement with Moccia et al.33 although the findings of Onwuegbuche et al.48 and Singirankabo et al.37 revealed that Gumbel is the optimal distribution. The location and scale parameters were consistently significant \((p < 0.05)\) across all block sizes, indicating reliable estimation of central tendency and variability. The event-based block for rainfall, with a high shape estimate (0.3974), suggested a heavier tail and a higher propensity for extreme rainfall events compared to other blocks. For temperature data, location and scale parameters were also consistently significant across all blocks, confirming stable estimates of central tendency and variability. However, the shape parameter was not significant for the 5-year, 10-year, and event-based models, indicating uncertainty in tail estimates, which is likely due to the limited number of data points or the irregular occurrence of extreme events. In contrast, the quarterly, monthly, and seasonal models produced significant shape parameters, suggesting that they provide more robust and reliable tail estimates for predicting rare and extreme values in both temperature and rainfall.
The model diagnostic tests in Table 7 reveal that the 10-year and 5-year blocks provide the best fit for both rainfall and temperature data, achieving the lowest AIC and BIC values (e.g., AIC = 74.406 and 146.985 for rainfall), indicating strong model parsimony and minimal information loss. These longer blocks effectively capture long-term extreme trends but rely on fewer data points (n = 7 and 14), which increase uncertainty in parameter estimates due to increased variances, as demonstrated by46. This finding aligns with studies by38,41, which emphasize the effectiveness of larger blocks in capturing long-term climatic trends by averaging out short-term fluctuations, thereby focusing on extreme patterns. Event-based and annual blocks also perform well for rainfall, with low AIC and BIC values, reflecting their stability in representing extreme events with adequate data, as supported by42. In contrast, higher-frequency blocks, such as monthly and 12-month moving average models, exhibit much higher AIC and BIC values for both rainfall and temperature, suggesting potential overfitting and inefficiency in capturing extreme patterns, a limitation also noted by43. Mid-range blocks, including quarterly, semi-annual, and seasonal, achieve moderate AIC and BIC values for both datasets, offering a balanced approach that captures seasonal variability while maintaining sufficient stability for reliable parameter estimation. This perspective is supported by studies such as15,42,46, which highlight the value of intermediate temporal scales in balancing the trade-offs between long-term trend analysis and sufficient data representation.
In addition, we computed the return levels for different return periods to determine how various models estimate the extremes. The return level represents the magnitude of an event expected to be equaled or exceeded, on average, once within a specified return period38,48. The findings in Fig. 7 for temperature and rainfall data reveal distinct patterns across models when estimating extremes at various return periods. For temperature in Fig. 7a , the 10-year and 5-year models consistently produce the highest return levels, maintaining stability across increasing return periods as observed in48, indicating their robustness in estimating extreme values over longer intervals. In contrast, models with finer resolutions, such as monthly and 12-month moving averages, yield lower return levels with modest increases over time, suggesting a limited capacity to capture rare extremes. The quarterly and semi-annual models show moderate return levels, providing a balanced estimation that captures both seasonal variability and long-term trends. For rainfall in Fig. 7b, a similar pattern emerges, with the 10-year, 5-year, and seasonal models achieving the highest and most stable return levels, while finer models like monthly and 12-month moving averages display lower return levels and less pronounced growth across return periods. The event-based model exhibits high initial return levels but shows a plateau at more extended periods, indicating potential limitations in capturing prolonged extremes. Overall, the 10-year, 5-year, and seasonal models appear to be the most consistent for temperature and rainfall extremes.
Finally, we used a density plot to check how each model captures the distribution of maximum temperatures and total rainfall. In the temperature plot in Fig. 8a , the 10-year, 5-year, and event-based models displayed the most concentrated curves, suggesting a narrower range with more pronounced extremes. Models with higher temporal resolutions, like monthly and 12-month moving averages, exhibit wider density curves, indicating a broader distribution that captures more frequent fluctuations but is less focused on extremes. The quarterly and semi-annual models fall between these extremes, striking a balance between stability and variability. For rainfall data in Fig. 8b, a similar pattern emerges: the 10-year and 5-year models show steeper, more concentrated curves, indicating that they effectively capture rare, high-magnitude events. In contrast, finer-resolution models, such as monthly and 12-month moving averages, have flatter curves, capturing a wider range of data with less emphasis on extremes.
Conclusion
In this study, we have assessed various probability distributions for modeling maximum temperature and total rainfall data using a systematic and comprehensive approach that combines several GOF tests and graphical tools. In addition, we have identified the optimal block size for the GEV distribution using return levels across different periods, as well as log-likelihood, AIC, and BIC. Insights from GOF tests highlighted that the GEV, Gamma, and log-normal distributions were well-suited for both maximum temperature and total rainfall datasets, as they consistently aligned with empirical data. On the other hand, distributions such as uniform, Weibull, and logistic showed a poor fit across multiple metrics, underscoring their limitations in capturing the complexities of climatic variables. The GEV distribution emerged as the optimal model for rainfall and temperature data, consistently outperforming others in key metrics such as the AIC, BIC, and RMSE. It also demonstrated superior performance in GOF tests, including the KS, AD, and CVM tests. This strong performance affirms the robustness of the GEV distribution in modeling climatic extremes and its capacity to provide reliable insights into long-term trends.
Block size analysis revealed the effectiveness of longer temporal aggregations, such as 10-year and 5-year blocks, which produced stable and high return levels across return periods, effectively capturing long-term extreme trends. However, these longer blocks increased uncertainty in parameter estimates due to fewer data points. In contrast, intermediate blocks, such as quarterly and seasonal, struck a balance by capturing seasonal variations while maintaining stability and reliable parameter estimates with moderate AIC and BIC values. High-frequency blocks, such as monthly and 12-month moving averages, although rich in data, exhibited higher AIC and BIC values, suggesting potential overfitting and inefficiency in representing extreme values.
The results of this study are important for Kenya and the East African region, as the adopted methodology can be applied. The comprehensive GOF tests also enhance forecasting temperature and rainfall data, which is crucial for risk assessment and the development of climate adaptation strategies. With this knowledge, predictions and preparations for catastrophic events, such as floods, droughts, or rising temperatures, can be enhanced. With better forecasts, policymakers and the government can improve infrastructure for water catchment systems and enhance agricultural activities through proper planning and disaster preparedness.
However, a key limitation of this study is its focus on individual probability distributions for temperature and rainfall without explicitly addressing the interdependence between these variables. Since temperature and rainfall are inherently related, accurate risk assessments and effective climate adaptation strategies require consideration of their associations. Extensive research has been conducted on the dependence between temperature and rainfall; therefore, future studies should prioritize exploring dependence structures within a multivariate framework using the fitted probability distributions identified in this study. Advanced approaches such as copula models or joint distribution analyses could provide deeper insights into the interactions between these variables, particularly under extreme climatic conditions. Such efforts would significantly enhance the reliability of climate models and their applicability to integrated risk assessment frameworks.
To build on this work, future research should focus on applying this methodology at finer spatial scales using real datasets from various regions in Kenya. Conducting probability distribution analyses at regional levels, incorporating block size analysis, and integrating data from multiple weather stations could yield region-specific insights into seasonal rainfall patterns, further informing targeted climate adaptation strategies. From a policy perspective, the results underscore the need for data-driven strategies that take into account both individual and joint variability of climatic variables. Policymakers should leverage these insights to design robust adaptation measures, such as enhancing agricultural planning, improving water resource management, and enhancing infrastructure resilience tailored to Kenya’s specific climate challenges.
Data availability
The data that support the findings of this study are accessible to registered users (free registration) on the World Bank, Climate Change Knowledge Portal (https://climateknowledgeportal.worldbank.org/).
References
GOK. Kenya Climate Smart Agriculture Strategy, 2017–2026 (Ministry of Agriculture, Livestock and Fisheries, 2017).
Jalango, D. et al. Climate smart agriculture investment plan for kenya. In Accelerating Impacts of CGIAR Climate Research for Africa (AICCRA) (2022).
Nyika, J. M. Climate change situation in Kenya and measures towards adaptive management in the water sector. In Research Anthology on Environmental and Societal Impacts of Climate Change, 1857–1872 (IGI Global, 2022).
Ngure, M. W., Wandiga, S. O., Olago, D. O. & Oriaso, S. O. Climate change stressors affecting household food security among Kimandi–Wanyaga smallholder farmers in Murang’a County, Kenya. Open Agric. 6, 587–608 (2021).
Mkonda, M. Y. & He, X. Are rainfall and temperature really changing? Farmer’s perceptions, meteorological data, and policy implications in the tanzanian semi-arid zone. Sustainability 9, 1412 (2017).
Al Mamoon, A. & Rahman, A. Selection of the best fit probability distribution in rainfall frequency analysis for Qatar. Nat. Hazards 86, 281–296 (2017).
Sharma, M. A. & Singh, J. B. Use of probability distribution in rainfall analysis. N. Y. Sci. J. 3, 40–49 (2010).
Dzupire, N. C., Ngare, P. & Odongo, L. A copula based bi-variate model for temperature and rainfall processes. Sci. Afr. 8, e00365 (2020).
Athulya, P. & James, K. Best fit probability distributions for monthly radiosonde weather data. Int. J. Adv. Manag. Technol. Eng. Sci. 7, 24–31 (2017).
Ozonur, D., Pobocikova, I. & de Souza, A. Statistical analysis of monthly rainfall in central west Brazil using probability distributions. Model. Earth Syst. Environ. 7, 1979–1989 (2021).
Ximenes, P. S. M. P., Silva, A. S. A., Ashkar, F. & Stosic, T. Best-fit probability distribution models for monthly rainfall of northeastern brazil. Water Sci. Technol. 84, 1541–1556 (2021).
Hussain, B. et al. Interdependence between temperature and precipitation: Modeling using copula method toward climate protection. Model. Earth Syst. Environ. 8, 2753–2766 (2022).
Singirankabo, E. & Iyamuremye, E. Modelling extreme rainfall events in Kigali city using generalized pareto distribution. Meteorol. Appl. 29, e2076 (2022).
Agbonaye, A. & Izinyon, O. Best-fit probability distribution model for rainfall frequency analysis of three cities in south eastern Nigeria. Niger. J. Environ. Sci. Technol. (NIJEST) 1, 34–42 (2017).
Douka, M. & Karacostas, T. Statistical analyses of extreme rainfall events in Thessaloniki, Greece. Atmos. Res. 208, 60–77 (2018).
Oseni, B. A. & Ayoola, F. J. Fitting the statistical distribution for daily rainfall in Ibadan, based on chi-square and Kolmogorov–Smirnov goodness-of-fit tests. West Afr. J. Ind. Acad. Res. 7, 93–100 (2013).
Yuan, J., Emura, K., Farnham, C. & Alam, M. A. Frequency analysis of annual maximum hourly precipitation and determination of best fit probability distribution for regions in Japan. Urban Clim. 24, 276–286 (2018).
Alam, M. A., Farnham, C. & Emura, K. Best-fit probability models for maximum monthly rainfall in Bangladesh using Gaussian mixture distributions. Geosciences 8, 138 (2018).
Houessou-Dossou, E. A. Y., Mwangi Gathenya, J., Njuguna, M. & Abiero Gariy, Z. Flood frequency analysis using participatory GIS and rainfall data for two stations in Narok town, Kenya. Hydrology 6, 90 (2019).
Coronado-Hernández, Ă“. E., Merlano-Sabalza, E., DĂaz-Vergara, Z. & Coronado-Hernández, J. R. Selection of hydrological probability distributions for extreme rainfall events in the regions of Colombia. Water 12, 1397 (2020).
Fadhilah, Y. et al. Fitting the best-fit distribution for the hourly rainfall amount in the Wilayah Persekutuan. Jurnal Teknologi 46, 49–58 (2007).
Hasan, R. H. R. Estimating the best-fitted probability distribution for monthly maximum temperature at the Sylhet station in Bangladesh. J. Math. Stat. Stud. 2, 60–67 (2021).
Hossain, M. Fitting the probability distribution of monthly maximum temperature of some selected stations from the northern part of Bangladesh. Int. J. Ecol. Econ. Stat. 39, 80–91 (2018).
WorldBank. Climate change knowledge portal (2024). Accessed 16 Sept 2023.
CullenFrey, A. Probabilistic techniques in exposure assessment (1999).
Haddad, K. & Rahman, A. Selection of the best fit flood frequency distribution and parameter estimation procedure: A case study for Tasmania in Australia. Stoch. Environ. Res. Risk Assess. 25, 415–428 (2011).
Fisher, R. A. On the mathematical foundations of theoretical statistics. In Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, vol. 222, 309–368 (1922).
Zong, Z. Information-Theoretic Methods for Estimating of Complicated Probability Distributions Vol. 207 (Elsevier, 2006).
Naghettini, M. Fundamentals of Statistical Hydrology (Springer, 2017).
Chikobvu, D. & Chifurira, R. Modelling of extreme minimum rainfall using generalised extreme value distribution for Zimbabwe. S. Afr. J. Sci. 111, 01–08 (2015).
Sukrutha, A., Dyuthi, S. R. & Desai, S. Probability distribution for monthly precipitation data in India. arXiv preprint arXiv:1708.03144 (2017).
Lima, A. O. et al. Extreme rainfall events over Rio de Janeiro state, brazil: Characterization using probability distribution functions and clustering analysis. Atmos. Res. 247, 105221 (2021).
Moccia, B., Mineo, C., Ridolfi, E., Russo, F. & Napolitano, F. Probability distributions of daily rainfall extremes in Lazio and Sicily, Italy, and design rainfall inferences. J. Hydrol. Reg. Stud. 33, 100771 (2021).
Razali, N. M. et al. Power comparisons of Shapiro–Wilk, Kolmogorov–Smirnov, Lilliefors and Anderson–Darling tests. J. Stat. Model. Anal. 2, 21–33 (2011).
Ng, J. et al. Investigation of the best fit probability distribution for annual maximum rainfall in Kelantan river basin. In IOP Conference Series: Earth and Environmental Science, vol. 476, 012118 (IOP Publishing, 2020).
Coronado-Hernández, Ă“. E., Merlano-Sabalza, E., DĂaz-Vergara, Z. & Coronado-Hernández, J. R. Selection of hydrological probability distributions for extreme rainfall events in the regions of Colombia. Water 12, 1397 (2020).
Singirankabo, E., Iyamuremye, E., Habineza, A. & Nelson, Y. Statistical modelling of maximum temperature in Rwanda using extreme value analysis. Open J. Math. Sci. 7, 180–195 (2023).
Coles, S. & Coles, S. Basics of statistical modeling. In An Introduction to Statistical Modeling of Extreme Values 18–44 (2001).
Ng, J. et al. Statistical modelling of extreme temperature in peninsular Malaysia. In IOP Conference Series: Earth and Environmental Science, vol. 1022, 012072 (IOP Publishing, 2022).
Ghosh, S., Roy, M. K. & Biswas, S. C. Determination of the best fit probability distribution for monthly rainfall data in Bangladesh. Am. J. Math. Stat. 6, 170–174 (2016).
Villarini, G., Smith, J. A., Serinaldi, F. & Ntelekos, A. A. Analyses of seasonal and annual maximum daily discharge records for central Europe. J. Hydrol. 399, 299–312 (2011).
Hasan, H., Radi, N. A. & Kassim, S. Modeling of extreme temperature using generalized extreme value (GEV) distribution: A case study of Penang. Proc. World Congr. Eng. 1, 181–186 (2012).
Ender, M. & Ma, T. Extreme value modeling of precipitation in case studies for China. Int. J. Sci. Innov. Math. Res. (IJSIMR) 2, 23–36 (2014).
Fowler, H. & Kilsby, C. A regional frequency analysis of united kingdom extreme rainfall from 1961 to 2000. Int. J. Climatol. J. R. Meteorol. Soc. 23, 1313–1334 (2003).
Gilleland, E., Ribatet, M. & Stephenson, A. G. A software review for extreme value analysis. Extremes 16, 103–119 (2013).
Özari, Ç., Eren, Ö. & Saygin, H. A new methodology for the block maxima approach in selecting the optimal block size. Tehnički vjesnik 26, 1292–1296 (2019).
Musyoka, M. M. Spatial–Temporal Characteristics of Rainfall Events in Kenya. Ph.D. thesis, University of Nairobi (2020).
Onwuegbuche, F. C. et al. Application of extreme value theory in predicting climate change induced extreme rainfall in Kenya. Int. J. Stat. Probab. 8, 85–94 (2019).
Acknowledgements
The authors acknowledge with gratitude the support from Strathmore Institute of Mathematical Sciences, Strathmore University and the DAAD [ST32 - PKZ: 91789473] in the production of this manuscript.
Author information
Authors and Affiliations
Contributions
K.O., B.O. and L.C. conceived the project. K.O. performed the analysis and drafted the manuscript with substantial contributions from B.O., L.C., and C.O. All authors have read and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Otieno, K., Chaba, L., Odhiambo, C. et al. A systematic approach to modeling monthly maximum temperature and total rainfall in Kenya. Sci Rep 15, 31758 (2025). https://doi.org/10.1038/s41598-025-12810-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-12810-0