Background and Summary

Dengue fever is an acute viral infectious disease transmitted by mosquitoes and is widely distributed in urban or semi-urban areas in more than 100 countries and regions in the tropics and subtropics worldwide, posing a serious threat to people’s lives and health1. In 2024, approximately 90 countries have so far reported active dengue transmission, and by 5th September, the total number of reported cases had reached more than 11 million in the Americas2. The number of people at risk of dengue is predicted to reach 6.1 billion (4.7–6.9 billion) by 2080, accounting for more than 60% of the world’s population3. Brazil, facing the worst dengue fever situation globally, with a reported 3,088,723 cases in 20234, and over one million cases in the first two months of 20245. Due to the lack of an effective vaccine, accurate and timely dengue risk prediction has become crucial for disease prevention and control5, particularly for the joint management of dengue outbreaks. Utilizing environmental datasets to establish the relationship between factors such as climate, environment, human activities and dengue can support research on dengue mechanisms and improve risk prediction. Epidemiological data in Brazil are publicly available6, and several projects and studies have highlighted the availability of dengue datasets for researchers to investigate transmission mechanisms and develop predictive models7. For example, Info Dengue platform is a hybrid data-based arbovirus alert system consists of three processes: data collection, coordination, and analysis. By a combination of social networks and climate and epidemiological data, it can generate indicators of the epidemiological status of dengue and other arboviruses such as Zika and chikungunya at the municipal level for 788 cities in Brazil8. However, to date, effective methods for calculating dengue environmental factors and the complete and comprehensive Brazilian environmental datasets specifically for studies on dengue transmission mechanisms and risk prediction remain unavailable. Given the critical dengue epidemic in Brazil, establishing a complete set of environmental datasets is essential for analyzing outbreaks and improving prevention and control efforts.

Environmental datasets are of great importance for modeling the spatial and temporal spread of dengue, especially in a country as diverse as Brazil9, where significant heterogeneity exists in the timing and intensity of dengue transmission10. Previous studies have explored the association of numerous environmental factors (e.g., temperature11,12, precipitation13,14, relative humidity15,16, atmospheric pressure17,18, wind speed18, the Palmer Drought Severity Index (PDSI)19, and the Normalized Difference Vegetation Index (NDVI)20,21) with the incidence of dengue. These factors affect the life cycle, survival, and biting rate of Aedes aegypti, as well as the incubation period of the virus, thereby shaping the spatial and temporal patterns of dengue epidemics. Due to the strong hematophagous nature of dengue vectors22 and their limited range of movement, human-to-vector-to-human transmission serves as a key route for the spread of the dengue virus. Consequently, human factors play a significant role in dengue virus transmission. Previous studies have explored the relationship between urbanization19 and dengue risk, as well as the impact of dengue fever on gross domestic product (GDP)23. Most dengue risk prediction studies use environmental factors from weather stations24. However, the limited spatial representativeness of weather station-based observations poses challenges for efficient and accurate dengue risk prediction. In addition, the time-consuming nature of downloading and processing satellite images and ready-to-use data such as PDSI and NDVI further complicates large-scale dengue studies. To address these challenges, it is essential to develop an efficient methodology for constructing a spatiotemporal dataset of dengue influencing factors, enabling more accurate and timely predictions of dengue outbreaks.

In recent years, geospatial big data has become a global focus, attracting increasing attention from various organizations25. It encompasses vast amount of specific types of data with location information, offering tremendous opportunities to advance scientific discoveries across diverse fields, such as climate science, disaster management, public health, precision agriculture and smart cities26. However, the inherent complexity of geospatial big data, characterized by its multi-source, multi-scale, high-dimensional, dynamic, heterogeneous, and non-linear nature, introduces numerous challenges, spanning the entire data lifecycle, from acquisition and storage to searching, sharing, transmission, analysis, and visualization27. There is an urgent need for advanced methods to address these issues in order to fully leverage the potential of geospatial big data. Currently, the main approaches are based on two platforms: cluster-based high-performance computing (HPC) systems and cloud computing platforms28. The introduction and application of cloud computing have significantly reduced the complexity and labor costs associated with dataset creation29. Cloud computing platforms such as Google Earth Engine (GEE), PIE Engine, and Amazon Web Services are widely used. For example, GEE hosts several petabytes of global satellite imagery (e.g., MODIS, Landsat, and Sentinel) and ready-to-use datasets on various themes, such as climate, land cover, agriculture, urbanization, and population. It also supports various algorithms, including image preprocessing, spatial and temporal analysis, and image classification, as well as parallel computing capabilities30,31. GEE has become an important tool in public health research, particularly in the study of infectious diseases like malaria32,33 and Zika34. However, datasets related to the factors influencing dengue transmission remain underdeveloped. Therefore, to produce dengue-related datasets based on GEE, it is essential to identify the appropriate data sources for generating spatial-temporal predictions of environmental factors that affect dengue transmission.

Here, this study presents a geospatial big data cloud-based framework for identifying common environmental factors associated with dengue transmission at the microregion scale, per epidemiological week, in Brazil. The framework aims to facilitate access to publicly available geospatial big data and cloud computing platforms for public health researchers. By specifying the spatial and temporal resolution, and the spatial extent required to generate environmental datasets, as well as by making the methodology and code publicly available, this approach can reduce duplication of efforts in dataset creation.

Methods

Brazil, officially known as the Federal Republic of Brazil, is the largest country in South America and the fifth largest in the world, and located in the eastern part of South America (Fig. 1). In Brazil, the area of dengue transmission has been expanding southward and toward the central regions of the country35,36. Studies have shown the impact of extreme weather events, such as droughts and floods19, on the distribution of dengue cases. Meanwhile, the economic expansion into the Amazon region through road construction and deforestation has also contributed to the widening of dengue transmission area37,38. In addition, long-term temperature increases have also been shown to play a significant role in this expansion39. Therefore, an integrated consideration of environmental factors related to temperature, precipitation, and urbanization is crucial for understanding dengue transmission mechanisms and improving risk prediction models.

Fig. 1
figure 1

Illustration of 558 microregions and 5 regions in Brazil.

Many studies utilized dengue cases reported by the Sistema de Informação de Agravo de Notificação (SINAN)6,19,40, which reports the dengue cases since 2001. This dataset provides clinical, sociodemographic, and laboratory information on patients with confirmed dengue fever. Currently, most studies on dengue risk in Brazil are conducted at the microregion level10,41. In Brazil, 5570 municipalities are aggregated into 558 microregions, 5 regions (Fig. 1), which are commonly used as spatial units for analyzing the spatial patterns of infectious diseases19. Accordingly, this study established a dataset of environmental factors at the spatial scale of 558 microregions, spanning a temporal scale of 1252 epidemiologic weeks from 2001 to 2024 (Fig. 2).

Fig. 2
figure 2

The proposed framework for generating dengue environmental factors at multiple spatial and temporal scales based on the GEE platform. Part (a) represents the temporal composition of images based on epidemiological weeks and environmental datasets with a population-weighting operation. Part (b) represents the spatial aggregation of image collections to the Brazilian microregional scale.

The dataset presents a list of dengue environmental factors, including (1) total precipitation per week; (2) number of rainy days per week; (3) maximum, mean, and minimum temperatures; (4) mean relative humidity; (5) mean atmospheric pressure; (6) mean wind speed; (7) Palmer Drought Severity Index (PDSI); (8) mean Normalized Difference Vegetation Index (NDVI); (9) Gross Domestic Product (GDP); and (10) building surface area (Table 1).

Table 1 List of global gridded variables used in this study.

Population-weighted spatial aggregation

Because the dengue vectors have strong anthropophilic behavior22 and virus-carrying mosquitoes have a limited range of activities, human-to-vector-to-human transmission has become the main route of dengue virus transmission. Therefore, it is necessary to fully consider the distribution of the population while computing the environmental factors in each microregion. The population-weighting operation can reflect the situation of environmental factors in areas with population distribution42. This facilitates further studies such as subsequent integration with the distribution of dengue cases for driver factor identification or risk mapping. In this study, the Brazilian population distribution data derived from WorldPop project dataset43 was used to apply the population-weighting approach to each environmental factor, to reduce the impact of environmental factors in uninhabited areas on the analysis of dengue transmission mechanisms. We selected the Brazilian population data from 2001 to 2020. Due to the lack of population data from 2021 to 2024 in the current WorldPop dataset, we used the 2020 population data to implement the population weighting on the data from 2021 to 2024 in this study.

Sum precipitation and number of rainy days

We obtained precipitation data for Brazil from 2001 to 2024 at a spatial resolution of 11,132 meters through the Earth Engine Data Catalog website (https://developers.google.com/earth-engine/datasets/catalog/ECMWF_ERA5_LAND_DAILY_AGGR). This dataset originates from the Copernicus Climate Data Store44. ERA5-Land is a high-resolution reanalysis dataset that provides consistent land surface data over decades, enhanced by atmospheric forcing from ERA5 to ensure accuracy45. In this study, we calculated the total precipitation for each microregion during the epidemiologic week. Precipitation here refers to the sum of mass and convective precipitation. The precipitation data, measured as depth (in meters), was convert to millimeters (mm) in GEE. This metric represents the depth of water uniformly distributed over the grid. To describe precipitation frequency, we used the number of rainy days per week. Previous studies have shown a relationship between the frequency of rainy days in a certain time cycle and dengue fever cases46.

Maximum, mean and minimum temperature

We obtained temperature data for Brazil from 2001 to 2024 at a spatial resolution of 11,132 meters through the Earth Engine Data Catalog website (https://developers.google.com/earth-engine/datasets/catalog/ECMWF_ERA5_LAND_HOURLY). This dataset originates from the Copernicus Climate Data Store45. The temperature represents the hourly air temperature at 2 meters above the surface of the land, ocean, or inland waters. It is calculated by interpolating between the lowest level of the model and the Earth’s surface based on atmospheric conditions45. The dataset provides temperature in kelvin, which we converted to degrees Celsius here.

Mean relative humidity

In this study, since there are no relative humidity dataset at large temporal and spatial scales, we used the dew point temperature and temperature from the ERA5-Land dataset to calculate relative humidity (%). The calculation is based on the Magnus formula:

$${Relative}\,{Humidity}( \% )=100\times (\frac{{\exp }(\frac{17.67\times {T}_{d}}{243.5+{T}_{d}})}{{\exp }(\frac{17.67\times T}{243.5+T})})$$
(1)

where Td represents the dew point temperature (°C), and T represents the temperature (°C). We chose the two parameter values of 17.67 and 243.5 °C based on those used in the presentation by NOAA’s47.

Mean atmospheric pressure

We obtained surface pressure data for Brazil from 2001 to 2024 at a spatial resolution of 11,132 meters through the Earth Engine Data Catalog (available from https://developers.google.com/earth-engine/datasets/catalog/ECMWF_ERA5_LAND_HOURLY). This dataset was sourced from the Copernicus Climate Data Store45. Surface pressure, measured in Pascals (Pa), indicates the atmospheric force per unit area at a given location, representing the weight of the air column above that point.

Mean wind speed

We also obtained wind speed data for Brazil from 2001 to 2024 at a spatial resolution of 11,132 meters through the Earth Engine Data Catalog (https://developers.google.com/earth-engine/datasets/catalog/ECMWF_ERA5_LAND_HOURLY). This dataset was sourced from the Copernicus Climate Data Store. The data, measured in meters per second (m/s), includes both the eastward (U component) and northward (V component) wind components at a height of 10 meters. To calculate the total wind speed, we combined these components using the following formula:

$${Wind\; Speed}=\sqrt{{U}^{2}+{V}^{2}}$$
(2)

where U represents the eastward wind speed component (m/s), and V represents the northward wind speed component (m/s).

Palmer drought severity index

The Palmer Drought Severity Index (PDSI) dataset used in this study was sourced from TerraClimate48 in the Earth Engine Data Catalog (https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_TERRACLIMATE). This dataset has a monthly temporal resolution and a spatial resolution of 4638.3 meters.

Mean normalized difference vegetation index

We obtained the Normalized Difference Vegetation Index (NDVI) data for Brazil from 2001 to 2022 at a spatial resolution of 463.3 meters through the Earth Engine Data Catalog (https://developers.google.com/earth-engine/datasets/catalog/MODIS_MOD09GA_006_NDVI). This dataset was sourced from the NASA LP DAAC at the USGS EROS Center. NDVI is a value between -1 and 1 which has widely used in environmental health studies49. 0 represents the absence of vegetation. Scores which close to 1 represent the highest possible density of vegetation. On the contrary, scores which close to -1 represent the presence of water50.

Gross domestic product

The gross domestic product (GDP) dataset used in this study is based on the calibrated nighttime light data51. This dataset has a spatial resolution of 1 km and a temporal resolution spanning from 1992 to 2019. The unit of the GDP index is millions of US dollars (2017 valuation). For this study, we obtained the 2001–2019 yearly GDP values and calculated them in 558 microregions.

Building surface area

In this study, we used built-up surface as an indicator of urbanization. We obtained data on the building surface area for each microregion in Brazil from the GHSL (Global Human Settlement Layer) dataset52. The GHS-BUILT-S product, derived from Sentinel-2 composite and Landsat, includes the data on both the total built-up surface and the built-up surface allocated to dominant non-residential uses53. The dataset provides data for five years (2000, 2005, 2010, 2015 and 2020) with the built-up surface data spatially-temporally interpolated or extrapolated from 1975 to 2030 at five-year intervals. The spatial resolution is 100 meters. In GEE, we calculated the building surface area in 558 microregions by extending the values contained in the GHSL dataset backward by 4 years to obtain complete annual data from 2001 to 2024. Figure 3(l) shows the building surface area in square kilometers (km2) in 2019.

Fig. 3
figure 3

Illustrative maps of microregion-level dengue-associated factors. The factors include minimum temperature (a), mean temperature (b), maximum temperature (c), total precipitation (d), number of rainy days (e), mean relative humidity (f), mean NDVI (g), mean pressure (h), mean wind speed (i), PDSI (j), GDP (k), and building surface area (l). (ai) shows the factors in a single epidemiological week (15th epidemiological week of 2019); (j) shows the PDSI in April 2019; (k,l) shows the GDP and building surface area in 2019.

The graphs below show the spatial distribution of dengue-associated factors across 558 microregions in the 15th epidemiological week, 2019 (Fig. 3) and the temporal distribution of dengue-associated factors in the Federal District of Brazil, identified as microregion 53001 (Fig. 4). The factors shown in Fig. 4 are recorded in the dataset on a daily or hourly basis, which are suitable for time series analysis.

Fig. 4
figure 4

Time series illustration of dengue-associated factors in the Federal District of Brazil (microregion code: 53001). Factors include (a) NDVI, (b) wind speed, (c) atmospheric pressure, (d) relative humidity, (e) precipitation, (f) number of rainy days, and (g) temperature.

Data Records

All output datasets described in this article are publicly and freely available through the Zenodo Repository54 (https://doi.org/10.5281/zenodo.15091652). The dataset contains 12 CSV files of dengue-related factors and a zip file of 558 microregions in Brazil. The first line of all CSV files includes the time of each variable, and each column contains the specific data of 558 microregions at each time.

Technical Validation

Meteorological grid-based datasets are usually generated through numerical weather prediction models that integrate multiple data sources, such as weather station observations, satellite remote sensing, radar, etc. These data are interpolated or assimilated to provide meteorological information covering extensive areas55,56. Therefore, the accuracy of such data is affected by both the model used and the data processing methods. Although weather station data are often considered accurate observations, they are limited to the specific locations of the stations and cannot be generalized for large-scale studies. Therefore, for dengue transmission studies at the national microregion level, the use of high-spatial-resolution grid-based datasets is essential. The ERA5-Land dataset used in this study is a land surface reanalysis dataset derived from ECMWF’s ERA5 atmospheric reanalysis data45. It uses the same meteorological forcings (e.g., precipitation, temperature, radiation) as ERA5 but runs the land surface model (HTESSEL)57 at a higher spatial resolution. It should be noted that ERA5-Land does not directly assimilate land observations but instead relies on atmospheric conditions provided by ERA5. To validate the datasets we produced by ERA5-land, we compare it with daily data from more than 200 conventional weather stations in Brazil, sourced from the National Institute of Meteorology (Instituto Nacional de Meteorologia, INMET). This dataset contains meteorological data from 1961 to 2019, with 6 variables selected for analysis: (a) precipitation, (b) relative humidity, (c) minimum temperature, (d) maximum temperature, (e) atmospheric pressure, and (f) wind speed (Fig. 5).

Fig. 5
figure 5

Illustrative maps of Pearson correlation coefficient between conventional weather station data of the National Institute of Meteorology in Brazil and the generated data: (a) precipitation, (b) relative humidity, (c) minimum temperature, (d) maximum temperature, (e) atmospheric pressure, and (f) wind speed.

Since the datasets we generated are population-weighted, and the distribution of the population has a large impact on the results during the validation process. We aimed to ensure that the weather station locations were situated in densely populated areas relative to the average population density in each microregion. Therefore, in the process of verifying the validity of the dataset, we selected data from meteorological stations located in densely populated areas. This approach allows for meaningful comparison between the weather station observations and the population-weighted data.

The verification process included the following steps: 1) In the Google Earth Engine platform, we first calculated the average population density of each microregion and the population density at the pixel where each weather station is located. We then removed stations located in sparsely populated areas. 2) Given temporal resolution of our data, we calculated daily weather values and resampled them based on epidemiological week in 2019. 3) To assess the reliability of the population-weighted dataset obtained in this study, we calculated the Pearson correlation coefficients and p-values for the six variables and their corresponding observations. 4) Since data from conventional weather stations are missing at certain time points and for specific variables, we excluded these weather stations with missing values from the calculation of the Pearson correlation coefficients and p-values.

We used the data in 2019 to carry out the above-mentioned steps. We selected 100 weather stations from 265 conventional weather stations of the National Institute of Meteorology. The data showed that the average Pearson correlation coefficient and p-value for weekly precipitation totals across 77 stations was 0.61 and 0.01, respectively, while for weekly mean relative humidity, it was 0.78 and 0.02 across 70 stations, respectively. The average Pearson correlation coefficients for maximum and minimum temperatures were 0.66 and 0.71, and their p-values were 0.06 and 0.05 respectively. For atmospheric pressure and wind speed, the Pearson correlation coefficients were 0.83 and 0.58 for 59 and 53 stations, and their p-values were 0.07 and 0.11 respectively. Figure 5 shows the distribution of the Pearson correlation coefficients for the six weather factors across microregions. The verification process showed that the datasets we generated have a strong linear relationship with the weather station data at the spatiotemporal scale. Given that the population distribution factors in our dataset influence the values, we conclude that the dataset produced in this study is reliable. Figure 6 shows the example comparison of the conventional weather station data and the population-weighted data over time of 52 epidemiology weeks in 2019 (weather station code: 43428, microregion code: 31001).

Fig. 6
figure 6

Illustration of the comparison of station and calculation maximum temperature over time of 52 epidemiology weeks in 2019 (weather station code: 43428, microregion code: 31001).

Usage Notes

The datasets compiled in this study can be used for research related to dengue risk prediction and the identification of driving factors in Brazil. The dataset can also help public health departments implement microregion-level dengue prevention strategies. In addition, the dataset is also applicable to studies of other vector-borne infectious diseases, such as chikungunya58 and yellow fever59, which have the same vectors (i.e., Aedes aegypti and Aedes albopictus). However, this dataset has several limitations that should be noted while using our data. First, due to the absence of population data from 2021 to 2024 in the WorldPop dataset, we used the 2020 population data to perform population weighting on the data from 2021 to 2024 based on the assumption that the population will not change significantly in the few years. We plan to update the dataset with the latest WoldPop population data as it becomes available. Second, the resolution constraints of both the factors and population data may limit the applicability of our method to microregion scales or larger spatial scales (e.g., mesoregions in Brazil). For municipality-level analysis, higher-resolution data would be better. Third, the large-scale climate raster data currently available on the Google Earth Engine (GEE) platform are generally derived from a combination of numerical meteorological models and observational data. The data assimilation outcomes may exhibit systematic biases, particularly in regions with sparse observational coverage60. Last, it is important to note that variables such as GDP and building surface area change on an annual basis, so this study does not capture changes in the economy or urbanization process on an epidemiological weekly scale.