Abstract
The quality of public transport is essential when considering urban mobility in large cities. Several factors, such as the increase in urban population, rain, and traffic events, can impact mobility, causing congestion. Addressing this issue is essential for the population and is part of the UN’s 2030 Agenda for Sustainable Development goals. Integrating data from different sources is crucial to understanding and planning urban traffic. This work aims to provide a dataset with spatiotemporal information on the mobility of municipal buses, including the estimated emission of polluting gases and the rainfall volume in Rio de Janeiro from 2014 to 2023. Its format facilitates integration with other Rio de Janeiro City Hall datasets, enabling the increase and deepening of the analyses. This work is the first to combine data from bus observation with positional information on neighborhoods and rainfall regions, rainfall volumes, and pollutant gas emissions. Thus, its availability opens opportunities for research topics involving public transport associated with environmental indicators and data science with time series studies and positional data.
Similar content being viewed by others
Background & Summary
In 1950, approximately 30% of the global population resided in urban areas. This percentage has now surpassed half and is projected to reach 80% by 20501. Citizens have been relocating to cities due to convenient access to other individuals, employment, and opportunities. An indirect consequence of this migration is the occurrence of traffic jam, which affects people’s accessibility2. Therefore, public transportation plays a crucial role in the population’s daily lives as it is a critical infrastructure for the economy3.
Today, the motorized transportation sector is responsible for emitting over 15% of the world’s polluting gases. Consequently, reducing vehicular traffic in cities has also become imperative to mitigate the impacts of climate emergencies4,5. This objective is part of the goals outlined in the 2030 Agenda for Sustainable Development developed by the United Nations (UN).
More broadly, understanding urban mobility is essential for transportation planning and other urban processes, such as the spread of epidemics4. The recent surge in urban data volumes has paved the way for the emergence of a new field of study known as the Science of Cities. However, developing a simple model to explain the dominant mechanisms governing the formation and evolution of mobility patterns remains elusive6.
Since traffic jam vary across different times and locations, incorporating spatiotemporal data enhances the reliability of assessing their impacts2. Thus, this work aims to provide a dataset called Carioca_MapBus that contains more than just spatiotemporal information on the mobility of municipal buses in Rio de Janeiro. It integrates data from multiple sources, expanding research on various subjects related to the study of mobility. For instance, this information includes estimates of pollutant gases emissions such as Carbon Monoxide (CO), Carbon Dioxide (CO2), Hydrocarbons (HC), and Nitrogen Oxides (NOx), as well as historical records of daily rainfall volume, administrative regions, bus garages, public transport terminals, and areas of interest (e.g., express corridors).
The availability of the Carioca_MapBus dataset presents opportunities for researching urban mobility in a megalopolis and its connection to environmental issues. It also provides an opportunity to conduct data science studies involving space-time series. Examples of these opportunities are presented in this paper.
Methods
Buses data sources
The dataset stems from ongoing data collection efforts initiated in 2014 by the CEFET’s Laboratory of Computational Intelligence in Engineering and Management (LINCE), coinciding with the SIURB (Municipal Urban Information System) implementation by the City Hall of Rio de Janeiro. This system began providing information related to the public transit system of the city of Rio de Janeiro7. Each vehicle in the Rio bus system contains an embedded board programmed to transmit trip information such as vehicle ID, geopositioning, service line, and instant velocity at regular intervals. The City Hall collects this trip information and offers a web access point containing data about every vehicle in the system.
The information on the City Hall’s access point resembles an aerial photograph of the city, providing the most up-to-date data for every bus. However, fetching data at timed intervals may lead to duplicated information if a bus still needs to update its data between fetches. Besides, the service does not provide historical data.
Dataset preparation
The dataset was created in five stages. Each stage includes metadata and new variables to make the information more useful. The pipeline of stages is illustrated in Fig. 1, and a Data Summary Table (DST) file results from each stage, providing descriptive statistics, provenance, and versioning. Stage (A) extracts and treats the raw data to fix errors and adds information about topographic elevation. Stage (B) adds data about the administrative regions of Rio de Janeiro city, neighborhoods, garages, and bus terminals. Then, Stage (C) inserts the rainfall zones and volumes. Stage (D) includes sample intervals, average speeds, elevations, and bus travel distances. Finally, Stage (E) incorporates estimates of polluting gas emissions. Each stage is detailed as follows.
The pipeline for the Data Summary Table building for the Carioca_MapBus. Each stage augments the usefulness of the dataset, adding different domain information for every observation throughout the process. Stage A assures the quality of the data and adds the elevation coordinate. The next stage attaches the administrative information conveyed to the positioning associated with each observation. Stage C assimilates the rainfall-related information. Stage D adds variables associated with the vehicle displacement. Stage E computes polluting emissions related to the observation.
Stage (A) extracts the raw data from the Rio de Janeiro City Hall monitoring facility (Data.Rio: https://www.data.rio/) and stores it in our servers. The raw data consists of observations on buses operating in Rio de Janeiro, collected into a file every minute from 04/16/2014 to 06/30/2023. Each entry, depicted in Table 1, displays the most recent data collected for each vehicle, including the time captured from an embedded GPS. This information includes the date, hour, minute, and second (GPSTIMESTAMP, or t), a unique identification number for the bus (BUSID), the service line on which it was operating at the time of observation (LINE), the latitude (LATITUDE), longitude (LONGITUDE), and the instant speed recorded through a GPS sensor installed on the vehicle (VELOCITY). Then, the processing progresses by grouping each day’s raw data (1,440 files per day) and removing duplicated data, transmission errors, and gibberish data, such as vehicles with negative instant speeds or above 120 km/h. Finally, we clip any observation outside a polygon defined by the municipality border (The bounding box comprises: maximum latitude −22.6°, minimum latitude −23.2°, maximum longitude −43.0°, minimum longitude −44.0°). and map the remaining observations with the elevation (ELEVATION, integrated with data from the Shuttle Radar Topography Mission8). We process the result as a data stream, as shown at the top of Fig. 2. This stream comprises unique observations, each indicating a change in the state of the transit system. The following (B and C) stages tag each observation with position-related information. Then, we split this stream into several streams in stage D: one stream per vehicle, suitable for displacement-related calculations.
Stage (B) tags administrative labels for each observation. The first group of labels comprises the regional subdivision of the municipality. It indicates the administrative region (ADMINISTRATIVEREGION) and the neighborhood (NEIGHBORHOOD). The tagging process uses the observation’s geopositioning (latitude and longitude) and GeoJSON files, representing the regions provided by the Rio de Janeiro City Hall (The dataset with the delimitation of neighborhoods is available9). Table 2 presents the city’s division by administrative regions and neighborhoods, respectively. Additionally, we also provide transit-related information, indicating if an observation lies on a parking lot (PARKING) or a bus terminal (TERMINAL). For the parking GeoJSON file, clustering techniques are used to group observations with zero speed at night and in the early morning hours, with manual inspection of satellite imagery. We use a similar process to create a terminal GeoJSON file. Figure 3(a) depicts Rio de Janeiro’s bus public transit system with each Metropolitan service route, and Fig. 3(b) shows the parking garages in red, bus terminals in green, and express corridors in blue.
Stage (C) includes information about rainfall volume (RAINFALLVOLUME) and zone (RAINFALLZONE), shown in Fig. 4. Rainfall volumes are available cumulatively at 15-minute intervals from 2014 to 2023 in the Alerta Rio system (The rainfall volumes dataset is available9). Thus, for each observation, we first identify the zone and then the rainfall volume associated with the 15-minute interval of its occurrence.
The City Hall of Rio de Janeiro divided the city into Rainfall Zones, where pluviometric measurement sensors are installed and monitored. The figure shows each Rainfall Zone, and the color varying from white to red indicates the number of observations in the database per area unit (density). The Table 7. provides the name for each zone.
Stage (D) splits the data stream into sections, one for each vehicle, as shown in Fig. 2. For every observation sequence δ, we compute variables related to the previous observation (δ − 1). Each value is recorded in an attribute linked with the observation sequence δ. The variables include the two-dimensional distance hdδ (the haversine distance between observations sequences δ − 1 and δ), the height hδ determined using the elevations (eδ and eδ−1), the interval iδ, the three-dimensional distance dδ, and the average speed sδ. Table 3 summarizes the notation used.
Equation (1) corresponds to the two-dimensional distance hdδ traveled between observations sequences δ and δ − 1, computed with the Haversine formula (i.e. distance between the two points along a great circle of the sphere), where r is the radius of the Earth, θδ−1 and θδ are the latitudes, and λδ−1 and λδ are the longitudes of each observation.
Let hδ be the height difference between two subsequent observations of the same bus, as described in Equation (2).
The interval iδ refers to the time difference between a particular vehicle’s current and previous observation. It is computed as follows: if the interval is less than or equal to 240 seconds, Equation (3) is applied. On the other hand, if the computed interval is more than 240 seconds, it is assumed there was a failure in the communication system. The resulting value is set to NA (not available) (Equation (4)).
Let dδ be the three-dimensional distance traveled between two observations considering the two-dimensional distance and height. It is computed from the Pythagorean Theorem using the information hdδ and hδ, as described in Equation (5).
The average speed between two observations, sδ, is obtained by dividing the three-dimensional distance (dδ) by the interval (iδ), according to Equation (6).
Stage (E) finally inserts the information about the estimated emission of polluting gases CO, CO2, HC, and NOx into the dataset. We adopt a bottom-up approach to compute these emissions estimates, which is valuable for identifying emission sources and, thus, essential for developing public policies and implementing mitigation measures. However, it should be noted that the methodology employed lacks a comprehensive analysis of emission sources, considering the nature and type of activity responsible for emissions. In this work, we only use buses circulating in Rio de Janeiro city as a source of emissions. Thus, we use the Vehicle Specific Power (VSP) model to estimate modal emission rates and associate them with average emission rates for diesel transport buses10. The VSP model defines the power per unit mass of the source (kW/ton), making it a convenient way to estimate a vehicle’s emissions and using several factors, such as vehicle acceleration, road slope, tire rolling resistance, and aerodynamic resistance, influence absolute power11. A limitation of this estimate is that the vehicle’s weight affects the emissions of heavy-duty diesel vehicles12, and the passenger counting in the element fleet is unavailable. The VSP is computed based on the bus speed sδ, the acceleration or deceleration of the vehicle a, and the slope of the road \(\sin (\alpha )={h}_{\delta }/{d}_{\delta }\). We first compute the acceleration a using Equation (7), such that sδ − sδ−1 is the speed variation, i.e., the difference between the current observation speed and the previous one, and the interval iδ, i. e., the difference between the current observation time and the previous one,
For two successive observations, the VSPδ is computed using Equation (8). Then, the average modal emission rates for CO2, CO, NOx, and HC are obtained using Table 4.
The dataset provides the computed estimates in mass per second, and it is still necessary to multiply these emission rates by the interval between the two observations to obtain the total mass of pollutants emitted. Finally, we add the emissions obtained for the entire bus line journey to bring the total emissions of CO, CO2, HC, and NOx.
Data gaps
The path between data generation in the vehicle and the final storage is not error-prone and presents points of failure. Although we developed an extensive dataset, it is only comprehensive for some periods.
Data Records
The dataset Carioca_MapBus is publicly available in OSF13. Table 5 shows the available dataset and related files for download. The Open Science Framework (OSF) repository, within the research project Carioca_MapBus13, makes available all necessary links to access raw and processed data, file descriptions, GTFS, and external links9, including, among others, population data.
Technical Validation
To ensure the dataset’s reliability, we critically assess three factors: the quality of the information source, discontinuities, and various aspects of the data, as outlined below.
The dataset consists of 3,228 files structured as data summary tables (DST) in Parquet format14, comprising over 9 billion observations across 25 attributes detailed in Table 6. These files are categorized into five types of DST files representing different themes: positioning information (DST-A), city administrative data (DST-B), rainfall data (DST-C), displacement information (DST-D), and emissions (DST-E). One attribute, shared across all DST themes, uniquely identifies observations and correlates DSTs. At DST-A, the attribute ID is a primary key, whereas at DST-B, DST-C, DST-D, and DST-E, the attribute ID is both a primary key and a foreign key to DST-A. In class A, two attributes identify the vehicle and the service line within the Rio de Janeiro City Hall, facilitating the identification of observed buses and monitoring the consistency of their routes with operational lines.
Attributes are computed based on combinations of time, space, and metrics related to the environment and mobility. Temporal attributes include the bus’s GPS timestamp for tracking observations over time, while space attributes encompass latitude, longitude, elevation, administrative region, neighborhood, parking, terminal, corridor, and rainfall zone. One attribute represents the rainfall volume in the last 15 minutes, and six attributes are associated with bus mobility metrics, including 2D and 3D distances, elevation, interval, and instant and average speeds. Finally, four attributes estimate gas emissions for CO, CO2, HC, and NOx. Note that all buses in Rio de Janeiro use a diesel-based fuel called ARLA 32 (Automotive Liquid Reducing Agent).
The dataset contains one DST file for each day from April 16, 2014, to June 30, 2023, comprising 3,362 files. In this period, 134 days have no information due to acquisition issues, representing less than 4% of the period. The lack of information is related to different reasons, such as (i) instability in the City Hall server that made the data available, (ii) instability in the server that collected the data, (iii) temporary interruption in the availability of data due to the occurrence of some maintenance, and (iv) data corruption during generation on the server.
The City Hall of Rio de Janeiro divides the city into Rainfall Zones, deploying and monitoring pluviometric measurement sensors. In Fig. 4, Rainfall Zones are displayed, showing the density of observations per area unit. Neighborhoods near Guanabara Bay, associated with the Center Zone of the city, exhibit more observations than those in the West Zone, reflecting the daily commuting pattern where people travel from residential zones to downtown for work.
Figure 5(a) illustrates the average daily observations for each year. There was a peak in 2015, with the average surpassing 4 million daily observations. However, this number drastically reduced to around 2 million by 2020, likely due to the COVID-19 pandemic. Figure 5(b) shows the average time interval between observations per year, with dispersion increasing in 2020 due to the pandemic and decreasing in 2023 due to City Hall improvements. Figure 5(c) displays the distribution of the average number of daily buses in circulation per year, showing a decline over the years, likely correlating with the decrease in observations. A notable decrease in this indicator occurred in 2020, coinciding with the onset of COVID-19 spread in Rio de Janeiro, as depicted by the lower limit of the boxplot. Additionally, Figure 6 illustrates the decline in the number of buses in circulation starting from March 2020, coinciding with the reinforcement of lockdown measures in Brazil.
We specifically focus on observations of buses in circulation, excluding those indicating buses stopped in parking areas or terminals. Table 7 Notably, the night and morning shifts exhibit the highest average speeds, which aligns with expectations, given the reduced street activity during these periods. Figure 7(a) illustrates the average bus speed per shift (dawn, morning, afternoon, and night) for each year, segmented based on Table 8. Night and morning shifts exhibit the highest average speeds, reflecting reduced street activity during those periods, as further evidenced in Fig. 7(b), which portrays the average number of observations for each shift across the years.
Figure 8(a) presents a heat map depicting the total number of observations by neighborhood. Some neighborhoods, such as Freguesia/Jacarepaguá (ID:120), Campo Grande (ID:144), and Centro (ID:5), stand out with more circulating buses. In Fig. 8(b), the heat map represents the mass of CO emitted in each neighborhood throughout the evaluation period, indicating potentially poorer air quality in smaller areas regardless of air circulation considerations (more detail in15).
Neighborhoods in Rio de Janeiro: (a) shows the number of observations per neighborhood, and (b) depicts the mass of CO gas emitted by vehicles from 2013 to 2023. Lower values are associated with lighter colors and higher values with darker ones. Table 2 provides the correspondence between ID numbers and Neighborhoods.
Usage Notes
The Carioca_MapBus dataset provides historical data on urban mobility in Rio de Janeiro. It can enrich various research projects, such as analyzing urban mobility in specific periods or regions or broader studies that incorporate additional information from DSTs. For example, researchers can explore correlations between urban mobility and environmental factors such as rainfall and air pollution. Below, we provide more examples drawn from our research.
One potential research topic involves analyzing bus density during different time slots and in specific city regions. Increased density may cause traffic jams, even if it does not significantly impact average speed. If we find this correlation, we can explore creating specific interval routes or dedicated bus traffic lanes to enhance average speeds in the region.
Rainfall information is also available in the dataset, enabling researchers to evaluate the speed of buses in times of rain and helping to identify rainfall zones that suffer significant consequences for urban mobility due to rain. Public authorities can then direct efforts to these locations and anticipate risks based on rain forecasts.
The dataset also allows for the assimilation of rainfall data with bus observations. This way, researchers can obtain the volume of rain the bus experienced during its route, providing the identification of possible flooding points.
To exemplify this scenario, we identified the occurrence of heavy rains on February 22, 2023, in the city of Rio de Janeiro, based on a TV news broadcast from RecordTV9. In the news, Fig. 9(a) presents two buses stopped in a flooded area in the Bonsucesso neighborhood. From the search for this information in the Carioca_MapBus dataset, we identified that one of the buses stopped in the flood refers to the bus on line 917, whose identification is B51518. Figure 9(c) shows the moving average speed in the last 5 minutes of this bus and the accumulated rainfall experienced by it on February 22, 2023. Note that from 8 pm onwards, the volume of rain increases and that between 10 pm and 0 am, the bus remains practically at a standstill. Figure 9(b) illustrates the region with the stopped vehicle. As reported in the news, the location is Avenida Itaoca in the Bonsucesso neighborhood. This analysis highlights the possibility of using the Carioca_MapBus dataset to identify possible flooding points.
Buses stopped in a flooded area in the Bonsucesso neighborhood. Source: TV news broadcast from RecordTV (a), and the geo-position information in the database recorded for B51518 bus between 10:00 pm and 11:30 pm (b). The moving average speed in the last 5 minutes (blue) of the B51518 vehicle and the accumulated rainfall (red) experienced by it on February 22, 2023 (c).
Another potential research topic is using the information on pollutant gas emissions to correlate this data with bus movements and assisting the government in proposing measures to mitigate emissions from the city’s public transportation, such as reducing routes, changing trajectories, or decreasing the number of buses on specific routes. Geotagging and estimated emissions are the building blocks for spatiotemporal and traffic congestion analysis of pollutant emissions.
It is important to note that although buses are the predominant mode of transport in Rio de Janeiro’s urban mobility, they are not the only public transport in the city. However, as seen in Fig. 3(a), the bus lines have an operating network covering the entire city, which supports understanding the city’s urban transport quality.
Overall, the availability of the Carioca_MapBus dataset significantly advances studying urban mobility in Rio de Janeiro, combining data from bus observation with positional information on neighborhoods and rainfall regions, rainfall volumes, and pollutant gas emissions. By providing the dataset construction process, it becomes possible to reproduce the study with new data to observe new insights and improvements.
Code availability
The pipeline that generates the data is published on GitHub and is available9. The query code corresponds to “dst-retrieval.py” file and we also provide a PDF document (Readme) that describes its usage and is availabe on https://osf.io/6h3wy.
References
Bettencourt, L. & West, G. A unified theory of urban living. Nature 467, 912 – 913 (2010).
Christodoulou, A., Dijkstra, L., Christidis, P., Bolsi, P. & Poelman, H. A fine resolution dataset of accessibility under different traffic conditions in European cities. Scientific Data 7 (2020).
Huang, Q. et al. Data descriptor: The temporal geographically-explicit network of public transport in Changchun city, Northeast China. Scientific Data 6 (2019).
Verbavatz, V. & Barthelemy, M. Access to mass rapid transit in OECD urban areas. Scientific Data 7 (2020).
Verbavatz, V. & Barthelemy, M. Critical factors for mitigating car traffic in cities. PLoS ONE 14 (2019).
Louf, R. & Barthelemy, M. How congestion shapes cities: From mobility patterns to scaling. Scientific Reports 4 (2014).
Prefeitura do Rio de Janeiro. DATA.RIO. Tech. Rep., https://www.data.rio/pages/histria (2024).
Farr, T. G. et al. The shuttle radar topography mission. Reviews of Geophysics 45 (2007).
MapBus, C. Carioca_mapbus important links. Tech. Rep., https://osf.io/yptkj/wiki/ImportantLinks (2025).
Zhai, H., Frey, H. C. & Rouphail, N. M. A vehicle-specific power approach to speed- and facility-specific emissions estimates for diesel transit buses. Environmental Science and Technology 42, 7985 – 7991 (2008).
Koupal, J., Cumberworth, M., Michaels, H., Beardsley, M. & Brzezinski, D. Draft design and implementation plan for EPA’s multi-scale motor vehicle and equipment emission systems (MOVES). Tech. Rep., https://nepis.epa.gov/Exe/ZyPDF.cgi?Dockey=P1000527.PDF (2002).
Clark, N. N., Kern, J. M., Atkinson, C. M. & Nine, R. D. Factors affecting heavy-duty diesel vehicle emissions. Journal of the Air and Waste Management Association 52, 84 – 94 (2002).
MapBus, C. Carioca_mapbus: Dataset on bus mobility and environmental indicators from rio de janeiro. Tech. Rep., https://doi.org/10.17605/osf.io/yptkj (2025).
Apache Software Foundation. Parquet Specifications. Tech. Rep., https://arrow.apache.org/docs/format/index.html (2024).
MapBus, C. Code for plotting neighborhoods in rio de janeiro. Tech. Rep., https://osf.io/yptkj/files/osfstorage/68532dc3db041d479445887b (2025).
Acknowledgements
The authors thank Cefet/RJ, FAPERJ, CAPES, and CNPq. Finance code: FAPERJ (Grant E-26/290.123/2021) and, the Fundação para a Ciência e a Tecnologia, I.P. (Portuguese Foundation for Science and Technology) by the project UIDB and UIDP 05064/2020 (doi.org/10.54499/UIDB/05064/2020) (VALORIZA – Research Centre for Endogenous Resource Valorization).
Author information
Authors and Affiliations
Contributions
All authors contributed equally to the study, and all authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Carvalho, D., Vancellote, V., Casais, P.M. et al. Dataset on bus mobility and environmental indicators from Rio de Janeiro. Sci Data 12, 1569 (2025). https://doi.org/10.1038/s41597-025-05755-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-025-05755-6











