Background & Summary

The building sector, including residential and commercial buildings, accounts for approximately one-third of global energy consumption and carbon dioxide (CO2) emissions1. To improve the energy efficiency of building operations, data-driven approaches have been widely used for building load forecasting2, occupant behavior modeling3, machine-learning based control of building systems4,5,6, building analytics7, and energy management8. To support the development of the data-driven approaches, high-fidelity data with detailed building operational information becomes essential.

With the equipment of building management systems (BMS) and smart meters in buildings, there are various open-source datasets available with different levels of fidelity. Most datasets are focused on building energy consumption9,10,11,12. Several datasets have been released with a focus on occupancy data13,14,15,16. More comprehensive datasets have also been proposed that consist of energy consumption, indoor environment, occupancy, weather conditions, and HVAC operations17,18,19. However, very limited number of datasets are reported for ultra-low energy buildings with low-energy and passive technologies, such as natural ventilation, ground source heat pumps, and thermally active building systems (TABS), which play a crucial role in achieving the carbon-neutral goal for the building sector. Agee, Nikdel and Roberts20,21 proposed a dataset for a zero-energy building that consists of energy uses, photovoltaic (PV) production, and building air leakage data, but doesn’t include detailed heating, ventilation, and air conditioning (HVAC) system operational data. Schweiker, Kleber and Wagner22,23 introduced a dataset for a naturally ventilated office building. However, the operation status of the manually operated windows is recorded as closed or open without specific information about the window opening percentage, which may be required to develop natural ventilation prediction and smart window controls. Therefore, a dataset containing detailed system operational information of ultra-low energy buildings with low-energy and passive technologies is needed.

This paper describes a high-fidelity dataset which provides granular insights into the performance and operations of an ultra-low energy building. With its intricate sensor network, the building captures diverse performance parameters as listed in Table 3. The dataset includes the following unique aspects compared to existing datasets in this area:

  • It provides data of integrated low-energy building systems, such as natural ventilation with automatic windows combined with geothermal powered TABS for heating and cooling, automatic operable skylights, and PV systems. The data of such a low-energy naturally ventilated building combined with geo-powered TABS has not been reported in existing datasets to our best knowledge.

  • It provides data from an extensive sensor network, including not only energy uses and indoor environment data as reported in existing datasets, but also detailed system operational data. Examples include window openings, temperature and flowrate of both source side and load side water loops of the heat pump, as well as outdoor sensors, such as localized weather stations and building façade temperatures that provide the boundaries of the microclimate.

  • It provides data of zone-level BTU meters for the TABS that was rarely reported in existing datasets, which helps understand the zone-level thermal load and detailed operations, e.g., water temperature, flowrate and valve status, of the TABS for each zone in response to the disturbances.

In summary, this dataset provides high-fidelity data regarding micro-climatic conditions, façade temperature, zone-level TABS, and thermal information with loads for a naturally ventilated building that utilizes geothermal heating and cooling. With this dataset, the users will be able to have a better understanding of the operations of such integrated low-energy and passive building technologies in a real building and thereby develop advanced methods/algorithms to better design and operate such systems. For example, similar to the analysis in6, the user can investigate the operational performance of the coupled NV and TABS and identify operation issues and potential improvement strategies, which may provide valuable information for researchers or operators of other similar buildings.

Methods

The building called HouseZero® (See Fig. 1) was retrofitted from a pre-1940s house into an office building and living-laboratory that functions as a prototype for an ultra-efficiency building. It has a total floor area of 356 m2.

Fig. 1
figure 1

The office building HouseZero® in Cambridge, Massachusetts, USA.

HouseZero® consists of four floors: basement (level 0), first floor (level 1), second floor (level 2), and third floor (level 3), and all zones are identified in Fig. 2. The basement is located at the underground lower level and has a large conference room, a server room, and a kitchenette. The first floor has direct access to the main entrance with semi-open spaces, as well as an open lounge which is designed for five occupants. The second floor is an open workspace, designed for 16 occupants. Lastly, the third floor is connected to the open lounge, with one laboratory (‘Live Lab’) and other workspaces designed for two occupants. The ‘Live Lab’ is designed to conduct room-scale experiments with functionality that represents the systems and operation of the building. In addition, it has the capability for experimentation with different façade systems, as the entire window system is designed to be removed and replaced with other experimental systems.

Fig. 2
figure 2

Layout of each floor and thermal zones.

The building integrates multiple low-energy technologies, including natural ventilation with automatic windows and operable skylights, an automatic light system, TABS, a heat pump, a geothermal system, solar PV, and a hot water system, as well as a sensor network and advanced controls24. Figure 3 describes the system configurations at HouseZero®. For a detailed description of the network, systems and controls, please refer to7,8,24.

Fig. 3
figure 3

Schematics of the building systems in HouseZero®.

The workflow of data collection and processing is shown in Fig. 4. HouseZero® has dedicated Building Automation System (BAS) server networks for controls and performance optimization. The raw dataset was first downloaded from the BAS servers and then processed with four steps.

Fig. 4
figure 4

Workflow of the data collection and processing.

The first step is outlier filtering. Outliers were identified using four methods: Absolute Difference25, Z-score26,27, DBSCAN and Mahalanobis distance28. Identifying values with high absolute difference from preceding data points helps pinpoint distinct peaks in the initial dataset. Z-score identifies abnormal data; points with a Z-score exceeding a certain threshold (typically, standard deviation = 3) were excluded. DBSCAN identifies dense clusters by grouping data points that are closely packed together and labels those not belonging to a cluster with at least 15 neighbors as outliers. The Mahalanobis distance measures a data point’s distance from the distribution’s center. Points with a distance exceeding a certain limit are deemed outliers. This limit is derived from the Chi-squared distribution, considering the dataset’s significance level (0.05 - 0.01) and variable count. This method is especially effective for datasets with correlated variables, and thus applied to zone temperature, slab temperature, and other BTU-related data.

The second step is data aggregation. This step is performed to resample data to hourly intervals for several reasons. The main reason is that, in practice and research studies, hourly interval data is often used, such as typical meteorological year weather files and operational data, for the building energy simulation. Moreover, the size of the dataset is reasonable and easier to utilize with hourly intervals. However, it is noteworthy that this may lose the fidelity of short-term control dynamics, such as the winter pulse ventilation that occurred in an hourly basis with a 30 seconds duration of window opening.

The third step is data imputation. During data resampling to hourly intervals, imputation with a mean fill technique is employed to handle short-term missing data such as removed values for outliers, and enable the completion of the dataset. This technique involves identifying missing intervals in the data and calculating the average between the last non-missing value before the gap and the first non-missing value following the gap. This calculated average is then used to fill the missing values. For the long-term missing data when the missing period is longer than a threshold, the values of those data points are left blank without imputation in the dataset and the missing data periods are documented in the data report. The thresholds to identify the long-term missing data are different for different sensor data, which can be found in detail in the data processing Python code provided in the Section “Code Availability”.

Finally, repetition filtering eliminates values in the dataset when they appear consecutively repeated over a set number of hours, based on thresholds tailored to assumptions about sensor errors and data characteristics. Such repetition is assumed to indicate a sensor malfunction or servers offline, which was recorded as missing data, ensuring data integrity. The repetition filtering was conducted at the last step instead of the first step to avoid the filtered repetition data points being filled again in the third step. Table 1 summarizes all of the methodologies for different sensor data described in this section.

Table 1 Summary of abnormality detection and filtering methods for different sensors.

As shown in Table 2, sensor data is organized within the database into three main categories: outdoor sensors, indoor sensors, and sensors for systems, each tracking various environmental and system-related parameters. Outdoor sensors provide hourly data on local weather conditions and façade temperatures, while indoor sensors measure zone and slab temperatures, CO2 concentrations, and relative humidity across different zones. Sensors for systems offer detailed insights into the building systems operations, including heat pumps, domestic hot water systems, lighting, and other loads, as well as operational data from geothermal systems, Thermal Active Building Systems (TABS), window openings, and valve statuses.

Table 2 Folder Structure of the Dataset.

To ensure transparency and reproducibility, the missing periods were documented in a data report. The missing periods were identified during data processing, which may be due to sensor malfunctions, server offline or other technical issues. The report included the start and end dates of each missing period, as well as the names of data points. This allows for greater transparency and ensures that the processed data can be used reliably and accurately in future analysis.

Data Records

The time-series data with one-hour intervals are in CSV format. The data is hosted at figShare29. A data report documenting the detailed description of the dataset is available in the same repository29.

This section provides a description of the data and special events that occurred within the data collection period. Table 3 includes all available sensors and missing data percentages. The available periods for the data collected are from June 2022 to the end of May 2024. (Year 1: June 2022 to May 2023; Year 2: June 2023 to May 2024). The last two columns indicate the percentage of data missing across Year 1 and Year 2 for different sensor data. In this paper, the outliers and missing data are reported in a separate file.

Table 3 Summary of released data and missing data percentages.

While the building has more than 300 sensors and meters, this paper releases data from 189 sensors and meters that are closely related to the main operational performance of the building. Outdoor sensors include two localized weather stations and nine façade temperature sensors on the building’s façades (See Fig. 5 for the locations of the façade temperature sensors). There are two localized weather stations with one installed on the roof of HouseZero® building and the other installed on the roof of a nearby building. These weather stations independently measure outdoor weather conditions, and their readings are cross-checked against each other for consistency.

Fig. 5
figure 5

Locations of the building façade temperature sensors.

Indoor sensors monitor zone-level air temperature, CO2, and relative humidity, as well as slab temperature. Sensors for systems are related to operational status of the integrated building systems shown in Fig. 3. Meters have been installed to monitor all load breakdowns from individual breakers, including PV, loads and net meters. Multiple meters are installed in the building for cross-validation of the electric loads to enhance accuracy and ensure data fidelity. For example, in addition to the electrical provider utility meter, a net meter was installed to validate both the PV production and loads as well as to provide the export and import electrical data. BTU meters are included for the radiant floor system in each zone, as well as the heat pump and geothermal systems. The status of each individual window and valve is also monitored.

Over the course of the data collection period, there were some times when certain meters and sensors had temporary interruptions and data loss. Table 4 summarizes issues in operation over the two years. This table is derived from the building operation log that provides additional information to help understand the causes of the missing data. The events that may cause missing data include system updates, snow covering, sensor issues, server communication issues, database maintenance issues, and other issues as listed in Table 4.

Table 4 Summary of the issues causing missing data.

Technical Validation

The datasets ensure completeness by minimizing temporal gaps in the two-year data collection and the necessary sensors are employed to capture critical data of the building such as temperature, weather, system operations, and window controls.

In this section, examples of data processing results and data samples from different sensors and meters are presented to demonstrate the data quality and coverage of the dataset. To ensure the validity and soundness of the dataset, the raw data has been processed using the methods as described in the Methods section. Figures 67 show the examples of detection of outliers and data repetition from the raw data, which demonstrate the efficacy of our data processing methods, underscoring the dataset’s reliability.

Fig. 6
figure 6

An example of outlier filtering using Z-score.

Fig. 7
figure 7

An example of data repetition filtering.

The pie charts in Fig. 8 depict the breakdown of energy end uses over the course of two years. The annual energy consumptions are 38.5 kWh/m2 and 36.3 kWh/m2 for Years 1 and 2, respectively. In both years, the IT load emerged as the most dominant, with heating and plug load following suit consistently throughout the two years.

Fig. 8
figure 8

Pie chart of the breakdown energy end uses for two years.

Figures 9 and 10 present the daily load trends observed during sample summer and winter days in Year 1. In Year 1, the cooling loads were consistently maintained at an average of approximately 0.2 kWh. The largest proportion of the total loads during summer was attributed to IT and plug loads. Meanwhile, cooling, control, and others exhibited a constant energy demand. During the winter season, heating energy consumption emerged as the highest-demand load and IT loads represented the second-largest contributor to the total load during winter. The other loads exhibited stability and remained predominantly similar during both typical summer and winter days.

Fig. 9
figure 9

Daily pattern of electricity end uses in a sample summer day.

Fig. 10
figure 10

Daily pattern of electricity end uses in a sample winter day.

BTU meters measured the water flowrates (in gallons per minute, GPM), energy rates (in kilo British Thermal Units per hour, kBTU/h), and supply and return water temperatures. Figure 11 illustrates the energy rate data from BTU meters in different zones in a sample winter day as an example. The energy rate data reflected the heating or cooling demand of different zones. In winter, the variation of the heating demand depends on the slab temperature and the slab temperature setpoint.

Fig. 11
figure 11

Daily pattern of energy rate data from a BTU meter in a sample winter day.

As shown in Fig. 12, the daily pattern of the indoor CO2 concentration exhibits a diurnal cycle with varying concentration levels throughout the day across multiple data series. For most of the zones, the CO2 level rose in the morning as occupants entered the office, reached peak in the afternoon, and decreased from the evening. The pattern indicates a potential correlation with daily occupancy profile in the building.

Fig. 12
figure 12

Daily pattern of the indoor CO2 concentrations in a sample summer day.

Figure 13 presents the relative humidity (RH, %) across various zones in a sample summer day. The RH values remained relatively constant, with slight fluctuations during the day. The pattern showed a potential correlation with the occupant schedule.

Fig. 13
figure 13

Daily pattern of relative humidity in a sample summer day.

Figure 14 illustrates the temperature trends and CO2 variation with natural ventilation during the passive mode. The windows were controlled based on indoor and outdoor air temperatures as well as indoor CO2 concentration to maintain the indoor air temperature and CO2 concentration within the comfortable range or acceptable level. During nighttime, the windows were operated for free cooling with night flushing to further improve the energy efficiency.

Fig. 14
figure 14

Indoor temperature and CO2 with natural ventilation in the passive mode.

Figure 15 shows the monthly PV production and solar radiation in Year 1. Overall, the PV production and solar radiation follow the same trend throughout the year. A new inverter with higher efficiency was installed in early 2023, which contributed to the improved PV efficiency from March 2023.

Fig. 15
figure 15

Monthly PV production of Year 1.

Figure 16 shows how TABS operated on a sample summer day. The cold water with an average temperature of 19.6 °C drawn from the geothermal well was directly supplied to the building and circulated through the piping systems in the slab. The mean temperature of the return water that carried the heat away from indoors was about 21.1 °C. The slab temperature is slightly lower than the return water temperature. Thus, the indoor temperature was maintained between 24–26 °C.

Fig. 16
figure 16

Operation of the TABS in a sample summer day of Zone 23.

Figure 17 shows how TABS operated on a sample winter day. The hot water with an average temperature of 26.7 °C generated from the ground source heat pump was supplied to the slab-embedded piping systems. The supply water temperature was controlled as a function of the outdoor temperature. The water returned at a temperature of around 22.7 °C. The slab temperature and indoor temperature remained at approximately 22 °C and stayed within the comfort zone.

Fig. 17
figure 17

Operation of the TABS in a sample winter day of Zone 23.

Figure 18 shows how the heat pump operated on a sample winter day. Detailed operational performance of the heat pump was monitored, including the power consumption, supply and return water temperature as well as flowrates for both house side and geo side, which supports the analysis of heat pump operations and development of prediction models and advanced controls. Part of the data was illustrated in this figure as an example.

Fig. 18
figure 18

Operation of the heat pump in a sample winter day.

This dataset contributes to the development of sustainable building practices from the following aspects. Firstly, the dataset provides a building energy benchmark for an ultra-efficient building with integrated low-energy and passive building technologies. Secondly, the dataset provides a better understanding of the operations of an ultra-efficient building with integrated natural ventilation and geo-powered TABS systems with detailed operational data. This may include load shape analysis, energy prediction, system operational pattern analysis as well as data-driven modeling. In addition, it can also be used for validation of building simulation models and development of learning-based control algorithms. The operation data of the building demonstrates the effectiveness of NV combined with TABS for maintaining a comfortable indoor environment while achieving high energy efficiency, which help promote the wider adoption of such low-energy technologies.

There are some usage restrictions and limitations for the usage of this dataset due to the specific boundaries of the building. First, the dataset is dependent on the configurations and characteristics, such as occupancy, design, and envelope/materials, of the subject building. Second, the dataset is dependent on local weather conditions under the 5 A climate zone. Third, the dataset is dependent on sensor settings, such as locations and resolution/accuracy.