Background & Summary

High-quality spatiotemporal characterization of solar energy systems (photovoltaic–PV and concentrating solar power–CSP) has historically been sparse, incomplete, or held behind privacy barriers or paywalls. This gap in data availability has hindered regional- and global-scale analysis on a key component of the growing and diversifying energy landscape, and inhibits solar energy design, distribution, and monitoring investigation. Recently, numerous groups have attempted to fill this gap using remote sensing, manual digitization, crowdsourcing, and machine learning techniques to spatiotemporally characterize solar energy across the globe (e.g., refs. 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40). There are also others maintaining databases of value-added attributes for a variety of applications (e.g., refs. 41,42,43,44,45,46,47,48,49,50,51). However, dataset availability, quality, and completeness vary widely, leaving key solar energy design information unknown for the broader scientific community. We aim to fill this gap by compiling a harmonized spatiotemporal dataset of ground-mounted solar energy arrays in the United States (US). We go further using high-spatial resolution aerial imagery alongside high-temporal resolution satellite imagery to independently estimate a suite of installation design metadata that contributes new knowledge on the solar energy landscape.

Understanding the location and design of current renewable energy infrastructure allows for more effective modeling, monitoring, and planning efforts for future infrastructure. The United States Large-Scale Solar Photovoltaic Database (USPVDB) is the most comprehensive publicly available, regularly updated, and standardized dataset of georectified utility-scale solar arrays in the US26,52. Importantly, this database contains valuable permitting data from the US Energy Information Administration (EIA) Form 860 including installation year, installed capacity, mount technology (fixed-axis, single-axis tracking, or dual-axis tracking), tilt, azimuth, prior land use, agrivoltaic acceptance, and more. Although USPVDB is the current best available solar metadata dataset in the US, it does have limitations. USPVDB reports utility-scale (≥1 MWDC) solar PV installations, which comprise the majority of installed capacity in the US26. However, this database omits the more numerous and distributed commercial-scale (<1 MWDC) solar PV projects53,54 and CSP installations. These limitations leave critical data gaps in our understanding of distributed energy resources and the solar energy landscape. There are key differences in the ecosystem service and economic land use trade-offs between commercial- and utility-scale installations55. Proportionally, commercial-scale systems tend to reside on cropland more often than utility-scale installations14 and experience less regulation and oversight56. Recent work has also shown that reported metadata can be incomplete or contain errors53 and underestimates the total extent of installed solar energy57, thus there is a need for a comprehensive and independent metadata characterization of this rapidly deployed technology.

Kruitwagen et al.14 produced the first global and publicly available geospatial dataset of solar energy installations. The follow-on product, the TransitionZero Solar Asset Mapper (TZ-SAM), is an open-access, global, and regularly updated dataset of commercial- and utility-scale solar facilities, derived using machine learning with Copernicus Sentinel-2 imagery (10 m), trained and validated on existing and hand annotated datasets36. OpenStreetMap (OSM), one of the richest geographical databases in existence, also provides access to commercial- and utility-scale solar arrays and panel-row data generated by open collaboration–crowd-sourced hand annotation of aerial and satellite imagery58–that has been used in a number of previous solar data acquisition efforts (e.g., refs. 8,9,14,36,40). There are concerns about the spatial quality and consistency of medium-coarse resolution remote sensing and crowd-sourced solar array delineation17,26,36. For example, remote sensing and crowd-sourced datasets are known to overestimate the total area of an array due to ambiguous array definitions or to classification of medium resolution satellite imagery9,17,26. Yet, datasets like TZ-SAM and OSM are critical for filling temporal, scale, and reporting bias limitations. Together, USPVDB, TZ-SAM, OSM, and similar high-fidelity spatial delineation and metadata acquisition efforts provide the foundation for understanding the renewable energy landscape and for the dataset presented here.

Solar array siting, management, and design choices have long-term impacts on electricity production, the physical landscape, and related ecosystem services. Most often, available datasets derived on permitting data, remote sensing, or even manual annotation stop at the project or array-scale. However, panel-row design metadata would allow for the scaling of in-depth design and design-impact analyses that are often completed at a single system level56 and thus limited by a lack of high-resolution data. Several modern tools and approaches have been published working to optimize solar designs for electricity production59,60,61,62, co-production of electricity and vegetation63,64,65, and stormwater runoff66,67,68,69,70. Along with numerous other tools, these models are dependent on array location and design characteristics that have not been widely available prior to this work.

Here, we leverage the best available datasets and databases to compile a comprehensive ground-mounted solar array dataset that is up to date, open access, and not limited to utility-scale capacity. We also compile existing panel-row datasets and, where available, use high spatial-resolution imagery to delineate new panel-row objects within solar array bounds. We use this new panel-row delineation to standardize and enhance existing array and panel-row boundaries, addressing concerns about accuracy and harmonization of manually digitized datasets and medium to coarse remote sensing-derived datasets17,26,36. We add value to the dataset by independently estimating several array- and panel-row attributes including installation year, installed capacity, mount technology, ground cover ratio (GCR), fixed-axis tilt, panel-row azimuth and dimensions, and inter-row spacing.

The Ground-Mounted Solar Energy in the United States (GM-SEUS) v1.0 dataset contains 15,017 ground-mounted solar PV and CSP arrays covering 2,944 km2. The dataset includes 9,631 utility-scale arrays composing an estimated 184.2 GWDC and 5,386 commercial-scale arrays composing an estimated 2.1 GWDC, making this the largest publicly available US solar repository to date (Fig. 1). For 9,042 arrays (83.1 GWDC), we delineated 2.92 million high-quality solar panel-rows, thus improving array geometries and providing sub-array design metadata. Collectively, the solar panel-row geometries compose 466 km2 in total area. Including harmonized metadata, 42% of arrays were fixed-axis, 21% were single-axis, 2.1% were dual-axis, 2.6% were mixed, and 33% were unknown. Solar PV GCR varies with mount type, and on average was 53% for fixed-axis, 42% for single-axis tracking, 50% for dual-axis tracking, and 63% for arrays with mixed mounts (GCR1).

Fig. 1
figure 1

GM-SEUS solar array distribution. GM-SEUS arrays are zonally grouped by number of solar installations within Uber H3 hexagons (resolution 5). H3 hexagons at this resolution in the region are 260+/−37 km2.

The goal of this effort is to provide researchers and policymakers with a distributed ground-mounted solar array dataset and panel-row design metadata on the existing US solar energy landscape. Additional use cases may include nowcast modeling and transmission planning71,72,73,74,75; grid pricing and incentive planning76; tracking policy effectiveness77; evaluating property value impacts78,79; assessing public perception towards projects with varying scale and levels of community engagement80; tracking agricultural production opportunity costs55; modeling carbon storage and sequestration potential81; modeling potential for habitat connectivity and pollination services82; pattern recognition and deep learning semantic segmentation models3,36; soiling and performance monitoring83,84,85; repowering preparation86; and tracking spatiotemporal material stock and recycling potential in existing infrastructure87,88,89,90.

Several efforts have extracted panel-row design metadata (e.g., refs. 21,23,28,53,91,92) but this is the first endeavor to provide a publicly-available dataset of this magnitude and spatial coverage. Importantly, the dataset is open access with all code and data available for training and acquisition of new array datasets both in the US and other countries. Greater knowledge on global solar PV panel-level distribution would enhance use cases reported here to the global PV market and all impacted landscapes. We intend to update this dataset annually and invite others to continue to introduce new value-added attributes to this dataset.

Methods

The five development phases of GM-SEUS are outlined in Fig. 2 and described in detail in the following Methods and Technical Validation sections. In phase 1, Compile Existing Geospatial Data, we collected and harmonized freely available solar array and panel-row datasets with existing geospatial data, removing duplicates. In phase 2, Georeference Metadata, we spatially referenced value-added attributes from solar array point data to polygon boundaries by proximity and manual digitization, delineating new array boundaries where necessary. Phase 3, Acquire Panel-Row Shapes, involved using a combination of image analysis approaches to classify solar panel-rows using high-resolution aerial imagery combined with existing panel-row data. In phase 4, Enhance Design Metadata, we used this panel-row data to generate new array boundaries and estimate several design attributes using remote sensing and geospatial analysis. Finally, in phase 5, Validate and Share, we performed technical validation of the derived design metadata and new spatial boundaries using high-quality reference datasets, while also ensuring that we maintain open access data principles.

Fig. 2
figure 2

GM-SEUS development workflow diagram. This generalized workflow progresses from left to right and top to bottom within each section, with each bullet summarizing major steps within each development phase. GM-SEUS development uses numerous existing geospatial sources outlined in Table 1 and Google Earth Engine cloud repository and computing resources. Note that FAIR data is Findable, Accessible, Interoperable, and Reusable.

Compiling existing geospatial data

Existing solar array data in the US

We compiled a dataset of distributed multi-scale ground-mounted solar arrays across the contiguous US (CONUS) complete through December 2024. We chose to use only freely available datasets to ensure availability of our results. We used existing ground-mounted solar array datasets in the US that contained explicit array polygons. Each dataset is unique in coverage and metadata completeness and was created for a distinct purpose. We collected data from the following open repositories: The United States Large-Scale Photovoltaic Database v2.0 (USPVDB)26,52, project and panel-row annotations from OpenStreetMap (OSM)58, the TransitionZero Global Solar Asset Mapper Q3-2024 (TZ-SAM)36,57, a California Central Valley solar PV dataset (CCVPV)21,93, and a Chesapeake Watershed solar dataset (CWSD)25,94.

We defined a solar array footprint or boundary as adjacent, existing, and connected solar panel-rows (PV or CSP) of the same installation year including the inter-row spacing between them (Fig. 3). We qualitatively assessed the spatial quality of input array geometries based on adherence to our array boundary definition (Fig. 3), inferred from reported delineation methods and visual inspection. High-quality sources reported clearly documented methods that aligned with our array footprint definition. Lower quality sources were those with less certain boundary definitions or those derived from medium-resolution imagery, often delineating project boundaries rather than array footprints. The resulting order of adherence was: USPVDB, CCVPV, CWSD, OSM, and TZ-SAM, with USPVDB, CCVPV, and CWSD most closely following the provided definition of an array.

Fig. 3
figure 3

Conceptual hierarchical system boundaries and solar panel-row metadata logic. Green boundaries indicate the conceptual boundary for each term. This study reports on the geospatial and temporal characteristics of panel-rows and arrays. A panel-row is a spatially unique collection of one or more panel-assemblies connected by proximity and often sharing one mount, but not necessarily electrically connected. An array is composed of one or more adjacent rows of the same installation year, and the row-spacing between them. The cell, panel, assembly, and project are not the system boundaries focused on in this study. The ratio of the long-edge to the short-edge is the L/W ratio. Note that these conceptual system boundaries represent crystalline-silicon solar PV. Thin-film and CSP panel-rows tend to present spatially similar patterns at the panel-row and array level but differ in their internal system components.

We also compiled existing value-added solar energy datasets that contained location (latitude and longitude) spatial data without explicit array boundaries. These data were from the following open repositories: the National Renewable Energy Laboratory (NREL) Agrivoltaic Map from the InSPIRE initiative45, the Lawrence Berkeley National Lab (LBNL) Utility-Scale Solar (USS), 2024 Edition report49, the NREL Photovoltaic Data Acquisition initiative (PVDAQ)46, the International Energy Agency and NREL hosted Solar Power and Chemical Energy Systems (SolarPACES) initiative CSP.guru data product47,95, Global Energy Monitor’s (GEM) Global Solar Power Tracker (GSPT)48, and the World Resources Institute’s Global Power Plant Database v1.3.0 (GPPDB)42,96. Some of these datasets were compilations of each other and existing datasets including the EIA Form 860, Wiki-Solar41, and various other regional and global sources including those used here. We joined these locations with existing array boundaries using a 190 m radius, the distance at which ~75% of solar location data is associated with an existing array53. For all geospatial datasets, we excluded decommissioned arrays where that information was available. Existing array and panel-row data sources are described in Table 1.

Table 1 Existing and publicly available geospatial datasets of solar energy systems.

Existing solar panel-row data in the US

To our knowledge, only two data repositories contain large quantities of ground-mounted solar panel-row geospatial data in the US, our recently published dataset of panel-row geometries in California’s Central Valley21,93, and mixed array and panel-row data within OSM, most often tagged with generator:source = solar8,58. We took guidance and motivation from existing OSM solar data extraction methods to process current solar array and panel-row data from OSM8,97, though we developed our own independent workflow. Complete polygon data was extracted from both generator:source = solar (likely panel-rows) and plant:source = solar (likely arrays) from OSM. We separated panel-rows from arrays within both tags by checking geometries with comparable panel-row area and perimeter to area ratios to panel-rows reported in Stid et al.21. We removed repeat panel-row objects prioritizing hand-digitized objects from OSM over the imagery classification approach of CCVPV.

Georeferencing metadata

Digitizing missing solar array boundaries

The 190 m georeference distance for missing boundary is shorter than distances used for metadata attribution in similar studies using between 300 m (ref. 9) to 400 m (ref. 8). There were 1,616 arrays with value-added reference point data and without georeferenced solar array boundary data within 190 m. For the initial GM-SEUS v1.0, we manually delineated 126 missing array boundaries from the NREL Agrivoltaic Map using the most recently available imagery including National Agricultural Imagery Program (NAIP) aerial imagery98, Copernicus Sentinel-2 imagery99, Google Maps basemap imagery100. We followed delineation logic from Fujita et al.26 and our array definition, creating new boundaries that encompass panel-rows and the space between them. If the array geometry was present in the existing solar array datasets but was outside the 190 m radius, we georeferenced information to that shape, and added omitted array boundaries where necessary. Where possible, we investigated context using the array name in a Google search (which often pointed to InSPIRE, OSM, and GSPT repositories), leading to georeferencing and value-added attribute joining of new and existing objects between 191 m and ~50 km from the provided coordinates. We omitted 27 solar arrays installed after available reference imagery, or without available imagery and context. In total, we added 4.26 km2 of new array area for 34 arrays. We intend to fully delineate and georeference new and remaining 1,490 point data arrays in future version updates.

A complete reference dataset of existing ground-mounted panel-row and array data

We excluded rooftop solar arrays by removing existing and delineated array boundaries that had more than a 50% areal intersection with the Global Google-Microsoft Open Buildings Dataset101, and panel-rows within those array boundaries. The resulting preliminary dataset of existing ground-mounted solar arrays contained 14,905 arrays with over 3,056 km2 in original direct land use area. For 4,470 of those existing arrays, the preliminary dataset contained 1.07 million unique panel-row objects composing 137.3 km2 in direct panel-row area.

Acquiring solar panel-row shapes

We acquired solar panel-row shapes within solar array boundaries using high-resolution aerial imagery and a combination of pixel-based and object-based image analysis approaches. Though solar installations possess a distinctive spectral signature15,21, pixel-based classifications alone can suffer from spectral confusion (noise) due to variance across imagery acquisition conditions and some spectral similarity to shadows, water, and impervious surfaces21,102. The consistent pattern and constrained layout of solar panel-rows within solar arrays make geographic object-based image analysis particularly effective for improving classification accuracy. We thus used a combination of supervised machine learning approaches and unsupervised object-based image analysis including Random Forest103, X-means clustering104, simple non-iterative clustering (SNIC)105, and gray-level co-occurrence matrix (GLCM) texture106,107. The integration of these methods is described in the following section and is well-supported in the literature for both general land cover mapping and solar classification (e.g., refs. 15,21,34,39,108,109).

NAIP 4-band imagery is the only free and widely available imagery with spatial resolution capable of delineating individual solar PV and CSP panel-rows. NAIP is collected during the primary regional growing season every two to three years at the state-level at 0.3 to 0.6 m resolution (Fig. 4). At the time of writing, the most recent NAIP mosaics made available in Google Earth Engine range from 2021 to 2023 depending on the state and flight contracts for specific years. During GM-SEUS processing, NAIP 2023 was actively being uploaded to Google Earth Engine, replacing 2021 imagery in some states. NAIP imagery dates used in the development of GM-SEUS are shown in Fig. 4.

Fig. 4
figure 4

The most recently available NAIP imagery dates within Google Earth Engine. The color gradient indicates the within-year acquisition date, ranging from earlier (lighter shade) to later (darker shades) in the calendar year. These dates represent the temporal limitation of panel-row delineation and array boundary enhancement based on available imagery (through 2023). Note that NAIP 2023 was being processed at the time of writing and has been included for states made available in Google Earth Engine by December 2024. Grey areas are either prohibited flight spaces for national security reasons (e.g., S. Nevada), or where the most recent imagery was prior to January 2021 (e.g., W. Washington).

Panel-row image classification

We classified panel-rows using five spectral indices with reported utility in identifying solar panel-rows21 and arrays15. The indices were the normalized difference photovoltaic index (NDPVI)21, the normalized blue deviation (NBD)21, 4-band brightness (Br), the normalized difference vegetation index (NDVI)110,111, and the normalized difference water index (NDWI)112. These indices are calculated by:

$${NDPVI}=\frac{(\alpha \ast B-{NIR})}{(\alpha \ast B+{NIR})}$$
(1)
$${NBD}=\frac{B-\frac{R+G}{2}}{B+\frac{R+G}{2}}$$
(2)
$${Br}=\frac{R+G+B+{NIR}}{4}$$
(3)
$${NDVI}=\frac{({NIR}-R)}{({NIR}+R)}$$
(4)
$${NDWI}=\frac{(G-{NIR})}{(G+{NIR})}$$
(5)

where α is a weighting coefficient (0.5) to reduce the importance of variations in the blue band when differentiating impervious surfaces from the rest of the landscape21. The letters R, G, B, and NIR indicate the red, green, blue, and near infrared bands of the aerial imagery, respectively.

To incorporate spatial context, we clustered imagery within array boundaries using SNIC of the given spectral indices and a GLCM textural measure (sum average) of each index. SNIC is a polygonization segmentation approach that generates superpixel objects across a seed grid based on spatial-spectral context parameters such as compactness, connectivity, and neighborhood105 that has shown promise in delineating solar arrays in combination with Random Forest15. GLCM textural metrics, specifically the sum average, further capture the unique spatial-neighborhood relationships between proximal panel-row-like pixels. GLCM has demonstrated utility in mapping high-resolution land cover (NAIP)109 and in solar classification34. SNIC and GLCM sum average were calculated for each spectral index using native Google Earth Engine functions.

We randomly sampled the SNIC superpixel clusters of the five spectral indices and the sum average for each index at 1000 points within each array boundary to train a locally relevant X-means clustering algorithm. We used a minimum number of clusters of 2 (solar and non-solar) and a maximum of 4 clusters, allowing one level of variability in solar and non-solar supercluster averages (e.g., two module types or two ground covers). For large arrays ( > 5 ha) and arrays with multi-polygon boundaries, we split imagery within the array sub-boundary into equal area chunks (no greater than 5 ha) to enhance computational efficiency and to allow large arrays to have greater X-means variability.

To classify the unsupervised X-means clusters, we trained and ran a Random Forest model to identify panel-rows (distinct from other land covers) within each array. We generated a new CONUS NAIP training dataset with 12,000 training points composed of 6 classes and 2,000 sample points per class (solar: 0, developed: 1, vegetated: 2, water: 3, snow/ice: 4, barren/sparse vegetation: 5). This training dataset is distributed along with the GM-SEUS data to facilitate others doing land use classification with NAIP data. To generate solar samples, we randomly sampled 2,000 panel-row centroids in existing solar panel-row data. We acquired land cover samples from 2018 and 2019 NAIP imagery random sampling within 25,000 Land Change Monitoring, Assessment, and Projection (LCMAP) validated reference plots from Pengra et al.113, ensuring each class had the closest to 2,000 samples as possible given class limitations. Given that LCMAP contains few examples of snow/ice plots, we also randomly sampled ~2,000 snow/ice points from the Randolph Glacial Inventory114 within CONUS. Qualitatively, we observed that within array bounds where the surrounding ground cover was relatively constrained to vegetation, barren surfaces, and impervious surfaces, CSP and thin film panel-rows exhibited distinct spectral signatures to solar PV. CSP panel-rows were often spectrally similar to snow/ice due to high reflectance of CSP reflectors, and thin-film panel-rows could be spectrally similar to water because of the high absorbance of the visible and NIR spectrum. Thus, the final solar image classification included those respective classes for CSP and thin-film module type installations.

We classified the original NAIP imagery using the five spectral indices and the new NAIP training dataset with 200 trees and a bag fraction of 0.5. The Random Forest model had an overall accuracy of 99.6% based on the aggregated performance per tree and an out of the bag error estimate of 0.35. Each cluster with the majority of its area classified as solar was assigned to the solar class. We then eroded small islands (commission errors) and filled holes (omission errors). Panel-rows were then vectorized and negative-buffered by a single pixel width to dissolve single-pixel inter-row connections. Given the large number of vertices captured by sub-meter image classification, we improved processing and storage efficiency by saving the convex-hull of each pixel-based panel-row as the final geometry. All imagery and data were accessed, trained, classified, and analyzed in Google Earth Engine115.

Filtering for high-quality panel-rows

The vectorized panel-row dataset contained commissions that were geometrically dissimilar to true positives within the dataset (universally) and within the array (locally). Thus, we universally removed panel-rows based on several criteria. We removed panel-row objects that were outside a minimum (15 m2) and maximum (2000 m2) panel-row area based on the minimum and maximum panel-row areas of the existing panel-rows dataset. We then universally removed any panel-row object with a perimeter to area ratio less than the minimum of the existing panel-row dataset (0.18).

Locally (within each array) we removed panel-rows in which two or more of five geometric similarity measures failed: (1) mount technology of the panel composed less than 10% of the array, (2) ratio of the long-edge to the short-edge (length ratio) more than three standard deviations from the array mean, (3) the ratio of the panel-row area to the bounding box area more than three standard deviations from the array mean, (4) the perimeter to area ratio more than three standard deviations from the array mean, and (5) the Polsby-Popper ratio of compactness more than three standard deviations from the array mean. The Polsby-Popper ratio, first used to defend against gerrymandering116, is defined by:

$${Compactness}=\frac{4\ast \pi \ast {rowArea}}{{{rowPerimeter}}^{2}}$$
(6)

While accounting for universal and within-array outliers removed a considerable quantity of commissions, some arrays contained overall low-quality classifications leading to retention of low-quality panel-rows objects. To address this, we created temporary solar array boundaries from the panel row objects (see Enhancing existing array boundaries with panel-rows). We then calculated the new array area and the perimeter to area ratio of the new array and the original existing array shape. We removed array-wide panel-row objects that were less than 25% of the original array area and greater than 99th percentile of the existing arrays perimeter to area ratio.

To create the final GM-SEUS panel-row dataset, newly derived panel-rows were merged with existing panel-rows giving preference to existing panel-rows. The result was 1.07 million panel-rows from existing sources, and 1.85 million newly delineated panel-rows. In addition to the quality-controlled dataset of array and panel-row boundaries, the data repository contains the raw Google Earth Engine output for all NAIP classified panel rows. The panel-row and enhanced array boundary delineation workflow is shown in Fig. 5.

Fig. 5
figure 5

Example panel-row and array boundary delineation logic by mount technology. The left column contains NAIP aerial imagery and input geometries originating from the Source dataset of greatest spatial quality for that array. RF refers to the Random Forest model and SNIC refers to the simple non-iterative clustering algorithm.

Enhancing design metadata

Enhancing existing array boundaries with panel-rows

We created new solar array geometries from GM-SEUS panel-rows using a buffer, dissolve, and erode approach. The selected buffer distance was 10 m, allowing for large panel-assemblies in panel-rows and high-latitude panel-rows to both have large (up to 20 m) spacing. This was similar to our previous approach where we used a 5 m buffer to group panel-rows into an array21 and to Hu et al.17 who used a 3 m buffer to assign an array group for rooftop solar. Importantly, the erosion used here removed the area external to the panel-rows and the space between. Thus, this process inherently aligns with our definition of a solar array spatial footprint (see Fig. 5).

The GM-SEUS repository contains a version of all newly created array boundaries from the final panel-row dataset and a version of all existing array boundaries replaced with newly delineated array boundaries where available. We maintained USPVDB and CCVPV array boundaries due to their completeness and matching to our definition of an array. OSM, CWSD, and TZ-SAM arrays do not inherently follow our definition of an array and contain manual delineation (OSM and CWSD) and medium resolution remote sensing (CWSD and TZ-SAM) biases26,36. Additionally, given that CWSD and TZ-SAM arrays were independent of project-level metadata, we allowed newly delineated disconnected array shapes to be considered as separate arrays in this source dataset. For these arrays, we grouped newly created sub-array geometries by the same installation year (see Estimating installation year), allowing arrays installed in different years to be their own installation. In total, 5,017 arrays (174 km2) received a new array boundary delineation. The original area of these arrays was 281 km2.

Estimating panel-row azimuth, mount technology, inter-row spacing, and tilt

We estimated the azimuth and mount technology for each panel-row object. We defined the azimuth as the primary south-facing cardinal direction of the short-edge vector in the minimum bounding rectangle (±180°), given that all arrays were in the northern hemisphere. In the final GM-SEUS dataset, the average azimuth (avgAzimuth) for single-axis mounted solar arrays was corrected to the southward-normal (perpendicular) angle to the panel-row face direction to follow azimuth definitions in existing datasets26,49. The final panel-row dataset maintains azimuth (rowAzimuth) as the primary direction of the short-edge vector (panel-row face).

To classify mount technology, we also calculated the length ratio of the long vector to the short vector and the ratio of panel-row area to bounding box area. The conditions for classifying mount technology were (see Fig. 3): single-axis–azimuth is within 30° of east or west and length ratio is greater than 2.5, fixed-axis–azimuth is within 60° of south and length ratio is greater than 2.5, dual-axis–the length ratio is less than 2.5. We also calculated the distance between each panel-row and the nearest panel-row (rowSpace) in the azimuthal direction for fixed- and single-axis panel-rows and all directions for dual-axis panel-rows. Azimuth and mount classification logic is similar to that of Edun et al.91 and Perry et al.92.

Optimal fixed-axis tilt (tiltEst) is generally assumed to correlate with array latitude, with slight deviations at higher latitudes117. However, local climate and topography also play important roles118. Using newly acquired azimuth and mount metadata, we estimated optimum tilt angle (tiltEst) for fixed-axis solar PV arrays (and mixed-mounted arrays) using the pvlib iotools package60,119. The latitude and longitude of each array was used to retrieve local typical meteorological year data from the PVGIS-ERA5 v5.3 database120. The typical meteorological year data provided location specific irradiance data that incorporates shading from local topography. Global plane of array irradiance for orientations between 10 and 70 degrees from horizontal facing the avgAzimuth were modeled using the typical meteorological year and an isotropic model in python. The tilt with the greatest annual modeled global plane of array irradiance was selected for tiltEst.

Estimating ground cover ratio (GCR)

Ground cover ratio (GCR), sometimes referred to as packing factor (PF), has previously been defined by two different relationships. We calculated relationships both by:

$${{GCR}}_{1}=\frac{{totRowArea}}{{{totArea}}^{\ast }}$$
(7)
$${{GCR}}_{2}=\frac{{rowWidth}}{{rowWidth}+{rowSpace}}$$
(8)

where totRowArea is the top-down or apparent total panel-row area within an array at peak solar inclination (for tracking arrays), totArea* is the total land area of the panel-rows and the spacing between them21,117,121,122,123,124, which is equivalent to totArea for arrays with complete panel-row delineation and arrays were a new boundary was delineated and replaced an original boundary, rowWidth is distance from the bottom edge to the top edge of a row along the short edge, and rowSpace is the distance (azimuthal distance fixed- and single-axis) from an array edge to the nearest panel-row edge. The sum of rowWidth and rowSpace is the horizontal ground distance between any identical point of a module in a directly adjacent row59,62,125. We filled in gaps for arrays without panel-row information by estimating GCR1 and GCR2 using a multiple linear regression between GCR and latitude and longitude from arrays with panel-row information for each mount and module technology.

Estimating installation year

Solar array installation year is often acquired by change detection and manual validation of aerial and satellite remote sensing imagery where permitting data is not available14,21,29,33,36. Given the quantity of arrays in GM-SEUS and the lack of permit data for commercial-scale installations, we needed a way to independently and automatically estimate the year of completed installation. We used the Google Earth Engine implementation of Landsat-based detection of trends in disturbance and recovery (LandTrendr) algorithms126 v0.2.0 to estimate the solar array installation year. LandTrendr is a suite of temporal segmentation algorithms tailored to detecting changes in forested areas at 30 m resolution, but with broader applications. We previously used LandTrendr and NDPVI to detect solar installation years between 2008 and 2018 in California with a 79% accuracy within one year of the manually validated installation year21.

Temporal segmentation requires a significant change in pixel spectral trajectory, which is dependent on the historical land use and land cover, and the subsequent land management between the arrays. Landsat pixels (30 m) contain mixed land cover of panel-rows and the space between them, meaning the post-installation reflectance is dependent on GCR, panel-row area, and ground cover management. Given the broader spectrum of possible spectral histories and variables affecting solar reflectance across the US, and reported utility in employing multiple indices for LandTrendr disturbance detection127, we modified our original approach to include a multi-index performance-weighted average of LandTrendr years of disturbance across twelve spectral indices. These were the indices with reported utility in solar detection (NDPVI, NBD, Br, NDVI, and NDWI) along with seven land use change indices built into LandTrendr including the enhanced vegetation index (EVI)128, the normalized burn ratio (NBR)129, the normalized difference moisture index (NDMI)130, and the Tasseled Cap-Transformations (greenness, brightness, wetness, and angle)131,132. All built-in indices except EVI take advantage of short-wave infrared bands that Landsat includes but NAIP does not.

We used LandTrendr and these indices to estimate the year of newest disturbance, or land use change, within the boundaries of each array polygon between 2009 and 2023, noting that a significant amount of existing solar has been installed in the last decade26. For arrays without permitted installation years, if an input array shape contained multiple newly delineated or existing polygons, we broke it into component polygons to check for unique installation years, identifying and separating mistakenly grouped arrays within the original boundary. This step ensures consistency with our array definition. We also removed segmented boundary years (2008 and 2024) to address known biases in the LandTrendr segmentation at edge years21. To promote accuracy, we subset LandTrendr indices where the mean absolute error (MAE) between the USPVDB permitted installation year values and LandTrendr estimated installation year (~4,000 arrays) was less than two years (NBD, NDPVI, NDWI, TCG, and NDMI). For these indices, we applied an inverse variance-weighted average to calculate the installation year. Similar performance-weighted average approaches are common for various applications (e.g., refs. 127,133,134). We then included indices with greater MAE to fill in non-detect installation years (TCA, NDVI, NBR, Br, TCW, EVI, and TCB). Of the over 16,000 input array polygon boundaries, this LandTrendr method only omitted installation year estimates for 41 array polygons, only 2 of which did not have installation years from other sources.

We independently estimated the installation year for all arrays. However, we also retain installation years from existing datasets in the following order: USPVDB, InSPIRE, USS, SolarPACES, GSPT, GPPDB, TZ-SAM, CWSD, and OSM. We omitted CCVPV since the multi-index performance-weighted average method is more robust than our original single index approach. We also only included installation year from TZ-SAM and CWSD for 2018 or later, since these approaches are based on Sentinel-2 availability. When LandTrendr did not result in a year of detected disturbance, we manually analyzed the installation year using available historical satellite and aerial imagery for each array and LandTrendr time series plots from methods from Stid et al.21,55. We allowed new array shapes derived within TZ-SAM array boundaries to be independent arrays, and regrouped shapes if they were installed in the same year. We provide both a compiled existing installation year (instYr) and the new LandTrendr-derived installation year (instYrLT) in the final GM-SEUS dataset.

Estimating installed capacity

Installed solar capacity is commonly assumed to correlate with panel-row surface area. Others have used statistical regression relationships between known solar PV capacity and panel-row surface area19 along with several adjustable parameters such as the spectral-intensity of the module surface area17,135. With temporal information and module composition, we estimated module efficiency and thus installed capacity for solar PV arrays with (Eq. 9) and without (Eq. 10) panel-row information. We thus used the following relationships modified from Martín-Chivelet117 and Phillpott et al.36 to estimate peak installed capacity for solar PV arrays (power, MWDC):

$${{capMWest}}_{{PV}}={totRowArea}\ast \eta \ast {G}_{{STC}}$$
(9)
$${{capMWest}}_{{PV}}=({{totArea}}^{\ast \ast }\ast {{GCR}}_{{local}})\ast \eta \ast {G}_{{STC}}$$
(10)

where η is the annual average value for ground-mounted systems from the LBNL Tracking the Sun 2024 Report44 for each technology (c-Si or thin-film), GSTC is the irradiance at standard test conditions (1 kWDC m–2), totArea** is the total array area adjusted for area bias of the input dataset relative to USPVDB array area, and GCRlocal is the estimated GCR1 for arrays without panel-row information in relation to latitude and longitude by mount technology and module type. Note that Eqs. 9 and 10 are effectively the same, because GCRlocal is equivalent to the mount and spatially relevant average ratio of totRowArea to totArea. Area bias was determined by intersecting array shapes from input datasets with USPVDB arrays and acquiring the average percent-difference in array polygon area (Fig. 6B and 6C). This corrects for array datasets that tend to over or underestimate total array area (by our definition), reducing erroneous totRowArea estimates when multiplying by GCRlocal.

Fig. 6
figure 6

Spatial validation compared to USPVDB. GM-SEUS array boundaries considered here are all those newly derived by the NAIP segmentation approach. (A) Cumulative distribution function of the IoU for each existing (TZ-SAM, OSM, CWSD, CCVPV) and new (GM-SEUS) array dataset relative to USPVDB with the red horizontal line representing the median IoU. (B) Total area of intersecting arrays with USPVDB representing the USPVDB area intersecting with array area and Ref. Data representing the intersecting reference dataset array area. from other existing and new datasets. (C) Proportional array area difference for existing and new array dataset boundaries intersecting USPVDB with the red horizontal line representing no area difference between USPVDB and the reference dataset.

After dataset harmonization, 59 of 74 CSP arrays were missing a reported installed capacity. For these arrays, we estimated thermal capacity (MWth) for solar CSP arrays with (Eq. 11) and without (Eq. 12) panel-row information by:

$${{capMWest}}_{{CSP}}={{totRowArea}}_{{effective}}\ast {C}_{f}$$
(11)
$${{capMWest}}_{{CSP}}={({{totArea}\ast {GCR}}_{{local}})}_{{effective}}\ast {C}_{f}$$
(12)

where totRowAreaeffective is the effective panel-row (for CSP, collector) area, estimated for parabolic trough, linear Fresnel, and dish-CSP systems as the half of the circumference of a circle with a diameter of rowWidth, and for power tower, beam down tower, and hybrid-CSP systems as the totRowArea, Cf is the recommended conversion factor 0.0007 MWth m–2 of aperture collector area to installed thermal capacity136. Again, totArea * GCRlocal is a spatial regression of GCR1 as it relates to totRowArea. This is a vastly simplified approach with numerous limitations and does not include assumptions about thermal-to-electric efficiency. More robust methods are available (e.g., refs. 137,138), but beyond the scope of this GM-SEUS initial version.

Similar to other new solar array attributes, we estimated installed capacity for all arrays and retain capacity attributes from existing datasets with a capacity attribute in order of perceived quality. We considered high-quality capacity estimates as those that were largely complete and derived from single-source permit records or data directly from industry partners (USPVDB, InSPIRE, USS, SolarPACES, PVDAQ). Lower quality estimates are capacity data that were derived from multiple sources, including some permit or operator data, but with less consistency (GSPT, GPPDB). Due to uncertainty in source or quality, we report only the newly estimated capacity for all arrays from CCVPV, TZ-SAM, and OSM datasets using Eqs. 912. CWSD did not report capacity.

Data Records

GM-SEUS v1.0 is available for public use and is provided in the Zenodo Repository139. The final data repository provides all geospatial files as geopackage, shapefile, and as CSV. We also provide 17,500 input and target images derived from GM-SEUS and NAIP imagery for direct application in deep learning and pattern recognition use cases. When using these products, please cite the original data sources and articles21,25,26,36,42,45,46,47,48,49,52,57,58,93,94,95,96 along with this publication. All arrays and panel-rows contain a ‘Source’ attribute, which references the data source of the original spatial information (see README in the data repository for more detail). See Table 1 for attribute-level information in each dataset. Data records are up to date through December 2024. The USPVDB, TZ-SAM, and OSM datasets intend on updating their completeness on a regular basis. We intend to provide annual updates to this dataset.

The GM-SEUS open repository contains the following files:

  • GMSEUS_Arrays_Final: Final array dataset containing boundaries from existing datasets and enhanced by buffer-dissolve-erode technique with GM-SEUS panel-rows containing all array-level attributes (ESRI:102003), geopackage, shapefile, csv

  • GMSEUS_Panels_Final: Final panel-row dataset containing boundaries from existing datasets and newly delineated GM-SEUS panel-rows containing all panel-row-level attributes (ESRI:102003), geopackage, shapefile, csv

  • GMSEUS_NAIP_Arrays: All array boundaries created by buffer-dissolve-erode method of newly delineated (NAIP) GM-SEUS panel-rows (ESRI:102003), geopackage, shapefile, csv

  • GMSEUS_NAIP_Panels: Newly delineated panel-rows from NAIP imagery with low-quality panel-rows removed (ESRI:102003), geopackage, shapefile, csv

  • GMSEUS_NAIP_PanelsNoQAQC: All newly delineated panel-rows from NAIP imagery without any quality control (ESRI:102003), geopackage, shapefile, csv

  • NAIPtrainRF: Training dataset of 12,000 NAIP training points (2,000 class–1) containing class values, spectral index values, the year of NAIP imagery accessed, and point coordinates (EPSG:4326), csv

  • LabeledImages: Directory containing image and mask subdirectories with ~17,500 input and target images for deep learning pattern recognition applications, GeoTIFF

We provide the following attribute fields in GM-SEUS Final Arrays:

  • arrayID: unique numeric ID of each solar array in GM-SEUS, unitless

  • Source: original array boundary source from existing datasets or manual digitization, unitless

  • nativeID: numeric ID of each solar array in from source spatial dataset if an indexing system existed, unitless

  • latitude: latitude of the array boundary centroid (EPSG:4269), decimal degrees

  • longitude: longitude of the array boundary centroid (EPSG:4269), decimal degrees

  • newBound: binary, whether the array boundary was derived from the existing data sources (0) or from a buffer-dissolve-erode of panel-rows following our definition of an array boundary (1), unitless

  • totArea: total land footprint of panel-rows and the space between them, m2

  • totRowArea: If numRow is greater than 0, sum of rowArea within an array. Otherwise, estimated based on totArea and GCR1 estimation where no panel-rows were detected, m2

  • numRow: number of panel-rows within an array, m2

  • instYr: installation year from existing sources, with gaps filled in by instYrLT, year

  • instYrLT: LandTrendr-derived installation year independent of any data source other than Landsat spectral trajectory, year

  • capMW: installed capacity from existing sources, with gaps filled in by capMWest, MWDC or MWth

  • capMWest: estimated installed capacity derived from capacity to panel-row area relationships described in Eqs. 912 independent of any data source, MWDC or MWth

  • modType: reported panel-row (module) technology at the array level (c-Si, thin-film, csp). If unreported, assumed to be c-Si, unitless

  • effInit: initial panel-rows efficiency from existing sources with gaps filled in by based on efficiency estimation from modType and instYr taken from the annual Tracking the Sun report, %

  • GCR1: 0-1, the ratio of totRowArea to the total area of panel-rows and the space between them. For arrays with complete panel delineation and arrays where newBound is 1, this is equivalent to totArea. This is also called packing factor. If numRow is greater than 0, GCR1 is an actual GCR1 for the array. Otherwise, GCR1 is estimated by linear regression of latitude and longitude by mount and module type, unitless

  • GCR2: 0-1, the ratio of the average width of the panel-row short edge (rowWidth) to the horizontal ground distance between identical panel-rows points, defined as the sum of widthAvg and rowSpace. If numRow is greater than 0, GCR2 is an actual GCR2 for the array. Otherwise, GCR2 is estimated by linear regression of latitude and longitude by mount and module type, unitless

  • mount: mount technology derived from the azimuth and geometry of each panel-row within the array or from existing sources, with preference given to newly derived mount technology. Either ‘fixed_axis’, ‘single_axis’, ‘dual_axis’, or ‘mixed_’ with a lower-case letter denoting the mixed mounts (e.g., mixed_fs), unitless

  • tilt: panel-row tilt for fixed-axis arrays (including arrays with mixed-mounting) from existing sources and filled in by tiltEst, degrees above horizontal

  • tiltEst: estimated panel-row tilt for fixed-axis arrays (including arrays with mixed-mounting) estimated using pvlib, degrees above horizontal

  • avgAzimuth: median estimated azimuth of panel-rows within array bounds or reported azimuth from existing sources, with preference given to newly estimated azimuth. For single-axis tracking arrays this is the cardinal direction of the long-edge. For all other mount types, this is the cardinal direction of the panel-row face, degrees from north

  • avgLength: median length of the long edge of panel-rows within an array, meters

  • avgWidth: median length of the short edge of panel-rows within an array, meters

  • avgSpace: median spacing between the solar array rows, in meters, between edges of the panel-row projected onto the ground, meters

  • STATEFP: unique geographic identifier for the U.S. Census Bureau state entity, unitless

  • COUNTYFP: unique geographic identifier for the U.S. Census Bureau county entity, unitless

  • geometry: best new or available geometry matching the array definition which contains panel-rows and the space between them, derived from existing sources (newBound = 0) or from a buffer-dissolve-erode of newly delineated panel-rows (newBound = 1), meters

  • version: GM-SEUS version in which the array geometry and attributes are derived. Each subsequent version will re-derive new geometries and the best delineation from each version will be selected, unitless

We provide the following attribute fields in GM-SEUS Final Panel-Rows:

  • panelID: unique numeric ID of the panel-row in GM-SEUS, unitless

  • arrayID: unique numeric ID of each solar array in GM-SEUS that the panel-row is associated with, unitless

  • Source: panel-row boundary source from existing datasets or GM-SEUS, unitless

  • rowArea: top-down or apparent panel-row area directly from the output of image classification, m2

  • rowWidth: length of the short-edge of the panel-row, meters

  • rowLength: length of the long-edge of the panel-row, meters

  • rowAzimuth: azimuth of the panel-row, with 0 at North, degrees

  • rowMount: mount technology (fixed-axis, single-axis, or dual-axis) of the panel-row, unitless

  • rowSpace: the inter-row spacing between the panel-row and the nearest panel-row in the azimuthal direction (fixed- and single-axis) or any direction (dual-axis), meters

  • geometry: top-down or perceived geometry, meters

  • version: GM-SEUS version in which the panel-row geometry and attributes are derived. Each subsequent version will re-derive new geometries and the best delineation from each version will be selected, unitless

Technical Validation

GM-SEUS completeness compared to other datasets

Through the end of Q3 2024, the Solar Energy Industries Association and Woods Mackenzie reported that the US had installed 5.3 million solar systems with a total solar capacity of ~220 GWDC (ref. 54). Given available information, ~18% is residential-scale solar (~40 GWDC), ~12% is commercial- or community-scale (~25 GWDC), and ~70% is utility-scale (~150 GWDC), with more than 80 GWDC installed since 202354.

GM-SEUS reports an estimated 186 GWDC installed through December 2024 (TZ-SAM and OSM provide the most recent data), or ~107% of estimated non-residential solar capacity (~85% of all solar capacity) through 2024. We also provide new sub-array metadata for 9,042 arrays (83.1 GWDC), 5,858 (16.7 GWDC) of which are not contained within USPVDB. Note that we include all arrays from USPVDB and TZ-SAM but refined the TZ-SAM array area where panel-row information is available and independently estimate installed capacity using site-level information.

For reference, within CONUS, the USPVDB v2.0 is complete through Q3 2023 and represents a permitted 90.4 GWDC (ref. 52), or ~65% of non-residential solar capacity through 2023. TZ-SAM Q3 2024 estimates capacity by array area and country-wide values for GCR and inverter loading ratio and estimates 197 GWAC within the CONUS (~257 GWDC assuming a median inverter loading ratio of 1.3 from USPVDB), or ~147% of non-residential solar capacity (~117% of all solar capacity) through Q3 202457. Though, note that coarse GCR estimates, and medium-coarse satellite imagery generated array geometries may overestimate encompassed area and thus estimated capacity26. This is evident in the TZ-SAM arrays that intersect USPVDB arrays, which overestimate total array area on average by ~45% (by our definition of an array – Figs. 3 and 6) and installed capacity by ~30% (assuming 1.3 inverter loading ratio) for the same arrays.

Many of the input datasets report being the most complete or comprehensive for their scope at the time of publication. We have compiled these data repositories, removed repeat information preferencing quality, acquired updated data from OSM, and enhanced spatiotemporal metadata for many existing array datasets. GM-SEUS is thus a harmonization and enhancement of the most comprehensive publicly available ground-mounted solar energy datasets available in the US through 2024 (see Table 1). However, we exclude non-contiguous US regions and residential and rooftop systems. The most comprehensive residential and rooftop datasets to-date are Bradbury et al.1 in the US (data: ref. 140) and Stowell et al.9 in the United Kingdom (data: ref. 141).

We have inevitably omitted existing ground-mounted solar energy systems and likely included commissions (non-solar objects). We have no way of knowing the extent of omission error beyond comparing against broad solar industry trends and acknowledging the 1,490 point data sources that we still need to manually georeference and digitize. Since NAIP imagery is often a year behind present due to quality control and inspection requirements, we are also not able to directly determine commission error (non-solar) in the existing dataset compilation. Thus, some arrays derived with moderate resolution remote sensing (CWSD, TZ-SAM) may contain non-solar commissions. For example, we have no way of knowing if arrays without available NAIP imagery including panel-rows are: 1) arrays installed after the most recent NAIP imagery or 2) inclusive of non-solar objects. There may also be situations where erroneous classifications of non-solar objects passed the panel-row quality control. However, if we only consider arrays verifiable with recent imagery (USPVDB, CCVPV, OSM, digitized arrays, and arrays with compiled or identified panel-rows), total GM-SEUS installed capacity is 125 GWDC, 72% of reported non-residential capacity through 2024. In general, validation of completeness is limited by our reliance on freely available data and our decision not to include additional existing data held behind paywalls. The difficulty in validating GM-SEUS underscores the motivation to create this product along with similar efforts.

Spatial confidence in array and panel-row delineation

USPVDB and EIA Form 860 form the most rigorous, robust, and widely available data from which to compare our results and is often the source for metadata on other existing datasets. The hand delineated array boundaries in USPVDB ensure completeness and also match our definition of an array (Fig. 3). Thus, for both spatial confidence and attribution technical validation, we compare our results to USPVDB.

We generated panel-row geometries for 9,042 arrays. Although we maintain USPVDB boundaries in the final dataset, we use these high-fidelity hand delineated boundaries to validate our NAIP array delineation approach for other existing array datasets. To evaluate confidence in newly generated array geometries, we use the Jaccard Similarity Index, also known as the Intersection over Union (IoU)142. IoU is bound by 0 and 1, where zero indicates no overlap and 1 indicates identical overlap of input geometries (A, B). IoU was calculated by:

$${IoU}(A,B)=\frac{A\cap B}{A\cup B}$$
(13)

We generated panel-rows using NAIP imagery and created array boundaries for 2,871 of 4,185 USPVDB arrays (52.9% of total USPVDB area). We calculated the IoU for all NAIP panel-row delineated array boundaries that intersected with any USPVDB array polygon. Due to USPVDB multi-polygons and connected array shapes, we dissolved all boundaries and considered any individual polygon where boundaries overlapped. The 1,314 array omissions were due either to poor quality panel-row or array delineation or outdated imagery compared to the installation of the array. Though, note that this partial coverage is only for array boundary delineation method validation and that we include all USPVDB arrays and array area in GM-SEUS.

The median IoU for array GM-SEUS boundaries was 0.88 (Fig. 6A), which is comparable and even superior to numerous instances of IoU being used to validate solar array boundary delineation (e.g., refs. 1,14,15,28,29). The NAIP panel-row delineation method underestimates USPVDB area on average by ~12% (Fig. 6B and C). This makes sense, given our highly conservative panel-row selection for high-quality sub-array metadata and that we only consider array area within the existing array boundary. We also compared the spatial array delineation of existing datasets to USPVDB using IoU resulting in median values of 0.95 for CCVPV, 0.85 for CWSD, 0.83 for OSM, and 0.69 for TZ-SAM, with total proportion of USPVDB area captured being 3.5% for CCVPV and CWSD, 93.0% for OSM, and 99.6% for TZ-SAM. Note that CCVPV and CWD more closely follow the array definition used here (and in USPVDB), where OSM and TZ-SAM more closely correspond to project area (Fig. 3) explaining the overestimate of array area relative to USPVDB (Fig. 6C).

We also used IoU to compare newly generated GM-SEUS panel-rows to existing OSM and CCVPV panel-row datasets, given that USPVDB does not provide panel-row spatial data. The median IoU for GM-SEUS panel-row boundaries was 0.48 (Fig. 7A). This is considerably lower than the array boundary IoU, though, comparing high-spatial resolution individual panel-rows is similar to pixel-wise IoU which are known to have lower scores than array-wise IoU17. Additionally, OSM contributors most often use Bing or Maxar imagery to delineate panel-rows in the OSM user-interface143. Horizontal accuracy standards for NAIP rectification require 95% confidence within 4-meters of true ground102. At the scale of panel-rows, ground sample errors up to a few meters can mean entirely missing or missing in part panel-row overlap with panel-rows hand delineated by OSM. Thus, what is more spatially important than panel-row geometric alignment is the correlation between the total estimated panel-row area within each array (Fig. 7B) and the difference in total panel-row for each individual intersection (Figs. 7C and 6D). Figure 7B shows that array-total panel-row area is highly correlated (log-log transform R2 = 0.95) between existing panel-rows and GM-SEUS NAIP-generated panel-rows, and that NAIP-panel-rows area ~15% larger than hand delineated panel-rows (Fig. 7C and 7D).

Fig. 7
figure 7

Spatial validation compared to input existing panel-row datasets. GM-SEUS panel-row boundaries considered here are all those newly derived by the NAIP segmentation approach. (A) Cumulative distribution function of the IoU for GM-SEUS NAIP panel-rows and existing panel rows with the red horizontal line representing the median IoU. (B) Log-Log transformed relationship between total panel-row area within a unique array for existing GM-SEUS NAIP panel-rows. (C) Total area of intersecting panel-rows with Exist. Rows representing the existing panel-row area intersecting with panel-row area of NAIP Rows, representing the intersecting GM-SEUS NAIP panel-row area. (D) Proportional panel-row area difference for GM-SEUS NAIP panel-row boundaries intersecting existing panel-row boundaries with the red horizontal line representing no area difference between USPVDB and the reference dataset.

Attribute validation

We estimated several solar array and panel-row characteristics that have available reference data for validation. We estimated the array installation year based on the LandTrendr-derived year of greatest spectral change, module efficiency based on installation year and module type, GCR based on new and existing panel-row delineation, installed capacity based on panel-row area or local GCR, predominant array mount technologies and azimuth based on panel-rows, and panel-row tilt based on latitude. We are not aware of validation datasets for other panel-row metadata metrics.

We harmonized and estimated the installation year for all arrays in GM-SEUS. Comparing GM-SEUS to the utility-scale solar arrays in USPVDB v2.0, the estimated installation year MAE was 1.52 years (Fig. 8). Error was skewed high in early installation years and low in late installation years, displaying limitations in this short-year-range and high spatial-resolution application of LandTrendr temporal segmentation. However, the more available years to segment, the less impact this error will have. Additionally, some indices (NDB and NDPVI) did not have the early installation year bias. For the individual indices used in estimating installation year, NBD had the lowest MAE (1.53 years) and TCB had the greatest MAE (4.07 years). Within USPVDB, individual indices omitted installation years for between 113 arrays (TCG) and 551 arrays (TCB). In total, 66% of estimated installation years were within one year of permitted installation year, and 80% were within two years.

Fig. 8
figure 8

Estimated installation year validation compared to USPVDB. Boxplots show quartiles and median installation year deviation, with high values indicated an overestimate of installation year. The dotted red line indicates no deviation, and the red alpha range displays a ±1 year deviation range.

We acquired permitted capacity from existing data sources for 4,894 arrays and estimated capacity for the remaining 10,123 arrays in GM-SEUS. Compared to USPVDB, estimated installed capacity log-log transform R2 of 0.84 for solar PV arrays (c-Si and thin-film). Due to limited availability of CSP capacity validation data (15 arrays) and aperture area conversion factor limitations across CSP technologies, we did not perform a comparative statistical analysis for CSP estimated capacity. Estimated solar PV capacity error is shown in Fig. 9.

Fig. 9
figure 9

Estimated solar PV installed capacity validation. Note that the plot and regression are Log-Log transformed and do not include CSP arrays.

Relative to USPVDB, GM-SEUS mount types were correct 92% of the time. Regarding azimuth and tilt, Perry et al.53 described accuracy of reported azimuths within 15 degrees of ground truth (64%) and of reported tilt within 5 degrees of ground truth (63%). For GM-SEUS, 89% of azimuth values and 12% of optimal tilt values were within 15 and 5 degrees of permit-reported azimuth and tilt respectively. Note that rather than estimating actual tilt, we estimated optimal tilt for an array using local typical meteorological year and topography data. Expanding the threshold, estimated optimum tilt was within 10 degrees of permitted tilt for 41% of arrays, and 80% within 15 degrees of permitted tilt.

Median GM-SEUS Solar PV GCR1 values were 53% for fixed-axis, 42% for single-axis tracking, 50% for dual-axis tracking, and 63% for arrays with mixed mounts. We compared GM-SEUS GCR1 estimates with CCVPV packing factor estimates21, estimates from Ong et al.123 and USA-wide estimates from Phillpott et al.36. Although, Phillpott et al.36 reports GCR for small (43%), medium (36%), and large (30%) arrays, rather than by mount technology. GCR1 and GCR2 distribution and comparison to validation data is shown in Fig. 10.

Fig. 10
figure 10

Ground cover ratio distribution compared to existing sources. GCR1 and GCR2 are newly derived metrics described by Eqs. 7 and 8. CCVPV is the packing factor for only fixed- and single-axis tracking arrays from Stid et al.21, and Ong represents reported packing factor values from Ong et al.123. Although not included in the graph, Phillpott et al.36 reports GCR for small (43%), medium (36%), and large (30%) arrays.

Fixed-axis and single-axis tracking GCR1 values are comparable across GM-SEUS, CCVPV, and Ong et al.123, and for “small” array GCR (43%) reported Phillpott et al.36. GM-SEUS dual-axis GCR1 is more than twice (50% vs. 22%) what is reported by Ong et al.123. However, Ong et al.123 reported only 9 dual-axis arrays (and only 83 systems in total). We considered 230 dual-axis systems, 5,282 fixed-axis systems, 2,583 single-axis systems, and 298 mixed-mount systems. Ultimately, panel-row area (Fig. 7B–D) and GCR (Fig. 10) estimates indicate that panel-row area and packing layout are generally consistent with prior findings and data, with a small overestimation of panel-row area.

Usage Notes

GM-SEUS limitations

This dataset provides a broad characterization of solar array design practices. Any characterization of solar array design and management derived from remote sensing imagery should be considered with extreme scrutiny given the limitations of such approaches17. While our work fills a critical data gap and compiles and enhances existing high-fidelity datasets, the design practices reported here are thus subject to uncertainty and should not be used to represent actual conditions at individual sites. No warranty is expressed or implied regarding accuracy, completeness or fitness for a specific purpose. We publish this dataset as open access, for the broader science community, policy makers, and stakeholders in addressing questions about the existing renewable energy landscape and do not consent to this data being used to target, identify, or make claims about individual arrays, properties, or entities. Any such use case is strictly prohibited.

Despite our best efforts, we acknowledge limitations in the creation of GM-SEUS. We have already noted several limitations. Hu et al.17 defines several additional cautions when performing solar energy characterization using overhead imagery. These include issues with distribution shift, availability of testing data, standardization of comparison metrics, and the scale of evaluation impacting reported performance. Where possible, we mitigated these pitfalls by evaluating performance with data of similar scope and coverage (USPVDB) and at the array or panel-row level, rather than aggregating17. We also make our dataset publicly available and use common evaluation metrics (e.g., IoU).

There are additional challenges when using NAIP for image classification due to high-resolution aerial imagery metadata variability. These include timing of acquisition (season, date, time of day), camera tilt, look angle, flight path mosaic and georectification artifacts, and low-radiometric resolution102. On a pixel-wise basis, for example, these challenges could lead to underestimating the panel-row area of a single-axis tracking system if imagery was acquired at the start or end of contracted flight time144. We have accounted for this bias in the past by trigonometrically correcting panel-row area based on the maximum estimated tilt angle of the tracking mount at the timing of imagery21. Though, even at sub-meter resolution, this correction could overestimate panel-row area if edge-pixels (or shadows) were included in the classification or if tilt is lower than expected. In fact, we know that in some arrays, GM-SEUS overestimates panel-row area due to convex-hull inclusion of isolated exterior pixels and commissions including shadows and access roads (Fig. 7C,D and Fig. 11). This example may also explain some uncharacteristically high GCR values, particularly for dual-axis installations (Fig. 10).

Fig. 11
figure 11

Panel-row delineation errors and limitations. The left column contains NAIP aerial imagery and input geometries originating from the source dataset of greatest spatial quality for that array.

We do not recommend using our estimated CSP capacity values for granular analyses of CSP contribution energy infrastructure. Our current approach is a highly simplified first-order estimate of thermal capacity (Eqs. 11 and 12) for 59 of 74 included CSP arrays. While the applied estimation approach is not recommended for tower-CSP plants, we also extended it to these systems due to a lack of comparable alternatives. Additionally, we made no assumptions regarding thermal-to-electric efficiency and did not convert thermal capacity (MWth) to electric capacity (MWe). As a result, our results include a mix of estimated thermal capacity and electric capacity values from existing sources. Including both retained and estimated values, the total estimated GM-SEUS CSP capacity was 1.96 GWth+e, compared to reported values from SolarPACES (1.51 GWe), USS (1.77 GWe), and GSPT (1.81 GWe).

Similar to the results reported in Fujita et al.26, GM-SEUS array boundaries do not represent total project area and thus total land transformation (see Fig. 3). GM-SEUS array boundaries are also spatially conservative, at times omitting array and panel-row area and under-representing the actual spatial footprint of the array (Fig. 11). This is due to our exclusion of panel-rows that were not geometrically consistent and panel-rows that produced unreasonable array boundaries. These limitations reduce the total intersection area relative to reference area and result in a low IoU (Fig. 11). We accept this limitation here because we valued high-quality panel-row metadata over within-array panel-row completeness. However, any calculation of solar energy-land interactions (e.g., land-use efficiency, land transformation, or footprint) should consider this knowledge and report limitations accordingly124.

Input array polygon and point metadata limitations

Aside from manual digitization of point data to polygon data, GM-SEUS does not search for new arrays that are not contained within existing reference polygon datasets (Table 1). Thus, GM-SEUS completeness is currently limited by coverage of existing datasets. The coverage, metadata completeness, and quality of input datasets varies depending on the scope and age of the dataset. Below, we outline key limitations associated with the metadata attributes of these input datasets (see Table 1).

The TZ-SAM dataset36,57 contains metadata on installation year, installed capacity, and GCR. TZ-SAM installation year is based on Copernicus Sentinel-2 imagery (S2) change detection, which is limited to installations after 2017. TZ-SAM also reports installation year as a range of potential dates based on the timing of imagery where change was detected. We select the median date within this range (only if the start of the range is 2018 or later) as the TZ-SAM provided installation year. TZ-SAM installation year is also not complete. TZ-SAM GCR estimates are derived from an OpenStreetMap validation dataset, are used to estimate installed capacity, and are provided at the country-level rather than the array level.

The CWSD dataset25,94 contains metadata on installation year. As with TZ-SAM, CWSD installation year is based on S2 change detection, which is limited to installations after 2017.

The OSM dataset58 contains metadata on installation year, mount, and installed capacity. OSM installation years are based on the ‘start_date’ attribute which contains crowd sourced uncertainty in definition. The same issue exists with mount and installed capacity reporting in OSM.

The SolarPACES or CSP.guru dataset47,95 contains metadata on installation year and installed capacity. SolarPACES also contains technology information (e.g., Parabolic Trough, Power Tower), from which we made assumptions about mount technologies.

The World Resources Institute’s GPPDB dataset42,96 contains metadata on installation year and installed capacity. GPPDB installation years were inferred from a commissioning year attribute, which may not be the same as the year of completed installation.

Use case product: labeled imagery for semantic segmentation

To display utility in the granularity of GM-SEUS arrays and panel-rows, we provide solar panel-row labeled images as an auxiliary data product intended for training deep learning models for pattern recognition (e.g., semantic segmentation). Deep learning convolutional neural networks require abundant and well-labeled training data and are used in a number of existing solar identification and characterization efforts (e.g.,refs. 3,4,5,6,7,10,11,12,13,14,16,17,18,19,20,23,25,27,28,30,32,33,36,53,91,92).

Labeled imagery was created for solar energy arrays within GM-SEUS that contained NAIP generated panel-rows (CCVPV or GM-SEUS Source) and at least 10 identified panel-rows. To reduce panel-row omission error, imagery was only selected in sub-array regions where panel-rows were present. Images (inputs) are 4-band (R, G, B, and NIR) rasters masks (targets) are binary single band rasters (0: non-solar, 1: solar), where solar labels are GM-SEUS panel-row vectors rasterized at local NAIP resolution and projection. Images and masks are provided at 256 × 256 pixel dimensions. We allowed arrays to contain random point centered image windows equal to 50% of the panel-row containing array area divided by tiled area. This resulted in ~17,500 images and masks within 4,452 arrays. We intend on augmenting this dataset with higher density sampling to include 200,000 + training images in subsequent GM-SEUS versions. Example images and masks for fixed-, single-, and dual-axis mounted installations are shown in Fig. 12.

Fig. 12
figure 12

Deep learning and pattern recognition use case image product examples. Three columns of six examples containing inputs or images (left) and targets or masks (right) for fixed-, single-, and dual-axis mounted arrays contained within GM-SEUS. Note, this data was not used to create GM-SEUS but is provided as a value-added product within the data repository. Image products were created from NAIP imagery and GM-SEUS panel-rows.

All files are stored as GeoTIFF files, with native NAIP imagery projections (UTM Zone for source imagery location, spatial reference information is included in both images and masks if reprojection is needed). Images and masks retain the same file naming logic for easy application. Importantly, file nomenclature includes the respective arrayID from the GM-SEUS, meaning images can be selected for metadata-specific applications (e.g., avgAzimuth, mount). File nomenclature is (for example): “id3044_tile0.tif”, where ‘3044’ is the arrayID from GM-SEUS and ‘tile0’ is the tile number for that array.