Introduction

Earth Observation (EO) satellites are critical for observing Earth’s atmosphere, oceans, land, and cryosphere1, and they underpin research in climate dynamics2, urbanization3, and epidemiology4. However, conventional ground-based processing pipelines-dependent on post-downlink analysis-are inadequate for real-time applications requiring sub-hour response times. To overcome these limitations, recent efforts have turned to embedding AI and Machine Learning (ML) capabilities directly onboard satellites (Fig. 1). This enables in-situ semantic analysis and intelligent data filtering, reducing transmission of low-value data-such as cloud-covered scenes-and easing downlink constraints5,6,7, while satisfying the latency demands of time-critical applications8. In parallel, advances in Internet of things (IoT) and edge computing are promoting interdisciplinary convergence and lowering the cost of entry for satellite missions, catalyzing growth in the New Space Economy9 As AI-driven spaceborne edge computing reshapes the full “acquisition-to-feedback” pipeline8,10, there is a growing deployment of small satellites equipped with AI-enabled processors11,12,13,14,15. Deploying ML models in orbit remains constrained by the harsh spaceborne environment-particularly power limitations, thermal dissipation, and radiation tolerance-which significantly influence the complexity and architecture of models that can be deployed on edge hardware. While future satellite chipsets may boost onboard compute, efficiency remains essential for mission longevity, cost, and reliability9. A key metric is the duty cycle-the fraction of time processing is active. Optimizing it requires low-power, time-efficient models suited to intermittent operation constrained by power, thermal, or communication limits.

Fig. 1
Fig. 1
Full size image

Block diagram of the proposed hardware-aware framework, which jointly optimizes task-specific performance metrics and device-specific latency. The graphical representation illustrates the AI-enabled Earth Observation (EO) data handling pipeline. In contrast to traditional processing chains-typically constrained by line-of-sight downlink windows of 8–12 min per orbit-the integration of onboard AI significantly reduces data volume by up to 85%, and decreases latency from hours to minutes.

Model compression and parameter efficiency techniques have thus become indispensable tools for optimizing neural network deployment in resource-constrained environments16. These approaches minimize model footprint and computational demand while preserving accuracy, facilitating compliance with real-time processing requirements and stringent hardware limitations. Such efficiencies are crucial given the inherent constraints of spaceborne platforms, where communication bandwidth restrictions impose strict caps on the volume of data and model updates transmissible during brief contact intervals. For example, uplink rates in small satellite missions often limit daily model updates to tens or hundreds of megabytes, a figure further reduced by operational overhead, complicating timely model adaptation to evolving payload conditions or sensor drift12. Compounding these challenges is the growing adoption of large foundation models in remote sensing, which, despite their superior performance, present significant computational and memory burdens. Their integration within spaceborne systems demands innovative strategies to balance accuracy, power consumption, and temporal efficiency under strict physical constraints. Ultimately, advancing computational efficiency through model design tailored to optimize duty cycle and resource usage is essential to realize scalable, robust onboard intelligence across diverse satellite architectures. Conventional model compression techniques primarily aim to reduce the size of pre-existing architectures, enhancing efficiency while remaining inherently limited by the original design constraints17,18. In contrast, NAS automates the discovery of optimized network topologies and has been successfully applied to both convolutional19,20 and attention-centric architectures21,22. A key advantage of NAS lies in its potential to incorporate hardware-awareness into the optimization loop, explicitly accounting for memory usage, power consumption, quantization-induced performance degradation, and latency constraints. However, while model optimization and NAS techniques have been advanced significantly in the research community, their integration into onboard satellite AI pipelines remains underexplored, particularly under the stringent constraints of spaceborne deployment.

To address this gap, we propose a Genetic Algorithm (GA)-based NAS framework that builds upon the hardware-aware principles introduced in ProxylessNAS23, extending them to support direct architecture search on the target task and under explicit hardware constraints. By incorporating latency-aware evaluation and resource-bounded architectural generation, our approach is tailored to identify models that not only satisfy predictive performance criteria but also adhere to the inference and energy requirements of edge platforms in orbit-thereby enabling efficient and autonomous onboard decision-making.

Consequently, we make the following contributions: (1) we introduce a hardware-aware NAS framework explicitly designed for edge devices operating onboard satellites; and (2) we conduct a rigorous empirical validation of our methodology on a novel and representative dataset. To the best of the authors’ knowledge, conversely to prior works-such as Peng et al.24, Li et al.25, and Kadway et al.26-our approach is the first to employ a population-based hardware-aware NAS. While these earlier studies explore efficient architecture search strategies or zero-shot scoring methods, none of them jointly optimize for task accuracy and hardware constraints.

The remainder of this paper is organized as follows. The  Background section reviews prior work on model compression and hardware-aware neural architecture search. The  Methods section details the proposed NAS framework and its hardware-aware optimization strategy. The  Results section introduces the datasets developed for benchmarking and presents experimental findings that demonstrate the effectiveness of the approach. Finally, the  Conclusions section summarizes the contributions and outlines directions for future research.

Background

Onboard satellites-especially CubeSats-computational efficiency is paramount due to strict power, thermal, and memory constraints. Hardware-aware optimization and model compression are key to enabling complex analytics without ground support. The following subsections present two core strategies-model compression and NAS-for efficient AI deployment in spaceborne environments.

Model compression techniques

Model compression transforms large neural network architectures into compact and efficient forms, striving to maintain predictive accuracy while reducing resource requirements. This efficiency is particularly critical when deploying models under stringent memory, power, and bandwidth constraints. Common compression methodologies encompass quantization27, pruning28, Knowledge Distillation (KD)29, and low-rank decomposition30,31, each offering distinct trade-offs between compression ratio, computational overhead, and predictive fidelity32,33,34,35.

Quantization reduces memory footprint by decreasing numerical precision, commonly converting 32-bit floating-point parameters to formats such as 8-bit integers27,36. Approaches include Post-Training Quantization (PTQ), performed post-training, and Quantization-Aware Training (QAT), integrated during training. Despite its storage and computational efficiency, quantization requires careful calibration to balance bitwidth reduction with minimal accuracy degradation due to inherent trade-offs37,38.

Pruning simplifies network architectures by eliminating redundant or less significant weights and neurons, thereby creating sparser networks. This process directly reduces both memory usage and computational complexity, aiming to preserve original predictive performance28.

Low-rank decomposition techniques, such as singular value decomposition and advanced tensor factorization methods, approximate weight matrices by decomposing them into products of smaller matrices. These approaches achieve reductions in computational complexity and memory usage, with varied effectiveness depending on the method and application context30,31,39. KD leverages knowledge transfer from larger “teacher” networks to smaller “student” architectures. This approach enables compact networks to achieve performance comparable to more complex models across diverse applications. However, KD’s efficacy is sensitive to differences in capacity and architecture between teacher and student models and demands meticulous hyperparameter tuning to avoid information loss from softened targets29,40. Along the compression techniques, NAS offers more than an alternative, as it can be integrated with quantization-aware training and structured pruning to jointly optimize architecture and weight representations, targeting low-precision accelerators (e.g.8-bit integers) for efficient deployment.

Neural architecture search

Rather than compressing an existing model, NAS search for optimal architectures from scratch, providing a proactive approach to balancing model efficiency and predictive capability. NAS formulates the problem as a constrained optimization over the architectural search space \(\mathscr {A}\). Specifically, the objective is to identify an architecture \(a \in \mathscr {A}\) that minimizes a task-specific loss function \(\mathscr {L}(a, \mathscr {D})\), while satisfying resource constraints such as computational cost \(\mathscr {C}(a)\). Contemporary NAS approaches employ a variety of empirical strategies to navigate this space efficiently, including learning-based methods and meta-heuristic algorithms. Core strategies can be broadly categorized into GA41, Reinforcement Learning42, One-Shot Methods43, Bayesian Optimization44,45, and Gradient-Based Approaches46,47. The following provides a brief overview of these methods, highlighting their key characteristics. GA evolve a population of candidate architectures through iterative mutation and recombination. Reinforcement Learning-based NAS employs a controller-typically an Recurrent Neural Network (RNN)-to generate architectures, optimizing them via reward signals from performance evaluations. One-Shot methods, such as Differentiable Architecture Search (DARTS)48 and ProxylessNAS23, utilize an overparameterized supernet that enables efficient evaluation of sub-architectures without retraining. Bayesian Optimization (BO) leverages probabilistic models to guide exploration toward promising regions of the search space. Gradient-Based methods reformulate architecture search as a continuous optimization problem, enabling the use of gradient descent for efficient exploration. Notably, hybrid strategies often combine One-Shot and gradient-based techniques, as exemplified by DARTS48 and ProxylessNAS23, to mitigate the limitations of individual approaches.

It is worth noting that each of these NAS methods offer distinct advantages and challenges. For example, GA and reinforcement learning are well-suited for exploring a wide variety of architectural configurations but are computationally intensive. One-shot methods and gradient-based approaches are efficient due to weight-sharing mechanisms, though this can constrain the performance of individual architectures. BO reduces search time by focusing on promising candidates, albeit at the cost of potentially missing out on high-performing architectures due to biased exploration.

Despite surpassing manual and compressed models in performance and compactness, early NAS methods often decouple accuracy from efficiency, neglecting deployment-critical factors such as memory, latency, power, and radiation resilience-especially vital for spaceborne edge computing. Hardware-aware NAS addresses these limitations by embedding deployment constraints into the search, yielding models suited for real-time, resource-constrained inference. Though gaining traction, its application in Earth Observation is limited, with most efforts focusing on ground-based tasks24,25,26,49,50,51. Notable onboard attempts include a zero-shot NAS for low-power devices26 and a reinforcement learning approach for fire detection on nanosatellites51. Despite their relevance, these methods face key limitations: restricted search spaces, fixed or heuristic objectives, and limited architectural diversity. Crucially, they overlook accuracy-efficiency trade-offs essential for autonomous, real-time inference in orbit.

Methods

Figure 1 illustrates the developed framework, which can be schematically divided into three key building blocks: Model Generator, Optimizer, and Hardware Awareness Block (HAB).

The Model Generator creates new architectures from the search space, which lists all possible neural network designs to explore during optimization. It provides the basis for generating candidate models. The Optimizer searches this space to find designs that trade off accuracy and efficiency. It uses algorithms to improve model choices step by step. The HAB adds hardware benchmarking, so chosen designs are tailored to the target device.

Search space

The admissible set of candidate architectures is formally defined as \(\mathscr {A}\), from which individual configurations \(a \in \mathscr {A}\) are stochastically sampled during the optimization process. Each architecture is instantiated by a single-path hierarchical generator that sequentially composes convolutional blocks, as also displayed in Fig. 1. Each block incorporates two modular primitives: a normal cell, consisting of standard convolutional operations, and a reduction cell, which halves the spatial resolution via max or average pooling. This cell-stacking design paradigm is widely adopted in both handcrafted and NAS-derived architectures. The total number of blocks is uniformly sampled from a discrete interval \([3, M_n]\), where \(M_n\) is a tunable upper bound.

Candidate architectures are parameterized by a combination of discrete and continuous variables, including the type and order of layers (e.g., convolutional, pooling, fully connected), as well as hyperparameters such as kernel size, number of filters, and activation function. Each architectural configuration serves as a genotype, akin to chromosomes in genetic algorithms, encoding structural elements like skip connections and network depth. This encoding enables expressive and flexible representations within the search space.

Each candidate architecture \(a \in \mathscr {A}\) is represented by a structured string called the architecture code. For instance, the architecture code LRr3agn1EPaELco2k3s2p2agn1EPMEHCEE corresponds to the genotype [“LRr3agn1”, “Pa”, “Lco2k3s1p1agn1”, “PM”, “HC”], where each segment encodes a functional block. The token Lco2k3s1p1agn, for example, specifies a convolutional layer with kernel size 3, stride 1, padding 1, GELU activation, and doubled output channels. This code is parsed into a sequence of components-referred to as the genotype-that specify the network’s building blocks and their order. This multi-level encoding mechanism-spanning from symbolic strings to interpretable layer definitions-enables expressive model specification and efficient traversal of the architectural search space. Moreover, it provides a robust foundation for mutation and crossover operations in the genetic algorithm, ensuring the generation of structurally valid and hardware-feasible architectures.

To ensure real-time feasibility for embedded platforms, the generator enforces a strict upper bound on model complexity by capping the number of trainable parameters at 10 million. Once instantiated, each candidate architecture \(a \in \mathscr {A}\) is trained via supervised learning and evaluated using a hardware-aware fitness function.

Optimization strategy

As one of the first contributions tailored to on-orbit satellite deployment, the architecture search strategy is formulated using a GA. This methodological choice is motivated by two key considerations: (1) the widespread success of evolutionary approaches in NAS, and (2) their inherent ability to optimize non-differentiable, discrete objective functions without relying on continuous relaxations. The optimization process operates over a fixed number of generations \(G_n\), with each generation maintaining a population of \(P_s\) candidate architectures. The evolutionary cycle iteratively refines the population through a balance of exploration and exploitation.

Two primary genetic operators drive population evolution. First, mutation, applied at a probability \(M_f (\%)\), introduces stochastic variability by randomly replacing one layer in an architecture with a structurally distinct alternative from the predefined search space. This mechanism ensures continual architectural innovation and prevents premature convergence. Second, single-point crossover is applied to the top-performing \(M_p (\%)\) of the population (the mating pool), ranked by fitness. For each selected parent pair, a random crossover point is sampled, and the architectural components before and after this point are exchanged, producing two offspring. This recombination process promotes the inheritance of beneficial architectural traits, accelerating convergence toward high-performing solutions.

To promote convergence toward high-quality solutions while preserving exploratory capability, the evolutionary process integrates three key mechanisms. First, elitism is applied: the top \(K_{\text {best}}\) architectures, ranked by fitness, are retained across generations, ensuring monotonic performance improvement.

Second, diversity injection mitigates premature convergence. At each generation, \(n_{\text {random}}\) novel architectures are uniformly sampled from the search space, enabling exploration of underrepresented regions that may elude incremental refinement.

Third, fitness-based selection guides genetic operations. Architectures are chosen with probabilities proportional to their multi-objective fitness scores, favoring those that best trade off accuracy and efficiency. To calculate the fitness, we adopt a weighted sum with exponential penalty, formulated as follows:

$$\begin{aligned} \mathscr {L}_{fitness} = \alpha \cdot \widehat{\text {fps}} + \beta \cdot \text {Metric} \cdot e^{\gamma \cdot \text {Metric}} \end{aligned}$$
(1)

where \(\alpha , \beta ,\) and \(\gamma\) are scalar weights that balance the contributions of speed and accuracy, enabling task-specific trade-offs during optimization. In this experiment we set \(\alpha , \beta , \gamma = 1\) to balance the influence of speed and accuracy, noting that the exponential term increases the weight of higher accuracy values.. The inference speed is normalized against a target value of 120 Frames Per Second (FPS) (\(\text {fps}_{\text {target}} = 120\)), yielding \(\widehat{\text {fps}} = \frac{\text {fps}}{120}\), a dimensionless metric adjustable via a hyperparameter to reflect deployment needs. Finally, the model accuracy is captured via a task-specific Metric, with exponential weighting \(e^{\gamma \cdot \text {Metric}}\) to emphasize high-performing architectures and penalize low-performance cases. In our case we use mIoU as our accuracy metric.

Hardware-awareness

After each training cycle, the training server interfaces with a designated device to collect performance feedback. Specifically, the NAS process performs inference on the target deployment environment to evaluate system-level metrics that inform the fitness function. If required by the device, model quantization and hardware-specific optimizations are integrated into the NAS workflow during the compilation phase, ensuring that resulting architectures are not only efficient but also compatible with deployment constraints. This feedback loop enables the search process to jointly optimize for both predictive accuracy and hardware-aware metrics-such as inference latency and memory access cost-ensuring that the selected architectures meet the stringent requirements of real-time, resource-constrained onboard applications.

Results

Data

We evaluated the proposed NAS approach on two separate tasks, i.e., thermal hotspots classification and burnt area segmentation.

Burnt area segmentation

We developed a novel dataset with near-global coverage giving particular attention to class balance across training, validation, and test splits to ensure robust learning and evaluation. The legend in Figure 2 shows how class distributions are carefully structured to avoid bias, promoting generalization across diverse geospatial and ecological conditions.

Fig. 2
Fig. 2
Full size image

Overview of the dataset employed in this study showing the spatial distribution of sampled data points with an example tile. The legend displays the class-wise percentage composition.

The burned area dataset consists of high-resolution, multispectral Sentinel-2 imagery curated for post-fire analysis, supporting accurate delineation of burned regions across varied ecological and climatic settings. While Sentinel-2 offers 10 m spatial resolution for most bands, we simulate the higher-resolution and sensor-specific characteristics of \(\Phi\)-sat-2 to support realistic onboard deployment scenarios. This simulation bridges the gap between freely available satellite data and the constraints and opportunities of in-orbit sensing systems like \(\Phi\)-sat-2. The dataset comprises 115 fire events geographically distributed across North America (17), South America (17), Africa (17), Europe (18), Asia (29), and Australia (17), thereby ensuring broad geographic and climatic coverage. By encompassing fire regimes from boreal, temperate, and tropical ecosystems, the dataset supports comprehensive analyses of fire severity, post-fire vegetation dynamics, and ecosystem resilience. The inclusion of spatially and temporally diverse fire events enhances the generalization capability of learned models, facilitating robust performance across a wide range of environmental conditions.

To approximate \(\Phi\)-sat-2 onboard observations, we applied a dedicated simulation pipeline to the Sentinel-2 L1C inputs. This included selection of key bands (B02, B03, B04, B08, B05, B06, B07) and derivation of cloud, cloud shadow, and cirrus masks from the Scene Classification Layer (SCL). Solar geometry and irradiance metadata were used to compute radiances, from which a synthetic pan-chromatic band was derived. Sentinel-2 imagery was then resampled to 4.75 m resolution, aligning with \(\Phi\)-sat-2’s imager. To reflect in-orbit sensor characteristics, the imagery was also degraded with simulated band misalignment, signal-dependent noise profiles (SNR), and modulation transfer function (MTF) blur. In some cases, simulated L1C reflectances were recalculated from radiance. Finally, image chips of size \(256 \times 256\) were tiled and saved with corresponding masks, producing an AI-ready dataset closely emulating \(\Phi\)-sat-2’s onboard acquisition conditions.

To improve classification reliability and address spectral ambiguities, annotations are structured into four classes: Background, Burned Areas, Clouds, and Waterbodies-with the latter two explicitly included to mitigate occlusion effects and reduce misclassification. The explicit labeling of Clouds and Waterbodies is particularly critical, as these features frequently obscure terrestrial surfaces in optical satellite imagery. Cloud cover, characterized by high spatial and temporal variability, introduces discontinuities that hinder burned area detection, while waterbodies exhibit spectral signatures that may be erroneously classified as burn scars if not properly distinguished. By incorporating these additional classes, the dataset substantially improves segmentation robustness, reducing errors associated with atmospheric interference and surface-level ambiguities.

Thermal hotspots classification

To showcase the generality of our approach on a different task, we employ the “end2end” dataset from our previous work6, specifically designed to enable the development of models for onboard, real-time thermal anomaly detection. Unlike conventional imagery that rely on atmospheric correction or fine geometric registration, the dataset preserves the characteristics of raw Sentinel-2 granules, using only pre-computed coarse shifts to align bands B11 and B12 with B8A. This design choice reflects the constraints of in-orbit processing, where minimal pre-processing is essential to meet latency and computational requirements. Each granule, approximately \(1152 \times 1296\) pixels, was subdivided into \(256 \times 256\) patches, and patches containing at least nine hotspot pixels within a THRawS bounding box were labeled as “events,” with all others labeled “non-events.” Labels were further refined through visual inspection to ensure reliability. The resulting collection comprises 5033 patches, with a strong imbalance between event (394, 7.8%) and non-event (4636, 92.2%) classes, thereby reflecting the rarity of thermal anomalies in real-world monitoring scenarios. To approximate operational deployment, the dataset was split into a geographically stratified manner, ensuring that training and testing samples originate from distinct regions while maintaining class proportions.

Evaluation metrics

The metric chosen for evaluating model performance is the mIoU, a standard metric for semantic segmentation. It is defined as the average IoU computed across all semantic classes: \(\text {IoU}_c = \frac{|S_p^c \cap S_g^c|}{|S_p^c \cup S_g^c|}, \quad {\text {mIoU} = \frac{1}{C} \sum _{c=1}^{C} \text {IoU}_c}\), where \(S_p^c\) and \(S_g^c\) denote the predicted and ground truth masks for class \(c\), and \(C\) is the number of classes. The mIoU quantifies spatial overlap and is particularly effective for evaluating segmentation performance across heterogeneous class distributions. For classification, the MCC is selected as metric providing a balanced evaluation even under class imbalance problems. The MCC is defined as \(\text {MCC} = \frac{ TP \cdot TN - FP \cdot FN }{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\), where \(TP\), \(TN\), \(FP\), and \(FN\) denote true positives, true negatives, false positives, and false negatives, respectively.

Experiments

All models were trained on an Amazon EC2 instance of type g4dn.12xlarge, equipped with four NVIDIA T4 Graphics Processing Units (GPUs), 48 vCPUs, and 192 GB of RAM, running Ubuntu 24.04 LTS (Noble Numbat). The software stack included CUDA 12.4 and PyTorch 2.6.0. To ensure a fair and reproducible comparison across candidate architectures, no hyperparameter tuning was performed. Focal Loss52 was used during training to address class imbalance in segmentation tasks. This improves sensitivity to hard-to-segment regions and sharpens boundary delineation. To assess deployment feasibility under spaceborne constraints, we evaluate our NAS framework on two complementary edge platforms: the NVIDIA Jetson AGX Orin\(^{\text {TM}}\)53 and the Intel® Movidius\(^{\text {TM}}\) Myriad\(^{\text {TM}}\) X54. These devices exemplify distinct trade-offs in computational capability, energy efficiency, and integration complexity, providing a comprehensive perspective on onboard AI processing. The NVIDIA Jetson AGX Orin\(^{\text {TM}}\) delivers up to 275 TOPS within a 15–60 W envelope, supporting 32-bit and mixed-precision inference via CUDA and TensorRT. A variant, the Orin NX, has been selected for space deployment aboard SpaceX’s Transporter-11, equipped with radiation shielding55. The Myriad\(^{\text {TM}}\) X offers over 1 TOPS at  2 W, with 16-bit fixed-point support and a dedicated Neural Compute Engine. It has demonstrated in-orbit capability on D-Orbit’s Wild Ride ION mission for onboard Earth observation56. In addition, we evaluated our framework on the NVIDIA A100-SXM (300 W), a datacenter-grade accelerator delivering up to 312 TFLOPS of mixed-precision performance and optimized for large-scale AI workloads. The contrast between 32-bit and 16-bit inference regimes underscores the breadth of our evaluation, showcasing the NAS framework’s capacity to adapt across heterogeneous precision constraints and deployment scenarios. We note that Jetson AGX Orin experiments used 32-bit precision despite native 16- and 8-bit support, as our goal was to assess NAS robustness across hardware rather than optimize a single device. On both devices, each model required about 2 minutes per generation, with the full NAS procedure lasting roughly 48 hours for the segmentation task and about 4 hours for the classification task, subject to dataset size and parameter choices.

Segmentation task

Figure 3a displays results for the NVIDIA Jetson AGX Orin\(^{\text {TM}}\). Particularly, the Fig. 3a(1) shows that the model discovered by our approach achieves a competitive mIoU while delivering an exceptional inference throughput-approximately \(3\times\) faster than baseline architectures, i.e.as MobileOne-S057, EfficientNet-B019, and ResNet-1858. Indeed, the model, highlighted by an arrow, occupies the right segment of the metric–FPS space, decisively dominating the performance–efficiency trade-off. ResNet-18, with 18 layers and approximately 11.7M parameters, achieves a mIoU of 0.840 at 40.4 FPS (fitness = 2.28). Despite its accuracy, it exhibits substantially higher latency and complexity compared to our model, which matches its performance while using only 34.2 K parameters. EfficientNet-B0 achieves a lower mIoU of 0.820 and an inference speed of 38.5 FPS (fitness = 2.18), whereas MobileOne-S0 yields a higher accuracy of 0.859 but at a lower throughput of 25.4 FPS (fitness = 2.24). Despite these trade-offs, none of the baselines jointly optimize for both accuracy and latency as effectively as the proposed architecture. The Pareto front, shown in Fig. 3a(2), comprises the top 20 architectures output of the NAS optimization. The graph outlines how the final optimized model reaches the mIoU of 0.845 while delivering an exceptional inference throughput of 168.1 FPS (fitness = 2.97). Notably, the evolutionary algorithm exhibits consistent convergence across successive generations, attaining a peak fitness value of 2.97 by generation 15 (see Fig. 3a(3))—higher than any of the baseline models. This convergence is further substantiated by a pronounced reduction in the fitness gap, indicative of diminished variance and the emergence of dominant, high-performing architectural configurations. Finally, the Fig. 3a(4) further illustrates the joint progression of maximum mIoU and FPS across generations.

Fig. 3
Fig. 3
Full size image

Summary of the NAS process on the segmentation task. Subfigures illustrate: (1) performance–latency–complexity trade-offs of discovered models compared to state-of-the-art baselines (ResNet-1858, Mobileone-S057, EfficientNet-B019; bubble size denotes parameter count); (2) Pareto front (top 20 models) showing the trade-off between segmentation accuracy (mIoU) and inference speed (FPS)–the deployment-selected architecture is indicated by arrow; (3) Fitness progression over generations, including maximum, minimum, and gap values, along with parameter count evolution of the top-performing model; (4) trajectory of maximum mIoU and FPS across generations, demonstrating joint optimization; (5) architecture discovered by our framework. NAS was performed over 15 generations with a population size of 50, training each model for 10 epochs. In each generation, the top 50% of models composed the mating pool (mutation rate: 0.2), with 10 randomly initialized architectures injected per generation. The highest-performing model was preserved to ensure elitism.

Similar considerations could be applied for the NAS optimization on the Movidius\(^{\text {TM}}\) Myriad\(^{\text {TM}}\) X, reported in Fig. 3b. A critical distinction in this context is that all models undergo 16-bit quantization and hardware-specific compilation prior to deployment. The performance variability observed across architectures highlights the absence of hardware-awareness in baseline models, which are not tailored to the operational constraints of the Myriad\(^{\text {TM}}\) X accelerator. The bubble graph in Fig. 3b(1) shows how our discovered architecture maintains its trade-off advantage over the baseline models. EfficientNet-B019 achieves a mean IoU of 0.795 at 4.73 FPS (fitness = 1.80). ResNet-1858 follows with an accuracy of 0.770 and 5.86 FPS (fitness = 1.72), while MobileOne-S057 exhibits the poorest performance on this platform, attaining only 0.683 accuracy and 4.32 FPS (fitness = 1.39).

These results highlight the sensitivity of baseline architectures to hardware constraints and PTQ. Specifically, this pronounced drop in segmentation accuracy for baseline models on the Myriad\(^{\text {TM}}\) X can be attributed to several architectural mismatches between these models and the hardware’s constraints. In particular, ResNet-18 and MobileOne-S0 include layers and operations that are either unsupported or sub-optimally executed on the Myriad\(^{\text {TM}}\) X. These include large-kernel convolutions (e.g., \(7{\times }7\) in ResNet-18), unoptimized skip connections, and dynamic activation patterns such as Squeeze-and-Excitation (SE) blocks in MobileOne. Furthermore, the Myriad\(^{\text {TM}}\) X relies on floating point 16 precision and static memory allocation, making it particularly sensitive to models trained under floating point 32 assumptions or those involving extensive branching and memory reuse. Unlike our proposed architectures-which co-evolved with deployment constraints in mind-baseline models were not designed with the Myriad\(^{\text {TM}}\) X’s compilation and runtime pipeline in consideration. This leads to reduced throughput and degraded inference fidelity after quantization. These findings reinforce the critical importance of hardware-aware design in edge AI pipelines. Even on this constrained platform, our model maintains nearly a \(3\times\) speedup over all tested alternatives, confirming its superior hardware generalizability and robustness. This behavior is further highlighted in Fig. 3b(2), where our proposed solution sustains its Pareto dominance even with drastically fewer parameters-less than 1 K-compared to multi-million parameter baselines. In contrast to the Jetson AGX Orin\(^{\text {TM}}\), the Myriad\(^{\text {TM}}\) X’s performance saturates quickly: further shrinking of model size along the Pareto front offers diminishing returns. This is evident in the altered shape of the fitness progression curve in Fig. 3b(3), which flattens earlier due to hardware-imposed throughput limitations. Fig. 3b(4) illustrates the joint evolution of accuracy and speed. This reflects the effective adaptation of the algorithm to hardware-specific restrictions, particularly in this regime where the latency target of 120 FPS is not attainable. In this scenario, the optimization process places increased emphasis on minimizing inference time, as evidenced by the steep slope of the FPS progression curve. Since further gains in mIoU are limited and latency improvements occur within a relatively narrow performance envelope, the resulting evolution of the fitness function exhibits a flatter trajectory compared to that of the Jetson AGX Orin\(^{\text {TM}}\). This behavior confirms the algorithm’s sensitivity to hardware-imposed ceilings and its capacity to prioritize latency-aware optimization in low-throughput, resource-constrained environments.

From a qualitative perspective, the Fig. 4 presents a comparison of segmentation masks generated by baseline models (EfficientNet-B0, MobileOne-S0, ResNet-18) and those discovered through our NAS framework (The reader is referred to Appendix B for a comprehensive account of additional qualitative and quantitative comparisons). The results are evaluated against high-resolution ground truth annotations across four representative scenes. The optimized architectures (PyNAS for Myriad\(^{\text {TM}}\) X and for Jetson AGX Orin\(^{\text {TM}}\)) consistently produce segmentation maps with high spatial fidelity, matching the same, if not better, level of accuracy of the baseline models. Indeed, both the Jetson AGX Orin\(^{\text {TM}}\)- and Myriad\(^{\text {TM}}\) X-optimized variants yield clean predictions with sharp object contours. Notably, the PyNAS (Jetson AGX Orin\(^{\text {TM}}\)) model achieves superior structural agreement with the ground truth across all samples, particularly in complex scenes of the second and fourth row, involving small, fragmented regions in the top area of the sample. These observations again validate the effectiveness of our approach in discovering architectures that match, if not surpass, manual baselines but with a significant higher computational efficiency. The consistency across hardware targets underscores the robustness and adaptability of the proposed methodology for real-world remote sensing applications.

Fig. 4
Fig. 4
Full size image

Qualitative segmentation comparison of segmentation masks produced by various models-EfficientNet-B0, MobileOne-S0, ResNet-18, PyNAS (NVIDIA Jetson AGX Orin\(^{\text {TM}}\)), and PyNAS (Intel® Movidius\(^{\text {TM}}\) Myriad\(^{\text {TM}}\) X)-alongside the input image and ground truth. The tiles used have been gathered on 8 August 2019 in northeastern Kazakhstan (48.83°–49.53° N, 71.96°–72.25° E). Each tile spans   1.8 \(\times\) 1.8 km.

We finally note that on both devices, each model generation required approximately 2 min per generation, and the complete NAS procedure took about 48 h. These timings may vary depending on the dataset size and the chosen NAS parameters.

Classification task

To showcase the potential of future on-orbit datacenters, as envisioned by European Space Agency (ESA) to enable AI-powered space missions and agile in-orbit applications59, we also evaluate our approach on a high-performance device, the NVIDIA A100-SXM (300 W). While not currently certified for space deployment due to power and radiation constraints, this experiment illustrates the performance envelope achievable under unconstrained computational conditions, offering insights into the capabilities that next-generation cognitive cloud infrastructures in space could unlock. The results for the classification task are reported in Fig. 5, where our model achieved a MCC of 0.974 and an inference throughput of 8555 FPS. For comparison, the best model in Meoni et al.6 (EfficientNet-lite0) reached a maximum MCC of 0.902 with 187 FPS, despite extensive experimentation with different learning policies and hyperparameter tuning. This represents an absolute accuracy gain of \(\approx\)+8% in MCC, while delivering more than a 47-fold increase in inference speed. It is worth emphasizing that the model of Meoni et al.60 is already a well-engineered architecture tailored to operate on edge devices, making it a particularly strong baseline. The fact that our solution not only surpasses it in accuracy but also dramatically reduces latency suggests that the proposed hardware-aware search framework does not merely compress existing designs but is capable of discovering fundamentally more efficient architectures. This points to a broader implication: hardware-aware NAS has the potential to bridge the longstanding gap between model compactness and predictive fidelity, a trade-off that has traditionally constrained spaceborne AI applications. The Pareto front output of the evolutionary search is displayed in Fig. 5a while Fig. 5b highlights improvements in MCC and inference speed. Beyond quantitative metrics, the Fig. 5c presents representative inference examples confirming the reliability of the outputs. Concluding the panel, Fig. 5d depicts the PyNAS architecture, whose depth of five was selected by the more powerful available computational resources. Beyond the raw performance figures, these results raise two important reflections. First, they demonstrate the adaptability of our approach across heterogeneous hardware platforms, from highly constrained edge devices to powerful datacenter-grade accelerators. This adaptability suggests that the same methodology could be flexibly repurposed for hybrid Earth–space architectures, where initial training or adaptation occurs on the ground with high-performance GPUs, followed by deployment of compressed and specialized models on orbit. Second, they highlight that future advancements in space-qualified hardware may unlock even greater potential for onboard autonomy: as radiation-hardened high-performance GPUs become available, the efficiency gains we observe on high performance chips could be directly translated into operational benefits, such as lower duty cycles, reduced downlink requirements, and enhanced mission resilience.

Fig. 5
Fig. 5
Full size image

Summary of the NAS process on the classification task. Subfigures illustrate: (a) Pareto front (top 20 models) showing the trade-off between classification accuracy (MCC) and inference speed (FPS)—the deployment-selected architecture is indicated by arrow; (b) trajectory of maximum MCC and FPS across generations, demonstrating joint optimization; (c) representative inference results on the test set; (d) architecture discovered by our framework. NAS was performed over 8 generations with a population size of 50, training each model for 15 epochs. In each generation, the top 50% of models composed the mating pool (mutation rate: 0.2), with 10 randomly initialized architectures injected per generation. The highest-performing model was preserved to ensure elitism.

Conclusion

This study introduces a hardware-aware NAS framework tailored specifically for satellite-based edge processing constraints. By integrating evolutionary optimization with real-time latency profiling on heterogeneous hardware platforms-NVIDIA Jetson AGX Orin\(^{\text {TM}}\) and Intel® Movidius\(^{\text {TM}}\) Myriad\(^{\text {TM}}\) X-we have shown that the proposed approach significantly surpasses state-of-the-art handcrafted architectures in terms of the performance–efficiency trade-off. Additionally, we showcase its potential for future cognitive cloud computing infrastructures in space, aligning with emerging initiatives on on-orbit datacenters and AI-enabled space missions.

Specifically, on the NVIDIA Jetson AGX Orin\(^{\text {TM}}\), our optimal architecture achieved a remarkable mIoU of 0.845 at 168.1 FPS with merely 34.2K parameters, substantially exceeding the efficiency of MobileOne-S0 (0.859 mIoU at 25.4 FPS) and ResNet-18 (0.840 mIoU at 40.4 FPS), both of which surpass millions of parameters. Similarly impressive results were observed on the Intel® Movidius\(^{\text {TM}}\) Myriad\(^{\text {TM}}\) X, where NAS-derived architectures achieved 0.802 mIoU at 11.5 FPS with fewer than 1K parameters, demonstrating extraordinary compactness and robust quantization resilience. Beyond segmentation, the classification task further highlighted the framework’s efficiency: on the NVIDIA A100-SXM, our model achieved a MCC of 0.974 at 8555 FPS, compared to the baseline model (EfficientNet-lite0) 0.902 at 187 FPS, representing an \(\approx\)+8% accuracy gain and a 47-fold increase in inference throughput. These results, obtained across edge and datacenter-grade platforms, emphasize the scalability and versatility of the proposed methodology.

The fitness convergence analysis further confirmed stable evolutionary optimization, clearly evidencing simultaneous improvements in accuracy and latency. Additionally, qualitative assessments validated the structural fidelity and generalization capability of the segmentation models. The consistency of these results across multiple hardware platforms underscores the effectiveness and reliability of our hardware-in-the-loop methodology.

Building on these findings, while our current results are simulation-based, future work aims to explore real on-orbit deployment. This includes evaluating the approach on suitable hardware platforms to assess its robustness and feasibility under operational satellite conditions. Future directions will also explore multi-objective optimization strategies to include energy efficiency and radiation resilience considerations, and broaden framework applicability to complementary tasks such as anomaly detection or onboard compression. Detailed speed–accuracy trade-offs under reduced-precision quantization are also necessary studies. A systematic comparison of NAS approaches constitutes another key research avenue. Such comparison, while currently constrained by computational resources, will be prioritized as part of our ongoing and future research efforts to ensure a fair and comprehensive assessment of NAS methods for satellite edge computing scenarios. Furthermore, distillation of geospatial foundation models, while not empirically substantiated in the current study, may offer promising opportunities; its potential utility and alignment with satellite-edge processing constraints warrant careful future investigation.