Introduction

Neural networks (NN) have found extensive application across various fields such as healthcare1,2,3, surveillance4,5, Industry 4.06,7,8, and Internet of Things (IoT)9,10,11. A neural network can be composed of a large number of layers of different types while sporting diverse hyperparameters. Hence, for a given dataset/application: (1) finding the right set of neural layers; (2) connecting them in the right topology; and (3) selecting the most optimal hyperparameters for each layer can be a daunting task requiring a large amount of computation resources, human expert involvement, and time. Requiring a given neural network to perform (during inferencing) under specific resource-constrained conditions (a case for many IoT/Edge devices) can add to the complexity of the neural architecture search process. For example, designing a neural network to have more than \(x\%\) accuracy for a given task is a hard problem to solve but it becomes harder if we further constrain the problem with additional parameters such as frames-per-second (FPS) requirements during inferences and power consumption limits.

Traditional neural architecture search (NAS) frameworks are typically designed to identify the best architecture within a specified search space. This approach is constrained by its pre-defined search space, which limits its capacity for generating novel neural network architecture (outside the search space). Additionally, most NAS frameworks prioritizes final model accuracy leading to very high search-cost (time, energy consumption) and poor edge AI performance (low FPS and high inferencing energy).

To mitigate these concerns we propose a large language model guided neural architecture discovery (LEMONADE) framework that can allow the discovery of novel neural network architecture without relying on a pre-defined search space. This framework is designed to allow the network-builder to efficient trade-off between: (1) Final model accuracy; (2) Neural search/discovery speed and energy consumption; (3) Final model energy consumptions and inferencing frames per second. These objectives are enforced through an iterative approach utilizing a large language model (LLM) and an expert system for driving the LLM towards the target discovery. The expert system will use a set of configurable rules and several user-defined metrics to generate a set of instructions for the LLM leading to progressive refinement of the generated neural architecture.

To validate the LEMONADE framework, we perform extensive experimentation using the CIFAR-10, CIFAR-100, ImageNet16-120, EuroSAT, Malaria Parasite, and IMDb datasets. We use the framework to generate different neural networks for diverse application requirements and priorities. Neural networks generated using LEMONADE for CIFAR-10 (\(95.54\%\) test accuracy) and CIFAR-100 (\(79.43\%\) test accuracy) demonstrated state-of-the-art performance in terms of final mode accuracy. For ImageNet16-120 LEMONADE was also able to generate fairly competitive architectures (\(42.95\%\) test accuracy). LEMONADE is also very efficient in terms of model generation and training, demonstrating notable reduction in network search/discovery time and associated energy consumption. While using GPT-4o12, as the backend LLM, LEMONADE was able to generate and train CIFAR-10 models in about 5.8 hours consuming only about 1.20 kWh-PUE energy. LEMONADE is also capable of prioritizing metrics beyond accuracy, which enables the creation of neural architectures that are optimal for different IoT/Edge requirements such as high speed inferencing at low power. This is achieved through efficiently trading-off model accuracy as demonstrated by our experimental results across several datasets/applications. LEMONADE has generated novel neural architectures from scratch, thereby, paving a new opportunity for search-space agnostic neural architecture search research. We have also validated the framework while using Gemini-Pro as the LLM component. To summarize, we:

  1. 1.

    Formalize and design a cost-effective and search-space agnostic neural architecture discovery framework (LEMONADE) leveraging LLMs.

  2. 2.

    Formulate an expert system with associated rules and relevant metrics that is capable of driving a given LLM toward discovering different neural architectures.

  3. 3.

    Implement LEMONADE as a highly configurable/efficient tool for immediate application and easy future extensions.

  4. 4.

    Qualitatively and quantitatively evaluate LEMONADE using CIFAR-10, CIFAR-100, ImageNet16-120, EuroSAT, Malaria Parasite, and IMDb datasets for diverse settings and application requirements.

Background and motivation

Next, we will briefly describe relevant related works and discuss the motivations that drove the development of LEMONADE.

Neural architecture search

Methods of neural architecture search (NAS) are extensively used across various applications such as image processing13,14,15,16, signal processing17,18,19, object detection20,21, and natural language processing22,23. It involves identifying the best neural network for a given task through repeated trials, traditionally judged solely based on final model accuracy. The early NAS techniques worked mainly based on the evolutionary algorithms (EA)24 and reinforcement learning (RL)25. Although these methods showed promising results in building quality NN, they require substantial computing power and time. To solve this issue, weight-reusing26 approaches were proposed that avoid the necessity of training each design from the beginning, resulting in low computation costs. One-shot approaches for NAS27 were also proposed which involves training a large network called SuperNet that incorporates every conceivable architecture within the search domain. Differentiable Neural architecture search (DNAS)28 is another weight re-using approach where all the SubNet parameters are optimized by gradient descent.

Most NAS methods utilize NAS-datasets that contains a large list of potential neural architectures from which we expect to find the most optimal architecture (for the target application) using the NAS method. One NAS dataset is the NAS-Bench-10129 which contains 5 million distinct neural architectures and was designed for the CIFAR-10 dataset. The NAS-Bench-20130 dataset has 15625 cell layouts and is derived from a cell-based search technique (for CIFAR-1031, CIFAR-10031 and ImageNet16-12032 datasets). In33, the authors proposed a NAS method named \(\beta\)-DARTS to solve the weak generalization ability found in the DARTS method. They used the NAS-Bench-201 to evaluate their framework. In another research work34, the authors suggested \(\Lambda\)-DARTS as a solution for the structural flaws caused by the weight-sharing approach in DARTS. In a recent work35, authors proposed GENIUS where they used an LLM to solve the NAS problem while utilizing a pre-defined search space and focusing solely on maximizing final model accuracy (no consideration given to model search efficiency or inferencing speed).

Shortcomings of NAS

Most traditional NAS techniques rely on having access to a pre-defined search space of potential neural architectures, making it difficult to scale across different applications and use cases. Additionally, most NAS frameworks do not have the capability to allow the search process to consider parameters such as: (1) Inferencing speed; (2) Inferencing energy consumption; (3) NAS search and training efficiency.

Why LLM and expert system for neural discovery?

We hypothesize that a large language model (LLM) trained on a large volume of open-domain data will also have the knowledge about different neural architectures. LLMs have demonstrated success in terms of searching for NN architectures given a search space35,36,37,38. However, we wanted to go one step further and find out if LLMs can generate novel NN architecture (discovery) without using a pre-defined search space. We also wanted to analyze if: (1) the open-domain knowledge has provided these LLMs with insights into different metrics associated with a neural architecture such as estimated training power consumption and inferencing speed; (2) these LLMs can follow automated instructions generated from an expert system for refining a NN. The abbreviations used in this study are listed in Table 1.

Table 1 Abbreviation.

Methodology

Neural discovery process

In Fig. 1 we show an overview of the LEMONADE framework where an expert system (ES) takes the task specification from the user (metrics) and generates commands (prompt) for the LLM using a set of rules for creating a neural architecture. The generated neural network (in the form of python code) from the LLM is then trained on the training dataset and subsequently evaluated on the validation set. The associated evaluation-based metrics are used by ES to generate the next LLM prompt that modifies the current neural network architecture for improving overall efficacy.

Fig. 1
figure 1

LEMONADE framework: Expert system guided iterative and multi-parameter search for neural network discovery.

Algorithm 1 shows the overall procedure of LEMONADE. M is a set of user-defined metrics, TS contains the task specification (classification/regression, input/output shape, datasets, etc), TC is an user defined terminating condition (in our case a max number of iterations). From lines 2–5, all variables are initialized and the initial command is stored in the cmd variable. For example, the initial cmd might be something like - ‘Please suggest a pytorch image classification model with input shape of (3,32,32) and output of 100’.

Algorithm 1
figure a

LEMONADE neural network discovery/search.

In line 7, LLM_AGI (GPT-4o12 for this study) returns a response based on the command and subsequently, a model is created from the response.

From lines 10–12, the generated model is trained and evaluated with the training and validation datasets respectively and a set of metrics such as training energy expenditure (\(T_{NE}\)), training accuracy (\(T_{Acc}\)), validation energy expenditure (\(V_{NE}\)), validation accuracy (\(V_{Acc}\)), and validation set inferencing frames-per-second (NF) are calculated and stored in the Metrics dictionary. The Metrics dictionary is then passed to the NNGES (Fig. 2) to generate a set of instructions for the next round of LLM-based neural network generation.

Fig. 2
figure 2

Flowchart of the neural network generation expert system (NNGES).

Any conflicts between the generated instructions are removed using Algorithm 2. To identify the best network architecture, we utilize the following combined model effectiveness metric (CM):

$$\begin{aligned} CM = W_A \cdot (T_{Acc} + V_{Acc}) + (W_F \cdot NF) - W_E \cdot (T_{NE} + V_{NE}) \end{aligned}$$
(1)

Where, \(W_A\) is weight for accuracy; \(T_{Acc}\) is training accuracy; \(V_{Acc}\) is validation accuracy; \(W_F\) is weight for FPS; NF is normalized FPS; \(W_E\) is weight for energy; \(T_{NE}\) is normalized training energy; \(V_{NE}\) is normalized validation energy. All parameters (user defined and evaluation-based) are described in Table 2.

Algorithm 2
figure b

LEMONADE conflict resolution.

Table 2 User-defined and evaluation based parameters.

Expert system for instruction set generation

We have developed an expert system for guiding the NAS process because: (1) The backend LLM can exhibit random behavior if proper bounds/rules are not set; (2) the backend LLM can start hallucinating during a lengthy neural search process and an expert system can keep it on track; (3) LLMs are not always aware of how to achieve a certain effect out of a neural network and requires additional input from an expert system to succeed.

The expert system utilizes a rulebook (Table 3) which is based on strategies that are commonly used by data scientists for constructing effective neural networks. In future, these rules can potentially be learned based on historical neural search data. The expert system drives the LLM towards constructing an optimal neural network for a given set of user-defined parameters by generating a series of instructions based on the Metrics calculated after each search iteration.

Table 3 Rule book for the LEMONADE expert system.

Figure 2 illustrates the flowchart of the Neural network generation expert system (NNGES) for instruction generation. The system takes M, a set of user-defined and evaluation metrics, as its input. It begins by initializing an empty dictionary for storing instructions. Next, the system compares the input metrics (M) with the user-defined metrics (see Table 2) and assigns different weights to each rule (as defined in Table 3) to construct an instruction dictionary. For instance, if the training accuracy (\(M.T_{ACC}\)) falls below the user-defined threshold (\(M.TT_{ACC}\)), the instruction assigns priority weights (\(PT_{ACC}\)) to Ins[’ACL’], Ins[’AMK’], Ins[’ADL’], and Ins[’ASC’]. This implies that ACL, AMK, ADL, and ASC are expected to enhance training accuracy in the next iteration.

The system sequentially evaluates all conditions and stores the corresponding weights in the instruction dictionary, which will be further refined in the next section.

Conflict resolution

Instructions generation might have some conflicts (see Table 3) due to the nature of the instructions themselves. For example, instructions such as adding a dense layer (ADL) and removing a dense layer (RDL), may get assigned positive weights by the NNGES algorithm (e.g., 0.5 for ADL and 0.4 for RDL). However, since these instructions have opposite effects, both cannot be passed to the LLM simultaneously. To resolve this, Algorithm 2 is used to prioritize and select the instructions with higher weights. Algorithm 2 shows the overall procedure of eliminating conflicts. In line 3, we organize the Ins in descending order according to their values. Through lines 7–9, it identifies and stores the instructions that have values larger than 0 in \(Refined\_Ins\). In lines 10–15, the algorithm examines the current instruction and the remaining instructions to identify any conflicts. If a conflict is detected, the algorithm looks at the computed metric to decide which one is more appropriate (by setting less appropriate metric to zero) for optimizing the NAS.

Dataset description and preparation

In this study, we have considered five publicly available image datasets: CIFAR-1031, CIFAR-10031, ImageNet16-12032, EuroSAT39, and Malaria Parasite40 and a text dataset: IMDb41 to validate our LEMONADE. Both the CIFAR-10 and the CIFAR-100 datasets contain 60k images of dimensions \(32\times 32\) pixels where 50k and 10k samples are designated for training and testing purposes, respectively. The CIFAR-10 dataset has 10 output classes whereas the CIFAR-100 datasets have 100 output classes. ImageNet16-120 has 151k training and 6k testing samples with a resolution of \(16\times 16\) distributed across 120 classes. The Malaria parasite dataset has two classes: (i) parasitized cells and (ii) uninfected cells with 27558 data samples in total. For our experiments with the Malaria parasite dataset, we resized the data to \(32\times 32\) resolution. After that we split it into an 8:2 ratio for the training and validation sets. The EuroSAT dataset contains 27000 data samples across 10 classes: (i) AnnualCrop (ii) Forest (iii) HerbaceousVegetation (iv) Highway (v) Industrial (vi) Pasture (vii) PermanentCrop (viii) Residential (ix) River, and (x) SeaLake. We also resized this dataset to \(32\times 32\) resolution and split it into 8:2 ratios for training and validation sets. Finally, the IMDb dataset contains 50k samples of text data with corresponding sentiment labels (positive and negative). We first split the data set into 8:2 ratio for training and validation and then pre-processed the text by removing urls, special characters, hashtags and mentions. Subsequently, we tokenize the text and transform it into sequences of fixed length.

Experimental analysis and results

All experiments are run on a single NVIDIA A100 GPU to make a fair determination of metrics such as power consumption and runtime. We utilize a python library named PyJoules42 to measure the energy consumption of both CPU and GPU during network search/training. In this section, we will discuss the search strategy, the training process, and the experimental results. We use the following user defined metric for all experiments (unless something else is specifically mentioned): {\(PT_{Acc}\) = 0, \(PV_{Acc} = 1, TT_{Acc} = 0.99, TV_{Acc} = 0.99, PF = 0, TF = 14000, PT_E = 0, TT_E = 1\times 10^{-3}, PV_E = 0, TV_E = 1\times 10^{-5}, OT = 0.10, UT = 0.05\}.\)

Intermediate and final model training process

During the neural discovery process, LEMONADE was executed for 30 iterations (Terminating Condition for Algorithm 1). To ascertain the quality of the searched network after each iteration, the searched networks are trained for 50 epochs with a batch size of 128, utilizing the Stochastic Gradient Descent (SGD) optimizer along with an initial learning rate of 0.025 and a weight decay parameter set to \(3\times 10^{-4}\). To enhance the convergence rate, a cosine annealing learning rate schedule was employed, which modulates the learning rate according to a cosine function. Furthermore, to mitigate overfitting, data augmentation techniques such as random rotation by 10 degrees, random horizontal flipping, and random cropping of size 16 × 16 were integrated into the training regimen to facilitate diverse learning. Then the trained model is evaluated using the validation data (from the corresponding dataset) to obtain the combined metric (CM) as shown in Algorithm 1. For the NLP task (IMDb dataset), we used Adam optimizer along with a initial learning rate of 0.001 and batch size of 128.

Upon the completion of the NAS procedure by LEMONADE , the final model is trained on the entire dataset over 600 epochs (200 epochs for IMDb). A batch size of 256 is employed, utilizing the SGD optimizer (Adam for IMDB), with the initial learning rate set at 0.025 (0.001 for IMDb) accompanied by a weight_decay of \(3\times 10^{-4}.\) The learning rate is systematically adjusted according to the annealing learning rate schedule. It is important to clarify that the complexity inherent in this final training phase is not inherently associated with the NAS process. Notably, larger datasets generally demand an increased number of training epochs. This is a challenge faced by all NAS frameworks and is not distinct to our methodology. In this experimental setup, GPT-4o was configured with a temperature parameter of 0.5.

Comparing LEMONADE with state-of-the-art NAS frameworks

Table 4 provides a comparative analysis of several State-of-the-Art (SOTA) NAS methods alongside LEMONADE , for the CIFAR-10, CIFAR-100, and ImageNet16-120 datasets. For these experiments we set \(W_A = 1\), \(W_E = 0,\) and \(W_F = 0\) because all the SOTA NAS we are comparing against only prioritize accuracy. For CIFAR-10, LEMONADE was able to generate a neural network with 95.54% test accuracy beating all SOTA frameworks. LEMONADE also beat state-of-the-art NAS frameworks for CIFAR-100 with a test accuracy of 79.43%. For ImageNet16-120, LEMONADE produced a neural network with almost SOTA performance. For the CIFAR-10 dataset, the NAS method (beside LEMONADE) that was able to achieve the highest accuracy was NSGANet47. But takes 648 GPU Hours for performing the search process compared to 5.8 GPU Hours that LEMONADE takes (111x Faster). For CIFAR-100, EIGEN45 led the most efficient neural network (besides LEMONADE) but that search took 120 GPU Hour compared to 7.12 GPU Hours taken by LEMONADE. Hence, LEMONADE can not only discover highly accurate models, it can also perform search operations that are generally faster than many SOTA NAS frameworks. Figure 3 shows the training/validation accuracy and loss of five different datasets during the final model training process (over 600 epochs).

Table 4 Performance metrics for various methods across datasets.
Fig. 3
figure 3

Training process: loss and accuracy graph.

To better capture the efficacy of the NAS frameworks we report a joint metric that combines both cost (search time or energy consumed during search) and final model accuracy into one number that is weighted based on the user’s need. This Goodness metric (GM) is computed as shown in Eq. 2. Where, \(X_A\) and \(X_E\) are the weights of accuracy and energy respectively. \(E_{Norm}\) is the normalized energy obtained from the Eq. 3. Where, E, \(E_{min}\) and \(E_{max}\) are the energy consumed by a given NAS framework, maximum energy consumed for the same task across all NAS frameworks and minimum Energy consumed for the same task across all NAS frameworks (per Table 5). Kilowatt-hour power usage effectiveness (kWh-PUE) serves as a metric for evaluating the energy efficiency of an Edge AI system by comparing the overall energy consumed to that used specifically for AI inference/search/training59. We have calculated the energy (in kWh-PUE) with the help of Eq. 4 as described in59. In this equation, \(p_c\), \(p_r\), and \(p_g\) represent the power usages (in watt) of CPU, RAM and GPU respectively. Also, t is the total run time in hours and g is the number of GPUs. We obtain the run time for each NAS method from Table 4 and assume a maximum GPU power draw (in watt) for NVIDIA A100.

$$\begin{aligned} GM= & X_A\times Accuracy + (1-E_{Norm})\times X_E \end{aligned}$$
(2)
$$\begin{aligned} E_{Norm}= & \frac{E-E_{min}}{E_{max}-E_{min}} \end{aligned}$$
(3)
$$\begin{aligned} p_t= & \frac{1.58 t\left( p_c+p_r+g p_g\right) }{1000} \end{aligned}$$
(4)
Table 5 Understanding the effectiveness of LEMONADE and other SOTA NAS frameworks utilizing the GM metric.

We observe a similar trend in Table 5 when comparing LEMONADE with other NAS frameworks. LEMONADE beats state-of-the-art NAS frameworks in terms of overall performance (GM) for CIFAR-10 and CIFAR-100 across different weight values. LEMONADE performs almost at the SOTA level for ImageNet16-120 as well.

Utilizing LEMONADE to construct neural networks for diverse datasets and application needs

It is evident that LEMONADE can very efficiently (cost and accuracy) generate neural networks with good accuracy for standard datasets with priority only given to final model accuracy. However, LEMONADE was designed to serve a more practical goal - Automating the AI integration process for addressing diverse applications with varying needs. LEMONADE is designed to help non-AI-experts build solutions for their respective tasks with varying needs such as high frame-per-second (FPS) and low energy energy inferencing for battery-operated edge devices. In Table 6, we demonstrate how LEMONADE can generate different neural networks for serving different application priorities (settings). For example, when we ask the LEMONADE system to give 100% priority (\(W_A = 1\)) to the final model accuracy (Malaria dataset), we obtain a neural network with an accuracy of 97.12% that consumes about \(2.5\times 10^{-9}\) Kwh-PUE of energy for inferencing one image with an FPS of 157. For the same dataset (Malaria), if we make the LEMONADE system design a neural network with equal priority given to inferencing energy and accuracy (\(W_A = 0.5, W_E = 0.5\)) then we obtain a model that is 450x more energy efficient with about 1% lower accuracy.

Table 6 Utilizing LEMONADE for building neural networks for diverse applications with different priorities and requirements..

To assess the generalizability of LEMONADE, we reported results across three different settings, similar to prior text classification, using the IMDb dataset in Table 6. When prioritizing accuracy at 100%,  LEMONADE generates a model achieving 90.51% accuracy, 57 FPS, and an energy consumption of \(8.42\times 10^{-9}\) kWh-PUE for per image inference. In contrast, with a priority distribution of 70% for accuracy, 10% for energy efficiency, and 20% for FPS, the generated model attains 89.03% accuracy, 473 FPS, and consumes only \(6.19\times 10^{-12}\) kWh-PUE.

Table 7 shows the comparative analysis of different SOTA NAS with LEMONADE (using both chatGPT-4o and Gemini-Pro as backend). For comparison points, we consider different edge AI metrics i.e., test accuracy, required training time in hours, inference speed in milliseconds (ms), inference power in milliwatts (mW), and model size in MB. In the Table 7 setting represents the priority, where 1 refers to full priority on accuracy, 2 refers to 50% priority on accuracy and 50% priority on energy, and 3 refers to 70% priority on accuracy, 10% on energy, and 20% on FPS. For the malaria dataset, DARTS shows 96.61% accuracy with 26.1 ms inference speed and takes 799.58 mW power for inference. Where NasNet and AmoebaNet shows 96.88% and 97.09% accuracy with 25.07 ms and 22.29 ms inference speed, and 744.65 mW and 745.66 mW inference power respectively. LEMONADE with chatGPT-4o outperforming the three different NAS based on the accuracy, training time, and inference speed. The model size of LEMONADE is 37.53 MB which is larger than the SOTA NAS, this is because of the full priority given to accuracy. For setting-2 (50% priority to both accuracy and energy consumption), we notice that LEMONADE generates a light weight model with decent accuracy and faster inference speed. This is also noticeable for LEMONADE with the Gemini-Pro backend. For the EuroSAT dataset, the NasNet generated model shows almost equal accuracy with respect to LEMONADE (chatGPT-4o) but we see that LEMONADE is more efficient in terms of training time, inference time, and inference power.

Table 7 In-depth analysis of LEMONADE with three different NAS with various priority settings: {\(1 \rightarrow (W_A = 1); \, 2 \rightarrow (W_A=0.5, W_E=0.5); \, 3 \rightarrow (W_A=0.7, W_E=0.1, W_F=0.2)\)} with consideration of edge AI metrics..

In-depth analysis and limitations of LLM

The LEMONADE framework is also equipped with a post-processing module that ensures that the models received from the LLM is indeed valid. If an invalid model is received, LEMONADE invokes an LLM command to fix the identified issue. We illustrate such an example in Fig. 4.

Fig. 4
figure 4

Qualitative analysis of the effectiveness of LEMONADE with chatGPT-4o backend.

We also showcase, with a few examples, how the the feedback mechanism of the LEMONADE’s Expert System guides the LLM towards changing the generated neural network.

  • Case 1: We can see that the initially generated model was a simple CNN model with two convolutional layers having ReLU activations and Batch Normalization (BN), Maxpooling, and a single Dense layer. After getting the feedback (ACL,ASC,ADL) chatGPT-4o generated a model with four blocks of convolutional layers with ReLU activation and Batch Normalization (BN), and two Dense layers. It also added two skip connections based on the feedback.

  • Case 2: Shows another behavior of chatGPT-4o where it adds more skip connections by adding some convolutional layers.

  • Case 3: In some situations, skip connections make the network architecture more energy intensive to train/run. In this case, chatGPT-4o reduces the skip connections based on the feedback from ES.

  • Case 4: chatGPT-4o not always showed outstanding performance. We carefully checked the responses from chatGPT-4o and saw that in approximately 10% of the cases it fails to follow the provided instructions. In one case chatGPT-4o generated an invalid model that failed complication due to layer shape mismatch.

Limitations of using the GPT-4o/Gemini-Pro for NAS

ChatGPT-4o and Gemini-Pro can generate neural networks based on a prompt but they have several limitations that necessities the use of an additional automated guidance system (such as LEMONADE). We discuss some of these limitations below:

  • Prompt dependency The effectiveness of a network architecture is significantly influenced by the quality of the prompt. A well-defined prompt enables GPT-4o or Gemini-Pro to produce network architectures that align with the specified requirements. Conversely, an ambiguous prompt may lead to outcomes that do not fulfill the objectives.

  • Inadequate validation capacity Although the LLMs can propose a network structure, they lack the capability to independently train and validate the proposed architecture on a dataset.

  • Deficiency in numerical optimization Current LLMs can not directly calculate the ideal hyperparameters of an architecture for the given task specifications.

Fine tuning GPT-4o/Gemini-Pro for the NAS

The following techniques have been used to improve the performance of the LLMs to generate good quality neural network architecture:

  • Prompt engineering: Fine-tuning the prompt by specifying the constraints and validation outcomes helped us generate good-quality architecture.

  • Integrate expert system: We have proposed an expert system that helps generate high quality architectures by providing structured feedback to the LLM backend (GPT-4o or Gemini-Pro).

  • Integrate external validation: Since the LLMs lack the capability to train the proposed architectures, we have integrated a system for assessing various performance metrics such as accuracy, power, inference speed, and model size. These metrics were subsequently utilized in the next iteration’s prompt, emulating a reinforcement approach.

Conclusion

In this article, we have formalized, implemented, and evaluated a multi-parameter neural discovery framework, LEMONADE that can efficiently generate novel neural networks for diverse requirements without leveraging any pre-defined search space. LEMONADE can effectively trade-off final model accuracy for other edge AI parameters such as FPS and inferencing energy cap. The proposed framework operates with the help of a set of customizable metrics and a rules-driven expert system. The proposed expert system generates instructions for a backend large language model (LLM) such as ChatGPT-4o and Gemini-Pro to iteratively produce novel neural networks. LEMONADE was able to successfully create state-of-the-art neural networks that are optimized for accuracy, FPS, and power consumption across different applications/requirements and datasets (CIFAR-10, CIFAR-100, ImageNet16-120, Malaria, Euro-SAT, IMDb). This work paves the way toward a new paradigm of AI-guided AI designing. Future works will investigate efficient model pruning and quantization using AI. Future works will also explore the use of a customized LLM that is specifically trained to generate AI models for a wider range of user-defined applications.