Table 5 Key hyperparameter values and rationale.

Hyperparameter	Symbol	Value(s)	Rationale
HLA Update Interval	K	10	Provides temporal abstraction, allowing the low-level agent to execute a sequence of 10 fine-tuning actions for each high-level strategic goal. This balances strategic stability with adaptive control.
Reward Weights	λ₁ (Weight) λ₂ (Stress) λ₃ (Manufacturability)	0.4 0.4 0.2	Prioritizes safety and lightweighting as primary objectives while maintaining manufacturability as a critical secondary constraint. The sum is normalized to 1.
Encoding Sensitivity	\(\:{\gamma\:}_{j}\)	1.0 (uniform)	A default scaling factor that preserves the natural output range of the tanh function (−1 to 1), ensuring stable gradient flow during backpropagation. Can be tuned per parameter if sensitivity analysis dictates.
Discount Factor	γ	0.99	Encourages long-horizon planning by making the agent highly value future rewards, which is essential for complex design tasks where the consequences of early actions unfold over time.
PPO Clipping Range	ε	0.2	A standard value that prevents destructively large policy updates, ensuring stable and monotonic policy improvement throughout training.

Quick links

Search