Fig. 6: Human aware loss functions (HALOs) from PPO to present.
From: Current and future state of evaluation of large language models for medical summarization tasks

The development timeline for HALOs from the advent of Proximal Policy Optimization (PPO) in 2017 through 2024. Each HALO is connected to it’s precursor (either DPO or PPO) by a dotted line. If a HALO has an algorithmic basis in reinforcement learning, it is presented as white text on a solid color background. If a HALO has an algorithmic basis that is reinforcement learning free, it is presented as colored text on a white background. Each color, either text or background, corresponds to the data requirements for that HALO. Blue corresponds to HALOs that only use prompt/response pair data. Orange corresponds to HALOs that use response preference pairs in addition to the prompt. Finally, green corresponds to HALOs that use binary judgement data in addition to the prompt/response pair. The figure includes PPO Proximal Policy Pptimization53, DPO Direct Preference Optimization54, RSO Statistical Rejection Sampling108, IPO Identity Preference Optimization109, cDPO Conservative DPO110, KTO Kahneman Tversky Optimization61, JPO Joint Preference Optimization59, ORPO Odds Ratio Preference Optimization111, rDPO Robust DPO112, BCO Binary Classifier Optimization113, DNO Direct Nash Optimization62, TR-DPO Trust Tegion DPO114, CPO Contrastive Preference Optimization115, SPPO Self-Play Preference Optimization116, PAL Pluralistic Alignment Framework62, EXO Efficient Exact Optimization117, AOT Alignment via Optimal Transport118, RPO Iterative Reasoning Preference Optimization119, NCA Noise Contrastive Alignment120, RTO Reinforced Token Optimization121, SimPO Simple Preference Optimization60.