Towards a unified view of preference learning for LLMs: A survey

<aside> 📝

</aside>

<aside> 🔗

</aside>

👋🏻 Hi there, this is the homepage of our survey paper “Towards a unified view of preference learning for LLMs: A survey“. On this page, we will organize the content of the survey and introduce the existing representative work of preference learning.

If you find this page valuable for your research, you are welcome to star⭐ our GitHub page or cite our paper! Correspondingly, if you find that the content of the page can be improved, please send us an email to inform us.

🚀 1. The motivation of our paper

Traditional categorization perspectives tend to split existing methods into reinforcement learning(RL) based methods, like RLHF which requires a reward model for online RL, and supervised learning(SL) based methods like Direct Preference Optimization(DPO) that directly employs preference optimization within an offline setting. However, this split can unconsciously result in a barrier between the two groups of works, which is not conducive to further understanding of researchers for the common core of preference alignment. Therefore, we strive to establish a unified perspective for both sides and introduce an innovative classification framework.

🔎 2. The unified view of preference learning of LLMs

The overview of the preference learning. For an LLM to be aligned with human preferences, we need to prepare preference data and corresponding feedback obtained from the environment which aligns with human preference. By feeding the model, data, and feedback to a specific algorithm, we obtain an aligned LLM.

Inspired by recent works, we survey existing works from a unified perspective in the following two folds:

First, the optimization objectives of RL and SL-based methods can be described within the same framework.

Following Shao et al.(2024), the gradient with respect to the parameter $\theta$ of a training method can be written as:

$$ \begin{equation} \nabla_{\theta} = \mathbb{E}{[(q,o) \sim \mathcal{D}]}\left( \frac{1}{|o|} \sum\limits{t=1}^{|o|} \delta_{\mathcal{A}(r, q, o, t)} \nabla_{\theta}\log \pi_{\theta}(o_t | q, o_{<t})\right),

\end{equation} $$

where $D$ denotes the data source which contains the input question $q$ and the output $o$. $\delta$ denotes the gradient coefficient which directly determines the direction and step size of the preference optimization. $\mathcal{A}$ denotes the algorithm. The gradient coefficient is determined by the specific algorithm, data, and corresponding feedback.

Second, the algorithm can be decoupled from online/offline settings.

In the context of alignment, online learning refers to the preference oracle $r$ or its approximator $\hat{r}$ can be queried over training, i.e. the feedback of the responses sampled from the current actor model can be given on the fly. If the feedback signal cannot be obtained in real-time, then it is considered offline learning. From a traditional perspective, RL-based methods are more flexible with online/offline settings, while SL-based methods are typically offline. However, as in the first point where we have unified RL-based and SL-based approaches, it can be inferred that SL-based methods can also be applied in an online setting, which has been proved by recent work guo et al (2024). Therefore, unlike the categorization in other survey papers, we do not use online/offline nor RL/SL as criteria for classifying algorithms. Instead, we decouple the algorithm from the online/offline setting.