ORPO

ORPO is a fine-tuning technique that streamlines the process of adapting LLMs to specific tasks. It addresses a limitation of the traditional two-stage approach. While SFT effectively adapts the model to a desired domain, it can inadvertently increase the probability of generating undesirable responses alongside preferred ones.

Preference alignment is employed to address this issue. It aims to:

Increase the likelihood of generating preferred responses.
Decrease the likelihood of generating rejected responses. Reinforcement Learning with Human Feedback (RLHF) or Direct Preference Optimization (DPO) solves the problem using a separate reference model, increasing computational complexity, while ORPO combines SFT and preference alignment into a single objective function. It modifies the standard language modeling loss by incorporating an odds ratio (OR) term. This term:
Weakly penalizes rejected responses.
Strongly rewards preferred responses. By simultaneously optimizing for both objectives, ORPO allows the LLM to learn the target task while aligning its outputs with human preferences.

Edmondo's Vault

Explorer

ORPO

Graph View

Backlinks