Artificial intelligence has journeyed from simple search engines to dynamic AI assistants that can code, write, and conduct research. Today, you can find these advanced tools right in your pocket through smartphone apps or internet APIs, making them accessible to everyone. Whether it’s helping with relationship advice, fact-checking, planning your diet, or suggesting travel ideas, these assistants are becoming an integral part of our daily routine.
As models become more sophisticated, concerns naturally arise about whether AI-generated responses truly align with our human values. Traditionally, developers have fine-tuned AI systems using human preference data—a process that involves analysing inputs alongside responses that were either chosen or rejected. This approach has paved the way for improved model alignment and safety, resulting in a variety of training algorithms.
Among these methods, Direct Preference Optimization (DPO) has caught the eye for its simplicity and efficiency. Yet, it isn’t without its flaws. DPO assigns the same weight to every word in a response, even though some words naturally carry more meaning or relevance than others. To put it in context, imagine asking an AI about the capital of France. While the correct answer is obviously important, any extra commentary becomes secondary, even if it’s beautifully written. This is where OTPO steps in, aiming to smarter weigh words according to their true significance.