Mitigating Sycophancy in Large Language Models via Direct Preference Optimization

  • Azal Ahmad Khan
  • , Sayan Alam
  • , Xinran Wang
  • , Ahmad Faraz Khan
  • , Debanga Raj Neog
  • , Ali Anwar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities, yet they occasionally exhibit sycophantic behavior, generating responses that align with or agree with a user's stated opinions or preferences, even when those opinions are incorrect or biased. This sycophantic tendency can undermine the trustworthiness and reliability of LLMs. This work proposes a novel approach to mitigate sycophancy in LLMs by fine-tuning them on a carefully curated dataset comprising prompts paired with sycophantic and non-sycophantic responses 1. Our method leverages Direct Preference Optimization (DPO), which optimizes LLMs to generate responses that align with the preferred (non-sycophantic) outputs without requiring explicit reward modeling. We develop a dataset of 1000 prompts with sycophantic and non-sycophantic responses to fine-tune LLMs. Our approach achieves an average reduction of 85% in persona-based tests and 84% in preference-driven tests, demonstrating significant mitigation of sycophantic behaviors. Our findings pave the way for more trustworthy and reliable language models that can provide objective and unbiased responses, aligning with human preferences while maintaining factual accuracy.

Original languageEnglish (US)
Title of host publicationProceedings - 2024 IEEE International Conference on Big Data, BigData 2024
EditorsWei Ding, Chang-Tien Lu, Fusheng Wang, Liping Di, Kesheng Wu, Jun Huan, Raghu Nambiar, Jundong Li, Filip Ilievski, Ricardo Baeza-Yates, Xiaohua Hu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1664-1671
Number of pages8
ISBN (Electronic)9798350362480
DOIs
StatePublished - 2024
Event2024 IEEE International Conference on Big Data, BigData 2024 - Washington, United States
Duration: Dec 15 2024Dec 18 2024

Publication series

NameProceedings - 2024 IEEE International Conference on Big Data, BigData 2024

Conference

Conference2024 IEEE International Conference on Big Data, BigData 2024
Country/TerritoryUnited States
CityWashington
Period12/15/2412/18/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Keywords

  • Finetuning
  • Large Language Models

Fingerprint

Dive into the research topics of 'Mitigating Sycophancy in Large Language Models via Direct Preference Optimization'. Together they form a unique fingerprint.

Cite this