TY  - EJOU
AU  - Qu, Yubin 
AU  - Huang, Song 
AU  - Li, Long 
AU  - Nie, Peng 
AU  - Yao, Yongming 

TI  - Beyond Intentions: A Critical Survey of Misalignment in LLMs
T2  - Computers, Materials \& Continua

PY  - 2025
VL  - 85
IS  - 1
SN  - 1546-2226

AB  - Large language models (LLMs) represent significant advancements in artificial intelligence. However, their increasing capabilities come with a serious challenge: misalignment, which refers to the deviation of model behavior from the designers’ intentions and human values. This review aims to synthesize the current understanding of the LLM misalignment issue and provide researchers and practitioners with a comprehensive overview. We define the concept of misalignment and elaborate on its various manifestations, including generating harmful content, factual errors (hallucinations), propagating biases, failing to follow instructions, emerging deceptive behaviors, and emergent misalignment. We explore the multifaceted causes of misalignment, systematically analyzing factors from surface-level technical issues (e.g., training data, objective function design, model scaling) to deeper fundamental challenges (e.g., difficulties formalizing values, discrepancies between training signals and real intentions). This review covers existing and emerging techniques for detecting and evaluating the degree of misalignment, such as benchmark tests, red-teaming, and formal safety assessments. Subsequently, we examine strategies to mitigate misalignment, focusing on mainstream alignment techniques such as RLHF, Constitutional AI (CAI), instruction fine-tuning, and novel approaches that address scalability and robustness. In particular, we analyze recent advances in misalignment attack research, including system prompt modifications, supervised fine-tuning, self-supervised representation attacks, and model editing, which challenge the robustness of model alignment. We categorize and analyze the surveyed literature, highlighting major findings, persistent limitations, and current contentious points. Finally, we identify key open questions and propose several promising future research directions, including constructing high-quality alignment datasets, exploring novel alignment methods, coordinating diverse values, and delving into the deep philosophical aspects of alignment. This work underscores the complexity and multidimensionality of LLM misalignment issues, calling for interdisciplinary approaches to reliably align LLMs with human values.
KW  - Large language models; alignment; misalignment; AI safety; human values

DO  - 10.32604/cmc.2025.067750