Publications

Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling

in International Conference on Learning Representations, 2026

It introduces a Response-conditioned Bradley-Terry (Rc-BT) model that enhances the model’s capability in length bias mitigating and length instruction following, through training on the augmented dataset. Furthermore, it proposes the Rc-RM and Rc-DPO algorithm to leverage the Rc-BT model for reward modeling and direct policy optimization (DPO) of LLMs.

Download here

Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks

under review in Neural Information Processing Systems, 2025

The research identifies a critical oversight in existing techniques, which predominantly focus on comparing responses while neglecting valuable latent signals embedded within prompt inputs, and which only focus on preference disparities at the intra-sample level, while neglecting to account for the inter-sample level preference differentials that exist among preference data. To leverage these previously neglected indicators, it proposes a novel Multi-level Aware Preference Learning (MAPL) framework, capable of enhancing multi-instruction capabilities.

Download here

Bias Fitting to Mitigate Length Bias of Reward Model in RLHF

under review in Neural Information Processing Systems, 2025

To accurately model the intricate nature of length bias and facilitate more effective bias mitigation, it proposes FiMi-RM (Bias Fitting to Mitigate Length Bias of Reward Model in RLHF), a framework that autonomously learns and corrects underlying bias patterns.

Download here

Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering

in Neural Information Processing Systems, 2025

It is the first to systematically investigate the effectiveness and underlying mechanisms of activation engineering for mitigating hallucinations in VideoLLMs. And it proposes a temporal-aware activation engineering framework for VideoLLMs, which adaptively identifies and manipulates hallucination-sensitive modules based on the temporal variation characteristic, substantially mitigating hallucinations without additional LLM fine-tuning.

Download here