Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach

Abstract

Large language models (LLMs) typically generate identical or similar responses for all users given the same prompt, posing serious safety risks in high-stakes applications where user vulnerabilities differ widely. Existing safety evaluations primarily rely on context-independent metrics—such as factuality, bias, or toxicity—overlooking the fact that the same response may carry divergent risks depending on the user's background or condition. We introduce ``personalized safety'' to fill this gap and present PENGUIN—a benchmark comprising 14,000 scenarios across seven sensitive domains with both context-rich and context-free variants. Evaluating six leading LLMs, we demonstrate that personalized user information significantly improves safety scores by $43.2\%$, confirming the effectiveness of personalization in safety alignment. However, not all context attributes contribute equally to safety enhancement. To address this, we develop RAISE—a training-free, two-stage agent framework that strategically acquires user-specific background. RAISE improves safety scores by up to 31.6% over six vanilla LLMs, while maintaining a low interaction cost of just 2.7 user queries on average. Our findings highlight the importance of selective information gathering in safety-critical domains and offer a practical solution for personalizing LLM responses without model retraining. This work establishes a foundation for safety research that adapts to individual user contexts rather than assuming a universal harm standard.

Personalized Safety Problem is a common issue

The same harmless empathetic response led to emotional relief for a low-risk user, but triggered fatal action for another with a suicidal intention. Despite advances in general LLM capabilities, these personalized safety failures remain a critical blind spot in current LLM safety research.

Left (blue dashed box): Two users with different personal contexts ask the same sensitive query, but a generic response leads to divergent safety outcomes—harmless for one, harmful for the other. Left (blue region): Evaluating this query across 1,000 diverse user profiles reveals highly inconsistent safety scores across models. Right (orange dashed box): When user-specific context is included, LLMs produce safer and more empathetic responses. Right (orange region): This trend generalizes across 14,000 context-rich scenarios, motivating our Penguin Benchmark for evaluating personalized safety in high-risk settings.

PENGUIN BENCHMARK

PENHGUIN benchmark (Personalized Evaluation of Nuanced Generation Under Individual Needs).

The first large-scale testbed for evaluating LLM safety in personalized, high-stake scenarios. Each user scenario is associated with structured context attributes and is paired with both context-rich and context-free queries. These are scored on a three-dimensional personalized safety scale to quantify the impact of user context information.

Example Data

We collect data from both Reddit and Sythetic Data. Below is an example of Reddit Data.

Experiment

Experiments among six state-of-art LLMs.

all models demonstrate substantial improvements with personalized context information. On average, safety scores increase from 2.79 to 4.00 across the dataset, reflecting a consistent and significant trend. These results indicate that the benefits of personalized information generalize across diverse model architectures and capability levels, which also motivates our next question: \textit{which user attributes contribute most to improving personalized safety?} This question is particularly realistic given that collecting full context is not always feasible in real-world applications.

RAISE (Risk-Aware Information Selection Engine)

Overview of our proposed RAISE framework. \em{Left}: We formulate the task as a sequential attribute selection problem, where each state represents a partial user context. \em{Middle}: An offline LLM-guided Monte Carlo Tree Search (MCTS) planner explores this space to discover optimized acquisition paths that maximize safety scores under budget constraints. \em{Right}: At inference time, the online agent follows the retrieved path via an Acquisition Module, while an Abstention Module decides when context suffices for safe response generation.

Performance Improvement by RAISE

We denote the unmodified, context-free model as the vanilla baseline, which achieves an average safety score of only 2.86. Adding the abstention mechanism enables the model to determine when more user information is needed, avoiding unsafe responses when information is insufficient, thus improving safety scores to 3.56 (a 24.5% improvement). Further introducing the path planner allows the system to intelligently select the most valuable attribute query sequence, maximizing safety scores to 3.77 (an additional 5.9% improvement) while keeping query counts moderate (an average of just 2.7 queries per user (2.5 +Agent); From the baseline model to the complete RAISE framework, safety scores improve by 31.6% overall. The planner helps optimize attribute collection strategies, while the abstention mechanism ensures generation is deferred until sufficient information is gathered. Together, they create a system that is both safe and efficient for high-stakes personalized LLM use cases.

BibTeX

@misc{wu2025personalizedsafetyllmsbenchmark,
      title={Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach}, 
      author={Yuchen Wu and Edward Sun and Kaijie Zhu and Jianxun Lian and Jose Hernandez-Orallo and Aylin Caliskan and Jindong Wang},
      year={2025},
      eprint={2505.18882},
      archivePrefix={arXiv},
      primaryClass={cs.CY},
      url={https://arxiv.org/abs/2505.18882}, 
}