Icon Beyond Safe Answers:
A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models

Baihui Zheng*, Boren Zheng*, Kerui Cao*, Yingshui Tan
Zhengdong Liu, Weixun Wang, Jiaheng Liu, Jian Yang
Xiaoyong Zhu, Wenbo Su, Bo Zheng, Kaifu Zhang
Taobao & Tmall Group of Alibaba *Indicates Equal Contribution Corresponding Author

Abstract

Despite the remarkable proficiency of Large Reasoning Models (LRMs) in handling complex reasoning tasks, their reliability in safety-critical scenarios remains uncertain. Existing evaluations primarily assess response-level safety, neglecting a critical issue we identify as Superficial Safety Alignment (SSA)—a phenomenon where models produce superficially safe outputs while internal reasoning processes fail to genuinely detect and mitigate underlying risks, resulting in inconsistent safety behaviors across multiple sampling attempts. To systematically investigate SSA, we introduce BeyondSafeAnswer Bench, a novel benchmark comprising 2,000 challenging instances organized into 3 distinct SSA scenario types and spanning 9 risk categories, each meticulously annotated with risk rationales. Evaluations of 19 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with top-performing models achieving only 38.0% accuracy in correctly identifying risk rationales. We further explore the efficacy of safety rules, specialized fine-tuning on safety reasoning data, and diverse decoding strategies in mitigating SSA. Our work provides a comprehensive assessment tool for evaluating and improving safety reasoning fidelity in LRMs, advancing the development of genuinely risk-aware and reliably safe AI systems.

Total Samples
Over-Sensitivity200
Risk Omission600
Cognitive Shortcut1200
Total2000
Primary Risk Categories (9)
Offense & Prejudice22.50%
Specially Regulated Items12.75%
Property Infringement11.63%
Invasion of Privacy & Confidentiality11.63%
Physical & Mental Health11.53%
Violence & Terrorism10.00%
Ethics & Morality9.38%
Rumors5.83%
Child Pornography4.75%
Data Tokens
Max query tokens312
Min query tokens5
Average query tokens73
※Note: Each question may contain two risk types, so the total number of risk classifications exceeds the number of questions.

Data Construction Pipline

An overview of the data construction, filtering, verification, and quality control processes of Chinese SafetyQA.

Key Features of Beyond Safe Answers Benchmark

  • Detailed Risk Rationales: Each instance is accompanied by explicit annotations that detail the underlying risks, enabling precise assessment of a model's reasoning depth
  • Comprehensive Coverage: Contains over 2,000 carefully curated samples spanning three distinct SSA scenarios—Over Sensitivity, Cognitive Shortcut, and Risk Omission—across 9 primary risk categories, ensuring diverse and extensive evaluation.
  • Challenging Evaluation: Top-performing LRMs achieve only moderate accuracy in correctly identifying risk rationales, highlighting the benchmark's rigor and difficulty.
  • Robust Methodology: Incorporates meticulous human annotations, rigorous quality control, and validation using multiple state-of-the-art LRMs to ensure reliability and validity.
  • Insightful Conclusions: Demonstrates the efficacy of explicit safety guidelines, fine-tuning with high-quality reasoning data, and minimal impact of decoding strategies in mitigating SSA.

Main Results

  • BeyondSafeAnswer Bench is Challenging. All tested LRMs exhibit significant Superficial Safety Alignment (SSA). Although models achieve high surface-level safety scores (over 90%), their internal risk reasoning accuracy is low, with even top-performing models scoring below 40% on accurately identifying risks.
  • Reasoning Accuracy Correlates with Safety. Higher reasoning accuracy strongly predicts safer outputs. Models proficient at internally identifying risks consistently generate safer responses, whereas models with poor internal reasoning demonstrate inconsistent safety behaviors.
  • Larger Models Exhibit Superior Safety Performance. Performance analysis indicates that larger models generally achieve better results across safety metrics, particularly in scenarios involving subtle or hidden risks (Risk Omission). The enhanced memory and knowledge retrieval capabilities of larger LRMs significantly improve their internal risk assessment.
  • Explicit Safety Guidelines Enhance Model Safety. Integrating explicit safety rules into prompts significantly improves both surface-level response safety and internal reasoning accuracy, particularly addressing the issue of Risk Omission. However, this approach may increase model hypersensitivity in certain scenarios.
  • Fine-tuning with High-quality Reasoning Data Improves Performance. Fine-tuning LRMs with curated reasoning trajectories enhances overall safety and internal risk reasoning accuracy. Smaller models see greater relative improvements, whereas larger models, already performing well, achieve more modest gains.

LeaderBoard

Models Safe@1 Safe@k Think@1 Think@k F-score OS@1 OS@k CS@1 CS@k RO@1 RO@k
Closed-Source Large Language Models
Doubao-1.5-thinking-pro 92.97 86.50 37.24 18.55 53.21 60.60 19.00 17.67 4.25 68.60 47.00
GLM-Z1-AirX 91.59 82.59 32.65 11.90 41.65 53.30 13.00 14.72 1.33 61.63 32.67
Kimi-K1.5 78.68 64.70 28.82 9.75 36.53 52.00 8.00 12.77 1.33 53.20 27.17
Open-Source Large Language Models
QwQ-32B 93.54 85.10 33.38 11.40 49.89 49.80 7.50 17.12 2.58 60.43 30.33
Qwen3-235B-A22B 97.52 93.30 35.25 12.45 44.82 55.40 9.00 16.47 2.17 66.10 34.17
Qwen3-30B-A3B 98.27 95.15 30.84 11.40 48.46 52.00 10.00 11.38 0.83 62.70 33.00
Qwen3-32B 96.50 91.25 34.02 11.25 51.09 57.00 12.00 15.55 1.42 63.30 30.67
Qwen3-14B 98.19 94.30 31.84 11.65 49.40 57.60 13.00 12.67 1.17 61.60 32.17
Qwen3-8B 97.14 92.15 28.62 9.30 46.09 56.40 11.00 10.90 0.75 54.80 25.83
Qwen3-4B 95.63 88.85 25.57 8.25 42.77 53.10 10.00 7.82 0.33 51.90 23.50
Qwen3-1.7B 79.87 62.85 15.37 2.95 29.23 34.00 3.00 4.12 0.08 31.67 8.67
Qwen3-0.6B 41.09 18.05 5.88 0.25 12.55 25.10 2.00 2.07 0.00 7.10 0.17
Deepseek-R1 94.63 88.85 37.98 16.20 54.22 52.70 13.50 20.78 4.33 67.47 40.83
R1-Distill-Llama-70B 86.69 79.50 23.45 7.55 39.05 49.60 12.00 10.17 2.17 41.30 16.83
R1-Distill-Qwen-32B 80.64 71.70 20.91 5.60 35.40 46.00 10.50 9.97 1.67 34.43 11.83
R1-Distill-Qwen-14B 83.07 73.55 19.61 6.05 34.43 45.20 8.50 7.05 0.83 36.20 15.67
R1-Distill-Llama-8B 71.50 58.60 14.73 3.90 27.28 34.70 6.50 4.77 0.42 28.00 10.00
R1-Distill-Qwen-7B 66.64 52.05 8.72 1.20 19.27 26.20 1.00 2.70 0.17 14.93 3.33
R1-Distill-Qwen-1.5B 39.96 17.25 2.94 0.15 8.13 14.60 1.00 1.00 0.00 2.93 0.17

Decoding Sampling Strategies' Impact on Large Reasoning Models

Temperature Impact

(a) Qwen3-32B

Top-p and Top-k Impact

(b) QwQ-32B

Through evaluating the effects of different decoding-phase sampling strategies on the performance of Qwen3-32B and QwQ-32B models, we identify that altering decoding parameters such as temperature, top-p, and top-k minimally influences the models' performance in safety reasoning tasks. The experiments systematically tested various combinations of these parameters, assessing their impact on metrics including Safe@1, Safe@k, Think@1, Think@k, and scenario-specific metrics (RO, OS, CS).

The experimental outcomes reveal that the core reasoning and safety evaluation capabilities of the tested large reasoning models are predominantly determined by intrinsic knowledge structures acquired during pre-training and alignment phases, rather than by decoding parameter adjustments. Notably, despite variations in sampling strategies, performance remained relatively stable across all metrics, suggesting that model training quality and inherent knowledge representation are far more critical to reliable safety alignment than the selection of decoding-phase parameters.

Impact of Safety Rules Integration on LRMs

We evaluated the efficacy of explicitly integrating safety rules into the input prompts of several Large Reasoning Models (LRMs), assessing changes in response safety and reasoning accuracy across multiple dimensions. The radar chart illustrates comparative performance before ("Base") and after ("With Rule") applying these explicit safety guidelines.

The integration of safety rules significantly enhanced overall safety metrics, notably Safe@1, which reached as high as 99.8% for QwQ-32B. Additionally, the rules markedly improved performance in risk omission scenarios (RO@1), suggesting models were more adept at detecting subtle or previously overlooked risks. However, this rule-based approach also resulted in a notable increase in over-sensitivity (OS@1), reflecting a tendency towards overly cautious behavior in ambiguous contexts. These findings underscore a trade-off where explicit rule integration improves risk detection at the expense of increased false positives in low-risk scenarios, indicating a need for carefully balanced safety guidelines.

:

Effect of Fine-tuning with Superior Reasoning Data

We investigated the impact of fine-tuning Large Reasoning Models (LRMs) using high-quality safety reasoning trajectories from the STAR-1 dataset, measuring improvements across multiple safety evaluation metrics. The provided visualizations compare model performances before ("Base") and after ("Fine-tuned") fine-tuning across various scales of the Qwen3 model family.

The results clearly demonstrate that fine-tuning substantially improves safety and reasoning accuracy metrics, especially for smaller models (e.g., Qwen3-0.6B and Qwen3-1.7B), which exhibited the largest relative gains in both Safe@1 and Think@1 metrics. Fine-tuning also notably enhanced model performance in scenarios involving Cognitive Shortcut (CS) and Risk Omission (RO), highlighting increased capacity for nuanced risk detection. Conversely, fine-tuned models exhibited heightened over-sensitivity (OS), indicating that fine-tuning, while significantly boosting comprehensive safety reasoning abilities, also tends to increase cautiousness, thus highlighting an essential balance to be considered in model alignment strategies.

Positive Correlation Between Reasoning Accuracy and Response Safety

We observed a significant positive correlation between reasoning accuracy (Think@1) and response safety (Safe@1) in the evaluated Large Reasoning Models (LRMs). Models demonstrating high accuracy in risk identification during reasoning processes consistently yielded safer final outputs. Conversely, models exhibiting weak reasoning capabilities, particularly smaller-scale models like Qwen3-0.6B and R1-Distill-Qwen-1.5B, displayed substantial inconsistencies between their initial safety (Safe@1) and sustained safety performance (Safe@k), highlighting reduced robustness in safety alignment. These findings underscore that reliable internal reasoning is critical to maintaining robust and consistent safety across diverse contexts, and emphasizes the necessity of prioritizing accurate internal reasoning processes within LRMs.

Dataset Examples

BibTeX

TODO:Citation