Existing safety evaluations primarily assess response-level safety, leaving reasoning-level risks unmeasured. Despite the remarkable proficiency of Large Reasoning Models (LRMs) in handling complex reasoning tasks, their reliability in safety-critical scenarios remains uncertain. We identify Superficial Safety Alignment (SSA): a phenomenon where models produce superficially safe outputs while internal reasoning processes fail to genuinely detect and mitigate underlying risks, creating a dangerous illusion of safety and rendering systems prone to catastrophic failure under minor perturbations. To systematically investigate SSA, we introduce Beyond Safe Answers (BSA), a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenarios and spanning nine risk categories, each meticulously annotated with risk rationales. We evaluate 23 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with the best model reaching 54.57% accuracy on risk-rationale identification. Current benchmarks are largely blind to this latent risk; to our knowledge, BSA is the first benchmark designed to systematically diagnose SSA. We further explore the efficacy of safety rules, specialized fine-tuning on safety reasoning data, and diverse decoding strategies in mitigating SSA. Our work aims for verifiably robust safety reasoning in LRMs, moving beyond mere superficial compliance and enabling practitioners to evaluate and improve safety-reasoning fidelity with measurable evidence.
| Total Samples | |
|---|---|
| Over-Sensitivity | 200 | 
| Risk Omission | 600 | 
| Cognitive Shortcut | 1200 | 
| Total | 2000 | 
| Primary Risk Categories (9) | |
|---|---|
| Offense & Prejudice | 22.50% | 
| Specially Regulated Items | 12.75% | 
| Property Infringement | 11.63% | 
| Invasion of Privacy & Confidentiality | 11.63% | 
| Physical & Mental Health | 11.53% | 
| Violence & Terrorism | 10.00% | 
| Ethics & Morality | 9.38% | 
| Rumors | 5.83% | 
| Child Pornography | 4.75% | 
| Data Tokens | |
|---|---|
| Max query tokens | 312 | 
| Min query tokens | 5 | 
| Average query tokens | 73 | 
 
          Key Features of Beyond Safe Answers Benchmark
Main Results
| Models | Safe@1 | Safe@k | Think@1 | Think@k | F-score | OS@1 | OS@k | CS@1 | CS@k | RO@1 | RO@k | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| Closed-Source Large Language Models | |||||||||||
| Doubao-1.5-thinking-pro | 92.97 | 86.50 | 37.24 | 18.55 | 53.21 | 60.60 | 19.00 | 17.67 | 4.25 | 68.60 | 47.00 | 
| Gemini-2.5-Flash | 95.38 | 90.75 | 40.46 | 19.7 | 56.53 | 60 | 19.5 | 22.7 | 6.67 | 69.47 | 45.83 | 
| Gemini-2.5-Pro | 94.51 | 88.2 | 38.02 | 18.15 | 54.23 | 68.6 | 26.5 | 19.55 | 5.67 | 64.77 | 40.33 | 
| Claude-3.7-Sonnet | 99.28 | 98.05 | 54.57 | 30.7 | 68.92 | 53.4 | 9.5 | 40.05 | 18.08 | 84 | 63 | 
| Claude-4-Sonnet | 98.98 | 96.75 | 48.89 | 25.55 | 64.37 | 58.3 | 14 | 36.35 | 16.58 | 70.83 | 47.33 | 
| GLM-Z1-AirX | 91.59 | 82.59 | 32.65 | 11.90 | 41.65 | 53.30 | 13.00 | 14.72 | 1.33 | 61.63 | 32.67 | 
| Kimi-K1.5 | 78.68 | 64.70 | 28.82 | 9.75 | 36.53 | 52.00 | 8.00 | 12.77 | 1.33 | 53.20 | 27.17 | 
| Open-Source Large Language Models | |||||||||||
| QwQ-32B | 93.54 | 85.10 | 33.38 | 11.40 | 49.89 | 49.80 | 7.50 | 17.12 | 2.58 | 60.43 | 30.33 | 
| Qwen3-235B-A22B | 97.52 | 93.30 | 35.25 | 12.45 | 44.82 | 55.40 | 9.00 | 16.47 | 2.17 | 66.10 | 34.17 | 
| Qwen3-30B-A3B | 98.27 | 95.15 | 30.84 | 11.40 | 48.46 | 52.00 | 10.00 | 11.38 | 0.83 | 62.70 | 33.00 | 
| Qwen3-32B | 96.50 | 91.25 | 34.02 | 11.25 | 51.09 | 57.00 | 12.00 | 15.55 | 1.42 | 63.30 | 30.67 | 
| Qwen3-14B | 98.19 | 94.30 | 31.84 | 11.65 | 49.40 | 57.60 | 13.00 | 12.67 | 1.17 | 61.60 | 32.17 | 
| Qwen3-8B | 97.14 | 92.15 | 28.62 | 9.30 | 46.09 | 56.40 | 11.00 | 10.90 | 0.75 | 54.80 | 25.83 | 
| Qwen3-4B | 95.63 | 88.85 | 25.57 | 8.25 | 42.77 | 53.10 | 10.00 | 7.82 | 0.33 | 51.90 | 23.50 | 
| Qwen3-1.7B | 79.87 | 62.85 | 15.37 | 2.95 | 29.23 | 34.00 | 3.00 | 4.12 | 0.08 | 31.67 | 8.67 | 
| Qwen3-0.6B | 41.09 | 18.05 | 5.88 | 0.25 | 12.55 | 25.10 | 2.00 | 2.07 | 0.00 | 7.10 | 0.17 | 
| Deepseek-R1 | 94.63 | 88.85 | 37.98 | 16.20 | 54.22 | 52.70 | 13.50 | 20.78 | 4.33 | 67.47 | 40.83 | 
| R1-Distill-Llama-70B | 86.69 | 79.50 | 23.45 | 7.55 | 39.05 | 49.60 | 12.00 | 10.17 | 2.17 | 41.30 | 16.83 | 
| R1-Distill-Qwen-32B | 80.64 | 71.70 | 20.91 | 5.60 | 35.40 | 46.00 | 10.50 | 9.97 | 1.67 | 34.43 | 11.83 | 
| R1-Distill-Qwen-14B | 83.07 | 73.55 | 19.61 | 6.05 | 34.43 | 45.20 | 8.50 | 7.05 | 0.83 | 36.20 | 15.67 | 
| R1-Distill-Llama-8B | 71.50 | 58.60 | 14.73 | 3.90 | 27.28 | 34.70 | 6.50 | 4.77 | 0.42 | 28.00 | 10.00 | 
| R1-Distill-Qwen-7B | 66.64 | 52.05 | 8.72 | 1.20 | 19.27 | 26.20 | 1.00 | 2.70 | 0.17 | 14.93 | 3.33 | 
| R1-Distill-Qwen-1.5B | 39.96 | 17.25 | 2.94 | 0.15 | 8.13 | 14.60 | 1.00 | 1.00 | 0.00 | 2.93 | 0.17 | 
 
                (a) Qwen3-32B
 
                (b) QwQ-32B
Through evaluating the effects of different decoding-phase sampling strategies on the performance of Qwen3-32B and QwQ-32B models, we identify that altering decoding parameters such as temperature, top-p, and top-k minimally influences the models' performance in safety reasoning tasks. The experiments systematically tested various combinations of these parameters, assessing their impact on metrics including Safe@1, Safe@k, Think@1, Think@k, and scenario-specific metrics (RO, OS, CS).
The experimental outcomes reveal that the core reasoning and safety evaluation capabilities of the tested large reasoning models are predominantly determined by intrinsic knowledge structures acquired during pre-training and alignment phases, rather than by decoding parameter adjustments. Notably, despite variations in sampling strategies, performance remained relatively stable across all metrics, suggesting that model training quality and inherent knowledge representation are far more critical to reliable safety alignment than the selection of decoding-phase parameters.
 
          We evaluated the efficacy of explicitly integrating safety rules into the input prompts of several Large Reasoning Models (LRMs), assessing changes in response safety and reasoning accuracy across multiple dimensions. The radar chart illustrates comparative performance before ("Base") and after ("With Rule") applying these explicit safety guidelines.
The integration of safety rules significantly enhanced overall safety metrics, notably Safe@1, which reached as high as 99.8% for QwQ-32B. Additionally, the rules markedly improved performance in risk omission scenarios (RO@1), suggesting models were more adept at detecting subtle or previously overlooked risks. However, this rule-based approach also resulted in a notable increase in over-sensitivity (OS@1), reflecting a tendency towards overly cautious behavior in ambiguous contexts. These findings underscore a trade-off where explicit rule integration improves risk detection at the expense of increased false positives in low-risk scenarios, indicating a need for carefully balanced safety guidelines.
: 
               
              We investigated the impact of fine-tuning Large Reasoning Models (LRMs) using high-quality safety reasoning trajectories from the STAR-1 dataset, measuring improvements across multiple safety evaluation metrics. The provided visualizations compare model performances before ("Base") and after ("Fine-tuned") fine-tuning across various scales of the Qwen3 model family.
The results clearly demonstrate that fine-tuning substantially improves safety and reasoning accuracy metrics, especially for smaller models (e.g., Qwen3-0.6B and Qwen3-1.7B), which exhibited the largest relative gains in both Safe@1 and Think@1 metrics. Fine-tuning also notably enhanced model performance in scenarios involving Cognitive Shortcut (CS) and Risk Omission (RO), highlighting increased capacity for nuanced risk detection. Conversely, fine-tuned models exhibited heightened over-sensitivity (OS), indicating that fine-tuning, while significantly boosting comprehensive safety reasoning abilities, also tends to increase cautiousness, thus highlighting an essential balance to be considered in model alignment strategies.
 
          We observed a significant positive correlation between reasoning accuracy (Think@1) and response safety (Safe@1) in the evaluated Large Reasoning Models (LRMs). Models demonstrating high accuracy in risk identification during reasoning processes consistently yielded safer final outputs. Conversely, models exhibiting weak reasoning capabilities, particularly smaller-scale models like Qwen3-0.6B and R1-Distill-Qwen-1.5B, displayed substantial inconsistencies between their initial safety (Safe@1) and sustained safety performance (Safe@k), highlighting reduced robustness in safety alignment. These findings underscore that reliable internal reasoning is critical to maintaining robust and consistent safety across diverse contexts, and emphasizes the necessity of prioritizing accurate internal reasoning processes within LRMs.
 
          @misc{zheng2025safeanswersbenchmarkevaluating,
      title={Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models}, 
      author={Baihui Zheng and Boren Zheng and Kerui Cao and Yingshui Tan and Zhendong Liu and Weixun Wang and Jiaheng Liu and Jian Yang and Wenbo Su and Xiaoyong Zhu and Bo Zheng and Kaifu Zhang},
      year={2025},
      eprint={2505.19690},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.19690}, 
}