Despite the remarkable proficiency of Large Reasoning Models (LRMs) in handling complex reasoning tasks, their reliability in safety-critical scenarios remains uncertain. Existing evaluations primarily assess response-level safety, neglecting a critical issue we identify as Superficial Safety Alignment (SSA)—a phenomenon where models produce superficially safe outputs while internal reasoning processes fail to genuinely detect and mitigate underlying risks, resulting in inconsistent safety behaviors across multiple sampling attempts. To systematically investigate SSA, we introduce BeyondSafeAnswer Bench, a novel benchmark comprising 2,000 challenging instances organized into 3 distinct SSA scenario types and spanning 9 risk categories, each meticulously annotated with risk rationales. Evaluations of 19 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with top-performing models achieving only 38.0% accuracy in correctly identifying risk rationales. We further explore the efficacy of safety rules, specialized fine-tuning on safety reasoning data, and diverse decoding strategies in mitigating SSA. Our work provides a comprehensive assessment tool for evaluating and improving safety reasoning fidelity in LRMs, advancing the development of genuinely risk-aware and reliably safe AI systems.
Total Samples | |
---|---|
Over-Sensitivity | 200 |
Risk Omission | 600 |
Cognitive Shortcut | 1200 |
Total | 2000 |
Primary Risk Categories (9) | |
---|---|
Offense & Prejudice | 22.50% |
Specially Regulated Items | 12.75% |
Property Infringement | 11.63% |
Invasion of Privacy & Confidentiality | 11.63% |
Physical & Mental Health | 11.53% |
Violence & Terrorism | 10.00% |
Ethics & Morality | 9.38% |
Rumors | 5.83% |
Child Pornography | 4.75% |
Data Tokens | |
---|---|
Max query tokens | 312 |
Min query tokens | 5 |
Average query tokens | 73 |
Key Features of Beyond Safe Answers Benchmark
Main Results
Models | Safe@1 | Safe@k | Think@1 | Think@k | F-score | OS@1 | OS@k | CS@1 | CS@k | RO@1 | RO@k |
---|---|---|---|---|---|---|---|---|---|---|---|
Closed-Source Large Language Models | |||||||||||
Doubao-1.5-thinking-pro | 92.97 | 86.50 | 37.24 | 18.55 | 53.21 | 60.60 | 19.00 | 17.67 | 4.25 | 68.60 | 47.00 |
GLM-Z1-AirX | 91.59 | 82.59 | 32.65 | 11.90 | 41.65 | 53.30 | 13.00 | 14.72 | 1.33 | 61.63 | 32.67 |
Kimi-K1.5 | 78.68 | 64.70 | 28.82 | 9.75 | 36.53 | 52.00 | 8.00 | 12.77 | 1.33 | 53.20 | 27.17 |
Open-Source Large Language Models | |||||||||||
QwQ-32B | 93.54 | 85.10 | 33.38 | 11.40 | 49.89 | 49.80 | 7.50 | 17.12 | 2.58 | 60.43 | 30.33 |
Qwen3-235B-A22B | 97.52 | 93.30 | 35.25 | 12.45 | 44.82 | 55.40 | 9.00 | 16.47 | 2.17 | 66.10 | 34.17 |
Qwen3-30B-A3B | 98.27 | 95.15 | 30.84 | 11.40 | 48.46 | 52.00 | 10.00 | 11.38 | 0.83 | 62.70 | 33.00 |
Qwen3-32B | 96.50 | 91.25 | 34.02 | 11.25 | 51.09 | 57.00 | 12.00 | 15.55 | 1.42 | 63.30 | 30.67 |
Qwen3-14B | 98.19 | 94.30 | 31.84 | 11.65 | 49.40 | 57.60 | 13.00 | 12.67 | 1.17 | 61.60 | 32.17 |
Qwen3-8B | 97.14 | 92.15 | 28.62 | 9.30 | 46.09 | 56.40 | 11.00 | 10.90 | 0.75 | 54.80 | 25.83 |
Qwen3-4B | 95.63 | 88.85 | 25.57 | 8.25 | 42.77 | 53.10 | 10.00 | 7.82 | 0.33 | 51.90 | 23.50 |
Qwen3-1.7B | 79.87 | 62.85 | 15.37 | 2.95 | 29.23 | 34.00 | 3.00 | 4.12 | 0.08 | 31.67 | 8.67 |
Qwen3-0.6B | 41.09 | 18.05 | 5.88 | 0.25 | 12.55 | 25.10 | 2.00 | 2.07 | 0.00 | 7.10 | 0.17 |
Deepseek-R1 | 94.63 | 88.85 | 37.98 | 16.20 | 54.22 | 52.70 | 13.50 | 20.78 | 4.33 | 67.47 | 40.83 |
R1-Distill-Llama-70B | 86.69 | 79.50 | 23.45 | 7.55 | 39.05 | 49.60 | 12.00 | 10.17 | 2.17 | 41.30 | 16.83 |
R1-Distill-Qwen-32B | 80.64 | 71.70 | 20.91 | 5.60 | 35.40 | 46.00 | 10.50 | 9.97 | 1.67 | 34.43 | 11.83 |
R1-Distill-Qwen-14B | 83.07 | 73.55 | 19.61 | 6.05 | 34.43 | 45.20 | 8.50 | 7.05 | 0.83 | 36.20 | 15.67 |
R1-Distill-Llama-8B | 71.50 | 58.60 | 14.73 | 3.90 | 27.28 | 34.70 | 6.50 | 4.77 | 0.42 | 28.00 | 10.00 |
R1-Distill-Qwen-7B | 66.64 | 52.05 | 8.72 | 1.20 | 19.27 | 26.20 | 1.00 | 2.70 | 0.17 | 14.93 | 3.33 |
R1-Distill-Qwen-1.5B | 39.96 | 17.25 | 2.94 | 0.15 | 8.13 | 14.60 | 1.00 | 1.00 | 0.00 | 2.93 | 0.17 |
(a) Qwen3-32B
(b) QwQ-32B
Through evaluating the effects of different decoding-phase sampling strategies on the performance of Qwen3-32B and QwQ-32B models, we identify that altering decoding parameters such as temperature, top-p, and top-k minimally influences the models' performance in safety reasoning tasks. The experiments systematically tested various combinations of these parameters, assessing their impact on metrics including Safe@1, Safe@k, Think@1, Think@k, and scenario-specific metrics (RO, OS, CS).
The experimental outcomes reveal that the core reasoning and safety evaluation capabilities of the tested large reasoning models are predominantly determined by intrinsic knowledge structures acquired during pre-training and alignment phases, rather than by decoding parameter adjustments. Notably, despite variations in sampling strategies, performance remained relatively stable across all metrics, suggesting that model training quality and inherent knowledge representation are far more critical to reliable safety alignment than the selection of decoding-phase parameters.
We evaluated the efficacy of explicitly integrating safety rules into the input prompts of several Large Reasoning Models (LRMs), assessing changes in response safety and reasoning accuracy across multiple dimensions. The radar chart illustrates comparative performance before ("Base") and after ("With Rule") applying these explicit safety guidelines.
The integration of safety rules significantly enhanced overall safety metrics, notably Safe@1, which reached as high as 99.8% for QwQ-32B. Additionally, the rules markedly improved performance in risk omission scenarios (RO@1), suggesting models were more adept at detecting subtle or previously overlooked risks. However, this rule-based approach also resulted in a notable increase in over-sensitivity (OS@1), reflecting a tendency towards overly cautious behavior in ambiguous contexts. These findings underscore a trade-off where explicit rule integration improves risk detection at the expense of increased false positives in low-risk scenarios, indicating a need for carefully balanced safety guidelines.
:We investigated the impact of fine-tuning Large Reasoning Models (LRMs) using high-quality safety reasoning trajectories from the STAR-1 dataset, measuring improvements across multiple safety evaluation metrics. The provided visualizations compare model performances before ("Base") and after ("Fine-tuned") fine-tuning across various scales of the Qwen3 model family.
The results clearly demonstrate that fine-tuning substantially improves safety and reasoning accuracy metrics, especially for smaller models (e.g., Qwen3-0.6B and Qwen3-1.7B), which exhibited the largest relative gains in both Safe@1 and Think@1 metrics. Fine-tuning also notably enhanced model performance in scenarios involving Cognitive Shortcut (CS) and Risk Omission (RO), highlighting increased capacity for nuanced risk detection. Conversely, fine-tuned models exhibited heightened over-sensitivity (OS), indicating that fine-tuning, while significantly boosting comprehensive safety reasoning abilities, also tends to increase cautiousness, thus highlighting an essential balance to be considered in model alignment strategies.
We observed a significant positive correlation between reasoning accuracy (Think@1) and response safety (Safe@1) in the evaluated Large Reasoning Models (LRMs). Models demonstrating high accuracy in risk identification during reasoning processes consistently yielded safer final outputs. Conversely, models exhibiting weak reasoning capabilities, particularly smaller-scale models like Qwen3-0.6B and R1-Distill-Qwen-1.5B, displayed substantial inconsistencies between their initial safety (Safe@1) and sustained safety performance (Safe@k), highlighting reduced robustness in safety alignment. These findings underscore that reliable internal reasoning is critical to maintaining robust and consistent safety across diverse contexts, and emphasizes the necessity of prioritizing accurate internal reasoning processes within LRMs.
TODO:Citation