Chinese SafetyQA

Abstract

Existing safety evaluations primarily assess response-level safety, leaving reasoning-level risks unmeasured. Despite the remarkable proficiency of Large Reasoning Models (LRMs) in handling complex reasoning tasks, their reliability in safety-critical scenarios remains uncertain. We identify Superficial Safety Alignment (SSA): a phenomenon where models produce superficially safe outputs while internal reasoning processes fail to genuinely detect and mitigate underlying risks, creating a dangerous illusion of safety and rendering systems prone to catastrophic failure under minor perturbations. To systematically investigate SSA, we introduce Beyond Safe Answers (BSA), a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenarios and spanning nine risk categories, each meticulously annotated with risk rationales. We evaluate 23 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with the best model reaching 54.57% accuracy on risk-rationale identification. Current benchmarks are largely blind to this latent risk; to our knowledge, BSA is the first benchmark designed to systematically diagnose SSA. We further explore the efficacy of safety rules, specialized fine-tuning on safety reasoning data, and diverse decoding strategies in mitigating SSA. Our work aims for verifiably robust safety reasoning in LRMs, moving beyond mere superficial compliance and enabling practitioners to evaluate and improve safety-reasoning fidelity with measurable evidence.

Total Samples
Over-Sensitivity	200
Risk Omission	600
Cognitive Shortcut	1200
Total	2000

Primary Risk Categories (9)
Offense & Prejudice	22.50%
Specially Regulated Items	12.75%
Property Infringement	11.63%
Invasion of Privacy & Confidentiality	11.63%
Physical & Mental Health	11.53%
Violence & Terrorism	10.00%
Ethics & Morality	9.38%
Rumors	5.83%
Child Pornography	4.75%

Data Tokens
Max query tokens	312
Min query tokens	5
Average query tokens	73

Data Construction Pipline

An overview of the data construction, filtering, verification, and quality control processes of Chinese SafetyQA.

Key Features of Beyond Safe Answers Benchmark

Detailed Risk Rationales: Each instance is accompanied by explicit annotations that detail the underlying risks, enabling precise assessment of a model's reasoning depth
Comprehensive Coverage: Contains over 2,000 carefully curated samples spanning three distinct SSA scenarios—Over Sensitivity, Cognitive Shortcut, and Risk Omission—across 9 primary risk categories, ensuring diverse and extensive evaluation.
Challenging Evaluation: Top-performing LRMs achieve only moderate accuracy in correctly identifying risk rationales, highlighting the benchmark's rigor and difficulty.
Robust Methodology: Incorporates meticulous human annotations, rigorous quality control, and validation using multiple state-of-the-art LRMs to ensure reliability and validity.
Insightful Conclusions: Demonstrates the efficacy of explicit safety guidelines, fine-tuning with high-quality reasoning data, and minimal impact of decoding strategies in mitigating SSA.

Main Results

BeyondSafeAnswer Bench is Challenging. All tested LRMs exhibit significant Superficial Safety Alignment (SSA). Although models achieve high surface-level safety scores (over 90%), their internal risk reasoning accuracy is low, with even top-performing models scoring below 40% on accurately identifying risks.
Reasoning Accuracy Correlates with Safety. Higher reasoning accuracy strongly predicts safer outputs. Models proficient at internally identifying risks consistently generate safer responses, whereas models with poor internal reasoning demonstrate inconsistent safety behaviors.
Larger Models Exhibit Superior Safety Performance. Performance analysis indicates that larger models generally achieve better results across safety metrics, particularly in scenarios involving subtle or hidden risks (Risk Omission). The enhanced memory and knowledge retrieval capabilities of larger LRMs significantly improve their internal risk assessment.
Explicit Safety Guidelines Enhance Model Safety. Integrating explicit safety rules into prompts significantly improves both surface-level response safety and internal reasoning accuracy, particularly addressing the issue of Risk Omission. However, this approach may increase model hypersensitivity in certain scenarios.
Fine-tuning with High-quality Reasoning Data Improves Performance. Fine-tuning LRMs with curated reasoning trajectories enhances overall safety and internal risk reasoning accuracy. Smaller models see greater relative improvements, whereas larger models, already performing well, achieve more modest gains.

LeaderBoard

Models	Safe@1	Safe@k	Think@1	Think@k	F-score	OS@1	OS@k	CS@1	CS@k	RO@1	RO@k
Closed-Source Large Language Models
Doubao-1.5-thinking-pro	92.97	86.50	37.24	18.55	53.21	60.60	19.00	17.67	4.25	68.60	47.00
Gemini-2.5-Flash	95.38	90.75	40.46	19.7	56.53	60	19.5	22.7	6.67	69.47	45.83
Gemini-2.5-Pro	94.51	88.2	38.02	18.15	54.23	68.6	26.5	19.55	5.67	64.77	40.33
Claude-3.7-Sonnet	99.28	98.05	54.57	30.7	68.92	53.4	9.5	40.05	18.08	84	63
Claude-4-Sonnet	98.98	96.75	48.89	25.55	64.37	58.3	14	36.35	16.58	70.83	47.33
GLM-Z1-AirX	91.59	82.59	32.65	11.90	41.65	53.30	13.00	14.72	1.33	61.63	32.67
Kimi-K1.5	78.68	64.70	28.82	9.75	36.53	52.00	8.00	12.77	1.33	53.20	27.17
Open-Source Large Language Models
QwQ-32B	93.54	85.10	33.38	11.40	49.89	49.80	7.50	17.12	2.58	60.43	30.33
Qwen3-235B-A22B	97.52	93.30	35.25	12.45	44.82	55.40	9.00	16.47	2.17	66.10	34.17
Qwen3-30B-A3B	98.27	95.15	30.84	11.40	48.46	52.00	10.00	11.38	0.83	62.70	33.00
Qwen3-32B	96.50	91.25	34.02	11.25	51.09	57.00	12.00	15.55	1.42	63.30	30.67
Qwen3-14B	98.19	94.30	31.84	11.65	49.40	57.60	13.00	12.67	1.17	61.60	32.17
Qwen3-8B	97.14	92.15	28.62	9.30	46.09	56.40	11.00	10.90	0.75	54.80	25.83
Qwen3-4B	95.63	88.85	25.57	8.25	42.77	53.10	10.00	7.82	0.33	51.90	23.50
Qwen3-1.7B	79.87	62.85	15.37	2.95	29.23	34.00	3.00	4.12	0.08	31.67	8.67
Qwen3-0.6B	41.09	18.05	5.88	0.25	12.55	25.10	2.00	2.07	0.00	7.10	0.17
Deepseek-R1	94.63	88.85	37.98	16.20	54.22	52.70	13.50	20.78	4.33	67.47	40.83
R1-Distill-Llama-70B	86.69	79.50	23.45	7.55	39.05	49.60	12.00	10.17	2.17	41.30	16.83
R1-Distill-Qwen-32B	80.64	71.70	20.91	5.60	35.40	46.00	10.50	9.97	1.67	34.43	11.83
R1-Distill-Qwen-14B	83.07	73.55	19.61	6.05	34.43	45.20	8.50	7.05	0.83	36.20	15.67
R1-Distill-Llama-8B	71.50	58.60	14.73	3.90	27.28	34.70	6.50	4.77	0.42	28.00	10.00
R1-Distill-Qwen-7B	66.64	52.05	8.72	1.20	19.27	26.20	1.00	2.70	0.17	14.93	3.33
R1-Distill-Qwen-1.5B	39.96	17.25	2.94	0.15	8.13	14.60	1.00	1.00	0.00	2.93	0.17

Decoding Sampling Strategies' Impact on Large Reasoning Models

(a) Qwen3-32B

(b) QwQ-32B

Through evaluating the effects of different decoding-phase sampling strategies on the performance of Qwen3-32B and QwQ-32B models, we identify that altering decoding parameters such as temperature, top-p, and top-k minimally influences the models' performance in safety reasoning tasks. The experiments systematically tested various combinations of these parameters, assessing their impact on metrics including Safe@1, Safe@k, Think@1, Think@k, and scenario-specific metrics (RO, OS, CS).

The experimental outcomes reveal that the core reasoning and safety evaluation capabilities of the tested large reasoning models are predominantly determined by intrinsic knowledge structures acquired during pre-training and alignment phases, rather than by decoding parameter adjustments. Notably, despite variations in sampling strategies, performance remained relatively stable across all metrics, suggesting that model training quality and inherent knowledge representation are far more critical to reliable safety alignment than the selection of decoding-phase parameters.

Impact of Safety Rules Integration on LRMs

We evaluated the efficacy of explicitly integrating safety rules into the input prompts of several Large Reasoning Models (LRMs), assessing changes in response safety and reasoning accuracy across multiple dimensions. The radar chart illustrates comparative performance before ("Base") and after ("With Rule") applying these explicit safety guidelines.

The integration of safety rules significantly enhanced overall safety metrics, notably Safe@1, which reached as high as 99.8% for QwQ-32B. Additionally, the rules markedly improved performance in risk omission scenarios (RO@1), suggesting models were more adept at detecting subtle or previously overlooked risks. However, this rule-based approach also resulted in a notable increase in over-sensitivity (OS@1), reflecting a tendency towards overly cautious behavior in ambiguous contexts. These findings underscore a trade-off where explicit rule integration improves risk detection at the expense of increased false positives in low-risk scenarios, indicating a need for carefully balanced safety guidelines.

:

Effect of Fine-tuning with Superior Reasoning Data

We investigated the impact of fine-tuning Large Reasoning Models (LRMs) using high-quality safety reasoning trajectories from the STAR-1 dataset, measuring improvements across multiple safety evaluation metrics. The provided visualizations compare model performances before ("Base") and after ("Fine-tuned") fine-tuning across various scales of the Qwen3 model family.

The results clearly demonstrate that fine-tuning substantially improves safety and reasoning accuracy metrics, especially for smaller models (e.g., Qwen3-0.6B and Qwen3-1.7B), which exhibited the largest relative gains in both Safe@1 and Think@1 metrics. Fine-tuning also notably enhanced model performance in scenarios involving Cognitive Shortcut (CS) and Risk Omission (RO), highlighting increased capacity for nuanced risk detection. Conversely, fine-tuned models exhibited heightened over-sensitivity (OS), indicating that fine-tuning, while significantly boosting comprehensive safety reasoning abilities, also tends to increase cautiousness, thus highlighting an essential balance to be considered in model alignment strategies.

Positive Correlation Between Reasoning Accuracy and Response Safety

We observed a significant positive correlation between reasoning accuracy (Think@1) and response safety (Safe@1) in the evaluated Large Reasoning Models (LRMs). Models demonstrating high accuracy in risk identification during reasoning processes consistently yielded safer final outputs. Conversely, models exhibiting weak reasoning capabilities, particularly smaller-scale models like Qwen3-0.6B and R1-Distill-Qwen-1.5B, displayed substantial inconsistencies between their initial safety (Safe@1) and sustained safety performance (Safe@k), highlighting reduced robustness in safety alignment. These findings underscore that reliable internal reasoning is critical to maintaining robust and consistent safety across diverse contexts, and emphasizes the necessity of prioritizing accurate internal reasoning processes within LRMs.

BSA Examples

BibTeX

@misc{zheng2025safeanswersbenchmarkevaluating,
      title={Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models}, 
      author={Baihui Zheng and Boren Zheng and Kerui Cao and Yingshui Tan and Zhendong Liu and Weixun Wang and Jiaheng Liu and Jian Yang and Wenbo Su and Xiaoyong Zhu and Bo Zheng and Kaifu Zhang},
      year={2025},
      eprint={2505.19690},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.19690}, 
}

Beyond Safe Answers:A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models