Chinese SimpleQA

Abstract

In this work, we present Chinese SimpleQA, the first comprehensive Chinese evaluation benchmark to assess the factuality ability of language models to answer short questions. Chinese SimpleQA features five main properties: (i) it's designed for the Chinese language across six major topics with a total of 99 diverse subtopics; (ii) it includes a comprehensive quality control process to ensure high-quality questions and answers; (iii) the reference answers are static and unchangeable over time; and (iv) it offers an easy-to-evaluate process since the questions and answers are short, leveraging the OpenAI API. By employing Chinese SimpleQA, we conduct an exhaustive evaluation of the factuality abilities of current LLMs. We hope that Chinese SimpleQA will aid developers in gaining a better understanding of the Chinese factuality capabilities of their models and support the advancement of foundation models.

Data Construction Pipline

An overview of the data construction, filtering, verification, and quality control processes of Chinese SimpleQA.

Chinese SimpleQA's Features

Chinese: Our Chinese SimpleQA focuses on the Chinese language, which provides a comprehensive evaluation of the factuality abilities of existing LLMs in Chinese.
Diverse: Chinese SimpleQA covers 6 topics (i.e., "Chinese Culture", "Humanities", "Engineering, Technology, and Applied Sciences", "Life, Art, and Culture", "Society", and "Natural Science"), and these topics include 99 fine-grained subtopics in total, which demonstrates the diversity of our Chinese SimpleQA.
High-quality: We conduct a comprehensive and rigorous quality control process to ensure the quality and accuracy of our Chinese SimpleQA.
Static: Following SimpleQA, to preserve the evergreen property of Chinese SimpleQA, all reference answers would not change over time.
Easy-to-evaluate: Following SimpleQA, as the questions and answers are very short, the grading procedure is fast to run via existing LLMs (e.g., OpenAI API).

Key Observations

Chinese SimpleQA is challenging. Only o1-preview and Doubao-pro-32k achieve the passing score (63.8% and 61.9% on the correct metric), and there is a long way to improve for many closed-source and open-source LLMs.
Larger models lead to better results. Based on the results of Qwen2.5 series, InternLM series, Yi-1.5 series, etc, we observe that better performance is obtained when the model is larger.
Larger models are more calibrated. We observe that o1-preview is more calibrated than o1-mini, and GPT-4o is more calibrated than GPT-4o-mini.
RAG matters. When introducing the RAG strategy into existing LLMs, the performance gaps between different LLMs decrease a lot. For example, for GPT-4o and Qwen2.5-3B, the performance gap decreases from 42.4% to 9.3% when using RAG.
Alignment tax exists. Existing alignment or post-training strategies usually decrease the factuality of language models.
Rankings of SimpleQA and Chinese SimpleQA are different. The performance of several LLMs focusing on Chinese (Doubao-pro-32k, and GLM-4-Plus) is close to the high-performance o1-preview. In particular, in the “Chinese Culture” topic, these Chinese community LLMs are significantly better than GPT or o1 series models.

LeaderBoard

Models	CO	NA	IN	CGA	F-score	CC	HU	ETAS	LAC	SO	NS
Closed-Source Large Language Models
o1-preview	63.8	12.2	24.0	72.7	67.9	45.7	69.8	72.4	65.0	73.5	72.3
Doubao-pro-32k	61.9	10.3	27.8	69.1	65.3	61.8	69.3	69.0	56.1	64.2	70.4
GLM-4-Plus	58.7	7.4	33.9	63.4	60.9	56.5	64.1	64.9	50.7	66.6	62.8
GPT-4o	59.3	1.4	39.3	60.1	59.7	39.4	64.0	65.1	53.3	68.6	62.0
Qwen-Max	54.1	11.3	34.6	61.0	57.4	47.8	59.9	63.5	49.9	61.2	59.3
Gemini-1.5-pro	54.4	8.0	37.6	59.1	56.7	41.4	59.1	60.8	52.2	56.3	64.3
DeepSeek-V2.5	54.1	5.9	40.0	57.5	55.7	50.4	57.6	58.8	50.1	59.4	56.9
Claude-3.5-Sonnet	46.2	27.4	26.4	63.6	53.5	28.7	61.3	60.4	42.2	59.8	57.7
Yi-Large	47.3	16.4	36.3	56.6	51.5	41.1	56.5	55.1	41.7	57.6	53.8
moonshot-v1-8k	48.7	5.4	45.9	51.5	50.1	49.8	54.1	56.8	41.4	53.0	46.6
GPT-4-turbo	45.6	14.2	40.2	53.1	49.1	24.2	55.2	58.9	43.9	52.5	50.8
GPT-4	45.4	8.4	46.2	49.6	47.4	25.2	54.0	52.8	41.8	52.8	50.6
Baichuan3-turbo	45.2	9.0	45.8	49.6	47.3	32.3	52.5	54.0	35.4	54.6	50.9
o1-mini	39.5	20.6	39.9	49.7	44.1	21.3	49.2	55.9	33.8	48.8	46.8
Doubao-lite-4k	36.7	31.2	32.1	53.3	43.4	40.2	44.8	51.0	31.1	41.4	50.4
GPT-4o mini	37.6	0.9	61.5	37.9	37.8	19.0	42.4	46.4	31.0	42.2	39.8
GPT-3.5	29.7	2.9	67.4	30.6	30.1	13.3	35.8	35.2	25.6	32.7	31.7
Open-Source Large Language Models
Qwen2.5-72B	48.4	7.1	44.5	52.1	50.2	36.3	56.1	57.9	37.1	53.3	56.4
Qwen2.5-32B	38.8	11.1	50.1	43.6	41.1	33.7	45.8	48.7	27.3	44.7	44.9
Qwen2.5-14B	35.4	9.6	55.0	39.2	37.2	30.2	41.8	46.1	24.1	38.8	41.0
Qwen2.5-7B	26.6	9.9	63.5	29.5	27.9	20.1	32.7	33.8	18.0	28.6	32.0
Qwen2.5-3B	16.2	12.8	71.0	18.6	17.3	13.4	17.9	26.1	9.3	15.6	20.8
Qwen2.5-1.5B	11.1	14.6	74.3	13.1	12.0	11.0	11.3	18.7	6.7	12.2	12.9
GLM4-9B	25.9	12.5	61.6	29.6	27.6	28.8	32.1	32.0	17.6	28.9	27.8
ChatGLM3-6B	11.2	13.6	75.2	12.9	12.0	12.1	13.8	12.4	8.8	13.4	11.8
InternLM2.5-20B	31.5	7.7	60.8	34.1	32.8	32.0	37.1	37.7	21.2	35.7	34.3
InternLM2.5-7B	24.7	7.5	67.8	26.7	25.7	25.5	29.4	31.0	16.4	26.9	25.8
InternLM2.5-1.8B	5.3	31.1	63.6	7.6	6.2	6.1	8.7	7.2	3.3	4.5	7.4
Yi-1.5-34B	30.9	5.8	63.3	32.8	31.8	28.2	36.9	36.8	24.4	32.8	31.4
Yi-1.5-9B	18.2	2.9	78.9	18.7	18.4	17.2	20.2	24.3	10.2	20.1	19.8
Yi-1.5-6B	15.9	2.8	81.3	16.3	16.1	14.2	17.9	21.3	10.3	16.8	16.5
LLaMA3.1-70B	38.3	9.4	52.3	42.3	40.2	22.9	47.2	49.3	34.5	49.6	40.4
LLaMA3.1-8B	16.9	8.8	74.3	18.6	17.7	8.5	20.7	23.4	9.7	20.5	20.7
DeepSeek-67B	43.5	14.8	41.7	51.1	47.0	34.3	54.5	50.3	42.3	49.0	46.2
DeepSeek-V2-Lite-Chat	33.7	12.8	53.5	38.6	36.0	35.3	38.5	41.7	32.2	37.5	31.2
DeepSeek-7B	23.2	13.2	63.6	26.7	24.8	24.5	27.2	28.9	20.6	27.0	21.5
Baichuan2-13B	19.1	24.9	56.0	25.4	21.8	24.0	25.8	23.3	16.8	23.0	18.7
Baichuan2-7B	12.5	21.8	65.7	16.0	14.0	14.6	16.1	15.4	11.1	13.8	13.3
Mixtral-8x22B-Instruct-v0.1	27.3	2.2	70.5	27.9	27.6	10.6	32.3	36.0	21.0	34.1	26.9
Mixtral-8x7B-Instruct-v0.1	20.4	7.2	72.4	22.0	21.2	5.2	26.5	29.0	13.0	25.0	23.3
Mistral-7B-Instruct-v0.2	15.0	8.8	76.2	16.4	15.6	4.5	18.2	22.2	9.5	21.4	15.7

Results of different models on Chinese SimpleQA. For metrics, CO, NA, IN, and CGA denote "Correct", "Not attempted", "Incorrect", and "Correct given attempted", respectively. For subtopics, CC, HU, ETAS, LAC, SO, and NS represent "Chinese Culture", "Humanities", "Engineering, Technology, and Applied Sciences", "Life, Art, and Culture", "Society", and "Natural Science", respectively. Following SimpleQA, F-score is the harmonic mean between correct and correct given attempted.

Rankings on Chinese SimpleQA vs. SimpleQA

There are significant ranking differences of various models between the SimpleQA and Chinese SimpleQA benchmarks. For example, Doubao-pro-32k rises from 12th to 2nd in the Chinese version, while GPT-4 drops from 3rd to 9th. This highlights the importance of evaluating models in multilingual environments. Notably, o1-preview consistently holds the top position across both datasets. Many Chinese community-developed models perform better on the Chinese SimpleQA than on the SimpleQA.

Detailed Results on Subtopics

The benchmark covers 99 subtopics to assess the model's knowledge across various fields. Overall, the o1-preview model performs most comprehensively, followed by Doubao, while Moonshot is the weakest. There is a noticeable gap between Chinese community models and the o1 model in Computer Science and Medicine, but less so in Education and Economics. Notably, some Chinese models outperform o1-preview in Education. Moonshot struggles in Mathematics, Law, and Entertainment, while Baichuan also underperforms in Entertainment. Yi-Large excels in Education, and o1 maintains strong performance in other domains. Evaluating models across diverse domains helps users choose the best fit for their needs.

Calibration and Test-Time Compute

Calibration Analysis

We analyzed the calibration of different LLMs on Chinese SimpleQA. Models were instructed to provide a confidence level from 0 to 100 when answering questions. Ideally, confidence should match actual accuracy. Results show that GPT-4o aligns better than GPT-4o-mini, and o1-preview aligns better than o1-mini. In the Qwen2.5 series, larger models show better calibration. All models tend to be overconfident, especially when confidence is above 50.

Test-Time Compute Analysis

We evaluated the relationship between increased test-time compute and accuracy. Random samples from Chinese SimpleQA showed that as inference counts increase, response accuracy improves and eventually reaches a ceiling. This aligns with the dataset's purpose to probe model knowledge boundaries.

Analysis On The Alignment Tax

Recently, prior studies (OpenAI, 2023; Song et al., 2023) have found that the alignment can lead to a decrease in the abilities of language models as known as the “alignment tax”. To illustrate the effect of alignment on factuality, we conduct a comparative performance analysis between pre-trained models and aligned models that are trained with Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF). As illustrated in Figure 6, different models exhibit varying trends after post-training, but most models have a significant decline. Among these, the Baichuan2 series models show the most significant decreases, with Baichuan2-7B and Baichuan2-13B experiencing F-score reductions of 47% and 28%, respectively. This reflects that the alignment training of most current LLMs still has obvious drawbacks to produce knowledge hallucinations, which further reflects the necessity of our dataset.

Analysis On The Effect Of RAG

In this study, we explore the effectiveness of the Retrieval-Augmented Generation (RAG) strategy in enhancing the factual accuracy of large language models (LLMs) on the Chinese SimpleQA dataset. Specifically, we reproduce a RAG system based on LlamaIndex (Liu, 2022), incorporating Google search APIs. As illustrated in Figure 5, all models demonstrate a substantial improvement in accuracy with RAG. For example, the performance of Qwen2.5-3B improved more than threefold. Notably, nearly all models with RAG outperform the native GPT4-o model. Meanwhile, the application of RAG also leads to a marked reduction in performance disparities among models. For example, the F-score difference between the Qwen2.5-3B with RAG and the Qwen2.5-72B with RAG is only 6.9%. This suggests that RAG reduces the performance gaps on models greatly, enabling even smaller ones to achieve high performance when augmented with RAG. Overall, this suggests that RAG serves as an effective shortcut for enhancing the factuality of LLMs.

Dataset Examples

Category	Question	Reference Answer	Model Output
中华文化-中医	创立了调气活血的“衡法”治则的是哪一位中医学家？	颜德馨	GPT4o: 创立调气活血“衡法”治则的中医学家是施今墨
中华文化-民俗	西迁节主要是哪个少数民族的节日？	锡伯族	GPT4o: 西迁节主要是裕固族的节日
自然与自然科学-资讯科学	《中国图书馆分类法》第五版中规定U6表示哪一类？	水路运输	GPT4o: 根据《中国图书馆分类法》第五版，"U6"用于表示公路运输
自然与自然科学-数学	圆周率π小数点后第15位的数字是什么？	3	GPT4o: 圆周率π的小数点后第15位是数字9
工程、技术与应用科学-计算机科学	1.7.0版本的pytorch可以兼容的最新cuda版本是多少？	11.0	Qwen2-72b-instruct: PyTorch 1.7.0版本主要与CUDA 10.2和CUDA 11.1兼容

This table shows a few examples from the Chinese SimpleQA dataset. More samples are available in the complete dataset.

BibTeX

@misc{he2024chinesesimpleqachinesefactuality,
      title={Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models}, 
      author={Yancheng He and Shilong Li and Jiaheng Liu and Yingshui Tan and Weixun Wang and Hui Huang and Xingyuan Bu and Hangyu Guo and Chengwei Hu and Boren Zheng and Zhuoran Lin and Xuepeng Liu and Dekai Sun and Shirong Lin and Zhicheng Zheng and Xiaoyong Zhu and Wenbo Su and Bo Zheng},
      year={2024},
      eprint={2411.07140},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.07140}, 
}