In this work, we present Chinese SimpleQA, the first comprehensive Chinese evaluation benchmark to assess the factuality ability of language models to answer short questions. Chinese SimpleQA features five main properties: (i) it's designed for the Chinese language across six major topics with a total of 99 diverse subtopics; (ii) it includes a comprehensive quality control process to ensure high-quality questions and answers; (iii) the reference answers are static and unchangeable over time; and (iv) it offers an easy-to-evaluate process since the questions and answers are short, leveraging the OpenAI API. By employing Chinese SimpleQA, we conduct an exhaustive evaluation of the factuality abilities of current LLMs. We hope that Chinese SimpleQA will aid developers in gaining a better understanding of the Chinese factuality capabilities of their models and support the advancement of foundation models.
Chinese SimpleQA's Features
Key Observations
Models | CO | NA | IN | CGA | F-score | CC | HU | ETAS | LAC | SO | NS |
---|---|---|---|---|---|---|---|---|---|---|---|
Closed-Source Large Language Models | |||||||||||
o1-preview | 63.8 | 12.2 | 24.0 | 72.7 | 67.9 | 45.7 | 69.8 | 72.4 | 65.0 | 73.5 | 72.3 |
Doubao-pro-32k | 61.9 | 10.3 | 27.8 | 69.1 | 65.3 | 61.8 | 69.3 | 69.0 | 56.1 | 64.2 | 70.4 |
GLM-4-Plus | 58.7 | 7.4 | 33.9 | 63.4 | 60.9 | 56.5 | 64.1 | 64.9 | 50.7 | 66.6 | 62.8 |
GPT-4o | 59.3 | 1.4 | 39.3 | 60.1 | 59.7 | 39.4 | 64.0 | 65.1 | 53.3 | 68.6 | 62.0 |
Qwen-Max | 54.1 | 11.3 | 34.6 | 61.0 | 57.4 | 47.8 | 59.9 | 63.5 | 49.9 | 61.2 | 59.3 |
Gemini-1.5-pro | 54.4 | 8.0 | 37.6 | 59.1 | 56.7 | 41.4 | 59.1 | 60.8 | 52.2 | 56.3 | 64.3 |
DeepSeek-V2.5 | 54.1 | 5.9 | 40.0 | 57.5 | 55.7 | 50.4 | 57.6 | 58.8 | 50.1 | 59.4 | 56.9 |
Claude-3.5-Sonnet | 46.2 | 27.4 | 26.4 | 63.6 | 53.5 | 28.7 | 61.3 | 60.4 | 42.2 | 59.8 | 57.7 |
Yi-Large | 47.3 | 16.4 | 36.3 | 56.6 | 51.5 | 41.1 | 56.5 | 55.1 | 41.7 | 57.6 | 53.8 |
moonshot-v1-8k | 48.7 | 5.4 | 45.9 | 51.5 | 50.1 | 49.8 | 54.1 | 56.8 | 41.4 | 53.0 | 46.6 |
GPT-4-turbo | 45.6 | 14.2 | 40.2 | 53.1 | 49.1 | 24.2 | 55.2 | 58.9 | 43.9 | 52.5 | 50.8 |
GPT-4 | 45.4 | 8.4 | 46.2 | 49.6 | 47.4 | 25.2 | 54.0 | 52.8 | 41.8 | 52.8 | 50.6 |
Baichuan3-turbo | 45.2 | 9.0 | 45.8 | 49.6 | 47.3 | 32.3 | 52.5 | 54.0 | 35.4 | 54.6 | 50.9 |
o1-mini | 39.5 | 20.6 | 39.9 | 49.7 | 44.1 | 21.3 | 49.2 | 55.9 | 33.8 | 48.8 | 46.8 |
Doubao-lite-4k | 36.7 | 31.2 | 32.1 | 53.3 | 43.4 | 40.2 | 44.8 | 51.0 | 31.1 | 41.4 | 50.4 |
GPT-4o mini | 37.6 | 0.9 | 61.5 | 37.9 | 37.8 | 19.0 | 42.4 | 46.4 | 31.0 | 42.2 | 39.8 |
GPT-3.5 | 29.7 | 2.9 | 67.4 | 30.6 | 30.1 | 13.3 | 35.8 | 35.2 | 25.6 | 32.7 | 31.7 |
Open-Source Large Language Models | |||||||||||
Qwen2.5-72B | 48.4 | 7.1 | 44.5 | 52.1 | 50.2 | 36.3 | 56.1 | 57.9 | 37.1 | 53.3 | 56.4 |
Qwen2.5-32B | 38.8 | 11.1 | 50.1 | 43.6 | 41.1 | 33.7 | 45.8 | 48.7 | 27.3 | 44.7 | 44.9 |
Qwen2.5-14B | 35.4 | 9.6 | 55.0 | 39.2 | 37.2 | 30.2 | 41.8 | 46.1 | 24.1 | 38.8 | 41.0 |
Qwen2.5-7B | 26.6 | 9.9 | 63.5 | 29.5 | 27.9 | 20.1 | 32.7 | 33.8 | 18.0 | 28.6 | 32.0 |
Qwen2.5-3B | 16.2 | 12.8 | 71.0 | 18.6 | 17.3 | 13.4 | 17.9 | 26.1 | 9.3 | 15.6 | 20.8 |
Qwen2.5-1.5B | 11.1 | 14.6 | 74.3 | 13.1 | 12.0 | 11.0 | 11.3 | 18.7 | 6.7 | 12.2 | 12.9 |
GLM4-9B | 25.9 | 12.5 | 61.6 | 29.6 | 27.6 | 28.8 | 32.1 | 32.0 | 17.6 | 28.9 | 27.8 |
ChatGLM3-6B | 11.2 | 13.6 | 75.2 | 12.9 | 12.0 | 12.1 | 13.8 | 12.4 | 8.8 | 13.4 | 11.8 |
InternLM2.5-20B | 31.5 | 7.7 | 60.8 | 34.1 | 32.8 | 32.0 | 37.1 | 37.7 | 21.2 | 35.7 | 34.3 |
InternLM2.5-7B | 24.7 | 7.5 | 67.8 | 26.7 | 25.7 | 25.5 | 29.4 | 31.0 | 16.4 | 26.9 | 25.8 |
InternLM2.5-1.8B | 5.3 | 31.1 | 63.6 | 7.6 | 6.2 | 6.1 | 8.7 | 7.2 | 3.3 | 4.5 | 7.4 |
Yi-1.5-34B | 30.9 | 5.8 | 63.3 | 32.8 | 31.8 | 28.2 | 36.9 | 36.8 | 24.4 | 32.8 | 31.4 |
Yi-1.5-9B | 18.2 | 2.9 | 78.9 | 18.7 | 18.4 | 17.2 | 20.2 | 24.3 | 10.2 | 20.1 | 19.8 |
Yi-1.5-6B | 15.9 | 2.8 | 81.3 | 16.3 | 16.1 | 14.2 | 17.9 | 21.3 | 10.3 | 16.8 | 16.5 |
LLaMA3.1-70B | 38.3 | 9.4 | 52.3 | 42.3 | 40.2 | 22.9 | 47.2 | 49.3 | 34.5 | 49.6 | 40.4 |
LLaMA3.1-8B | 16.9 | 8.8 | 74.3 | 18.6 | 17.7 | 8.5 | 20.7 | 23.4 | 9.7 | 20.5 | 20.7 |
DeepSeek-67B | 43.5 | 14.8 | 41.7 | 51.1 | 47.0 | 34.3 | 54.5 | 50.3 | 42.3 | 49.0 | 46.2 |
DeepSeek-V2-Lite-Chat | 33.7 | 12.8 | 53.5 | 38.6 | 36.0 | 35.3 | 38.5 | 41.7 | 32.2 | 37.5 | 31.2 |
DeepSeek-7B | 23.2 | 13.2 | 63.6 | 26.7 | 24.8 | 24.5 | 27.2 | 28.9 | 20.6 | 27.0 | 21.5 |
Baichuan2-13B | 19.1 | 24.9 | 56.0 | 25.4 | 21.8 | 24.0 | 25.8 | 23.3 | 16.8 | 23.0 | 18.7 |
Baichuan2-7B | 12.5 | 21.8 | 65.7 | 16.0 | 14.0 | 14.6 | 16.1 | 15.4 | 11.1 | 13.8 | 13.3 |
Mixtral-8x22B-Instruct-v0.1 | 27.3 | 2.2 | 70.5 | 27.9 | 27.6 | 10.6 | 32.3 | 36.0 | 21.0 | 34.1 | 26.9 |
Mixtral-8x7B-Instruct-v0.1 | 20.4 | 7.2 | 72.4 | 22.0 | 21.2 | 5.2 | 26.5 | 29.0 | 13.0 | 25.0 | 23.3 |
Mistral-7B-Instruct-v0.2 | 15.0 | 8.8 | 76.2 | 16.4 | 15.6 | 4.5 | 18.2 | 22.2 | 9.5 | 21.4 | 15.7 |
Results of different models on Chinese SimpleQA. For metrics, CO, NA, IN, and CGA denote "Correct", "Not attempted", "Incorrect", and "Correct given attempted", respectively. For subtopics, CC, HU, ETAS, LAC, SO, and NS represent "Chinese Culture", "Humanities", "Engineering, Technology, and Applied Sciences", "Life, Art, and Culture", "Society", and "Natural Science", respectively. Following SimpleQA, F-score is the harmonic mean between correct and correct given attempted.
There are significant ranking differences of various models between the SimpleQA and Chinese SimpleQA benchmarks. For example, Doubao-pro-32k rises from 12th to 2nd in the Chinese version, while GPT-4 drops from 3rd to 9th. This highlights the importance of evaluating models in multilingual environments. Notably, o1-preview consistently holds the top position across both datasets. Many Chinese community-developed models perform better on the Chinese SimpleQA than on the SimpleQA.
The benchmark covers 99 subtopics to assess the model's knowledge across various fields. Overall, the o1-preview model performs most comprehensively, followed by Doubao, while Moonshot is the weakest. There is a noticeable gap between Chinese community models and the o1 model in Computer Science and Medicine, but less so in Education and Economics. Notably, some Chinese models outperform o1-preview in Education. Moonshot struggles in Mathematics, Law, and Entertainment, while Baichuan also underperforms in Entertainment. Yi-Large excels in Education, and o1 maintains strong performance in other domains. Evaluating models across diverse domains helps users choose the best fit for their needs.
Calibration Analysis
We analyzed the calibration of different LLMs on Chinese SimpleQA. Models were instructed to provide a confidence level from 0 to 100 when answering questions. Ideally, confidence should match actual accuracy. Results show that GPT-4o aligns better than GPT-4o-mini, and o1-preview aligns better than o1-mini. In the Qwen2.5 series, larger models show better calibration. All models tend to be overconfident, especially when confidence is above 50.
Test-Time Compute Analysis
We evaluated the relationship between increased test-time compute and accuracy. Random samples from Chinese SimpleQA showed that as inference counts increase, response accuracy improves and eventually reaches a ceiling. This aligns with the dataset's purpose to probe model knowledge boundaries.
Recently, prior studies (OpenAI, 2023; Song et al., 2023) have found that the alignment can lead to a decrease in the abilities of language models as known as the “alignment tax”. To illustrate the effect of alignment on factuality, we conduct a comparative performance analysis between pre-trained models and aligned models that are trained with Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF). As illustrated in Figure 6, different models exhibit varying trends after post-training, but most models have a significant decline. Among these, the Baichuan2 series models show the most significant decreases, with Baichuan2-7B and Baichuan2-13B experiencing F-score reductions of 47% and 28%, respectively. This reflects that the alignment training of most current LLMs still has obvious drawbacks to produce knowledge hallucinations, which further reflects the necessity of our dataset.
In this study, we explore the effectiveness of the Retrieval-Augmented Generation (RAG) strategy in enhancing the factual accuracy of large language models (LLMs) on the Chinese SimpleQA dataset. Specifically, we reproduce a RAG system based on LlamaIndex (Liu, 2022), incorporating Google search APIs. As illustrated in Figure 5, all models demonstrate a substantial improvement in accuracy with RAG. For example, the performance of Qwen2.5-3B improved more than threefold. Notably, nearly all models with RAG outperform the native GPT4-o model. Meanwhile, the application of RAG also leads to a marked reduction in performance disparities among models. For example, the F-score difference between the Qwen2.5-3B with RAG and the Qwen2.5-72B with RAG is only 6.9%. This suggests that RAG reduces the performance gaps on models greatly, enabling even smaller ones to achieve high performance when augmented with RAG. Overall, this suggests that RAG serves as an effective shortcut for enhancing the factuality of LLMs.
Category | Question | Reference Answer | Model Output |
---|---|---|---|
中华文化-中医 | 创立了调气活血的“衡法”治则的是哪一位中医学家? | 颜德馨 | GPT4o: 创立调气活血“衡法”治则的中医学家是施今墨 |
中华文化-民俗 | 西迁节主要是哪个少数民族的节日? | 锡伯族 | GPT4o: 西迁节主要是裕固族的节日 |
自然与自然科学-资讯科学 | 《中国图书馆分类法》第五版中规定U6表示哪一类? | 水路运输 | GPT4o: 根据《中国图书馆分类法》第五版,"U6"用于表示公路运输 |
自然与自然科学-数学 | 圆周率π小数点后第15位的数字是什么? | 3 | GPT4o: 圆周率π的小数点后第15位是数字9 |
工程、技术与应用科学-计算机科学 | 1.7.0版本的pytorch可以兼容的最新cuda版本是多少? | 11.0 | Qwen2-72b-instruct: PyTorch 1.7.0版本主要与CUDA 10.2和CUDA 11.1兼容 |
This table shows a few examples from the Chinese SimpleQA dataset. More samples are available in the complete dataset.
@misc{he2024chinesesimpleqachinesefactuality,
title={Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models},
author={Yancheng He and Shilong Li and Jiaheng Liu and Yingshui Tan and Weixun Wang and Hui Huang and Xingyuan Bu and Hangyu Guo and Chengwei Hu and Boren Zheng and Zhuoran Lin and Xuepeng Liu and Dekai Sun and Shirong Lin and Zhicheng Zheng and Xiaoyong Zhu and Wenbo Su and Bo Zheng},
year={2024},
eprint={2411.07140},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.07140},
}