With the increasing use of Large Language Models (LLMs) in fields such as e-commerce, domain-specific concept evaluation benchmarks are crucial for assessing their domain capabilities. Existing LLMs may generate factually incorrect information within the complex e-commerce applications. Therefore, it is necessary to build an e-commerce concept benchmark. Existing benchmarks encounter two primary challenges: (1) handle the heterogeneous and diverse nature of tasks, (2) distinguish between generality and specificity within the e-commerce field. To address these problems, we propose ChineseEcomQA, a scalable question-answering benchmark focused on fundamental e-commerce concepts. ChineseEcomQA is built on three core characteristics: Focus on Fundamental Concept, E-commerce Generality and E-commerce Expertise. Fundamental concepts are designed to be applicable across a diverse array of e-commerce tasks, thus addressing the challenge of heterogeneity and diversity. Additionally, by carefully balancing generality and specificity, ChineseEcomQA effectively differentiates between broad e-commerce concepts, allowing for precise validation of domain capabilities. We achieve this through a scalable benchmark construction process that combines LLM validation, Retrieval-Augmented Generation (RAG) validation, and rigorous manual annotation. Based on ChineseEcomQA, we conduct extensive evaluations on mainstream LLMs and provide some valuable insights. We hope that ChineseEcomQA could guide future domain-specific evaluations, and facilitate broader LLM adoption in e-commerce applications.
Starting from the basic elements of e-commerce such as user behavior and product information, we summarized the main types of e-commerce concepts, defined 10 sub-concepts from basic concepts to advanced concepts as follows:
ChineseEcomQA's Features
Key Observations
Models | Accuracy | Accuracy on 10 sub-concepts | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Avg. | IC | IDC | CC | BC | AC | SC | ITC | RVC | RLC | PC | ||
Closed-Source Large Language Models | ||||||||||||
GLM-4-Plus | 69.2 | 57.3 | 54.1 | 76.3 | 77.6 | 69.5 | 59.5 | 72.2 | 83.1 | 68.5 | 74.0 | |
Qwen2.5-max | 68.5 | 62.2 | 62.2 | 71.1 | 77.6 | 63.0 | 58.5 | 57.8 | 88.7 | 63.9 | 80.4 | |
Yi-Large | 67.6 | 56.6 | 59.5 | 71.1 | 81.8 | 62.0 | 58.5 | 70.0 | 70.4 | 68.5 | 78.0 | |
o1-preview | 66.8 | 69.2 | 63.1 | 78.4 | 80.0 | 67.0 | 52.0 | 43.3 | 83.1 | 61.3 | 71.1 | |
Baichuan4-Turbo | 66.4 | 57.3 | 56.8 | 82.0 | 72.4 | 61.0 | 59.5 | 66.7 | 78.9 | 55.4 | 74.6 | |
GPT-4o | 65.6 | 68.2 | 52.3 | 74.7 | 72.4 | 64.5 | 56.5 | 50.0 | 80.3 | 57.7 | 79.8 | |
Doubao-1.5-pro-32k | 64.0 | 69.6 | 64.0 | 62.9 | 74.1 | 56.5 | 64.5 | 48.9 | 69.0 | 62.6 | 68.2 | |
Claude-3.5-Sonnet | 63.8 | 70.6 | 56.8 | 73.2 | 64.1 | 63.0 | 31.5 | 62.2 | 81.7 | 65.2 | 69.4 | |
Gemini-1.5-pro | 61.1 | 59.8 | 49.6 | 67.0 | 70.0 | 56.0 | 43.5 | 55.6 | 81.7 | 54.1 | 73.4 | |
o1-mini | 55.4 | 59.1 | 41.4 | 53.1 | 37.1 | 59.0 | 53.0 | 58.9 | 64.8 | 63.6 | 64.2 | |
Gemini-1.5-flash | 54.5 | 62.9 | 35.1 | 57.2 | 46.5 | 52.5 | 53.0 | 36.7 | 74.7 | 54.4 | 71.7 | |
Open-Source Large Language Models | ||||||||||||
DeepSeek-R1 | 74.0 | 62.9 | 72.1 | 72.1 | 84.7 | 70.5 | 55.5 | 67.8 | 85.9 | 76.1 | 92.5 | |
DeepSeek-V3 | 72.2 | 67.5 | 64.9 | 74.2 | 80.6 | 69.0 | 62.0 | 72.2 | 77.5 | 68.2 | 86.1 | |
DeepSeek-V2.5 | 67.4 | 66.4 | 58.6 | 73.7 | 76.5 | 64.0 | 60.0 | 75.6 | 83.1 | 54.1 | 61.8 | |
DeepSeek-67B | 58.4 | 61.2 | 47.7 | 70.6 | 62.9 | 47.0 | 52.5 | 60.0 | 59.2 | 55.7 | 67.1 | |
DeepSeek-7B | 47.5 | 38.5 | 41.1 | 59.3 | 45.9 | 40.0 | 49.0 | 54.4 | 47.9 | 54.4 | 44.7 | |
DeepSeek-R1-Distill-Qwen-32B | 57.1 | 63.6 | 46.0 | 62.4 | 47.6 | 36.0 | 43.0 | 61.1 | 78.9 | 61.6 | 70.6 | |
DeepSeek-R1-Distill-Qwen-14B | 50.6 | 64.7 | 43.2 | 62.9 | 38.8 | 27.5 | 41.0 | 60.0 | 67.6 | 59.0 | 40.9 | |
DeepSeek-R1-Distill-Qwen-7B | 38.9 | 48.6 | 18.0 | 53.6 | 16.5 | 25.0 | 35.5 | 36.7 | 52.1 | 48.2 | 57.2 | |
DeepSeek-R1-Distill-Qwen-1.5B | 26.2 | 35.7 | 2.7 | 46.9 | 6.5 | 8.5 | 23.5 | 18.9 | 40.9 | 40.0 | 38.2 | |
Qwen2.5-72B | 62.7 | 57.3 | 46.0 | 66.0 | 64.7 | 55.5 | 58.0 | 67.8 | 76.1 | 56.7 | 78.6 | |
Qwen2.5-32B | 60.9 | 62.2 | 42.3 | 58.8 | 50.6 | 61.5 | 52.5 | 66.7 | 74.7 | 62.3 | 77.5 | |
Qwen2.5-14B | 55.3 | 57.0 | 40.5 | 54.6 | 48.8 | 59.0 | 49.0 | 40.0 | 66.2 | 59.3 | 78.6 | |
Qwen2.5-7B | 47.1 | 45.8 | 24.3 | 51.6 | 37.6 | 44.5 | 54.0 | 31.1 | 64.8 | 48.5 | 68.8 | |
Qwen2.5-3B | 41.7 | 52.1 | 14.4 | 41.8 | 34.1 | 42.5 | 34.0 | 30.0 | 60.6 | 51.1 | 56.7 | |
LLaMA3.1-70B | 54.6 | 59.1 | 35.7 | 58.8 | 39.4 | 58.0 | 37.5 | 73.3 | 74.7 | 53.8 | 56.1 | |
LLaMA3.1-8B | 42.4 | 40.6 | 11.7 | 61.3 | 17.1 | 42.0 | 38.5 | 42.2 | 66.2 | 44.6 | 60.0 |
Results of different models on ChineseEcomQA. For sub-concepts, IC, IDC, CC, BC, AC, SC, ITC, RVC, RLC and PC represent “Industry Categorization”, “Industry Concept”, “Category Concept”, “Brand Concept”, “Attribute Concept”, “Spoken Concept”, “Intent Concept”, “Review Concept”, “Relevance Concept” and “Personalized Concept” respectively
In this section, we evaluate the calibration capabilities of various models on category concept and brand concept, with results visualized in Figure 6. The results demonstrate the correlation between the stated con- fidence of the model, and how accurate the model actually was. Notably, o1-preview exhibits the best alignment performance, fol- lowed by o1-mini. Within the Qwen2.5 series, the calibration hier- archy emerges as Qwen2.5-MAX > Qwen2.5-72B > Qwen2.5-14B > Qwen2.5-7B > Qwen2.5-3B, suggesting that larger model scales correlate with improved calibration. However, most models consis- tently fall below the perfect alignment line, indicating a prevalent tendency towards overconfidence in predictions. This highlights significant room for improving large language model calibration to mitigate overconfident generation of erroneous responses.
In this study, we explore the effectiveness of the Retrieval-Augmented Generation (RAG) strategy in enhancing the domain knowledge of large language models (LLMs) on the ChineseEcomQA dataset. Specifically, we reproduce a RAG system referring to the settings of Chinese-SimpleQA on category concept and brand concept. In Figure 7, all models improve significantly with RAG. We can summarize three detailed conclusions:
In conclusion, the discussions above suggest that RAG serves as an effective method for enhancing the e-commerce knowledge of LLMs.
Model | Behavior A | Behavior B | Behavior C | Behavior D |
---|---|---|---|---|
DeepSeek-R1-Distill-Qwen-7B | 23.97 | 2.17 | 68.02 | 5.85 |
DeepSeek-R1-Distill-Qwen-14B | 40.27 | 3.75 | 47.05 | 8.94 |
DeepSeek-R1-Distill-Qwen-32B | 39.57 | 2.14 | 52.01 | 6.29 |
Deepseek-R1 | 62.80 | 2.80 | 26.49 | 7.92 |
Inspired by (Liu et al., 2025), we categorize the thinking process of reasoning models into the following four types:
We used the judge LLMs (GPT-4o) to classify the thinking types of different models on category and brand concept tasks. The specific results can be seen in Table 2. Analyzing the dominant reasoning types, we found the following conclusions:
Overall, types A and B are the ability of reasoning LLMs obtained through scaling up test-time computation. Types C and D are superficial self-reflections that lead to incorrect final answers. Deepseek-R1 demonstrates better generalization ability based on a powerful base model. In contrast, the DeepSeek-R1-Distill-Qwen series, distilled in some specific fields, appears to struggle with superficial self-reflections. The accumulation of factual errors during the intermediate reasoning steps increases the overall error rate. For smaller reasoning LLMs, reasoning ability in open domains cannot be directly generalized through mathematical logic ability, and we need to find better methods to improve their performance.
To demonstrate the distinctions between ChineseSimpleQA and ChineseEcomQA, we compare the ranking differences of various models across these two benchmarks. As illustrated in Figure, significant performance discrepancies emerge among various models. Notably, the o1-preview model ranks first on ChineseSimpleQA but drops to 4th position on ChineseEcomQA. Conversely, GLM-4-Plus ascends from 3rd to 1st place between the two benchmarks. These ranking variations reveal that most Chinese community-developed models (e.g., Qwen-Max, GLM-4-Plus, Yi-Large) exhibit superior performance on Chinese e-commerce domain adaptation when operating within identical linguistic contexts. Furthermore, the distinct ranking distributions across models indicate that ChineseEcomQA exhibits discriminative power complementary to ChineseSimpleQA, enabling comprehensive evaluation of LLMs' domain- specific capability in Chinese e-commerce scenarios
This picture shows a few examples from the ChineseEcomQA dataset. More samples are available in the complete dataset.
@misc{he2024chinesesimpleqachinesefactuality,
title={Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models},
author={Yancheng He and Shilong Li and Jiaheng Liu and Yingshui Tan and Weixun Wang and Hui Huang and Xingyuan Bu and Hangyu Guo and Chengwei Hu and Boren Zheng and Zhuoran Lin and Xuepeng Liu and Dekai Sun and Shirong Lin and Zhicheng Zheng and Xiaoyong Zhu and Wenbo Su and Bo Zheng},
year={2024},
eprint={2411.07140},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.07140},
}