Icon ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models

Haibin Chen*, Kangtao Lv*, Chengwei Hu, Yanshi Li, Yujin Yuan,
Yancheng He, Xingyao Zhang, Langming Liu, Shilei Liu, Wenbo Su, Bo Zheng
Taobao & Tmall Group of Alibaba

*Indicates Equal Contribution

Corresponding Author

Abstract

With the increasing use of Large Language Models (LLMs) in fields such as e-commerce, domain-specific concept evaluation benchmarks are crucial for assessing their domain capabilities. Existing LLMs may generate factually incorrect information within the complex e-commerce applications. Therefore, it is necessary to build an e-commerce concept benchmark. Existing benchmarks encounter two primary challenges: (1) handle the heterogeneous and diverse nature of tasks, (2) distinguish between generality and specificity within the e-commerce field. To address these problems, we propose ChineseEcomQA, a scalable question-answering benchmark focused on fundamental e-commerce concepts. ChineseEcomQA is built on three core characteristics: Focus on Fundamental Concept, E-commerce Generality and E-commerce Expertise. Fundamental concepts are designed to be applicable across a diverse array of e-commerce tasks, thus addressing the challenge of heterogeneity and diversity. Additionally, by carefully balancing generality and specificity, ChineseEcomQA effectively differentiates between broad e-commerce concepts, allowing for precise validation of domain capabilities. We achieve this through a scalable benchmark construction process that combines LLM validation, Retrieval-Augmented Generation (RAG) validation, and rigorous manual annotation. Based on ChineseEcomQA, we conduct extensive evaluations on mainstream LLMs and provide some valuable insights. We hope that ChineseEcomQA could guide future domain-specific evaluations, and facilitate broader LLM adoption in e-commerce applications.

E-commerce concepts

An overview of the data construction, filtering, verification, and quality control processes of Chinese SimpleQA.

Starting from the basic elements of e-commerce such as user behavior and product information, we summarized the main types of e-commerce concepts, defined 10 sub-concepts from basic concepts to advanced concepts as follows:

  • • Industry Categorization. Given e-commerce corpus (such as user queries or web corpus), the LLMs need to figure out which e-commerce industries and categories are involved. The difficulty lies in distinguishing similar categories in the e-commerce domain.
  • • Industry Concept. The model needs to understand the specialized knowledge in different e-commerce industries. The difficulty lies in accurately memorizing professional factual knowledge
  • • Category Concept. The model must understand which category a common, standard product belongs to.
  • • Brand Concept. The model needs to recognize major brands and understand some background information about them.
  • • Attribute Concept. E-commerce text often describes products using basic attributes, like style or age group. The model must have the ability to pick out these specific attribute words.
  • • Spoken Concept. The e-commerce field is closely related to daily life scenarios, and people often use casual and imprecise language to express what they want. The model needs to understand the true expression forms.
  • • Intent Concept. Beyond just informal language, sometimes consumers just list a bunch of attributes. The model needs to figure out the consumer's true intention from these phrases (such as how to choose).
  • • Review Concept. The model needs to understand common concepts in user comments, such as emotional tendencies, commonly used evaluation aspects, etc.
  • • Relevance Concept. One of the most crucial concepts of e-commerce is figuring out how relevant a product is to what a user wants. The model needs to integrate basic concepts such as intent concept and category concept to determine the relevance among user expression and products.
  • • Personalized Concept. Personalized concept is one of the most important parts of user experience. This requires combining basic e-commerce concepts with general reasoning skills to recommend new product categories that best match a user's recent preferences.

Data Construction Pipline

An overview of the data construction, filtering, verification, and quality control processes of ChineseEcomQA.

ChineseEcomQA's Features

  • • Focus on Fundamental Concept: We focus on fundamental concepts that enable unified generative evaluation.
  • • E-commerce Generality: The concepts assessed by the benchmark must be common across the e-commerce industry, avoiding platform-specific implementations or task-specific formulations.
  • • E-commerce Expertise: Real-world e-commerce problems often require a foundation of specialized e-commerce knowledge, complemented by the application of general comprehension and reasoning skills.

Key Observations

  • Leading Models: Deepseek-R1 and Deepseek-V3 are currently the best models, demonstrating the promising potential of powerful foundation LLMs (and reasoning LLMs) in the e-commerce field.
  • Significant Challenges: ChineseEcomQA poses considerable challenges, with many state-of-the-art models achieving below 60% accuracy on specific sub-concepts
  • Scaling Laws: E-commerce concepts follow scaling law, where larger models demonstrate superior capability in advanced concepts.
  • Calibration:Larger models show better calibration in confidence estimation.
  • Reasoning LLMs:. Deepseek-R1-Distill-Qwen series performs worse than the original Qwen series and struggles to identify and correct its own factual errors, indicating that there are still many challenges in the reasoning ability of open domains.
  • RAG matters:. When introducing the RAG strategy into existing LLMs, models of various sizes have shown significant performance improvements, narrowing the gap among models.

LeaderBoard

Models Accuracy Accuracy on 10 sub-concepts
Avg. IC IDC CC BC AC SC ITC RVC RLC PC
Closed-Source Large Language Models
GLM-4-Plus 69.2 57.3 54.1 76.3 77.6 69.5 59.5 72.2 83.1 68.5 74.0
Qwen2.5-max 68.5 62.2 62.2 71.1 77.6 63.0 58.5 57.8 88.7 63.9 80.4
Yi-Large 67.6 56.6 59.5 71.1 81.8 62.0 58.5 70.0 70.4 68.5 78.0
o1-preview 66.8 69.2 63.1 78.4 80.0 67.0 52.0 43.3 83.1 61.3 71.1
Baichuan4-Turbo 66.4 57.3 56.8 82.0 72.4 61.0 59.5 66.7 78.9 55.4 74.6
GPT-4o 65.6 68.2 52.3 74.7 72.4 64.5 56.5 50.0 80.3 57.7 79.8
Doubao-1.5-pro-32k 64.0 69.6 64.0 62.9 74.1 56.5 64.5 48.9 69.0 62.6 68.2
Claude-3.5-Sonnet 63.8 70.6 56.8 73.2 64.1 63.0 31.5 62.2 81.7 65.2 69.4
Gemini-1.5-pro 61.1 59.8 49.6 67.0 70.0 56.0 43.5 55.6 81.7 54.1 73.4
o1-mini 55.4 59.1 41.4 53.1 37.1 59.0 53.0 58.9 64.8 63.6 64.2
Gemini-1.5-flash 54.5 62.9 35.1 57.2 46.5 52.5 53.0 36.7 74.7 54.4 71.7
Open-Source Large Language Models
DeepSeek-R1 74.0 62.9 72.1 72.1 84.7 70.5 55.5 67.8 85.9 76.1 92.5
DeepSeek-V3 72.2 67.5 64.9 74.2 80.6 69.0 62.0 72.2 77.5 68.2 86.1
DeepSeek-V2.5 67.4 66.4 58.6 73.7 76.5 64.0 60.0 75.6 83.1 54.1 61.8
DeepSeek-67B 58.4 61.2 47.7 70.6 62.9 47.0 52.5 60.0 59.2 55.7 67.1
DeepSeek-7B 47.5 38.5 41.1 59.3 45.9 40.0 49.0 54.4 47.9 54.4 44.7
DeepSeek-R1-Distill-Qwen-32B 57.1 63.6 46.0 62.4 47.6 36.0 43.0 61.1 78.9 61.6 70.6
DeepSeek-R1-Distill-Qwen-14B 50.6 64.7 43.2 62.9 38.8 27.5 41.0 60.0 67.6 59.0 40.9
DeepSeek-R1-Distill-Qwen-7B 38.9 48.6 18.0 53.6 16.5 25.0 35.5 36.7 52.1 48.2 57.2
DeepSeek-R1-Distill-Qwen-1.5B 26.2 35.7 2.7 46.9 6.5 8.5 23.5 18.9 40.9 40.0 38.2
Qwen2.5-72B 62.7 57.3 46.0 66.0 64.7 55.5 58.0 67.8 76.1 56.7 78.6
Qwen2.5-32B 60.9 62.2 42.3 58.8 50.6 61.5 52.5 66.7 74.7 62.3 77.5
Qwen2.5-14B 55.3 57.0 40.5 54.6 48.8 59.0 49.0 40.0 66.2 59.3 78.6
Qwen2.5-7B 47.1 45.8 24.3 51.6 37.6 44.5 54.0 31.1 64.8 48.5 68.8
Qwen2.5-3B 41.7 52.1 14.4 41.8 34.1 42.5 34.0 30.0 60.6 51.1 56.7
LLaMA3.1-70B 54.6 59.1 35.7 58.8 39.4 58.0 37.5 73.3 74.7 53.8 56.1
LLaMA3.1-8B 42.4 40.6 11.7 61.3 17.1 42.0 38.5 42.2 66.2 44.6 60.0

Results of different models on ChineseEcomQA. For sub-concepts, IC, IDC, CC, BC, AC, SC, ITC, RVC, RLC and PC represent “Industry Categorization”, “Industry Concept”, “Category Concept”, “Brand Concept”, “Attribute Concept”, “Spoken Concept”, “Intent Concept”, “Review Concept”, “Relevance Concept” and “Personalized Concept” respectively

Calibration capabilities of various models

In this section, we evaluate the calibration capabilities of various models on category concept and brand concept, with results visualized in Figure 6. The results demonstrate the correlation between the stated con- fidence of the model, and how accurate the model actually was. Notably, o1-preview exhibits the best alignment performance, fol- lowed by o1-mini. Within the Qwen2.5 series, the calibration hier- archy emerges as Qwen2.5-MAX > Qwen2.5-72B > Qwen2.5-14B > Qwen2.5-7B > Qwen2.5-3B, suggesting that larger model scales correlate with improved calibration. However, most models consis- tently fall below the perfect alignment line, indicating a prevalent tendency towards overconfidence in predictions. This highlights significant room for improving large language model calibration to mitigate overconfident generation of erroneous responses.

Analysis On The Effect Of RAG

In this study, we explore the effectiveness of the Retrieval-Augmented Generation (RAG) strategy in enhancing the domain knowledge of large language models (LLMs) on the ChineseEcomQA dataset. Specifically, we reproduce a RAG system referring to the settings of Chinese-SimpleQA on category concept and brand concept. In Figure 7, all models improve significantly with RAG. We can summarize three detailed conclusions:

  • • For small LLMs, introducing RAG information can significantly increase the absolute value of evaluation metrics. For example, Qwen2.5-14B has achieved a 27.9% improvement.
  • • For large LLMs, RAG can also achieve significant relative improvements. For example, Deepseek V3's average relative improvement reached 10.44% (accuracy from 77.4 to 85.5).
  • • Under the RAG setting, the performance between models still follows the scaling law, but the gap is rapidly narrowed. For example, the difference in accuracy between Deepseek-V3 and Qwen2.5-72B has narrowed from 12.1% to 4%.

In conclusion, the discussions above suggest that RAG serves as an effective method for enhancing the e-commerce knowledge of LLMs.

Different thinking types in the reasoning LLMs.

Model Behavior A Behavior B Behavior C Behavior D
DeepSeek-R1-Distill-Qwen-7B 23.97 2.17 68.02 5.85
DeepSeek-R1-Distill-Qwen-14B 40.27 3.75 47.05 8.94
DeepSeek-R1-Distill-Qwen-32B 39.57 2.14 52.01 6.29
Deepseek-R1 62.80 2.80 26.49 7.92

Inspired by (Liu et al., 2025), we categorize the thinking process of reasoning models into the following four types:

  • • Type A: Reasoning LLMs repeatedly confirm the correct answer through self-reflections.
  • • Type B: Reasoning LLMs initially makes a mistake but corrects it through self-reflection.
  • • Type C: Reasoning LLMs introduce knowledge errors through self-reflections, resulting in potentially correct answers being modified into an incorrect ones.
  • • Type D: Reasoning LLMs undergo repeated self-reflections. Although it ultimately produced an answer, it does not obtain a highly certain and confident answer through reflection.

We used the judge LLMs (GPT-4o) to classify the thinking types of different models on category and brand concept tasks. The specific results can be seen in Table 2. Analyzing the dominant reasoning types, we found the following conclusions:

  • • According to column Type A, after arriving at the correct answer, reasoning LLMs will verify this answer through multiple rounds of reflections.
  • • According to column Type B, reasoning LLMs, regardless of their size, have acquired the ability to correct their own erroneous thinking. In the context of e-commerce concept, the underlying reasoning paths are less complex than in areas like mathematics or programming, leading to less frequent self-correction. The results also indicate that the error-correction processes do not lead to a substantial enhancement in their knowledge capacity
  • • According to column Type C, smaller LLMs are more likely to introduce factual errors during their thinking process, which can lead to incorrect answers. This is one of the important reasons why smaller reasoning LLMs perform worse than the original Qwen series models.

Overall, types A and B are the ability of reasoning LLMs obtained through scaling up test-time computation. Types C and D are superficial self-reflections that lead to incorrect final answers. Deepseek-R1 demonstrates better generalization ability based on a powerful base model. In contrast, the DeepSeek-R1-Distill-Qwen series, distilled in some specific fields, appears to struggle with superficial self-reflections. The accumulation of factual errors during the intermediate reasoning steps increases the overall error rate. For smaller reasoning LLMs, reasoning ability in open domains cannot be directly generalized through mathematical logic ability, and we need to find better methods to improve their performance.

Rankings on ChineseEcomQA vs. Chinese SimpleQA

To demonstrate the distinctions between ChineseSimpleQA and ChineseEcomQA, we compare the ranking differences of various models across these two benchmarks. As illustrated in Figure, significant performance discrepancies emerge among various models. Notably, the o1-preview model ranks first on ChineseSimpleQA but drops to 4th position on ChineseEcomQA. Conversely, GLM-4-Plus ascends from 3rd to 1st place between the two benchmarks. These ranking variations reveal that most Chinese community-developed models (e.g., Qwen-Max, GLM-4-Plus, Yi-Large) exhibit superior performance on Chinese e-commerce domain adaptation when operating within identical linguistic contexts. Furthermore, the distinct ranking distributions across models indicate that ChineseEcomQA exhibits discriminative power complementary to ChineseSimpleQA, enabling comprehensive evaluation of LLMs' domain- specific capability in Chinese e-commerce scenarios

Dataset Examples

This picture shows a few examples from the ChineseEcomQA dataset. More samples are available in the complete dataset.

BibTeX

@misc{he2024chinesesimpleqachinesefactuality,
      title={Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models}, 
      author={Yancheng He and Shilong Li and Jiaheng Liu and Yingshui Tan and Weixun Wang and Hui Huang and Xingyuan Bu and Hangyu Guo and Chengwei Hu and Boren Zheng and Zhuoran Lin and Xuepeng Liu and Dekai Sun and Shirong Lin and Zhicheng Zheng and Xiaoyong Zhu and Wenbo Su and Bo Zheng},
      year={2024},
      eprint={2411.07140},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.07140}, 
}