Icon DeltaBench

Yancheng He*1, Shilong Li*1, Jiaheng Liu*,†1, Weixun Wang*1, Xingyuan Bu1, Ge Zhang2,
Zhongyuan Peng1, Zhaoxiang Zhang3, Zhicheng Zheng1, Wenbo Su1, Bo Zheng1
1Alibaba Group 2M-A-P 3CASIA
*Equal Contribution   Corresponding Author

Abstract

Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to Detect Errors in Long CoT ReAsoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.

💥 DeltaBench

Illustration of the evaluation process for critic models and Process Reward Models (PRMs) for DeltaBench.

DeltaBench is the first dataset to analyze the qualities of the long CoTs generated by o1-like models and evaluate the critique abilities to Detect Error in Long CoT ReAsoning of existing critic models and PRMs. Specifically, DeltaBench comprises 1,236 samples across diverse domains, including Math, Programming, PCB (physics, chemistry and biology), and General Reasoning. Each sample encompasses a problem, its corresponding long CoT solution, and comprehensive human annotations.

💫 Introduction

DeltaBench introduces the first dataset to analyze the qualities of the long CoTs generated by o1-like models and evaluate the critique abilities to Detect Error in Long CoT ReAsoning of existing critic models and PRMs. Specifically, in DeltaBench, we first collect a diverse collection of long CoTs generated by various o1-like models (i.e., QwQ, DeepSeek-R1, and Gemini-2.0 Flash Thinking) across different reasoning tasks such as Math, Programming, PCB (physics, chemistry and biology), and General Reasoning. Then, we divide each long CoT into different sections, where each section denotes an independent subtask, as shown in the figure below.

After that, each section includes the following tags:

  • 1️⃣ Strategy Shift: whether this section introduces a new method or strategy attempt. If a new strategy is introduced, the specific step is annotated.
  • 2️⃣ Reasoning Usefulness: whether the reasoning in this section is useful.
  • 3️⃣ Reasoning Correctness: whether this section contains any errors. If an error is present, additional error-related fields are annotated, including the first step number at which the error occurs, explanation, and correction.
  • 4️⃣ Reflection Efficiency: whether this section contains reflection and whether the reflection is correct. If reflection is present, the step at which the reflection begins is annotated.

LeaderBoard

Model Recall Precision F1 Math F1 Code F1 PCB F1 General F1
Process Reward Models (PRMs)
Qwen2.5-Math-PRM-7B30.3034.9629.2229.6423.7631.0934.19
Qwen2.5-Math-PRM-72B28.1629.3726.3824.1622.0231.1435.83
Llama3.1-8B-PRM-Deepseek-Data11.715.5912.0212.2810.9516.7612.59
Llama3.1-8B-PRM-Mistral-Data9.6411.219.459.4010.7213.4312.40
Skywork-o1-Qwen-2.5-1.5B3.323.843.071.306.665.437.87
Skywork-o1-Qwen-2.5-7B2.492.222.170.786.286.023.11
LLM as Critic Models
GPT-4-turbo-128k57.1937.3540.7637.5643.0645.5442.17
GPT-4o-mini49.8835.3737.8233.2637.9545.9846.39
Doubao-1.5-Pro39.6837.0235.2532.4639.4733.5337.00
GPT-4o36.5232.4830.8528.6128.5339.2536.50
Qwen2.5-Max36.1130.8230.4926.7332.8139.4929.54
Gemini-1.5-pro35.5130.3229.5926.5628.2040.1333.66
DeepSeek-V332.3328.1327.3327.0427.7327.3527.45
Llama-3.1-70B-Instruct32.2228.8527.6721.4932.1328.4539.18
Qwen2.5-32B-Instruct30.1228.6326.7322.3431.3733.7824.37
DeepSeek-R129.2032.6628.4324.1729.2834.7835.87
o1-preview27.9230.5926.9722.1928.0933.1135.94
Qwen2.5-14B-Instruct26.6427.2724.7321.5129.0529.9820.59
Llama-3.1-8B-Instruct25.7128.0124.9118.1232.1727.3029.93
o1-mini22.9022.9019.8916.7121.7020.3726.94
Qwen2.5-7B-Instruct21.9919.6118.6311.6125.9229.8515.18
DeepSeek-R1-Distill-Qwen-32B17.1918.6516.2813.0223.5515.0511.56
DeepSeek-R1-Distill-Qwen-14B12.8114.5412.559.4018.3610.4412.01

Results of PRMs and critic models on DeltaBench. For each group of models, bold indicates the best results, while underline indicates the second best results.

BibTeX

@misc{he2025largelanguagemodelsdetect,
      title={Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?}, 
      author={Yancheng He and Shilong Li and Jiaheng Liu and Weixun Wang and Xingyuan Bu and Ge Zhang and Zhongyuan Peng and Zhaoxiang Zhang and Zhicheng Zheng and Wenbo Su and Bo Zheng},
      year={2025},
      eprint={2502.19361},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.19361}, 
}