CoRe 🥑

⭐ NeurIPS 2025 D&B Track Spotlight ⭐

Benchmarking LLMs' Code Reasoning Capabilities through Static Analysis Tasks

📄 Paper 🐙 GitHub 🤗 HuggingFace

💡 Overview

CoRe is a high-quality, human-verified benchmark designed to evaluate LLMs on fundamental static analysis tasks. CoRe includes 12,553 task instances spanning data dependency, control dependency, and information flow across programs written in C/C++, Java, and Python. To ensure semantic diversity and reasoning complexity, we propose a semantics-aware diverse sampling strategy that selects targets and task instances based on structural coverage and dependency depth.

🏆 Leaderboard

Dependency Classification
Trace Generation
Dependency Source Enumeration

F1 Score

ModelData DependencyControl DependencyInformation FlowOverall

Correct Trace Rate (%)

ModelData DependencyControl DependencyInformation FlowOverall

Exact Match (%)

ModelData DependencyControl DependencyInformation FlowOverall

📝 Citation

@article{xie2025core,
  title={CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks},
  author={Xie, Danning and Zheng, Mingwei and Liu, Xuwei and Wang, Jiannan and Wang, Chengpeng and Tan, Lin and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2507.05269},
  year={2025}
}