CoRe🥑

Benchmarking LLMs’ Code Reasoning Capabilities through Static Analysis Tasks

📄 Paper 🐙 GitHub 🤗 HuggingFace

Overview

CoRe is a high-quality, human-verified benchmark designed to evaluate LLMs on fundamental static analysis tasks. CoRe includes 12,553 task instances spanning data dependency, control dependency, and information flow across programs written in C/C++, Java, and Python. To ensure semantic diversity and reasoning complexity, we propose a semantics-aware diverse sampling strategy that selects targets and task instances based on structural coverage and dependency depth.

Dependency Classification
Trace Generation
Dependency Source Enumeration

F1

#ModelData DependencyControl DependencyInformation FlowOverall
1Model A95.192.393.793.7
2Model B91.089.590.190.2

Correct Trace Rate (%)

#ModelData DependencyControl DependencyInformation FlowOverall
1Model A83.080.578.280.6
2Model B75.073.172.573.5

Exact Match (%)

#ModelData DependencyControl DependencyInformation FlowOverall
1Model A70.072.371.471.2
2Model B60.062.561.961.5

Examples

Data Dependency
Control Dependency
Information Flow
# Example Data Dependency
input_code = """
SELECT name FROM students WHERE grade > 3.5;
"""

expected = {
    "tables": ["students"],
    "columns": ["name", "grade"],
}
def factorial(n: int) -> int:
    if n <= 1:
        return 1
    return n * factorial(n - 1)
// Example InfoFlow
function readUserToken() {
  const token = localStorage.getItem('token');
  sendToServer(token);
}

Citation

@misc{your2025dataset,
  title        = {Your Awesome Dataset},
  author       = {Doe, Jane and Doe, John},
  howpublished = {\url{https://yourdataset.github.io}},
  year         = {2025}
}