BigCodeBench addresses limitations of existing benchmarks by providing a more comprehensive and challenging set of 1,140 function-level tasks that require complex reasoning and problem-solving skills. It covers diverse libraries and function calls, and includes two components: BigCodeBench-Complete for code completion and BigCodeBench-Instruct for instruction-tuned LLMs. This benchmark aims to better evaluate LLMs' programming capabilities in real-world scenarios.
BigCodeBench consists of two main components: BigCodeBench-Complete and BigCodeBench-Instruct. BigCodeBench-Complete focuses on code completion, where LLMs must finish implementing a function based on detailed docstring instructions. BigCodeBench-Instruct is designed to evaluate instruction-tuned LLMs that follow natural-language instructions, presenting a more conversational approach to task descriptions.