Although large language models (LLMs) have been largely successful in generating functionally correct programs, conditioning models to produce efficient solutions while ensuring correctness remains a challenge. Further, unreliability in benchmarking code efficiency is a hurdle across varying hardware specifications for popular interpreted languages such as Python. In this paper, we present ECCO, a reproducible benchmark for evaluating program efficiency via two paradigms: natural language (NL) based code generation and history-based code editing. On ECCO, we adapt and thoroughly investigate the three most promising existing LLM-based approaches: in-context learning, iterative refinement with execution or NL feedback, and fine-tuning conditioned on execution and editing history. While most methods degrade functional correctness and moderately increase program efficiency, we find that adding execution information often helps maintain functional correctness, and NL feedback enhances more on efficiency. We release our benchmark to support future work on LLM-based generation of efficient code.
In ECCO, we evaluate both runtime and memory efficiency of programs across 2 paradigms.
We evaluate the efficiency of programs generated by LLMs using execution and editing history.
We extend the relative efficiency metrics from PIE (Shypula, 2024) to both memory and runtime:
We evaluate the efficiency of programs generated by large language models (LLMs) using NL prompts. This method explores how well LLMs can interpret and implement efficiency instructions given in natural language.
We introduce relative efficiency metrics which compare generated code against a spectrum of user submitted solutions. This allows us to know each model's efficiency characteristics:
By using these relative metrics, we can assess how our NL-instructed generations compare to human-written code in terms of both speed and memory efficiency.
We explore three most promising classes of approaches:
In-Context Learning: Comparing the two paradigms, history-based editing results in a substantially higher pass@1 by referring to a base correct program, compared to NL-instructed generation which lacks a base program to start from.
Iterative-Refinement: Methods that involve NL feedback (self-refine and nl+exec refine) achieve the highest speedup across models, whereas raw excution results (exec-refine) can maintain correctness better while optimizing. We conjecture that execution outputs are better representations to inform functional correctness than NL descriptions, while it is easier to convey high-level optimization strategies in NL.
Multiple iterations of iterative-refinement leads to a consistent decrease in pass@1 which highlights the need for more robust methods that emphasize correctness-preserving optimizations.
@article{waghjale2024ecco,
title={ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?},
author={Waghjale, Siddhant and Veerendranath, Vishruth and Wang, Zora Zhiruo and Fried, Daniel},
journal={arXiv preprint arXiv:2407.14044},
year={2024}
}