ComBack: A Versatile Dataset for Enhancing Compiler Backend Development Efficiency

1SKLP, Institute of Computing Technology, CAS, Beijing, China
2UCAS, Beijing, China
*Corresponding Author

Overview

Compiler backends are tasked with generating executable machine code for processors. With the proliferation of diverse processors, it is imperative for programmers to tailor specific compiler backends to accommodate each one. Meanwhile, compiler backend development is a laborious and time-consuming task, lacking effective automation methods. Although language models have demonstrated strong abilities in code related tasks, the lack of appropriate datasets for compiler backend development limits the application of language models in this field.
In this paper, we introduce ComBack, the first public dataset designed for improving compiler backend development capabilities of language models. ComBack includes 178 backends for mainstream compilers and three tasks including statement-level completion, next-statement suggestion and code generation, representing common development scenarios. We conducted experiments by fine-tuning six pre-trained language models with ComBack, demonstrating its effectiveness in enhancing model accuracy across the three tasks. We further evaluated the top-performing model (CodeT5+) across the three tasks for new targets, comparing its accuracy with conventional methods (Fork-Flow), ChatGPT3.5-Turbo, and Code-LLaMA-34B-Instruct. Remarkably, fine-tuned CodeT5+ with only 220M parameters on ComBack outperformed Fork-Flow methods significantly and surpassed ChatGPT and Code-LLaMA, suggesting potential efficiency improvements in compiler development. ComBack is avaliable at https://huggingface.co/datasets/docz1105/ComBack.

The Overview of ComBack dataset

ComBack is a comprehensive dataset specifically curated to advance the application of language models in compiler backend development. It amalgamates code from 178 distinct backends, with 77 from GCC (versions 3.0-13.0) and 101 from LLVM (versions 2.0.1-17.0.1), sourced from numerous GitHub repositories and official compiler websites, totaling over 5.7 million lines of code. A key feature is its pre-processing, which systematically replaces target-specific values (like instruction encodings and flags) with generic placeholders. The dataset is structured to support three primary development tasks—statement-level completion, next-statement suggestion, and code generation from natural language descriptions—aiming to enable models to learn common patterns across diverse target architectures (CPUs, MPUs, GPUs) and thereby enhance efficiency in creating and customizing compiler backends.

Main Results

Experimental results clearly demonstrate the significant efficacy of the ComBack dataset. Fine-tuning six representative pre-trained language models with ComBack led to substantial accuracy improvements across the three core compiler backend development tasks: statement-level completion, next-statement suggestion, and code generation. Specifically, Edit Distance (ED) scores improved by an average of 41.64 to 77.21 points, and Exact Match (EM) accuracy for statement completion tasks saw absolute gains ranging from 42.58% to 67.77%. This overall enhancement is clearly illustrated in Table 2, which showcases the broad impact of ComBack.

VEGA Accuracy Results

Crucially, when tasked with generating code for new targets of existing types (such as RISC-V, ARC, and NVPTX), a CodeT5+ model with only 220M parameters, after fine-tuning on ComBack, significantly outperformed both ChatGPT-3.5-Turbo and Code-LLaMA-34B-Instruct, as detailed in Table 3. Furthermore, the fine-tuned CodeT5+ also demonstrated superior performance compared to the conventional 'Fork-Flow' development method for code generation, a comparison visually presented in Figure 6.

VEGA vs ForkFlow Comparison

As shown in Table 5, by fine-tuning CodeT5+ with ComBack iteratively expanded to include data for a specific customized target (RI5CY), significant accuracy improvements were achieved for that target across all three tasks. Specifically, for RI5CY, Statement-Level Completion EM increased by +7.90%, Next-Statement Suggestion EM by +9.96%, and Code Generation BLEU-4 by +25.05, showcasing the dataset's ability to enhance model performance for specialized, evolving targets.

Example

Figure 4 showcases the kind of target-specific information prevalent in compiler backend code.

  • In Figure 4(a), RISCVMCExpr::VK_RISCV_LO is an enumeration value specific to the RISC-V target, while CSKYMCExpr::VK_CSKY_GOT is specific to CSKY. During ComBack's preprocessing, these distinct values would be mapped to a generic placeholder like . A language model fine-tuned on ComBack learns the pattern that a Kind variable is often assigned such a placeholder in this context. This allows the model to correctly suggest or complete similar assignments for new or different targets, even if it hasn't seen their exact enumeration values before.
  • Similarly, in Figure 4(b), the immediate values (16, 31) and (-8, 7) used with isImm would be replaced by . The model learns the structure of using isImm with numeric literals.
  • In Figure 4(c), the instruction encoding strings like "\0\0\x40\x03" and sizes like 4 would be replaced by and respectively. The model learns the pattern of using OS.write() with these types of placeholders.
  • example

    Figure 5 illustrates the three practical development scenarios ComBack addresses.

  • Statement-Level Completion: As shown, given an incomplete input like ... adjustReg(...), a model fine-tuned with ComBack can complete the current statement, suggesting MachineInstr::FrameDestroy);. This directly aids in faster and more accurate code entry.
  • Next-Statement Suggestion: Following a completed input statement such as ... maxCallFrameSize = (maxCallFrameSize + AlignMask) & ~AlignMask;, the model predicts the subsequent logical statement, here MFI -> setMaxCallFrameSize(maxCallFrameSize);, guiding developers in constructing code sequences.
  • Code Generation: This example demonstrates how a natural language description (e.g., "Returns a TargetRegisterClass used for pointer values") and target-specific values (e.g., "Sparc, SP::I64RegsRegClass") can be used as input for the model to generate the entire corresponding function body, significantly speeding up the creation of new functions.

  • BibTex

      
    
    @inproceedings{zhong2024comback,
      title={ComBack: A Versatile Dataset for Enhancing Compiler Backend Development Efficiency},
      author={Ming Zhong, Fang Lyu, Lulin Wang, Hongna Geng, Lei Qiu, Huimin Cui, Xiaobing Feng},
      booktitle={Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
      year={2024}
    }