VEGA: Automatically Generating Compiler Backends using a Pre-trained Transformer Model

1SKL of Processors, Institute of Computing Technology, CAS
2University of Chinese Academy of Sciences
3UNSW Sydney, Australia
*Corresponding Author

Overview

We introduce VEGA, an AI-driven system aimed at easing the development of compiler backends for new targets. Our approach involves categorizing functions from existing backends into function groups, each comprising various target specific implementations of a standard compiler interface function, abstracted as a single function template. There fore, generating a new backend involves customizing these function templates to specific target requirements. To capitalize on AI’s capabilities in code generation, VEGA maps statements in a target-specific version of a function template into feature vectors, distinguishing between target independent and target-specific properties. Leveraging a pre-trained model, VEGA can efficiently auto-generate a version of each function template tailored to a specific target, thereby enabling the construction of a complete compiler backend for a new target based solely on its target description files.

We evaluated VEGA on three distinct targets: a CPU processor (RISC-V), a customized processor with instruction extensions (RI5CY), and an IoT processor (xCORE). VEGA demonstrated high efficiency, generating compiler backends under an hour, which can substantially enhance developer productivity. Across the three targets, VEGA achieved accuracy rates of 71.5%, 73.2%, and 62.2% for all generated functions, significantly outperforming the traditional forkflow method, which yielded less than 8% accuracy. Moreover, VEGA provides explicit confidence scores for generated functions and statements, allowing developers to easily identify areas requiring minimal manual intervention. This research has the potential to improve the effectiveness of traditional compiler backend development.

The Architecture of VEGA

VEGA is an automated system that uses a pre-trained transformer model to generate compiler backends, identifying target-specific and target-independent features to produce code for new architectures efficiently, requiring only their description files as input.

Main Results

VEGA demonstrates significant improvements in compiler backend generation efficiency and accuracy. It can generate complete backends for new targets like RISC-V, RI5CY, and xCORE in under an hour. Evaluation using pass@1 on LLVM regression tests shows average function-level accuracy rates of 71.5% (RISC-V), 73.2% (RI5CY), and 62.2% (xCORE). This vastly outperforms the traditional fork-flow approach, which achieved less than 8% accuracy for these targets. VEGA also provides confidence scores, aiding developers in identifying potentially incorrect code sections, further enhancing productivity.

VEGA Accuracy Results

Compared to the manual effort required by the traditional fork-flow method, VEGA drastically reduces the need for modifications. While fork-flow required over 85% of statements to be manually changed, VEGA achieves high statement-level accuracy (e.g., 55.0% for RISC-V, 58.5% for RI5CY) automatically, minimizing manual intervention.

VEGA vs ForkFlow Comparison

Example

The paper presents a complete example to demonstrate VEGA’s workflow: automatically generating the getRelocType function for the RISC-V target. It then abstracts a function template, identifying target-independent and target-dependent parts.

example-1

First, VEGA collects existing implementations of getRelocType from ARM and MIPS backends and aligns their code using GumTree.

example-2

Next, VEGA extracts feature vectors from these implementations, capturing semantic properties relevant to backend generation. Using these feature vectors, it fine-tunes a pre-trained transformer model. Finally, given only the RISC-V target description files, VEGA generates a new RISC-V-specific getRelocType function. Throughout the process, it assigns confidence scores to each generated statement, helping developers quickly locate and fix uncertain parts.

Quick Start

Hardware Dependencies:
- 8 Nvidia Tesla V100 GPUs, each with 16 GB memory.

Software Dependencies:
- CUDA version: 11.7
- Python version: 3.8.1
- Conda: Any version that supports Python 3.8.1

Dataset and Base Model
ComBack: A Versatile Dataset for Enhancing Compiler Backend Development Efficiency
- Paper: https://neurips.cc/virtual/2024/poster/97455
- Dataset & Model https://huggingface.co/docz1105/ComBack_Models

Follow these steps to set up the environment and run a functionality test.

          
# 1. Clone the repository
git lfs clone https://huggingface.co/docz1105/VEGA_AE
cd VEGA_AE

# 2. Set up Conda environment
# Option A: Use provided YAML file
conda env create -f vega_ae.yml
conda activate vega_ae

# Option B: Create manually
conda create -n vega_ae python=3.8.1
conda activate vega_ae
pip install -r requirements.txt

# 3. Run Functionality Test (Generates one function for RI5CY)
# Takes < 3 minutes on 8x V100 GPUs
bash run_function_test.sh

# Check the output (generated code vs ground truth)
cat ./models/FT_Model/result.jsonl

# 4. Run Full Code Generation (for RISC-V, RI5CY, xCORE)
# Uses the provided fine-tuned model and test data
bash run_test.sh
# (Results will be in ./models/FT_Model/result.jsonl, overwriting previous results)

# 5. (Optional) Fine-tune the model from scratch
# Uses the original UniXcoder and training data
bash run_fine_tuning.sh
# (New model saved in ./models/New_FT_Model)
            
          

BibTex

  

@inproceedings{zhong2025vega,
  title={VEGA: Automatically Generating Compiler Backends Using a Pre-Trained Transformer Model},
  author={Ming Zhong, Fang Lv, Lulin Wang, Lei Qiu, Yingying Wang, Ying Liu, Huimin Cui, Xiaobing Feng, Jingling Xue},
  booktitle={2025 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)},
  year={2025}
}