LLM4Decompile: Decompiling Binary Code with LLM

LLM4Decompile

Reverse Engineering: Decompiling Binary Code with Gargantuan Language Fashions

For more facts compare out the paper.

0. Updates

2023.03.16 Add llm4decompile-6.7b-uo mannequin which is trained without prior files of the optimization phases (O0~O3), the average re-executability is arond 0.21.

1. Introduction of LLM4Decompile and Decompile-Eval

Our goal is to assemble and begin the first begin-provide LLM dedicated to decompilation, and to evaluate its capabilities by constructing the first decompilation benchmark all in favour of re-compilability and re-executable.

We start by compiling one million C code samples from AnghaBench into assembly code the utilization of GCC with assorted configurations, forming a dataset of assembly-provide pairs in 4 billion tokens. We then finetune the DeepSeek-Coder mannequin, a recent code LLM, the utilization of this dataset. Adopted by constructing the overview benchmark, Decompile-Eval, in holding with HumanEval questions and take a look at samples. Particularly, we formulate the overview from two perspectives: whether the decompiled code can recompile successfully, and whether it passes all assertions within the take a look at conditions.

Figure 1 items the steps serious about our decompilation overview. First, the provision code (denoted as src) is compiled by the GCC compiler with roar parameters, comparable to optimization phases, to assemble the executable binary. This binary is then disassembled into assembly language (asm) the utilization of the objdump map. The assembly instructions are subsequently decompiled to reconstruct the provision code in a layout that’s readable to humans (famous as src’). To assess the quality of the decompiled code (src’), it is tested for its ability to be recompiled with the everyday GCC compiler (re-compilability) and for its performance thru take a look at assertions (re-executability).

2. Overview Results

Metrics

Re-compilability and re-executability wait on as essential indicators in validating the effectiveness of a decompilation job. When decompiled code will also be recompiled, it provides tough evidence of syntactic integrity. It ensures that the decompiled code will not be any longer excellent readable, but furthermore adheres to the structural and syntactical standards anticipated by the compiler.
On the opposite hand, syntax alone does no longer guarantee semantic equivalence to the everyday pre-compiled program. Re-executability provides this essential measure of semantic correctness. By re-compiling the decompiled output and running the take a look at conditions, we assess if the decompilation preserved this technique good judgment and habits.
Together, re-compilability and re-executability record syntax recovery and semantic preservation – both a truly unprecedented for usable and tough decompilation.

Results

3. How to Use The Model

Our LLM4Decompile comprises objects with sizes between 1.3 billion and 33 billion parameters, and we own made these objects accessible on Hugging Face.

llm4decompile-1.3b

llm4decompile-6.7b

llm4decompile-33b

llm4decompile-6.7b-nsp

llm4decompile-6.7b-uo

Picture: The NSP mannequin is trained with assembly code, the average re-executability is arond 0.17.

Picture: The unified optimization (UO) mannequin is trained without prior files of the optimization phases (O0~O3), the average re-executability is arond 0.21. The pre-processing of UO mannequin is somewhat assorted (no prior files of the On), please compare the mannequin internet page.

Right here give an instance of exercise our mannequin.

Preprocessing: compile the C code into binary, disassemble the binary into assembly instructions.

{output_file}.s’#disassemble the binary file into assembly instructions
subprocess.flee(compile_command, shell=Factual, compare=Factual)

input_asm=”
with begin(output_file+’.s’) as f:#asm file
asm=f.be taught()
asm=asm.split(‘Disassembly of share .text:’)[-1].strip()
for tmp in asm.split(‘n’):
tmp_asm=tmp.split(‘t’)[-1]#cast off the binary code
tmp_asm=tmp_asm.split(‘#’)[0].strip()#cast off the feedback
input_asm+=tmp_asm+’n’
input_asm=re.sub(zeros_pattern, ”, input_asm)
sooner than=f”# Right here’s the assembly code with {opt_state} optimization:n”#urged
after=”n# What’s the provision code?n”#urged
input_asm_prompt=sooner than+input_asm.strip()+after
with begin(fileName +’_’ + opt_state +’.asm’,’w’,encoding=’utf-8′) as f:
f.write(input_asm_prompt)”>

import subprocess
import os
import re

digit_pattern = r'b0x[a-fA-F0-9]+b'# binary codes in Hexadecimal
zeros_pattern = r'^0+s'#0s
OPT = ["O0", "O1", "O2", "O3"]
fileName = 'path/to/file'
with open(fileName+'.c','r') as f:#original file
    c_func = f.read()
for opt_state in OPT:
    output_file = fileName +'_' + opt_state
    input_file = fileName+'.c'
    compile_command = f'gcc -c -o {output_file}.o {input_file} -{opt_state} -lm'#compile the code with GCC on Linux
    subprocess.run(compile_command, shell=True, check=True)
    compile_command = f'objdump -d {output_file}.o> {output_file}.s'#disassemble the binary file into assembly instructions
    subprocess.run(compile_command, shell=True, check=True)
    
    input_asm = ''
    with open(output_file+'.s') as f:#asm file
        asm= f.read()
    asm = asm.split('Disassembly of section .text:')[-1].strip()
    for tmp in asm.split('n'):
        tmp_asm = tmp.split('t')[-1]#remove the binary code
        tmp_asm = tmp_asm.split('#')[0].strip()#remove the comments
        input_asm+=tmp_asm+'n'
    input_asm = re.sub(zeros_pattern, '', input_asm)
    before = f"# This is the assembly code with {opt_state} optimization:n"#prompt
    after = "n# What is the source code?n"#prompt
    input_asm_prompt = before+input_asm.strip()+after
    with open(fileName +'_' + opt_state +'.asm','w',encoding='utf-8') as f:
        f.write(input_asm_prompt)

Decompilation: exercise LLM4Decompile to translate the assembly instructions into C:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = 'arise-sustech/llm4decompile-1.3b'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16).cuda()

with open(fileName +'_' + opt_state +'.asm','r') as f:#original file
    asm_func = f.read()
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=500)
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])

4. How to exercise Decompile-Eval

Recordsdata are stored in llm4decompile/decompile-eval/decompile-eval.json, the utilization of JSON checklist layout. There are 164*4 (O0, O1, O2, O3) samples, every with 5 keys:

  • task_id: indicates the ID of the world.
  • type: the optimization stage, is one of [O0, O1, O2, O3].
  • c_func: C solution for HumanEval enviornment.
  • c_test: C take a look at assertions.
  • input_asm_prompt: assembly instructions with prompts, will also be derived as in our preprocessing instance.

To flee the overview on single GPU and single job:

cd LLM4Decompile
python ./evaluation/run_evaluation_llm4decompile_singleGPU.py

To flee the overview the utilization of TGI (10x quicker, enhance a few GPUs and multi-job):
First, please install the text-abilities-inference following the respectable link

git clone https://github.com/albertan017/LLM4Decompile.git
cd LLM4Decompile
pip install -r requirements.txt

# Before run the evaluation script, plase update the model_path to your local mdoel path.
bash ./scripts/run_evaluation_llm4decompile.sh

5. On Going

LLM4Binary: We idea to encompass bigger dataset to pre-explain the mannequin with assembly code and C code.

Decompiler-ALL: Enhance mroe languages/platforms and settings (e.g., decompile a few capabilities).

6. License

This code repository is licensed beneath the MIT License.

7. Contact

Might own to it is possible you’ll maybe presumably also own any questions, please lift an enviornment.

8. Thoughts

The conversation relating to the language mannequin decompiler that took role on Reddit roughly a three hundred and sixty five days within the past modified into moderately spell binding to us.

9. Quotation

@misc{tan2024llm4decompile,
      title={LLM4Decompile: Decompiling Binary Code with Large Language Models}, 
      author={Hanzhuo Tan and Qi Luo and Jing Li and Yuqun Zhang},
      year={2024},
      eprint={2403.05286},
      archivePrefix={arXiv},
      primaryClass={cs.PL}
}
Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like