Our paper is published by Network and Distributed Systems Security Symposium (NDSS) 2019. You are encouraged to cite the following paper if you use the provided resources for academic research.

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs.

@inproceedings{zuo2019neural,
title={Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs},
author={Zuo, Fei and Li, Xiaopeng and Young, Patrick and Luo,Lannan and Zeng,Qiang and Zhang, Zhexin},
booktitle={Proceedings of the 2019 Network and Distributed Systems Security Symposium (NDSS)},
year={2019} }
* We previously submitted the paper to NDSS 2018 in August 2017 and S&P 2019 in May 2018, and finally got accepted to NDSS 2019 after significant improvement. However, the main NMT-inspired idea remains the same. Here is our NDSS 2018 submission page.

Prerequiste:

Make sure you have installed all of following packages or libraries (including dependencies if necessary) in your computer:

  1. genism
  2. NLTK
  3. Scikit-learn
  4. Tensorflow
  5. Keras (version ≤ 2.1.4)

Download Embedding Files

We have trained three instruction embeddings of which dimension is 50, 100, 150. To further use our Siamese based tool for binaries similarity detection, you should first download them from the link.

Normally, we prefer all instructions can find its embedding in pre-trained .w2v files. If not, any unknown word will be replaced with zero vector.

Input Data

Only well pre-processed data can be accepted by the Siamese neural network based binaries similary detector. To display the usage of Siamese model, we provide some input samples (e.g. test_set_O2.csv) which you can download from the link.

Please note that, according to our NDSS paper, we collect the assembly code via LLVM. Thus, the assembly code in our dataset may be different from those obtained from a reverse enginnering tool such as IDA Pro.

Pre-trained Model

As an example, we provided the pre-trained model weights for a Siamese based binary similarity detector, please click the link to download it. In detail, each sub-network in such Siamese network is a double-layer LSTM with 100D vector (for each instruction embedding) as input. Please refer to the example script in Python to re-run the test case.

You can modify the backend of llvm to automatically output the bounderies of basic blocks (in .s file) and assign an identifier for each of them. To do that, you need to replace the orginial AsmPrinter.cpp file in llvm project with the one provided here.