Identifying x86_64 ELF Symbols in Stripped Binaries using AI
Scenario
Let's say we have a binary that we have either reverse engineered or have the associated debugging information. We will refer to this executable as binary S. We also have another executable called P that we know is similar to binary S in some way but it is stripped and has no debugging information. Binary P could have a similar source code to binary S or could be made from the same statically linked library. However, S and P are sufficiently different that they cannot be diffed with BinDiff.
We want to use information from binary S to transpose symbols from S to P.
This blog post will explain how the AI we are developing at RevEng.AI does exactly that. We will look at one technique that explores our program comprehension embeddings API to locate symbols in P given a single known anchor point in it.
Setting the scene
As a running example, S will be the binary miredo
from the Debian Sid package miredo/amd64
. Miredo is a Teredo client (RFC 4380) that tunnels IPv6 packets in IPv4 UDP packets. We don't have access to the source code but we can access symbol information via the miredo-dbgsym/amd64
package. It is important to note that RFC 4380 uses a combination of MD5 and HMAC in its protocol design. You can download binary S below.
As for executable P, we will be using the binary fdupes
from the Debian Sid package fdupes/amd64
. FDupes is a program that identifies duplicate files within directories using MD5 hashes and byte by byte comparisons. You can download binary P below.
Our intuition informs us that, given both binaries contain code for processing MD5, can we port MD5 related symbols from S to P?
An inspection of our source binary S reveals the following symbols:
A quick internet search for "md5" and "fdupes" reveals that FDupes has the same md5_process
, md5_finish
, md5_append
, and md5_init
functions in its source code (Github search).
md5_init
is a small 42 byte function and should be easy to identify in Ghidra. For the rest of this blog post we are going to assume that we know the location of a single symbol in P - md5_init
. This will be our known anchor point in P.
We are going to use this anchor point with the relative difference between symbols' embedded representations from S to locate another symbol in P. Before we explain exactly what that means or why we would want to do that, take a look at md5.c and you will see that md5_process
is not a direct caller or callee of md5_init
and therefore cannot easily be identified by simple static rules.
We will now explain how to locate md5_process
in the stripped binary P using md5_init
and the relative difference between md5_process
and md5_init
in our source executable S.
Machine Learning Background
RevEng.AI's BinNet model uses a deep multi-output transformer architecture to capture the semantic meaning of binary code. It represents binary code (and all strings, XREFS, constants, etc.) as real-valued vectors that can be used for downstream ML tasks.
To understand this properly you need to have a basic understanding of Word2Vec and the concept of word representations. We recommend pausing here and reading the following blog post if you are unfamiliar with modern NLP techniques.
The famous example from Word2Vec allowed for the vector representations of the words king
- man
+ woman
≈ queen
. This proved that the vector space generated by Word2Vec captured semantic information beyond what was previously possible by Neural Network Language Models (NNLMs).
You might be able to guess where this is going...
Recovering Symbol Information using Embedded Symbol Representations
We are going to use RevEng.AI's BinNet model and embeddings API to locate md5_process
in P by calculating the difference between the embedded representations of md5_process
- md5_init
in S and adding the representation of md5_init
in P.
We will use the following Python code:
Our query vector is a real-valued numpy vector of 32 bit floats:
We now need to find the closest symbol in P to our query vector. We do this using the cosine similarity against all functions that Ghidra found in P.
The closest BinNet embeddings in P to our query can then be accessed by looking up the row corresponding to the closest index:
And now we can find the matched symbols virtual address:
The symbol md5_process
should have a virtual address of 14068 (0x36f4) in our binary P. If we now download symbol information from the fdupes-dbgsym/amd64
Debian package and merge the ELF files we indeed find this is correct.
How difficult was this?
OK, BinDiff doesn't work but can't we just hash a function's disassembly code and find matches between the two functions? Can we use IDA FLIRT?
Let's download the debugging package and inspect P with symbols. The first thing to note is that md5_process
is 2113 bytes in S and 2067 bytes in P. Matching hashes clearly won't work.
Is the disassembly similar? Let's take a look with Rizin:
fdupes
miredo
The disassembly is significantly different and would not match with hashing or semantic hashing. IDA FLIRT works by matching the first 32 bytes of each function and would fail to find the match.
Discussion
Why would we want to do this? RevEng.AI's BinNet embeddings aren't just useful for matching symbol names. They inherently capture the semantic meaning of binary code and could be used for:
- Porting symbols between different malware samples or OS versions
- Identifying known vulnerabilities in closed source software
- Detecting malware embedded inside proprietary software
- Aiding reverse engineering of never seen before programs
- Speeding up fuzzing by directing path selection to heap allocation functions or other defined behaviour
There are far more efficient ways to use RevEng.AI's API to identify symbols in stripped binaries but this has been a fun example. If you are interested in using our API please contact us via the contact page at https://reveng.ai.
No source code was analysed by our AI in the development of this example.