Identifying x86_64 ELF Symbols in Stripped Binaries using AI
Scenario
Let's say we have a binary that we have either reverse engineered or have the associated debugging information. We will refer to this executable as binary S. We also have another executable called P that we know is similar to binary S in some way but it is stripped and has no debugging information. Binary P could have a similar source code to binary S or could be made from the same statically linked library. However, S and P are sufficiently different that they cannot be diffed with BinDiff.
We want to use information from binary S to transpose symbols from S to P.
This blog post will explain how the AI we are developing at RevEng.AI does exactly that. We will look at one technique that explores our program comprehension embeddings API to locate symbols in P given a single known anchor point in it.
Setting the scene
As a running example, S will be the binary miredo
from the Debian Sid package miredo/amd64
. Miredo is a Teredo client (RFC 4380) that tunnels IPv6 packets in IPv4 UDP packets. We don't have access to the source code but we can access symbol information via the miredo-dbgsym/amd64
package. It is important to note that RFC 4380 uses a combination of MD5 and HMAC in its protocol design. You can download binary S below.
As for executable P, we will be using the binary fdupes
from the Debian Sid package fdupes/amd64
. FDupes is a program that identifies duplicate files within directories using MD5 hashes and byte by byte comparisons. You can download binary P below.
Our intuition informs us that, given both binaries contain code for processing MD5, can we port MD5 related symbols from S to P?
An inspection of our source binary S reveals the following symbols:
root@desyl:~# readelf -s /usr/sbin/miredo | grep -i md5
81: 0000000000000000 0 FILE LOCAL DEFAULT ABS md5.c
82: 0000000000006360 2113 FUNC LOCAL DEFAULT 14 md5_process
138: 0000000000006e10 194 FUNC GLOBAL DEFAULT 14 md5_finish
369: 0000000000006be0 552 FUNC GLOBAL DEFAULT 14 md5_append
396: 0000000000006bb0 42 FUNC GLOBAL DEFAULT 14 md5_init
A quick internet search for "md5" and "fdupes" reveals that FDupes has the same md5_process
, md5_finish
, md5_append
, and md5_init
functions in its source code (Github search).
md5_init
is a small 42 byte function and should be easy to identify in Ghidra. For the rest of this blog post we are going to assume that we know the location of a single symbol in P - md5_init
. This will be our known anchor point in P.
We are going to use this anchor point with the relative difference between symbols' embedded representations from S to locate another symbol in P. Before we explain exactly what that means or why we would want to do that, take a look at md5.c and you will see that md5_process
is not a direct caller or callee of md5_init
and therefore cannot easily be identified by simple static rules.
We will now explain how to locate md5_process
in the stripped binary P using md5_init
and the relative difference between md5_process
and md5_init
in our source executable S.
Machine Learning Background
RevEng.AI's BinNet model uses a deep multi-output transformer architecture to capture the semantic meaning of binary code. It represents binary code (and all strings, XREFS, constants, etc.) as real-valued vectors that can be used for downstream ML tasks.
To understand this properly you need to have a basic understanding of Word2Vec and the concept of word representations. We recommend pausing here and reading the following blog post if you are unfamiliar with modern NLP techniques.
The famous example from Word2Vec allowed for the vector representations of the words king
- man
+ woman
≈ queen
. This proved that the vector space generated by Word2Vec captured semantic information beyond what was previously possible by Neural Network Language Models (NNLMs).
You might be able to guess where this is going...
Recovering Symbol Information using Embedded Symbol Representations
We are going to use RevEng.AI's BinNet model and embeddings API to locate md5_process
in P by calculating the difference between the embedded representations of md5_process
- md5_init
in S and adding the representation of md5_init
in P.
We will use the following Python code:
from reveng import BinNet, Binary, Symbol
bn = BinNet(apikey='freeware')
S = Binary(path='/usr/bin/miredo')
P = Binary(path='/usr/bin/fdupes')
# run binary analysis, P is stripped
# extract function boundaries from ghidra
S.analyse()
P.analyse(FBD='ghidra', stripped=True)
# store embeddings in numpy matrix, shape=(N, 192)
# N is number of symbols in binary
# 192 is the number of dimensions for BinNet embeddings
Xs = bn.get_embeddings(S.symbols)
Xp = bn.get_embeddings(P.symbols)
# get embeddings for each symbol
# md5_init is at 0x00003f07 in fdupes
s_md5_init = Xs[S.get_symbol('md5_init'), :]
s_md5_process = Xs[S.get_symbol('md5_process'), :]
p_md5_init = Xp[S.get_symbol_at_loc(0x00003f07), :]
# build query vector
query_vec = s_md5_process - s_md5_init + p_md5_init
Our query vector is a real-valued numpy vector of 32 bit floats:
In [216]: query_vec
Out[216]:
array([[-1.0754741 , -0.53394 , 2.5343153 , -0.8589107 , -0.80127776,
3.6827471 , -0.2754942 , -0.25956583, -0.9386326 , -0.8809752 ,
-0.94144917, -0.9657012 , -0.37274447, 3.551879 , 1.4170911 ,
...
-0.5281741 , -0.65251297, -0.9238823 , 1.2517035 , -0.77191585,
-0.9819512 , -0.695347 , -0.47379667, -0.6443882 , 0.8159219 ,
-0.20785426, -0.52132106, -0.7369828 , 2.757089 , -0.07254058,
-1.1361468 , -0.45225745]], dtype=float32)
We now need to find the closest symbol in P to our query vector. We do this using the cosine similarity against all functions that Ghidra found in P.
from sklearn.metrics.pairwise import cosine_similarity
closest = cosine_similarity(Xp, query_vec).squeeze().argsort()[::-1][0]
The closest BinNet embeddings in P to our query can then be accessed by looking up the row corresponding to the closest index:
In [235]: Xp[closest, :]
Out[235]:
array([[-1.03495793, -0.60625399, 2.51171437, -0.84164309, -0.83068409,
3.69240599, -0.31704739, -0.22428862, -1.0581016 , -0.96066797,
-0.87574955, -0.95887101, -0.40816309, 3.62616827, 1.39601726,
...
-0.46266309, -0.58537404, -0.87706089, 1.23309646, -0.75341943,
-0.94693962, -0.58251833, -0.45131244, -0.71748744, 0.77900041,
-0.25757586, -0.56068685, -0.74393312, 2.77102427, -0.04606016,
-1.16797348, -0.41193305]], dtype=float32)
And now we can find the matched symbols virtual address:
In [236]: P.symbols[closest].vaddr
Out[236]: 14068
The symbol md5_process
should have a virtual address of 14068 (0x36f4) in our binary P. If we now download symbol information from the fdupes-dbgsym/amd64
Debian package and merge the ELF files we indeed find this is correct.
How difficult was this?
OK, BinDiff doesn't work but can't we just hash a function's disassembly code and find matches between the two functions? Can we use IDA FLIRT?
Let's download the debugging package and inspect P with symbols. The first thing to note is that md5_process
is 2113 bytes in S and 2067 bytes in P. Matching hashes clearly won't work.
Is the disassembly similar? Let's take a look with Rizin:
fdupes
[0x000013b0]> pdf @ sym.md5_process
┌ 2067: sym.md5_process (int64_t arg1, int64_t arg2, int64_t arg_6fa87e4fh);
│ 0x000036f4 4157 push r15
│ 0x000036f6 4156 push r14
│ 0x000036f8 4155 push r13
│ 0x000036fa 4154 push r12
│ 0x000036fc 55 push rbp
│ 0x000036fd 53 push rbx
│ 0x000036fe 8b4708 mov eax, dword [rdi + 8]
│ 0x00003701 894424b8 mov dword [rsp - 0x48], eax
│ 0x00003705 8b470c mov eax, dword [rdi + 0xc]
│ 0x00003708 89442498 mov dword [rsp - 0x68], eax
│ 0x0000370c 8b4710 mov eax, dword [rdi + 0x10]
│ 0x0000370f 8944249c mov dword [rsp - 0x64], eax
│ 0x00003713 8b4714 mov eax, dword [rdi + 0x14]
│ 0x00003716 894424a0 mov dword [rsp - 0x60], eax
│ 0x0000371a 4889f0 mov rax, rsi
│ 0x0000371d 40f6c603 test sil, 3
│ ┌─< 0x00003721 744c je 0x376f
│ │ 0x00003723 488b06 mov rax, qword [rsi]
│ │ 0x00003726 48894424c0 mov qword [rsp - 0x40], rax
│ │ 0x0000372b 488b4608 mov rax, qword [rsi + 8]
miredo
[0x00003560]> pdf @ sym.md5_process
┌ 2113: sym.md5_process (int64_t arg1, int64_t arg2, int64_t arg_6fa87e4fh);
| 0x00006360 4157 push r15
│ 0x00006362 4156 push r14
│ 0x00006364 4155 push r13
│ 0x00006366 4154 push r12
│ 0x00006368 55 push rbp
│ 0x00006369 53 push rbx
│ 0x0000636a 4881ec880000. sub rsp, 0x88
│ 0x00006371 64488b042528. mov rax, qword fs:[0x28]
│ 0x0000637a 4889442478 mov qword [var_78h], rax
│ 0x0000637f 31c0 xor eax, eax
│ 0x00006381 8b4708 mov eax, dword [rdi + 8]
│ 0x00006384 89442428 mov dword [var_28h], eax
The disassembly is significantly different and would not match with hashing or semantic hashing. IDA FLIRT works by matching the first 32 bytes of each function and would fail to find the match.
Discussion
Why would we want to do this? RevEng.AI's BinNet embeddings aren't just useful for matching symbol names. They inherently capture the semantic meaning of binary code and could be used for:
- Porting symbols between different malware samples or OS versions
- Identifying known vulnerabilities in closed source software
- Detecting malware embedded inside proprietary software
- Aiding reverse engineering of never seen before programs
- Speeding up fuzzing by directing path selection to heap allocation functions or other defined behaviour
There are far more efficient ways to use RevEng.AI's API to identify symbols in stripped binaries but this has been a fun example. If you are interested in using our API please contact us via the contact page at https://reveng.ai.
No source code was analysed by our AI in the development of this example.