Identifying x86_64 ELF Symbols in Stripped Binaries using AI
This blog post is the first in a new series that explains some of the new capabilities we are developing at RevEng.AI - starting with identifying symbols relative to known anchor points. What exactly does that mean?
Let's say we have a binary that we have either reverse engineered or have the associated debugging information. We will refer to this executable as binary S. We also have another executable called P that we know is similar to binary S in some way but it is stripped and has no debugging information. Binary P could have a similar source code to binary S or could be made from the same statically linked library. However, S and P are sufficiently different that they cannot be diffed with BinDiff.
We want to use information from binary S to transpose symbols from S to P.
This blog post will explain how the AI we are developing at RevEng.AI does exactly that. We will look at one technique that explores our program comprehension embeddings API to locate symbols in P given a single known anchor point in it.
Setting the scene
As a running example, S will be the binary
miredo from the Debian Sid package
miredo/amd64. Miredo is a Teredo client (RFC 4380) that tunnels IPv6 packets in IPv4 UDP packets. We don't have access to the source code but we can access symbol information via the
miredo-dbgsym/amd64 package. It is important to note that RFC 4380 uses a combination of MD5 and HMAC in its protocol design. You can download binary S below.
As for executable P, we will be using the binary
fdupes from the Debian Sid package
fdupes/amd64. FDupes is a program that identifies duplicate files within directories using MD5 hashes and byte by byte comparisons. You can download binary P below.
Our intuition informs us that, given both binaries contain code for processing MD5, can we port MD5 related symbols from S to P?
An inspection of our source binary S reveals the following symbols:
root@desyl:~# readelf -s /usr/sbin/miredo | grep -i md5 81: 0000000000000000 0 FILE LOCAL DEFAULT ABS md5.c 82: 0000000000006360 2113 FUNC LOCAL DEFAULT 14 md5_process 138: 0000000000006e10 194 FUNC GLOBAL DEFAULT 14 md5_finish 369: 0000000000006be0 552 FUNC GLOBAL DEFAULT 14 md5_append 396: 0000000000006bb0 42 FUNC GLOBAL DEFAULT 14 md5_init
A quick internet search for "md5" and "fdupes" reveals that FDupes has the same
md5_init functions in its source code (Github search).
md5_init is a small 42 byte function and should be easy to identify in Ghidra. For the rest of this blog post we are going to assume that we know the location of a single symbol in P -
md5_init. This will be our known anchor point in P.
We are going to use this anchor point with the relative difference between symbols' embedded representations from S to locate another symbol in P. Before we explain exactly what that means or why we would want to do that, take a look at md5.c and you will see that
md5_process is not a direct caller or callee of
md5_init and therefore cannot easily be identified by simple static rules.
We will now explain how to locate
md5_process in the stripped binary P using
md5_init and the relative difference between
md5_init in our source executable S.
Machine Learning Background
RevEng.AI's BinNet model uses a deep multi-output transformer architecture to capture the semantic meaning of binary code. It represents binary code (and all strings, XREFS, constants, etc.) as real-valued vectors that can be used for downstream ML tasks.
To understand this properly you need to have a basic understanding of Word2Vec and the concept of word representations. We recommend pausing here and reading the following blog post if you are unfamiliar with modern NLP techniques.
The famous example from Word2Vec allowed for the vector representations of the words
queen. This proved that the vector space generated by Word2Vec captured semantic information beyond what was previously possible by Neural Network Language Models (NNLMs).
You might be able to guess where this is going...
Recovering Symbol Information using Embedded Symbol Representations
We are going to use RevEng.AI's BinNet model and embeddings API to locate
md5_process in P by calculating the difference between the embedded representations of
md5_init in S and adding the representation of
md5_init in P.
We will use the following Python code:
from reveng import BinNet, Binary, Symbol bn = BinNet(apikey='freeware') S = Binary(path='/usr/bin/miredo') P = Binary(path='/usr/bin/fdupes') # run binary analysis, P is stripped # extract function boundaries from ghidra S.analyse() P.analyse(FBD='ghidra', stripped=True) # store embeddings in numpy matrix, shape=(N, 192) # N is number of symbols in binary # 192 is the number of dimensions for BinNet embeddings Xs = bn.get_embeddings(S.symbols) Xp = bn.get_embeddings(P.symbols) # get embeddings for each symbol # md5_init is at 0x00003f07 in fdupes s_md5_init = Xs[S.get_symbol('md5_init'), :] s_md5_process = Xs[S.get_symbol('md5_process'), :] p_md5_init = Xp[S.get_symbol_at_loc(0x00003f07), :] # build query vector query_vec = s_md5_process - s_md5_init + p_md5_init
Our query vector is a real-valued numpy vector of 32 bit floats:
In : query_vec Out: array([[-1.0754741 , -0.53394 , 2.5343153 , -0.8589107 , -0.80127776, 3.6827471 , -0.2754942 , -0.25956583, -0.9386326 , -0.8809752 , -0.94144917, -0.9657012 , -0.37274447, 3.551879 , 1.4170911 , ... -0.5281741 , -0.65251297, -0.9238823 , 1.2517035 , -0.77191585, -0.9819512 , -0.695347 , -0.47379667, -0.6443882 , 0.8159219 , -0.20785426, -0.52132106, -0.7369828 , 2.757089 , -0.07254058, -1.1361468 , -0.45225745]], dtype=float32)
We now need to find the closest symbol in P to our query vector. We do this using the cosine similarity against all functions that Ghidra found in P.
from sklearn.metrics.pairwise import cosine_similarity closest = cosine_similarity(Xp, query_vec).squeeze().argsort()[::-1]
The closest BinNet embeddings in P to our query can then be accessed by looking up the row corresponding to the closest index:
In : Xp[closest, :] Out: array([[-1.03495793, -0.60625399, 2.51171437, -0.84164309, -0.83068409, 3.69240599, -0.31704739, -0.22428862, -1.0581016 , -0.96066797, -0.87574955, -0.95887101, -0.40816309, 3.62616827, 1.39601726, ... -0.46266309, -0.58537404, -0.87706089, 1.23309646, -0.75341943, -0.94693962, -0.58251833, -0.45131244, -0.71748744, 0.77900041, -0.25757586, -0.56068685, -0.74393312, 2.77102427, -0.04606016, -1.16797348, -0.41193305]], dtype=float32)
And now we can find the matched symbols virtual address:
In : P.symbols[closest].vaddr Out: 14068
md5_process should have a virtual address of 14068 (0x36f4) in our binary P. If we now download symbol information from the
fdupes-dbgsym/amd64 Debian package and merge the ELF files we indeed find this is correct.
How difficult was this?
OK, BinDiff doesn't work but can't we just hash a function's disassembly code and find matches between the two functions? Can we use IDA FLIRT?
Let's download the debugging package and inspect P with symbols. The first thing to note is that
md5_process is 2113 bytes in S and 2067 bytes in P. Matching hashes clearly won't work.
Is the disassembly similar? Let's take a look with Rizin:
[0x000013b0]> pdf @ sym.md5_process ┌ 2067: sym.md5_process (int64_t arg1, int64_t arg2, int64_t arg_6fa87e4fh); │ 0x000036f4 4157 push r15 │ 0x000036f6 4156 push r14 │ 0x000036f8 4155 push r13 │ 0x000036fa 4154 push r12 │ 0x000036fc 55 push rbp │ 0x000036fd 53 push rbx │ 0x000036fe 8b4708 mov eax, dword [rdi + 8] │ 0x00003701 894424b8 mov dword [rsp - 0x48], eax │ 0x00003705 8b470c mov eax, dword [rdi + 0xc] │ 0x00003708 89442498 mov dword [rsp - 0x68], eax │ 0x0000370c 8b4710 mov eax, dword [rdi + 0x10] │ 0x0000370f 8944249c mov dword [rsp - 0x64], eax │ 0x00003713 8b4714 mov eax, dword [rdi + 0x14] │ 0x00003716 894424a0 mov dword [rsp - 0x60], eax │ 0x0000371a 4889f0 mov rax, rsi │ 0x0000371d 40f6c603 test sil, 3 │ ┌─< 0x00003721 744c je 0x376f │ │ 0x00003723 488b06 mov rax, qword [rsi] │ │ 0x00003726 48894424c0 mov qword [rsp - 0x40], rax │ │ 0x0000372b 488b4608 mov rax, qword [rsi + 8]
[0x00003560]> pdf @ sym.md5_process ┌ 2113: sym.md5_process (int64_t arg1, int64_t arg2, int64_t arg_6fa87e4fh); | 0x00006360 4157 push r15 │ 0x00006362 4156 push r14 │ 0x00006364 4155 push r13 │ 0x00006366 4154 push r12 │ 0x00006368 55 push rbp │ 0x00006369 53 push rbx │ 0x0000636a 4881ec880000. sub rsp, 0x88 │ 0x00006371 64488b042528. mov rax, qword fs:[0x28] │ 0x0000637a 4889442478 mov qword [var_78h], rax │ 0x0000637f 31c0 xor eax, eax │ 0x00006381 8b4708 mov eax, dword [rdi + 8] │ 0x00006384 89442428 mov dword [var_28h], eax
The disassembly is significantly different and would not match with hashing or semantic hashing. IDA FLIRT works by matching the first 32 bytes of each function and would fail to find the match.
Why would we want to do this? RevEng.AI's BinNet embeddings aren't just useful for matching symbol names. They inherently capture the semantic meaning of binary code and could be used for:
- Porting symbols between different malware samples or OS versions
- Identifying known vulnerabilities in closed source software
- Detecting malware embedded inside proprietary software
- Aiding reverse engineering of never seen before programs
- Speeding up fuzzing by directing path selection to heap allocation functions or other defined behaviour
There are far more efficient ways to use RevEng.AI's API to identify symbols in stripped binaries but this has been a fun example. If you are interested in using our API please contact us via the contact page at https://reveng.ai.
No source code was analysed by our AI in the development of this example.