6 min read

Identifying x86_64 ELF Symbols in Stripped Binaries using AI

This blog post is the first in a new series that explains some of the new capabilities we are developing at RevEng.AI - starting with identifying symbols relative to known anchor points. What exactly does that mean?
Identifying x86_64 ELF Symbols in Stripped Binaries using AI
Photo by Alexander Sinn / Unsplash

Scenario

Let's say we have a binary that we have either reverse engineered or have the associated debugging information. We will refer to this executable as binary S. We also have another executable called P that we know is similar to binary S in some way but it is stripped and has no debugging information. Binary P could have a similar source code to binary S or could be made from the same statically linked library. However, S and P are sufficiently different that they cannot be diffed with BinDiff.

We want to use information from binary S to transpose symbols from S to P.

This blog post will explain how the AI we are developing at RevEng.AI does exactly that. We will look at one technique that explores our program comprehension embeddings API to locate symbols in P given a single known anchor point in it.

Setting the scene

As a running example, S will be the binary miredo from the Debian Sid package miredo/amd64. Miredo is a Teredo client (RFC 4380) that tunnels IPv6 packets in IPv4 UDP packets. We don't have access to the source code but we can access symbol information via the miredo-dbgsym/amd64 package. It is important to note that RFC 4380 uses a combination of MD5 and HMAC in its protocol design. You can download binary S below.

As for executable P, we will be using the binary fdupes from the Debian Sid package fdupes/amd64. FDupes is a program that identifies duplicate files within directories using MD5 hashes and byte by byte comparisons. You can download binary P below.

Our intuition informs us that, given both binaries contain code for processing MD5, can we port MD5 related symbols from S to P?

An inspection of our source binary S reveals the following symbols:

root@desyl:~# readelf -s /usr/sbin/miredo | grep -i md5
    81: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS md5.c
    82: 0000000000006360  2113 FUNC    LOCAL  DEFAULT   14 md5_process
   138: 0000000000006e10   194 FUNC    GLOBAL DEFAULT   14 md5_finish
   369: 0000000000006be0   552 FUNC    GLOBAL DEFAULT   14 md5_append
   396: 0000000000006bb0    42 FUNC    GLOBAL DEFAULT   14 md5_init

A quick internet search for "md5" and "fdupes" reveals that FDupes has the same md5_process, md5_finish, md5_append, and md5_init functions in its source code (Github search).

md5_init is a small 42 byte function and should be easy to identify in Ghidra. For the rest of this blog post we are going to assume that we know the location of a single symbol in P - md5_init. This will be our known anchor point in P.

We are going to use this anchor point with the relative difference between symbols' embedded representations from S to locate another symbol in P. Before we explain exactly what that means or why we would want to do that, take a look at md5.c and you will see that md5_process is not a direct caller or callee of md5_init and therefore cannot easily be identified by simple static rules.

We will now explain how to locate md5_process in the stripped binary P using md5_init and the relative difference between md5_process and md5_init in our source executable S.

Machine Learning Background

RevEng.AI's BinNet model uses a deep multi-output transformer architecture to capture the semantic meaning of binary code. It represents binary code (and all strings, XREFS, constants, etc.) as real-valued vectors that can be used for downstream ML tasks.

To understand this properly you need to have a basic understanding of Word2Vec and the concept of word representations. We recommend pausing here and reading the following blog post if you are unfamiliar with modern NLP techniques.

The famous example from Word2Vec allowed for the vector representations of the words king - man + womanqueen. This proved that the vector space generated by Word2Vec captured semantic information beyond what was previously possible by Neural Network Language Models (NNLMs).

You might be able to guess where this is going...

Recovering Symbol Information using Embedded Symbol Representations

We are going to use RevEng.AI's BinNet model and embeddings API to locate md5_process in P by calculating the difference between the embedded representations of md5_process - md5_init in S and adding the representation of md5_init in P.

We will use the following Python code:

from reveng import BinNet, Binary, Symbol

bn = BinNet(apikey='freeware')
S  = Binary(path='/usr/bin/miredo')
P  = Binary(path='/usr/bin/fdupes')

# run binary analysis, P is stripped
# extract function boundaries from ghidra
S.analyse()
P.analyse(FBD='ghidra', stripped=True)

# store embeddings in numpy matrix, shape=(N, 192)
# N is number of symbols in binary
# 192 is the number of dimensions for BinNet embeddings
Xs = bn.get_embeddings(S.symbols)
Xp = bn.get_embeddings(P.symbols)

# get embeddings for each symbol
# md5_init is at 0x00003f07 in fdupes
s_md5_init    = Xs[S.get_symbol('md5_init'), :]
s_md5_process = Xs[S.get_symbol('md5_process'), :]
p_md5_init    = Xp[S.get_symbol_at_loc(0x00003f07), :]

# build query vector
query_vec = s_md5_process - s_md5_init + p_md5_init

Our query vector is a real-valued numpy vector of 32 bit floats:

In [216]: query_vec
Out[216]:
array([[-1.0754741 , -0.53394   ,  2.5343153 , -0.8589107 , -0.80127776,
         3.6827471 , -0.2754942 , -0.25956583, -0.9386326 , -0.8809752 ,
        -0.94144917, -0.9657012 , -0.37274447,  3.551879  ,  1.4170911 ,
        ...
        -0.5281741 , -0.65251297, -0.9238823 ,  1.2517035 , -0.77191585,
        -0.9819512 , -0.695347  , -0.47379667, -0.6443882 ,  0.8159219 ,
        -0.20785426, -0.52132106, -0.7369828 ,  2.757089  , -0.07254058,
        -1.1361468 , -0.45225745]], dtype=float32)

We now need to find the closest symbol in P to our query vector. We do this using the cosine similarity against all functions that Ghidra found in P.

from sklearn.metrics.pairwise import cosine_similarity

closest = cosine_similarity(Xp, query_vec).squeeze().argsort()[::-1][0]

The closest BinNet embeddings in P to our query can then be accessed by looking up the row corresponding to the closest index:

In [235]: Xp[closest, :]
Out[235]:
array([[-1.03495793, -0.60625399,  2.51171437, -0.84164309, -0.83068409,
         3.69240599, -0.31704739, -0.22428862, -1.0581016 , -0.96066797,
        -0.87574955, -0.95887101, -0.40816309,  3.62616827,  1.39601726,
        ...
        -0.46266309, -0.58537404, -0.87706089,  1.23309646, -0.75341943,
        -0.94693962, -0.58251833, -0.45131244, -0.71748744,  0.77900041,
        -0.25757586, -0.56068685, -0.74393312,  2.77102427, -0.04606016,
        -1.16797348, -0.41193305]], dtype=float32)

And now we can find the matched symbols virtual address:

In [236]: P.symbols[closest].vaddr
Out[236]: 14068

The symbol md5_process should have a virtual address of 14068 (0x36f4) in our binary P. If we now download symbol information from the fdupes-dbgsym/amd64 Debian package and merge the ELF files we indeed find this is correct.

How difficult was this?

OK, BinDiff doesn't work but can't we just hash a function's disassembly code and find matches between the two functions? Can we use IDA FLIRT?

Let's download the debugging package and inspect P with symbols. The first thing to note is that md5_process is 2113 bytes in S and 2067 bytes in P. Matching hashes clearly won't work.

Is the disassembly similar? Let's take a look with Rizin:

fdupes
[0x000013b0]> pdf @ sym.md5_process
┌ 2067: sym.md5_process (int64_t arg1, int64_t arg2, int64_t arg_6fa87e4fh);
│           0x000036f4      4157           push r15
│           0x000036f6      4156           push r14
│           0x000036f8      4155           push r13
│           0x000036fa      4154           push r12
│           0x000036fc      55             push rbp
│           0x000036fd      53             push rbx
│           0x000036fe      8b4708         mov eax, dword [rdi + 8] 
│           0x00003701      894424b8       mov dword [rsp - 0x48], eax
│           0x00003705      8b470c         mov eax, dword [rdi + 0xc] 
│           0x00003708      89442498       mov dword [rsp - 0x68], eax
│           0x0000370c      8b4710         mov eax, dword [rdi + 0x10]
│           0x0000370f      8944249c       mov dword [rsp - 0x64], eax
│           0x00003713      8b4714         mov eax, dword [rdi + 0x14]
│           0x00003716      894424a0       mov dword [rsp - 0x60], eax
│           0x0000371a      4889f0         mov rax, rsi   
│           0x0000371d      40f6c603       test sil, 3
│       ┌─< 0x00003721      744c           je 0x376f
│       │   0x00003723      488b06         mov rax, qword [rsi]  
│       │   0x00003726      48894424c0     mov qword [rsp - 0x40], rax
│       │   0x0000372b      488b4608       mov rax, qword [rsi + 8] 

miredo

[0x00003560]> pdf @ sym.md5_process
┌ 2113: sym.md5_process (int64_t arg1, int64_t arg2, int64_t arg_6fa87e4fh);
|           0x00006360      4157           push r15
│           0x00006362      4156           push r14
│           0x00006364      4155           push r13
│           0x00006366      4154           push r12
│           0x00006368      55             push rbp
│           0x00006369      53             push rbx
│           0x0000636a      4881ec880000.  sub rsp, 0x88
│           0x00006371      64488b042528.  mov rax, qword fs:[0x28]
│           0x0000637a      4889442478     mov qword [var_78h], rax
│           0x0000637f      31c0           xor eax, eax
│           0x00006381      8b4708         mov eax, dword [rdi + 8]  
│           0x00006384      89442428       mov dword [var_28h], eax

The disassembly is significantly different and would not match with hashing or semantic hashing. IDA FLIRT works by matching the first 32 bytes of each function and would fail to find the match.

Discussion

Why would we want to do this? RevEng.AI's BinNet embeddings aren't just useful for matching symbol names. They inherently capture the semantic meaning of binary code and could be used for:

  • Porting symbols between different malware samples or OS versions
  • Identifying known vulnerabilities in closed source software
  • Detecting malware embedded inside proprietary software
  • Aiding reverse engineering of never seen before programs
  • Speeding up fuzzing by directing path selection to heap allocation functions or other defined behaviour

There are far more efficient ways to use RevEng.AI's API to identify symbols in stripped binaries but this has been a fun example. If you are interested in using our API please contact us via the contact page at https://reveng.ai.

No source code was analysed by our AI in the development of this example.