Esh - Statistical Similarity of Binaries.
Say that you have a procedure in assembly code (a query), and you want to find similar procedures, say the same source code compiled by a different compiler or slightly patched, in a corpus of assembly procedures (targets).
Why would I want to do this? Imagine you have a vulnerable procedure (e.g. Heartbleed), and you want to check if it was embedded in some binary in your organization, but don't know what compiler was used, or whether the code was patched. Other use-cases can be clone detection for reducing code-base maintenance (by enabling code-reuse) or for identifying IP theft. That's where Esh comes in!
What's Esh? Esh is a tool that allows searching a corpus of procedures in assembly code, for a query procedure, even if the source code was compiled with a different compiler of was slightly patched. Esh does not rely on syntactic information and thus is able to operate even when procedures were stripped of debug information. Read on to learn more.
So what were those pieces of code? We included a demo of a key component of Esh - the ability to find semantic similarity of small linear pieces of code, where no matching of variable names exist. Why? because this is the common scenario when dealing with assembly code - no variable names exists, so you need to get creative to compute similarity. Other key aspects of Esh are procedure decomposition and statistical significance.
Here is a (very high-level) breakdown of how Esh computes similarity:
Look at the 3 pictures above. Which of the query pictures: 1 or 2, is more similar to the target picture? Can you say why?
If you thought (like we did) that image 2 is more similar to the target, then the above pictures explains why. The pictures mark and number the common regions of the queries and target image. Image 2 is more similar to the target, simply because they share more mutual regions, and these regions are big and non-trivial (for instance the black background in the pictures should not be used as evidence for similarity). This observation is inspired by work done by Michal Irani and Oren Boiman in the field of image processing (dancer images courtesy of Prof. Michal Irani). Esh shows that the same principals of similarity apply to code!
Consider the assembly code in the diagram above, extracted from binaries stripped of debug information. The blocks show partial assembly code of three procedures, two of them (t and q2) containing the “Heartbleed” vulnerability and another (q1) is unrelated. The vulnerable procedures were compiled using different compilers. Finding similarity between the procedures using syntactic techniques is challenging, as different compilers can produce significantly different assembly code.
Instead, we decompose each procedure to strands, semantically compare similarity of strands, and lift the results to procedures. In the diagram, the target code t and the query code q1, q2 share three matching strands, numbered in the figure as 1, 2, and 3 (within circles). Each strand is a sequence of instructions, and strands are considered as matches when they perform an equivalent computation. In the figure, we mark matching strands using the same circled number. Two syntactically different strands can be equivalent.
For example, the strands numbered 2 in q and t2 are different syntactically but are equivalent (they perform the same operation on inputs). Further, strands need not be syntactically contiguous, this is because they are based on data-flow dependencies rather than on syntactic properties. For example, strand 3 in the query procedure q2 ( mov r12, rbx; lea rdi, [r12+3] ). This strand matches the strand mov r13, rbx;lea rcx, [r13+3] in the target procedure t. In contrast, the code in q1 only matches the single strand 1 in the target, and that strand is of low statistical significance as it is common and thus suggests less of similarity.