Context Triggered Piecewise Hashing To Detect Malware Similarity

At Nettitude we collect a large amount of malware binary samples, both from our Honeypot network, from our customers and from incident response. One of the first steps we take is to calculate the MD5 hash of the malware and compare this hash to known samples, while unknown samples can be examined further by an analyst.
Traditional hashes like MD5 take an input and create a fixed length output hash that represents the data. A one bit change to the data will produce an entirely different MD5 hash, so quite often during this examination process we find that the malware is essentially identical to another known piece of malware and only slight variations in the file cause the MD5 hash to differ. Eliminating these duplicates is something I’ve recently been looking into.
SSDeep
Initially I took a look at the well-known SSDeep tool, which can be used to determine document similarity. This uses a technique known as context triggered piecewise hashing (CTPH) to compare files for similarities.
SSDeep can be used to hash and compare any two files; it doesn’t, however, take into account the internal structure of those files when hashing. Therefore in order to get more reliable results I decided to implement a version of CTPH myself, which takes into account the format of a windows executable image (also known as the portable executable file format PE).
What Is Context Triggered Piecewise Hashing?
CTPH – also known as fuzzy hashing – is based on using a rolling hash, where the hash has a siding window and a ‘state’. The state maintains the hash of the last few bytes of the data that are in the current window and is constructed in such a way that allows the removal of influence of previous bytes of data, and the addition of new data, as the sliding window position moves.
The Rabin-Karp string search algorithm uses this rolling hash technique to locate substrings that only require one comparison per character:

Figure 1 – Rabin-Karp String Search

When examining documents for similarity using CTPH, a rolling hash of the data is performed and, at an arbitrarily chosen trigger point. For example, when the modulus of the hash matches a certain value, a traditional hash such as MD5 is calculated on the data processed since the previous trigger point. Part of the traditional hash is saved, for instance, the last two bytes and the rolling hash continues. The final CTP hash consists of the saved parts of the traditional hash.
Comparing two CTP hashes
CTP hashes can be compared by using a Bloom filter and a Hamming distance calculation. A bloom filter is a data structure that can be used to test if an element of data is a member of a set and the Hamming distance gives a weighting value based on the difference of two strings.
An example of this is Sdhash, which uses a bloom filter combined with a calculation of the Hamming distance.
The parts of the two CTP hashes to be compared can be added to two separate bloom filters, and then the Hamming distance between the bloom filters is calculated to determine the weight of differences between the two hashes.
Results
After some initial research, I have implemented a prototype tool.
The prototype uses pebliss to read the PE file, SQLite as a backend to store the results, the Rabin-Fingerprint rolling hash algorithm and MD5 hashing. For hash comparison, a Bloom filter is also used and the percentage matching bits is used to gauge similarity. To begin with, only the code section of the malware is hashed and compared.
The CTPH algorithm allows an arbitrary trigger point to be chosen. For the triggering I experimented with various values and finally settled on a system which looks at the next byte to be processed and compares it to an assembler “jmp”, “call” or “ret” instruction. Although the byte being examined may actually not be a jmp and might simply be part of another instruction, that can’t be any worse than choosing an arbitrary modulus value of the rolling hash. At the trigger point I generate an MD5 sum.
Hashing 200 malware samples with the tool shows that the code section for a large number of our unique samples is in fact the same. This may of course be because a packer has been used to compress or obfuscate the malware.

Figure 2 – SQLite Similarity Results

Conclusion
The initial prototype has proved that the concept of fuzzy hashing the individual malware sections can be useful.
This still, however, needs to be developed into a complete tool. I’ll be moving towards implementing the following features over the next few weeks:

Hash and compare the resource and data sections of the PE files
Identify which malware has been packed
Move away from SQLite and place the data into a NoSql database (mongo or elasticsearch)
Calculating the Hamming distance of two hashes

Some interesting research has already been done on hashing disassembled functions using CTPH. This is also something that I will be investigating in the coming weeks:
http://www.shadowserver.org/wiki/uploads/Information/FuzzyHashing.pdf
http://www.blackhat.com/presentations/bh-usa-09/RAYGOZA/BHUSA09-Raygoza-MalwareSimAnalysis-PAPER.pdf

To contact Nettitude’s editor, please email media@nettitude.com.