The Problem of Data Loss Intelligence

Data Loss Intelligence (DLI) concerns the information that is available to you when your data has been compromised. It’s distinct from Data Loss Protection (DLP) technologies, which are more concerned with preventing your data being compromised in the first place. Think of DLI as your last line; it tries to let you know when DLP has failed, and what is happening now that your data is out in the wild.

Tracking data is hard. It’s even harder when you lose control of its distribution. Whether it’s through an insider leak, external attack or innocent mistake, once data leaves the confines of your regular security perimeter, it is often entirely lost. Knowing who has your data, where they are keeping it and what they are doing with it is an increasingly important part of managing your sensitive business assets and intellectual property.

 Data is Inert

The heart of the problem lies with the fact that data is inherently inert. Typically it cannot itself execute or exhibit behaviours. It is entirely dependent on a counterpart piece of software, a reader, to parse its content and interpret any actions that file suggest be performed. Any attempt to contact the outside world or “call home”, for example, may be honoured by the reader. Often this is not the case, however, as this directly conflicts with an end user’s desire for privacy by allowing their actions to be tracked. Where an external call is permitted, usually each data type has a plethora of available readers and the behaviour may not be consistent across them all.

 The Problem of Consistency

While on the topic of consistency, even if you could guarantee a call home with one data type, your data is often not in a single discrete format. With each separate data type and reader, a different track technique will be needed, resulting in confusing and tedious operating procedures. One of the approaches taken to address this is to abstract the concept of protection away from being embedded in the data itself. This is done by wrapping, or “enveloping”, the data in a protective layer of encryption. The problem here is that at some point, in order to allow legitimate access, you must provide the key to the end user. Once decrypted, the content is unsecured again, and is subject to traditional threats. Even if the software which applies the envelope does not explicitly allow access to the decrypted material, a skilled attacker may be able to capture the key and use this to decrypt the original content.

Ultimately, to have a robust DLI solution, we need to be able to guarantee a behaviour on a computer we don’t control. This is a challenge as it’s near identical to the goals of malware and, as such, is rightly proactively hindered by various security mechanisms and product updates. Somehow we need to balance the need for user privacy and data control. It’s a problem that’s been going for years, with strong similarity to DRM (digital rights management). This is a problem we are actively investigating at Nettitude.

To contact Nettitude’s editor, please email


Context Triggered Piecewise Hashing To Detect Malware Similarity

At Nettitude we collect a large amount of malware binary samples, both from our Honeypot network, from our customers and from incident response. One of the first steps we take is to calculate the MD5 hash of the malware and compare this hash to known samples, while unknown samples can be examined further by an analyst.

Traditional hashes like MD5 take an input and create a fixed length output hash that represents the data. A one bit change to the data will produce an entirely different MD5 hash, so quite often during this examination process we find that the malware is essentially identical to another known piece of malware and only slight variations in the file cause the MD5 hash to differ. Eliminating these duplicates is something I’ve recently been looking into.


Initially I took a look at the well-known SSDeep tool, which can be used to determine document similarity. This uses a technique known as context triggered piecewise hashing (CTPH) to compare files for similarities.

SSDeep can be used to hash and compare any two files; it doesn’t, however, take into account the internal structure of those files when hashing. Therefore in order to get more reliable results I decided to implement a version of CTPH myself, which takes into account the format of a windows executable image (also known as the portable executable file format PE).

What Is Context Triggered Piecewise Hashing? 

CTPH – also known as fuzzy hashing – is based on using a rolling hash, where the hash has a siding window and a ‘state’. The state maintains the hash of the last few bytes of the data that are in the current window and is constructed in such a way that allows the removal of influence of previous bytes of data, and the addition of new data, as the sliding window position moves.

The Rabin-Karp string search algorithm uses this rolling hash technique to locate substrings that only require one comparison per character:

Figure 1 - Rabin-Karp String Search

Figure 1 – Rabin-Karp String Search


When examining documents for similarity using CTPH, a rolling hash of the data is performed and, at an arbitrarily chosen trigger point. For example, when the modulus of the hash matches a certain value, a traditional hash such as MD5 is calculated on the data processed since the previous trigger point. Part of the traditional hash is saved, for instance, the last two bytes and the rolling hash continues. The final CTP hash consists of the saved parts of the traditional hash.

Comparing two CTP hashes

CTP hashes can be compared by using a Bloom filter and a Hamming distance calculation. A bloom filter is a data structure that can be used to test if an element of data is a member of a set and the Hamming distance gives a weighting value based on the difference of two strings.

An example of this is Sdhash, which uses a bloom filter combined with a calculation of the Hamming distance.

The parts of the two CTP hashes to be compared can be added to two separate bloom filters, and then the Hamming distance between the bloom filters is calculated to determine the weight of differences between the two hashes.


After some initial research, I have implemented a prototype tool.

The prototype uses pebliss to read the PE file, SQLite as a backend to store the results, the Rabin-Fingerprint rolling hash algorithm and MD5 hashing. For hash comparison, a Bloom filter is also used and the percentage matching bits is used to gauge similarity. To begin with, only the code section of the malware is hashed and compared.

The CTPH algorithm allows an arbitrary trigger point to be chosen. For the triggering I experimented with various values and finally settled on a system which looks at the next byte to be processed and compares it to an assembler “jmp”, “call” or “ret” instruction. Although the byte being examined may actually not be a jmp and might simply be part of another instruction, that can’t be any worse than choosing an arbitrary modulus value of the rolling hash. At the trigger point I generate an MD5 sum.

Hashing 200 malware samples with the tool shows that the code section for a large number of our unique samples is in fact the same. This may of course be because a packer has been used to compress or obfuscate the malware.

Figure 2 - SQLite Similarity Results

Figure 2 – SQLite Similarity Results


The initial prototype has proved that the concept of fuzzy hashing the individual malware sections can be useful.

This still, however, needs to be developed into a complete tool. I’ll be moving towards implementing the following features over the next few weeks:

  • Hash and compare the resource and data sections of the PE files
  • Identify which malware has been packed
  • Move away from SQLite and place the data into a NoSql database (mongo or elasticsearch)
  • Calculating the Hamming distance of two hashes

Some interesting research has already been done on hashing disassembled functions using CTPH. This is also something that I will be investigating in the coming weeks:



To contact Nettitude’s editor, please email

Shellter – A Dynamic Shellcode Injector

Recently, Shellter has been added to the official repository of Kali Linux. This is a very important milestone in the course of development for this project. Since there are not many tools that can be used to assist penetration testers evading anti-virus (AV), we decided to write a few words about it.

What is Shellter?

It is a truly dynamic shellcode injector. By ‘dynamic’ we mean that the start of the injected code does not occur in locations that are based on very strict rules, such as the entry point of an executable or at a statically predictable location. Shellter currently only supports 32-bit executables. This is a project that I have started and keep developing in my own time for the past two years.

How are the injection points selected?
The injection points are based on the execution flow of the executable. Shellter will actually trace the execution flow of an application in userland and it will log those instructions and locations that are in the range of the executable where the injection will take place.
Once the tracing has finished, Shellter will filter the execution flow based on the size of the code that is about to be injected and it will only consider the valid injection points based on various filtering parameters.

What other features does Shellter provide?

Even though avoiding the usage of static injection locations is a very important thing for AV bypassing, Shellter is not only limited to that, and it provides some additional advanced features.

More exotic features are currently being added in the upcoming version of Shellter that will be officially presented in public for the first time in BsidesLisbon 2015.


To contact Nettitude editor, please email

The Prestige in Malware Persistence


Just like in magic tricks, a malware infection very often consists of three parts or acts. Paraphrasing the following narration from the film “The Prestige (2006)” gives an idea of what we are going to talk about.

“Every malware infection consists of three parts or acts. The first part is called the pledge; the attacker shows you something ordinary, like a document or a web page. The second act is called the turn; the attacker takes that ordinary something and makes it into something extraordinary. But you wouldn’t clap yet, because making something disappear isn’t enough. You have to bring it BACK.”
In this case, bringing this something back is not something the attackers want you to see, and they definitely won’t be offended if you don’t clap. This is what the prestige in malware persistence is all about.

 The case study

During recent malware analysis engagements, we came across an interesting persistence mechanism which proves to be quite effective.

The malware itself is built as a DLL file and uses the common technique of loading itself by adding a registry key that launches rundll32.exe and loads the malicious DLL which is dropped in “C:Users<username>AppdataRoaming” directory. The DLL file uses a random name using the regex pattern: ^[0-9A-F]{3,4}.tmp$,  for example: “1FD9.tmp”.

However, what happens next is the interesting part of this case study. Every time we start the system, the malicious DLL will subsequently inject itself in explorer.exe and then delete its persistence traces, including itself and the aforementioned key in Windows registry.

During analysis, we noticed that the malware was also injecting itself to the Windows ‘Desktop Window Manager’ process (dwm.exe). Both of the infected processes are normally running in the security context of the current user.

During shutdown, the malware will place back itself and the registry key used as a persistence mechanism to load itself after every reboot of the infected host.

Are you looking closely?

Any attempts to look for common persistence indicators using ordinary tools, such as the well-known ‘autoruns’ utility from the Sysinternals, will fail in this case. This is for the simple reason that while the malware is active it deletes its traces.

However, by examining the virtual address space of Windows Explorer with VMMap from Sysinternals, we can retrieve some interesting clues (figure 1).

Figure 1. Mapped Memory with RWE permissions

Figure 1- Mapped Memory with RWE permissions

If we look closer at the first one in this case, we can find another very important clue that indicates that something might be wrong in this host (figure 2).

Figure-2-prestige in Malware Presistence

Figure 2- Prestige in Malware Presistence

We can clearly see an absolute path to a file, but browsing to that directory or attempting to find a registry key that contains any data pointing to that will not return anything useful to us.

A good old trick

Restarting the host in safe mode will not allow for the malware to  load and delete its traces (figures 3, 4).

Figure 3. The malicious DLL file

Figure 3- The malicious DLL file

Figure 4. The registry key used for persistence

Figure 4- The registry key used for persistence

Also note that the malware changes its own filename every time it executes.


Going one step further

So far we have found significant details regarding the hiding and the persistence mechanisms of this malware, but we still haven’t shown a proof of what we have been discussing so far.

Debugging Windows explorer locally on shutdown to catch the event of the malware that writes itself back won’t help us much, since the OS will also kill the debugger process before we manage to see anything useful.

However, by enabling auditing in that directory for successful events related to the creation and writing data to files, we can then go back and examine the logs (figure 5).

Figure 5. Event logs

Figure 5- Event logs

We can clearly see that indeed it was the Windows explorer process that dropped the malicious DLL file back to its place.



During malware analysis we see all sorts of tricks and tactics for evading detection. This particular case was really interesting because the malware was very effective at hiding from common auditing tools that report auto-run registry keys as potential indicators of compromise. Furthermore, it was achieved without using any high or low system level hooks that are harder to implement and which also add a fair portion of artefacts into the system.

To contact Nettitude editor, please email