SXS, hashing and persistence

How often have you asked yourself what are those files in WinSXS, and how are these names generated? How is dll located using manifest? 

What we will try to elaborate here is code which we developed long time ago for fun of it, but now we will show it, as it can be used for persistence on Windows.

There are mysteries surrounding WinSXS in general. What are these long paths in WinSXS folder? What are those numbers there? How does Windows locate these paths?

Now let’s have a look at one long path:

To understand this, we need to have a look at sxs.dll, and the logic for locating data, and we need to know that there are 2 hashing methods. One which uses the version from the manifest, and the 2nd which ignores the version. We will call them from now on : hash and versionhash. This is the large hex value at the end of DLL folder name.

Let’s have a look at the default Manifest provided by Microsoft to enable visual styles.

So how does all this data become the above path?

First we need to split the manifest into these parts:

language can be “*” and so can processorArchitecture, which we may see by two functions in sxs.dll:

If language (or as it’s internally referenced as culture) is set to “*”, it’s replaced by the “none” string value, while processorArchitecture is properly set depending on architecture. You may see available architectures by disassembling sxs.dll!FusionpFormatProcessorArchitecture.

Hashing is performed on lower case attributes and values. If the value is “none“, the hashing of the attribute and value is skipped.

From “GeneratePsuedokeyFromAttributes” you may see that 2 hashes are generated, one without version and one with version. The code is a bit long, so we will paste some relevant parts in decompiled C.

Where RtlHashEncodedLBlob is the important part of the hashing, which basically runs this loop on the whole lower case unicode string:

So we can rewrite this in C like this:

Now we can go back to GeneratePsuedokeyFromAttributes and look how the data is combined:

Et voilà, we got the hashes for versionhash and final path. At this point, the final path hash is invalid, as the version is wrong and from manifest. Now we turn towards SxspGetAutoServicingVersion.

It will open HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\SideBySide\Winners key. Then it will look for the generated name. To make it simpler, the name is generated using this formula:

This sub key is opened. It will have subkeys with Major.Minor which you get from the manifest, and by querying RG_SZ Default value for version.


We obtain the latest available version which we can use to generate the hash again for the DLL path, and the DLL path is generated as:

We may use this logic to generate the path for the DLL with the high version and plant it in %windir%\winsxs to always load your DLL as a way of persistence. Why using the high version, and not replacing the DLL in the existing folder? Simple, if Microsoft updates the DLL with a newer version, your persistence is gone. Note also that %windir%\winsxs is owned by TrustedInstaller so you will need to take ownership of %windir%\winsxs before adding data there. Also the same modification has to be done to proper Winners key. Also you will need to take ownership of %windir%\winsxs\manifests to create the manifest file which matches the rules of folder_path.manifest

This hashing mechanism is used since Windows Vista until the latest Windows 10 Insider at the time of publishing this article.

As you may see the DLL is loaded in the process after restart from a different path, which we have generated following the hashing algorithm that we have reversed and provided in this article.


We have provided the sample code which will dynamically generate .local for its own process, and the persistence method can be left as an exercise to the reader. We are also providing a python script which you might use to generate the Winners key, and the path with the provided version.

The code can be obtained from github.


WannaCry: Untold story

When WannaCry came to be, the whole internet was struck. People were racing against each other to write more details on WannaCry and its spreading method, or any kind of details. At that time we didn’t want to write what we have spotted right away, and which none of the write-ups that were happening that day or later didn’t even mention, well they did but didn’t understand the purpose of this thing.

This thing later became so called KillSwitch. None of the article publisherd or researchers asked one simple question. How did this Malware 101 get away around all anti-virus products at that time? It has/had all signs of the malware:  PE file in resource, no code signature, APIs to create services, APIs in import table in general look suspicious, not to mention strings. Malware 101 but still got around all security products at that time. Why?

The trick lies in KillSwitch. Kill switch is not the kill switch, it is an anti-anti-virus technique used to bypass all anti-virus products at that time.

As anyone could have seen, the URL was very long, and apparently randomly typed. If we look carefully at it:

we can see that the author or authors randomly smashed on the keyboard (look at distributions of characters and check your keyboard, default US keyboard) to get the random URL.

Let’s put our theory to the test:

Now we will run this through 3 different AV engines, which names we won’t mention but all of them make wrong assumption about emulating InternetOpenUrlA/W and that is that they return success even if the URL specified is invalid. We assure you that they are the biggest on the market. Dumps that you will see are produced by our tracers. These logs, as above sample code, were made on Saturday morning 13th of May 2017 when everybody were aware of the outbreak, and as soon as we managed to get one sample, and make initial analyses.

Emulator 1:

Emulator 2:

Emulator 3:

As you may see, all 3 of them hit ExitProcess call if InternetOpenUrlA/W was successful. As URL wasn’t active at the time of the initial spread this leads to the only one possible conclusion. This is not bad decision by AVs due to many reasons which we won’t mention here, but this assumption is exploited by the bad guys to fight them back.

This is the whole truth about the so called KillSwitch and domain. No mystery, no speculation if the authors wanted to stop it if something went wrong, no Illuminati conspiracy. Plain and simple, it was an anti-anti-virus technique. Honestly, wrongly implemented, or intentionally wrongly implemented so the whole world was asking what this URL was all about. Or it was done in this way so the same trick could be reused in the future with a different domain in order to puzzle people again? This we might never know.

APC for the WIN(dows)

Through APC we can inject almost anything into a remote process. No process handle is needed, nor any other API is required to gain full control of a given process. So let’s set some ground rules, we can’t use anything except NtQueueApcThread. This API will be used to allocate RWX memory, to copy data and to trigger execution.

What most people don’t know about APCs is that you can call any API which has anything between 0 and 3 arguments. On x32 the APC callback is defined as stdcall but that’s not really needed, as due to the logic of the APC dispatcher, CONTEXT is kept in EDI and later is passed to NtContinue. So we don’t have to take care of stack alignment. We can call stdcall or cdecl APIs without any problem via APC. This logic of the APC dispatcher hasn’t changed from Windows XP to the latest Windows 10 Insider. On x64, however, due to calling convention we don’t have to care about the stack as there is only one calling convention and that is fastcall.

In this disassembly you may see that CONTEXT is saved in EDI, and that it’s passed to NtContinue. Feel free to check also XP/Vista/Win7 and you will see the same code. This is why we can call LoadLibrary without any issue, or any other API which requires less than 3 arguments. Of course you may call APIs which require four or more arguments as long as those are not used as you will have bogus data.

How do we find an Alertable thread? In user mode that’s not an easy task. In kernel mode, we would check ETHREAD for the Alertable flag set, but that’s not the case in user mode. The easiest way that I know and which doesn’t require anything else except NtQueueApcThread is to check the state of every thread, and then whether the thread is in Waiting state, but WaitReason is not DelayExecution to queue SleepEx to these threads, on the next scan if thread has changed state to DelayExecution, we have found an Alertable thread. Of course, we can miss some Alertable threads which are already in DelayExecution but that’s a sacriface that we have to make. This is achieved by using NtQuerySystemInformation and examining thread information.

How do we get RWX memory using APC only? We can’t call any API which would allocate executable memory. We can’t call VirtualAlloc/Ex,NtAllocateVirtualMemory, NtProtectVirtualMemory, NtMapViewOfSection. We can’t call WriteProcessMemory, we can however call HeapCreate to get RWX memory in a remote process, but we would need to obtain this address somehow. The easiest way would be to check PEB.ProcessHeaps but this would require for us to use NtReadVirtualMemory which is a no go for us. Another method would be to use NtQueryVirtualMemory to find this RWX region, which is again in violation of our rules. HeapCreate would be awesome if we knew that the process didn’t have any heap so we could obtain the heap from PEB.ProcessHeaps by using memcpy or another API to copy data from PEB + offsetof(PEB, ProcessHeaps).

Now we come to design issues of .NET and native images. If you have ever looked into .NET precomplied images you would notice the .xdata section which is RWX. These images are generated by ngen.exe [1]. And they are located under %windir%\aseembly\NativeImages_*. Depending on architecture you want to load the proper one. By issuing APC, we can use LoadLibraryA/W to load this DLL into a remote process. Of course, if LoadLibraryA/W are too obvious, there are many trampolines that can be used to call these APIs.  Now we have RWX memory loaded into a remote process. We solved a huge step, and we can continue. Later we can use GetModuleHandleExA/W to obtain the base of this DLL, and write it somewhere as GetModuleHandleExA/W requires as 3rd argument the pointer to the memory where the base of the requested DLL will be stored. Similar to LdrLoadDll or LdrGetDllHandle. Remember: we only know the address in memory where the base of this DLL will be stored, but we don’t know the base of this DLL.

Where do we write? We will write all data to a memory gap in the well known DLL. Whoever did any sort of PE patching, or fixing corrupted dumps knows what I’m talking about. For those who are not aware  the trick lies in one simple fact and that the sections VirtualSize is not always PAGE_SIZE aligned. This difference between PAGE_SIZE and VirtualSize we know won’t be used, and thus we can use it for our own purposes. For this we can write a simple algorithm to scan sections of ntdll/kernel32/kernelbase and see where we have enough data by looking for section with IMAGE_SCN_MEM_WRITE. This trick wouldn’t work if Microsoft had implemented ASLR per process. This is something which we will leave for another blog post eventually, as we already have code which is performing this task.

Once we obtain the pointer where we want to write, the question is how to write there? With APC we have several options. And I’ll list only 2 which are easy to use. On x32 systems we can use InterlockedExchange which is good, as the first argument is an address to LONG, and the second argument is the value we want to write there. On x64 it’s different as we can’t use InterlockedExchange since those don’t exist on x64 systems. What we can use on x64 are TLS slots stored in TEB. This address we can obtain via NtQueryInformationThread. There are total of 64 TLS slots, which gives us 64 * 8 or 4 bytes to write data depending on architecture. Once we fill TLS slots, we may use ntdll!RtlMoveMemory to copy this data to desired location. (e.g. a memory gap). To set TLS we would call TlsSetValue with index from 0 to 63. We solved also an issue of copying memory to a remote process only via APC.

Now we come to the main question. How do we copy to RWX memory. We know where we stored offset of the loaded DLL by using GetModuleHandleExA/W, but we don’t know its base. First things first. We need to increment the DLL base to RWX section to get the pointer to the RWX memory. On x32 we can achieve this by using InterlockedExchangeAdd. We know the VirtualAddress of sections, so this is straightforward, nothing magical in it. On x64 we are again stuck. No such API as InterlockedExchangeAdd. So what do we do? If you have spent some time working with COM objects, you know that all of them have AddRef. AddRef is supposed to increment the reference count for a certain object. This is however an ULONG value, but as we are only incrementing the low part of ULOGLONG, we can use AddRef. What is good about AddRef is that it’s also CFG compatible. After thinking for a few minutes which one to take, I went for IClassFactory from ole32.dll which we can obtain by calling DllGetClassObject in our current process. If we look at the disassembly of ole32.dll we will get this for CDefClassFactory::AddRef:

So by calling this function with the first argument set at our (pointer – 8) we will increment it. We can reduce the number of APCs if we know that the default Allocation granularity is 64K on windows which means
that all memory allocated or DLLs mapped will have the last 4 digits set to 0, we can start incrementing like this:

So if our DLL base is at 0x10000 for example, and we need to increment by 0xC000, we can queue 0xC000 APCs which is really not a problem for windows and would work quite fast (you wouldn’t even notice the difference). If we want it faster we can set increment at 0x100XX which would require only 0xC0 APCs queued. Not really important but something to keep in mind for optimization. Now we have an incremented pointer to RWX and we are ready to go.

Now we need to copy our shellcode to the RWX memory. All memory copy/move APIs that I’m aware of take only a pointer to memory, but none takes a pointer to a pointer or as we would say PVOID *. But there is one API which we might abuse to do what we need. That API is RtlCopyUnicodeString. What we do now is to make 2 UNICODE_STRING structures and let RtlCopyUnicodeString do the job for us.

The memory layout at this point (applies to both x32 and x64):

Now we do RtlMoveMemory(0x1058, 0x1020, 8) and the code becomes:

Now we use Interlocked* or a combination of  TlsSetValue/RtlMoveMemory to set other data and the code becomes:

And we issue RtlCopyUnicodeString, voila, the shellcode is in RWX memory.

Once we have everything in place, we can’t simply call this pointer. The reason is that we don’t know what that pointer is. There are some ways to execute this pointer by thread context modification APIs such as RtlCaptureContext,  copy the pointer to Eip/Rip and call NtContinue/RtlRestoreContext/SetThreadContext but those trigger CFG on Windows 10, and we are limited to only one API from our injection process. But what we can do is to abuse COM objects again, to be more accurate QueryInterface of some COM object. We don’t really use a COM object but only QueryInterface. The good part is that QueryInterface takes 3 arguments:

QueryInterface should return a valid interface if proper a GUID is specified, and call its AddRef. I chose again to stick with IClassFactory from previous case on x64 (same interface we are using on x32 version):

We know where our pointer to RWX memory is, all we need now is to make fake Vtbl. This is quite easy and straightforward but for clarity reasons here is an example.

Vtbls now looks like:

Now just follow the execution flow of the last few lines:

And our code is executed in the remote process. As you may see, we executed code only by using NtQueueApcThread, no other API is used or required.

In our final implementation we have used TlsSetValue implementation to make code cross compilable with minimal changes between x32 and x64 version. You may write InterlockedExchange version as an exercise. You may download code from our github repository.




Welcome to the stolenbytes blog. We are at the moment preparing some articles, which we hope you will find interesting. Stay tuned.