APC for the WIN(dows)

Through APC we can inject almost anything into a remote process. No process handle is needed, nor any other API is required to gain full control of a given process. So let’s set some ground rules, we can’t use anything except NtQueueApcThread. This API will be used to allocate RWX memory, to copy data and to trigger execution.

What most people don’t know about APCs is that you can call any API which has anything between 0 and 3 arguments. On x32 the APC callback is defined as stdcall but that’s not really needed, as due to the logic of the APC dispatcher, CONTEXT is kept in EDI and later is passed to NtContinue. So we don’t have to take care of stack alignment. We can call stdcall or cdecl APIs without any problem via APC. This logic of the APC dispatcher hasn’t changed from Windows XP to the latest Windows 10 Insider. On x64, however, due to calling convention we don’t have to care about the stack as there is only one calling convention and that is fastcall.

In this disassembly you may see that CONTEXT is saved in EDI, and that it’s passed to NtContinue. Feel free to check also XP/Vista/Win7 and you will see the same code. This is why we can call LoadLibrary without any issue, or any other API which requires less than 3 arguments. Of course you may call APIs which require four or more arguments as long as those are not used as you will have bogus data.

How do we find an Alertable thread? In user mode that’s not an easy task. In kernel mode, we would check ETHREAD for the Alertable flag set, but that’s not the case in user mode. The easiest way that I know and which doesn’t require anything else except NtQueueApcThread is to check the state of every thread, and then whether the thread is in Waiting state, but WaitReason is not DelayExecution to queue SleepEx to these threads, on the next scan if thread has changed state to DelayExecution, we have found an Alertable thread. Of course, we can miss some Alertable threads which are already in DelayExecution but that’s a sacriface that we have to make. This is achieved by using NtQuerySystemInformation and examining thread information.

How do we get RWX memory using APC only? We can’t call any API which would allocate executable memory. We can’t call VirtualAlloc/Ex,NtAllocateVirtualMemory, NtProtectVirtualMemory, NtMapViewOfSection. We can’t call WriteProcessMemory, we can however call HeapCreate to get RWX memory in a remote process, but we would need to obtain this address somehow. The easiest way would be to check PEB.ProcessHeaps but this would require for us to use NtReadVirtualMemory which is a no go for us. Another method would be to use NtQueryVirtualMemory to find this RWX region, which is again in violation of our rules. HeapCreate would be awesome if we knew that the process didn’t have any heap so we could obtain the heap from PEB.ProcessHeaps by using memcpy or another API to copy data from PEB + offsetof(PEB, ProcessHeaps).

Now we come to design issues of .NET and native images. If you have ever looked into .NET precomplied images you would notice the .xdata section which is RWX. These images are generated by ngen.exe [1]. And they are located under %windir%\aseembly\NativeImages_*. Depending on architecture you want to load the proper one. By issuing APC, we can use LoadLibraryA/W to load this DLL into a remote process. Of course, if LoadLibraryA/W are too obvious, there are many trampolines that can be used to call these APIs.  Now we have RWX memory loaded into a remote process. We solved a huge step, and we can continue. Later we can use GetModuleHandleExA/W to obtain the base of this DLL, and write it somewhere as GetModuleHandleExA/W requires as 3rd argument the pointer to the memory where the base of the requested DLL will be stored. Similar to LdrLoadDll or LdrGetDllHandle. Remember: we only know the address in memory where the base of this DLL will be stored, but we don’t know the base of this DLL.

Where do we write? We will write all data to a memory gap in the well known DLL. Whoever did any sort of PE patching, or fixing corrupted dumps knows what I’m talking about. For those who are not aware  the trick lies in one simple fact and that the sections VirtualSize is not always PAGE_SIZE aligned. This difference between PAGE_SIZE and VirtualSize we know won’t be used, and thus we can use it for our own purposes. For this we can write a simple algorithm to scan sections of ntdll/kernel32/kernelbase and see where we have enough data by looking for section with IMAGE_SCN_MEM_WRITE. This trick wouldn’t work if Microsoft had implemented ASLR per process. This is something which we will leave for another blog post eventually, as we already have code which is performing this task.

Once we obtain the pointer where we want to write, the question is how to write there? With APC we have several options. And I’ll list only 2 which are easy to use. On x32 systems we can use InterlockedExchange which is good, as the first argument is an address to LONG, and the second argument is the value we want to write there. On x64 it’s different as we can’t use InterlockedExchange since those don’t exist on x64 systems. What we can use on x64 are TLS slots stored in TEB. This address we can obtain via NtQueryInformationThread. There are total of 64 TLS slots, which gives us 64 * 8 or 4 bytes to write data depending on architecture. Once we fill TLS slots, we may use ntdll!RtlMoveMemory to copy this data to desired location. (e.g. a memory gap). To set TLS we would call TlsSetValue with index from 0 to 63. We solved also an issue of copying memory to a remote process only via APC.

Now we come to the main question. How do we copy to RWX memory. We know where we stored offset of the loaded DLL by using GetModuleHandleExA/W, but we don’t know its base. First things first. We need to increment the DLL base to RWX section to get the pointer to the RWX memory. On x32 we can achieve this by using InterlockedExchangeAdd. We know the VirtualAddress of sections, so this is straightforward, nothing magical in it. On x64 we are again stuck. No such API as InterlockedExchangeAdd. So what do we do? If you have spent some time working with COM objects, you know that all of them have AddRef. AddRef is supposed to increment the reference count for a certain object. This is however an ULONG value, but as we are only incrementing the low part of ULOGLONG, we can use AddRef. What is good about AddRef is that it’s also CFG compatible. After thinking for a few minutes which one to take, I went for IClassFactory from ole32.dll which we can obtain by calling DllGetClassObject in our current process. If we look at the disassembly of ole32.dll we will get this for CDefClassFactory::AddRef:

So by calling this function with the first argument set at our (pointer – 8) we will increment it. We can reduce the number of APCs if we know that the default Allocation granularity is 64K on windows which means
that all memory allocated or DLLs mapped will have the last 4 digits set to 0, we can start incrementing like this:

So if our DLL base is at 0x10000 for example, and we need to increment by 0xC000, we can queue 0xC000 APCs which is really not a problem for windows and would work quite fast (you wouldn’t even notice the difference). If we want it faster we can set increment at 0x100XX which would require only 0xC0 APCs queued. Not really important but something to keep in mind for optimization. Now we have an incremented pointer to RWX and we are ready to go.

Now we need to copy our shellcode to the RWX memory. All memory copy/move APIs that I’m aware of take only a pointer to memory, but none takes a pointer to a pointer or as we would say PVOID *. But there is one API which we might abuse to do what we need. That API is RtlCopyUnicodeString. What we do now is to make 2 UNICODE_STRING structures and let RtlCopyUnicodeString do the job for us.

The memory layout at this point (applies to both x32 and x64):

Now we do RtlMoveMemory(0x1058, 0x1020, 8) and the code becomes:

Now we use Interlocked* or a combination of  TlsSetValue/RtlMoveMemory to set other data and the code becomes:

And we issue RtlCopyUnicodeString, voila, the shellcode is in RWX memory.

Once we have everything in place, we can’t simply call this pointer. The reason is that we don’t know what that pointer is. There are some ways to execute this pointer by thread context modification APIs such as RtlCaptureContext,  copy the pointer to Eip/Rip and call NtContinue/RtlRestoreContext/SetThreadContext but those trigger CFG on Windows 10, and we are limited to only one API from our injection process. But what we can do is to abuse COM objects again, to be more accurate QueryInterface of some COM object. We don’t really use a COM object but only QueryInterface. The good part is that QueryInterface takes 3 arguments:

QueryInterface should return a valid interface if proper a GUID is specified, and call its AddRef. I chose again to stick with IClassFactory from previous case on x64 (same interface we are using on x32 version):

We know where our pointer to RWX memory is, all we need now is to make fake Vtbl. This is quite easy and straightforward but for clarity reasons here is an example.

Vtbls now looks like:

Now just follow the execution flow of the last few lines:

And our code is executed in the remote process. As you may see, we executed code only by using NtQueueApcThread, no other API is used or required.

In our final implementation we have used TlsSetValue implementation to make code cross compilable with minimal changes between x32 and x64 version. You may write InterlockedExchange version as an exercise. You may download code from our github repository.

[1] https://docs.microsoft.com/en-us/dotnet/framework/tools/ngen-exe-native-image-generator

 

2 Replies to “APC for the WIN(dows)”

  1. Interesting article, I had a question/addition about this: “On x64, however, due to calling convention we don’t have to care about the stack as there is only one calling convention and that is fastcall.”

    https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/x64-architecture#Calling_Conventions

    Specifically point 3 states: “The caller reserves space on the stack for arguments passed in registers. The called function can use this space to spill the contents of registers to the stack.” so you do actually have to care about the stack, but in this case NtQueueApcThread appears to take care of it…

    1. Thank you for your feedback.

      Good point, in this case delivery of the APC by the kernel will take care of the shadow stack allocation. Or to be more precise it is part of CONTEXT.P1Home-P6Home which is used as the shadow stack in the APC delivery on x64 systems.

Leave a Reply

Your email address will not be published. Required fields are marked *