Jumping with DynASM

Directly continuing from the first DynASM example, one obvious optimisation would be to write the remaining loop of run_job in assembly, thereby avoiding a function call on every iteration. This idea leads to the following version of transcode.dasm:

|.arch x64
|.actionlist transcode_actionlist
|.section code
|.globals GLOB_

static void emit_transcoder(Dst_DECL, transcode_job_t* job)
{
| jmp ->loop_test
|->loop_body:
| dec r8
  for(int f = 0; f < job->num_fields; ++f)
  {
    field_info_t* field = job->fields + f;
    switch(field->byte_width)
    {
    case 4:
|     mov eax, [rcx + field->input_offset]
      if(field->input_endianness != field->output_endianness) {
|       bswap eax
      }
|     mov [rdx + field->output_offset], eax
      break;
    case 8:
|     mov rax, [rcx + field->input_offset]
      if(field->input_endianness != field->output_endianness) {
|       bswap rax
      }
|     mov [rdx + field->output_offset], rax
      break;
    default:
      throw std::exception("TODO: Other byte widths");
    }
  }
| add rcx, job->input_record_size
| add rdx, job->output_record_size
|->loop_test:
| test r8, r8
| jnz ->loop_body
| ret

In order, the changes to note are:

  1. The addition of the following:
    |.globals GLOB_
    
  2. The addition of the following loop head:
    | jmp ->loop_test
    |->loop_body:
    | dec r8
  3. The addition of the following loop tail:
    | add rcx, job->input_record_size
    | add rdx, job->output_record_size
    |->loop_test:
    | test r8, r8
    | jnz ->loop_body
The interesting components of these changes are the jumps and the labels. Once you know that the -> prefix is DynASM's notation for so-called global labels, then the syntax becomes the same as in any other assembler: labels are introduced by suffixing them with a colon, and are jumped to by being used as an operand to a jump instruction. As well as global labels, DynASM also supports so-called local labels. The defining difference between the two is that an assembly fragment containing a global label can only be emitted once, whereas local labels can be emitted an unlimited number of times. As a consequence, when jumping to a local label, you need to specify whether to jump backwards to the nearest previous emission of that label, or forwards to the next subsequent emission of that label. As global labels can only be emitted once, so no such specification is needed.
Label typeSyntaxUsageAvailable namesMaximum emissionsRetrievable in C
Global->name:jmp ->nameAny C identifier1Yes
Local  name:jmp  >name (forward) or
jmp  <name (backward)
Integers between 1 and 9No
PC=>expr:jmp =>exprAny C
expression
N/ANo
With labels explained, the remaining curiosity is the .globals directive: its effect is to emit a C enumeration with the names of all global labels. For this example, it causes the following to be written in transcode.h:
//|.globals GLOB_
enum {
  GLOB_loop_test,
  GLOB_loop_body,
  GLOB__MAX
};

Now that we're using labels, we need to do slightly more initialisation work. In particular, between calling dasm_init and dasm_setup, we need to do the following:

void* global_labels[GLOB__MAX];
dasm_setupglobal(&state, global_labels, GLOB__MAX);

After calling dasm_encode, the absolute address of ->loop_test: will be stored in global_labels[GLOB_loop_test], and likewise the absolute address of ->loop_body: will be stored in global_labels[GLOB_loop_body].

For completeness, the final C code is as follows:

void (*make_transcoder(transcode_job_t* job))(const void*, void*, int)
{
  dasm_State* state;
  int status;
  void* code;
  size_t code_size;
  void* global_labels[GLOB__MAX];

  dasm_init(&state, DASM_MAXSECTION);
  dasm_setupglobal(&state, global_labels, GLOB__MAX);
  dasm_setup(&state, transcode_actionlist);

  emit_transcoder(&state, job);
  
  status = dasm_link(&state, &code_size);
  assert(status == DASM_S_OK);

  code = VirtualAlloc(nullptr, code_size, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
  status = dasm_encode(&state, code);
  assert(status == DASM_S_OK);

  dasm_free(&state);
  return (void(*)(const void*, void*, int))code;
}

void run_job(transcode_job_t* job)
{
  void (*transcode_n_records)(const void*, void*, int) = make_transcoder(job);
  transcode_n_records(job->input, job->output, job->num_input_records);
}

A first DynASM example

As an example of where you might want to use DynASM, let us consider the problem of transforming an array of binary structures into an array of slightly different binary structures. For the sake of concreteness, let us assume that such transformation jobs are described by the following C structures:

struct field_info_t
{
  int byte_width;

  int input_offset;
  int input_endianness;

  int output_offset;
  int output_endianness;
};

struct transcode_job_t
{
  const void* input;
  int input_record_size;
  int num_input_records;

  void* output;
  int output_record_size;

  int num_fields;
  field_info_t* fields;
};

Naïve code for performing these jobs might look something like the following:

void run_job(transcode_job_t* job)
{
  const char* input = (const char*)job->input;
  char* output = (char*)job->output;
  for(int r = 0; r < job->num_input_records; ++r)
  {
    for(int f = 0; f < job->num_fields; ++f)
    {
      field_info_t* field = job->fields + f;
      memcpy(output + field->output_offset, input + field->input_offset, field->byte_width);
      if(field->input_endianness != field->output_endianness)
        swap_endianness(output + field->output_offset, output + field->output_offset + field->byte_width - 1); 
    }

    input += job->input_record_size;
    output += job->output_record_size;
  }
}

void swap_endianness(char* first, char* last)
{
  for(; first < last; ++first, --last)
  {
    char tmp = *first;
    *first = *last;
    *last = tmp;
  }
}

If num_input_records is really large and the transcoding needs to be done as fast as mechanically possible, then one idea might be to unroll the inner loop of run_job at runtime using DynASM. The idea is that the resulting code will look something like the following:

void run_job(transcode_job_t* job)
{
  void (*transcode_one_record)(const char*, char*) = make_transcoder(job);

  const char* input = (const char*)job->input;
  char* output = (char*)job->output;
  for(int r = 0; r < job->num_input_records; ++r)
  {
    transcode_one_record(input, output);

    input += job->input_record_size;
    output += job->output_record_size;
  }
}

As the first step toward implementing make_transcoder, we need something to feed into DynASM. The following code is such an input, which we'll assume is in a file called transcode.dasc:

|.arch x64
|.actionlist transcode_actionlist
|.section code

static void emit_transcoder(Dst_DECL, transcode_job_t* job)
{
  for(int f = 0; f < job->num_fields; ++f)
  {
    field_info_t* field = job->fields + f;
    switch(field->byte_width)
    {
    case 4:
|     mov eax, [rcx + field->input_offset]
      if(field->input_endianness != field->output_endianness) {
|       bswap eax
      }
|     mov [rdx + field->output_offset], eax
      break;
    case 8:
|     mov rax, [rcx + field->input_offset]
      if(field->input_endianness != field->output_endianness) {
|       bswap rax
      }
|     mov [rdx + field->output_offset], rax
      break;
    default:
      throw std::exception("TODO: Other byte widths");
    }
  }
| ret
}

With this written, we can can use DynASM to transform it into a file called transcode.h using the following command line:

luajit dynasm.lua --nolineno -o transcode.h transcode.dasc

The resulting file, transcode.h, should look something like the following:

//This file has been pre-processed with DynASM.

//|.arch x64
//|.actionlist transcode_actionlist
static const unsigned char transcode_actionlist[27] = {
  139,129,233,255,15,200,255,137,130,233,255,72,139,129,
  233,255,72,15,200,255,72,137,130,233,255,195,255
};

//|.section code
#define DASM_SECTION_CODE   0
#define DASM_MAXSECTION 	1

static void emit_transcoder(Dst_DECL, transcode_job_t* job)
{
  for(int f = 0; f < job->num_fields; ++f)
  {
    field_info_t* field = job->fields + f;
    switch(field->byte_width)
    {
    case 4:
//|     mov eax, [rcx + field->input_offset]
dasm_put(Dst, 0, field->input_offset);
      if(field->input_endianness != field->output_endianness) {
//|       bswap eax
dasm_put(Dst, 4);
      }
//|     mov [rdx + field->output_offset], eax
dasm_put(Dst, 7, field->output_offset);
      break;
    case 8:
//|     mov rax, [rcx + field->input_offset]
dasm_put(Dst, 11, field->input_offset);
      if(field->input_endianness != field->output_endianness) {
//|       bswap rax
dasm_put(Dst, 16);
      }
//|     mov [rdx + field->output_offset], rax
dasm_put(Dst, 20, field->output_offset);
      break;
    default:
      throw std::exception("TODO: Other byte widths");
    }
  }
//| ret
dasm_put(Dst, 25);
}

With this, we're now able to implement make_transcoder:

#define DASM_FDEF static
#include "dynasm/dasm_proto.h" // For declarations of the dasm_ functions
#include "dynasm/dasm_x86.h"   // For x64 implementations of the dasm_ functions
#include "transcode.h"         // For emit_transcoder
#include <assert.h>            // For assert
#include <Windows.h>           // For VirtualAlloc

void (*make_transcoder(transcode_job_t* job))(const char*, char*)
{
  dasm_State* state;
  int status;
  void* code;
  size_t code_size;

  dasm_init(&state, DASM_MAXSECTION);
  dasm_setup(&state, transcode_actionlist);

  emit_transcoder(&state, job);
  
  status = dasm_link(&state, &code_size);
  assert(status == DASM_S_OK);

  code = VirtualAlloc(NULL, code_size, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
  status = dasm_encode(&state, code);
  assert(status == DASM_S_OK);

  dasm_free(&state);
  return (void(*)(const char*, char*))code;
}

What is DynASM?

DynASM advertises itself as a dynamic assembler for code generation engines. I can think of several interpretations of what a dynamic assembler might be, not all of which are compatible with each other. As such, it is worth beginning my series about DynASM with a description of what it is and what it isn't.

The envisioned usage pattern is to have fragments of assembly code which are syntactically complete, except possibly for the values of some constants (i.e. the instructions, addressing modes, and registers are all fixed). A decision is made at runtime as to how many copies to make of each fragment, what the value of each constant should be in each fragment, and in what order to emit the fragments.

The DynASM site states that DynASM takes mixed C/Assembler source as input, and produces plain C code output. While this is true, it is also easy to misinterpret: the input to DynASM is a C file whose intended purpose is to emit machine code - the assembly portion of the input is the code to emit rather than the code to run in-line with the C portion. As an example, the intent of the following code is that when write_return_n is called, the machine code for return n; is emitted:

void write_return_n(dasm_state ds, int n)
{
| mov rax, n
| ret
}

If DynASM was implemented by building up a string of assembly code and then passing the result to a stand-alone assembler, then the result of passing the above code through DynASM might be:

void write_return_n(dasm_state ds, int n)
{
  ds += "mov rax, " + n + "\n";
  ds += "ret\n";
}

In reality, DynASM builds up a string of machine code rather than a string of assembly code, meaning that the actual output is somewhat closer to the following:

void write_return_n(dasm_state ds, int n)
{
  dasm_append_bytes(ds, 0x48, 0xC7, 0xC0); dasm_append_dword(ds, n);
  dasm_append_bytes(ds, 0xC3);
}

With this example in mind, DynASM can be described as a text-processing tool which takes lines starting with a vertical bar, interprets them as assembly code, and replaces them with C code which writes out the corresponding machine code. This description fails to mention a bunch of really nice features, but it gives the general idea.

A note on D3D10_RESOURCE_MISC_GDI_COMPATIBLE

Direct3D 10.1 is interoperable with the Windows GDI, but there is very little explicit documentation on the matter. MSDN has a diagram with Direct2D at the center of the interoperability web, but we have to rely on a DirectX blog post for the complete interoperability diagram. In particular, note that the latter diagram has a path from Direct3D10.1 to GDI via DXGI 1.1.

MSDN gives the impression that this interoperability is simple: when calling CreateTexture2D, there is a nice flag called D3D10_RESOURCE_MISC_GDI_COMPATIBLE, the documentation for which says that after enabling the flag, the resulting texture can be cast to an IDXGISurface1 and have GetDC called on it. A similar flag called DXGI_SWAP_CHAIN_FLAG_GDI_COMPATIBLE exists when creating a swap chains rather than textures. Unfortunately, if you try to naively use one of these flags, resource creation might well fail with E_INVALIDARG. The reason for this failure is stated on GetDC, but really needs to be more prominent:

The format for the surface or swap chain must be DXGI_FORMAT_B8G8R8A8_UNORM_SRGB or DXGI_FORMAT_B8G8R8A8_UNORM.

If this constraint isn't satisfied, then it isn't GetDC which will fail, but the resource creation itself.

Making COM nice to use

I write a lot of C++ code for Windows. There are a lot of Windows APIs which expose cool functionality and are implemented using COM; for me this currently means Direct2D, DirectWrite, Direct3D, DXGI, and Windows Imaging Component. In terms of presenting a compatible ABI, COM is a good thing (and is certainly nicer than straight flattening a C++ API to a C API). That said, COM looks extremely clunky in modern C++, for the following reasons:

Some of those reasons are fairly minor, others are a major nuisance, but in aggregate, their overall effect makes COM programming rather unpleasant. Some people take small measures to attack one of the reasons individually, such as CHECK_HRESULT macro for throwing an exception upon an unsuccessful HRESULT (but you still need to wrap each call in this macro), or a com_ptr<T> templated smart pointer which at least does AddRef and Release automatically (but you then need to unwrap the smart pointer when passing it as a parameter to a COM method). I think that a much better approach is to attack all of the problems simultaneously. It isn't as easy as writing a single macro or a single smart pointer template, but the outcome is nice-to-use COM rather than COM-with-one-less-problem.

As with switching on Lua strings, my answer is code generation. I have a tool which:

  1. Takes as input a set of COM headers (for example d3d10.h, d3d10_1.h, d2d1.h, dwrite.h, dxgi.h, and wincodec.h).
  2. Identifies every interface which is defined across this set of headers.
  3. For each interface, it writes out a new class (in an appropriate namespace) which is like a smart pointer on steroids:
    1. If the interface inherits from a base interface, then the smart class inherits from the smart class corresponding to the base interface.
    2. Just like a smart pointer, AddRef and Release are handled automatically by copy construction, move construction, copy assignment, and move assignment.
    3. For each method of the interface, a new wrapper method (or set of overloaded methods) is written for the class:
      1. If the return type was HRESULT, then the wrapper checks for failure and throws an exception accordingly.
      2. Out-parameters become return values (using a std::tuple if there was more than one out-parameter, or an out-parameter on a method which already had a non-void and non-HRESULT return type).
      3. Coupled parameter pairs (such as pointer and length, or pointer and IID) become a single templated parameter.
      4. Pointers to POD structures get replaced with references to POD structures, with optional pointers becoming an overload which omits the parameter entirely.
      5. Pointers to COM objects get replaced with (references to) their corresponding smart class (in the case of in-parameters and also out-parameters).

As an example, consider the following code which uses raw Direct2D to create a linear gradient brush:

// rt has type ID2D1RenderTarget*
// brush has type ID2D1LinearGradientBrush*
D2D1_GRADIENT_STOP stops[] = {
  {0.f, colour_top},
  {1.f, colour_bottom}};
ID2D1GradientStopCollection* stops_collection = nullptr;
HRESULT hr = rt->CreateGradientStopCollection(
  stops,
  sizeof(stops) / sizeof(D2D1_GRADIENT_STOP),
  D2D1_GAMMA_2_2,
  D2D1_EXTEND_MODE_CLAMP,
  &stops_collection);
if(FAILED(hr))
  throw Exception(hr, "ID2D1RenderTarget::CreateGradientStopCollection");
hr = rt->CreateLinearGradientBrush(
  LinearGradientBrushProperties(Point2F(), Point2F())
  BrushProperties(),
  stops_collection,
  &brush);
stops_collection->Release();
stops_collection = nullptr;
if(FAILED(hr))
  throw Exception(hr, "ID2D1RenderTarget::CreateLinearGradientBrush");

With the nice-COM headers, the exact same behaviour is expressed in a much more concise manner:

// rt has type C6::D2::RenderTarget
// brush has type C6::D2::LinearGradientBrush
D2D1_GRADIENT_STOP stops[] = {
  {0.f, colour_top},
  {1.f, colour_bottom}};
brush = rt.createLinearGradientBrush(
  LinearGradientBrushProperties(Point2F(), Point2F()),
  BrushProperties(),
  rt.createGradientStopCollection(
    stops,
    D2D1_GAMMA_2_2,
    D2D1_EXTEND_MODE_CLAMP));
page: 10 11 12 13 14