corsix.org

Anthropic are currently in the tech news for (re)producing a C compiler using $20,000 of Claude's time, but I'm more interested in their compiler performance take-home challenge from two weeks ago. One part of Anthropic is showing that compiler engineers might be obsolete in the medium term, whereas another part is trying to find and hire the best compiler engineers on the planet. That both things can be simultaneously true is amusing to me, but perhaps not surprising: software engineers should be in the game of automating away their current problems so that they can move onward to new problems. Just as the higher education sector has expert lecturers teaching novice undergraduates, the AI sector can have expert humans teaching novice AIs.

Anyway, enough philosophical musing, I want to look at that compiler take-home challenge. It revolves around this little computation graph:

Several copies of this graph can be chained together vertically ("rounds"), and then multiple copies can be placed side-by-side horizontally ("batch size"). For example, here are two vertical copies and three horizontal copies, along with the initial loads and final stores:

The full challenge involves 16 vertical copies and 32 horizontal copies, meaning 512 copies in total. The challenge is to take this computational graph and schedule it on a hypothetical CPU capable of ~10 instructions per cycle, trying to minimise the total cycle count. For example, scheduling it in 1300 cycles involves considering a grid which is ~10 cells wide and 1300 cells tall, then placing each box from the diagram in one of those cells. Placing boxes into grid cells isn't all that hard, but I've been glossing over an important fact which makes it hard: the ~10 cells in each grid row are not all the same. Each grid row in fact consists of 7½ "valu" cells, 2 "load" cells, 2 "store" cells, and finally 1 "flow" cell. Most of the boxes from the diagram have to go into a "valu" cell, but there are a few exceptions:

"vload" boxes require one "load" cell and one "flow" cell.
"vstore" boxes require one "store" cell and one "flow" cell.
"gather" boxes require eight "load" cells.

What looked simple now looks a bit harder: the 512 total "gather" boxes require 4096 "load" cells, thus requiring a grid at least 2048 cells tall. At this point, if I told you it was possible to make everything fit into a grid less than 1000 cells tall, you might think I was delusional. I assure you it is possible though, courtesy of two main strategies:

Reducing the number of boxes on the diagram.
Replacing some of the "+ base" and "gather" boxes with alternatives.

I don't have much to say about strategy one, as the majority of the changes can be shown using a single diagram:

Strategy two is more interesting. The gather operation can be replaced with a selection tree: preload the values of every possible idx, and use the output from all earlier & 1 boxes to select the appropriate value out of all the preloaded values. Each binary select operation requires either a single "flow" cell, or one or two "valu" cells. The number of binary select operations required to replace a "+ base" and "gather" then depends on the vertical level within the graph:

Level	Gather-style	Select-style (ignoring one-off overheads)
0	1x "valu" + 8x "load"	Free
1	1x "valu" + 8x "load"	1x ("flow" or "valu")
2	1x "valu" + 8x "load"	2x ("flow" or "valu") + 1x ("flow" or 2x "valu")
3	1x "valu" + 8x "load"	4x ("flow" or "valu") + 3x ("flow" or 2x "valu")
4	1x "valu" + 8x "load"	8x ("flow" or "valu") + 7x ("flow" or 2x "valu")
5	1x "valu" + 8x "load"	16x ("flow" or "valu") + 15x ("flow" or 2x "valu")
⋮	⋮	⋮

More than 280 gathers can be gainfully replaced with selection trees. Doing so massively reduces the number of "load" cells required, at the cost of increasing the number of "flow" cells required. In turn, some of those "flow" cells can be traded for "valu" cells. Part of the challenge is finding just the right instruction mix as to equally balance "valu" and "load" and "flow" in the 7½ : 2 : 1 ratio. If the full graph is considered as a whole, it isn't too hard to balance the overall instruction counts as to hit 7½ : 2 : 1. Unfortunately, this isn't enough: every single one of the ~1000 cycles wants to hit that 7½ : 2 : 1 ratio, which means instruction selection and instruction scheduling are intertwined problems. This is where the real meat of the challenge is found: the space of possible instruction selections and instruction schedules is vast, and so the winner is the person (or system) capable of finding the best point in that search space. If you can find a good point, submit it to the leaderboard (or email Anthropic asking for a job).

You might also ask whether this challenge is a reasonable proxy for actual compiler engineer experience. It is easy to cheat the default test harness, but I'm going to ignore that, as the objective is clearly to find compiler engineers rather than security engineers, and human review of submissions can easily identify the latter. There are also various well-known compiler problems which the challenge doesn't touch: there's no control flow, all instructions have single-cycle latency, memory hazards can be (almost) entirely ignored, there are no cache effects, and no doubt all sorts of other things. The target machine is clearly fictitious, but there is more VLIW hardware in the AI space than many people realise, and that final ½ "valu" slot is an endearing complication. Overall I think it is an OK proxy, though I much prefer it as an unlimited-time challenge rather than its original format of a two-hour timed exam.

Sebastian Aaltonen recently wrote an excellent piece titled No Graphics API, which you should read if you're interested in the mechanics of GPUs and APIs for talking to them. You should especially read it if you're an ambitious young engineer at Microsoft who would like to make your mark on the world by designing and championing DirectX 13. If you're instead a hardware engineer designing a new GPU (or something GPU-adjacent), you should again read it and ensure that your hardware design is amenable to what is described.

Is there anything further which can be stripped away?

The outlined API is already quite thin, but perhaps it could be even thinner still.

Compute-only

If only caring about GPGPU and not at all about graphics, there's a subset of the outlined API which drops the graphics-specific bits. The surviving functions are:

Memory: gpuMalloc, gpuFree, gpuHostToDevicePointer.
Pipelines: gpuCreateComputePipeline, gpuFreePipeline.
Queue: gpuCreateQueue, gpuStartCommandRecording, gpuSubmit.
Semaphores: gpuCreateSemaphore, gpuWaitSemaphore, gpuDestroySemaphore.
Commands: gpuMemCpy, gpuBarrier, gpuSignalAfter, gpuWaitBefore, gpuSetPipeline, gpuDispatch, gpuDispatchIndirect.

If you do care about graphics then the rest should obviously be kept, but it is an interesting little thought experiment to consider a useful compute-only subset.

`gpuHostToDevicePointer`

If a GPU has a sufficiently good MMU, then in most cases, firmware on the GPU and drivers on the host can conspire to set up the GPU-side MMU mapping to make gpuHostToDevicePointer a no-op. If most could be extended to all, then gpuHostToDevicePointer could be removed, but perhaps this is a scenario in which covering 99% of cases is easy, but the final 1% is very hard.

`gpuSubmit`

If command buffers are always one-shot, then gpuSubmit looks potentially unnecessary: all commands enqueued to a command buffer will eventually execute, so they could be eligible for execution immediately upon being enqueued, with no gpuSubmit call necessary. CUDA works this way: kernels are enqueued with one call; there's no need for separate enqueue and submit. That said, there are a few possible arguments for the two-stage dance:

Throughput can be improved (at a slight latency cost) by coalescing multiple commands together. For example, perhaps enqueueing can be done entirely in a userspace driver, whereas submission requires doing a system call.
Though the API synopsis at the end of the blog post doesn't show semaphores interacting with gpuSubmit, an earlier example does. If removing gpuSubmit, there would need to be a different way of doing the semaphore part of submit.
If doing graphics (rather than just compute), there needs to be some way to tell the GPU that all commands for the current frame are complete, and that the associated render target should be displayed to the human.

`uint32x3` for `SV_ThreadID` and `SV_GroupID` and `SV_GroupThreadID`

Do we really need these to be 3D, or does it suffice for them to be 1D? Software can always unravel a 1D index to 3D if it needs to, and the driver might be inserting such an unravel already if the hardware is really only 1D under the hood.

Is there anything else which needs adding?

Though I'm a fan of minimalism, it is possible to be too minimalistic.

Multi-device support

In a system with multiple GPUs, gpuMalloc needs to know which GPU to allocate on, so either an extra argument or a sideband function call to set the active GPU. The same is true for all of gpuTextureSizeAlign, gpuCreateSemaphore, gpuCreate*Pipeline, and gpuCreate*State (some of these cases could instead be lazy and defer the actual GPU-specific resource creation until the first time the resultant object is used with a GpuCommandBuffer, but laziness causes other problems).

A related can of worms is peer-to-peer support between multiple GPUs: can one GPU write to another's memory in the same way it can write to CPU MEMORY_READBACK memory? Is there some form of barrier or signal or semaphore allowing one GPU to wait for work on another to complete? Many further questions are possible.

Multi-process support

Some GPGPU workloads benefit from having a singular GPU memory allocation visible to multiple distinct CPU processes. They might also benefit from being able to create a pipeline and then share the GPU-side state associated with that pipeline between multiple distinct CPU processes, though this is more of a minor optimisation to avoid the same state being created multiple times.

Memory pinning

It is often very convenient to be able to take an arbitrary memory allocation performed by the application (i.e. not through gpuMalloc), and make that memory visible to the GPU as-if gpuMalloc were used with the MEMORY_READBACK flag. In general, making this work requires at least one of:

UMA.
A sufficiently good MMU on the GPU.
An IOMMU on the host.
The OS to find a suitable range of contiguous physical memory and change which physical memory backs the allocation (only possible for pageable memory).

If relying on at least one of the above isn't viable, an alternative is adding a variant of gpuMemCpy which accepts CPU pointers. It is always possible to implement such an async memcpy API: the driver can do some combination of temporary pinning / splitting one non-contiguous copy into several contiguous ones / bounce buffers / DMA controller scatter-gather lists.

Instruction cache fences

Some hardware contains non-coherent instruction caches which need to be explicitly cleared after loading (or modifying) code and before executing said code. This is an obvious candidate for a gpuBarrier flag / mode. Alternatively, hardware which requires it could have the driver transparently perform the appropriate fence as part of every gpuCreate*Pipeline call, or transparently perform it just before the first gpuSubmit call after a gpuCreate*Pipeline call.

Does anything give me cause for concern?

Most of the outlined API has me thinking "yep, this all seems sensible", but a few areas cause me to think a little bit harder.

Write-combining memory

The approach to memory management is relying on either UMA or PCIe ReBAR. ReBAR doesn't strictly require write-combining memory, but you really want something like WC memory to give CPU → GPU writes acceptable performance. This is fine on x86 / x86-64, but potentially an area of concern for any other CPU architectures which lack the concept of WC. Even where it exists, write combining is not your friend: handing out pointers to write-combining memory to user code comes with lots of potential footguns. Some of these footguns can be mitigated with education and documentation, but not entirely.

Deadlock avoidance

If one command buffer can do a gpuWaitBefore for a gpuSignalAfter from a different command buffer, then commands from the two buffers need to be run in the right order, lest the GPU commit to blocking on the wait command before running the commands which would unblock it. This might look like an easy problem to solve: if a GPU would be blocking on a wait, it should instead actively go looking for other work (from other submitted command buffers) to perform. Actual reality is slightly more annoying: perhaps there are a finite number of hardware command queues, so the GPU driver multiplexes multiple software command buffers onto the same hardware command queue, and if it does that multiplexing in the wrong order, the resultant queue ends up with the wait before the signal. There are many ways to make this problem go away; one such way is to put the onus on the developer, and require that it is valid (even if not optimal) for the GPU to execute submitted command buffers one after another, and commands within each of those buffers in the order they were enqueued, with no reordering anywhere. CUDA happens to design the API to ensure this: streams don't need any explicit submission (so it is valid for the GPU to run commands in the exact order they were enqueued), and cudaEventRecord must be enqueued before cudaStreamWaitEvent is enqueued, as that's just how events work.

Leaving 32 bits behind

In practice, PCIe ReBAR means having a 64-bit operating system. I'm fine with excluding 32-bit operating systems, but perhaps not everybody is.

I have slightly more sympathy for 32-bit programs on 64-bit operating systems. To make them work, gpuHostToDevicePointer would need to return a 64-bit value rather than a pointer. Even then, structure definitions containing pointers could not be shared between CPU and GPU, and the amount of memory allocatable with gpuMalloc would be limited to a few gigabytes. It might be easier to just say that 32-bit programs are a legacy which we're prepared to leave behind.

Conclusion

You'll note that my collection of thoughts takes up far fewer words than the referenced piece. As I said in opening, it is an excellent piece: most of it doesn't require any further commentary.

I'm the kind of person who thinks about the design and implementation of hash tables. One design which I find particularly cute, and I think deserves a bit more publicity, is Robin Hood open-addressing with linear probing and power-of-two table size. If you're not familiar with hash table terminology, that might look like a smorgasbord of random words, but it should become clearer as we look at some actual code.

To keep the code simple to start with, I'm going to assume:

Keys are randomly-distributed 32-bit integers.
Values are also 32-bit integers.
If the key 0 is present, its value is not 0.
The table occupies at most 32 GiB of memory.

Each slot in the table is either empty, or holds a key and a value. The combination of properties (1) and (2) allows a key/value pair to be stored as a 64-bit integer, and property (3) means that the 64-bit value 0 can be used to represent an empty slot (some hash table designs also need a special value for representing tombstones, but this design doesn't need tombstones). Combining a key and a value into 64 bits couldn't be easier: the low 32 bits hold the key, and the high 32 bits hold the value.

The structure for the table itself needs a pointer to the array of slots, the length of said array, and the number of non-empty slots. As the length is always a power of two, it's more useful to store length - 1 instead of length, which leads to mask rather than length, and property (4) means that mask can be stored as 32 bits. As the load factor should be less than 100%, we can assume count < length, and hence count can also be 32 bits. This leads to a mundane-looking:

struct hash_table_t {
  uint64_t* slots;
  uint32_t mask;
  uint32_t count;
};

Property (1) means that we don't need to hash keys, as they're already randomly distributed. Every possible key K has a "natural position" in the slots array, which is just K & mask. If there are collisions, the slot in which a key actually ends up might be different to its natural position. The "linear probing" part of the design means that if K cannot be in its natural position, the next slot to be considered is (K + 1) & mask, and if not that slot then (K + 2) & mask, then (K + 3) & mask, and so on. This leads to the definition of a "chain": if K is some key present in the table, C_K denotes the sequence of slots starting with K's natural position and ending with K's actual position. We have the usual property of open-addressing: none of the slots in C_K are empty slots. The "Robin Hood" part of the design then imposes an additional rather interesting property: for each slot S in C_K, Score(S.Index, S.Key) ≥ Score(S.Index, K), where:

S.Index is the index of S in the slots array (not the index of it in C_K).
S.Key is the key present in slot S (i.e. the low 32 bits of slots[S.Index]).
Score(Index, Key) is (Index - Key) & mask.

These properties give us the termination conditions for the lookup algorithm: for a possible key K, we look at each slot starting from K's natural position, and either we find K, or we find an empty slot, or we find a slot with Score(S.Index, S.Key) < Score(S.Index, K). In either of the latter two cases, K cannot have been present in the table. In the function below, Score(S.Index, K) is tracked as d. In a language with a modern type system, the result of a lookup would be Optional<Value>, but if sticking to plain C, property (3) can be used to make something similar: the 64-bit result is zero if the key is absent, and otherwise the value is in the low 32 bits of the result (which may themselves be zero, but the full 64-bit result will be non-zero). The logic is thus:

uint64_t table_lookup(hash_table_t* table, uint32_t key) {
  uint32_t mask = table->mask;
  uint64_t* slots = table->slots;
  for (uint32_t d = 0;; ++d) {
    uint32_t idx = (key + d) & mask;
    uint64_t slot = slots[idx];
    if (slot == 0) {
      return 0;
    } else if (key == (uint32_t)slot) {
      return (slot >> 32) | (slot << 32);
    } else if (((idx - (uint32_t)slot) & mask) < d) {
      return 0;
    }
  }
}

If using a rich 64-bit CPU architecture, many of the expressions in the above function are cheaper than they might initially seem:

slots[idx] involves zero-extending idx from 32 bits to 64, multiplying it by sizeof(uint64_t), adding it to slots, and then loading from that address. All this is a single instruction on x86-64 or arm64.
key == (uint32_t)slot involves a comparison using the low 32 bits of a 64-bit register, which is a completely standard operation on x86-64 or arm64.
(slot >> 32) | (slot << 32) is a rotation by 32 bits, which again is a single instruction on x86-64 or arm64.

On the other hand, if using riscv64, things are less good:

If the Zba extension is present, sh3add.uw is a single instruction for zero-extending idx from 32 bits to 64, multiplying it by sizeof(uint64_t), and adding it to slots. If not, each step is a separate instruction, though the zero-extension can be eliminated with a slight reformulation to encourage the compiler to fold the zero-extension onto the load of table->mask (as riscv64 usually defaults to making sign-extension free, in contrast to x86-64/arm64 which usually make zero-extension free). Regardless, the load is always its own instruction.
key == (uint32_t)slot hits a gap in the riscv64 ISA: it doesn't have any 32-bit comparison instructions, so this either becomes a 32-bit subtraction followed by a 64-bit comparison against zero, or promotion of both operands from 32 bits to 64 bits followed by a 64-bit comparison.
If the Zbb extension is present, rotations are a single instruction. If not, they're three instructions, and so it becomes almost worth reworking the slot layout to put the key in the high 32 bits and the value in the low 32 bits.

Moving on from lookup to insertion, there are various different options for what to do when the key being inserted is already present. I'm choosing to show a variant which returns the old value (in the same form as table_lookup returns) and then overwrites with the new value, though other variants are obviously possible. The logic follows the same overall structure as seen in table_lookup:

uint64_t table_set(hash_table_t* table, uint32_t key, uint32_t val) {
  uint32_t mask = table->mask;
  uint64_t* slots = table->slots;
  uint64_t kv = key + ((uint64_t)val << 32);
  for (uint32_t d = 0;; ++d) {
    uint32_t idx = ((uint32_t)kv + d) & mask;
    uint64_t slot = slots[idx];
    if (slot == 0) {
      // Inserting new value (and slot was previously empty)
      slots[idx] = kv;
      break;
    } else if ((uint32_t)kv == (uint32_t)slot) {
      // Overwriting existing value
      slots[idx] = kv;
      return (slot >> 32) | (slot << 32);
    } else {
      uint32_t d2 = (idx - (uint32_t)slot) & mask;
      if (d2 < d) {
        // Inserting new value, and moving existing slot
        slots[idx] = kv;
        table_reinsert(slots, mask, slot, d2);
        break;
      }
    }
  }
  if (++table->count * 4ull >= mask * 3ull) {
    // Expand table once we hit 75% load factor
    table_rehash(table);
  }
  return 0;
}

To avoid the load factor becoming too high, the above function will sometimes grow the table by calling this helper function:

void table_rehash(hash_table_t* table) {
  uint32_t old_mask = table->mask;
  uint32_t new_mask = old_mask * 2u + 1u;
  uint64_t* new_slots = calloc(new_mask + 1ull, sizeof(uint64_t));
  uint64_t* old_slots = table->slots;
  uint32_t idx = 0;
  do {
    uint64_t slot = old_slots[idx];
    if (slot != 0) {
      table_reinsert(new_slots, new_mask, slot, 0);
    }
  } while (idx++ != old_mask);
  table->slots = new_slots;
  table->mask = new_mask;
  free(old_slots);
}

Both of table_set and table_rehash make use of a helper function which is very similar to table_set, but doesn't need to check for overwriting an existing key and also doesn't need to update count:

void table_reinsert(uint64_t* slots, uint32_t mask, uint64_t kv, uint32_t d) {
  for (;; ++d) {
    uint32_t idx = ((uint32_t)kv + d) & mask;
    uint64_t slot = slots[idx];
    if (slot == 0) {
      slots[idx] = kv;
      break;
    } else {
      uint32_t d2 = (idx - (uint32_t)slot) & mask;
      if (d2 < d) {
        slots[idx] = kv;
        kv = slot;
        d = d2;
      }
    }
  }
}

That covers lookup and insertion, so next up is key removal. As already hinted at, this hash table design doesn't need tombstones. Instead, removing a key involves finding the slot containing that key and then shifting slots left until finding an empty slot or a slot with Score(S.Index, S.Key) == 0. This removal strategy works due to a neat pair of emergent properties:

If slot S has Score(S.Index, S.Key) != 0, it is viable for S.Key to instead be at (S.Index - 1) & mask (possibly subject to additional re-arranging to fill the gap formed by moving S.Key).
If slot S has Score(S.Index, S.Key) == 0, and S is part of some chain C_K, then S is at the very start of C_K. Hence it is viable to turn (S.Index - 1) & mask into an empty slot without breaking any chains.

This leads to the tombstone-free removal function, which follows the established pattern of returning either the old value or zero:

uint64_t table_remove(hash_table_t* table, uint32_t key) {
  uint32_t mask = table->mask;
  uint64_t* slots = table->slots;
  for (uint32_t d = 0;; ++d) {
    uint32_t idx = (key + d) & mask;
    uint64_t slot = slots[idx];
    if (slot == 0) {
      return 0;
    } else if (key == (uint32_t)slot) {
      uint32_t nxt = (idx + 1) & mask;
      --table->count;
      while (slots[nxt] && ((slots[nxt] ^ nxt) & mask)) {
        slots[idx] = slots[nxt];
        idx = nxt;
        nxt = (idx + 1) & mask;
      }
      slots[idx] = 0;
      return (slot >> 32) | (slot << 32);
    } else if (((idx - (uint32_t)slot) & mask) < d) {
      return 0;
    }
  }
}

The final interesting hash table operation is iterating over all keys and values, which is just an array iteration combined with filtering out zeroes:

void table_iterate(hash_table_t* table, void(*visit)(uint32_t key, uint32_t val)) {
  uint64_t* slots = table->slots;
  uint32_t mask = table->mask;
  uint32_t idx = 0;
  do {
    uint64_t slot = slots[idx];
    if (slot != 0) {
      visit((uint32_t)slot, (uint32_t)(slot >> 32));
    }
  } while (idx++ != mask);
}

That wraps up the core concepts of this hash table, so now it is time to revisit some of the initial simplifications.

If keys are 32-bit integers but are not randomly-distributed, then we just need an invertible hash function from 32 bits to 32 bits, the purpose of which is to take keys following ~any real-world pattern and emit a ~random pattern. The table_lookup, table_set, and table_remove functions gain key = hash(key) at the very start but are otherwise unmodified (noting that if the hash function is invertible, hash equality implies key equality, hence no need to explicitly check key equality), and table_iterate is modified to apply the inverse function before calling visit. If hardware CRC32 / CRC32C instructions are present (as is the case on sufficiently modern x86-64 and arm64 chips), these can be used for the task, although their inverses are annoying to compute, so perhaps not ideal if iteration is an important operation. If CRC32 isn't viable, one option out of many is:

uint32_t u32_hash(uint32_t h) {
  h ^= h >> 16;
  h *= 0x21f0aaad;
  h ^= h >> 15;
  h *= 0x735a2d97;
  h ^= h >> 15;
  return h;
}

uint32_t u32_unhash(uint32_t h) {
  h ^= h >> 15; h ^= h >> 30;
  h *= 0x97132227;
  h ^= h >> 15; h ^= h >> 30;
  h *= 0x333c4925;
  h ^= h >> 16;
  return h;
}

If keys and values are larger than 32 bits, then the design can be augmented with a separate array of key/value pairs, with the design as shown containing a 32-bit hash of the key and the array index of the key/value pair. To meet property (3) in this case, either the hash function can be chosen to never be zero, or "array index plus one" can be stored rather than "array index". It is not possible to make the hash function invertible in this case, so table_lookup, table_set, and table_remove do need extending to check for key equality after confirming hash equality. Iteration involves walking the separate array of key/value pairs rather than the hash structure, which has the added benefit of iteration order being related to insertion order rather than hash order. As another twist on this, if keys and values are variably-sized, then the design can instead be augmented with a separate array of bytes, with key/value pairs serialised somewhere in that array, and the hash structure containing a 32-bit hash of the key and the byte offset (within the array) of the key/value pair.

Of course, a design can only stretch so far. If you're after a concurrent lock-free hash table, look elsewhere. If you can rely on 128-bit SIMD instructions being present, you might instead want to group together every 16 key/value pairs, keep an 8-bit hash of each key, and rely on SIMD to perform 16 hash comparisons in parallel. If you're building hardware rather than software, it can be appealing to have multiple hash functions, each one addressing its own SRAM bank. There is no one-size-fits-all hash table, but I've found the one shown here to be good for a lot of what I do.

Continuing the recent theme, I was given an ET-SoC-1 PCIe board, which is now installed in my home lab. The first order of business is confirming exactly which RISC-V instructions are supported by its minion CPU cores. We could try to learn this from documentation, or from the system emulator, or from the C compiler (all of which exist), but the ground truth can only be confirmed by testing on real hardware. This requires writing some code to interact with the hardware, and while there is a high-level runtime intended for this, it is more illuminating to jump in at a slightly lower level. We begin with:

int fd = open("/dev/et0_ops", O_RDWR | O_CLOEXEC);
if (fd < 0) FATAL("Could not open PCIe device");

The kernel driver creates two files per PCIe card: /dev/et<N>_ops and /dev/et<N>_mgmt. In broad strokes, the former is useful for launching compute kernels, whereas the latter is useful for updating firmware. Different filesystem permissions can be applied to the two files: perhaps only root should be able to update the firmware, but any user should be able to launch kernels.

Launching a kernel on the device requires uploading some RISC-V code to it, and in turn that requires choosing somewhere in the device's address space to place said code. Code for minion cores has to live within the device's DRAM, which is a 32 GiB region starting at address 0x80_0000_0000, but firmware takes a little bit for itself. There's an ioctl to determine how much is available:

struct dram_info dram_info;
if (ioctl(fd, ETSOC1_IOCTL_GET_USER_DRAM_INFO, &dram_info) < 0) {
  FATAL("Could not issue ETSOC1_IOCTL_GET_USER_DRAM_INFO");
}
printf("Have %llu bytes of DRAM starting at 0x%llx\n",
  (long long unsigned)dram_info.size,
  (long long unsigned)dram_info.base);

This prints 34265366528 (32 GiB minus 90 MiB) and 0x8005801000 (~88 MiB after 0x80_0000_0000). Proper host software would create a memory allocator at this point to dynamically manage this region, but for this post we'll just bump allocate starting from dram_info.base.

To actually launch a kernel, we need to think about queues. Firmware on the device initializes some submission queues (SQ) and completion queues (CQ), and the kernel driver knows how to push onto an SQ and how to pop from a CQ. These queues are small at the moment: each SQ can hold just over 1 KiB, and each CQ just under 1½ KiB. Each queue contains some number of messages, and we begin with a little helper function to (ask the kernel driver to) push a message onto an SQ:

uint16_t sq_push(int fd, struct cmn_header_t* msg, uint8_t flags) {
  struct cmd_desc msg_desc = {
    .cmd = msg,
    .size = msg->size,
    .flags = flags,
  };
  uint16_t tag = msg->tag_id = (uint16_t)rand();
  for (;;) {
    if (ioctl(fd, ETSOC1_IOCTL_PUSH_SQ, &msg_desc) < 0) {
      if (errno == EAGAIN) continue;
      FATAL("Could not issue ETSOC1_IOCTL_PUSH_SQ");
    }
    return tag;
  }
}

One kind of message we can push to an /dev/et0_ops SQ is struct device_ops_kernel_launch_cmd_t, which in particular includes:

code_start_address: The address of some RISC-V code on the device.
pointer_to_args: An arbitrary 64-bit value to be passed to the RISC-V code in the a0 register. To pass more than this, the values can be placed somewhere in device memory, and the address of those values passed.
shire_mask: A bitmask of which minion tiles to execute the RISC-V code on; every set bit will cause the code to be executed 64 times (because there are 32 minions per tile, and 2 hardware threads per minion).

If the optional CMD_FLAGS_KERNEL_LAUNCH_ARGS_EMBEDDED flag is specified, then we get what Vulkan calls "push constants": the SQ message can include a little bit of data immediately after struct device_ops_kernel_launch_cmd_t, which the ioctl will push to the device for us, and firmware on the device will copy to pointer_to_args prior to invoking code_start_address. Kernel arguments aren't required for this post, but we can (ab)use this mechanism to push the RISC-V code itself to the device.

This causes the message to be:

struct {
  struct device_ops_kernel_launch_cmd_t launch;
  uint32_t rv_code[3];
} __attribute__((packed, aligned(8))) launch_cmd = {
  .launch = {
    .command_info = {
      .cmd_hdr = {
        .size = sizeof(launch_cmd),
        .msg_id = DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_CMD,
        .flags = CMD_FLAGS_KERNEL_LAUNCH_ARGS_EMBEDDED,
      }
    },
    .code_start_address = dram_info.base,
    .pointer_to_args = dram_info.base,
    .shire_mask = 0x1,
  },
  .rv_code = {
    0x00000013, // nop
    0x00800513, // li a0, 8
    0x00000073, // ecall
  },
};
uint16_t tag = sq_push(fd, &launch_cmd.launch.command_info.cmd_hdr, 0);

The pushed RISC-V code consists of three instructions: a nop which'll come in handy later, and then two instructions to perform a syscall (the number 8 is SYSCALL_RETURN_FROM_KERNEL).

Once the device firmware has popped this SQ message and completed running the RISC-V code, it'll push a CQ message. Firmware can also push unsolicited CQ messages for other reasons, which proper host software should do something with, but we'll just ignore in the interest of brevity. This leads to a helper function for popping CQ messages until the message we want arrives:

void cq_pop_until(int fd, struct rsp_desc* dst, uint16_t kind, uint16_t tag) {
  for (;;) {
    if (ioctl(fd, ETSOC1_IOCTL_POP_CQ, dst) < 0) {
      if (errno == EAGAIN) continue;
      FATAL("Could not issue ETSOC1_IOCTL_POP_CQ");
    }
    struct cmn_header_t* hdr = (struct cmn_header_t*)dst->rsp;
    if (hdr->msg_id == kind && hdr->tag_id == tag) return;
  }
}

Using this, we can obtain the struct device_ops_kernel_launch_rsp_t telling us how the kernel launch went:

char rsp_buf[256];
struct rsp_desc rsp_desc = {
  .rsp = rsp_buf,
  .size = sizeof(rsp_buf),
};
cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_RSP, tag);
printf("Kernel launch status: %u\n", ((struct device_ops_kernel_launch_rsp_t*)rsp_buf)->status);

This prints 0, meaning DEV_OPS_API_KERNEL_LAUNCH_RESPONSE_KERNEL_COMPLETED.

So far so good: we can launch a trivial kernel, and it completes successfully. The neat part is that we only need a tiny modification to use this for probing which RISC-V instructions are supported by the ET-SoC-1's minion CPUs. The key is the nop at the start of rv_code: we can replace this with any other RISC-V instruction, and if the kernel still completes successfully then the instruction was supported, whereas if the kernel fails then the instruction wasn't supported. Firmware on the device handles the grungy details of catching the invalid instruction and getting everything neat and tidy again ready for running the next kernel (similar to how your operating system catches segfaults and limits their impact to terminating just the single faulty process rather than the whole machine).

Trying this out merely requires changing the nop to something invalid, launching the new kernel, and printing the status again:

launch_cmd.rv_code[0] = 0x1234deaf;
tag = sq_push(fd, &launch_cmd.launch.command_info.cmd_hdr, 0);
cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_RSP, tag);
printf("Faulty launch status: %u\n", ((struct device_ops_kernel_launch_rsp_t*)rsp_buf)->status);

This prints 2, meaning DEV_OPS_API_KERNEL_LAUNCH_RESPONSE_EXCEPTION.

As it happens, firmware on the device can give us more details about the exception by populating a struct execution_context_t somewhere in device memory. We need to allocate enough device memory to hold an execution_context_t array, with one array element per hardware thread. As we set shire_mask to 0x1, the kernel runs on hardware threads 0 through 63, so we need a 64-element array. Bump allocating the device memory is easy enough: all we need to do is add .exception_buffer = dram_info.base + 64, to the definition of launch_cmd. Pulling the array back to the host is slightly more involved, motivating another little helper function:

uint16_t async_memcpy_from_device(int fd, void* dst, uint64_t src, size_t size) {
  struct {
    struct cmn_header_t header;
    struct dma_read_node node;
  } __attribute__((packed, aligned(8))) dma_cmd = {
    .header = {
      .size = sizeof(dma_cmd),
      .msg_id = DEV_OPS_API_MID_DEVICE_OPS_DMA_READLIST_CMD,
    },
    .node = {
      .dst_host_virt_addr = (uint64_t)(uintptr_t)dst,
      .src_device_phy_addr = src,
      .size = size,
    }
  };
  return sq_push(fd, &dma_cmd.header, CMD_DESC_FLAG_DMA);
}

We need to allocate some special host memory to be the target of the async memcpy, but doing so is just an mmap call, and then we can do the copy and look at the execution_context_t::scause we get back:

const size_t contexts_size = sizeof(execution_context_t) * 64;
void* dma_buf = mmap(NULL, contexts_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (dma_buf == MAP_FAILED) FATAL("Could not allocate %llu byte DMA buffer", (long long unsigned)contexts_size);
tag = async_memcpy_from_device(fd, dma_buf, launch_cmd.launch.exception_buffer, contexts_size);
cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_DMA_READLIST_RSP, tag);
printf("Faulty launch scause: %u\n", (unsigned)((execution_context_t*)dma_buf)->scause);

This also prints 2, but the 2 now means something different: it comes from the RISC-V Instruction Set Manual: Volume II: Privileged Architecture scause value list, which says 2 means Illegal instruction. This is exactly what we expect, but we'll observe some other values in due course.

Next up, we need a list of encoded RISC-V instructions to try running. We could consult the RISC-V Instruction Set Manual, Volume I: Unprivileged Architecture's opcode map for standard RISC-V instructions and ET Programmer's Reference Manual for custom instructions, but transcribing instruction encodings out of manuals is dull work, especially when someone else has already done it for us: the fork of binutils for this device has riscv-opc.h with lots of standard encodings, and esperanto-opc.h for the custom ones. Each of the #define MATCH_<INSN> <ENCODING> lines therein gives us one possible encoding of INSN, typically with all register operands set to x0 or f0 and any immediate operands set to 0. We can start by pulling out a handful of instructions from each file:

// Standard:
#define MATCH_LD          0x3003
#define MATCH_FENCE          0xf
#define MATCH_FENCE_I     0x100f
#define MATCH_DIV      0x2004033
#define MATCH_FMUL_S  0x10000053
// Custom:
#define MATCH_FMUL_PS 0x1000007b
#define MATCH_FDIV_PI 0x1e00007b
#define MATCH_BITMIXB 0x8000703b

const struct insn_entry {
  const char* name;
  uint32_t encoding;
} g_insns[] = {
  {"ld",      MATCH_LD},
  {"fence",   MATCH_FENCE},
  {"fence.i", MATCH_FENCE_I},
  {"div",     MATCH_DIV},
  {"fmul.s",  MATCH_FMUL_S},
  {"fmul.ps", MATCH_FMUL_PS},
  {"fdiv.pi", MATCH_FDIV_PI},
  {"bitmixb", MATCH_BITMIXB},
  {NULL, 0},
};

Each instruction can then be tested in turn:

for (const struct insn_entry* i = g_insns; i->name; ++i) {
  launch_cmd.rv_code[0] = i->encoding;
  tag = sq_push(fd, &launch_cmd.launch.command_info.cmd_hdr, 0);
  cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_RSP, tag);
  if (((struct device_ops_kernel_launch_rsp_t*)rsp_buf)->status == DEV_OPS_API_KERNEL_LAUNCH_RESPONSE_KERNEL_COMPLETED) {
    printf("%-8s -> OK\n", i->name);
  } else {
    tag = async_memcpy_from_device(fd, dma_buf, launch_cmd.launch.exception_buffer, contexts_size);
    cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_DMA_READLIST_RSP, tag);
    printf("%-8s -> scause %u\n", i->name, (unsigned)((execution_context_t*)dma_buf)->scause);
  }
}

This prints:

`ld`	→	scause 5
`fence`	→	OK
`fence.i`	→	scause 30
`div`	→	OK
`fmul.s`	→	OK
`fmul.ps`	→	OK
`fdiv.pi`	→	scause 30
`bitmixb`	→	scause 2

The same documentation as before tells us that scause of 5 means Load access fault, which is to be expected: the encoded instruction is ld x0, 0(x0), and while this is a valid instruction, 0(x0) isn't a valid address in the minion's memory map. More curious is scause of 30: the standard documentation puts this under Designated for custom use, and so we need to look at the aforementioned ET Programmer's Reference Manual to see it described as M-code emulation. This means that the hardware's instruction decoder does recognise the instruction as valid, but the hardware doesn't natively implement the instruction; instead it is asking firmware to invisibly (albeit slowly) emulate it. Unfortunately, the firmware logic for instruction emulation hasn't been written yet, so we get a very visble exception rather than invisible emulation. The distinction between Illegal instruction and M-code emulation is somewhat arbitrary: firmware could choose to perform emulation in response to Illegal instruction, and could choose not to perform emulation in response to M-code emulation (as seen in the current firmware where the emulation logic hasn't been written yet). Despite it being arbitrary, I'll maintain the distinction.

The testing code can then be improved to interpret scause values:

for (const struct insn_entry* i = g_insns; i->name; ++i) {
  launch_cmd.rv_code[0] = i->encoding;
  tag = sq_push(fd, &launch_cmd.launch.command_info.cmd_hdr, 0);
  cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_RSP, tag);
  const char* status = "OK";
  if (((struct device_ops_kernel_launch_rsp_t*)rsp_buf)->status != DEV_OPS_API_KERNEL_LAUNCH_RESPONSE_KERNEL_COMPLETED) {
    tag = async_memcpy_from_device(fd, dma_buf, launch_cmd.launch.exception_buffer, contexts_size);
    cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_DMA_READLIST_RSP, tag);
    switch (((execution_context_t*)dma_buf)->scause) {
    case  2: status = "Invalid"; break;
    case 30: status = "Emulate"; break;
    }
  }
  printf("%-8s -> %s\n", i->name, status);
}

This now prints:

`ld`	→	OK
`fence`	→	OK
`fence.i`	→	Emulate
`div`	→	OK
`fmul.s`	→	OK
`fmul.ps`	→	OK
`fdiv.pi`	→	Emulate
`bitmixb`	→	Invalid

Testing more instructions is just a matter of adding more entries to g_insns. I've put my list of entries along with all supporting code up as a gist, which you're welcome to read through, but you'll need a real hardware device to make the code useful. Alternatively, keep on reading here as I go through the results from running it on the device in my lab.

All of RV64I is OK: this is addi, addiw, slti, sltiu, andi, ori, xori, slli, slliw, srli, srliw, srai, sraiw, lui, auipc, add, addw, sub, subw, slt, sltu, and, or, xor, sll, sllw, srl, srlw, sra, sraw, jal, jalr, beq, bne, blt, bltu, bge, bgeu, ld, lw, lwu, lh, lhu, lb, lbu, sd, sw, sh, sb, fence, ecall, ebreak, and various assembler pseudo-instructions expanding to these.

Standard extensions are fairly quickly enumerated:

M extension: all OK; this is mul, mulh, mulhsu, mulhu, mulw, div, divu, rem, remu, divw, divuw, remw, remuw.
Zicsr extension: all OK; this is csrrw, csrrs, csrrc, csrrwi, csrrsi, csrrci.
Zifencei extension: fence.i is emulated.
Supervisor / Machine-Mode privileged instructions: mret and sret and wfi are all OK, sfence.vma is emulated.
F extension: mostly OK (frcsr, fscsr, frrm, fsrm, fsrmi, frflags, fsflags, fsflagsi, flw, fsw, fadd.s, fsub.s, fmul.s, fmin.s, fmax.s, fmadd.s, fmsub.s, fnmsub.s, fnmadd.s, fcvt.w.s, fcvt.wu.s, fcvt.s.w, fcvt.s.wu, fsgnj.s, fsgnjn.s, fsgnjx.s, fmv.x.s, fmv.s.x, feq.s, flt.s, fle.s, fclass.s), but with a few instructions emulated (fdiv.s, fsqrt.s, fcvt.l.s, fcvt.lu.s, fcvt.s.l, fcvt.s.lu).
C extension: instructions valid as per their corresponding non-compressed counterpart.
All other standard extensions are invalid (A, B, D, Q, V, Zfa, Zfh, Zicntr, Zicond, and so on and so forth).

Rather more interesting are the non-standard instructions. There are various ways of grouping these, but I'll start with custom scalar integer arithmetic: packb is OK, but bitmixb is invalid. The behaviour of packb is just rd = (rs1 & 0xff) | ((rs2 & 0xff) << 8). The behaviour of bitmixb is far more interesting, performing a variety of bit interleavings of two 8-bit values, of the kind you might want for 2D texture address swizzling on a GPU. Conceptually this instruction takes three inputs (two 8-bit values and a 16-bit control), but RISC-V doesn't do three-input instructions, so two 8-bit inputs are packed together in a single input register, which no doubt is part of the motivation for packb. Alas, bitmixb is invalid, but perhaps a future chip will have it.

Next up are the cache-aware narrow store instructions: shl, shg, sbl, and sbg are all OK. These instructions exist because L1 and L2 caches are not coherent on the ET-SoC-1. The standard RISC-V sh / sb instructions operate at per-minion L1D, whereas the shl / sbl instructions do not interact with L1 at all and instead operate at per-tile L2, and then shg / sbg do not interact with L1 or L2 at all and instead operate at per-ASIC L3 (or similar, depending on the exact address). Due to the non-coherence, software needs to be very aware of the cache hierarchy. If writing entire cache lines (which are aligned 64 byte ranges), software can write at any cache level, and then rely on either implicit or explicit cache eviction to propagate the lines to higher cache levels (at which point they can become visible to other cores). If writing less than a cache line, and other cores are writing other parts of the same line, then all writers need to direct their writes to a cache which is common to all writers: L2 (and hence l suffix instruction) if all writers are in the same tile, L3 (and hence g suffix instruction) otherwise. Subsequent readers also need to use a load instruction which operates at that same cache, or need to explicitly flush any lower-level caches prior to issuing regular loads.

Moving on, all of amoadd[lg].[wd], amomin[lg].[wd], amominu[lg].[wd], amomax[lg].[wd], amomaxu[lg].[wd], amoand[lg].[wd], amoor[lg].[wd], amoxor[lg].[wd], amoswap[lg].[wd], and amocmpswap[lg].[wd] are OK. These instructions are inspired by the standard Zaamo instructions, but with the same L2 (l) or L3 (g) suffix as before, and all available as either 32-bit (w) or 64-bit (d). Degenerate forms of amoswap can be used as cache-aware variants of sw and sd, hence bespoke swl / swg / sdl / sdg instructions aren't required, and degenerate forms of amoor can be used for cache-aware loads.

Other than amocmpswap, 32-bit variants of the scalar AMOs also exist in SIMD form: famoadd[lg].pi, famomin[lg].pi, famominu[lg].pi, famomax[lg].pi, famomaxu[lg].pi, famoand[lg].pi, famoor[lg].pi, famoxor[lg].pi, and famoswap[lg].pi are all OK. These all operate in a scatter / gather fashion: each SIMD lane forms its target address as rs2 + fs1.i32<i>, and arbitrary lanes can be skipped by setting the m0 mask register appropriately. Wrapping up the SIMD AMOs, famomin[lg].ps and famomax[lg].ps are also OK: they perform an fp32 min/max rather than an i32 or u32 min/max.

Unlike their scalar counterparts, degenerate AMOs don't need to be used for cache-aware SIMD scatters / gathers. For L1D scatters, fscw.ps is OK, and for L2 / L3 fscw[lg].ps are OK. Narrower versions of these are also OK: fsch.ps and fsch[lg].ps for 16-bit values (from the low bits of each 32-bit lane), and fscb.ps and fscb[lg].ps for 8-bit values (again from the low bits of each 32-bit lane). Gather variants are the same, just starting with fg rather than fsc: all of fgw.ps, fgw[lg].ps, fgh.ps, fgh[lg].ps, fgb.ps, and fgb[lg].ps are OK. There are also "restricted" SIMD scatters / gathers, where all eight memory accesses of the instruction are within the same aligned 32-byte region: fsc32[whb].ps for scatters and fg32[whb].ps for gathers, all of which are OK (but aren't available in cache-aware variants; these restricted instructions always target L1D).

We're almost done with SIMD memory instructions, but there's one more batch to go. flw.ps and flw[lg].ps are OK: they load from consecutive memory locations rather than being a gather, but still respect m0 as a lane-enable mask. The naming follows the usual pattern for L1D / L2 / L3. When lane-enable isn't required, there's instead flq2, which is OK. There are no L2 / L3 variants of flq2, but flq2 is exactly what a compiler might want for spilling registers to the stack, and the stack is per-minion, so L1D is fine for that. Each of these load instructions has a matching store instruction: fsw.ps, fsw[lg].ps, and fsq2 are all reported as OK by the test program, though code in the emulator suggests that A0 silicon doesn't support lane-enables for fsw[lg].ps. Finally, fbc.ps is OK: it loads one 32-bit value from memory, and then broadcasts it to all lanes (subject to m0).

For broadcasting immediates, fbci.pi and fbci.ps are OK: they both have a 20-bit immediate (as is conventional for RISC-V), but differ in how they expand that to 32 bits. Meanwhile, fbcx.ps is OK and broadcasts from a GPR rather than an immediate. As per usual, these instructions all respect m0 as a lane-enable mask, so using any of them with a one-hot mask acts like an insert rather than a broadcast. In the other direction, fmvs.x.ps and fmvz.x.ps are OK: they extract a single lane to a GPR, either sign-extending or zero-extending to get from 32 bits to 64 bits. For shuffling lanes around within a SIMD register, fswizz.ps is OK: it has an 8-bit immediate encoding an arbitrary four-lane shuffle, which is applied to lanes 0-3 and 4-7. For rearranging registers rather than lanes, fcmov.ps and fcmovm.ps are OK, performing conditional moves (or variable blends in SSE/AVX terms).

Masks have been mentioned in passing, but also have a few dedicated instructions: maskand, maskor, maskxor, and masknot are all OK for performing bitwise manipulation of mask registers. For initializing a single mask register from a GPR or an immediate, mov.m.x is OK. Meanwhile, mova.x.m and mova.m.x are OK for doing bulk moves of all eight mask registers to / from one 64-bit GPR. There's no instruction for moving just one mask to a GPR, but maskpopc and maskpopcz are OK: they count the number of 1s or 0s in the mask and put the result in a GPR. The semantics of maskpopc.rast seem like a cute extension of that, but unfortunately the instruction is invalid.

At long last, we get to SIMD arithmetic. Starting with 32-bit integers, fand.pi, fandi.pi, for.pi, fxor.pi, fnot.pi, fadd.pi, faddi.pi, fsub.pi, fmul.pi, fmulh.pi, and fmulhu.pi are all OK, each being an 8x32b SIMD equivalent to a corresponding scalar instruction. Also OK are fmin.pi, fminu.pi, fmax.pi, and fmaxu.pi, which don't have scalar equivalents in ET-SoC-1, but do the obvious thing. There are bitwise shifts in the form of fsll.pi, fslli.pi, fsra.pi, fsrai.pi, fsrl.pi, and fsrli.pi, which are all OK, but have a subtle difference to their scalar counterparts: the shift amount isn't taken mod 32, so a shift amount greater than 31 causes the entire original value to be shifted out. The instruction fslloi.pi is invalid, but doesn't appear in the manual nor in the simulator, so I infer that it's a shift purely from its name. Next up, fsat8.pi and fsatu8.pi are both OK, with semantics of clamping each lane to the limits of int8_t or uint8_t, and then zero-extending from 8 bits back up to 32 bits. Even more specialised are fpackreph.pi and fpackrepb.pi, which are both OK, taking the low 16 or 8 bits of each lane, concatenating them to form 128 or 64 bits, then broadcasting that back up to 256 bits. To wrap up the section, fdiv.pi, fdivu.pi, frem.pi, and fremu.pi are all emulated.

For integer SIMD comparisons, feq.pi, fle.pi, flt.pi, and fltu.pi are all OK, performing lane-wise comparisons and then placing the result in another SIMD register where each lane is either -1 (comparison true) or 0 (comparison false). For results instead in a mask register, fltm.pi and fsetm.pi are both OK: the former performing signed less-than, and the latter checking for not-equal-to-zero.

We then reach FP32 SIMD arithmetic, with fadd.ps, fsub.ps, fmul.ps, fmin.ps, fmax.ps, fmadd.ps, fmsub.ps, fnmadd.ps, fnmsub.ps, fsgnj.ps, fsgnjn.ps, fsgnjx.ps and fclass.ps all being OK as obvious 8x32b SIMD equivalents to scalar instructions from the F extension. Just like in F, fdiv.ps and fsqrt.ps are emulated. To aid with that emulation, frcp.ps is OK, computing the approximate reciprocal with at most 1 ULP of approximation error. The similarly-named frcp.fix.rast is however invalid. Continuing the approximate theme, fexp.ps and flog.ps are both OK, computing the base-2 exponent or logarithm with at most 1 ULP of approximation error, but then fsin.ps and frsq.ps (reciprocal square root) are both emulated. Completing this section, fround.ps and ffrc.ps are both OK: the former rounds a floating-point value to have a zero fractional component, and the latter gives just the fractional component.

FP32 SIMD comparisons are no surprise given the integer SIMD comparisons. feq.ps and flt.ps and fle.ps are all OK, with results as a SIMD register. Their variants feqm.ps and fltm.ps and flem.ps are also OK, this time with results in a mask register.

When it comes to SIMD data type conversions, fcvt.f16.ps and fcvt.pw.ps and fcvt.pwu.ps are all OK, as are their inverses fcvt.ps.f16 and fcvt.ps.pw and fcvt.ps.pwu. All other SIMD variants of fcvt are invalid.

The remaining SIMD instructions are cubeface.ps, cubefaceidx.ps, cubesgnsc.ps, and cubesgntc.ps, all of which are invalid.

That concludes all the RISC-V instructions of the ET-SoC-1's minion CPU cores. However, RISC-V instructions aren't the full story, as additional specialised functionality is made available via dedicated CSRs:

Tensor operations are all expressed via CSR writes.
Most cache control operations (such as prefetching and evicting) are expressed via CSR writes, though some are instead ESR writes.
Synchronization operations (fast local barriers and fast credit counters) are also expressed as a combination of CSRs and ESRs.

All of this specialised functionality is worthy of study, but I've got to draw a line somewhere: functionality exposed via CSRs will have to wait for a future post.

Previously, we saw that ET-SoC-1 lives on, and contains lots of Minions tiles:

Each Minions tile contains a NoC mesh stop, 4 MiB of SRAM which can act as L2 cache or L3 cache or scratchpad, and then four "neighborhoods" of CPU cores:

The 4 MiB is built from four 1 MiB blocks, and each block can be individually configured as L2 or L3 or scratchpad. If configured as scratchpad, the 1 MiB is regular memory in the address space, which any CPU within the ASIC can access (as could the host, if the device firmware included it as part of the exposed address space). If configured as L2, the 1 MiB is a hardware-managed cache, used by the local tile to cache accesses to the on-device DRAM. If configured as L3, the 1 MiB is a hardware-managed cache used by all tiles to cache accesses any address.

Delving deeper, each neighborhood contains 32 KiB of L1 instruction cache, one set of PMU counters, and eight minion cores:

The instruction cache can deliver two 64-byte lines to cores every cycle, so is relying on each core executing at least four instructions from each line to avoid stalls. As each RISCV instruction is either 2 or 4 bytes, and a common rule-of-thumb is one branch every five instructions, this all seems reasonable. Executable code for minions always comes from on-device DRAM, which slightly simplifies this cache. There's a single set of PMU (performance monitoring unit) counters per neighborhood, which is a slight divergance from the RISCV specification of per-core performance counters, but a shared PMU is better than no PMU, so I can live with it.

At long last, we now reach an individual minion core:

Starting on the left hand side, we have a fairly conventional setup for a RISCV core with two hardware threads sharing one set of execution resources. The core is described as single-issue in-order, though I'm assuming that "in-order" merely means that instructions from each thread start in program order and retire in program order, but can reorder during execution (in particular so that loads are effectively asynchronous, blocking at the first instruction which consumes the load result, rather than blocking at the load instruction itself). Speaking of loads, one pleasant surprise is the presence of an MMU (and associated privilege modes) for converting virtual addresses to physical. My initial reaction to the presence of an MMU was that it was overkill (c.f. Tenstorrent baby RISCV cores, which lack one), but after contemplating it for a bit, I'm really glad the hardware designers spent the transistors on it. The only notable limitation is that both hardware threads share a single satp CSR, meaning that both threads see the same virtual address space — or in OS terminology, they need to come from the same process rather than separate processes. Things get slightly more exotic on the right hand side of the diagram, in particular with the Tensor μop Sequencer, but we can initially ignore that and focus on the RISCV side of things. If we do that, then the instruction set is 64-bit RISCV (RV64I), with various extensions:

Standard M Extension for Integer Multiplication and Division
Standard F Extension for Single-Precision Floating-Point, with a few tweaks:
- Denormals are treated as sign-preserved zero on input
- Denormals are flushed to sign-preserved zero on output (before rounding)
- Division and square-root are not implemented in hardware
- Floating-point registers (FPRs) are 256 bits wide (rather than just 32 bits wide), with F instructions ignoring the high 224 bits on input and writing zero to the high 224 bits on output
Standard C Extension for Compressed Instructions
Standard Zicsr Extension for Control and Status Register (CSR) Instructions
Custom atomic memory operations extension, similar to the standard Zaamo extension, but with various tweaks:
- All atomic instructions come in both "local" and "global" variants, where "local" conceptually executes at the L2 data cache, and "global" conceptually executes at L3 and/or the memory controller
- Most atomic instructions come in both scalar and vector variants, where "scalar" operates on 32 bits or 64 bits, and "vector" operates on eight 32-bit lanes sequentially
- Compare-and-swap instructions
Custom SIMD extension, much closer to AVX10/256 than it is to the standard RISCV vector extension:
- Operates on the same floating-point registers as the F extension, which have been widened to 256 bits
- Most instructions operate lane-wise on eight 32-bit lanes
- Adds eight mask registers (m0 ... m7), each 8 bits wide (one bit per lane)
- Most F instructions have a SIMD equivalent operating lanewise on eight 32-bit lanes
- Most RV32I instructions have a SIMD equivalent operating lanewise on eight 32-bit lanes, as does integer multiplication (but division doesn't)
- Gather and scatter instructions
- Conditional move instructions
- Unary fp32 functions operating lanewise: exp2, log2, reciprocal
- Low-precision dot product support, albeit only exposed to the Tensor μop Sequencer and not exposed as RISCV instructions

At the bottom of the diagram is 4 KiB of L1, which I've drawn after the MMU for the sake of diagram simplicity, but I assume is virtually-indexed physically-tagged and therefore operating in parallel with the MMU. This 4 KiB has three possible configuration modes:

Mode	Thread 0	Thread 1	Tensor Coprocessor
Shared	4 KiB data cache, shared		Mostly disabled
Split	2 KiB data cache	½ KiB data cache	Mostly disabled
Scratchpad	½ KiB data cache	½ KiB data cache	3 KiB register file

The most exotic part of each minion core is the Tensor μop Sequencer. If you're just after a bunch of RISCV CPU cores with SIMD extensions, you can ignore it, but if you're optimising the performance of matrix multiplication then you'll eventually need to look at it. It is used for computing C = A @ B or C += A @ B, where C is a matrix of 32-bit elements between 1x1 and 16x16 in size. The possible data types are:

C	`+=`	A	`@`	B	Relative throughput
fp32	`+=`	fp32	`@`	fp32	1x
fp32	`+=`	fp16	`@`	fp16	2x
(u)int32	`+=`	(u)int8	`@`	(u)int8	8x

Notably absent are bf16 and fp8 data types, which is possibly due to the age of the design. When A and B are both fp32, the Tensor μop Sequencer makes use of the same FMA circuitry as used by the fp32 SIMD instructions, and so has throughput of eight scalar FMA operations per cycle (each one adding a single scalar product to one element of C). When A and B are both fp16, a variant of the circuitry is used which performs two fp16 multiplications in each lane followed by a non-standard three-way floating-point addition in each lane (thus adding two scalar products to each of eight elements of C). When A and B are both 8-bit integers, there are four scalar products and a five-way addition per lane per cycle, but this time the hardware can compute 16 lanes per cycle.

All of these matrix multiplications require storage for A and B and C. We've already seen the storage for A: it's the 3 KiB register file present when L1 is configured in scratchpad mode. The documentation refers to this register file as L1Scp[0] through L1Scp[47], where each L1Scp[i] holds between 4 and 64 bytes of matrix data. We'll come back to B. Moving on to C, when C is fp32, it is stored in the floating-point registers (i.e. f0 ... f31) of thread 0: a pair of floating-point registers can hold a 16-element matrix row, so the 32 FPRs can collectively hold a 16x16 matrix. Things are slightly more complex when C is (u)int32, possibly because there's not enough bandwidth from the FPRs for 16 lanes per cycle. This motivates the TenC registers, which can collectively hold a 16x16 (u)int32 matrix, and are used exclusively as a temporary accumulator for integer matrix multiplications: the actual instruction variants for these end up looking like TenC = A @ B or TenC += A @ B or FPRs = TenC + A @ B. Coming back to B, it can either come from L1Scp (like A), or from the elusive TenB register file. I say elusive because TenB exists for the purposes of exposition, but doesn't actually exist as permanent architectural state. If instructions can indeed reorder during execution (as is very desirable to hide load latency), then hardware will have some kind of queue structure for holding the results of instructions, where a queue entry is allocated early in the lifetime of an instruction, is populated when its execution completes, and is popped if it has completed and is at the front of the queue (at which point the instruction's result is comitted to the register file). I posit that this queue is TenB, except that upon being popped, the results are sent to the multipliers rather than comitted to a register file. This would be consistent with all the documented properties of TenB, and would be a cute way of reusing existing hardware resources.

That covers the TensorFMA32, TensorFMA16A32, and TensorIMA8A32 instructions. There are also a variety of load instructions to take data from memory, optionally transpose it, and write it to L1Scp (TensorLoad, TensorLoadInterleave16, TensorLoadInterleave8, TensorLoadTranspose32, TensorLoadTranspose16, TensorLoadTranspose8). The elusive TensorLoadB instruction takes data from memory and writes it to TenB (though as TenB doesn't really exist, it instead forwards the loaded data to the next instruction which "reads" TenB). There's also a TensorQuant instruction for performing various in-place transformations on a 32-bit matrix in the FPRs (i.e. a C matrix). To wrap up, there are a pair of store instructions, which take data from L1Scp or from FPRs and write it out to memory.

The final fun surprise of the tensor instructions is the bundle of TensorSend, TensorRecv, TensorBroadcast, and TensorReduce. Of this bundle, TensorSend and TensorRecv are easiest to explain: any pair of minion CPU cores anywhere in the ASIC can choose to communicate with each other, with one executing TensorSend and the other executing TensorRecv (in both cases including the mhartid of the other core as an operand in their instruction). The sender will transmit data from its FPRs, and the receiver will either write that data to its FPRs, or combine it pointwise with data in its FPRs. TensorBroadcast builds on this, but has a fixed data movement pattern rather than being arbitrary pairwise: if cores 0 through 2^N-1 each execute N TensorBroadcast instructions, then data from core 0 is sent to all cores 1 through 2^N-1. TensorReduce is the opposite: if cores 0 through 2^N-1 each execute N TensorReduce instructions, data from all cores 1 through 2^N-1 is sent to core 0, in the shape of a binary reduction tree (with + or min or max applied at each tree node).

None of the previously-described tensor instructions exist as RISCV instructions. Instead, they are "executed" by writing carefully crafted 64-bit values to particular CSRs using csrrw instructions. In some cases the 64 bits written to the CSR aren't quite enough, so an additional 64 bits are taken from the x31 GPR - in such cases the instructions are effectively 128 bits wide. These instructions are then presumably queued up inside the Tensor μop Sequencer, which converts them to μops and sends them out to the Load / Store Unit and the SIMD Execution Lanes. As a slight quirk, only thread 0 of each minion can write to these CSRs and therefore execute these instructions. The only tensor instruction which thread 1 of each minion can execute is TensorLoadL2Scp, which requires that the issuing core is in a tile whose L2 is at least partially configured as scratchpad memory, and copies data from an arbitrary location in memory to said scratchpad (making it effectively a prefetch instruction).

Tensor instructions are enqueued in program order, and order is maintained through to μop emission, but these μops act like a 3^rd hardware thread, and thus can arbitrarily interleave with subsequent instructions from the original thread. Explicit wait instructions are required to wait for the completion of tensor instructions. The Tensor μop Sequencer also lacks certain hazard tracking, so software needs to insert additional wait instructions between some pairs of conflicting tensor instructions. It isn't pretty, but such is the reality of squeezing out performance in low-power designs. Thankfully the documentation spells out all the cases in which waits are required.

Overall, the minion cores look like nice little CPU cores - hopefully I'll be able to get my hands on some to play with soon (thankfully AINekko understand the importance of getting cards out for developers to play around with, but hardware takes time). This will allow me to explore my unanswered questions, such as:

What clock speed are the minions running at? I think Esperanto's original goal was 1 Ghz, but reality might be more like 600 Mhz.
Are "local" versus "global" atomics merely a performance optimisation, or are the "local" atomics actually non-coherent in some way?
How good are these cores at hiding latency? Contemporary GPUs can round-robin between 16 warps in each warp scheduler, whereas minions can only round-robin between 2 threads.
Can we run Linux on a 1056C / 2112T machine?

Anthropics Compiler Challenge

Thoughts on No Graphics API

Is there anything further which can be stripped away?

Compute-only

`gpuHostToDevicePointer`

`gpuSubmit`

`uint32x3` for `SV_ThreadID` and `SV_GroupID` and `SV_GroupThreadID`

Is there anything else which needs adding?

Multi-device support

Multi-process support

Memory pinning

Instruction cache fences

Does anything give me cause for concern?

Write-combining memory

Deadlock avoidance

Leaving 32 bits behind

Conclusion

My favourite small hash table

Which RISC-V instructions does the ET-SoC-1 give us?

ET's Minions

Anthropics Compiler Challenge

Thoughts on No Graphics API

Is there anything further which can be stripped away?

Compute-only

gpuHostToDevicePointer

gpuSubmit

uint32x3 for SV_ThreadID and SV_GroupID and SV_GroupThreadID

Is there anything else which needs adding?

Multi-device support

Multi-process support

Memory pinning

Instruction cache fences

Does anything give me cause for concern?

Write-combining memory

Deadlock avoidance

Leaving 32 bits behind

Conclusion

My favourite small hash table

Which RISC-V instructions does the ET-SoC-1 give us?

ET's Minions

`gpuHostToDevicePointer`

`gpuSubmit`

`uint32x3` for `SV_ThreadID` and `SV_GroupID` and `SV_GroupThreadID`