corsix.org

Continuing the recent theme, I was given an ET-SoC-1 PCIe board, which is now installed in my home lab. The first order of business is confirming exactly which RISC-V instructions are supported by its minion CPU cores. We could try to learn this from documentation, or from the system emulator, or from the C compiler (all of which exist), but the ground truth can only be confirmed by testing on real hardware. This requires writing some code to interact with the hardware, and while there is a high-level runtime intended for this, it is more illuminating to jump in at a slightly lower level. We begin with:

int fd = open("/dev/et0_ops", O_RDWR | O_CLOEXEC);
if (fd < 0) FATAL("Could not open PCIe device");

The kernel driver creates two files per PCIe card: /dev/et<N>_ops and /dev/et<N>_mgmt. In broad strokes, the former is useful for launching compute kernels, whereas the latter is useful for updating firmware. Different filesystem permissions can be applied to the two files: perhaps only root should be able to update the firmware, but any user should be able to launch kernels.

Launching a kernel on the device requires uploading some RISC-V code to it, and in turn that requires choosing somewhere in the device's address space to place said code. Code for minion cores has to live within the device's DRAM, which is a 32 GiB region starting at address 0x80_0000_0000, but firmware takes a little bit for itself. There's an ioctl to determine how much is available:

struct dram_info dram_info;
if (ioctl(fd, ETSOC1_IOCTL_GET_USER_DRAM_INFO, &dram_info) < 0) {
  FATAL("Could not issue ETSOC1_IOCTL_GET_USER_DRAM_INFO");
}
printf("Have %llu bytes of DRAM starting at 0x%llx\n",
  (long long unsigned)dram_info.size,
  (long long unsigned)dram_info.base);

This prints 34265366528 (32 GiB minus 90 MiB) and 0x8005801000 (~88 MiB after 0x80_0000_0000). Proper host software would create a memory allocator at this point to dynamically manage this region, but for this post we'll just bump allocate starting from dram_info.base.

To actually launch a kernel, we need to think about queues. Firmware on the device initializes some submission queues (SQ) and completion queues (CQ), and the kernel driver knows how to push onto an SQ and how to pop from a CQ. These queues are small at the moment: each SQ can hold just over 1 KiB, and each CQ just under 1½ KiB. Each queue contains some number of messages, and we begin with a little helper function to (ask the kernel driver to) push a message onto an SQ:

uint16_t sq_push(int fd, struct cmn_header_t* msg, uint8_t flags) {
  struct cmd_desc msg_desc = {
    .cmd = msg,
    .size = msg->size,
    .flags = flags,
  };
  uint16_t tag = msg->tag_id = (uint16_t)rand();
  for (;;) {
    if (ioctl(fd, ETSOC1_IOCTL_PUSH_SQ, &msg_desc) < 0) {
      if (errno == EAGAIN) continue;
      FATAL("Could not issue ETSOC1_IOCTL_PUSH_SQ");
    }
    return tag;
  }
}

One kind of message we can push to an /dev/et0_ops SQ is struct device_ops_kernel_launch_cmd_t, which in particular includes:

code_start_address: The address of some RISC-V code on the device.
pointer_to_args: An arbitrary 64-bit value to be passed to the RISC-V code in the a0 register. To pass more than this, the values can be placed somewhere in device memory, and the address of those values passed.
shire_mask: A bitmask of which minion tiles to execute the RISC-V code on; every set bit will cause the code to be executed 64 times (because there are 32 minions per tile, and 2 hardware threads per minion).

If the optional CMD_FLAGS_KERNEL_LAUNCH_ARGS_EMBEDDED flag is specified, then we get what Vulkan calls "push constants": the SQ message can include a little bit of data immediately after struct device_ops_kernel_launch_cmd_t, which the ioctl will push to the device for us, and firmware on the device will copy to pointer_to_args prior to invoking code_start_address. Kernel arguments aren't required for this post, but we can (ab)use this mechanism to push the RISC-V code itself to the device.

This causes the message to be:

struct {
  struct device_ops_kernel_launch_cmd_t launch;
  uint32_t rv_code[3];
} __attribute__((packed, aligned(8))) launch_cmd = {
  .launch = {
    .command_info = {
      .cmd_hdr = {
        .size = sizeof(launch_cmd),
        .msg_id = DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_CMD,
        .flags = CMD_FLAGS_KERNEL_LAUNCH_ARGS_EMBEDDED,
      }
    },
    .code_start_address = dram_info.base,
    .pointer_to_args = dram_info.base,
    .shire_mask = 0x1,
  },
  .rv_code = {
    0x00000013, // nop
    0x00800513, // li a0, 8
    0x00000073, // ecall
  },
};
uint16_t tag = sq_push(fd, &launch_cmd.launch.command_info.cmd_hdr, 0);

The pushed RISC-V code consists of three instructions: a nop which'll come in handy later, and then two instructions to perform a syscall (the number 8 is SYSCALL_RETURN_FROM_KERNEL).

Once the device firmware has popped this SQ message and completed running the RISC-V code, it'll push a CQ message. Firmware can also push unsolicited CQ messages for other reasons, which proper host software should do something with, but we'll just ignore in the interest of brevity. This leads to a helper function for popping CQ messages until the message we want arrives:

void cq_pop_until(int fd, struct rsp_desc* dst, uint16_t kind, uint16_t tag) {
  for (;;) {
    if (ioctl(fd, ETSOC1_IOCTL_POP_CQ, dst) < 0) {
      if (errno == EAGAIN) continue;
      FATAL("Could not issue ETSOC1_IOCTL_POP_CQ");
    }
    struct cmn_header_t* hdr = (struct cmn_header_t*)dst->rsp;
    if (hdr->msg_id == kind && hdr->tag_id == tag) return;
  }
}

Using this, we can obtain the struct device_ops_kernel_launch_rsp_t telling us how the kernel launch went:

char rsp_buf[256];
struct rsp_desc rsp_desc = {
  .rsp = rsp_buf,
  .size = sizeof(rsp_buf),
};
cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_RSP, tag);
printf("Kernel launch status: %u\n", ((struct device_ops_kernel_launch_rsp_t*)rsp_buf)->status);

This prints 0, meaning DEV_OPS_API_KERNEL_LAUNCH_RESPONSE_KERNEL_COMPLETED.

So far so good: we can launch a trivial kernel, and it completes successfully. The neat part is that we only need a tiny modification to use this for probing which RISC-V instructions are supported by the ET-SoC-1's minion CPUs. The key is the nop at the start of rv_code: we can replace this with any other RISC-V instruction, and if the kernel still completes successfully then the instruction was supported, whereas if the kernel fails then the instruction wasn't supported. Firmware on the device handles the grungy details of catching the invalid instruction and getting everything neat and tidy again ready for running the next kernel (similar to how your operating system catches segfaults and limits their impact to terminating just the single faulty process rather than the whole machine).

Trying this out merely requires changing the nop to something invalid, launching the new kernel, and printing the status again:

launch_cmd.rv_code[0] = 0x1234deaf;
tag = sq_push(fd, &launch_cmd.launch.command_info.cmd_hdr, 0);
cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_RSP, tag);
printf("Faulty launch status: %u\n", ((struct device_ops_kernel_launch_rsp_t*)rsp_buf)->status);

This prints 2, meaning DEV_OPS_API_KERNEL_LAUNCH_RESPONSE_EXCEPTION.

As it happens, firmware on the device can give us more details about the exception by populating a struct execution_context_t somewhere in device memory. We need to allocate enough device memory to hold an execution_context_t array, with one array element per hardware thread. As we set shire_mask to 0x1, the kernel runs on hardware threads 0 through 63, so we need a 64-element array. Bump allocating the device memory is easy enough: all we need to do is add .exception_buffer = dram_info.base + 64, to the definition of launch_cmd. Pulling the array back to the host is slightly more involved, motivating another little helper function:

uint16_t async_memcpy_from_device(int fd, void* dst, uint64_t src, size_t size) {
  struct {
    struct cmn_header_t header;
    struct dma_read_node node;
  } __attribute__((packed, aligned(8))) dma_cmd = {
    .header = {
      .size = sizeof(dma_cmd),
      .msg_id = DEV_OPS_API_MID_DEVICE_OPS_DMA_READLIST_CMD,
    },
    .node = {
      .dst_host_virt_addr = (uint64_t)(uintptr_t)dst,
      .src_device_phy_addr = src,
      .size = size,
    }
  };
  return sq_push(fd, &dma_cmd.header, CMD_DESC_FLAG_DMA);
}

We need to allocate some special host memory to be the target of the async memcpy, but doing so is just an mmap call, and then we can do the copy and look at the execution_context_t::scause we get back:

const size_t contexts_size = sizeof(execution_context_t) * 64;
void* dma_buf = mmap(NULL, contexts_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (dma_buf == MAP_FAILED) FATAL("Could not allocate %llu byte DMA buffer", (long long unsigned)contexts_size);
tag = async_memcpy_from_device(fd, dma_buf, launch_cmd.launch.exception_buffer, contexts_size);
cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_DMA_READLIST_RSP, tag);
printf("Faulty launch scause: %u\n", (unsigned)((execution_context_t*)dma_buf)->scause);

This also prints 2, but the 2 now means something different: it comes from the RISC-V Instruction Set Manual: Volume II: Privileged Architecture scause value list, which says 2 means Illegal instruction. This is exactly what we expect, but we'll observe some other values in due course.

Next up, we need a list of encoded RISC-V instructions to try running. We could consult the RISC-V Instruction Set Manual, Volume I: Unprivileged Architecture's opcode map for standard RISC-V instructions and ET Programmer's Reference Manual for custom instructions, but transcribing instruction encodings out of manuals is dull work, especially when someone else has already done it for us: the fork of binutils for this device has riscv-opc.h with lots of standard encodings, and esperanto-opc.h for the custom ones. Each of the #define MATCH_<INSN> <ENCODING> lines therein gives us one possible encoding of INSN, typically with all register operands set to x0 or f0 and any immediate operands set to 0. We can start by pulling out a handful of instructions from each file:

// Standard:
#define MATCH_LD          0x3003
#define MATCH_FENCE          0xf
#define MATCH_FENCE_I     0x100f
#define MATCH_DIV      0x2004033
#define MATCH_FMUL_S  0x10000053
// Custom:
#define MATCH_FMUL_PS 0x1000007b
#define MATCH_FDIV_PI 0x1e00007b
#define MATCH_BITMIXB 0x8000703b

const struct insn_entry {
  const char* name;
  uint32_t encoding;
} g_insns[] = {
  {"ld",      MATCH_LD},
  {"fence",   MATCH_FENCE},
  {"fence.i", MATCH_FENCE_I},
  {"div",     MATCH_DIV},
  {"fmul.s",  MATCH_FMUL_S},
  {"fmul.ps", MATCH_FMUL_PS},
  {"fdiv.pi", MATCH_FDIV_PI},
  {"bitmixb", MATCH_BITMIXB},
  {NULL, 0},
};

Each instruction can then be tested in turn:

for (const struct insn_entry* i = g_insns; i->name; ++i) {
  launch_cmd.rv_code[0] = i->encoding;
  tag = sq_push(fd, &launch_cmd.launch.command_info.cmd_hdr, 0);
  cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_RSP, tag);
  if (((struct device_ops_kernel_launch_rsp_t*)rsp_buf)->status == DEV_OPS_API_KERNEL_LAUNCH_RESPONSE_KERNEL_COMPLETED) {
    printf("%-8s -> OK\n", i->name);
  } else {
    tag = async_memcpy_from_device(fd, dma_buf, launch_cmd.launch.exception_buffer, contexts_size);
    cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_DMA_READLIST_RSP, tag);
    printf("%-8s -> scause %u\n", i->name, (unsigned)((execution_context_t*)dma_buf)->scause);
  }
}

This prints:

`ld`	→	scause 5
`fence`	→	OK
`fence.i`	→	scause 30
`div`	→	OK
`fmul.s`	→	OK
`fmul.ps`	→	OK
`fdiv.pi`	→	scause 30
`bitmixb`	→	scause 2

The same documentation as before tells us that scause of 5 means Load access fault, which is to be expected: the encoded instruction is ld x0, 0(x0), and while this is a valid instruction, 0(x0) isn't a valid address in the minion's memory map. More curious is scause of 30: the standard documentation puts this under Designated for custom use, and so we need to look at the aforementioned ET Programmer's Reference Manual to see it described as M-code emulation. This means that the hardware's instruction decoder does recognise the instruction as valid, but the hardware doesn't natively implement the instruction; instead it is asking firmware to invisibly (albeit slowly) emulate it. Unfortunately, the firmware logic for instruction emulation hasn't been written yet, so we get a very visble exception rather than invisible emulation. The distinction between Illegal instruction and M-code emulation is somewhat arbitrary: firmware could choose to perform emulation in response to Illegal instruction, and could choose not to perform emulation in response to M-code emulation (as seen in the current firmware where the emulation logic hasn't been written yet). Despite it being arbitrary, I'll maintain the distinction.

The testing code can then be improved to interpret scause values:

for (const struct insn_entry* i = g_insns; i->name; ++i) {
  launch_cmd.rv_code[0] = i->encoding;
  tag = sq_push(fd, &launch_cmd.launch.command_info.cmd_hdr, 0);
  cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_KERNEL_LAUNCH_RSP, tag);
  const char* status = "OK";
  if (((struct device_ops_kernel_launch_rsp_t*)rsp_buf)->status != DEV_OPS_API_KERNEL_LAUNCH_RESPONSE_KERNEL_COMPLETED) {
    tag = async_memcpy_from_device(fd, dma_buf, launch_cmd.launch.exception_buffer, contexts_size);
    cq_pop_until(fd, &rsp_desc, DEV_OPS_API_MID_DEVICE_OPS_DMA_READLIST_RSP, tag);
    switch (((execution_context_t*)dma_buf)->scause) {
    case  2: status = "Invalid"; break;
    case 30: status = "Emulate"; break;
    }
  }
  printf("%-8s -> %s\n", i->name, status);
}

This now prints:

`ld`	→	OK
`fence`	→	OK
`fence.i`	→	Emulate
`div`	→	OK
`fmul.s`	→	OK
`fmul.ps`	→	OK
`fdiv.pi`	→	Emulate
`bitmixb`	→	Invalid

Testing more instructions is just a matter of adding more entries to g_insns. I've put my list of entries along with all supporting code up as a gist, which you're welcome to read through, but you'll need a real hardware device to make the code useful. Alternatively, keep on reading here as I go through the results from running it on the device in my lab.

All of RV64I is OK: this is addi, addiw, slti, sltiu, andi, ori, xori, slli, slliw, srli, srliw, srai, sraiw, lui, auipc, add, addw, sub, subw, slt, sltu, and, or, xor, sll, sllw, srl, srlw, sra, sraw, jal, jalr, beq, bne, blt, bltu, bge, bgeu, ld, lw, lwu, lh, lhu, lb, lbu, sd, sw, sh, sb, fence, ecall, ebreak, and various assembler pseudo-instructions expanding to these.

Standard extensions are fairly quickly enumerated:

M extension: all OK; this is mul, mulh, mulhsu, mulhu, mulw, div, divu, rem, remu, divw, divuw, remw, remuw.
Zicsr extension: all OK; this is csrrw, csrrs, csrrc, csrrwi, csrrsi, csrrci.
Zifencei extension: fence.i is emulated.
Supervisor / Machine-Mode privileged instructions: mret and sret and wfi are all OK, sfence.vma is emulated.
F extension: mostly OK (frcsr, fscsr, frrm, fsrm, fsrmi, frflags, fsflags, fsflagsi, flw, fsw, fadd.s, fsub.s, fmul.s, fmin.s, fmax.s, fmadd.s, fmsub.s, fnmsub.s, fnmadd.s, fcvt.w.s, fcvt.wu.s, fcvt.s.w, fcvt.s.wu, fsgnj.s, fsgnjn.s, fsgnjx.s, fmv.x.s, fmv.s.x, feq.s, flt.s, fle.s, fclass.s), but with a few instructions emulated (fdiv.s, fsqrt.s, fcvt.l.s, fcvt.lu.s, fcvt.s.l, fcvt.s.lu).
C extension: instructions valid as per their corresponding non-compressed counterpart.
All other standard extensions are invalid (A, B, D, Q, V, Zfa, Zfh, Zicntr, Zicond, and so on and so forth).

Rather more interesting are the non-standard instructions. There are various ways of grouping these, but I'll start with custom scalar integer arithmetic: packb is OK, but bitmixb is invalid. The behaviour of packb is just rd = (rs1 & 0xff) | ((rs2 & 0xff) << 8). The behaviour of bitmixb is far more interesting, performing a variety of bit interleavings of two 8-bit values, of the kind you might want for 2D texture address swizzling on a GPU. Conceptually this instruction takes three inputs (two 8-bit values and a 16-bit control), but RISC-V doesn't do three-input instructions, so two 8-bit inputs are packed together in a single input register, which no doubt is part of the motivation for packb. Alas, bitmixb is invalid, but perhaps a future chip will have it.

Next up are the cache-aware narrow store instructions: shl, shg, sbl, and sbg are all OK. These instructions exist because L1 and L2 caches are not coherent on the ET-SoC-1. The standard RISC-V sh / sb instructions operate at per-minion L1D, whereas the shl / sbl instructions do not interact with L1 at all and instead operate at per-tile L2, and then shg / sbg do not interact with L1 or L2 at all and instead operate at per-ASIC L3 (or similar, depending on the exact address). Due to the non-coherence, software needs to be very aware of the cache hierarchy. If writing entire cache lines (which are aligned 64 byte ranges), software can write at any cache level, and then rely on either implicit or explicit cache eviction to propagate the lines to higher cache levels (at which point they can become visible to other cores). If writing less than a cache line, and other cores are writing other parts of the same line, then all writers need to direct their writes to a cache which is common to all writers: L2 (and hence l suffix instruction) if all writers are in the same tile, L3 (and hence g suffix instruction) otherwise. Subsequent readers also need to use a load instruction which operates at that same cache, or need to explicitly flush any lower-level caches prior to issuing regular loads.

Moving on, all of amoadd[lg].[wd], amomin[lg].[wd], amominu[lg].[wd], amomax[lg].[wd], amomaxu[lg].[wd], amoand[lg].[wd], amoor[lg].[wd], amoxor[lg].[wd], amoswap[lg].[wd], and amocmpswap[lg].[wd] are OK. These instructions are inspired by the standard Zaamo instructions, but with the same L2 (l) or L3 (g) suffix as before, and all available as either 32-bit (w) or 64-bit (d). Degenerate forms of amoswap can be used as cache-aware variants of sw and sd, hence bespoke swl / swg / sdl / sdg instructions aren't required, and degenerate forms of amoor can be used for cache-aware loads.

Other than amocmpswap, 32-bit variants of the scalar AMOs also exist in SIMD form: famoadd[lg].pi, famomin[lg].pi, famominu[lg].pi, famomax[lg].pi, famomaxu[lg].pi, famoand[lg].pi, famoor[lg].pi, famoxor[lg].pi, and famoswap[lg].pi are all OK. These all operate in a scatter / gather fashion: each SIMD lane forms its target address as rs2 + fs1.i32<i>, and arbitrary lanes can be skipped by setting the m0 mask register appropriately. Wrapping up the SIMD AMOs, famomin[lg].ps and famomax[lg].ps are also OK: they perform an fp32 min/max rather than an i32 or u32 min/max.

Unlike their scalar counterparts, degenerate AMOs don't need to be used for cache-aware SIMD scatters / gathers. For L1D scatters, fscw.ps is OK, and for L2 / L3 fscw[lg].ps are OK. Narrower versions of these are also OK: fsch.ps and fsch[lg].ps for 16-bit values (from the low bits of each 32-bit lane), and fscb.ps and fscb[lg].ps for 8-bit values (again from the low bits of each 32-bit lane). Gather variants are the same, just starting with fg rather than fsc: all of fgw.ps, fgw[lg].ps, fgh.ps, fgh[lg].ps, fgb.ps, and fgb[lg].ps are OK. There are also "restricted" SIMD scatters / gathers, where all eight memory accesses of the instruction are within the same aligned 32-byte region: fsc32[whb].ps for scatters and fg32[whb].ps for gathers, all of which are OK (but aren't available in cache-aware variants; these restricted instructions always target L1D).

We're almost done with SIMD memory instructions, but there's one more batch to go. flw.ps and flw[lg].ps are OK: they load from consecutive memory locations rather than being a gather, but still respect m0 as a lane-enable mask. The naming follows the usual pattern for L1D / L2 / L3. When lane-enable isn't required, there's instead flq2, which is OK. There are no L2 / L3 variants of flq2, but flq2 is exactly what a compiler might want for spilling registers to the stack, and the stack is per-minion, so L1D is fine for that. Each of these load instructions has a matching store instruction: fsw.ps, fsw[lg].ps, and fsq2 are all reported as OK by the test program, though code in the emulator suggests that A0 silicon doesn't support lane-enables for fsw[lg].ps. Finally, fbc.ps is OK: it loads one 32-bit value from memory, and then broadcasts it to all lanes (subject to m0).

For broadcasting immediates, fbci.pi and fbci.ps are OK: they both have a 20-bit immediate (as is conventional for RISC-V), but differ in how they expand that to 32 bits. Meanwhile, fbcx.ps is OK and broadcasts from a GPR rather than an immediate. As per usual, these instructions all respect m0 as a lane-enable mask, so using any of them with a one-hot mask acts like an insert rather than a broadcast. In the other direction, fmvs.x.ps and fmvz.x.ps are OK: they extract a single lane to a GPR, either sign-extending or zero-extending to get from 32 bits to 64 bits. For shuffling lanes around within a SIMD register, fswizz.ps is OK: it has an 8-bit immediate encoding an arbitrary four-lane shuffle, which is applied to lanes 0-3 and 4-7. For rearranging registers rather than lanes, fcmov.ps and fcmovm.ps are OK, performing conditional moves (or variable blends in SSE/AVX terms).

Masks have been mentioned in passing, but also have a few dedicated instructions: maskand, maskor, maskxor, and masknot are all OK for performing bitwise manipulation of mask registers. For initializing a single mask register from a GPR or an immediate, mov.m.x is OK. Meanwhile, mova.x.m and mova.m.x are OK for doing bulk moves of all eight mask registers to / from one 64-bit GPR. There's no instruction for moving just one mask to a GPR, but maskpopc and maskpopcz are OK: they count the number of 1s or 0s in the mask and put the result in a GPR. The semantics of maskpopc.rast seem like a cute extension of that, but unfortunately the instruction is invalid.

At long last, we get to SIMD arithmetic. Starting with 32-bit integers, fand.pi, fandi.pi, for.pi, fxor.pi, fnot.pi, fadd.pi, faddi.pi, fsub.pi, fmul.pi, fmulh.pi, and fmulhu.pi are all OK, each being an 8x32b SIMD equivalent to a corresponding scalar instruction. Also OK are fmin.pi, fminu.pi, fmax.pi, and fmaxu.pi, which don't have scalar equivalents in ET-SoC-1, but do the obvious thing. There are bitwise shifts in the form of fsll.pi, fslli.pi, fsra.pi, fsrai.pi, fsrl.pi, and fsrli.pi, which are all OK, but have a subtle difference to their scalar counterparts: the shift amount isn't taken mod 32, so a shift amount greater than 31 causes the entire original value to be shifted out. The instruction fslloi.pi is invalid, but doesn't appear in the manual nor in the simulator, so I infer that it's a shift purely from its name. Next up, fsat8.pi and fsatu8.pi are both OK, with semantics of clamping each lane to the limits of int8_t or uint8_t, and then zero-extending from 8 bits back up to 32 bits. Even more specialised are fpackreph.pi and fpackrepb.pi, which are both OK, taking the low 16 or 8 bits of each lane, concatenating them to form 128 or 64 bits, then broadcasting that back up to 256 bits. To wrap up the section, fdiv.pi, fdivu.pi, frem.pi, and fremu.pi are all emulated.

For integer SIMD comparisons, feq.pi, fle.pi, flt.pi, and fltu.pi are all OK, performing lane-wise comparisons and then placing the result in another SIMD register where each lane is either -1 (comparison true) or 0 (comparison false). For results instead in a mask register, fltm.pi and fsetm.pi are both OK: the former performing signed less-than, and the latter checking for not-equal-to-zero.

We then reach FP32 SIMD arithmetic, with fadd.ps, fsub.ps, fmul.ps, fmin.ps, fmax.ps, fmadd.ps, fmsub.ps, fnmadd.ps, fnmsub.ps, fsgnj.ps, fsgnjn.ps, fsgnjx.ps and fclass.ps all being OK as obvious 8x32b SIMD equivalents to scalar instructions from the F extension. Just like in F, fdiv.ps and fsqrt.ps are emulated. To aid with that emulation, frcp.ps is OK, computing the approximate reciprocal with at most 1 ULP of approximation error. The similarly-named frcp.fix.rast is however invalid. Continuing the approximate theme, fexp.ps and flog.ps are both OK, computing the base-2 exponent or logarithm with at most 1 ULP of approximation error, but then fsin.ps and frsq.ps (reciprocal square root) are both emulated. Completing this section, fround.ps and ffrc.ps are both OK: the former rounds a floating-point value to have a zero fractional component, and the latter gives just the fractional component.

FP32 SIMD comparisons are no surprise given the integer SIMD comparisons. feq.ps and flt.ps and fle.ps are all OK, with results as a SIMD register. Their variants feqm.ps and fltm.ps and flem.ps are also OK, this time with results in a mask register.

When it comes to SIMD data type conversions, fcvt.f16.ps and fcvt.pw.ps and fcvt.pwu.ps are all OK, as are their inverses fcvt.ps.f16 and fcvt.ps.pw and fcvt.ps.pwu. All other SIMD variants of fcvt are invalid.

The remaining SIMD instructions are cubeface.ps, cubefaceidx.ps, cubesgnsc.ps, and cubesgntc.ps, all of which are invalid.

That concludes all the RISC-V instructions of the ET-SoC-1's minion CPU cores. However, RISC-V instructions aren't the full story, as additional specialised functionality is made available via dedicated CSRs:

Tensor operations are all expressed via CSR writes.
Most cache control operations (such as prefetching and evicting) are expressed via CSR writes, though some are instead ESR writes.
Synchronization operations (fast local barriers and fast credit counters) are also expressed as a combination of CSRs and ESRs.

All of this specialised functionality is worthy of study, but I've got to draw a line somewhere: functionality exposed via CSRs will have to wait for a future post.

Previously, we saw that ET-SoC-1 lives on, and contains lots of Minions tiles:

Each Minions tile contains a NoC mesh stop, 4 MiB of SRAM which can act as L2 cache or L3 cache or scratchpad, and then four "neighborhoods" of CPU cores:

The 4 MiB is built from four 1 MiB blocks, and each block can be individually configured as L2 or L3 or scratchpad. If configured as scratchpad, the 1 MiB is regular memory in the address space, which any CPU within the ASIC can access (as could the host, if the device firmware included it as part of the exposed address space). If configured as L2, the 1 MiB is a hardware-managed cache, used by the local tile to cache accesses to the on-device DRAM. If configured as L3, the 1 MiB is a hardware-managed cache used by all tiles to cache accesses any address.

Delving deeper, each neighborhood contains 32 KiB of L1 instruction cache, one set of PMU counters, and eight minion cores:

The instruction cache can deliver two 64-byte lines to cores every cycle, so is relying on each core executing at least four instructions from each line to avoid stalls. As each RISCV instruction is either 2 or 4 bytes, and a common rule-of-thumb is one branch every five instructions, this all seems reasonable. Executable code for minions always comes from on-device DRAM, which slightly simplifies this cache. There's a single set of PMU (performance monitoring unit) counters per neighborhood, which is a slight divergance from the RISCV specification of per-core performance counters, but a shared PMU is better than no PMU, so I can live with it.

At long last, we now reach an individual minion core:

Starting on the left hand side, we have a fairly conventional setup for a RISCV core with two hardware threads sharing one set of execution resources. The core is described as single-issue in-order, though I'm assuming that "in-order" merely means that instructions from each thread start in program order and retire in program order, but can reorder during execution (in particular so that loads are effectively asynchronous, blocking at the first instruction which consumes the load result, rather than blocking at the load instruction itself). Speaking of loads, one pleasant surprise is the presence of an MMU (and associated privilege modes) for converting virtual addresses to physical. My initial reaction to the presence of an MMU was that it was overkill (c.f. Tenstorrent baby RISCV cores, which lack one), but after contemplating it for a bit, I'm really glad the hardware designers spent the transistors on it. The only notable limitation is that both hardware threads share a single satp CSR, meaning that both threads see the same virtual address space — or in OS terminology, they need to come from the same process rather than separate processes. Things get slightly more exotic on the right hand side of the diagram, in particular with the Tensor μop Sequencer, but we can initially ignore that and focus on the RISCV side of things. If we do that, then the instruction set is 64-bit RISCV (RV64I), with various extensions:

Standard M Extension for Integer Multiplication and Division
Standard F Extension for Single-Precision Floating-Point, with a few tweaks:
- Denormals are treated as sign-preserved zero on input
- Denormals are flushed to sign-preserved zero on output (before rounding)
- Division and square-root are not implemented in hardware
- Floating-point registers (FPRs) are 256 bits wide (rather than just 32 bits wide), with F instructions ignoring the high 224 bits on input and writing zero to the high 224 bits on output
Standard C Extension for Compressed Instructions
Standard Zicsr Extension for Control and Status Register (CSR) Instructions
Custom atomic memory operations extension, similar to the standard Zaamo extension, but with various tweaks:
- All atomic instructions come in both "local" and "global" variants, where "local" conceptually executes at the L2 data cache, and "global" conceptually executes at L3 and/or the memory controller
- Most atomic instructions come in both scalar and vector variants, where "scalar" operates on 32 bits or 64 bits, and "vector" operates on eight 32-bit lanes sequentially
- Compare-and-swap instructions
Custom SIMD extension, much closer to AVX10/256 than it is to the standard RISCV vector extension:
- Operates on the same floating-point registers as the F extension, which have been widened to 256 bits
- Most instructions operate lane-wise on eight 32-bit lanes
- Adds eight mask registers (m0 ... m7), each 8 bits wide (one bit per lane)
- Most F instructions have a SIMD equivalent operating lanewise on eight 32-bit lanes
- Most RV32I instructions have a SIMD equivalent operating lanewise on eight 32-bit lanes, as does integer multiplication (but division doesn't)
- Gather and scatter instructions
- Conditional move instructions
- Unary fp32 functions operating lanewise: exp2, log2, reciprocal
- Low-precision dot product support, albeit only exposed to the Tensor μop Sequencer and not exposed as RISCV instructions

At the bottom of the diagram is 4 KiB of L1, which I've drawn after the MMU for the sake of diagram simplicity, but I assume is virtually-indexed physically-tagged and therefore operating in parallel with the MMU. This 4 KiB has three possible configuration modes:

Mode	Thread 0	Thread 1	Tensor Coprocessor
Shared	4 KiB data cache, shared		Mostly disabled
Split	2 KiB data cache	½ KiB data cache	Mostly disabled
Scratchpad	½ KiB data cache	½ KiB data cache	3 KiB register file

The most exotic part of each minion core is the Tensor μop Sequencer. If you're just after a bunch of RISCV CPU cores with SIMD extensions, you can ignore it, but if you're optimising the performance of matrix multiplication then you'll eventually need to look at it. It is used for computing C = A @ B or C += A @ B, where C is a matrix of 32-bit elements between 1x1 and 16x16 in size. The possible data types are:

C	`+=`	A	`@`	B	Relative throughput
fp32	`+=`	fp32	`@`	fp32	1x
fp32	`+=`	fp16	`@`	fp16	2x
(u)int32	`+=`	(u)int8	`@`	(u)int8	8x

Notably absent are bf16 and fp8 data types, which is possibly due to the age of the design. When A and B are both fp32, the Tensor μop Sequencer makes use of the same FMA circuitry as used by the fp32 SIMD instructions, and so has throughput of eight scalar FMA operations per cycle (each one adding a single scalar product to one element of C). When A and B are both fp16, a variant of the circuitry is used which performs two fp16 multiplications in each lane followed by a non-standard three-way floating-point addition in each lane (thus adding two scalar products to each of eight elements of C). When A and B are both 8-bit integers, there are four scalar products and a five-way addition per lane per cycle, but this time the hardware can compute 16 lanes per cycle.

All of these matrix multiplications require storage for A and B and C. We've already seen the storage for A: it's the 3 KiB register file present when L1 is configured in scratchpad mode. The documentation refers to this register file as L1Scp[0] through L1Scp[47], where each L1Scp[i] holds between 4 and 64 bytes of matrix data. We'll come back to B. Moving on to C, when C is fp32, it is stored in the floating-point registers (i.e. f0 ... f31) of thread 0: a pair of floating-point registers can hold a 16-element matrix row, so the 32 FPRs can collectively hold a 16x16 matrix. Things are slightly more complex when C is (u)int32, possibly because there's not enough bandwidth from the FPRs for 16 lanes per cycle. This motivates the TenC registers, which can collectively hold a 16x16 (u)int32 matrix, and are used exclusively as a temporary accumulator for integer matrix multiplications: the actual instruction variants for these end up looking like TenC = A @ B or TenC += A @ B or FPRs = TenC + A @ B. Coming back to B, it can either come from L1Scp (like A), or from the elusive TenB register file. I say elusive because TenB exists for the purposes of exposition, but doesn't actually exist as permanent architectural state. If instructions can indeed reorder during execution (as is very desirable to hide load latency), then hardware will have some kind of queue structure for holding the results of instructions, where a queue entry is allocated early in the lifetime of an instruction, is populated when its execution completes, and is popped if it has completed and is at the front of the queue (at which point the instruction's result is comitted to the register file). I posit that this queue is TenB, except that upon being popped, the results are sent to the multipliers rather than comitted to a register file. This would be consistent with all the documented properties of TenB, and would be a cute way of reusing existing hardware resources.

That covers the TensorFMA32, TensorFMA16A32, and TensorIMA8A32 instructions. There are also a variety of load instructions to take data from memory, optionally transpose it, and write it to L1Scp (TensorLoad, TensorLoadInterleave16, TensorLoadInterleave8, TensorLoadTranspose32, TensorLoadTranspose16, TensorLoadTranspose8). The elusive TensorLoadB instruction takes data from memory and writes it to TenB (though as TenB doesn't really exist, it instead forwards the loaded data to the next instruction which "reads" TenB). There's also a TensorQuant instruction for performing various in-place transformations on a 32-bit matrix in the FPRs (i.e. a C matrix). To wrap up, there are a pair of store instructions, which take data from L1Scp or from FPRs and write it out to memory.

The final fun surprise of the tensor instructions is the bundle of TensorSend, TensorRecv, TensorBroadcast, and TensorReduce. Of this bundle, TensorSend and TensorRecv are easiest to explain: any pair of minion CPU cores anywhere in the ASIC can choose to communicate with each other, with one executing TensorSend and the other executing TensorRecv (in both cases including the mhartid of the other core as an operand in their instruction). The sender will transmit data from its FPRs, and the receiver will either write that data to its FPRs, or combine it pointwise with data in its FPRs. TensorBroadcast builds on this, but has a fixed data movement pattern rather than being arbitrary pairwise: if cores 0 through 2^N-1 each execute N TensorBroadcast instructions, then data from core 0 is sent to all cores 1 through 2^N-1. TensorReduce is the opposite: if cores 0 through 2^N-1 each execute N TensorReduce instructions, data from all cores 1 through 2^N-1 is sent to core 0, in the shape of a binary reduction tree (with + or min or max applied at each tree node).

None of the previously-described tensor instructions exist as RISCV instructions. Instead, they are "executed" by writing carefully crafted 64-bit values to particular CSRs using csrrw instructions. In some cases the 64 bits written to the CSR aren't quite enough, so an additional 64 bits are taken from the x31 GPR - in such cases the instructions are effectively 128 bits wide. These instructions are then presumably queued up inside the Tensor μop Sequencer, which converts them to μops and sends them out to the Load / Store Unit and the SIMD Execution Lanes. As a slight quirk, only thread 0 of each minion can write to these CSRs and therefore execute these instructions. The only tensor instruction which thread 1 of each minion can execute is TensorLoadL2Scp, which requires that the issuing core is in a tile whose L2 is at least partially configured as scratchpad memory, and copies data from an arbitrary location in memory to said scratchpad (making it effectively a prefetch instruction).

Tensor instructions are enqueued in program order, and order is maintained through to μop emission, but these μops act like a 3^rd hardware thread, and thus can arbitrarily interleave with subsequent instructions from the original thread. Explicit wait instructions are required to wait for the completion of tensor instructions. The Tensor μop Sequencer also lacks certain hazard tracking, so software needs to insert additional wait instructions between some pairs of conflicting tensor instructions. It isn't pretty, but such is the reality of squeezing out performance in low-power designs. Thankfully the documentation spells out all the cases in which waits are required.

Overall, the minion cores look like nice little CPU cores - hopefully I'll be able to get my hands on some to play with soon (thankfully AINekko understand the importance of getting cards out for developers to play around with, but hardware takes time). This will allow me to explore my unanswered questions, such as:

What clock speed are the minions running at? I think Esperanto's original goal was 1 Ghz, but reality might be more like 600 Mhz.
Are "local" versus "global" atomics merely a performance optimisation, or are the "local" atomics actually non-coherent in some way?
How good are these cores at hiding latency? Contemporary GPUs can round-robin between 16 warps in each warp scheduler, whereas minions can only round-robin between 2 threads.
Can we run Linux on a 1056C / 2112T machine?

Esperanto Technologies was founded back in 2014, completed the RTL for their Maxion CPUs in September 2018, were doing bring-up and characterization of their ET-SoC-1 in H2 2021, and were happily outlining their next-gen ET-SoC-2x and ET-SoC-3x in November 2024. Unfortunately, things did not go to plan: they closed down in July 2025, retaining just a few people to facilitate selling or licensing their accumulated IP. The official line from the company is that competitors poached their staff with compensation packages up to 4x higher than what Esperanto could offer, and that slowly bleeding staff led to eventual death. Some voices in the media instead posit that the company failed on publicity and community engagement. Amongst other potential problems, their website (still) has an "I want to evaluate Esperanto systems" form, rather than an online store with prices and "Add to cart" / "Go to checkout" buttons. Being an AI chip startup is hard: you've got to get lots of things right, and just one of them being wrong is enough to condemn you to failure.

Fast-forward to the present day - October 2025 - and AINekko drops out of stealth, with two interesting repos on GitHub: et-platform and et-man. I don't know for sure, but it looks like AINekko purchased the ET-SoC-1 IP, and is proceeding to open it all up. et-platform contains a simulator, a kernel driver, and all sorts of firmware / bootcode / management software, all available under the Apache License v2. Meanwhile, et-man currently contains "just" a comprehensive programmer's reference manual, but it looks like more documentation should land there in due course. There's also a claim that the RTL will be open-sourced eventually, though I imagine that this will only be the RTL written by Esperanto, and not any 3^rd-party IP they licensed to include in their chip (e.g. PCIe controllers from Synopsys). The et-platform README helpfully states:

AINekko's ET is an open-source manycore platform for parallel computing acceleration. It is built on the legacy of Esperanto Technologies ET-SoC-1 chip.

It lives on, but what exactly is this chip? There's a photo of AINekko's CTO holding one of the PCIe boards on X:

An image from the Esperanto product page shows what's under that heatsink:

Presumably it's a PCIe Gen4 x8 edge connector on the bottom, 32 GiB of LPDDR4x (the four ~square black chips), and an ET-SoC-1 in the middle. The ET-SoC-1 ASIC is itself a grid of tiles with a NoC mesh stop in each tile:

There are four different types of tile:

Kind	Count	Contents (per tile)
DRAM	8	Bridge to 4 GiB LPDDR4x (2 16-bit channels per tile)
PCIe	1	Bridge to host over PCIe
Maxions	1	4x Maxion CPU core, 1x Minion-lite CPU core, 5 MiB SRAM
Minions	34 (†)	32x Minion CPU core (2 threads each), 4 MiB SRAM

(†) Documentation suggests that 1 of the 34 is lost for yield purposes, leaving 33 usable.

This grid is somewhat reminiscent of a Tenstorrent Wormhole or Blackhole, though with a major philosophical difference: the Tenstorrent approach is to have a separate address space per tile with software explicitly initiating asynchronous NoC transactions to move data around, whereas ET-SoC-1 adopts a more GPU-like approach with a single address space spanning the entire ASIC and a hierarchy of hardware L1 / L2 / L3 caches to mitigate latency. Another big difference at the NoC level is that Wormhole and Blackhole choose to go with two unidirectional toruses, whereas ET-SoC-1 has a single bidirectional grid.

The DRAM and PCIe tiles are relatively mundane, containing LPDDR4x controllers and PCIe controllers respectively. Neither has any programmable CPU cores, and they are mostly transparent to software: parts of the address space do physically live in these tiles, but software doesn't need to worry about the physical location of memory, as the NoC hides the details.

Next up, the Maxions tile is roughly similar to a Tenstorrent ARC tile combined with a SiFive x280 tile as found on Blackhole: each Maxion CPU core is a 64-bit RISC-V single-threaded superscalar out-of-order CPU core capable of running Linux, just like a SiFive x280, and the Minion-lite serves the same role as an ARC core with regards to board and system management. The 5 MiB SRAM in the tile is split as 1 MiB scratchpad for the Minion-lite and 4 MiB L2 for the Maxions (though the L2 can instead be reconfigured as scratchpad).

The majority of the ET-SoC-1 ASIC is made up of Minion tiles, similar to how the majority of a Tenstorrent ASIC is made up of Tensix tiles. Both contain low-power in-order RISCV cores, but the differences quickly become apparent. There is a lot to be said about these minions, but it'll have to wait until next time.

I'm a big fan of aarch64's csel family of instructions. A single instruction can evaluate rd = cond ? rs1 : f(rs2), where cond is any condition code and f is any of f₀(x) = x or f₁(x) = x+1 or f₂(x) = ~x or or f₃(x) = -x. Want to convert a condition to a boolean? Use f₁ with rs1 == rs2 == x0. Want to convert a condition to a mask? Use f₂ with rs1 == rs2 == x0. Want to compute an absolute value? Use f₃ with rs1 == rs2. It is pleasing that the composition of f₁ and f₂ is f₃. I could continue espousing, but hopefully you get the idea.

RISC-V is the hot new thing, but it lacks a direct equivalent to csel. Some cases of converting conditions to booleans are possible with the slt family of instructions in the base instruction set. Beyond that, a few special cases are implemented by instruction set extensions: Zbb adds min and max instructions which are a particular pattern of compare and select, and Zicond adds czero.eqz and czero.nez which again are particular patterns of compare and select. But the general case? Considered and rejected, as per this direct quote from The RISC-V Instruction Set Manual Volume I Version 20250508:

We considered but did not include conditional moves or predicated instructions, which can effectively replace unpredictable short forward branches. Conditional moves are the simpler of the two, but are difficult to use ...

That quote hints at short forward branches being the recommended alternative. It doesn't quite go as far as to say that out-of-order cores are encouraged to perform macro fusion in the frontend to convert short forward branches back into conditional moves (when possible), but it is commonly taken to mean this, and some SiFive cores implement exactly this fusion.

Continuing to quote from The RISC-V Instruction Set Manual Volume I Version 20250508, the introductory text motivating Zicond also mentions fusion:

Using these [Zicond] instructions, branchless sequences can be implemented (typically in two-instruction sequences) without the need for instruction fusion, special provisions during the decoding of architectural instructions, or other microarchitectural provisions.

One of the shortcomings of RISC-V, compared to competing instruction set architectures, is the absence of conditional operations to support branchless code-generation: this includes conditional arithmetic, conditional select and conditional move operations. The design principles of RISC-V (e.g. the absence of an instruction-format that supports 3 source registers and an output register) make it unlikely that direct equivalents of the competing instructions will be introduced.

The design principles mentioned in passing mean that czero.eqz has slightly odd semantics. Assuming rd ≠ rs2, the intent is that these two instruction sequences compute the same thing:

Base instruction set	With Zicond
`mv rd, x0 beq rs2, x0, skip_next mv rd, rs1 skip_next:`	`czero.eqz rd, rs1, rs2`

The whole premise of fusion is predicated on the idea that it is valid for a core to transform code similar to the branchy code on the left into code similar to the branch-free code on the right. I wish to cast doubt on this validity: it is true that the two instruction sequences compute the same thing, but details of the RISC-V memory consistency model mean that the two sequences are very much not equivalent, and therefore a core cannot blindly turn one into the other.

To see why, consider this example, again from The RISC-V Instruction Set Manual Volume I Version 20250508:

Control dependencies behave differently from address and data dependencies in the sense that a control dependency always extends to all instructions following the original target in program order.
  lw x1, 0(x2)
  bne x1, x0, next
next:
  sw x3, 0(x4)
Even though both branch outcomes have the same target, there is still a control dependency from the memory operation generated by the first instruction in this snippet to the memory operation generated by the last instruction. This definition of control dependency is subtly stronger than what might be seen in other contexts (e.g., C++), but it conforms with standard definitions of control dependencies in the literature.

The general point highlighted by this example is that every branch (or indirect jump) instruction imposes a syntactic control dependency on every store instruction anywhere after it in program order. If a branch is converted to a conditional move, there is no longer a syntactic control dependency. There can instead be an address or data dependency, but this only applies to stores which use the result of the conditional move, whereas the syntactic control dependency applied to all stores. In other words, not equivalent.

TLDR: If RISC-V cores want to perform fusion of short forward branches into conditional moves (to mitigate the lack of conditional moves in the instruction set), the resultant fused instruction needs to retain some branch-like properties to avoid violating the memory model.

In compiler circles, Lengauer & Tarjan's 1979 paper "A Fast Algorithm for Finding Dominators in a Flowgraph" is a classic: it describes and proves a viable algorithm for computing the idom of every node in a rooted directed graph. I have just two criticisms of the paper:

All the sample code is written in Algol. This was a fine choice in 1979, but tastes have changed over the intervening 46 years, and most modern eyes will see the Algol code as somewhat antiquated.
It starts by presenting an optimised variant of the algorithm, rather than starting with the most basic exposition and then introducing optimisations.

I believe that both problems can be addressed via a little bit of reworking.

The algorithm takes as input a graph G with some root node r. We immediately perform a depth first search on G; let T be the spanning tree discovered by that search, and replace all node labels with the pre-order index from that search. The root node is now 0, one of its successors is 1, and so forth. For arbitrary nodes v and w, this allows us to write things like v < w and v ≤ w and min(v, w), all of which are operating on these indices. We also introduce some notation for talking about these graphs:

Notation	Meaning
`v ->^G w`	There is an edge from `v` to `w` in `G`, and `w != r`
`v -*->^T w`	There is a path from `v` to `w` in `T`, or `v == w`

We jump in at what the paper calls Theorem 4, which enables this definition:

sdom(w) = min({     v  |              v ->^G w and v ≤ w} union
              {sdom(u) | u -*->^T v and v ->^G w and v > w and u > w})

This definition is recursive, but only in the case of u > w, so we can compute it for all nodes by first computing it for w = N-1, then for w = N-2, and so forth until we eventually reach w = 1 (we cannot compute it for w = 0, as both sets are empty in that case).

Reworking Theorem 3 slightly enables this definition:

idomfrom(w) = argmin_u{sdom(u) | u -*->^T w and u > sdom(w)}

Where argmin_u{expr | condition} gives the u which minimises expr (over all the possible u which meet condition), or any such u if there are multiple u achieving that minimum. Note that the set is always non-empty, as u = w meets the condition. As u -*->^T w implies u ≤ w, we also have idomfrom(w) ≤ w.

Then reworking Theorem 2 slightly enables this definition:

idom(w) = sdom(w) if idomfrom(w) ≥ w else idom(idomfrom(w))

This definition is recursive, but only in the case of idomfrom(w) < w, so we can compute it for all nodes by first computing it for w = 1, then for w = 2, and so forth until eventually we reach w = N-1.

This gives us everything we need to compute idom, via a four step process:

Perform a depth first search on G, to give T (and pre-order integer nodes).
Compute sdom(w) for all w > 0, in decreasing order of w.
Compute idomfrom(w) for all w > 0, in any order.
Compute idom(w) for all w > 0, in increasing order of w.

In both steps 2 and 3, we're interested in either min or argmin_u over the set {sdom(u) | u -*->^T v and u > limit}, for various values of v and limit. There are (at least) three different strategies for computing each of these min / argmin_u:

From scratch every time: whenever such a computation is required, start with u = v, and then iterate u = parent(u) (where parent gives the parent in T) until u ≤ limit, taking the min / argmin_u of sdom(u) over all the observed u for which u > limit.
Like strategy 1, but introduce a layer of caching to avoid repeated work. The paper calls this "path compression", which is one way of viewing it, but you reach the exact same place if you instead apply aggressive caching to strategy 1. For this to work, all queries need to be made in decreasing order of limit, as otherwise previously cached results wouldn't be valid. This happens naturally for step 2 (because it has limit = w, and is performed in decreasing order of w), but slightly more work is required to extend it to step 3: the usual approach is to chop up step 3 into lots of small pieces, and perform them interleaved with step 2 at appropriate times.
Like strategy 2, but doing the union-find equivalent of "union by rank/size" rather than the union-find equivalent of "path compression". Note that the min and argmin_u operations in question aren't quite union-find, but they're sufficiently similar that the ideas can carry over.

These three strategies are in increasing order of implementation complexity, but also decreasing order of worst-case algorithmic complexity. Strategy 3 is best in theory, but the general consensus is that strategy 2 usually beats it in practice. Meanwhile, the Go compiler uses strategy 1 and considers it sufficient.

If using strategy 1, the pseudocode for steps 2 through 4 can be quite concise:

# Step 2
for w in range(N-1, 0, -1):
  sdom[w] = min(strategy1(v, w)[0] if v > w else v for v in pred(w))

# Step 3
for w in range(1, N):
  idomfrom[w] = strategy1(w, sdom[w])[1]

# Step 4
for w in range(1, N):
  idom[w] = sdom[w]           # Relevant  when idomfrom[w] = w
  idom[w] = idom[idomfrom[w]] # No change when idomfrom[w] = w

def strategy1(v, limit):
  us = [v]
  while parent(us[-1]) > limit: us.append(parent(us[-1]))
  return min((sdom[u], u) for u in us)

Note that pred in step 2 gives the predecessors in G, whereas parent in strategy1 gives the parent in T.

There are two little tricks to avoid some of the function calls in step 3:

If sdom(w) == 0, then u = w is an acceptable result from argmin_u.
If sdom(w) == parent(w), then u = w will be only possible argmin_u.

We can revise step 3 to detect these two cases, and handle them without a call:

# Step 3
for w in range(1, N):
  if sdom[w] in {0, parent(w)}:
    idomfrom[w] = w
  else:
    idomfrom[w] = strategy1(w, sdom[w])[1]

If we then want to use something better than strategy1, step 3 needs to be chopped up and interleaved with step 2. One way of doing this is to introduce an array called deferred, which is conceptually storing various sets: calls to strategy1 in step 3 are changed to instead add a node to a set, and then the actual calls happen later when the set is drained. The pseudocode for steps 2 and 3 thus becomes:

# Step 2 and Step 3
deferred = [0] * N # Initialise deferred (all sets empty)
for w in range(N-1, 0, -1):
  v = deferred[w]
  while v != 0: # Drain set
    idomfrom[v] = strategy1(v, w)[1]
    v = deferred[v]
  sdom[w] = min(strategy1(v, w)[0] if v > w else v for v in pred(w))
  if sdom[w] in {0, parent(w)}:
    idomfrom[w] = w
  else:
    # Add w to set, will drain when step 2 reaches sdom(w)
    deferred[w] = deferred[sdom[w]]
    deferred[sdom[w]] = w

Then we can upgrade steps 2 and 3 from strategy1 to strategy2:

# Step 2 and Step 3
deferred = [0] * N # Initialise deferred (all sets empty)
for w in range(N-1, 0, -1):
  v = deferred[w]
  while v != 0: # Drain set
    idomfrom[v] = strategy2(v, w)[1]
    v = deferred[v]
  sdom[w] = min(strategy2(v, w)[0] if v > w else v for v in pred(w))
  if sdom[w] in {0, parent(w)}:
    idomfrom[w] = w
  else:
    # Add w to set, will drain when step 2 reaches sdom(w)
    deferred[w] = deferred[sdom[w]]
    deferred[sdom[w]] = w
  cache_ancestor[w] = parent(w)
  cache_result[w] = (sdom(w), w)

def strategy2(v, limit):
  vs = []
  ancestor = cache_ancestor[v]
  while ancestor > limit:
    vs.append(v)
    v = ancestor
    ancestor = cache_ancestor[v]
  result = cache_result[v]
  while vs:
    v = vs.pop()
    result = min(result, cache_result[v])
    cache_result[v] = result
    cache_ancestor[v] = ancestor
  return result

I consider strategy2 to be sufficient, so I won't present a strategy3. Instead, a few memory optimisations are possible to anyone squeezing out performance:

The idom and sdom arrays can share the same underlying storage, as for any given w, once idom[w] has been assigned, sdom[w] is not accessed. This also removes the need for the idom[w] = sdom[w] assignment in step 4.
The cache_ancestor and parent arrays can share the same underlying storage, as for any given w, once cache_ancestor[w] has been assigned for the first time, parent(w) is not used. This also removes the need for the cache_ancestor[w] = parent(w) assignment.
The deferred and idomfrom arrays can share the same underlying storage, albeit the draining needs a minor tweak to read deferred[v] before writing idomfrom[v].

Once idom(w) has been computed, it can be used to compute DF(w) in the style of Cytron et al's 1991 paper "Efficiently Computing Static Single Assignment Form and the Control Dependence Graph":

for w in bottom_up_traversal_of_dominator_tree:
  DF(w) = set()
  for v in succ(w): # successors in G
    if idom(v) != w:
      DF(w).add(v)
  for v in children(w): # children in the dominator tree
    for u in DF(v):
      if idom(u) != w:
        DF(w).add(u)

Alternatively, idom(w) can be used to compute DF(w) in the style of Cooper et al's 2001 paper "A Simple, Fast Dominance Algorithm":

for w in G:
  DF(w) = set()
for w in G:
  if len(pred(w)) ≥ 2: # predecessors in G
    for v in pred(w):  # predecessors in G
      u = v
      while u != idom(w):
        DF(u).add(w)
        u = idom(u)

Note that if len(pred(w)) ≥ 2: is an optimisation, rather than a requirement for correctness: when len(pred(w)) == 1, the sole predecessor of w will be idom(w), so the innermost loop won't execute.

This formulation is easily amenable to representing DF values as arrays rather than sets, as deduplication needs only to look at the most recently inserted value:

for w in G:
  DF(w) = []
for w in G:
  if len(pred(w)) ≥ 2: # predecessors in G
    for v in pred(w):  # predecessors in G
      u = v
      while u != idom(w):
        if len(DF(u)) == 0 or DF(u)[-1] != w: DF(u).append(w)
        u = idom(u)

Finally, once DF(w) is available for all w, SSA construction becomes simple: if G denotes a control flow graph of basic blocks, then for every variable assigned-to in w, DF(w) is exactly the set of basic blocks which require a phi function (or basic block argument) inserted at their start to reconverge the various assignments of that variable. A mere two caveats apply:

The phi function is another assignment to the variable in question, which possibly enlarges the set of variables assigned-to by the basic block into which the phi was inserted. Multiple iterations can be required to reach a fixpoint.
Some liveness analysis can be applied to avoid inserting phi functions whose result would never be used.

Which RISC-V instructions does the ET-SoC-1 give us?

ET's Minions

Esperanto lives on at AINekko

RISC-V Conditional Moves

Reworking Lengauer-Tarjan