ET's Minions

Previously, we saw that ET-SoC-1 lives on, and contains lots of Minions tiles:

Each Minions tile contains a NoC mesh stop, 4 MiB of SRAM which can act as L2 cache or L3 cache or scratchpad, and then four "neighborhoods" of CPU cores:

The 4 MiB is built from four 1 MiB blocks, and each block can be individually configured as L2 or L3 or scratchpad. If configured as scratchpad, the 1 MiB is regular memory in the address space, which any CPU within the ASIC can access (as could the host, if the device firmware included it as part of the exposed address space). If configured as L2, the 1 MiB is a hardware-managed cache, used by the local tile to cache accesses to the on-device DRAM. If configured as L3, the 1 MiB is a hardware-managed cache used by all tiles to cache accesses any address.

Delving deeper, each neighborhood contains 32 KiB of L1 instruction cache, one set of PMU counters, and eight minion cores:

The instruction cache can deliver two 64-byte lines to cores every cycle, so is relying on each core executing at least four instructions from each line to avoid stalls. As each RISCV instruction is either 2 or 4 bytes, and a common rule-of-thumb is one branch every five instructions, this all seems reasonable. Executable code for minions always comes from on-device DRAM, which slightly simplifies this cache. There's a single set of PMU (performance monitoring unit) counters per neighborhood, which is a slight divergance from the RISCV specification of per-core performance counters, but a shared PMU is better than no PMU, so I can live with it.

At long last, we now reach an individual minion core:

Starting on the left hand side, we have a fairly conventional setup for a RISCV core with two hardware threads sharing one set of execution resources. The core is described as single-issue in-order, though I'm assuming that "in-order" merely means that instructions from each thread start in program order and retire in program order, but can reorder during execution (in particular so that loads are effectively asynchronous, blocking at the first instruction which consumes the load result, rather than blocking at the load instruction itself). Speaking of loads, one pleasant surprise is the presence of an MMU (and associated privilege modes) for converting virtual addresses to physical. My initial reaction to the presence of an MMU was that it was overkill (c.f. Tenstorrent baby RISCV cores, which lack one), but after contemplating it for a bit, I'm really glad the hardware designers spent the transistors on it. The only notable limitation is that both hardware threads share a single satp CSR, meaning that both threads see the same virtual address space — or in OS terminology, they need to come from the same process rather than separate processes. Things get slightly more exotic on the right hand side of the diagram, in particular with the Tensor μop Sequencer, but we can initially ignore that and focus on the RISCV side of things. If we do that, then the instruction set is 64-bit RISCV (RV64I), with various extensions:

At the bottom of the diagram is 4 KiB of L1, which I've drawn after the MMU for the sake of diagram simplicity, but I assume is virtually-indexed physically-tagged and therefore operating in parallel with the MMU. This 4 KiB has three possible configuration modes:

ModeThread 0Thread 1Tensor Coprocessor
Shared4 KiB data cache, sharedMostly disabled
Split2 KiB data cache½ KiB data cacheMostly disabled
Scratchpad½ KiB data cache½ KiB data cache3 KiB register file

The most exotic part of each minion core is the Tensor μop Sequencer. If you're just after a bunch of RISCV CPU cores with SIMD extensions, you can ignore it, but if you're optimising the performance of matrix multiplication then you'll eventually need to look at it. It is used for computing C = A @ B or C += A @ B, where C is a matrix of 32-bit elements between 1x1 and 16x16 in size. The possible data types are:

C+=A@BRelative throughput
fp32+=fp32@fp321x
fp32+=fp16@fp162x
(u)int32+=(u)int8@(u)int88x

Notably absent are bf16 and fp8 data types, which is possibly due to the age of the design. When A and B are both fp32, the Tensor μop Sequencer makes use of the same FMA circuitry as used by the fp32 SIMD instructions, and so has throughput of eight scalar FMA operations per cycle (each one adding a single scalar product to one element of C). When A and B are both fp16, a variant of the circuitry is used which performs two fp16 multiplications in each lane followed by a non-standard three-way floating-point addition in each lane (thus adding two scalar products to each of eight elements of C). When A and B are both 8-bit integers, there are four scalar products and a five-way addition per lane per cycle, but this time the hardware can compute 16 lanes per cycle.

All of these matrix multiplications require storage for A and B and C. We've already seen the storage for A: it's the 3 KiB register file present when L1 is configured in scratchpad mode. The documentation refers to this register file as L1Scp[0] through L1Scp[47], where each L1Scp[i] holds between 4 and 64 bytes of matrix data. We'll come back to B. Moving on to C, when C is fp32, it is stored in the floating-point registers (i.e. f0 ... f31) of thread 0: a pair of floating-point registers can hold a 16-element matrix row, so the 32 FPRs can collectively hold a 16x16 matrix. Things are slightly more complex when C is (u)int32, possibly because there's not enough bandwidth from the FPRs for 16 lanes per cycle. This motivates the TenC registers, which can collectively hold a 16x16 (u)int32 matrix, and are used exclusively as a temporary accumulator for integer matrix multiplications: the actual instruction variants for these end up looking like TenC = A @ B or TenC += A @ B or FPRs = TenC + A @ B. Coming back to B, it can either come from L1Scp (like A), or from the elusive TenB register file. I say elusive because TenB exists for the purposes of exposition, but doesn't actually exist as permanent architectural state. If instructions can indeed reorder during execution (as is very desirable to hide load latency), then hardware will have some kind of queue structure for holding the results of instructions, where a queue entry is allocated early in the lifetime of an instruction, is populated when its execution completes, and is popped if it has completed and is at the front of the queue (at which point the instruction's result is comitted to the register file). I posit that this queue is TenB, except that upon being popped, the results are sent to the multipliers rather than comitted to a register file. This would be consistent with all the documented properties of TenB, and would be a cute way of reusing existing hardware resources.

That covers the TensorFMA32, TensorFMA16A32, and TensorIMA8A32 instructions. There are also a variety of load instructions to take data from memory, optionally transpose it, and write it to L1Scp (TensorLoad, TensorLoadInterleave16, TensorLoadInterleave8, TensorLoadTranspose32, TensorLoadTranspose16, TensorLoadTranspose8). The elusive TensorLoadB instruction takes data from memory and writes it to TenB (though as TenB doesn't really exist, it instead forwards the loaded data to the next instruction which "reads" TenB). There's also a TensorQuant instruction for performing various in-place transformations on a 32-bit matrix in the FPRs (i.e. a C matrix). To wrap up, there are a pair of store instructions, which take data from L1Scp or from FPRs and write it out to memory.

The final fun surprise of the tensor instructions is the bundle of TensorSend, TensorRecv, TensorBroadcast, and TensorReduce. Of this bundle, TensorSend and TensorRecv are easiest to explain: any pair of minion CPU cores anywhere in the ASIC can choose to communicate with each other, with one executing TensorSend and the other executing TensorRecv (in both cases including the mhartid of the other core as an operand in their instruction). The sender will transmit data from its FPRs, and the receiver will either write that data to its FPRs, or combine it pointwise with data in its FPRs. TensorBroadcast builds on this, but has a fixed data movement pattern rather than being arbitrary pairwise: if cores 0 through 2N-1 each execute N TensorBroadcast instructions, then data from core 0 is sent to all cores 1 through 2N-1. TensorReduce is the opposite: if cores 0 through 2N-1 each execute N TensorReduce instructions, data from all cores 1 through 2N-1 is sent to core 0, in the shape of a binary reduction tree (with + or min or max applied at each tree node).

None of the previously-described tensor instructions exist as RISCV instructions. Instead, they are "executed" by writing carefully crafted 64-bit values to particular CSRs using csrrw instructions. In some cases the 64 bits written to the CSR aren't quite enough, so an additional 64 bits are taken from the x31 GPR - in such cases the instructions are effectively 128 bits wide. These instructions are then presumably queued up inside the Tensor μop Sequencer, which converts them to μops and sends them out to the Load / Store Unit and the SIMD Execution Lanes. As a slight quirk, only thread 0 of each minion can write to these CSRs and therefore execute these instructions. The only tensor instruction which thread 1 of each minion can execute is TensorLoadL2Scp, which requires that the issuing core is in a tile whose L2 is at least partially configured as scratchpad memory, and copies data from an arbitrary location in memory to said scratchpad (making it effectively a prefetch instruction).

Tensor instructions are enqueued in program order, and order is maintained through to μop emission, but these μops act like a 3rd hardware thread, and thus can arbitrarily interleave with subsequent instructions from the original thread. Explicit wait instructions are required to wait for the completion of tensor instructions. The Tensor μop Sequencer also lacks certain hazard tracking, so software needs to insert additional wait instructions between some pairs of conflicting tensor instructions. It isn't pretty, but such is the reality of squeezing out performance in low-power designs. Thankfully the documentation spells out all the cases in which waits are required.

Overall, the minion cores look like nice little CPU cores - hopefully I'll be able to get my hands on some to play with soon (thankfully AINekko understand the importance of getting cards out for developers to play around with, but hardware takes time). This will allow me to explore my unanswered questions, such as:

Esperanto lives on at AINekko

Esperanto Technologies was founded back in 2014, completed the RTL for their Maxion CPUs in September 2018, were doing bring-up and characterization of their ET-SoC-1 in H2 2021, and were happily outlining their next-gen ET-SoC-2x and ET-SoC-3x in November 2024. Unfortunately, things did not go to plan: they closed down in July 2025, retaining just a few people to facilitate selling or licensing their accumulated IP. The official line from the company is that competitors poached their staff with compensation packages up to 4x higher than what Esperanto could offer, and that slowly bleeding staff led to eventual death. Some voices in the media instead posit that the company failed on publicity and community engagement. Amongst other potential problems, their website (still) has an "I want to evaluate Esperanto systems" form, rather than an online store with prices and "Add to cart" / "Go to checkout" buttons. Being an AI chip startup is hard: you've got to get lots of things right, and just one of them being wrong is enough to condemn you to failure.

Fast-forward to the present day - October 2025 - and AINekko drops out of stealth, with two interesting repos on GitHub: et-platform and et-man. I don't know for sure, but it looks like AINekko purchased the ET-SoC-1 IP, and is proceeding to open it all up. et-platform contains a simulator, a kernel driver, and all sorts of firmware / bootcode / management software, all available under the Apache License v2. Meanwhile, et-man currently contains "just" a comprehensive programmer's reference manual, but it looks like more documentation should land there in due course. There's also a claim that the RTL will be open-sourced eventually, though I imagine that this will only be the RTL written by Esperanto, and not any 3rd-party IP they licensed to include in their chip (e.g. PCIe controllers from Synopsys). The et-platform README helpfully states:

AINekko's ET is an open-source manycore platform for parallel computing acceleration. It is built on the legacy of Esperanto Technologies ET-SoC-1 chip.

It lives on, but what exactly is this chip? There's a photo of AINekko's CTO holding one of the PCIe boards on X:

An image from the Esperanto product page shows what's under that heatsink:

Presumably it's a PCIe Gen4 x8 edge connector on the bottom, 32 GiB of LPDDR4x (the four ~square black chips), and an ET-SoC-1 in the middle. The ET-SoC-1 ASIC is itself a grid of tiles with a NoC mesh stop in each tile:

There are four different types of tile:

KindCountContents (per tile)
DRAM8Bridge to 4 GiB LPDDR4x (2 16-bit channels per tile)
PCIe1Bridge to host over PCIe
Maxions14x Maxion CPU core, 1x Minion-lite CPU core, 5 MiB SRAM
Minions34 (†)32x Minion CPU core (2 threads each), 4 MiB SRAM

(†) Documentation suggests that 1 of the 34 is lost for yield purposes, leaving 33 usable.

This grid is somewhat reminiscent of a Tenstorrent Wormhole or Blackhole, though with a major philosophical difference: the Tenstorrent approach is to have a separate address space per tile with software explicitly initiating asynchronous NoC transactions to move data around, whereas ET-SoC-1 adopts a more GPU-like approach with a single address space spanning the entire ASIC and a hierarchy of hardware L1 / L2 / L3 caches to mitigate latency. Another big difference at the NoC level is that Wormhole and Blackhole choose to go with two unidirectional toruses, whereas ET-SoC-1 has a single bidirectional grid.

The DRAM and PCIe tiles are relatively mundane, containing LPDDR4x controllers and PCIe controllers respectively. Neither has any programmable CPU cores, and they are mostly transparent to software: parts of the address space do physically live in these tiles, but software doesn't need to worry about the physical location of memory, as the NoC hides the details.

Next up, the Maxions tile is roughly similar to a Tenstorrent ARC tile combined with a SiFive x280 tile as found on Blackhole: each Maxion CPU core is a 64-bit RISC-V single-threaded superscalar out-of-order CPU core capable of running Linux, just like a SiFive x280, and the Minion-lite serves the same role as an ARC core with regards to board and system management. The 5 MiB SRAM in the tile is split as 1 MiB scratchpad for the Minion-lite and 4 MiB L2 for the Maxions (though the L2 can instead be reconfigured as scratchpad).

The majority of the ET-SoC-1 ASIC is made up of Minion tiles, similar to how the majority of a Tenstorrent ASIC is made up of Tensix tiles. Both contain low-power in-order RISCV cores, but the differences quickly become apparent. There is a lot to be said about these minions, but it'll have to wait until next time.

RISC-V Conditional Moves

I'm a big fan of aarch64's csel family of instructions. A single instruction can evaluate rd = cond ? rs1 : f(rs2), where cond is any condition code and f is any of f0(x) = x or f1(x) = x+1 or f2(x) = ~x or or f3(x) = -x. Want to convert a condition to a boolean? Use f1 with rs1 == rs2 == x0. Want to convert a condition to a mask? Use f2 with rs1 == rs2 == x0. Want to compute an absolute value? Use f3 with rs1 == rs2. It is pleasing that the composition of f1 and f2 is f3. I could continue espousing, but hopefully you get the idea.

RISC-V is the hot new thing, but it lacks a direct equivalent to csel. Some cases of converting conditions to booleans are possible with the slt family of instructions in the base instruction set. Beyond that, a few special cases are implemented by instruction set extensions: Zbb adds min and max instructions which are a particular pattern of compare and select, and Zicond adds czero.eqz and czero.nez which again are particular patterns of compare and select. But the general case? Considered and rejected, as per this direct quote from The RISC-V Instruction Set Manual Volume I Version 20250508:

We considered but did not include conditional moves or predicated instructions, which can effectively replace unpredictable short forward branches. Conditional moves are the simpler of the two, but are difficult to use ...

That quote hints at short forward branches being the recommended alternative. It doesn't quite go as far as to say that out-of-order cores are encouraged to perform macro fusion in the frontend to convert short forward branches back into conditional moves (when possible), but it is commonly taken to mean this, and some SiFive cores implement exactly this fusion.

Continuing to quote from The RISC-V Instruction Set Manual Volume I Version 20250508, the introductory text motivating Zicond also mentions fusion:

Using these [Zicond] instructions, branchless sequences can be implemented (typically in two-instruction sequences) without the need for instruction fusion, special provisions during the decoding of architectural instructions, or other microarchitectural provisions.

One of the shortcomings of RISC-V, compared to competing instruction set architectures, is the absence of conditional operations to support branchless code-generation: this includes conditional arithmetic, conditional select and conditional move operations. The design principles of RISC-V (e.g. the absence of an instruction-format that supports 3 source registers and an output register) make it unlikely that direct equivalents of the competing instructions will be introduced.

The design principles mentioned in passing mean that czero.eqz has slightly odd semantics. Assuming rd ≠ rs2, the intent is that these two instruction sequences compute the same thing:

Base instruction setWith Zicond
  mv rd, x0
  beq rs2, x0, skip_next
  mv rd, rs1
skip_next:
  czero.eqz rd, rs1, rs2
 
 
 

The whole premise of fusion is predicated on the idea that it is valid for a core to transform code similar to the branchy code on the left into code similar to the branch-free code on the right. I wish to cast doubt on this validity: it is true that the two instruction sequences compute the same thing, but details of the RISC-V memory consistency model mean that the two sequences are very much not equivalent, and therefore a core cannot blindly turn one into the other.

To see why, consider this example, again from The RISC-V Instruction Set Manual Volume I Version 20250508:

Control dependencies behave differently from address and data dependencies in the sense that a control dependency always extends to all instructions following the original target in program order.

  lw x1, 0(x2)
  bne x1, x0, next
next:
  sw x3, 0(x4)

Even though both branch outcomes have the same target, there is still a control dependency from the memory operation generated by the first instruction in this snippet to the memory operation generated by the last instruction. This definition of control dependency is subtly stronger than what might be seen in other contexts (e.g., C++), but it conforms with standard definitions of control dependencies in the literature.

The general point highlighted by this example is that every branch (or indirect jump) instruction imposes a syntactic control dependency on every store instruction anywhere after it in program order. If a branch is converted to a conditional move, there is no longer a syntactic control dependency. There can instead be an address or data dependency, but this only applies to stores which use the result of the conditional move, whereas the syntactic control dependency applied to all stores. In other words, not equivalent.

TLDR: If RISC-V cores want to perform fusion of short forward branches into conditional moves (to mitigate the lack of conditional moves in the instruction set), the resultant fused instruction needs to retain some branch-like properties to avoid violating the memory model.

Reworking Lengauer-Tarjan

In compiler circles, Lengauer & Tarjan's 1979 paper "A Fast Algorithm for Finding Dominators in a Flowgraph" is a classic: it describes and proves a viable algorithm for computing the idom of every node in a rooted directed graph. I have just two criticisms of the paper:

  1. All the sample code is written in Algol. This was a fine choice in 1979, but tastes have changed over the intervening 46 years, and most modern eyes will see the Algol code as somewhat antiquated.
  2. It starts by presenting an optimised variant of the algorithm, rather than starting with the most basic exposition and then introducing optimisations.

I believe that both problems can be addressed via a little bit of reworking.


The algorithm takes as input a graph G with some root node r. We immediately perform a depth first search on G; let T be the spanning tree discovered by that search, and replace all node labels with the pre-order index from that search. The root node is now 0, one of its successors is 1, and so forth. For arbitrary nodes v and w, this allows us to write things like v < w and v ≤ w and min(v, w), all of which are operating on these indices. We also introduce some notation for talking about these graphs:

NotationMeaning
v ->G wThere is an edge from v to w in G, and w != r
v -*->T wThere is a path from v to w in T, or v == w

We jump in at what the paper calls Theorem 4, which enables this definition:

sdom(w) = min({     v  |               v ->G w and v ≤ w} union
{sdom(u) | u -*->T v and v ->G w and v > w and u > w})

This definition is recursive, but only in the case of u > w, so we can compute it for all nodes by first computing it for w = N-1, then for w = N-2, and so forth until we eventually reach w = 1 (we cannot compute it for w = 0, as both sets are empty in that case).

Reworking Theorem 3 slightly enables this definition:

idomfrom(w) = argminu{sdom(u) | u -*->T w and u > sdom(w)}

Where argminu{expr | condition} gives the u which minimises expr (over all the possible u which meet condition), or any such u if there are multiple u achieving that minimum. Note that the set is always non-empty, as u = w meets the condition. As u -*->T w implies u ≤ w, we also have idomfrom(w) ≤ w.

Then reworking Theorem 2 slightly enables this definition:

idom(w) = sdom(w) if idomfrom(w) ≥ w else idom(idomfrom(w))

This definition is recursive, but only in the case of idomfrom(w) < w, so we can compute it for all nodes by first computing it for w = 1, then for w = 2, and so forth until eventually we reach w = N-1.

This gives us everything we need to compute idom, via a four step process:

  1. Perform a depth first search on G, to give T (and pre-order integer nodes).
  2. Compute sdom(w) for all w > 0, in decreasing order of w.
  3. Compute idomfrom(w) for all w > 0, in any order.
  4. Compute idom(w) for all w > 0, in increasing order of w.

In both steps 2 and 3, we're interested in either min or argminu over the set {sdom(u) | u -*->T v and u > limit}, for various values of v and limit. There are (at least) three different strategies for computing each of these min / argminu:

  1. From scratch every time: whenever such a computation is required, start with u = v, and then iterate u = parent(u) (where parent gives the parent in T) until u ≤ limit, taking the min / argminu of sdom(u) over all the observed u for which u > limit.
  2. Like strategy 1, but introduce a layer of caching to avoid repeated work. The paper calls this "path compression", which is one way of viewing it, but you reach the exact same place if you instead apply aggressive caching to strategy 1. For this to work, all queries need to be made in decreasing order of limit, as otherwise previously cached results wouldn't be valid. This happens naturally for step 2 (because it has limit = w, and is performed in decreasing order of w), but slightly more work is required to extend it to step 3: the usual approach is to chop up step 3 into lots of small pieces, and perform them interleaved with step 2 at appropriate times.
  3. Like strategy 2, but doing the union-find equivalent of "union by rank/size" rather than the union-find equivalent of "path compression". Note that the min and argminu operations in question aren't quite union-find, but they're sufficiently similar that the ideas can carry over.

These three strategies are in increasing order of implementation complexity, but also decreasing order of worst-case algorithmic complexity. Strategy 3 is best in theory, but the general consensus is that strategy 2 usually beats it in practice. Meanwhile, the Go compiler uses strategy 1 and considers it sufficient.

If using strategy 1, the pseudocode for steps 2 through 4 can be quite concise:

# Step 2
for w in range(N-1, 0, -1):
  sdom[w] = min(strategy1(v, w)[0] if v > w else v for v in pred(w))

# Step 3
for w in range(1, N):
  idomfrom[w] = strategy1(w, sdom[w])[1]

# Step 4
for w in range(1, N):
  idom[w] = sdom[w]           # Relevant  when idomfrom[w] = w
  idom[w] = idom[idomfrom[w]] # No change when idomfrom[w] = w

def strategy1(v, limit):
  us = [v]
  while parent(us[-1]) > limit: us.append(parent(us[-1]))
  return min((sdom[u], u) for u in us)

Note that pred in step 2 gives the predecessors in G, whereas parent in strategy1 gives the parent in T.

There are two little tricks to avoid some of the function calls in step 3:

  1. If sdom(w) == 0, then u = w is an acceptable result from argminu.
  2. If sdom(w) == parent(w), then u = w will be only possible argminu.

We can revise step 3 to detect these two cases, and handle them without a call:

# Step 3
for w in range(1, N):
  if sdom[w] in {0, parent(w)}:
    idomfrom[w] = w
  else:
    idomfrom[w] = strategy1(w, sdom[w])[1]

If we then want to use something better than strategy1, step 3 needs to be chopped up and interleaved with step 2. One way of doing this is to introduce an array called deferred, which is conceptually storing various sets: calls to strategy1 in step 3 are changed to instead add a node to a set, and then the actual calls happen later when the set is drained. The pseudocode for steps 2 and 3 thus becomes:

# Step 2 and Step 3
deferred = [0] * N # Initialise deferred (all sets empty)
for w in range(N-1, 0, -1):
  v = deferred[w]
  while v != 0: # Drain set
    idomfrom[v] = strategy1(v, w)[1]
    v = deferred[v]
  sdom[w] = min(strategy1(v, w)[0] if v > w else v for v in pred(w))
  if sdom[w] in {0, parent(w)}:
    idomfrom[w] = w
  else:
    # Add w to set, will drain when step 2 reaches sdom(w)
    deferred[w] = deferred[sdom[w]]
    deferred[sdom[w]] = w

Then we can upgrade steps 2 and 3 from strategy1 to strategy2:

# Step 2 and Step 3
deferred = [0] * N # Initialise deferred (all sets empty)
for w in range(N-1, 0, -1):
  v = deferred[w]
  while v != 0: # Drain set
    idomfrom[v] = strategy2(v, w)[1]
    v = deferred[v]
  sdom[w] = min(strategy2(v, w)[0] if v > w else v for v in pred(w))
  if sdom[w] in {0, parent(w)}:
    idomfrom[w] = w
  else:
    # Add w to set, will drain when step 2 reaches sdom(w)
    deferred[w] = deferred[sdom[w]]
    deferred[sdom[w]] = w
  cache_ancestor[w] = parent(w)
  cache_result[w] = (sdom(w), w)

def strategy2(v, limit):
  vs = []
  ancestor = cache_ancestor[v]
  while ancestor > limit:
    vs.append(v)
    v = ancestor
    ancestor = cache_ancestor[v]
  result = cache_result[v]
  while vs:
    v = vs.pop()
    result = min(result, cache_result[v])
    cache_result[v] = result
    cache_ancestor[v] = ancestor
  return result

I consider strategy2 to be sufficient, so I won't present a strategy3. Instead, a few memory optimisations are possible to anyone squeezing out performance:


Once idom(w) has been computed, it can be used to compute DF(w) in the style of Cytron et al's 1991 paper "Efficiently Computing Static Single Assignment Form and the Control Dependence Graph":

for w in bottom_up_traversal_of_dominator_tree:
  DF(w) = set()
  for v in succ(w): # successors in G
    if idom(v) != w:
      DF(w).add(v)
  for v in children(w): # children in the dominator tree
    for u in DF(v):
      if idom(u) != w:
        DF(w).add(u)

Alternatively, idom(w) can be used to compute DF(w) in the style of Cooper et al's 2001 paper "A Simple, Fast Dominance Algorithm":

for w in G:
  DF(w) = set()
for w in G:
  if len(pred(w)) ≥ 2: # predecessors in G
    for v in pred(w):  # predecessors in G
      u = v
      while u != idom(w):
        DF(u).add(w)
        u = idom(u)

Note that if len(pred(w)) ≥ 2: is an optimisation, rather than a requirement for correctness: when len(pred(w)) == 1, the sole predecessor of w will be idom(w), so the innermost loop won't execute.

This formulation is easily amenable to representing DF values as arrays rather than sets, as deduplication needs only to look at the most recently inserted value:

for w in G:
  DF(w) = []
for w in G:
  if len(pred(w)) ≥ 2: # predecessors in G
    for v in pred(w):  # predecessors in G
      u = v
      while u != idom(w):
        if len(DF(u)) == 0 or DF(u)[-1] != w: DF(u).append(w)
        u = idom(u)

Finally, once DF(w) is available for all w, SSA construction becomes simple: if G denotes a control flow graph of basic blocks, then for every variable assigned-to in w, DF(w) is exactly the set of basic blocks which require a phi function (or basic block argument) inserted at their start to reconverge the various assignments of that variable. A mere two caveats apply:

  1. The phi function is another assignment to the variable in question, which possibly enlarges the set of variables assigned-to by the basic block into which the phi was inserted. Multiple iterations can be required to reach a fixpoint.
  2. Some liveness analysis can be applied to avoid inserting phi functions whose result would never be used.

Tenstorrent Wormhole Series Part 8: Reference

This blog series (1, 2, 3, 4, 5, 6, 7) has given an introduction to Wormhole and a tour of some of its low-level functionality. If it has whetted your appetite, then the follow-up is the tt-isa-documentation repository I've been writing - it is the comprehensive reference manual going into detail on all the pieces. The reference manual bookends the blog series; there's no need for more blog parts, as the manual contains all the material I'd want to cover.

page: 1 2 3