RISC-V Conditional Moves

I'm a big fan of aarch64's csel family of instructions. A single instruction can evaluate rd = cond ? rs1 : f(rs2), where cond is any condition code and f is any of f0(x) = x or f1(x) = x+1 or f2(x) = ~x or or f3(x) = -x. Want to convert a condition to a boolean? Use f1 with rs1 == rs2 == x0. Want to convert a condition to a mask? Use f2 with rs1 == rs2 == x0. Want to compute an absolute value? Use f3 with rs1 == rs2. It is pleasing that the composition of f1 and f2 is f3. I could continue espousing, but hopefully you get the idea.

RISC-V is the hot new thing, but it lacks a direct equivalent to csel. Some cases of converting conditions to booleans are possible with the slt family of instructions in the base instruction set. Beyond that, a few special cases are implemented by instruction set extensions: Zbb adds min and max instructions which are a particular pattern of compare and select, and Zicond adds czero.eqz and czero.nez which again are particular patterns of compare and select. But the general case? Considered and rejected, as per this direct quote from The RISC-V Instruction Set Manual Volume I Version 20250508:

We considered but did not include conditional moves or predicated instructions, which can effectively replace unpredictable short forward branches. Conditional moves are the simpler of the two, but are difficult to use ...

That quote hints at short forward branches being the recommended alternative. It doesn't quite go as far as to say that out-of-order cores are encouraged to perform macro fusion in the frontend to convert short forward branches back into conditional moves (when possible), but it is commonly taken to mean this, and some SiFive cores implement exactly this fusion.

Continuing to quote from The RISC-V Instruction Set Manual Volume I Version 20250508, the introductory text motivating Zicond also mentions fusion:

Using these [Zicond] instructions, branchless sequences can be implemented (typically in two-instruction sequences) without the need for instruction fusion, special provisions during the decoding of architectural instructions, or other microarchitectural provisions.

One of the shortcomings of RISC-V, compared to competing instruction set architectures, is the absence of conditional operations to support branchless code-generation: this includes conditional arithmetic, conditional select and conditional move operations. The design principles of RISC-V (e.g. the absence of an instruction-format that supports 3 source registers and an output register) make it unlikely that direct equivalents of the competing instructions will be introduced.

The design principles mentioned in passing mean that czero.eqz has slightly odd semantics. Assuming rd ≠ rs2, the intent is that these two instruction sequences compute the same thing:

Base instruction setWith Zicond
  mv rd, x0
  beq rs2, x0, skip_next
  mv rd, rs1
skip_next:
  czero.eqz rd, rs1, rs2
 
 
 

The whole premise of fusion is predicated on the idea that it is valid for a core to transform code similar to the branchy code on the left into code similar to the branch-free code on the right. I wish to cast doubt on this validity: it is true that the two instruction sequences compute the same thing, but details of the RISC-V memory consistency model mean that the two sequences are very much not equivalent, and therefore a core cannot blindly turn one into the other.

To see why, consider this example, again from The RISC-V Instruction Set Manual Volume I Version 20250508:

Control dependencies behave differently from address and data dependencies in the sense that a control dependency always extends to all instructions following the original target in program order.

  lw x1, 0(x2)
  bne x1, x0, next
next:
  sw x3, 0(x4)

Even though both branch outcomes have the same target, there is still a control dependency from the memory operation generated by the first instruction in this snippet to the memory operation generated by the last instruction. This definition of control dependency is subtly stronger than what might be seen in other contexts (e.g., C++), but it conforms with standard definitions of control dependencies in the literature.

The general point highlighted by this example is that every branch (or indirect jump) instruction imposes a syntactic control dependency on every store instruction anywhere after it in program order. If a branch is converted to a conditional move, there is no longer a syntactic control dependency. There can instead be an address or data dependency, but this only applies to stores which use the result of the conditional move, whereas the syntactic control dependency applied to all stores. In other words, not equivalent.

TLDR: If RISC-V cores want to perform fusion of short forward branches into conditional moves (to mitigate the lack of conditional moves in the instruction set), the resultant fused instruction needs to retain some branch-like properties to avoid violating the memory model.

Reworking Lengauer-Tarjan

In compiler circles, Lengauer & Tarjan's 1979 paper "A Fast Algorithm for Finding Dominators in a Flowgraph" is a classic: it describes and proves a viable algorithm for computing the idom of every node in a rooted directed graph. I have just two criticisms of the paper:

  1. All the sample code is written in Algol. This was a fine choice in 1979, but tastes have changed over the intervening 46 years, and most modern eyes will see the Algol code as somewhat antiquated.
  2. It starts by presenting an optimised variant of the algorithm, rather than starting with the most basic exposition and then introducing optimisations.

I believe that both problems can be addressed via a little bit of reworking.


The algorithm takes as input a graph G with some root node r. We immediately perform a depth first search on G; let T be the spanning tree discovered by that search, and replace all node labels with the pre-order index from that search. The root node is now 0, one of its successors is 1, and so forth. For arbitrary nodes v and w, this allows us to write things like v < w and v ≤ w and min(v, w), all of which are operating on these indices. We also introduce some notation for talking about these graphs:

NotationMeaning
v ->G wThere is an edge from v to w in G, and w != r
v -*->T wThere is a path from v to w in T, or v == w

We jump in at what the paper calls Theorem 4, which enables this definition:

sdom(w) = min({     v  |               v ->G w and v ≤ w} union
{sdom(u) | u -*->T v and v ->G w and v > w and u > w})

This definition is recursive, but only in the case of u > w, so we can compute it for all nodes by first computing it for w = N-1, then for w = N-2, and so forth until we eventually reach w = 1 (we cannot compute it for w = 0, as both sets are empty in that case).

Reworking Theorem 3 slightly enables this definition:

idomfrom(w) = argminu{sdom(u) | u -*->T w and u > sdom(w)}

Where argminu{expr | condition} gives the u which minimises expr (over all the possible u which meet condition), or any such u if there are multiple u achieving that minimum. Note that the set is always non-empty, as u = w meets the condition. As u -*->T w implies u ≤ w, we also have idomfrom(w) ≤ w.

Then reworking Theorem 2 slightly enables this definition:

idom(w) = sdom(w) if idomfrom(w) ≥ w else idom(idomfrom(w))

This definition is recursive, but only in the case of idomfrom(w) < w, so we can compute it for all nodes by first computing it for w = 1, then for w = 2, and so forth until eventually we reach w = N-1.

This gives us everything we need to compute idom, via a four step process:

  1. Perform a depth first search on G, to give T (and pre-order integer nodes).
  2. Compute sdom(w) for all w > 0, in decreasing order of w.
  3. Compute idomfrom(w) for all w > 0, in any order.
  4. Compute idom(w) for all w > 0, in increasing order of w.

In both steps 2 and 3, we're interested in either min or argminu over the set {sdom(u) | u -*->T v and u > limit}, for various values of v and limit. There are (at least) three different strategies for computing each of these min / argminu:

  1. From scratch every time: whenever such a computation is required, start with u = v, and then iterate u = parent(u) (where parent gives the parent in T) until u ≤ limit, taking the min / argminu of sdom(u) over all the observed u for which u > limit.
  2. Like strategy 1, but introduce a layer of caching to avoid repeated work. The paper calls this "path compression", which is one way of viewing it, but you reach the exact same place if you instead apply aggressive caching to strategy 1. For this to work, all queries need to be made in decreasing order of limit, as otherwise previously cached results wouldn't be valid. This happens naturally for step 2 (because it has limit = w, and is performed in decreasing order of w), but slightly more work is required to extend it to step 3: the usual approach is to chop up step 3 into lots of small pieces, and perform them interleaved with step 2 at appropriate times.
  3. Like strategy 2, but doing the union-find equivalent of "union by rank/size" rather than the union-find equivalent of "path compression". Note that the min and argminu operations in question aren't quite union-find, but they're sufficiently similar that the ideas can carry over.

These three strategies are in increasing order of implementation complexity, but also decreasing order of worst-case algorithmic complexity. Strategy 3 is best in theory, but the general consensus is that strategy 2 usually beats it in practice. Meanwhile, the Go compiler uses strategy 1 and considers it sufficient.

If using strategy 1, the pseudocode for steps 2 through 4 can be quite concise:

# Step 2
for w in range(N-1, 0, -1):
  sdom[w] = min(strategy1(v, w)[0] if v > w else v for v in pred(w))

# Step 3
for w in range(1, N):
  idomfrom[w] = strategy1(w, sdom[w])[1]

# Step 4
for w in range(1, N):
  idom[w] = sdom[w]           # Relevant  when idomfrom[w] = w
  idom[w] = idom[idomfrom[w]] # No change when idomfrom[w] = w

def strategy1(v, limit):
  us = [v]
  while parent(us[-1]) > limit: us.append(parent(us[-1]))
  return min((sdom[u], u) for u in us)

Note that pred in step 2 gives the predecessors in G, whereas parent in strategy1 gives the parent in T.

There are two little tricks to avoid some of the function calls in step 3:

  1. If sdom(w) == 0, then u = w is an acceptable result from argminu.
  2. If sdom(w) == parent(w), then u = w will be only possible argminu.

We can revise step 3 to detect these two cases, and handle them without a call:

# Step 3
for w in range(1, N):
  if sdom[w] in {0, parent(w)}:
    idomfrom[w] = w
  else:
    idomfrom[w] = strategy1(w, sdom[w])[1]

If we then want to use something better than strategy1, step 3 needs to be chopped up and interleaved with step 2. One way of doing this is to introduce an array called deferred, which is conceptually storing various sets: calls to strategy1 in step 3 are changed to instead add a node to a set, and then the actual calls happen later when the set is drained. The pseudocode for steps 2 and 3 thus becomes:

# Step 2 and Step 3
deferred = [0] * N # Initialise deferred (all sets empty)
for w in range(N-1, 0, -1):
  v = deferred[w]
  while v != 0: # Drain set
    idomfrom[v] = strategy1(v, w)[1]
    v = deferred[v]
  sdom[w] = min(strategy1(v, w)[0] if v > w else v for v in pred(w))
  if sdom[w] in {0, parent(w)}:
    idomfrom[w] = w
  else:
    # Add w to set, will drain when step 2 reaches sdom(w)
    deferred[w] = deferred[sdom[w]]
    deferred[sdom[w]] = w

Then we can upgrade steps 2 and 3 from strategy1 to strategy2:

# Step 2 and Step 3
deferred = [0] * N # Initialise deferred (all sets empty)
for w in range(N-1, 0, -1):
  v = deferred[w]
  while v != 0: # Drain set
    idomfrom[v] = strategy2(v, w)[1]
    v = deferred[v]
  sdom[w] = min(strategy2(v, w)[0] if v > w else v for v in pred(w))
  if sdom[w] in {0, parent(w)}:
    idomfrom[w] = w
  else:
    # Add w to set, will drain when step 2 reaches sdom(w)
    deferred[w] = deferred[sdom[w]]
    deferred[sdom[w]] = w
  cache_ancestor[w] = parent(w)
  cache_result[w] = (sdom(w), w)

def strategy2(v, limit):
  vs = []
  ancestor = cache_ancestor[v]
  while ancestor > limit:
    vs.append(v)
    v = ancestor
    ancestor = cache_ancestor[v]
  result = cache_result[v]
  while vs:
    v = vs.pop()
    result = min(result, cache_result[v])
    cache_result[v] = result
    cache_ancestor[v] = ancestor
  return result

I consider strategy2 to be sufficient, so I won't present a strategy3. Instead, a few memory optimisations are possible to anyone squeezing out performance:


Once idom(w) has been computed, it can be used to compute DF(w) in the style of Cytron et al's 1991 paper "Efficiently Computing Static Single Assignment Form and the Control Dependence Graph":

for w in bottom_up_traversal_of_dominator_tree:
  DF(w) = set()
  for v in succ(w): # successors in G
    if idom(v) != w:
      DF(w).add(v)
  for v in children(w): # children in the dominator tree
    for u in DF(v):
      if idom(u) != w:
        DF(w).add(u)

Alternatively, idom(w) can be used to compute DF(w) in the style of Cooper et al's 2001 paper "A Simple, Fast Dominance Algorithm":

for w in G:
  DF(w) = set()
for w in G:
  if len(pred(w)) ≥ 2: # predecessors in G
    for v in pred(w):  # predecessors in G
      u = v
      while u != idom(w):
        DF(u).add(w)
        u = idom(u)

Note that if len(pred(w)) ≥ 2: is an optimisation, rather than a requirement for correctness: when len(pred(w)) == 1, the sole predecessor of w will be idom(w), so the innermost loop won't execute.

This formulation is easily amenable to representing DF values as arrays rather than sets, as deduplication needs only to look at the most recently inserted value:

for w in G:
  DF(w) = []
for w in G:
  if len(pred(w)) ≥ 2: # predecessors in G
    for v in pred(w):  # predecessors in G
      u = v
      while u != idom(w):
        if len(DF(u)) == 0 or DF(u)[-1] != w: DF(u).append(w)
        u = idom(u)

Finally, once DF(w) is available for all w, SSA construction becomes simple: if G denotes a control flow graph of basic blocks, then for every variable assigned-to in w, DF(w) is exactly the set of basic blocks which require a phi function (or basic block argument) inserted at their start to reconverge the various assignments of that variable. A mere two caveats apply:

  1. The phi function is another assignment to the variable in question, which possibly enlarges the set of variables assigned-to by the basic block into which the phi was inserted. Multiple iterations can be required to reach a fixpoint.
  2. Some liveness analysis can be applied to avoid inserting phi functions whose result would never be used.

Tenstorrent Wormhole Series Part 8: Reference

This blog series (1, 2, 3, 4, 5, 6, 7) has given an introduction to Wormhole and a tour of some of its low-level functionality. If it has whetted your appetite, then the follow-up is the tt-isa-documentation repository I've been writing - it is the comprehensive reference manual going into detail on all the pieces. The reference manual bookends the blog series; there's no need for more blog parts, as the manual contains all the material I'd want to cover.

Tenstorrent Wormhole Series Part 7: Bits of the MatMul

Previously, in part 6, I took a deep dive into Tensix Vector, but if you're buying Tenstorrent hardware, you almost certainly want fast matrix multiplication. Furthermore, if that is your view, then everything else on the chip is secondary to matrix multiplication: SRAM and DRAM are for storing matrices close to the compute logic, Tensix Vector is for operating on matrix multiplication results without having to send them back to the host over PCIe, and so on.

Back in part 5, we saw one representation of a T tile, emphasizing the RISC-V cores. A different representation of the same thing instead emphasizes Tensix Matrix and the data paths to/from it:

The above diagram requires a few remarks:

The focus might be on Tensix Matrix, but the data path between Tensix Matrix and L1 goes through Tensix Unpack and Tensix Pack. The very quick summary of these units is that they can perform some light data type conversion and some light reshaping, but mainly just shovel data between L1 and SrcB / SrcA / Dst. The bidirectional arrow on the diagram between Tensix Pack and L1 is there because Tensix Pack can perform either L1 = Dst or L1 += Dst, with the latter eliminating some of the need for Tensix Unpack to write to Dst. Tensix Pack can also perform some flavors of ReLU, though only as L1 = ReLU(Dst) or L1 += ReLU(Dst), and not as L1 = ReLU(L1 + Dst).

The sub-32-bit types that Tensix Unpack can write to SrcA and SrcB include:

With that interlude complete, we can finally look at Tensix Matrix. It is capable of a few different operations, but the big one is matrix multiplication, doing Dst += SrcB @ SrcA, where Dst and SrcB are both 8x16 and SrcA is 16x16:

The TT-Metal API tends to expose 32x32 blocks, matmul for which can be built out of the 8x16 matmul, one possible arrangement being:

For the primitive 8x16 operation, what gets applied to each scalar element of Dst is d = d + a0·b0 + a1·b1 + ⋯ + a15·b15, where ai ranges over a column of SrcA and bi ranges over a row of SrcB. For a 32x32 block, the primitive 8x16 operation needs to be applied 16 times in total (because there are 8 chunks of Dst, and each chunk requires a 32-element dot product rather than just a 16-element dot product).

The scalar + and · in the previous paragraph are floating-point addition and multiplication. If you're not a hardware designer, you might view floating-point addition and multiplication as magical primitive operations provided by hardware. If you are a hardware designer, you'll know that nothing is magic: floating-point addition and multiplication need to eventually be implemented via a mixture of integer comparisons, shifts, additions, and multiplications. To really reinforce the point that nothing is magic, I can show you code for bf16 fused-multiply-add implemented using just integer operations. Any actual hardware would be described in (System)Verilog rather than C, but there are more people who can read C than Verilog. If you look through that code, you'll see a few major pieces:

  1. Decompose the bf16 inputs to their sign / exponent / mantissa (lines 20-21).
  2. For normal numbers, re-attach the implicit one bit at the top of the mantissa (line 21). For denormal numbers, use __builtin_clz and << to find and reposition their most significant one bit (lines 22-24).
  3. Check for any input being NaN or infinity, and return an appropriate NaN or infinity if so (lines 33-45).
  4. Perform the multiply: xor of signs, integer addition of exponents, integer multiplication of mantissas (lines 31, 47-49).
  5. Perform the addition: determine the largest exponent, use >> on the mantissa of the other value to equalize the exponents, then integer add or subtract the mantissas (lines 51-73).
  6. Turn the result back into bf16: __builtin_clz and << or >> to get exactly 10 bits of mantissa, clamp exponent to the range 0-255 (possibly another >> here if the value is denormal), then pack sign and exponent and mantissa into 16 bits (lines 75-85).
  7. Considering all the bits thrown away by >> in previous steps, perform rounding on the result (rounding to nearest, ties to even). This is just line 88, though it relies heavily on p_m <<= 3, z_m <<= 3 and on sticky_shift in earlier steps.

If you're aiming for full IEEE754 conformance, then all of the above steps are required. If instead you're designing hardware for AI, then any step that can be removed and/or simplified is a massive win: each Tensix Matrix unit on a Wormhole chip wants to perform 2048 floating-point multiplies and 2048 floating-point additions per cycle, and there are 80 Tensix Matrix units on each chip, so any cost savings in either multiplication or addition get amplified 16,3840 times. Even a small simplification becomes noticable when amplified that much. I don't know exactly what simplifications Tenstorrent have opted for, but there are potential savings in every step:

  1. Perform decomposition somewhere else (e.g. in Tensix Unpack rather than Tensix Matrix).
  2. Treat all denormals as zero, thus saving on __builtin_clz and <<.
  3. Correctly handle only some cases of inputs being NaN or infinity.
  4. Use only some of the mantissa for multiplication rather than all of it.
  5. If adding more than two values together, determine the largest exponent of all of them, then use >> to equalize all their exponents at once (some academic work calls this a Class IV multi-term adder). All the mantissas can then be added together using a multiple-input integer adder, rather than a tree of two-input adders.
  6. If the result is going to immediately feed into step 1 of another multiply/add, skip step 6 and the subsequent step 1. Additionally, if the result is denormal, treat it as zero to save on >>.
  7. Use a cheaper rounding mode: either rounding toward zero (no rounding bits required), or rounding to nearest with ties away from zero (just one rounding bit required).

Treating denormals as zero on input (step 2) and on output (step 6) are relatively common simplifications; usually even CPUs have options for it. The more unusual simplifications are in steps 4 and 5. For step 4, if you want to be able to multiply a variety of floating-point formats between bfp2 and tf32, then you'd normally need an 11b by 11b multiplier for this step:

That said, fp8 (e5m2) only requires 3b by 3b multiply, and bf16 only requires 8b by 8b, so the full 11b by 11b might feel wasteful if lower-precision operations the majority of your workload. The particular simplification that Tenstorrent have opted for in Wormhole is a 7b by 5b multiplier, which can be invoked up to four times to cover (almost) the full range:

This has a number of interesting implications:

At this point, we can cross-check with the advertised TFLOP/s numbers. We'd expect 4.096 TFLOP/s from each Tensix Matrix unit when LoFi is in use (e.g. for fp8), half that when LoFi + HiFi2 is in use (e.g. for bfp8), and half that again when all of LoFi + HiFi2 + HiFi3 + HiFi4 are required (e.g. for fp16). With 72 usable Tensix Matrix units on an n150s board and 128 usable on an n300s board, this would mean:

n150sn300s
Just LoFi, e.g. fp8 (e5m2)294.9 TFLOP/s524.3 TFLOP/s
LoFi+HiFi2, e.g. bfp8147.5 TFLOP/s262.1 TFLOP/s
LoFi+HiFi2+HiFi3+HiFi4, e.g. fp1673.7 TFLOP/s131.1 TFLOP/s

Meanwhile, the one-pager from the Tenstorrent sales page reports:

The bfp8 and fp16 numbers are exactly as expected. The oddity is the fp8 number, which is only 88.9% (i.e. sixteen eighteenths) of expected. This suggests that at such low precision, the bottleneck becomes data transfer (e.g. 16 cycles to multiply a 32x32 tile, but 18 cycles to get the data in or get the data out).

To hit these TFLOP/s numbers, "all" you have to do is write code for one of the RV "T" cores to issue a matrix multiplication instruction every cycle, and write code for the other four RV cores to keep data flowing in and out at the right rate. The Macro-Op Expander and Replay Expander seen in part 5 help with this task: the expanders can output a Tensix instruction every cycle without themselves needing to be instructed every cycle, giving the RV cores a bit of time to do other things (such as control flow). Then repeat this 72 times for an n150s board, or 128 times for an n300s board. In either case, that'll result in several hundred RV cores busily working away!

That wraps up part 7; there will likely be a long gap before the next part due to a certain factory game expansion pack being released next week.

Tenstorrent Wormhole Series Part 6: Vector instruction set

Back in part 4, we were considering the entire Wormhole PCIe card, and then in part 5 we zoomed in on a single T tile. Today I'm going to zoom in even more, looking at the box that part 5 labelled "Tensix Vector (SFPU)". To draw a very rough analogy to GPU graphics programming, Tensix Unpack/Matrix/Pack are somewhat like a (configurable) fixed-function pipeline, whereas Tensix Vector can execute arbitrary shader programs. To instead draw a very rough analogy to GPU AI programming, Tensix Unpack/Matrix/Pack are like tensor cores, whereas Tensix Vector is like CUDA cores. That said, neither analogy is entirely accurate, as fundamentally this hardware is trying to be its own thing rather than trying to be a GPU. Continuing the AI theme, Unpack/Matrix/Pack can execute (amongst other things) linear layers consisting of matrix multiplication, optionally adding a bias, and then optionally ReLU, but once you stray too much beyond this, you'll need to pull in Tensix Vector. Tanh? Tensix Vector. Dropout? Tensix Vector. Cumsum? You guessed it, Tensix Vector.

The Tenstorrent documentation and code often refer to Tensix Vector as "SFPU", but I'll stick to calling it Tensix Vector. The hardware is wrapped with an API/toolchain/compiler called SFPI, which has an associated documentation page. I'll try to explain the raw underlying hardware, though I'll occasionally make reference to things the SFPI toolchain does. The documentation makes reference to an emulator in main.cc, which I can't find per se, but sfpu.cc gets somewhat close. Unfortunately, it operates at the level of a partially-lowered compiler IR, so some interpretation is required to map between that IR and actual machine instructions. Speaking of machine instructions, we saw the general layout of Tensix instructions in part 5. As a reminder, these are disjoint from RISC-V instructions, so there's no relation between the RISC-V "V" (for Vector) extension and Tensix Vector, and the general instruction layout is:

Whereas RISC-V "V" tries to present arbitrary-length vectors, Tensix Vector is a good old SIMD instruction set, like AArch64 NEON or x86 SSE/AVX, with 32 SIMD lanes in Wormhole. Each lane consists of 32 bits, which depending on the instruction are viewed as either fp32 or int32 or signmag32.

With the introduction done, we can start to get into the details. The remainder of this post gets very low-level and very dense, so skip it if that's not for you.

Execution environment

Before diving into the Tensix Vector instructions, it is useful to consider the environment in which the instructions execute. The important parts are:

Size
Vector registers (L0-L7)8 registers, 32 lanes per register, 32b per lane
Fixed constants4 values, 32b each
Programmable constants4 "constants", 8 lanes per constant, 32b per lane
Flags active1b
Per lane flags32 lanes, 1b per lane
Flag stackBetween 0 and 8 entries, (1+32×1)b per entry
PRNG32 lanes, 32b LFSR per lane (with caveats)
DstEither 512 rows, 16 lanes per row, 32b per lane
Or 1024 rows, 16 lanes per row, 16b per lane
RWC_Dst10b

The vector registers are called L0 through L7, which is a somewhat poor choice of naming scheme, given that L1 could easily instead refer to the 1464 KiB of SRAM on each tile. Considering the vector registers and the constants all together, there are 16 possible operands, which are encoded into 4b fields in instructions like so:

EncodingMeaning
0 - 7Vector Registers L0 - L7
8Fixed Constant 0.8373 (bit pattern 0x3F56594B)
9Fixed Constant 0.0 (bit pattern 0x00000000)
10Fixed Constant 1.0 (bit pattern 0x3F800000)
11Programmable constant, though toolchain requires it to be -1.0
12Programmable constant (vConstIntPrgm0 / vConstFloatPrgm0)
13Programmable constant (vConstIntPrgm1 / vConstFloatPrgm1)
14Programmable constant (vConstIntPrgm2 / vConstFloatPrgm2)
15Fixed Constant lane_number << 1 (i.e. 0, 2, 4, ..., 62)

The programmable constants are set using the SFPCONFIG instruction, which we'll cover later. The toolchain exposes two names for each, differing in type, but they're backed by the same storage. The programmable constants usually have the same value in all eight of their lanes, but in the event that the lanes end up with different values, four copies of the constant are stacked horizontally to form 32 lanes. The fixed constants 0.8373 and 0.0 and 1.0 have the same value in every lane, and then the final fixed constant has a different value in every lane.

Next up are flags. Flags can be active or inactive. If flags are active, then there is a 1b flag per lane controlling whether that lane is enabled. Initially all lanes are enabled, and then various instructions can "refine" the per-lane flag: lanes which fail the condition switch from enabled to disabled (whilst previously disabled lanes remain disabled, with neither their contents nor their flags being updated). The toolchain exposes refinement through the v_and macro. If flags are inactive, then all lanes are enabled regardless of the 1b flag per lane. There is also a stack on to which all this state can be pushed and then later popped. Contrast this to Apple G13: there each lane has a disabled counter rather than a stack of flags.

We then find some PRNG state, which can optionally be used for stochastic rounding. The seeding mechanism leaves something to be desired though, as does the state update function, so I'd recommend avoiding the PRNG if you care your random numbers having high quality and low correlation.

The final notable part of the execution environment is Dst: the large 2D piece of memory that the Tensix Matrix unit writes the result of matrix operations to. The rows of this memory are 16 scalars wide, the combination of 16 rows is typically called a 16x16 face (which is what a lot of the LLK code operates on), and then the combination of four such faces is typically called a 32x32 tile (which is what the TT-Metal API exposes). Expressed differently, 64 rows of Dst are required for holding a 32x32 tile. The SFPLOAD and SFPSTORE instructions transfer data between a single vector register and some rows of Dst (they do not transfer between a vector register and main memory!), with the target rows determined by the summation of an immediate operand to SFPLOAD / SFPSTORE and the RWC_Dst variable, taken modulo the number of rows of Dst (512 when it is holding 32b scalars, 1024 when holding 16b scalars). The toolchain exposes RWC_Dst via the slightly questionable syntax dst_reg++.

Notation

I'll use VD to mean an arbitrary vector register used as the output (and often also an input) of an instruction. I'll use VA / VB / VC to mean arbitrary vector registers or constants used as inputs to an instruction. When instructions operate on fixed registers, I'll use the names L0 through L7. Scalar inputs that come from N bits within the instruction itself are referred to as ImmN. Signed immediates (in two's complement form) of N+1 bits will be ±ImmN, the range of which is -(2N) through (+2N)-1.

Some instructions can operate in several distinct modes, in which case they'll be listed multiple times in different sections and marked with (‡) each time.

Instruction encoding

The Mod0 family of encodings put a "VD" field at the top, then a modifier field, then immediates at the bottom:

Meanwhile, the Mod1 family of encodings put a modifier field at the bottom, then "VD", then other operands, then immediates at the top:

Each instruction links through to emulation code for that instruction, giving (my best guess of) its precise encoding and behaviour. In each case, the encoding will be one of the above, but the opcode varies by instruction, as does the interpretation of Mod0 / Mod1.

With the stage set, we can now take a brief look at all the instructions handled by Tensix Vector.

Int32 arithmetic and bitwise operations

We begin with some gentle integer instructions:

Per-lane behaviour (int32)
SFPIADDVD = VC ± VD or VD = VC ± Imm11
Optionally refine flags based on VD < 0 or inverse thereof
SFPANDVD &= VC
SFPORVD |= VC
SFPXORVD ^= VC
SFPNOTVD = ~VC
SFPLZVD = CountLeadingZeros(VC)
Optionally refine flags based on VC != 0 or inverse thereof
SFPABS (‡)VD = Abs(VC)
SFPSHFTVD = VD << (VC % 32) or VD = VD >> (-VC % 32) or
VD = VD << Imm5 or VD = VD >> -Imm5
SFPSHFT2 (‡)VD = VB << (VC % 32) or VD = VB >> (-VC % 32)
SFPSETCCRefine flags based on VC != 0 or VC < 0 or inverse of either

Nothing greatly surprising here, though it is a shame that so many instructions treat VD as both an input and an output (this isn't for lack of encoding space, as there's plently of that, and isn't for lack of register file ports, as SFPMAD requires three read ports and a write port, so I'm not sure of the rationale here). Shifts are all done modulo 32, with the sign of the shift amount determining whether the shift is left or right. Right shifts are always unsigned, though apparently Blackhole adds support for signed right shifts. There's also a somewhat insane variant of SFPSHFT2 that shifts VB by an immediate, but bits 12 through 15 specify both VB and (part of) the immediate, so the possible options there are L0 << 0, L1 << 1, L2 << 2, and so forth.

Flags are generally refined based on the sign or the zero-ness of the result. The conditions VC != 0 and VC < 0 are native, as are their inverses VC == 0 and VC >= 0. The non-native VC > 0 is achieved by refining on VC >= 0 and then refining on VC != 0. Its inverse (VC <= 0) is achieved by refining on VC >= 0 and then refining on VC != 0 and then issuing SFPCOMPC to invert the overall result. Three instructions for VC <= 0 isn't great, but again is addressed in Blackhole. Comparisons where the right hand side isn't zero are done by subtracting the two operands, and then comparing the subtraction result against zero. This causes < / <= / > / >= to do the wrong thing if overflow occurs during the subtraction, which is mildly concerning.

Flag stack

Per-lane behaviour
SFPENCCConfigure whether flags are active, also set flags
SFPPUSHCPush copy of flags on to flag stack
SFPCOMPCInvert per-lane flags, using top of stack as context
SFPPOPCPop from flag stack into flags, or read top of stack into flags

The SFPENCC instruction is used to initialise the flags subsystem: it can set flags to active, and initialise the per-lane flags to either enable or disable all lanes.

SFPPUSHC and SFPPOPC mostly do what you'd expect. If SFPPUSHC is used more than eight times, then it'll start overwriting previous entries. The stack size counter is four bits, and it too will wrap if SFPPUSHC is used sixteen times. If SFPPOPC is used with the size counter equal to zero, then the counter will underflow to fiveteen, but the resultant flags state will always be all lanes active. I would not advise trying to do anything clever with stack underflow or overflow.

SFPCOMPC is slightly interesting: it inverts the per-lane flags, but does this subject to the the state on the top of the stack; lanes that would be disabled in that state are set to disabled rather than being inverted.

Fp32 field manipulation

Up next are some unconventional, though not unwelcome, instructions to manipulate the three fields of an IEEE754 float:

Per-lane behaviour (fp32 sign/exponent/mantissa)
SFPEXEXPVD = VC.Exponent or VD = VC.Exponent - 127
Optionally refine flags based on VD < 0 or inverse thereof
SFPEXMANVD = { 0, !Imm1, VC.Mantissa}
SFPMOV (‡)VD = {!VC.Sign, VC.Exponent, VC.Mantissa}
SFPSETSGNVD = { VD.Sign, VC.Exponent, VC.Mantissa} or
VD = { Imm1, VC.Exponent, VC.Mantissa}
SFPABS (‡)VD = { 0, VC.Exponent, VC.Mantissa}
SFPSETEXPVD = { VC.Sign, VD.Mantissa & 255, VC.Mantissa} or
VD = { VC.Sign, VD.Exponent, VC.Mantissa} or
VD = { VC.Sign, Imm8, VC.Mantissa}
SFPSETMANVD = { VC.Sign, VC.Exponent, VD.Mantissa} or
VD = { VC.Sign, VC.Exponent, Imm12 << 11}
SFPDIVP2VD = { VC.Sign, Imm8, VC.Mantissa} or
VD = { VC.Sign, VC.Exponent ± Imm7, VC.Mantissa}

There is no SFPEXSGN instruction, as integer instructions suffice for this: SFPSETCC can refine flags based on the sign bit, and SFPSHFT can do a right shift by 31 to extract just the sign bit.

The SFPDIVP2 instruction can perform addition/subtraction on the exponent field, thereby providing multiplication or division by a power of two, though the arithmetic will wrap around if it overflows, so some care is required. The only saving grace is that the VC.Exponent ± Imm7 form will leave VC untouched if it starts as ±Inf or ±NaN. If wrapping is a concern, use SFPMULI instead (described in the next section).

There is some overlap between instructions here; an absolute-value function can be built from SFPSETSGN, or SFPABS can be used for this. Similarly, one mode of SFPSETEXP is completely identical to one mode of SFPDIVP2.

Fp32 arithmetic

Then we reach the floating point multiply/add unit:

Per-lane behaviour (fp32)
SFPMULVD = VA * VB + 0
SFPADDVD = 1 * VB + VC
SFPMADVD = VA * VB + VC
SFPMULIVD *= Bf16ToFp32(Imm16)
SFPADDIVD += Bf16ToFp32(Imm16)
SFPLUT
TmpA, TmpC = LUT({L0.Low16, L1.Low16, L2.Low16}, Abs(L3))
VD = TmpA * Abs(L3) + TmpC
SFPLUTFP32
TmpA, TmpC = LUT({L0, L1, L2, L4, L5, L6}, Abs(L3))
VD = TmpA * Abs(L3) + TmpC

All of these instructions take two cycles, i.e. VD is not available until two cycles after the instruction is issued. An SFPNOP instruction must be inserted if the next instruction would otherwise want to consume VD (Blackhole relieves the compiler of this duty).

There is no fp32 subtract instruction; it is instead achieved by SFPMAD with VB set to -1.0. Most ISAs with a floating-point fused-multiply-add instruction have variants of the instruction to negate the result of the multiplication and/or the negate the addend, as doing so is incredibly cheap in hardware. This glaring omission is seemingly corrected in Blackhole.

The SFPADD instruction always has VA set to the constant 1.0 by the compiler, allowing hardware to treat SFPADD exactly like SFPMAD if it so desires. Similarly, SFPMUL always has VC set to the constant 0.0 by the compiler, allowing hardware to treat SFPMUL exactly like SFPMAD. The chip I'm playing with indeed treats SFPADD and SFPMUL exactly like SFPMAD, though future chips might be able to just add or just multiply faster than SFPMAD (e.g. Zen 4 takes four cycles for a multiply-add, but just 3 cycles for either a multiply or an add).

There are no dedicated fp32 comparison instructions (though see the min/max mode of SFPSWAP described later), as the integer SFPSETCC generally suffices, though this does mean that -NaN is considered less than -Inf and +Inf is considered less than +NaN. It would also mean that -0 is considered less than +0, but it looks like all arithmetic instructions normalize -0 to +0 (similarly, it looks like all denormal inputs are treated as zero and all denormal outputs are flushed to +0; see also Tenstorrent's statement on infinities and NaNs and denormals).

The unusual instructions are SFPLUT and SFPLUTFP32, which create a 3-element or 6-element table from various bits of L0/L1/L2 and optionally L4/L5/L6, then use the magnitude of L3 to determine which table element to use, extract TmpA and TmpC from said element, calculate VD = TmpA * Abs(L3) + TmpC, then optionally overwrite the sign of the result with the original sign of L3. These instructions allow for various unary functions to be approximated in a piecewise linear fashion (similar in spirit, though not at all in implementation, to genlut in Apple's AMX).

For SFPLUT, the table ranges are:

TmpA (multiplicand)TmpC (addend)
Abs(L3) < 1.0Fp8ToFp32((L0 >> 8) & 255)Fp8ToFp32(L0 & 255)
1.0 ≤ Abs(L3) < 2.0Fp8ToFp32((L1 >> 8) & 255)Fp8ToFp32(L1 & 255)
2.0 ≤ Abs(L3) Fp8ToFp32((L2 >> 8) & 255)Fp8ToFp32(L2 & 255)

Whereas for SFPLUTFP32 in mode FP16_3ENTRY_TABLE:

TmpA (multiplicand)TmpC (addend)
Abs(L3) < 1.0Fp16ToFp32(L0 >> 16)Fp16ToFp32(L0 & 0xffff)
1.0 ≤ Abs(L3) < 2.0Fp16ToFp32(L1 >> 16)Fp16ToFp32(L1 & 0xffff)
2.0 ≤ Abs(L3) Fp16ToFp32(L2 >> 16)Fp16ToFp32(L2 & 0xffff)

For SFPLUTFP32 in mode FP32_3ENTRY_TABLE:

TmpA (multiplicand)TmpC (addend)
Abs(L3) < 1.0L0L4
1.0 ≤ Abs(L3) < 2.0L1L5
2.0 ≤ Abs(L3) L2L6

For SFPLUTFP32 in mode FP16_6ENTRY_TABLE1:

TmpA (multiplicand)TmpC (addend)
Abs(L3) < 0.5Fp16ToFp32(L0 & 0xffff)Fp16ToFp32(L4 & 0xffff)
0.5 ≤ Abs(L3) < 1.0Fp16ToFp32(L0 >> 16)Fp16ToFp32(L4 >> 16)
1.0 ≤ Abs(L3) < 1.5Fp16ToFp32(L1 & 0xffff)Fp16ToFp32(L5 & 0xffff)
1.5 ≤ Abs(L3) < 2.0Fp16ToFp32(L1 >> 16)Fp16ToFp32(L5 >> 16)
2.0 ≤ Abs(L3) < 3.0Fp16ToFp32(L2 & 0xffff)Fp16ToFp32(L6 & 0xffff)
3.0 ≤ Abs(L3) Fp16ToFp32(L2 >> 16)Fp16ToFp32(L6 >> 16)

And finally SFPLUTFP32 in mode FP16_6ENTRY_TABLE2:

TmpA (multiplicand)TmpC (addend)
Abs(L3) < 0.5Fp16ToFp32(L0 & 0xffff)Fp16ToFp32(L4 & 0xffff)
0.5 ≤ Abs(L3) < 1.0Fp16ToFp32(L0 >> 16)Fp16ToFp32(L4 >> 16)
1.0 ≤ Abs(L3) < 1.5Fp16ToFp32(L1 & 0xffff)Fp16ToFp32(L5 & 0xffff)
1.5 ≤ Abs(L3) < 2.0Fp16ToFp32(L1 >> 16)Fp16ToFp32(L5 >> 16)
2.0 ≤ Abs(L3) < 4.0Fp16ToFp32(L2 & 0xffff)Fp16ToFp32(L6 & 0xffff)
4.0 ≤ Abs(L3) Fp16ToFp32(L2 >> 16)Fp16ToFp32(L6 >> 16)

Many of the instructions in this section also support a mode whereby the result of the instruction isn't written to VD, but is instead written to the register number in the low four bits of L7. This can be viewed as a kind of scatter operation. SFPMAD also supports a kind of gather operation: rather than reading from VA, the multiplicand input can be taken from the register number in the low four bits of L7.

Min / max / swap

Per-lane behaviour (fp32 or signmag32)
SFPSWAP (‡)VD, VC = Min(VD, VC), Max(VD, VC)
SFPSWAP (‡)VD, VC = VC, VD

This instruction takes two cycles, possibly because it has two destinations and there's only one write port on the register file, and SFPSWAP must be followed by SFPNOP. When doing min / max, it uses the total ordering whereby -NaN < -Inf < finite negative values < -0 < +0 < finite positive values < +Inf < +NaN. The smaller of the two inputs ends up in VD, and the larger in VC. There are also variants which compute Min,Max for some groups of 8 lanes, and Max,Min for other groups of 8 lanes.

This is not an arithmetic instruction, so it does not flush denormals on input or on output. This means it can also be used for 32-bit integers in sign/magnitude form. The plain swap can also be used on int32 lanes.

Data type conversions to / from fp32

Per-lane behaviour
SFPSTOCHRND (‡)VD = Fp32ToBf16(VC) << 16
SFPSTOCHRND (‡)VD = Fp32ToTf32(VC)
SFPSTOCHRND (‡)VD = Fp32ToInt32(Min(Abs(VC), 255)) or
VD = Fp32ToInt32(Min(Abs(VC), 65535))
SFPSTOCHRND (‡)VD = Fp32ToSignMag32(Clamp(VC, ±127)) or
VD = Fp32ToSignMag32(Clamp(VC, ±32767))
SFPCASTVD = SignMag32ToFp32(VC)

All of the above support two rounding modes, either stochastic rounding or round to nearest (SFPSTOCHRND resolves ties away from zero, SFPCAST resolves ties to even, which seems like a strange discrepancy). The stochastic rounding relies on the hardware PRNG, though as mentioned in the introduction, the quality of its randomness is poor: adjacent vector lanes will have 30 out of 32 bits in common, and consecutive random values within a lane will have 31 out of 32 bits in common. This leads to significant correlation between random values if more than one random value is obtained.

The PRNG state can also be observed directly with an oddball variant of SFPMOV:

Per-lane behaviour
SFPMOV (‡)VD = RandomInt32()
SFPNOPNo-op, delay subsequent instructions by one cycle

SFPNOP is listed here as it is required for PRNG seeding: the seeding procedure involves writing the new seed to the PRNG_SEED::Seed_Val configuration register and then executing a bunch of SFPNOP instructions.

Rounding and clamping of sign / magnitude integers

Per-lane behaviour (signmag32)
SFPSTOCHRND (‡)VD = Min(Round(Abs(VC) >> (VB % 32)), 255) or
VD = Min(Round(Abs(VC) >> Imm5), 255)
SFPSTOCHRND (‡)VD = Clamp(Round(VC >> (VB % 32)), ±127) or
VD = Clamp(Round(VC >> Imm5), ±127)

All of the above support two rounding modes, based on the shifted-out bits: either stochastic rounding or round to nearest with ties away from zero. The toolchain uses the names int32_to_uint8 and int32_to_int8 for these operations. The PRNG used for stochastic rounding is the same as in the previous section.

Note that the lane type here is signmag32: the high bit is a sign bit, and then the low 31 bits are a magnitude. When the magnitude is clamped, it stays in the low bits. Negative zero is allowed as an input, but is always normalised to +0 on output.

Constants

Per-lane behaviour
SFPLOADIVD = Bf16ToFp32(Imm16) or
VD = Fp16ToFp32(Imm16) or
VD = Imm16 or
VD = ±Imm15 or
VD.High16 = Imm16 or
VD.Low16 = Imm16
SFPCONFIG (‡)SelectedProgrammableConstant = L0[0:8]

There are various modes of SFPLOADI for setting all lanes of a vector register to a 16-bit immediate. A 32-bit immediate can be formed by using two SFPLOADI instructions: Bf16ToFp32 or High16 to set the high 16 bits, and then Low16 to set just the low 16 bits. A selection of interesting 32-bit values can also be formed in a single cycle by using SFPSETSGN / SFPDIVP2 / SFPSETMAN with VC set to one of the fixed constants.

To load a value into one of the programmable constants, first use SFPLOADI to load it into all lanes of L0, then use SFPCONFIG to copy L0[0:8] into one of the programmable constants.

Cross-lane data movement

Whole-vector behaviour
SFPMOV (‡)VD = VC
SFPSHFT2 (‡)L0, L1, L2, L3 = L1, L2, L3, Zeros or
L0, L1, L2, L3 = L1, L2, L3, {L0[8:32], Zeros[0:8]} or
L0, L1, L2, L3 = L1, L2, L3, RotateLanesRight(VC)
SFPSHFT2 (‡)VD = RotateLanesRight(VC) or
VD = ShiftLanesRight(VC)
SFPTRANSPTranspose(L0, L1, L2, L3); Transpose(L4, L5, L6, L7)

The RotateLanesRight function rotates each group of eight lanes right by one lane, so VD = RotateLanesRight(VC) does VD[i] = VC[i&7 ? i-1 : i+7]. The similar VD = ShiftLanesRight(VC) is meant to do VD[i] = i&7 ? VC[i-1] : 0, but a hardware bug means that instead of every 8th lane getting zero, it gets whatever the most recent RotateLanesRight wrote to that lane. Between this and the comedy mode that can do L0 << 0 or L1 << 1 or L2 << 2 etc, I get the impression that SFPSHFT2 was poorly specified and/or poorly tested. Hopefully it is all fixed in Blackhole.

The variants of SFPSHFT2 involving RotateLanesRight / ShiftLanesRight require two cycles to execute. If it weren't for this, the variant of SFPSHFT2 which moves zeros to L3 would be redundant, as it could be implemented with the RotateLanesRight variant with constant zero as VC.

Meanwhile, SFPTRANSP causes the following transformation:

L0L1L2L3
[ 0: 8]L0[0:8]L0[8:16]L0[16:24]L0[24:32]
[ 8:16]L1[0:8]L1[8:16]L1[16:24]L1[24:32]
[16:24]L2[0:8]L2[8:16]L2[16:24]L2[24:32]
[24:32]L3[0:8]L3[8:16]L3[16:24]L3[24:32]
L4L5L6L7
[ 0: 8]L4[0:8]L4[8:16]L4[16:24]L4[24:32]
[ 8:16]L5[0:8]L5[8:16]L5[16:24]L5[24:32]
[16:24]L6[0:8]L6[8:16]L6[16:24]L6[24:32]
[24:32]L7[0:8]L7[8:16]L7[16:24]L7[24:32]

The naïve implementation of this instruction would either require 8 cycles to execute, or require a register file with 8 write ports. Neither of these things seems likely, so perhaps what we're seeing is 8x 32b as the fundamental unit of storage, L0/L1/L2/L3 being backed by 16 units of storage, and the SFPTRANSP instruction flipping how L0/L1/L2/L3 map on to that storage (ditto L4/L5/L6/L7, and their backing 16 units of storage). The modes of SFPSHFT2 which write to all four of L0 through L3 might pull a similar trick; actually writing to one register and just shuffling indices for the others.

Transfer between Dst and vector registers

At long last, we reach the means of getting data in and out of the vector world:

Whole-vector behaviour
SFPLOADVD = Dst[R:R+4, 0:15:2] or VD = Dst[R:R+4, 1:16:2]
SFPSTOREDst[R:R+4, 0:15:2] = VD or Dst[R:R+4, 1:16:2] = VD

Given that rows of Dst have 16 lanes, and vector registers have 32 lanes, you might expect SFPLOAD / SFPSTORE to reference two rows of Dst at a time. This is not the case; they instead reference half of four rows at a time. With Imm10 denoting the 10-bit immediate in SFPLOAD / SFPSTORE, the initial row R is (RWC_Dst + Imm10) & 0x3fc. If (RWC_Dst + Imm10) & 2 is zero, then the even columns of Dst[R:R+4] are referenced, whereas if (RWC_Dst + Imm10) & 2 is non-zero, then the odd columns of Dst[R:R+4] are referenced. Row R corresponds to vector lanes [0:8], R+1 to [8:16], R+2 to [16:24], and R+3 to [24:32], which neatly matches up with some of the cross-lane data movement instructions.

SFPLOAD / SFPSTORE can also increment RWC_Dst after performing the data transfer. The mechanism for this is somewhat involved:

The SETRWC and INCRWC instructions can also be used to modify RWC_Dst. Furthermore, these instructions can also modify RWC_SrcA, RWC_SrcB, and RWC_Fidelity; Tensix Matrix instructions make use of all of these, but Tensix Vector only needs RWC_Dst. Meanwhile, Tensix Pack and Unpack use totally different sets of counters for selecting their memory locations and their Dst / SrcA / SrcB locations.

When SFPLOAD / SFPSTORE access Dst, the lane width of Dst is either 16 bits per lane or 32 bits per lane, controlled by the ALU_ACC_CTRL_SFPU::Fp32_enabled configuration register. A data type conversion is also performed as part of the access, revealing a variety of possible formats for the lanes of Dst:

Dst lane typeVector lane type
fp32fp32
fp16 (with slightly permuted bits)fp32
bf16fp32
int32int32
signmag32int32
signmag8 (permuted and packed into 16 bits)signmag32
signmag11 (permuted and packed into 16 bits)int32
signmag16 (with slightly permuted bits)signmag16 (in half a lane)

If SFPLOAD / SFPSTORE do not specify a data type conversion, then the value of the ALU_FORMAT_SPEC_REG1::SrcB configuration register is used to infer the data type of Dst, and an appropriate conversion is chosen based on this. This is what the Tenstorrent documentation means when it says that on Wormhole, the destination register format is always determined by the runtime.

There's also a SFPLOADMACRO instruction, which is similar to SFPLOAD, but then executes a little pre-programmed instruction sequence. In part 5 we saw the Macro-Op Expander and the Replay Expander; SFPLOADMACRO is yet another mechanism for one instuction to expand to several, albeit limited in scope to Tensix Vector. I can only find one example of Tenstorrent code using this mechanism, which is enough to confirm its existence, but not enough for me to extrapolate further.

Conclusion

We've seen everything that Tensix Vector can do in Wormhole. Constructing useful high-level functionality from the low-level pieces is left as an exercise for the reader (or you can use what Tenstorrent have already built). That wraps up part 6; if you're reading along, then part 7 is next.

page: 1 2 3