Tenstorrent Wormhole Series Part 5: Taking apart T tiles

Previously, in part 4, we identified the 128 usable T tiles on my Wormhole n300s board. These tiles are the workhorse of the board, so it is about time we took a detailed look inside one of them. Ignoring all the NoC functionality, my best guess as to the contents of each T tile is this diagram:

Starting at the top of the diagram, we have 1464 KiB of SRAM, which is directly mapped into the tile-local address space starting at address 0. It is connected to lots of other components within the tile, and other tiles can also access it via NoC requests (again, I have not shown any of the NoC functionality on the above diagram). The advertised capacity is 1.5 MB of SRAM; if you were hoping for 1.5 MiB, then you'd need 72 KiB more than the 1464 KiB shown, but you can find that distributed across the tile (32 KiB in Dst, 30 KiB in the core-local RAMs, 4 KiB in SrcA, 4 KiB in SrcB, 1 KiB in Lreg, and so on).

Moving down a row, we have five RISC-V RV32IM cores, which I've labelled as "B", "T0", "T1", "T2", and "NC". Each core has 32 GPRs, each 32 bits wide, along with a 32-bit program counter. The RV32IM instruction set can be roughly split into three pieces: load/store, ALU (arithmetic operations, bitwise operations, and multiply and divide), and branches - these execution resources are shown on the diagram within each core. The host system can put whatever RISC-V machine code it desires in L1, and the RISC-V cores will happily execute it. Said code will have exclusive bare-metal control of the cores; there are no interrupts, no user-mode/kernel-mode split, no hypervisor, etc. The RISC-V cores execute completely independently (of each other, and of the host), though there are mechanisms to synchronize them.

Moving down another row, things start to get interesting. Firstly, each core has 2 KiB or 4 KiB of core-local RAM mapped into the address space starting at address 0xFFB00000. The C/C++ call stack is usually located here, thereby decreasing the load on L1, albeit with the trade-off that pointers into the stack cannot be meaningfully passed between cores nor used as the source or destination pointer for NoC requests. Next up, the "NC" core has 16 KiB of instruction RAM mapped into the address space starting at address 0xFFC00000, presumably again to reduce the load on L1. Finally, this row contains three "Tensix" instruction pipes, one attached to each "T" core. This is where we leave the world of standard RISC-V instructions, and enter the world of Tenstorrent special sauce. One way of describing Tensix would be a massive AI coprocessor glued on to the three "T" cores, with emphasis on the word massive: the assorted Tensix pieces occupy much more area and perform vastly more FLOPs than the RISC-V cores that drive them. We'll look at the Tensix instruction pipes in more detail later, but the quick summary is that they ingest Tensix instructions and output (slightly modified) Tensix instructions. Said instructions are 32 bits wide, but other than the width being the same, the Tensix instruction set is completely unrelated to any RISC-V instruction set. The Tensix instruction set is also evolving with each Tenstorrent generation; Grayskull is slightly different to Wormhole, which in turn is slightly different to Blackhole, and so on.

Moving down again, we hit "Tensix Sync". At least conceptually, this unit ingests Tensix instructions coming out of the three pipes, and dispatches Tensix instructions to the eight backend execution resources. A handful of instructions relating to synchronization of the three inbound pipes execute at "Tensix Sync", either manipulating the mutexes and semaphores within "Tensix Sync", or selectively pausing an inbound pipe until certain conditions are met. Instructions leaving "Tensix Sync" are tagged with which pipe they originated from, which is relevant information for most backend instructions.

The next row of the diagram contains the eight Tensix backend execution resources, from left to right: Scalar (often called ThCon), ThCfg, Unpack, Matrix (often called FPU), Pack, Vector (often called SFPU), TDMA, and Xmov. For AI workloads, the star of the show is the Matrix unit, which amongst other things can dispatch Dst[8,16] = SrcB[8,16] @ SrcA[16,16] every cycle (which involves 2048 individual multipliers, each 7b x 5b, followed by the equivalent of 2048 individual additions). To the left of Matrix is the Unpack unit, which moves values from memory (in a variety of data formats, including some block-float ones) into SrcA and SrcB, and then the Pack unit on the other side does the inverse: moving values from Dst back out to memory. Also of note is the Vector unit for performing 32-wide SIMD. This unit cannot directly access memory, but it can do transfers in both directions between Dst and the eight SIMD registers. This is suited to performing non-linear functions on the results of matrix multiplies prior to writing said results out to memory. The Matrix and Vector units are sometimes collectively called "Math". All of these units contain far more configuration parameters than can fit into a 32-bit instruction, so there are lots of configuration registers scattered about the place, along with Scalar and ThCfg units to help drive all this configuration. The Tensix Scalar unit also has a set of 64 32-bit GPRs per pipe, meaning that it contains more GPRs than all of the RISC-V cores in the tile do (3 times 64 versus 5 times 32).

The final row of the diagram I've labelled as "L0 ???", as the descriptions of several Tensix instructions mention an L0, but I'm not particularly confident as to its presence or size or functionality. If it exists, possibly it is a hardware-managed cache that all Tensix loads transparently go through, and Tensix stores can either target or skip and write directly to L1 (for when the stored values are less valuable than the pre-existing contents of the cache).

We can now look at some of the pieces in more detail.

Tensix Instruction Pipe

Each of the three Tensix instruction pipes looks something like this:

Tensix instructions enter at the top via two means. The conceptually simpler means is the MMIO box in the top right of the diagram; any "T" core can write a 32-bit value to address 0xFFE40000 to push a Tensix instruction into the pipe associated with that core. Said instructions are 32 bits wide, laid out as:

In contrast, 32-bit RISC-V instructions look totally different:

The Tensix opcode is 8 bits wide, but values ≥ 0xC0 aren't used, meaning that if a Tensix instruction is rotated left by two bits, it will never overlap with a 32-bit RISC-V instruction (it lands in the encoding space normally reserved for 16-bit RVC instructions, though not used for that purpose here):

This leads us to the box in the top left of the diagram: if a "T" core tries to execute an instruction whose low two bits are not 0b11, then the instruction bits will be rotated right by two and then treated as data to be written to the aforementioned 0xFFE40000. Regardless of the means of entry, once a Tensix instruction has entered the pipe, RISC-V execution and Tensix execution proceed completely independently of each other.

Next up, we hit the Macro-Op Expander, which is where the MOP_CFG(u16 zhi) and MOP(u1 template, u7 count1, u16 zlo) instructions execute (instructions other than MOP_CFG and MOP flow through the Macro-Op Expander unchanged). Of these, MOP_CFG just stores the 16-bit immediate to a 16-bit register within the expander, whereas MOP is the really interesting one; it causes the expander to run through one of the following templates:

Template 0	Template 1
`zmask = (zhi << 16) \| zlo; flags = mop_cfg[1]; for (i = 0; i <= count1; ++i) { if ((zmask & 1) == 0) { exec(mop_cfg[3]); if (flags & 0x02) { exec(mop_cfg[4]); exec(mop_cfg[5]); exec(mop_cfg[6]); } if (flags & 0x01) { exec(mop_cfg[2]); } } else { exec(mop_cfg[7]); if (flags & 0x02) { exec(mop_cfg[7]); exec(mop_cfg[7]); exec(mop_cfg[7]); } if (flags & 0x01) { exec(mop_cfg[8]); } } zmask >>= 1; }`	`i_count = mop_cfg[0]; j_count = mop_cfg[1]; for (i = 0; i < i_count;) { exec(mop_cfg[2]); ++i; for (j = 0; j < j_count;) { exec(mop_cfg[5]); ++j; if (j != j_count) { exec(mop_cfg[6]); } else if (i != i_count) { exec(mop_cfg[8]); } else { exec(mop_cfg[7]); } } exec(mop_cfg[3]); exec(mop_cfg[4]); }`

Template 0

Template 1

zmask = (zhi << 16) | zlo;
flags = mop_cfg[1];
for (i = 0; i <= count1; ++i) {
  if ((zmask & 1) == 0) {
    exec(mop_cfg[3]);
    if (flags & 0x02) {
      exec(mop_cfg[4]);
      exec(mop_cfg[5]);
      exec(mop_cfg[6]);
    }
    if (flags & 0x01) {
      exec(mop_cfg[2]);
    }
  } else {
    exec(mop_cfg[7]);
    if (flags & 0x02) {
      exec(mop_cfg[7]);
      exec(mop_cfg[7]);
      exec(mop_cfg[7]);
    }
    if (flags & 0x01) {
      exec(mop_cfg[8]);
    }
  }
  zmask >>= 1;
}

i_count = mop_cfg[0];
j_count = mop_cfg[1];
for (i = 0; i < i_count;) {
  exec(mop_cfg[2]);
  ++i;
  for (j = 0; j < j_count;) {
    exec(mop_cfg[5]);
    ++j;
    if (j != j_count) {
      exec(mop_cfg[6]);
    } else if (i != i_count) {
      exec(mop_cfg[8]);
    } else {
      exec(mop_cfg[7]);
    }
  }
  exec(mop_cfg[3]);
  exec(mop_cfg[4]);
}

Any call to exec(x) in the above causes the expander to output the Tensix instruction x. In this way, a single MOP instruction expands to a somewhat programmable sequence of instructions. The programmability comes from the immediate operands to MOP and the values stored in the mop_cfg registers. For the latter, each "T" core can set the mop_cfg registers of its associated pipe by writing to the uint32_t[9] starting at address 0xFFB80000.

Moving down a row in the diagram, we find a sneaky back door allowing the "B" core to inject Tensix instructions into any of the three pipes:

"B" core MMIO address	Semantics of 32-bit write
`0xFFE40000`	Push instruction into pipe associated with "T0"
`0xFFE50000`	Push instruction into pipe associated with "T1"
`0xFFE60000`	Push instruction into pipe associated with "T2"

This allows the "B" core to help initialize some of the state within the various Tensix units prior the "T" cores being turned on, but it probably isn't intended for much more than this.

Moving down to the final row, we hit the Replay Expander, which is where REPLAY(u5 idx, u5 len, u2 mode) instructions execute. The three possible modes of this instruction are:

Record: The next len instructions which enter the Replay Expander are swallowed up by the Replay Expander, and written to buffer[idx:idx+len].
Tee: The next len instructions which flow through the Replay Expander are written to buffer[idx:idx+len] in addition to flowing through.
Playback: The Replay Expander outputs buffer[idx:idx+len], one instruction at a time.

When not in Record mode, instructions other than REPLAY will flow through the Replay Expander unchanged (though the incoming stream is paused while Playback is in progress).

Tensix Sync

There are eight mutexes within this unit, each with four possible states:

Acquired by "T0" pipe
Acquired by "T1" pipe
Acquired by "T2" pipe
Released

Some instructions execute at Tensix Sync to manipulate these mutexes:

ATGETM(u3 mutex_index)

If the specified mutex is already acquired by the pipe on which ATGETM appeared, does nothing. Otherwise, pauses said pipe until the mutex is released, and then atomically acquires it for said pipe and unpauses the pipe.

ATRELM(u3 mutex_index)

If the specified mutex is already acquired by the pipe on which ATRELM appeared, then it is released. Otherwise, does nothing.

There are also eight semaphores within this unit, each having a four-bit counter value and a four-bit maximum value. Some instructions execute at Tensix Sync to manipulate these semaphores:

SEMINIT(u4 max, u4 ctr, u8 which_sems_mask)

Set the counter value and the maximum value of the specified semaphores to the given values.

SEMPOST(u8 which_sems_mask)

Increment the counter value of the specified semaphores, if not already equal to 15. Note that the upper limit is always 15; the maximum as set by SEMINIT is only used by SEMWAIT.

SEMGET(u8 which_sems_mask)

Decrement the counter value of the specified semaphores, if not already equal to zero.

SEMWAIT(u9 to_pause_mask, u8 which_sems_mask, u2 condition)

For as long as (any of?) the specified semaphores have counter equal to zero (condition == 1) or have counter equal to their maximum (condition == 2), prevent the pipe on which SEMWAIT appeared from dispatching any instructions to the execution resources in to_pause_mask.

The "T" cores can also manipulate the semaphores via MMIO:

Reading from 0xFFE80020 + 4*i gives the counter value of semaphore i.
Writing 0 to 0xFFE80020 + 4*i does what SEMPOST(1u << i) would do.
Writing 1 to 0xFFE80020 + 4*i does what SEMGET(1u << i) would do.

One final instruction executes at Tensix Sync:

STALLWAIT(u9 to_pause_mask, u15 condition_mask)

Similar to SEMWAIT, but waits while assorted non-semaphore conditions are met. Said conditions can include various execution resources being busy, SrcA or SrcB being valid, and SrcA or SrcB being clear.

Any instructions not yet described will flow through Tensix Sync to one of the backend execution resources, though that flow can be paused while ATGETM or SEMWAIT or STALLWAIT are in progress.

Tensix Scalar (ThCon)

This unit contains 3x 64x 32-bit GPRs, the roles for which are typically statically assigned. Instructions manipulate the set of 64 GPRs corresponding to the pipe from which the instruction originally came. Each "T" core can also access its register set via MMIO to the uint32_t[64] starting at address 0xFFE00000.

Various ALU-style operations execute here to manipulate these GPRs:

SETDMAREG(u16 value, u1 mode, u6 gpr_idx, u1 lo_hi)

Sets the low 16 bits (lo_hi == 0) or high 16 bits (lo_hi == 1) of the specified GPR to the specified value, leaving the other bits unchanged. Does something totally different if mode == 1; consult the YAML for details.

ADDDMAREG(u1 b_is_const, u6 gpr_out, u6 b, u6 gpr_a)

Does gpr_out = gpr_a + (b_is_const ? b : gprs[b]).

SUBDMAREG(u1 b_is_const, u6 gpr_out, u6 b, u6 gpr_a)

Does gpr_out = gpr_a - (b_is_const ? b : gprs[b]).

MULDMAREG(u1 b_is_const, u6 gpr_out, u6 b, u6 gpr_a)

Does gpr_out = (gpr_a & 0xFFFF) * (b_is_const ? b : (gprs[b] & 0xFFFF)).
Note only low 16 bits of each input are used.

BITWOPDMAREG(u1 b_is_const, u2 op, u6 gpr_out, u6 b, u6 gpr_a)

Does gpr_out = gpr_a &|^ (b_is_const ? b : gprs[b]),
where &|^ is & (op == 0) or | (op == 1) or ^ (op == 2).

CMPDMAREG(u1 b_is_const, u2 op, u6 gpr_out, u6 b, u6 gpr_a)

Does gpr_out = gpr_a <==> (b_is_const ? b : gprs[b]),
where <==> is < (op == 1) or == (op == 2) or > (op == 0).

SHIFTDMAREG(u1 b_is_const, u1 op, u6 gpr_out, u6 b, u6 gpr_a)

Does gpr_out = gpr_a <<>> (b_is_const ? b : gprs[b]),
where <<>> is << (op == 0) or >> (op == 1).

Then instructions to move between these GPRs and L0/L1:

LOADIND(u2 sz, u6 gpr_ofs, u1 lo_hi, u2 inc, u6 gpr_data, u6 gpr_base)

Loads from L1 to GPRs.
The L1 address is gpr_base*16 + ((gpr_ofs >> (lo_hi*16)) & 0xFFFF).
Various size modes:

sz == 3: Load 8 bits (high 24 bits of gpr_data unchanged).

sz == 2: Load 16 bits (high 16 bits of gpr_data unchanged).

sz == 1: Load 32 bits.

sz == 0: Load 128 bits (to four GPRs starting at gpr_data & 0x3c).

Also various options for incrementing after the load:

inc == 0: No auto-increment.

inc == 1: Increment the low/high 16 bits of gpr_ofs by 2.

inc == 2: Increment the low/high 16 bits of gpr_ofs by 4.

inc == 3: Increment the low/high 16 bits of gpr_ofs by 16.

STOREIND(u1 l1, u2 sz, u6 gpr_ofs, u1 lo_hi, u2 inc, u6 gpr_data, u6 gpr_base)

Stores from GPRs to L0/L1.
Other than the extra l1 operand, all operands as per LOADIND.

ATSWAP(u1 l1, u8 ofs_mask, u6 gpr_data, u6 gpr_base)

Does an atomic swap between GPRs and L0/L1 of up to 128 bits.
The L1 address is gpr_base*16. Four GPRs starting at gpr_data & 0x3c give 128 bits, which are partially swapped with the 128 bits at the L1 address: if bit i of ofs_mask is set, then bits i*16 through i*16+15 are swapped.

ATCAS(u1 l1, u4 set_val, u4 cmp_val, u2 ofs, u6 gpr_base)

Does an atomic compare/set against L0/L1. The logic is along the lines of:
uint32_t *word = gpr_base*16 + ofs*4;
retry:
atomic {
  if (*word != cmp_val) {
    goto retry; // Comparison failed
  }
  *word = set_val;
}
ATINCGET(u1 l1, u5 len, u2 ofs, u6 gpr_data, u6 gpr_base)

Does an atomic increment against L0/L1. The logic is along the lines of:
uint32_t *word = gpr_base*16 + ofs*4;
uint32_t incr_mask = (1u << (len + 1)) - 1;
atomic {
  uint32_t incremented = *word + gpr_data;
  gpr_data = *word;
  *word = (incremented & incr_mask) | (*word &~ incr_mask);
}
ATINCGETPTR(u1 l1, u1 no_incr, u5 incr_log2, u4 len, u2 ofs, u6 gpr_data, u6 gpr_base)

Does an atomic FIFO operation against L0/L1. The logic is along the lines of:
struct fifo_ctl_t {
  uint32_t rd;
  uint32_t wr;
  uint32_t pad[2];
} *fifo = gpr_base*16;
uint32_t *word = gpr_base*16 + ofs*4;
uint32_t fifo_capacity = 1u << (len - 1);
uint32_t fifo_mask = (1u << len) - 1;
retry:
atomic {
  if (ofs & 1) {
    uint32_t fifo_size = (fifo->wr - fifo->rd) & fifo_mask;
    if (fifo_size == fifo_capacity) {
      goto retry; // Cannot write to full FIFO
    }
  } else {
    if (fifo->rd == fifo->wr) {
      goto retry; // Cannot read from empty FIFO
    }
  }
  uint32_t incremented = *word + (!no_incr << incr_log2);
  gpr_data = *word;
  *word = (incremented & fifo_mask) | (*word &~ fifo_mask);
}

Two instructions move between GPRs and the 1 MiB range of address space starting at 0xFFB00000, though they cannot access the 2 KiB / 4 KiB core-local RAMs within this range:

LOADREG(u6 gpr_data, u18 ofs)

Does gpr_data = *(0xFFB00000 | (ofs << 2)).

STOREREG(u6 gpr_data, u18 ofs)

Does *(0xFFB00000 | (ofs << 2)) = gpr_data.

Configuration Registers

There are two broad categories of configuration registers:

261 per-pipe registers, each of which being between 1 and 16 bits wide, packed into 57x 16b per pipe (so 3x 57x 16b total). A packed 16b group is set using the SETC16(u6 idx, u16 val) instruction, which executes on the ThCfg unit. I have not found any MMIO region exposing these registers. Contents includes:
- CFG_STATE_ID::StateID
- DEST_TARGET_REG_CFG_MATH::Offset
- ADDR_MOD_SET::Base
- ADDR_MOD_{AB, DST, PACK, BIAS}_SEC[0-7]::*
- SRCA_SET::{Base, SetOvrdWithAddr}
- SRCB_SET::Base
- CLR_DVALID::{SrcA, SrcB}_Disable
- FIDELITY_BASE::Phase
- UNPACK_MISC_CFG::CfgContext{Offset, CntReset, CntInc}[01]
- NOC_OVERLAY_MSG_CLEAR::{StreamId, MsgNum}_[01]
- CG_CTRL_{EN, KICK}::*
- PERF_CNT_CMD::Cmd[0-3]{Start, Stop}
- ENABLE_ACC_STATS::Enable
- FPU_BIAS_SEL::Pointer
- FP16A_FORCE::Enable
248+26+39+174 unit-specific registers, each of which being between 1 and 32 bits wide, packed into (72+14+8+28)x 32b. There are two copies of each of these registers, with the per-pipe CFG_STATE_ID::StateID configuration register determining which copy is in use by a given pipe. Both copies are accessible via MMIO from the "B" or "T" cores, the 1^st as uint32_t[188] at 0xFFEF0000, and the 2^nd as uint32_t[188] at 0xFFEF02F0. A packed 32b group can be moved to / from a Tensix Scalar GPR using the RDCFG(u6 gpr, u8 idx) / WRCFG(u6 gpr, u1 wr128, u8 idx) instructions, and 8b-aligned subgroups can be manipulated using the RMWCIB[0-3](u8 mask, u8 bits, u8 idx) instructions. We have:
- 248 registers, packed into 72x 32b, that nominally live in Tensix Scalar, but mostly control other units. These can be set using REG2FLOP rather than WRCFG.
  - THCON_SEC[01]_REG0::TileDescriptor
  - THCON_SEC[01]_REG[189]::* for Tensix Pack?
  - THCON_SEC[01]_REG[23457]::* for Tensix Unpack?
  - THCON_SEC[01]_REG6::* for Tensix Xmov?
- 26 registers, packed into 14x 32b, for Tensix Unpack:
  - UNP[01]_ADDR_CTRL_XY_REG_[01]::[XY]stride
  - UNP[01]_ADDR_CTRL_ZW_REG_[01]::[ZW]stride
  - UNP[01]_ADDR_BASE_REG_[01]::Base
  - UNP[01]_FORCED_SHARED_EXP::shared_exp
  - UNP[01]_ADD_DEST_ADDR_CNTR::add_dest_addr_cntr
  - UNP0_BLOBS_Y_START_CNTX_{01,23}::blobs_y_start
- 39 registers, packed into 8x 32b, for Tensix Matrix and Tensix Vector:
  - ALU_FORMAT_SPEC_REG::{SrcA, SrcB, Dstacc}_{val, override}
  - ALU_FORMAT_SPEC_REG0::{SrcAUnsigned, SrcBUnsigned, SrcA}
  - ALU_FORMAT_SPEC_REG1::SrcB
  - ALU_FORMAT_SPEC_REG2::Dstacc
  - ALU_ROUNDING_MODE::{Fpu, Gasket, Packer}_srnd_en
  - ALU_ACC_CTRL::*
  - STACC_RELU::{ApplyRelu, ReluThreshold}
  - DISABLE_RISC_BP::*
  - ECC_SCRUBBER::*
  - STATE_RESET::EN
  - DEST_OFFSET::Enable
  - DEST_REGW_BASE::Base
  - INT_DESCALE::{Enable, Mode}
- 174 registers, packed into 28x 32b, for Tensix Pack:
  - PCK0_ADDR_CTRL_XY_REG_[01]::[XY]stride
  - PCK0_ADDR_CTRL_ZW_REG_[01]::[ZW]stride
  - PCK0_ADDR_BASE_REG_[01]::Base
  - PCK_DEST_RD_CTRL::*
  - PCK_EDGE_MODE::mode
  - PCK_EDGE_TILE_FACE_SET_SELECT::{select, enable}
  - PCK_EDGE_TILE_ROW_SET_SELECT::select
  - PCK_EDGE_OFFSET_SEC[0-3]::mask
  - PACK_COUNTERS_SEC[0-3]::*
  - PACK_CONCAT_MASK_SEC[0-3]::pack_concat_mask
  - TILE_ROW_SET_MAPPING_[0-3]::row_set_mapping_[0-15]
  - TILE_FACE_SET_MAPPING_[0-3]::face_set_mapping_[0-15]

I'm not going to make any attempt to explain the details of every configuration register, or really any configuration register, as that would take far too long.

General shape of low-level kernels

What we've seen so far should make Tenstorrent's low-level-kernels slightly more scrutable. Each LLK has an init step which configures the Macro-Op Expander and the Replay Expander and the Tensix Scalar GPRs and the relevant configuration registers, and then a runtime step which takes advantage of all that pre-programming. These LLKs are wrapped by things in Metalium's llk_api directory, which in turn are wrapped by things in Metalium's compute_kernel_api directory, which is the API that developers are meant to use.

The LLKs make use of various instructions not yet covered; you'll have to consult the mostly-accurate YAML file outlining every instruction, or the C header generated from that YAML for further details. The general pattern of that header is that TT_OP_X(...) generates the encoding of instruction X (e.g. for later MMIO use), TT_X(...) generates the encoding of X and immediately does an MMIO write to push it into the instruction pipe, and TTI_X(...) uses the T6 as RVC trick to generate the encoding of X and splat it into the RISC-V instruction stream (so TTI_X can be used instead of TT_X when all the operands are compile-time constants).

An obvious next step would be dissecting a matrix multiplication kernel to describe how it orchestrates the Unpack and Matrix and Pack units, but this post is long enough already, so it'll have to wait for another time. That wraps up part 5; if you're reading along, then part 6 is next.