Tenstorrent Wormhole Series Part 5: Taking apart T tiles
Previously, in part 4, we identified the 128 usable T tiles on my Wormhole n300s board. These tiles are the workhorse of the board, so it is about time we took a detailed look inside one of them. Ignoring all the NoC functionality, my best guess as to the contents of each T tile is this diagram:
Starting at the top of the diagram, we have 1464 KiB of SRAM, which is directly mapped into the tile-local address space starting at address 0
. It is connected to lots of other components within the tile, and other tiles can also access it via NoC requests (again, I have not shown any of the NoC functionality on the above diagram). The advertised capacity is 1.5 MB of SRAM; if you were hoping for 1.5 MiB, then you'd need 72 KiB more than the 1464 KiB shown, but you can find that distributed across the tile (32 KiB in Dst
, 30 KiB in the core-local RAMs, 4 KiB in SrcA
, 4 KiB in SrcB
, 1 KiB in Lreg
, and so on).
Moving down a row, we have five RISC-V RV32IM cores, which I've labelled as "B", "T0", "T1", "T2", and "NC". Each core has 32 GPRs, each 32 bits wide, along with a 32-bit program counter. The RV32IM instruction set can be roughly split into three pieces: load/store, ALU (arithmetic operations, bitwise operations, and multiply and divide), and branches - these execution resources are shown on the diagram within each core. The host system can put whatever RISC-V machine code it desires in L1, and the RISC-V cores will happily execute it. Said code will have exclusive bare-metal control of the cores; there are no interrupts, no user-mode/kernel-mode split, no hypervisor, etc. The RISC-V cores execute completely independently (of each other, and of the host), though there are mechanisms to synchronize them.
Moving down another row, things start to get interesting. Firstly, each core has 2 KiB or 4 KiB of core-local RAM mapped into the address space starting at address 0xFFB00000
. The C/C++ call stack is usually located here, thereby decreasing the load on L1, albeit with the trade-off that pointers into the stack cannot be meaningfully passed between cores nor used as the source or destination pointer for NoC requests. Next up, the "NC" core has 16 KiB of instruction RAM mapped into the address space starting at address 0xFFC00000
, presumably again to reduce the load on L1. Finally, this row contains three "Tensix" instruction pipes, one attached to each "T" core. This is where we leave the world of standard RISC-V instructions, and enter the world of Tenstorrent special sauce. One way of describing Tensix would be a massive AI coprocessor glued on to the three "T" cores, with emphasis on the word massive: the assorted Tensix pieces occupy much more area and perform vastly more FLOPs than the RISC-V cores that drive them. We'll look at the Tensix instruction pipes in more detail later, but the quick summary is that they ingest Tensix instructions and output (slightly modified) Tensix instructions. Said instructions are 32 bits wide, but other than the width being the same, the Tensix instruction set is completely unrelated to any RISC-V instruction set. The Tensix instruction set is also evolving with each Tenstorrent generation; Grayskull is slightly different to Wormhole, which in turn is slightly different to Blackhole, and so on.
Moving down again, we hit "Tensix Sync". At least conceptually, this unit ingests Tensix instructions coming out of the three pipes, and dispatches Tensix instructions to the eight backend execution resources. A handful of instructions relating to synchronization of the three inbound pipes execute at "Tensix Sync", either manipulating the mutexes and semaphores within "Tensix Sync", or selectively pausing an inbound pipe until certain conditions are met. Instructions leaving "Tensix Sync" are tagged with which pipe they originated from, which is relevant information for most backend instructions.
The next row of the diagram contains the eight Tensix backend execution resources, from left to right: Scalar (often called ThCon), ThCfg, Unpack, Matrix (often called FPU), Pack, Vector (often called SFPU), TDMA, and Xmov. For AI workloads, the star of the show is the Matrix unit, which amongst other things can dispatch Dst[8,16] = SrcB[8,16] @ SrcA[16,16]
every cycle (which involves 2048 individual multipliers, each 7b x 5b, followed by the equivalent of 2048 individual additions). To the left of Matrix is the Unpack unit, which moves values from memory (in a variety of data formats, including some block-float ones) into SrcA
and SrcB
, and then the Pack unit on the other side does the inverse: moving values from Dst
back out to memory. Also of note is the Vector unit for performing 32-wide SIMD. This unit cannot directly access memory, but it can do transfers in both directions between Dst
and the eight SIMD registers. This is suited to performing non-linear functions on the results of matrix multiplies prior to writing said results out to memory. The Matrix and Vector units are sometimes collectively called "Math". All of these units contain far more configuration parameters than can fit into a 32-bit instruction, so there are lots of configuration registers scattered about the place, along with Scalar and ThCfg units to help drive all this configuration. The Tensix Scalar unit also has a set of 64 32-bit GPRs per pipe, meaning that it contains more GPRs than all of the RISC-V cores in the tile do (3 times 64 versus 5 times 32).
The final row of the diagram I've labelled as "L0 ???", as the descriptions of several Tensix instructions mention an L0, but I'm not particularly confident as to its presence or size or functionality. If it exists, possibly it is a hardware-managed cache that all Tensix loads transparently go through, and Tensix stores can either target or skip and write directly to L1 (for when the stored values are less valuable than the pre-existing contents of the cache).
We can now look at some of the pieces in more detail.
Tensix Instruction Pipe
Each of the three Tensix instruction pipes looks something like this:
Tensix instructions enter at the top via two means. The conceptually simpler means is the MMIO box in the top right of the diagram; any "T" core can write a 32-bit value to address 0xFFE40000
to push a Tensix instruction into the pipe associated with that core. Said instructions are 32 bits wide, laid out as:
In contrast, 32-bit RISC-V instructions look totally different:
The Tensix opcode is 8 bits wide, but values ≥ 0xC0
aren't used, meaning that if a Tensix instruction is rotated left by two bits, it will never overlap with a 32-bit RISC-V instruction (it lands in the encoding space normally reserved for 16-bit RVC instructions, though not used for that purpose here):
This leads us to the box in the top left of the diagram: if a "T" core tries to execute an instruction whose low two bits are not 0b11
, then the instruction bits will be rotated right by two and then treated as data to be written to the aforementioned 0xFFE40000
. Regardless of the means of entry, once a Tensix instruction has entered the pipe, RISC-V execution and Tensix execution proceed completely independently of each other.
Next up, we hit the Macro-Op Expander, which is where the MOP_CFG(u16 zhi)
and MOP(u1 template, u7 count1, u16 zlo)
instructions execute (instructions other than MOP_CFG
and MOP
flow through the Macro-Op Expander unchanged). Of these, MOP_CFG
just stores the 16-bit immediate to a 16-bit register within the expander, whereas MOP
is the really interesting one; it causes the expander to run through one of the following templates:
Template 0 | Template 1 |
---|---|
|
|
Any call to exec(x)
in the above causes the expander to output the Tensix instruction x
. In this way, a single MOP
instruction expands to a somewhat programmable sequence of instructions. The programmability comes from the immediate operands to MOP
and the values stored in the mop_cfg
registers. For the latter, each "T" core can set the mop_cfg
registers of its associated pipe by writing to the uint32_t[9]
starting at address 0xFFB80000
.
Moving down a row in the diagram, we find a sneaky back door allowing the "B" core to inject Tensix instructions into any of the three pipes:
"B" core MMIO address | Semantics of 32-bit write |
---|---|
0xFFE40000 | Push instruction into pipe associated with "T0" |
0xFFE50000 | Push instruction into pipe associated with "T1" |
0xFFE60000 | Push instruction into pipe associated with "T2" |
This allows the "B" core to help initialize some of the state within the various Tensix units prior the "T" cores being turned on, but it probably isn't intended for much more than this.
Moving down to the final row, we hit the Replay Expander, which is where REPLAY(u5 idx, u5 len, u2 mode)
instructions execute. The three possible modes of this instruction are:
- Record: The next
len
instructions which enter the Replay Expander are swallowed up by the Replay Expander, and written tobuffer[idx:idx+len]
. - Tee: The next
len
instructions which flow through the Replay Expander are written tobuffer[idx:idx+len]
in addition to flowing through. - Playback: The Replay Expander outputs
buffer[idx:idx+len]
, one instruction at a time.
When not in Record mode, instructions other than REPLAY
will flow through the Replay Expander unchanged (though the incoming stream is paused while Playback is in progress).
Tensix Sync
There are eight mutexes within this unit, each with four possible states:
- Acquired by "T0" pipe
- Acquired by "T1" pipe
- Acquired by "T2" pipe
- Released
Some instructions execute at Tensix Sync to manipulate these mutexes:
If the specified mutex is already acquired by the pipe on which
ATGETM
appeared, does nothing. Otherwise, pauses said pipe until the mutex is released, and then atomically acquires it for said pipe and unpauses the pipe.If the specified mutex is already acquired by the pipe on which
ATRELM
appeared, then it is released. Otherwise, does nothing.
There are also eight semaphores within this unit, each having a four-bit counter value and a four-bit maximum value. Some instructions execute at Tensix Sync to manipulate these semaphores:
SEMINIT(u4 max, u4 ctr, u8 which_sems_mask)
Set the counter value and the maximum value of the specified semaphores to the given values.
Increment the counter value of the specified semaphores, if not already equal to 15. Note that the upper limit is always 15; the maximum as set by
SEMINIT
is only used bySEMWAIT
.Decrement the counter value of the specified semaphores, if not already equal to zero.
SEMWAIT(u9 to_pause_mask, u8 which_sems_mask, u2 condition)
For as long as (any of?) the specified semaphores have counter equal to zero (
condition == 1
) or have counter equal to their maximum (condition == 2
), prevent the pipe on whichSEMWAIT
appeared from dispatching any instructions to the execution resources into_pause_mask
.
The "T" cores can also manipulate the semaphores via MMIO:
- Reading from
0xFFE80020 + 4*i
gives the counter value of semaphorei
. - Writing 0 to
0xFFE80020 + 4*i
does whatSEMPOST(1u << i)
would do. - Writing 1 to
0xFFE80020 + 4*i
does whatSEMGET(1u << i)
would do.
One final instruction executes at Tensix Sync:
STALLWAIT(u9 to_pause_mask, u15 condition_mask)
Similar to
SEMWAIT
, but waits while assorted non-semaphore conditions are met. Said conditions can include various execution resources being busy, SrcA or SrcB being valid, and SrcA or SrcB being clear.
Any instructions not yet described will flow through Tensix Sync to one of the backend execution resources, though that flow can be paused while ATGETM
or SEMWAIT
or STALLWAIT
are in progress.
Tensix Scalar (ThCon)
This unit contains 3x 64x 32-bit GPRs, the roles for which are typically statically assigned. Instructions manipulate the set of 64 GPRs corresponding to the pipe from which the instruction originally came. Each "T" core can also access its register set via MMIO to the uint32_t[64]
starting at address 0xFFE00000
.
Various ALU-style operations execute here to manipulate these GPRs:
SETDMAREG(u16 value, u1 mode, u6 gpr_idx, u1 lo_hi)
Sets the low 16 bits (
lo_hi == 0
) or high 16 bits (lo_hi == 1
) of the specified GPR to the specified value, leaving the other bits unchanged. Does something totally different ifmode == 1
; consult the YAML for details.
ADDDMAREG(u1 b_is_const, u6 gpr_out, u6 b, u6 gpr_a)
Does
gpr_out = gpr_a + (b_is_const ? b : gprs[b])
.
SUBDMAREG(u1 b_is_const, u6 gpr_out, u6 b, u6 gpr_a)
Does
gpr_out = gpr_a - (b_is_const ? b : gprs[b])
.
MULDMAREG(u1 b_is_const, u6 gpr_out, u6 b, u6 gpr_a)
Does
gpr_out = (gpr_a & 0xFFFF) * (b_is_const ? b : (gprs[b] & 0xFFFF))
.
Note only low 16 bits of each input are used.
BITWOPDMAREG(u1 b_is_const, u2 op, u6 gpr_out, u6 b, u6 gpr_a)
Does
gpr_out = gpr_a &|^ (b_is_const ? b : gprs[b])
,
where&|^
is&
(op == 0
) or|
(op == 1
) or^
(op == 2
).
CMPDMAREG(u1 b_is_const, u2 op, u6 gpr_out, u6 b, u6 gpr_a)
Does
gpr_out = gpr_a <==> (b_is_const ? b : gprs[b])
,
where<==>
is<
(op == 1
) or==
(op == 2
) or>
(op == 0
).
SHIFTDMAREG(u1 b_is_const, u1 op, u6 gpr_out, u6 b, u6 gpr_a)
Does
gpr_out = gpr_a <<>> (b_is_const ? b : gprs[b])
,
where<<>>
is<<
(op == 0
) or>>
(op == 1
).
Then instructions to move between these GPRs and L0/L1:
LOADIND(u2 sz, u6 gpr_ofs, u1 lo_hi, u2 inc, u6 gpr_data, u6 gpr_base)
Loads from L1 to GPRs.
The L1 address isgpr_base*16 + ((gpr_ofs >> (lo_hi*16)) & 0xFFFF)
.
Various size modes:
sz == 3
: Load 8 bits (high 24 bits ofgpr_data
unchanged).sz == 2
: Load 16 bits (high 16 bits ofgpr_data
unchanged).sz == 1
: Load 32 bits.sz == 0
: Load 128 bits (to four GPRs starting atgpr_data & 0x3c
).Also various options for incrementing after the load:
inc == 0
: No auto-increment.inc == 1
: Increment the low/high 16 bits ofgpr_ofs
by 2.inc == 2
: Increment the low/high 16 bits ofgpr_ofs
by 4.inc == 3
: Increment the low/high 16 bits ofgpr_ofs
by 16.
STOREIND(u1 l1, u2 sz, u6 gpr_ofs, u1 lo_hi, u2 inc, u6 gpr_data, u6 gpr_base)
Stores from GPRs to L0/L1.
Other than the extral1
operand, all operands as perLOADIND
.
ATSWAP(u1 l1, u8 ofs_mask, u6 gpr_data, u6 gpr_base)
Does an atomic swap between GPRs and L0/L1 of up to 128 bits.
The L1 address isgpr_base*16
. Four GPRs starting atgpr_data & 0x3c
give 128 bits, which are partially swapped with the 128 bits at the L1 address: if biti
ofofs_mask
is set, then bitsi*16
throughi*16+15
are swapped.
ATCAS(u1 l1, u4 set_val, u4 cmp_val, u2 ofs, u6 gpr_base)
Does an atomic compare/set against L0/L1. The logic is along the lines of:
uint32_t *word = gpr_base*16 + ofs*4; retry: atomic { if (*word != cmp_val) { goto retry; // Comparison failed } *word = set_val; }
ATINCGET(u1 l1, u5 len, u2 ofs, u6 gpr_data, u6 gpr_base)
Does an atomic increment against L0/L1. The logic is along the lines of:
uint32_t *word = gpr_base*16 + ofs*4; uint32_t incr_mask = (1u << (len + 1)) - 1; atomic { uint32_t incremented = *word + gpr_data; gpr_data = *word; *word = (incremented & incr_mask) | (*word &~ incr_mask); }
ATINCGETPTR(u1 l1, u1 no_incr, u5 incr_log2, u4 len, u2 ofs, u6 gpr_data, u6 gpr_base)
Does an atomic FIFO operation against L0/L1. The logic is along the lines of:
struct fifo_ctl_t { uint32_t rd; uint32_t wr; uint32_t pad[2]; } *fifo = gpr_base*16; uint32_t *word = gpr_base*16 + ofs*4; uint32_t fifo_capacity = 1u << (len - 1); uint32_t fifo_mask = (1u << len) - 1; retry: atomic { if (ofs & 1) { uint32_t fifo_size = (fifo->wr - fifo->rd) & fifo_mask; if (fifo_size == fifo_capacity) { goto retry; // Cannot write to full FIFO } } else { if (fifo->rd == fifo->wr) { goto retry; // Cannot read from empty FIFO } } uint32_t incremented = *word + (!no_incr << incr_log2); gpr_data = *word; *word = (incremented & fifo_mask) | (*word &~ fifo_mask); }
Two instructions move between GPRs and the 1 MiB range of address space starting at 0xFFB00000
, though they cannot access the 2 KiB / 4 KiB core-local RAMs within this range:
Does
gpr_data = *(0xFFB00000 | (ofs << 2))
.
STOREREG(u6 gpr_data, u18 ofs)
Does
*(0xFFB00000 | (ofs << 2)) = gpr_data
.
Configuration Registers
There are two broad categories of configuration registers:
- 261 per-pipe registers, each of which being between 1 and 16 bits wide, packed into 57x 16b per pipe (so 3x 57x 16b total). A packed 16b group is set using the
SETC16(u6 idx, u16 val)
instruction, which executes on the ThCfg unit. I have not found any MMIO region exposing these registers. Contents includes:CFG_STATE_ID::StateID
DEST_TARGET_REG_CFG_MATH::Offset
ADDR_MOD_SET::Base
ADDR_MOD_{AB, DST, PACK, BIAS}_SEC[0-7]::*
SRCA_SET::{Base, SetOvrdWithAddr}
SRCB_SET::Base
CLR_DVALID::{SrcA, SrcB}_Disable
FIDELITY_BASE::Phase
UNPACK_MISC_CFG::CfgContext{Offset, CntReset, CntInc}[01]
NOC_OVERLAY_MSG_CLEAR::{StreamId, MsgNum}_[01]
CG_CTRL_{EN, KICK}::*
PERF_CNT_CMD::Cmd[0-3]{Start, Stop}
ENABLE_ACC_STATS::Enable
FPU_BIAS_SEL::Pointer
FP16A_FORCE::Enable
- 248+26+39+174 unit-specific registers, each of which being between 1 and 32 bits wide, packed into (72+14+8+28)x 32b. There are two copies of each of these registers, with the per-pipe
CFG_STATE_ID::StateID
configuration register determining which copy is in use by a given pipe. Both copies are accessible via MMIO from the "B" or "T" cores, the 1st asuint32_t[188]
at0xFFEF0000
, and the 2nd asuint32_t[188]
at0xFFEF02F0
. A packed 32b group can be moved to / from a Tensix Scalar GPR using theRDCFG(u6 gpr, u8 idx)
/WRCFG(u6 gpr, u1 wr128, u8 idx)
instructions, and 8b-aligned subgroups can be manipulated using theRMWCIB[0-3](u8 mask, u8 bits, u8 idx)
instructions. We have:- 248 registers, packed into 72x 32b, that nominally live in Tensix Scalar, but mostly control other units. These can be set using
REG2FLOP
rather thanWRCFG
.THCON_SEC[01]_REG0::TileDescriptor
THCON_SEC[01]_REG[189]::*
for Tensix Pack?THCON_SEC[01]_REG[23457]::*
for Tensix Unpack?THCON_SEC[01]_REG6::*
for Tensix Xmov?
- 26 registers, packed into 14x 32b, for Tensix Unpack:
UNP[01]_ADDR_CTRL_XY_REG_[01]::[XY]stride
UNP[01]_ADDR_CTRL_ZW_REG_[01]::[ZW]stride
UNP[01]_ADDR_BASE_REG_[01]::Base
UNP[01]_FORCED_SHARED_EXP::shared_exp
UNP[01]_ADD_DEST_ADDR_CNTR::add_dest_addr_cntr
UNP0_BLOBS_Y_START_CNTX_{01,23}::blobs_y_start
- 39 registers, packed into 8x 32b, for Tensix Matrix and Tensix Vector:
ALU_FORMAT_SPEC_REG::{SrcA, SrcB, Dstacc}_{val, override}
ALU_FORMAT_SPEC_REG0::{SrcAUnsigned, SrcBUnsigned, SrcA}
ALU_FORMAT_SPEC_REG1::SrcB
ALU_FORMAT_SPEC_REG2::Dstacc
ALU_ROUNDING_MODE::{Fpu, Gasket, Packer}_srnd_en
ALU_ACC_CTRL::*
STACC_RELU::{ApplyRelu, ReluThreshold}
DISABLE_RISC_BP::*
ECC_SCRUBBER::*
STATE_RESET::EN
DEST_OFFSET::Enable
DEST_REGW_BASE::Base
INT_DESCALE::{Enable, Mode}
- 174 registers, packed into 28x 32b, for Tensix Pack:
PCK0_ADDR_CTRL_XY_REG_[01]::[XY]stride
PCK0_ADDR_CTRL_ZW_REG_[01]::[ZW]stride
PCK0_ADDR_BASE_REG_[01]::Base
PCK_DEST_RD_CTRL::*
PCK_EDGE_MODE::mode
PCK_EDGE_TILE_FACE_SET_SELECT::{select, enable}
PCK_EDGE_TILE_ROW_SET_SELECT::select
PCK_EDGE_OFFSET_SEC[0-3]::mask
PACK_COUNTERS_SEC[0-3]::*
PACK_CONCAT_MASK_SEC[0-3]::pack_concat_mask
TILE_ROW_SET_MAPPING_[0-3]::row_set_mapping_[0-15]
TILE_FACE_SET_MAPPING_[0-3]::face_set_mapping_[0-15]
- 248 registers, packed into 72x 32b, that nominally live in Tensix Scalar, but mostly control other units. These can be set using
I'm not going to make any attempt to explain the details of every configuration register, or really any configuration register, as that would take far too long.
General shape of low-level kernels
What we've seen so far should make Tenstorrent's low-level-kernels slightly more scrutable. Each LLK has an init step which configures the Macro-Op Expander and the Replay Expander and the Tensix Scalar GPRs and the relevant configuration registers, and then a runtime step which takes advantage of all that pre-programming. These LLKs are wrapped by things in Metalium's llk_api
directory, which in turn are wrapped by things in Metalium's compute_kernel_api
directory, which is the API that developers are meant to use.
The LLKs make use of various instructions not yet covered; you'll have to consult the mostly-accurate YAML file outlining every instruction, or the C header generated from that YAML for further details. The general pattern of that header is that TT_OP_X(...)
generates the encoding of instruction X
(e.g. for later MMIO use), TT_X(...)
generates the encoding of X
and immediately does an MMIO write to push it into the instruction pipe, and TTI_X(...)
uses the T6 as RVC trick to generate the encoding of X
and splat it into the RISC-V instruction stream (so TTI_X
can be used instead of TT_X
when all the operands are compile-time constants).
An obvious next step would be dissecting a matrix multiplication kernel to describe how it orchestrates the Unpack and Matrix and Pack units, but this post is long enough already, so it'll have to wait for another time. That wraps up part 5; if you're reading along, then part 6 is next.