Mapping & Compilation¶

Parts of Compiler¶

Frontend: Prog lang to IR (Intermediate Representation)
Middle-end (Optimizer)
Backend: IR to Assembly/Machine code (Code generator)

ML Compilation System¶

Mapping onto Hardware¶

Dataflow choice
Tiling
Vectorization
Bind ops to PEs
On-chip memory management

Dataflow selection¶

Compiler can re-order loops without changing functionality

Compiler heuristics can model approximate effect on runtime and memory access

Eg: 1D Convolution


Weight stationary
Output stationary
Input stationary

Tiling¶

Choose tile size on which to operate, to fit data in various parts of memory system

Break a loop into nested loops, each of which can be mapped hierarchically onto memory system (DRAM, SRAM, Registers)

Other names

CUDA: Thread Block
OpenCL: Work Group

Vectorization¶

Parallelize operations within smallest tile, to leverage hardware parallelism

Binding¶

Specify PE index \(i\) that will execute loop iteration \(j\)

Applicable when no of PEs \(\ne\) no of loop iterations

On-chip memory management¶

Graph Compiler¶

On-Chip buffer¶

“Spatio-temporal tetris”

Mem management passes:

Scheduling: order of subgraph execution
Allocation: where to put data in buffer
Slicing/fusing: how to break/merge operations

Mapping Space¶

Usually very large

Many mappings are functionally-identical; eg: binding operations differently
Many mappings are invalid; eg: tile size doesn’t fit in memory

Navigate space using heuristics

Informed by device performance/energy models

DLA ISA¶

Domain-specific, simple ISAs
VLIW: Very Long Instruction Word is common

As time progresses, for DLAs, compiler need not worry about loop nests/mapping data flows, as this is all handled in hardware

Operation fusion: Coarse-grained optimizations

Last Updated: 2024-05-14 ; Contributors: AhmedThahir