Skip to content

Efficiency

All elements on a single layer of a network are parallelizable

CPU Chip Area

image-20240416121111928

Hardware Types

Purpose
General CPU (Central Processing Unit) Low Latency
Control Flow
GPU (Graphics Processing Unit) High throughput
Data flow
TPU (Tensor Processing Unit)
NPU (Neural Processing Unit)
Specialized FPGA (Field Programmable Gate Assembly) Re-Programmable Logic
ASIC (Application Specific Integrated Circuit) Fixed logic

Performance Metrics

Metric Common Units Affected by Hardware Affected by DNN
Compute FLOPs/s **F**loating-point **op**erations per **s**econd
OPs/s Non floating-point **op**erations per **s**econd
MACs/s Multipy-Accumulate Ops/s Half FLOPs/s
Latency No of sec per operation s
Throughput No of operations per second Ops/s
Memory Capacity GB
Bandwidth GB/s
Workload Operational intensity Op/B
HW Utilization

OPs

\[ \begin{aligned} &\text{OPs} \\ &= \text{Ops/sec} \\ &= \underbrace{ \dfrac{1}{\text{Cycles/Op}} \times \text{Cycles/sec} }_\text{for single PE} \times \text{No of PEs} \end{aligned} \]

PE = Processing Element

Roofline Plot

Characterize performance of given hardware device across different workloads, to help identify if a workload is memory-bound or compute-bound

image-20240416134952487

Speed up Technique
Memory-bound Algorithmic improvement
(reduce precision)
image-20240416135852498
Faster memory chip image-20240416135930083
Compute-Bound Faster PE
(Overclocking)
image-20240416141235433

Operational Intensity

\[ \begin{aligned} \text{Operational Intensity} &= \dfrac{\text{No of Ops}}{\text{Mem Footprint}} \\ \text{No of Ops} &= \text{Multiplications} + \text{Additions} \\ \text{Mem Footprint} &= \text{Size of parameters} + \text{Size of activations} \end{aligned} \]

Quantifies the ratio of computations to memory footprint of a DNN

The same DNN can have different operational intensity on different hardware, if each device supports different numerical precision (Size of data affects operational intensity)

image-20240416141806577

IDK

image-20240416142904897

Performance Bottlenecks

  • Memory access efficiency
  • Uncoalesced reads
  • Compute utilization
  • Overhead of control logic
  • Complex DNN topologies
  • Control flow and data hazards may stall execution even if hardware is available

Hardware Efficiency

Energy breakdown

img

Hardware Efficiency Approaches

Approach Technique
Arithmetic Specialized instructions Amortize overhead
Reduce overhead fraction

Perform complex/fused operations with the same data fetch
SIMD

Matrix Multiple Unit

HFMA
HDP4A
HMMA
image-20240416143737789
Quantization Lower numerical precision image-20240418115326365
Memory Locality Move data to inexpensive on-chip memory image-20240418115729873
image-20240418120330868
Re-use Avoid expensive memory fetches

Temporal: Read once, use same data multiple times by same PE
SIMD, SIMT

Spatial: Read once, use data multiple times by multiple PEs
Dataflow processing

Weights stationary (CNNs)
Input stationary (Fully-Connected Layers)
Output stationary
image-20240418120607987
image-20240418140557228
image-20240418140811910
Operations Sparsity Skip ineffectual operations

Activation Sparsity (Sparse Activation Functions: ReLU)
Weight Sparsity (Regularization/Pruning)
Block Sparsity

Coarse-grained
Fine-grained - Overhead
Interleaving
Model storage CSC Representation
(Compressed Sparse Column)
Model Optim:
Change DNN arch (and hence workload) to better fit HW
Compression
Distillation
AutoML

Floating-point add is more expensive relative to integer, compared to multiplication , due to shifting operations

Guidelines for DSAs

Domain-Specific Architectures

  • Dedicated memory to minimize distance of data transfer
  • Invest resources saved from dropping advanced micro-architectural optimizations into more arithmetic units/larger memories
  • Use easiest form of parallelism that matches the domain
  • Reduce data size and type to simplest needed for the domain
  • Use domain-specific programming language to port code to DSA
Last Updated: 2024-05-12 ; Contributors: AhmedThahir, web-flow

Comments