# Plan Memory This document describes the **PlanMemory** transformation (`PlanMemoryPass`) in HIVM: hardware context, algorithm, interface, and constraints. ## Hardware background Ascend on-chip memory uses a buffer model. It includes storage for the Cube (matrix) and Vector units. Software must control addresses explicitly and respect alignment. For Atlas A2 training/inference products, the hardware layout[1] is: ![image](../../../../images/developer_guide/HardwareStructure.png) Buffer alignment requirements: | Buffer | Alignment | Role | |--------|-----------|------| | Unified Buffer (UB) | 32 bytes | General cache for vector/scalar ops | | L1 Buffer | 32 bytes | Temporary feature maps, convolution data | | L0A Buffer | 512 bytes | Left matrix (e.g. feature map) for matrix ops | | L0B Buffer | 512 bytes | Right matrix (e.g. weight) for matrix ops | | L0C Buffer | 512 bytes | Matrix op intermediate and output | | BT Buffer | 64 bytes | BiasTable for matrix bias | | FP Buffer | 64 bytes | Fixpipe: quantization/ReLU parameters, etc. | ## Algorithm ### Software context In the input IR, `memref.alloc` buffers have a name and size but no address. The AscendNPU IR memory module (PlanMemory) assigns addresses based on **lifetime** to avoid overwrites and precision issues. To fit all buffers in limited on-chip memory, PlanMemory performs **memory reuse** using IR semantics and hardware rules. To avoid unnecessary data dependencies that hurt performance, it uses a **three-level allocation** strategy to balance performance and utilization. Allocation targets the storage used by Cube (L1, L0A, L0B, L0C, etc.) and Vector (UB). Cube uses L0A/L0B for left/right matrices (loaded from L1) and L0C for results; allocation is mainly for L1 and L0C. Vector uses UB for inputs and outputs. PlanMemory also allocates a small amount of **Workspace** (`memref_ext.alloc_workspace`) for CV flows (e.g. moving Cube results from L0C to UB via workspace). Workspace size is reported to the framework runtime. ### Terms - **BufferLife**: The **lifetime** of a buffer — from first write (gen) to last read (kill). Non-overlapping lifetimes can share the same memory; PlanMemory computes offsets so non-overlapping buffers reuse space. - **Alias**: Two values that refer to the same underlying data (e.g. before/after `subview`). - **Inplace reuse**: An op’s **output** can overwrite an **input** buffer (e.g. `vcast` f16→i16 same width). PlanMemory assigns the same offset to such output under hardware inplace rules. - **Address offset / pointer_cast**: After allocation, instead of a separate alloc, PlanMemory emits `hivm.hir.pointer_cast(offset)` (byte offset in that memory space). ### Implementation **Source**: `bishengir/lib/Dialect/HIVM/Transforms/PlanMemory.cpp` Main steps: - **Lifetime analysis**: For each buffer, compute gen and kill. - **Allocation**: Assign addresses from lifetime and reuse rules. - **OP rewrite**: Replace original `alloc` with `hivm.hir.pointer_cast(offset)`. #### Lifetime analysis 1. Use community `Liveness` for liveness. 2. Walk the IR (including scf.for, scf.if, scf.while), collect **gen** (buffers defined) and **kill** (last use) per op to get BufferLife. 3. Compute **lifetime** per buffer (first write to last read). Non-overlapping lifetimes may share memory. 4. Use Alias to identify **inplace** candidates and assign the same base address where valid. #### Allocation Two modes: **sequential** and **reuse**. If the sum of buffer sizes fits in a Memory Scope (e.g. UB, L1), sequential allocation is used. Otherwise, reuse is applied so that reuse does not cause conflicts. Reuse includes **Inplace** and **three-level reuse**. ##### Inplace Conditions: same Memory Scope (e.g. UB); in an `A = B + C` pattern, A’s kill is C’s gen; and hardware inplace rules are satisfied. ##### Three-level reuse The compiler tries higher levels first and rolls back to a lower level if allocation fails. **[1] Level2** Different pipelines (PIPE) run in parallel. Reusing memory across pipelines can add dependencies and hurt throughput. Level2: prefer reuse within the same pipeline type (e.g. Vector with Vector), so non-DMA pipelines reuse first. Example: ```text Shared A [A0, A1] Shared B [B] Shared C [C] Shared D [D0, D1] Loop i: // sync op1(A0, A1) // DMA OP, Double Buffer op2(B) // Vector OP op3(C) // Vector OP op4(D0, D1) // DMA OP, Double Buffer ``` With double buffering, A and D use two buffers each for DMA. When shared memory is tight, reusing C with A would introduce an extra dependency between op1 (DMA PIPE) and op3 (Vector PIPE), so MTE_PIPE and V_PIPE could not run in parallel and performance would drop. With Level2, C reuses B; both are Vector ops, so V_PIPE is serial anyway and reuse does not add cross-pipeline dependency. ![image](../../../../images/developer_guide/plan_memory_level2.png) - Pros: Same-pipeline reuse does not add cross-pipeline dependency; better overall performance. - Cons: Smaller reuse space; reuse may fail more often. **[2] Level1** Double buffering improves overlap of load and compute and is important for performance. If a single buffer reuses one slot of a double buffer, the double buffer cannot run in parallel and the pipeline stalls. Level1: in the same loop, if a single buffer would reuse a double buffer, the single buffer is upgraded to a double buffer so both slots can be used without stalling. Example: ```text Shared A [A0, A1] Shared B [B] Shared C [C] Shared D [D0, D1] Loop i: // sync op1(A0, A1) // DMA OP, Double Buffer op2(B) // Vector OP op3(C) // Vector OP op4(D0, D1) // DMA OP, Double Buffer ``` With double buffering, A and D use two buffers each. When shared memory is tight, if C (single buffer) reuses one slot of A, then op1 needs A0 while op3 still holds C (same as A0), so op1 must wait for op3 and the pipeline stalls. With Level1, C is upgraded to a double buffer; op1 uses A0 while op3 uses C1 (backed by A1), so op1 does not wait for op3 and the pipeline can overlap. ![image](../../../../images/developer_guide/plan_memory_level1.png) - Pros: Avoids pipeline stall in double-buffer scenarios. - Cons: Extra buffer for the new double buffer; may reduce reuse success rate. **[3] Level0** Level0: any two buffers with non-overlapping lifetimes can share memory. ![image](../../../../images/developer_guide/plan_memory_level0.png) - Pros: Maximum reuse. - Cons: Ignores pipeline structure; bad reuse can hurt performance. #### OP rewrite After computing offsets, replace `memref_ext.alloc_workspace` (GLOBAL_WORKSPACE_PLAN) and `memref.alloc` (LOCAL_MEM_PLAN) with `hivm.hir.pointer_cast(offset)`. ### Tests **File**: `bishengir/test/Dialect/HIVM/plan-memory.mlir` **Typical CHECK**: ```mlir // CHECK-NOT: memref.alloc() // CHECK: %[[CONST0:.*]] = arith.constant 0 : i64 // CHECK: {{.*}} = hivm.hir.pointer_cast(%[[CONST0]]) ``` ## Interface | Option | Default | Description | |--------|---------|-------------| | `-mem-plan-mode=global-work-space-plan` | false | Use `GLOBAL_WORKSPACE_PLAN` in CV pipeline | | `enable-global-workspace-reuse` | false | Reuse buffers inside workspace | | `restrict-inplace-as-isa` | false | Restrict inplace to match ISA behavior | ## Constraints The total size of buffers live at any one time must not exceed the actual hardware memory size for that scope. > Each buffer is aligned as in [Hardware background](#hardware-background). Otherwise, PlanMemory fails with a scope overflow error (e.g. `UB overflow`): ```bash loc("/tmp/tmp0h121237/kernel.ttadapter.mlir":2:3): error: ub overflow, requires 3219456 bits while 1572864 bits available! (possible reason: tiling basic block is too large or block number is more than what user expect due to multi-buffer feature is enabled and some ops need extra local buffer.) ```