# Cube–Vector optimization overview

## Hardware background

This document describes the Cube–Vector (CV) optimization flow in AscendNPU IR at a high level. CV optimizations target NPU hardware such as Altas A2/A3: they apply a series of transformations in the HIVM (Huawei Intermediate Virtual Machine) layer so that the **Cube** (matrix) and **Vector** (vector) units work together efficiently and Mix kernel execution is improved.

### Terms and background (read first)

The following terms appear throughout the CV documentation. It helps to understand them before reading individual pass details.

| Term | Meaning | Notes |
|------|---------|--------|
| **HIVM** | Huawei Intermediate Virtual Machine | A dialect in AscendNPU IR that carries NPU-oriented ops (e.g. mmadL1, fixpipe, vadd) and control flow. |
| **IR** | Intermediate Representation | The compiler’s abstraction between source and machine code. This project uses MLIR; IR is in SSA form. |
| **Bufferization** | Converting the tensor abstraction into concrete memory (memref) | In the “pre-bufferization” phase, IR is still **tensor**-centric (logical multi-dim arrays). After bufferization, **memref** (memory with address/layout) is introduced. Most CV passes run in pre-bufferization, so you will see “tensor.empty”, “tensor slice”, etc. |
| **tensor vs memref** | tensor: logical multi-dim array, no explicit address; memref: memory region with base, stride, shape | In the CV flow, fixpipe “output” is often first represented as a tensor, then materialized via workspace (memref) or bufferization. |
| **Workspace** | A contiguous region of memory on GM, passed as a kernel argument at runtime | Used for intermediate data between Cube and Vector (e.g. fixpipe output). Multiple buffers can be allocated at different **offsets** in the same workspace; PlanMemory computes offsets for reuse. |
| **Liveness** | The lifetime of a buffer from “defined/first use” to “last use” | PlanMemory uses liveness to decide if two buffers can overlap; if not, they can share the same base address with different offsets. |
| **Inplace** | The op’s output is written over the input’s storage, reusing the same buffer | e.g. vcast f16→i16 (same width); output can overwrite input. PlanMemory identifies inplace ops and assigns offsets accordingly. |
| **AIC / AIV** | AIC: sub-kernel dominated by **C**ube; AIV: sub-kernel dominated by **V**ector | After splitting a Mix kernel, AIC runs mainly on Cube and AIV on Vector; they exchange data via fixpipe, DMA, etc., and the host/scheduler coordinates the call order. |
| **Host / Device** | Host: runs on CPU; Device: runs on NPU | Mix kernels are on the device side. “Mix can only be called from host” means the mix entry is invoked from a host-launched kernel call; another device kernel cannot directly call a mix function (current convention). |
| **CC / CV / VC / VV** | Two letters: “previous compute unit” and “next compute unit” (C=Cube, V=Vector) | e.g. CV = Cube then fixpipe/load then Vector; CC = two Cube segments connected by fixpipe+load. Used to describe InsertWorkSpaceForMixCV patterns. |

**Pre- vs post-bufferization:**  
Most CV passes run in the **pre-bufferization** phase (`hivmPreBufferizationOptimizationPipeline`), when IR is still tensor-centric with `scf.for` and similar control flow. Bufferization turns tensors into memrefs and fixes layout; after that, **post-bufferization** optimizations (e.g. another PlanMemory pass on `memref.alloc`) run. Understanding “CV structure first, then memory materialization” helps with pass ordering.

### Architecture

Ascend NPU uses a heterogeneous architecture with:

| Component | Description | Typical (Example) |
|-----------|-------------|------------------|
| **Cube** | Matrix unit; runs `mmadL1`, `batchMmadL1`, etc. | 24 AI Cores |
| **Vector** | Vector unit; runs `vadd`, `vcast`, `vreduce`, etc. | 48 AI Cores |
| **L0C** | Cube output buffer for matmul results | 128KB |
| **L0A/L0B** | Cube input buffers | 64KB |
| **UB** | Unified Buffer; main memory for Vector ops | 256KB |
| **GM** | Global memory | External DDR |

**fixpipe** is the data path between Cube and Vector. On Ascend, Cube and Vector are separate at the hardware level; the interaction path depends on the chip. For the 910 series, after Cube completes, fixpipe moves results from L0C to GM for later Vector use. In IR this is the `hivm.hir.fixpipe` op; on hardware it corresponds to the L0C→UB path and can do type conversion, quantization, etc. (controlled by fixpipe attributes such as `pre_quant`, `pre_relu`). The 910-series architecture is shown below.

![V220 architecture](../../../../images/developer_guide/cvarch.png)

## Algorithm overview

### createNormalizeMatmulPass

- **Role**: Normalize M/K/N dimensions, init conditions, and per-channel add form for `hivm.hir.mmadL1` and `hivm.hir.batchMmadL1`.
- **Goal**: Unify matmul IR so that later fixpipe insertion, tiling, etc. can match and transform consistently.
- **Typical transforms**: Inline vbrc+vadd-style bias into mmadL1 init; extract real M/K/N and replace constants; handle PerChannel cases.
- **Typical scenario**: Elementwise accumulation.

Before:

```mlir
%2 = ops // not 0 const
%3 = hivm.hir.mmadL1 ins(*)
       outs(%2 : tensor<16x32xf32>) -> tensor<16x32xf32>
```

After:

```mlir
%2 = ops
%3 = tensor.empty() : tensor<16x32xf32>
%4 = hivm.hir.mmadL1 ins(*)
        outs(%3 : tensor<16x32xf32>) -> tensor<16x32xf32>
%5 = hivm.hir.vadd ins(%2, %4: tensor<1x32xf32>) outs(%2 : tensor<16x32xf32>)
```

### createInlineFixpipePass

- **Role**: Insert `hivm.hir.fixpipe` between mmadL1/batchMmadL1 and store; merge store+vcast etc. into fixpipe’s quant/activation options.
- **Goal**: Make Cube-to-Vector data movement explicit so that workspace allocation and load/store insertion have clear insertion points.
- **Typical transforms**: Insert fixpipe on the use chain from mmadL1 result to store; fuse vcast (f32→f16) etc. as fixpipe `pre_quant = F322F16`.
- **Typical scenario**: Pure Cube to store.

Before:

```text
mmadL1 -> store
```

After:

```text
mmadL1 -> fixpipe
```

InlineFixpipe inserts fixpipe and then tries to inline ops such as hivm.vcast, hivm.vrelu, hivm.store onto the new fixpipe.

### createTileBatchMMIntoLoopPass

- **Role**: Expand `hivm.hir.batchMmadL1` along the batch dimension into an `scf.for` loop; each iteration runs a single `mmadL1` and fixpipe.
- **Goal**: Turn the batch dimension into a loop so that load/fixpipe/store can be indexed by batch for workspace and pipelining.
- **Typical transforms**: TileBatchMMIntoLoop expands batchMmadL1 into a loop.

Before:

```text
batchmmadL1 a : [batch, m, k], b[batch, k, n]
fixpipe workspace : [batch, m, n]
```

After:

```text
for batch_idx in range(batch):
  mmadL1(extract_slice(a), extract_slice(b))
  fixpipe(extract_slice(workspace))
```

### createInsertLoadStoreForMixCVPass

- **Role**: Insert load/store at Cube–Vector boundaries so that data flows correctly between tensor and global workspace.
- **Goal**: Ensure correct data transfer between Cube and Vector.
- **Typical transforms**: batchMmadL1+fixpipe rewritten to loop with mmadL1+fixpipe; extract_slice/insert_slice on inputs/outputs.
- **Typical scenario**: Cube–Vector mix (CV pattern).

Before:

```text
mmadL1
fixpipe
vadd
```

After:

```text
mmadL1
fixpipe
load
vadd
```

### createInsertWorkSpaceForMixCVPass

- **Role**: At Cube–Vector boundaries (CC/CV/VC/VV), replace `tensor.empty` with `memref_ext.alloc_workspace`.
- **Goal**: Allocate fixpipe output, store output, and other intermediates from global workspace for cross-iteration and cross-core sharing.
- **Patterns**: CC (mmadL1→fixpipe→load→mmadL1), CV (mmadL1→fixpipe→load→vector), VC (vector→store→load→mmadL1), VV (vector→store→load→vector).
- **Typical scenario**: Cube–Vector mix (CV pattern).

Before:

```mlir
%1 = mmadL1
%2 = tensor.empty()
%3 = fixpipe ins(%1) outs(%2)
%4 = load ins(%3)
vadd (%4)
```

After:

```mlir
%1 = mmadL1
%2 = memref_ext.alloc_workspace()
%3 = bufferization.to_tensor(%2)
%4 = fixpipe ins(%1) outs(%3)
%5 = load ins(%4)
vadd (%5)
```

### createBindWorkSpaceArgPass

- **Role**: Bind in-function `memref_ext.alloc_workspace` to the function’s workspace argument (`hacc.arg_type = #hacc.arg_type<workspace>`).
- **Goal**: Single source of workspace so that the runtime passes the workspace pointer as an argument and multiple kernels can share one workspace.
- **Preconditions**: The function must have a workspace argument; InsertWorkSpaceForMixCV must have inserted alloc_workspace.
- **Typical scenario**: Cube–Vector mix (CV pattern).

Before:

```mlir
func.func @bind_workspace_arg(
              %arg0: i64 {hacc.arg_type = #hacc.arg_type<ffts_base_address>},
              %arg1: memref<?xi8> {hacc.arg_type = #hacc.arg_type<workspace>}){
  memref_ext.alloc_workspace() : memref<100xi32>
  return
}
```

After:

```mlir
func.func @bind_workspace_arg(
              %arg0: i64 {hacc.arg_type = #hacc.arg_type<ffts_base_address>},
              %arg1: memref<?xi8> {hacc.arg_type = #hacc.arg_type<workspace>}){
  memref_ext.alloc_workspace() from %arg1 : memref<100xi32>
  return
}
```

### createPlanMemoryPass

- **Role**: In `GLOBAL_WORKSPACE_PLAN` mode, plan memory for `memref_ext.alloc_workspace`, replacing alloc with `hivm.hir.pointer_cast` + offset.
- **Goal**: On a given workspace base, assign offsets by liveness and inplace rules to maximize reuse and minimize total workspace size.
- **Typical transforms**: Multiple alloc_workspace mapped to different offsets in the same workspace; conflicting buffers get different offsets.

Before:

```mlir
func.func @bind_workspace_arg(
              %arg0: i64 {hacc.arg_type = #hacc.arg_type<ffts_base_address>},
              %arg1: memref<?xi8> {hacc.arg_type = #hacc.arg_type<workspace>}){
  memref_ext.alloc_workspace() from %arg1 : memref<100xi32>
  return
}
```

After:

```mlir
func.func @bind_workspace_arg(
              %arg0: i64 {hacc.arg_type = #hacc.arg_type<ffts_base_address>},
              %arg1: memref<?xi8> {hacc.arg_type = #hacc.arg_type<workspace>}){
  memref_ext.alloc_workspace() from %arg1 offset=[0] : memref<100xi32>
  return
}

```

### createSplitMixKernelPass

- **Role**: Split the Mix kernel into AIC (Cube) and AIV (Vector) sub-functions and generate the mix entry.
- **Goal**: The backend can schedule AIC/AIV to Cube/Vector cores separately for pipelining and sync.
- **Typical transforms**: Traverse IR by core type; put Cube code in AIC and Vector in AIV; use `annotation.mark` for tensors passed across cores.

before:

```mlir
func.func @bind_workspace_arg(
              %arg0: i64 {hacc.arg_type = #hacc.arg_type<ffts_base_address>},
              %arg1: memref<?xi8> {hacc.arg_type = #hacc.arg_type<workspace>},
     hivm.func_core_type = #hivm.func_core_type<MIX>){
  mmadl1
  memref_ext.alloc_workspace() from %arg1 offset=[0] : memref<100xi32>
  fixpipe
  load
  vadd
}
```

after:

```mlir
func.func @bind_workspace_arg_aic(
              %arg0: i64 {hacc.arg_type = #hacc.arg_type<ffts_base_address>},
              %arg1: memref<?xi8> {hacc.arg_type = #hacc.arg_type<workspace>},
     hivm.func_core_type = #hivm.func_core_type<AIC>){
  mmadl1
  memref_ext.alloc_workspace() from %arg1 offset=[0] : memref<100xi32>
  fixpipe
}
func.func @bind_workspace_arg_aiv(
              %arg0: i64 {hacc.arg_type = #hacc.arg_type<ffts_base_address>},
              %arg1: memref<?xi8> {hacc.arg_type = #hacc.arg_type<workspace>},
     hivm.func_core_type = #hivm.func_core_type<AIV>){
  load
  vadd
}
```

## Interface

Testing individual passes:

```bash
bishengir-opt -hivm-normalize-matmul input.mlir -o output.mlir
bishengir-opt -hivm-inline-fixpipe input.mlir -o output.mlir
bishengir-opt --hivm-tile-batchmm-into-loop input.mlir -o output.mlir
bishengir-opt -insert-workspace-for-mix-cv input.mlir -o output.mlir
bishengir-opt --hivm-bind-workspace-arg input.mlir -o output.mlir
bishengir-opt -hivm-plan-memory -mem-plan-mode=global-work-space-plan input.mlir -o output.mlir
bishengir-opt -hivm-split-mix-kernel input.mlir -o output.mlir
```

### Test Cases

Currently, all test cases in the repository are located under `path-to-ascendnpuir\bishengir\test`. To run a specific pass, search for the corresponding compilation command to locate the relevant test file. For example, searching for `hivm-normalize-matmul` will lead you to the corresponding test file `bishengir\test\Dialect\HIVM\normalize-matmul.mlir`.

### Test Commands

The specific run commands are provided at the top of each test file. For example:

```bash
// RUN: bishengir-opt -hivm-normalize-matmul %s -split-input-file -verify-diagnostics -allow-unregistered-dialect | FileCheck %s
```

Here, `bishengir-opt` and `FileCheck` are binary executable files generated during compilation, located in `path-to-ascendnpuir\build\bin`. In the above command, `%s` should be replaced with the corresponding test file, such as `bishengir\test\Dialect\HIVM\normalize-matmul.mlir`.

The output MLIR will match the `CHECK:` part in the test file. The execution is successful if no `CHECK failed` errors are reported after testing.

## Constriants

- The createPlanMemoryPass handles the space size for data interaction points. Since the total required space size is dynamically returned, there is no limitation on the size.
- The createInlineFixpipePass currently can only inline three types of operations: vast, relu, and store.