# Tile Cube and Vector Loop This document describes the **TileCubeVectorLoop** pass in HIVM. It optimizes CV-style kernels. Before reading, we recommend [CV Optimization](CVOptimization.md) for CV terminology. ## Hardware background Current Ascend AI accelerator chips adopt a decoupled architecture for AIC and AIV cores. Data exchange between AIC and AIV cores is performed through Global Memory. When data dependencies exist between the AIC and AIV cores, inter-core synchronization instructions are required to ensure data correctness. However, frequent inter-core synchronization leads to performance degradation. Therefore, to improve operator execution performance, the synchronization frequency between AIC and AIV cores should be minimized as much as possible. ## Overview ![Effect of using Tile Cube and Vector Loop](../../../../images/developer_guide/TileCubeAndVectorLoop.png) **After CV Pipelining, this pass performs another tiling pass on the Cube and Vector loops in Mix kernels**: it splits one large iteration into multiple smaller iterations. Goals: 1. **Reduce inter-core sync**: Smaller iterations are more likely to fit in local buffers (L0C, UB), reducing sync cost. 2. **Coarser tiles**: Under hardware limits, larger tile sizes can improve memory and compute efficiency: - **Cube**: Matmul results go to L0C; if one iteration exceeds L0C capacity, it cannot be placed in L0C in one shot. - **Vector**: Too large an iteration can cause UB overflow. ## Interface ### Compile options | Option | Default | Description | |--------|--------|-------------| | `tile-mix-cube-loop` | 1 | Cube loop trip count; 1 means no tiling. | | `tile-mix-vector-loop` | 1 | Vector loop trip count; 1 means no tiling. | ## Hardware background Current Ascend AI accelerator chips adopt a decoupled architecture for AIC and AIV cores. Data exchange between AIC and AIV cores is performed through Global Memory. When data dependencies exist between the AIC and AIV cores, inter-core synchronization instructions are required to ensure data correctness. However, frequent inter-core synchronization leads to performance degradation. Therefore, to improve operator execution performance, the synchronization frequency between AIC and AIV cores should be minimized as much as possible. ## Algorithm The pass walks the IR, finds CopyOut ops for Cube and Vector loops, uses them as anchors to split their producers, and fuses the split into the loop. Before: ```mlir scf.for { hivm.load A hivm.load B hivm.hir.mmadL1 hivm.hir.fixpipe } {cube_loop} ``` After: ```mlir scf.for { for { hivm.load slice_A hivm.load slice_B hivm.hir.mmadL1 hivm.hir.fixpipe } {sub_tile} } {cube_loop} ``` ## Constraints 1. Only `scf.for` with **`hivm.loop_core_type`** is processed; allowed values: - **`#hivm.tcore_type`**: Cube loop - **`#hivm.tcore_type`**: Vector loop 2. For Vector: if the tiled block size is smaller than UB alignment, no tiling is applied. 3. For Cube: if the block size before tiling is smaller than total L0C size, no tiling is applied. (*Note: L1 size is not yet considered; L1 Memory Overflow may occur in some cases; Cube tiling will be refined with liveness analysis.*)