# Auto-Subtiling

## Hardware background

During Ascend chip evolution, AIC and AIV were separated with a 1:2 core ratio.

![image](../../../../images/developer_guide/cvarch.png)

In the current ecosystem, neither user-written kernels nor community operators typically implement Ascend Cube–Vector 1:2 sub-block logic. To improve compute efficiency and Ascend affinity, the compiler needs automatic sub-block (subtiling) capability. This feature applies a Cube–Vector 1:2 subtiling strategy and performs the corresponding data splitting.

## Algorithm overview

The overall approach is:

![image](../../../../images/developer_guide/auto_subtiling2.png)

Effects:

![image](../../../../images/developer_guide/auto_subtiling3.png)

### Input/output example

Original Code

```mlir
%t0 = hivm.hir.vexp ins(%src: tensor<64xf16>)
                     outs(%init: tensor<64xf16>) -> tensor<64xf16>
%t1 = hivm.hir.vabs ins(%t0: tensor<64xf16>)
                     outs(%init: tensor<64xf16>) -> tensor<64xf16>
hivm.hir.store ins(%t1: tensor<64xf16>) outs(%output : memref<64xf16>)
```

Vector Auto 1:2 Feature Enabled Successfully

```mlir
%0 = hivm.hir.get_sub_block_idx -> i64
%slice_src = tensor.extract_slice %src[%0][32][1] : tensor<64xf16> to tensor<32xf16>
%t0 = hivm.hir.vexp ins(%slice_src: tensor<32xf16>)
                     outs(%new_init: tensor<32xf16>) -> tensor<32xf16>
%t1 = hivm.hir.vabs ins(%t0: tensor<32xf16>)
                     outs(%new_init: tensor<32xf16>) -> tensor<32xf16>
%output_slice = memref.subview %output[%0][32][1] : memref<64xf16> to memref<32xf16>
hivm.hir.store ins(%t1: tensor<32xf16>) outs(%output_slice : memref<32xf16>)
```

### Implementation idea

1. Split Store data in half via extract-slice and for-loop.
2. Bubble up the extract-slice using the BubbleUpExtractSlice pattern.
3. Map the for-loop to subblock.
4. Subtiling succeeds.

If subtiling fails, the compiler falls back to 1:1.

![image](../../../../images/developer_guide/auto_subtiling4.png)

Figure: Auto-subtiling 1:2 implementation.

### Design

#### Dimension analyzer (axis selection)

The Dimension Analyzer chooses a **parallel axis** for splitting by analyzing all operators in the target kernel.

#### Why choose a parallel axis

Vector cores do not share a direct data path. To maximize parallelism and correctness, splitting must avoid cross-tile dependencies. Splitting along a parallel axis allows each tile to be computed independently on a vector unit.

#### Tile and slice store (leaf)

Before each StoreOp/leaf node, an ExtractSliceOp for 1:2 splitting is inserted along the axis chosen by the Dimension Analyzer.

#### BubbleUp Extract Slice

A BubbleUp strategy is implemented per op type. Supported op types include:

- BroadcastOp, ReduceOp, ExpandOp (specific shapes), CollapseOp (specific shapes)
- ElementwiseOp, LoopOp, ExtractSliceOp (specific cases), InsertSliceOp (specific cases)

Additional op types can be supported by adding matchAndRewrite patterns.

## Interface

Behavior is controlled by:

- `--enable-auto-bind-sub-block=True` — enable this feature (default)
- `--enable-auto-bind-sub-block=False` — disable this feature

## Constraints and fallback

If subtiling or an intermediate transformation fails, the compiler automatically falls back to 1:1 to preserve correctness.

Common reasons for falling back to 1:1:

1. Axis selection fails (no valid parallel axis for splitting).
2. BubbleUpExtractSlicePattern encounters an unsupported op.

### Fallback example

Original Code

```mlir
%t0 = hivm.hir.vexp ins(%src: tensor<64xf16>)
                     outs(%init: tensor<64xf16>) -> tensor<64xf16>
%t1 = hivm.hir.vabs ins(%t0: tensor<64xf16>)
                     outs(%init: tensor<64xf16>) -> tensor<64xf16>
hivm.hir.store ins(%t1: tensor<64xf16>) outs(%output : memref<64xf16>)
```

Auto 1:2 Enablement Failed, With if condition, only core 0 is operational

```mlir
%0 = hivm.hir.get_sub_block_idx
%1 = arith.cmpi eq %0, %c0_cst
scf.if %1 {
  %t0 = hivm.hir.vexp ins(%src: tensor<64xf16>)
                       outs(%init: tensor<64xf16>) -> tensor<64xf16>
  %t1 = hivm.hir.vabs ins(%t0: tensor<64xf16>)
                       outs(%init: tensor<64xf16>) -> tensor<64xf16>
  hivm.hir.store ins(%t1: tensor<64xf16>) outs(%output : memref<64xf16>)
}
```