# AutoSchedule ## HFusion AutoSchedule: Design for Automatic Fusion and Scheduling HFusion is a high-level framework on Bisheng IR for operator fusion and automatic scheduling. The **AutoSchedule** module generates efficient schedules for Ascend NPU once fusion units are fixed. Design goals include: - **Automation**: Choose scheduling and Tiling strategies from fusion patterns and operator traits. - **Extensibility**: Common scheduler base and kernel abstractions for new strategies. - **Performance**: Dynamic shape, multi-core reduce, and related optimizations. - **Engineering**: Schedules expressed as reusable, interpretable Transform Dialect sequences. ### Hardware Background The Ascend NPU uses a multi-level memory hierarchy: global memory (GM) has large capacity but high access latency, while on-chip memory (e.g., L1, UB) has lower latency but limited capacity. To leverage on-chip memory bandwidth and reduce global memory traffic: - **Maximize on-chip memory utilization**: Through large-scale operator fusion, multiple ops are merged into the same kernel, so intermediate results are reused in on-chip buffers and GM read/write counts are reduced. - **Satisfy hardware access constraints**: On-chip memory (e.g., UB) access must meet hardware requirements; for example, **UB access requires 32-byte alignment** (stride-align). Otherwise, misaligned access can cause runtime errors or performance degradation. - **Fit Tiling and loop structure**: When generating Tiling and loop nests, stride/size/tile alignment constraints must be respected so that the resulting kernel complies with hardware specifications. AutoSchedule is designed around these hardware characteristics: it uses automated scheduling and Tiling strategies to maximize on-chip memory utilization while preserving legality. ### Algorithm Principle The core algorithm centers on **large-scale operator fusion + dimension-mapping-driven loop generation + Transform Dialect fusion execution**: 1. **Dimension Analyzer axis mapping** - `DimensionAnalyzer` analyzes the axis mapping of each op in the kernel relative to the anchor (`getCommonAxis`, `getNormalizedInterchange`, etc.). - It establishes the correspondence between tensor dimensions and anchor dimensions, supporting broadcast, reduce, transpose, and other patterns. - This provides dimension-level information for subsequent Tiling and loop construction. 2. **Loop generation and op fusion** - Based on the axis mapping, a unified loop structure is generated for the fusion graph, and ops are fused into the same loop via MLIR Transform Dialect primitives such as `fuseIntoContaining`, `fuseLoops`, and `coalesceLoops`. - During fusion, hardware constraints are explicitly enforced: stride-align (32-byte alignment), size-align, tile-align, etc., so the generated IR meets Ascend NPU memory access rules. 3. **Tiling computation and selection** - Using `StmtExprBuilder` and the `Expr` system, together with static or dynamic shape, multiple candidate Tiling schemes (`TilingCases`) are generated. - `getStrideAlignments()` and `getTileAlignments()` are applied during Tiling; dimensions are adjusted with `alignTo(alignment)` to produce valid Tiling that satisfies stride-align and related constraints. - A best `TilingKey` is chosen (e.g., by cost or alignment), and the schedule description is built accordingly. 4. **Transform Dialect interpretation** - The scheduler does not modify IR directly; it builds a Transform Dialect program. `AutoScheduleInterpreter` translates it into concrete Transform operations and applies them to the target IR so the schedule takes effect. ### Interface Description #### Code location - **Headers (APIs and abstractions)**: `bishengir/include/bishengir/Dialect/HFusion/Transforms/AutoSchedule/` - **Implementation**: `bishengir/lib/Dialect/HFusion/Transforms/AutoSchedule/` #### Core interfaces and abstractions | Type | Name | Description | | ---------- | ---------------------- | --------------------------------------------------------------------------- | | Base class | `SchedulerBase` | Abstract base for all schedulers, encapsulating the common scheduling flow | | Scheduler | `PureElemwiseScheduler`| Pure elementwise fusion strategy | | Scheduler | `AnyPBRScheduler` | Generic strategy for Pointwise/Broadcast/Reduce and similar fusion patterns | | Kernel | `KernelInfo` | Unified description of fused kernel IO, dimensions, alignment, multi-core | | Alignment | `getStrideAlignments()`| Returns stride alignment constraints (dim index, unit), e.g. 32-byte align | | Alignment | `getSizeAlignments()` | Returns size dimension alignment constraints | | Alignment | `getTileAlignments()` | Returns tile dimension alignment constraints | | Analysis | `DimensionAnalyzer` | `getCommonAxis`, `getInterchange`, `getNormalizedInterchange`, etc. | | Primitive | `cacheRead` / `cacheWrite` | IO cache | | Primitive | `tileUsingFor` / `tileUsingForAll` / `tileReductionUsingFor` | Tiling | | Primitive | `fuseLoops` / `fuseIntoContaining` / `coalesceLoops` | Loop fusion and coalesce | | Primitive | `setBufferSize` | Resource constraints | #### Strategy selection and call chain - **Pass entry**: The AutoSchedule pass is invoked in the HFusion pipeline and receives the `func::FuncOp` and fusion information. - **Strategy selection**: In `AutoScheduleBase.cpp::applySchedule()`, the scheduler is chosen by `FusionKind`: - `FusionKind::PureElemwise` → `PureElemwiseScheduler` - `FusionKind::AnyPB` / `FusionKind::LastAxisPBR` / `FusionKind::AnyPBR` → `AnyPBRScheduler` - **Main flow**: `runPreScheduleProcedure()` → `runScheduleProcedure()` (including `calculateTilingImpl()`, `createScheduleImpl()`) → `runPostScheduleProcedure()` → Transform Dialect application. ### Constraints and Capabilities AutoSchedule explicitly handles the following constraints during scheduling and Tiling: | Constraint | Description | API / Implementation | | ---------------- | ---------------------------------------------------------------------------- | -------------------- | | **Stride align** | UB and other on-chip memory access must be 32-byte aligned | `getStrideAlignments()`, `alignTo()` in `calculateTilingImpl()` | | **Size align** | Some ops (e.g., transpose, concat, cast) require tile/size alignment | `getSizeAlignments()` | | **Tile align** | Combination of stride and size constraints, applied to Tiling schemes | `getTileAlignments()` | | **Reduce axis** | Reduce, broadcast, extract_slice, transpose, etc. impose lowest-dim alignment| `KernelInfo` per-op stride-align logic | | **On-chip buffer** | Buffer allocation limited by L1/UB capacity and `maxBufferCnt` | `setBufferSize`, `maxBufferCnt` | | **Multi-core reduce** | Multi-core parallel reduce only when specific conditions hold | `analyzeMultiCoreReduceInfo()` | These constraints are applied uniformly in `KernelInfo::getStrideAlignments()` and scheduler-specific `calculateTilingImpl()` to ensure the generated schedule is valid and hardware-compatible. ### Architecture overview #### Core components ##### Scheduler base and strategies - **SchedulerBase**: Abstract base for all schedulers (`AutoScheduleBase.h`), encapsulating the common scheduling flow. - **Concrete strategy schedulers**: - **PureElemwiseScheduler**: Pure elementwise fusion (`PureElemwiseSchedule.h/cpp`). - **AnyPBRScheduler**: Generic strategy for AnyPBR (Pointwise/Broadcast/Reduce) and similar ops (`AnyPBRSchedule.h/cpp`). ##### Kernel and tiling abstraction - **KernelInfo**: Unified description of a fused kernel (`KernelInfo.h`), including IO, dimensions, alignment, and multi-core capability. - **Tiling abstraction and utilities** (`TilingUtils.h/cpp`): - **TilingInfo**, **TilingStruct**, **TilingData**: Describe a single or multiple candidate tiling schemes. - **Expr** / **StmtExprBuilder**: Build tiling expressions that depend on static or dynamic shape. ##### Schedule operations - **ScheduleOperations.cpp**: Implements reusable schedule primitives, including: - IO cache: `cacheRead` / `cacheWrite` - Tiling: `tileUsingFor` / `tileUsingForAll` / `tileReductionUsingFor` - Loop transforms: `fuseLoops` / `fuseIntoContaining` / `coalesceLoops` - Resource constraints: `setBufferSize`, etc. ##### Schedule interpretation - **AutoScheduleInterpreter.cpp**: Converts the high-level schedule description produced by the scheduler into Transform Dialect operations and applies them to the target IR so the schedule takes effect. #### Strategy selection and call chain The overall call chain is: - **Pass entry** - The AutoSchedule pass is invoked in the HFusion pipeline and receives the `func::FuncOp` and fusion information to process. - **Strategy selection and scheduler construction** - In `AutoScheduleBase.cpp::applySchedule()`, the scheduler is chosen by fusion kind `FusionKind`: - `FusionKind::PureElemwise` → `PureElemwiseScheduler` - `FusionKind::AnyPB` / `FusionKind::LastAxisPBR` / `FusionKind::AnyPBR` → `AnyPBRScheduler` - The scheduler instance is created with `std::make_unique<...>(funcOp)`. - **Main scheduling flow (`SchedulerBase::runOnOperation()`)** - **Pre** (`runPreScheduleProcedure()`): Insert IO cache, analyze fusion graph and legality; call `analyzeAndVerifyKernelImpl()` for strategy-specific kernel analysis and checks. - **Schedule** (`runScheduleProcedure()`): Call `calculateTilingImpl()` to get `TilingComputeFn` and candidate tiling; choose a `TilingKey` (e.g. by cost or alignment); call `createScheduleImpl()` to build the schedule for that key; pass the schedule to the Transform interpreter via `applyScheduleImpl()`. - **Post** (`runPostScheduleProcedure()`): Optional structure cleanup and statistics. - **Transform Dialect application** - `AutoScheduleInterpreter` parses the schedule description, translates it into a sequence of Transform Dialect operations, and applies them to the HFusion IR. #### Key data structures ##### KernelInfo (kernel description) - Abstracts the structure and constraints of a single fused kernel. Typical information includes: - Input/output tensors and their shape/layout. - Topology of ops in the fusion graph. - Stride, size, and tile alignment constraints for hardware. - Whether multi-core reduce is supported and which dimensions can be parallelized. - For specific fusion patterns, derived classes (e.g. `AnyPBRKernelInfo`) can add pattern-specific analysis. ##### Tiling (`TilingUtils.h`) - **TilingData**: Tiling parameters for a single dimension (constant or expression). - **TilingStruct** / **TilingCases**: A full tiling scheme and sets of candidate schemes. - **Expr** / **StmtExprBuilder**: - `DimSymbol`: Symbol for a dynamic dimension. - `Expr`: Arithmetic (e.g. dimension/factor, align-to granularity). - `StmtExprBuilder`: Builds `Expr` from IR shape and constants and generates the host-side tiling function. ##### ValueHandle - Uniform wrapper for MLIR `Value`, function arguments, and named values for consistent access and manipulation. - Common types include `NamedValueHandle`, `FuncArgHandle`, etc. ### Scheduling strategies #### PureElemwise scheduling strategy - **Use case**: Graphs that are mostly elementwise ops, without complex broadcast/reduce. - **Implementation location**: `PureElemwiseSchedule.h/cpp`. - **Strategy characteristics**: - Aims at regular loop structure with simple, regular tiling. - Emphasizes contiguous access and multi-level cache friendliness after fusion. - `calculateTilingImpl()` and `createScheduleImpl()` perform tiling and schedule primitive chaining. #### AnyPBR scheduling strategy (AnyPBRScheduler) - **Use case**: Fused subgraphs containing broadcast, reduce, and similar patterns. - **Implementation location**: `AnyPBRSchedule.h/cpp`. - **Capabilities**: - **Tiling**: - In `calculateTilingImpl()`, considers stride alignment, dynamic shape symbols, reduce/broadcast axes, etc.; uses `StmtExprBuilder` to build expressions and produce multiple `TilingCases`. - **Multi-core reduce analysis and utilization**: - `analyzeMultiCoreReduceInfo()` determines whether multi-core reduce conditions are met (see §7.3). - **Schedule construction**: - In `createScheduleImpl()`, for the chosen `TilingKey`, enables concrete scheduling strategies, considering on-chip buffer allocation, axis-specific tiling, loop fuse/coalesce, and multi-core binding. ### Main optimizations This section summarizes **stride-align**, **dynamic shape**, and **multi-core reduce** and their role in AutoSchedule. #### Stride-align memory alignment optimization - **Goal**: Avoid unaligned UB access. - **APIs and data source**: - `KernelInfo::getStrideAlignments()` (`KernelInfo.h`): Returns (dimension index, alignment) pairs describing the minimum stride alignment for certain dimensions during memory access. - Related APIs: `getSizeAlignments()` for size alignment; `getTileAlignments()` for tile alignment. - **Usage in scheduling strategy (AnyPBR example)**: - In `AnyPBRSchedule.cpp::calculateTilingImpl()`: - Generate initial tiling from problem size. - Iterate over `getStrideAlignments()` and `getTileAlignments()`, and adjust relevant dimension tiling sizes with `alignTo(alignment)`. - Output stride-aligned `TilingCases`. - **When**: Stride-align is applied during **tiling**, i.e. when `calculateTilingImpl()` is called inside `runScheduleProcedure()`. #### Dynamic shape support - **Problem and requirements**: - Some dimensions (e.g., batch size, spatial size) may not be constant at compile time. - AutoSchedule must support computing suitable tiling parameters at **runtime** from actual input shapes. - **Expr system design**: - **Expr** / **DimSymbol** / **StmtExprBuilder** in `TilingUtils.h` form a lightweight expression framework: - **DimSymbol**: Symbol for a dimension (e.g., N, H, W); can be created via `StmtExprBuilder::createDimSymbolExpr()`. - **Expr**: Supports basic arithmetic, e.g., `N/4`, `min(N,64)`, `(H*W)/factor`. - **StmtExprBuilder**: Builds `Expr` from IR shape and constants; generates host-side tiling computation statements. - **Host tiling function generation and execution**: - `TilingComputeFn` from `calculateTilingImpl()` is typically a lambda or callable. - When executed on the host with concrete shapes, `DimSymbol`s are bound to values and tiling is computed. - For fully static shapes, expressions fold to constants at compile time. - **Configuration and extension**: - Options such as `AutoScheduleOptions::enableSymbolAnalysis` enable symbolic equivalence analysis for dynamic shape tiling optimization. #### Multi-core reduce - Multi-core reduce is analyzed (e.g. via `analyzeMultiCoreReduceInfo()`) and applied when the kernel and pattern satisfy the required conditions (see dedicated documentation). ### Extending AutoSchedule: custom strategy This section describes how to **add a new fusion pattern and its scheduling strategy** in the HFusion AutoSchedule framework, including scheduler implementation, kernel info extension, and registration. #### Define a new FusionKind - In the HFusion enum definition (e.g. `HFusionEnums.td`), add a new fusion kind, e.g. `FusionKind::MyKind`. - In fusion analysis and pattern matching, ensure that fusion units with this kind are produced so that AutoSchedule can select the corresponding scheduler. #### Custom scheduler (inherit SchedulerBase) - Add a header (e.g. `MySchedule.h`) under `bishengir/include/bishengir/Dialect/HFusion/Transforms/AutoSchedule/` and define the scheduler class: ```cpp class MyScheduler : public SchedulerBase { public: explicit MyScheduler(func::FuncOp funcOpIn) : SchedulerBase(funcOpIn, FusionKind::MyKind) {} // 1. Kernel analysis and verification LogicalResult analyzeAndVerifyKernelImpl() override; // 2. Tiling computation (static / dynamic shape) TilingComputeFn calculateTilingImpl() override; // 3. Schedule creation (Transform Dialect primitives) LogicalResult createScheduleImpl(TilingKey key, OpBuilder &opBuilder) override; // 4. Optional: pre/post extensions LogicalResult runPreScheduleProcedure(OpBuilder &opBuilder) override; LogicalResult runPostScheduleProcedure(OpBuilder &opBuilder) override; }; ``` - Add implementation (e.g. `MySchedule.cpp`) and implement the overrides: - **analyzeAndVerifyKernelImpl()** - Use `KernelInfoCollector` to gather kernel info (reuse an existing `KernelInfo` or add a custom subclass). - Check that the fusion graph matches the strategy (op types, shape relations, etc.). - **calculateTilingImpl()** - Build and return `TilingComputeFn`: use `StmtExprBuilder` for static/dynamic dimension expressions; apply stride-align and tile-align; generate multiple `TilingCases` for different scenarios (e.g. small/large, different ranks) for selection. - **createScheduleImpl(TilingKey key, OpBuilder &opBuilder)** - For the chosen `TilingKey`, call schedule primitives in order: IO cache (`cacheRead`/`cacheWrite`), tiling (`tileUsingFor`/`tileUsingForAll`/`tileReductionUsingFor`), loop transforms (`fuseLoops`, `fuseIntoContaining`, `coalesceLoops`). Ensure the generated Transform sequence is correct and consistent with `KernelInfo`. - **runPreScheduleProcedure()** / **runPostScheduleProcedure()** (optional) - Add strategy-specific pre/post logic, e.g. pattern normalization, schedule validation, or statistics. #### Extend KernelInfo (optional) If the new strategy needs extra structured information, extend `KernelInfo` by subclassing: ```cpp class MyKernelInfo : public KernelInfo { public: MyKernelInfo(MLIRContext *ctx) : KernelInfo(FusionKind::MyKind, ctx) {} static bool classof(const KernelInfo *T) { return T->getFusionKind() == FusionKind::MyKind; } // Add fields and accessors required by this fusion pattern }; ``` In `KernelInfoCollector`, add handling for `FusionKind::MyKind` to construct and fill `MyKernelInfo` so the scheduler can use it in `analyzeAndVerifyKernelImpl()` and `calculateTilingImpl()`. #### Register the strategy In `AutoScheduleBase.cpp::applySchedule()`, add a branch: ```cpp case FusionKind::MyKind: scheduler = std::make_unique(funcOp); break; ``` Ensure `MySchedule.cpp` is in the build and linked into the HFusion Transform module; the new strategy is then available in the pipeline. #### Schedule primitives (Schedule API) quick reference Inside `createScheduleImpl()`, you can reuse the schedule APIs in `ScheduleOperations.cpp`: - **IO cache and buffer management** - `cacheRead` - `cacheWrite` - `setBufferSize` - **Tiling and loop structure control** - `tileUsingFor` - `tileUsingForAll` - `tileReductionUsingFor` - **Loop fuse and coalesce** - `fuseLoops` - `fuseIntoContaining` - `coalesceLoops` - **Multi-core binding** - `bindLoopToMulticore` - See AnyPBR for `getMultiCoreNum` and similar to determine core count configuration. By combining these primitives, you can implement flexible and efficient schedule strategies for new fusion patterns within the same framework. ### Internal mechanisms (brief) #### ValueHandle abstraction - The `ValueHandle` family provides a uniform abstraction over MLIR values, function arguments, and named values from different sources. - They offer a single interface for access and manipulation so scheduler code does not depend on low-level IR details. - This helps keep the code concise and maintainable in both schedule construction and Transform interpretation. #### Transform Dialect integration and interpretation - AutoSchedule does not modify operator IR directly inside the scheduler; it builds a **Transform Dialect program**. - `AutoScheduleInterpreter.cpp` is responsible for: - Receiving the schedule description produced by the scheduler; - Translating it into a sequence of Transform Dialect operations; - Applying these operations to the target `func::FuncOp` to perform the actual IR transformation. - This design keeps schedule logic decoupled from IR details and makes the schedule traceable (e.g., by printing the Transform program), debuggable, and reusable. #### Tiling computation framework - The `TilingInfo` and `Expr` system unify **dimension size**, **alignment rules**, and **dynamic shape** as expressions. - For static shape: - Expressions can be evaluated at compile time and folded into constant tiling parameters. - For dynamic shape: - They are evaluated in the host tiling function with concrete input shapes; - The same expressions serve both static and dynamic cases, reducing code duplication.