AutoSchedule

HFusion AutoSchedule: Design for Automatic Fusion and Scheduling

HFusion is a high-level framework on Bisheng IR for operator fusion and automatic scheduling. The AutoSchedule module generates efficient schedules for Ascend NPU once fusion units are fixed. Design goals include:

  • Automation: Choose scheduling and Tiling strategies from fusion patterns and operator traits.

  • Extensibility: Common scheduler base and kernel abstractions for new strategies.

  • Performance: Dynamic shape, multi-core reduce, and related optimizations.

  • Engineering: Schedules expressed as reusable, interpretable Transform Dialect sequences.

Hardware Background

The Ascend NPU uses a multi-level memory hierarchy: global memory (GM) has large capacity but high access latency, while on-chip memory (e.g., L1, UB) has lower latency but limited capacity. To leverage on-chip memory bandwidth and reduce global memory traffic:

  • Maximize on-chip memory utilization: Through large-scale operator fusion, multiple ops are merged into the same kernel, so intermediate results are reused in on-chip buffers and GM read/write counts are reduced.

  • Satisfy hardware access constraints: On-chip memory (e.g., UB) access must meet hardware requirements; for example, UB access requires 32-byte alignment (stride-align). Otherwise, misaligned access can cause runtime errors or performance degradation.

  • Fit Tiling and loop structure: When generating Tiling and loop nests, stride/size/tile alignment constraints must be respected so that the resulting kernel complies with hardware specifications.

AutoSchedule is designed around these hardware characteristics: it uses automated scheduling and Tiling strategies to maximize on-chip memory utilization while preserving legality.

Algorithm Principle

The core algorithm centers on large-scale operator fusion + dimension-mapping-driven loop generation + Transform Dialect fusion execution:

  1. Dimension Analyzer axis mapping

    • DimensionAnalyzer analyzes the axis mapping of each op in the kernel relative to the anchor (getCommonAxis, getNormalizedInterchange, etc.).

    • It establishes the correspondence between tensor dimensions and anchor dimensions, supporting broadcast, reduce, transpose, and other patterns.

    • This provides dimension-level information for subsequent Tiling and loop construction.

  2. Loop generation and op fusion

    • Based on the axis mapping, a unified loop structure is generated for the fusion graph, and ops are fused into the same loop via MLIR Transform Dialect primitives such as fuseIntoContaining, fuseLoops, and coalesceLoops.

    • During fusion, hardware constraints are explicitly enforced: stride-align (32-byte alignment), size-align, tile-align, etc., so the generated IR meets Ascend NPU memory access rules.

  3. Tiling computation and selection

    • Using StmtExprBuilder and the Expr system, together with static or dynamic shape, multiple candidate Tiling schemes (TilingCases) are generated.

    • getStrideAlignments() and getTileAlignments() are applied during Tiling; dimensions are adjusted with alignTo(alignment) to produce valid Tiling that satisfies stride-align and related constraints.

    • A best TilingKey is chosen (e.g., by cost or alignment), and the schedule description is built accordingly.

  4. Transform Dialect interpretation

    • The scheduler does not modify IR directly; it builds a Transform Dialect program. AutoScheduleInterpreter translates it into concrete Transform operations and applies them to the target IR so the schedule takes effect.

Interface Description

Code location

  • Headers (APIs and abstractions): bishengir/include/bishengir/Dialect/HFusion/Transforms/AutoSchedule/

  • Implementation: bishengir/lib/Dialect/HFusion/Transforms/AutoSchedule/

Core interfaces and abstractions

Type

Name

Description

Base class

SchedulerBase

Abstract base for all schedulers, encapsulating the common scheduling flow

Scheduler

PureElemwiseScheduler

Pure elementwise fusion strategy

Scheduler

AnyPBRScheduler

Generic strategy for Pointwise/Broadcast/Reduce and similar fusion patterns

Kernel

KernelInfo

Unified description of fused kernel IO, dimensions, alignment, multi-core

Alignment

getStrideAlignments()

Returns stride alignment constraints (dim index, unit), e.g. 32-byte align

Alignment

getSizeAlignments()

Returns size dimension alignment constraints

Alignment

getTileAlignments()

Returns tile dimension alignment constraints

Analysis

DimensionAnalyzer

getCommonAxis, getInterchange, getNormalizedInterchange, etc.

Primitive

cacheRead / cacheWrite

IO cache

Primitive

tileUsingFor / tileUsingForAll / tileReductionUsingFor

Tiling

Primitive

fuseLoops / fuseIntoContaining / coalesceLoops

Loop fusion and coalesce

Primitive

setBufferSize

Resource constraints

Strategy selection and call chain

  • Pass entry: The AutoSchedule pass is invoked in the HFusion pipeline and receives the func::FuncOp and fusion information.

  • Strategy selection: In AutoScheduleBase.cpp::applySchedule(), the scheduler is chosen by FusionKind:

    • FusionKind::PureElemwisePureElemwiseScheduler

    • FusionKind::AnyPB / FusionKind::LastAxisPBR / FusionKind::AnyPBRAnyPBRScheduler

  • Main flow: runPreScheduleProcedure()runScheduleProcedure() (including calculateTilingImpl(), createScheduleImpl()) → runPostScheduleProcedure() → Transform Dialect application.

Constraints and Capabilities

AutoSchedule explicitly handles the following constraints during scheduling and Tiling:

Constraint

Description

API / Implementation

Stride align

UB and other on-chip memory access must be 32-byte aligned

getStrideAlignments(), alignTo() in calculateTilingImpl()

Size align

Some ops (e.g., transpose, concat, cast) require tile/size alignment

getSizeAlignments()

Tile align

Combination of stride and size constraints, applied to Tiling schemes

getTileAlignments()

Reduce axis

Reduce, broadcast, extract_slice, transpose, etc. impose lowest-dim alignment

KernelInfo per-op stride-align logic

On-chip buffer

Buffer allocation limited by L1/UB capacity and maxBufferCnt

setBufferSize, maxBufferCnt

Multi-core reduce

Multi-core parallel reduce only when specific conditions hold

analyzeMultiCoreReduceInfo()

These constraints are applied uniformly in KernelInfo::getStrideAlignments() and scheduler-specific calculateTilingImpl() to ensure the generated schedule is valid and hardware-compatible.

Architecture overview

Core components

Scheduler base and strategies
  • SchedulerBase: Abstract base for all schedulers (AutoScheduleBase.h), encapsulating the common scheduling flow.

  • Concrete strategy schedulers:

    • PureElemwiseScheduler: Pure elementwise fusion (PureElemwiseSchedule.h/cpp).

    • AnyPBRScheduler: Generic strategy for AnyPBR (Pointwise/Broadcast/Reduce) and similar ops (AnyPBRSchedule.h/cpp).

Kernel and tiling abstraction
  • KernelInfo: Unified description of a fused kernel (KernelInfo.h), including IO, dimensions, alignment, and multi-core capability.

  • Tiling abstraction and utilities (TilingUtils.h/cpp):

    • TilingInfo, TilingStruct, TilingData: Describe a single or multiple candidate tiling schemes.

    • Expr / StmtExprBuilder: Build tiling expressions that depend on static or dynamic shape.

Schedule operations
  • ScheduleOperations.cpp: Implements reusable schedule primitives, including:

    • IO cache: cacheRead / cacheWrite

    • Tiling: tileUsingFor / tileUsingForAll / tileReductionUsingFor

    • Loop transforms: fuseLoops / fuseIntoContaining / coalesceLoops

    • Resource constraints: setBufferSize, etc.

Schedule interpretation
  • AutoScheduleInterpreter.cpp: Converts the high-level schedule description produced by the scheduler into Transform Dialect operations and applies them to the target IR so the schedule takes effect.

Strategy selection and call chain

The overall call chain is:

  • Pass entry

    • The AutoSchedule pass is invoked in the HFusion pipeline and receives the func::FuncOp and fusion information to process.

  • Strategy selection and scheduler construction

    • In AutoScheduleBase.cpp::applySchedule(), the scheduler is chosen by fusion kind FusionKind:

      • FusionKind::PureElemwisePureElemwiseScheduler

      • FusionKind::AnyPB / FusionKind::LastAxisPBR / FusionKind::AnyPBRAnyPBRScheduler

    • The scheduler instance is created with std::make_unique<...>(funcOp).

  • Main scheduling flow (SchedulerBase::runOnOperation())

    • Pre (runPreScheduleProcedure()): Insert IO cache, analyze fusion graph and legality; call analyzeAndVerifyKernelImpl() for strategy-specific kernel analysis and checks.

    • Schedule (runScheduleProcedure()): Call calculateTilingImpl() to get TilingComputeFn and candidate tiling; choose a TilingKey (e.g. by cost or alignment); call createScheduleImpl() to build the schedule for that key; pass the schedule to the Transform interpreter via applyScheduleImpl().

    • Post (runPostScheduleProcedure()): Optional structure cleanup and statistics.

  • Transform Dialect application

    • AutoScheduleInterpreter parses the schedule description, translates it into a sequence of Transform Dialect operations, and applies them to the HFusion IR.

Key data structures

KernelInfo (kernel description)
  • Abstracts the structure and constraints of a single fused kernel. Typical information includes:

    • Input/output tensors and their shape/layout.

    • Topology of ops in the fusion graph.

    • Stride, size, and tile alignment constraints for hardware.

    • Whether multi-core reduce is supported and which dimensions can be parallelized.

  • For specific fusion patterns, derived classes (e.g. AnyPBRKernelInfo) can add pattern-specific analysis.

Tiling (TilingUtils.h)
  • TilingData: Tiling parameters for a single dimension (constant or expression).

  • TilingStruct / TilingCases: A full tiling scheme and sets of candidate schemes.

  • Expr / StmtExprBuilder:

    • DimSymbol: Symbol for a dynamic dimension.

    • Expr: Arithmetic (e.g. dimension/factor, align-to granularity).

    • StmtExprBuilder: Builds Expr from IR shape and constants and generates the host-side tiling function.

ValueHandle
  • Uniform wrapper for MLIR Value, function arguments, and named values for consistent access and manipulation.

  • Common types include NamedValueHandle, FuncArgHandle, etc.

Scheduling strategies

PureElemwise scheduling strategy

  • Use case: Graphs that are mostly elementwise ops, without complex broadcast/reduce.

  • Implementation location: PureElemwiseSchedule.h/cpp.

  • Strategy characteristics:

    • Aims at regular loop structure with simple, regular tiling.

    • Emphasizes contiguous access and multi-level cache friendliness after fusion.

    • calculateTilingImpl() and createScheduleImpl() perform tiling and schedule primitive chaining.

AnyPBR scheduling strategy (AnyPBRScheduler)

  • Use case: Fused subgraphs containing broadcast, reduce, and similar patterns.

  • Implementation location: AnyPBRSchedule.h/cpp.

  • Capabilities:

    • Tiling:

      • In calculateTilingImpl(), considers stride alignment, dynamic shape symbols, reduce/broadcast axes, etc.; uses StmtExprBuilder to build expressions and produce multiple TilingCases.

    • Multi-core reduce analysis and utilization:

      • analyzeMultiCoreReduceInfo() determines whether multi-core reduce conditions are met (see §7.3).

    • Schedule construction:

      • In createScheduleImpl(), for the chosen TilingKey, enables concrete scheduling strategies, considering on-chip buffer allocation, axis-specific tiling, loop fuse/coalesce, and multi-core binding.

Main optimizations

This section summarizes stride-align, dynamic shape, and multi-core reduce and their role in AutoSchedule.

Stride-align memory alignment optimization

  • Goal: Avoid unaligned UB access.

  • APIs and data source:

    • KernelInfo::getStrideAlignments() (KernelInfo.h): Returns (dimension index, alignment) pairs describing the minimum stride alignment for certain dimensions during memory access.

    • Related APIs: getSizeAlignments() for size alignment; getTileAlignments() for tile alignment.

  • Usage in scheduling strategy (AnyPBR example):

    • In AnyPBRSchedule.cpp::calculateTilingImpl():

      • Generate initial tiling from problem size.

      • Iterate over getStrideAlignments() and getTileAlignments(), and adjust relevant dimension tiling sizes with alignTo(alignment).

      • Output stride-aligned TilingCases.

  • When: Stride-align is applied during tiling, i.e. when calculateTilingImpl() is called inside runScheduleProcedure().

Dynamic shape support

  • Problem and requirements:

    • Some dimensions (e.g., batch size, spatial size) may not be constant at compile time.

    • AutoSchedule must support computing suitable tiling parameters at runtime from actual input shapes.

  • Expr system design:

    • Expr / DimSymbol / StmtExprBuilder in TilingUtils.h form a lightweight expression framework:

      • DimSymbol: Symbol for a dimension (e.g., N, H, W); can be created via StmtExprBuilder::createDimSymbolExpr().

      • Expr: Supports basic arithmetic, e.g., N/4, min(N,64), (H*W)/factor.

      • StmtExprBuilder: Builds Expr from IR shape and constants; generates host-side tiling computation statements.

  • Host tiling function generation and execution:

    • TilingComputeFn from calculateTilingImpl() is typically a lambda or callable.

    • When executed on the host with concrete shapes, DimSymbols are bound to values and tiling is computed.

    • For fully static shapes, expressions fold to constants at compile time.

  • Configuration and extension:

    • Options such as AutoScheduleOptions::enableSymbolAnalysis enable symbolic equivalence analysis for dynamic shape tiling optimization.

Multi-core reduce

  • Multi-core reduce is analyzed (e.g. via analyzeMultiCoreReduceInfo()) and applied when the kernel and pattern satisfy the required conditions (see dedicated documentation).

Extending AutoSchedule: custom strategy

This section describes how to add a new fusion pattern and its scheduling strategy in the HFusion AutoSchedule framework, including scheduler implementation, kernel info extension, and registration.

Define a new FusionKind

  • In the HFusion enum definition (e.g. HFusionEnums.td), add a new fusion kind, e.g. FusionKind::MyKind.

  • In fusion analysis and pattern matching, ensure that fusion units with this kind are produced so that AutoSchedule can select the corresponding scheduler.

Custom scheduler (inherit SchedulerBase)

  • Add a header (e.g. MySchedule.h) under bishengir/include/bishengir/Dialect/HFusion/Transforms/AutoSchedule/ and define the scheduler class:

class MyScheduler : public SchedulerBase {
public:
  explicit MyScheduler(func::FuncOp funcOpIn)
      : SchedulerBase(funcOpIn, FusionKind::MyKind) {}

  // 1. Kernel analysis and verification
  LogicalResult analyzeAndVerifyKernelImpl() override;

  // 2. Tiling computation (static / dynamic shape)
  TilingComputeFn calculateTilingImpl() override;

  // 3. Schedule creation (Transform Dialect primitives)
  LogicalResult createScheduleImpl(TilingKey key,
                                 OpBuilder &opBuilder) override;

  // 4. Optional: pre/post extensions
  LogicalResult runPreScheduleProcedure(OpBuilder &opBuilder) override;
  LogicalResult runPostScheduleProcedure(OpBuilder &opBuilder) override;
};
  • Add implementation (e.g. MySchedule.cpp) and implement the overrides:

  • analyzeAndVerifyKernelImpl()

    • Use KernelInfoCollector to gather kernel info (reuse an existing KernelInfo or add a custom subclass).

    • Check that the fusion graph matches the strategy (op types, shape relations, etc.).

  • calculateTilingImpl()

    • Build and return TilingComputeFn: use StmtExprBuilder for static/dynamic dimension expressions; apply stride-align and tile-align; generate multiple TilingCases for different scenarios (e.g. small/large, different ranks) for selection.

  • createScheduleImpl(TilingKey key, OpBuilder &opBuilder)

    • For the chosen TilingKey, call schedule primitives in order: IO cache (cacheRead/cacheWrite), tiling (tileUsingFor/tileUsingForAll/tileReductionUsingFor), loop transforms (fuseLoops, fuseIntoContaining, coalesceLoops). Ensure the generated Transform sequence is correct and consistent with KernelInfo.

  • runPreScheduleProcedure() / runPostScheduleProcedure() (optional)

    • Add strategy-specific pre/post logic, e.g. pattern normalization, schedule validation, or statistics.

Extend KernelInfo (optional)

If the new strategy needs extra structured information, extend KernelInfo by subclassing:

class MyKernelInfo : public KernelInfo {
public:
  MyKernelInfo(MLIRContext *ctx)
      : KernelInfo(FusionKind::MyKind, ctx) {}

  static bool classof(const KernelInfo *T) {
    return T->getFusionKind() == FusionKind::MyKind;
  }

  // Add fields and accessors required by this fusion pattern
};

In KernelInfoCollector, add handling for FusionKind::MyKind to construct and fill MyKernelInfo so the scheduler can use it in analyzeAndVerifyKernelImpl() and calculateTilingImpl().

Register the strategy

In AutoScheduleBase.cpp::applySchedule(), add a branch:

case FusionKind::MyKind:
  scheduler = std::make_unique<MyScheduler>(funcOp);
  break;

Ensure MySchedule.cpp is in the build and linked into the HFusion Transform module; the new strategy is then available in the pipeline.

Schedule primitives (Schedule API) quick reference

Inside createScheduleImpl(), you can reuse the schedule APIs in ScheduleOperations.cpp:

  • IO cache and buffer management

    • cacheRead

    • cacheWrite

    • setBufferSize

  • Tiling and loop structure control

    • tileUsingFor

    • tileUsingForAll

    • tileReductionUsingFor

  • Loop fuse and coalesce

    • fuseLoops

    • fuseIntoContaining

    • coalesceLoops

  • Multi-core binding

    • bindLoopToMulticore

    • See AnyPBR for getMultiCoreNum and similar to determine core count configuration.

By combining these primitives, you can implement flexible and efficient schedule strategies for new fusion patterns within the same framework.

Internal mechanisms (brief)

ValueHandle abstraction

  • The ValueHandle family provides a uniform abstraction over MLIR values, function arguments, and named values from different sources.

  • They offer a single interface for access and manipulation so scheduler code does not depend on low-level IR details.

  • This helps keep the code concise and maintainable in both schedule construction and Transform interpretation.

Transform Dialect integration and interpretation

  • AutoSchedule does not modify operator IR directly inside the scheduler; it builds a Transform Dialect program.

  • AutoScheduleInterpreter.cpp is responsible for:

    • Receiving the schedule description produced by the scheduler;

    • Translating it into a sequence of Transform Dialect operations;

    • Applying these operations to the target func::FuncOp to perform the actual IR transformation.

  • This design keeps schedule logic decoupled from IR details and makes the schedule traceable (e.g., by printing the Transform program), debuggable, and reusable.

Tiling computation framework

  • The TilingInfo and Expr system unify dimension size, alignment rules, and dynamic shape as expressions.

  • For static shape:

    • Expressions can be evaluated at compile time and folded into constant tiling parameters.

  • For dynamic shape:

    • They are evaluated in the host tiling function with concrete input shapes;

    • The same expressions serve both static and dynamic cases, reducing code duplication.