AutoSchedule¶
HFusion AutoSchedule: Design for Automatic Fusion and Scheduling¶
HFusion is a high-level framework on Bisheng IR for operator fusion and automatic scheduling. The AutoSchedule module generates efficient schedules for Ascend NPU once fusion units are fixed. Design goals include:
Automation: Choose scheduling and Tiling strategies from fusion patterns and operator traits.
Extensibility: Common scheduler base and kernel abstractions for new strategies.
Performance: Dynamic shape, multi-core reduce, and related optimizations.
Engineering: Schedules expressed as reusable, interpretable Transform Dialect sequences.
Hardware Background¶
The Ascend NPU uses a multi-level memory hierarchy: global memory (GM) has large capacity but high access latency, while on-chip memory (e.g., L1, UB) has lower latency but limited capacity. To leverage on-chip memory bandwidth and reduce global memory traffic:
Maximize on-chip memory utilization: Through large-scale operator fusion, multiple ops are merged into the same kernel, so intermediate results are reused in on-chip buffers and GM read/write counts are reduced.
Satisfy hardware access constraints: On-chip memory (e.g., UB) access must meet hardware requirements; for example, UB access requires 32-byte alignment (stride-align). Otherwise, misaligned access can cause runtime errors or performance degradation.
Fit Tiling and loop structure: When generating Tiling and loop nests, stride/size/tile alignment constraints must be respected so that the resulting kernel complies with hardware specifications.
AutoSchedule is designed around these hardware characteristics: it uses automated scheduling and Tiling strategies to maximize on-chip memory utilization while preserving legality.
Algorithm Principle¶
The core algorithm centers on large-scale operator fusion + dimension-mapping-driven loop generation + Transform Dialect fusion execution:
Dimension Analyzer axis mapping
DimensionAnalyzeranalyzes the axis mapping of each op in the kernel relative to the anchor (getCommonAxis,getNormalizedInterchange, etc.).It establishes the correspondence between tensor dimensions and anchor dimensions, supporting broadcast, reduce, transpose, and other patterns.
This provides dimension-level information for subsequent Tiling and loop construction.
Loop generation and op fusion
Based on the axis mapping, a unified loop structure is generated for the fusion graph, and ops are fused into the same loop via MLIR Transform Dialect primitives such as
fuseIntoContaining,fuseLoops, andcoalesceLoops.During fusion, hardware constraints are explicitly enforced: stride-align (32-byte alignment), size-align, tile-align, etc., so the generated IR meets Ascend NPU memory access rules.
Tiling computation and selection
Using
StmtExprBuilderand theExprsystem, together with static or dynamic shape, multiple candidate Tiling schemes (TilingCases) are generated.getStrideAlignments()andgetTileAlignments()are applied during Tiling; dimensions are adjusted withalignTo(alignment)to produce valid Tiling that satisfies stride-align and related constraints.A best
TilingKeyis chosen (e.g., by cost or alignment), and the schedule description is built accordingly.
Transform Dialect interpretation
The scheduler does not modify IR directly; it builds a Transform Dialect program.
AutoScheduleInterpretertranslates it into concrete Transform operations and applies them to the target IR so the schedule takes effect.
Interface Description¶
Code location¶
Headers (APIs and abstractions):
bishengir/include/bishengir/Dialect/HFusion/Transforms/AutoSchedule/Implementation:
bishengir/lib/Dialect/HFusion/Transforms/AutoSchedule/
Core interfaces and abstractions¶
Type |
Name |
Description |
|---|---|---|
Base class |
|
Abstract base for all schedulers, encapsulating the common scheduling flow |
Scheduler |
|
Pure elementwise fusion strategy |
Scheduler |
|
Generic strategy for Pointwise/Broadcast/Reduce and similar fusion patterns |
Kernel |
|
Unified description of fused kernel IO, dimensions, alignment, multi-core |
Alignment |
|
Returns stride alignment constraints (dim index, unit), e.g. 32-byte align |
Alignment |
|
Returns size dimension alignment constraints |
Alignment |
|
Returns tile dimension alignment constraints |
Analysis |
|
|
Primitive |
|
IO cache |
Primitive |
|
Tiling |
Primitive |
|
Loop fusion and coalesce |
Primitive |
|
Resource constraints |
Strategy selection and call chain¶
Pass entry: The AutoSchedule pass is invoked in the HFusion pipeline and receives the
func::FuncOpand fusion information.Strategy selection: In
AutoScheduleBase.cpp::applySchedule(), the scheduler is chosen byFusionKind:FusionKind::PureElemwise→PureElemwiseSchedulerFusionKind::AnyPB/FusionKind::LastAxisPBR/FusionKind::AnyPBR→AnyPBRScheduler
Main flow:
runPreScheduleProcedure()→runScheduleProcedure()(includingcalculateTilingImpl(),createScheduleImpl()) →runPostScheduleProcedure()→ Transform Dialect application.
Constraints and Capabilities¶
AutoSchedule explicitly handles the following constraints during scheduling and Tiling:
Constraint |
Description |
API / Implementation |
|---|---|---|
Stride align |
UB and other on-chip memory access must be 32-byte aligned |
|
Size align |
Some ops (e.g., transpose, concat, cast) require tile/size alignment |
|
Tile align |
Combination of stride and size constraints, applied to Tiling schemes |
|
Reduce axis |
Reduce, broadcast, extract_slice, transpose, etc. impose lowest-dim alignment |
|
On-chip buffer |
Buffer allocation limited by L1/UB capacity and |
|
Multi-core reduce |
Multi-core parallel reduce only when specific conditions hold |
|
These constraints are applied uniformly in KernelInfo::getStrideAlignments() and scheduler-specific calculateTilingImpl() to ensure the generated schedule is valid and hardware-compatible.
Architecture overview¶
Core components¶
Scheduler base and strategies¶
SchedulerBase: Abstract base for all schedulers (
AutoScheduleBase.h), encapsulating the common scheduling flow.Concrete strategy schedulers:
PureElemwiseScheduler: Pure elementwise fusion (
PureElemwiseSchedule.h/cpp).AnyPBRScheduler: Generic strategy for AnyPBR (Pointwise/Broadcast/Reduce) and similar ops (
AnyPBRSchedule.h/cpp).
Kernel and tiling abstraction¶
KernelInfo: Unified description of a fused kernel (
KernelInfo.h), including IO, dimensions, alignment, and multi-core capability.Tiling abstraction and utilities (
TilingUtils.h/cpp):TilingInfo, TilingStruct, TilingData: Describe a single or multiple candidate tiling schemes.
Expr / StmtExprBuilder: Build tiling expressions that depend on static or dynamic shape.
Schedule operations¶
ScheduleOperations.cpp: Implements reusable schedule primitives, including:
IO cache:
cacheRead/cacheWriteTiling:
tileUsingFor/tileUsingForAll/tileReductionUsingForLoop transforms:
fuseLoops/fuseIntoContaining/coalesceLoopsResource constraints:
setBufferSize, etc.
Schedule interpretation¶
AutoScheduleInterpreter.cpp: Converts the high-level schedule description produced by the scheduler into Transform Dialect operations and applies them to the target IR so the schedule takes effect.
Strategy selection and call chain¶
The overall call chain is:
Pass entry
The AutoSchedule pass is invoked in the HFusion pipeline and receives the
func::FuncOpand fusion information to process.
Strategy selection and scheduler construction
In
AutoScheduleBase.cpp::applySchedule(), the scheduler is chosen by fusion kindFusionKind:FusionKind::PureElemwise→PureElemwiseSchedulerFusionKind::AnyPB/FusionKind::LastAxisPBR/FusionKind::AnyPBR→AnyPBRScheduler
The scheduler instance is created with
std::make_unique<...>(funcOp).
Main scheduling flow (
SchedulerBase::runOnOperation())Pre (
runPreScheduleProcedure()): Insert IO cache, analyze fusion graph and legality; callanalyzeAndVerifyKernelImpl()for strategy-specific kernel analysis and checks.Schedule (
runScheduleProcedure()): CallcalculateTilingImpl()to getTilingComputeFnand candidate tiling; choose aTilingKey(e.g. by cost or alignment); callcreateScheduleImpl()to build the schedule for that key; pass the schedule to the Transform interpreter viaapplyScheduleImpl().Post (
runPostScheduleProcedure()): Optional structure cleanup and statistics.
Transform Dialect application
AutoScheduleInterpreterparses the schedule description, translates it into a sequence of Transform Dialect operations, and applies them to the HFusion IR.
Key data structures¶
KernelInfo (kernel description)¶
Abstracts the structure and constraints of a single fused kernel. Typical information includes:
Input/output tensors and their shape/layout.
Topology of ops in the fusion graph.
Stride, size, and tile alignment constraints for hardware.
Whether multi-core reduce is supported and which dimensions can be parallelized.
For specific fusion patterns, derived classes (e.g.
AnyPBRKernelInfo) can add pattern-specific analysis.
Tiling (TilingUtils.h)¶
TilingData: Tiling parameters for a single dimension (constant or expression).
TilingStruct / TilingCases: A full tiling scheme and sets of candidate schemes.
Expr / StmtExprBuilder:
DimSymbol: Symbol for a dynamic dimension.Expr: Arithmetic (e.g. dimension/factor, align-to granularity).StmtExprBuilder: BuildsExprfrom IR shape and constants and generates the host-side tiling function.
ValueHandle¶
Uniform wrapper for MLIR
Value, function arguments, and named values for consistent access and manipulation.Common types include
NamedValueHandle,FuncArgHandle, etc.
Scheduling strategies¶
PureElemwise scheduling strategy¶
Use case: Graphs that are mostly elementwise ops, without complex broadcast/reduce.
Implementation location:
PureElemwiseSchedule.h/cpp.Strategy characteristics:
Aims at regular loop structure with simple, regular tiling.
Emphasizes contiguous access and multi-level cache friendliness after fusion.
calculateTilingImpl()andcreateScheduleImpl()perform tiling and schedule primitive chaining.
AnyPBR scheduling strategy (AnyPBRScheduler)¶
Use case: Fused subgraphs containing broadcast, reduce, and similar patterns.
Implementation location:
AnyPBRSchedule.h/cpp.Capabilities:
Tiling:
In
calculateTilingImpl(), considers stride alignment, dynamic shape symbols, reduce/broadcast axes, etc.; usesStmtExprBuilderto build expressions and produce multipleTilingCases.
Multi-core reduce analysis and utilization:
analyzeMultiCoreReduceInfo()determines whether multi-core reduce conditions are met (see §7.3).
Schedule construction:
In
createScheduleImpl(), for the chosenTilingKey, enables concrete scheduling strategies, considering on-chip buffer allocation, axis-specific tiling, loop fuse/coalesce, and multi-core binding.
Main optimizations¶
This section summarizes stride-align, dynamic shape, and multi-core reduce and their role in AutoSchedule.
Stride-align memory alignment optimization¶
Goal: Avoid unaligned UB access.
APIs and data source:
KernelInfo::getStrideAlignments()(KernelInfo.h): Returns (dimension index, alignment) pairs describing the minimum stride alignment for certain dimensions during memory access.Related APIs:
getSizeAlignments()for size alignment;getTileAlignments()for tile alignment.
Usage in scheduling strategy (AnyPBR example):
In
AnyPBRSchedule.cpp::calculateTilingImpl():Generate initial tiling from problem size.
Iterate over
getStrideAlignments()andgetTileAlignments(), and adjust relevant dimension tiling sizes withalignTo(alignment).Output stride-aligned
TilingCases.
When: Stride-align is applied during tiling, i.e. when
calculateTilingImpl()is called insiderunScheduleProcedure().
Dynamic shape support¶
Problem and requirements:
Some dimensions (e.g., batch size, spatial size) may not be constant at compile time.
AutoSchedule must support computing suitable tiling parameters at runtime from actual input shapes.
Expr system design:
Expr / DimSymbol / StmtExprBuilder in
TilingUtils.hform a lightweight expression framework:DimSymbol: Symbol for a dimension (e.g., N, H, W); can be created via
StmtExprBuilder::createDimSymbolExpr().Expr: Supports basic arithmetic, e.g.,
N/4,min(N,64),(H*W)/factor.StmtExprBuilder: Builds
Exprfrom IR shape and constants; generates host-side tiling computation statements.
Host tiling function generation and execution:
TilingComputeFnfromcalculateTilingImpl()is typically a lambda or callable.When executed on the host with concrete shapes,
DimSymbols are bound to values and tiling is computed.For fully static shapes, expressions fold to constants at compile time.
Configuration and extension:
Options such as
AutoScheduleOptions::enableSymbolAnalysisenable symbolic equivalence analysis for dynamic shape tiling optimization.
Multi-core reduce¶
Multi-core reduce is analyzed (e.g. via
analyzeMultiCoreReduceInfo()) and applied when the kernel and pattern satisfy the required conditions (see dedicated documentation).
Extending AutoSchedule: custom strategy¶
This section describes how to add a new fusion pattern and its scheduling strategy in the HFusion AutoSchedule framework, including scheduler implementation, kernel info extension, and registration.
Define a new FusionKind¶
In the HFusion enum definition (e.g.
HFusionEnums.td), add a new fusion kind, e.g.FusionKind::MyKind.In fusion analysis and pattern matching, ensure that fusion units with this kind are produced so that AutoSchedule can select the corresponding scheduler.
Custom scheduler (inherit SchedulerBase)¶
Add a header (e.g.
MySchedule.h) underbishengir/include/bishengir/Dialect/HFusion/Transforms/AutoSchedule/and define the scheduler class:
class MyScheduler : public SchedulerBase {
public:
explicit MyScheduler(func::FuncOp funcOpIn)
: SchedulerBase(funcOpIn, FusionKind::MyKind) {}
// 1. Kernel analysis and verification
LogicalResult analyzeAndVerifyKernelImpl() override;
// 2. Tiling computation (static / dynamic shape)
TilingComputeFn calculateTilingImpl() override;
// 3. Schedule creation (Transform Dialect primitives)
LogicalResult createScheduleImpl(TilingKey key,
OpBuilder &opBuilder) override;
// 4. Optional: pre/post extensions
LogicalResult runPreScheduleProcedure(OpBuilder &opBuilder) override;
LogicalResult runPostScheduleProcedure(OpBuilder &opBuilder) override;
};
Add implementation (e.g.
MySchedule.cpp) and implement the overrides:analyzeAndVerifyKernelImpl()
Use
KernelInfoCollectorto gather kernel info (reuse an existingKernelInfoor add a custom subclass).Check that the fusion graph matches the strategy (op types, shape relations, etc.).
calculateTilingImpl()
Build and return
TilingComputeFn: useStmtExprBuilderfor static/dynamic dimension expressions; apply stride-align and tile-align; generate multipleTilingCasesfor different scenarios (e.g. small/large, different ranks) for selection.
createScheduleImpl(TilingKey key, OpBuilder &opBuilder)
For the chosen
TilingKey, call schedule primitives in order: IO cache (cacheRead/cacheWrite), tiling (tileUsingFor/tileUsingForAll/tileReductionUsingFor), loop transforms (fuseLoops,fuseIntoContaining,coalesceLoops). Ensure the generated Transform sequence is correct and consistent withKernelInfo.
runPreScheduleProcedure() / runPostScheduleProcedure() (optional)
Add strategy-specific pre/post logic, e.g. pattern normalization, schedule validation, or statistics.
Extend KernelInfo (optional)¶
If the new strategy needs extra structured information, extend KernelInfo by subclassing:
class MyKernelInfo : public KernelInfo {
public:
MyKernelInfo(MLIRContext *ctx)
: KernelInfo(FusionKind::MyKind, ctx) {}
static bool classof(const KernelInfo *T) {
return T->getFusionKind() == FusionKind::MyKind;
}
// Add fields and accessors required by this fusion pattern
};
In KernelInfoCollector, add handling for FusionKind::MyKind to construct and fill MyKernelInfo so the scheduler can use it in analyzeAndVerifyKernelImpl() and calculateTilingImpl().
Register the strategy¶
In AutoScheduleBase.cpp::applySchedule(), add a branch:
case FusionKind::MyKind:
scheduler = std::make_unique<MyScheduler>(funcOp);
break;
Ensure MySchedule.cpp is in the build and linked into the HFusion Transform module; the new strategy is then available in the pipeline.
Schedule primitives (Schedule API) quick reference¶
Inside createScheduleImpl(), you can reuse the schedule APIs in ScheduleOperations.cpp:
IO cache and buffer management
cacheReadcacheWritesetBufferSize
Tiling and loop structure control
tileUsingFortileUsingForAlltileReductionUsingFor
Loop fuse and coalesce
fuseLoopsfuseIntoContainingcoalesceLoops
Multi-core binding
bindLoopToMulticoreSee AnyPBR for
getMultiCoreNumand similar to determine core count configuration.
By combining these primitives, you can implement flexible and efficient schedule strategies for new fusion patterns within the same framework.
Internal mechanisms (brief)¶
ValueHandle abstraction¶
The
ValueHandlefamily provides a uniform abstraction over MLIR values, function arguments, and named values from different sources.They offer a single interface for access and manipulation so scheduler code does not depend on low-level IR details.
This helps keep the code concise and maintainable in both schedule construction and Transform interpretation.
Transform Dialect integration and interpretation¶
AutoSchedule does not modify operator IR directly inside the scheduler; it builds a Transform Dialect program.
AutoScheduleInterpreter.cppis responsible for:Receiving the schedule description produced by the scheduler;
Translating it into a sequence of Transform Dialect operations;
Applying these operations to the target
func::FuncOpto perform the actual IR transformation.
This design keeps schedule logic decoupled from IR details and makes the schedule traceable (e.g., by printing the Transform program), debuggable, and reusable.
Tiling computation framework¶
The
TilingInfoandExprsystem unify dimension size, alignment rules, and dynamic shape as expressions.For static shape:
Expressions can be evaluated at compile time and folded into constant tiling parameters.
For dynamic shape:
They are evaluated in the host tiling function with concrete input shapes;
The same expressions serve both static and dynamic cases, reducing code duplication.