` | | `aten.maximum` | `linalg.binary_fn` | | `aten.minimum` | `linalg.binary_fn` | | `aten.clamp_min` / `aten.clamp_min.Tensor` | `linalg.binary_fn` | | `aten.clamp_max` / `aten.clamp_max.Tensor` | `linalg.binary_fn` | | `aten.clamp` | Combination of `max_signed` + `min_signed` | | `aten.pow.Tensor_Tensor` / `aten.pow.Tensor_Scalar` / `aten.pow.Scalar` | `hfusion.binary_fn` | | `aten.logical_and` | `hfusion.binary_fn` | | `aten.logical_or` | `hfusion.binary_fn` | ##### Elementwise unary | Torch Op | Lowering target | |----------|-----------------| | `aten.abs` | `linalg.unary_fn` | | `aten.ceil` | `linalg.unary_fn` | | `aten.floor` | `linalg.unary_fn` | | `aten.neg` | `linalg.unary_fn` | | `aten.log` | `linalg.unary_fn` | | `aten.exp` | `linalg.unary_fn` | | `aten.reciprocal` | `hfusion.unary_fn` | | `aten.relu` | `hfusion.unary_fn` | | `aten.rsqrt` | `hfusion.unary_fn` | | `aten.sqrt` | `hfusion.unary_fn` | | `aten.erf` | `hfusion.unary_fn` | | `aten.tanh` | `hfusion.unary_fn` | | `aten.sin` | `hfusion.unary_fn` | | `aten.cos` | `hfusion.unary_fn` | | `aten.bitwise_not` | `hfusion.unary_fn` | | `aten.sigmoid` | Decomposed to negf -> exp -> add -> div | | `aten.gelu` | Decomposed to tanh-based approximation | ##### Compare | Torch Op | Lowering target | |----------|-----------------| | `aten.gt.Scalar` / `aten.gt.Tensor` | `hfusion.compare_fn` | | `aten.lt.Scalar` / `aten.lt.Tensor` | `hfusion.compare_fn` | | `aten.ge.Scalar` / `aten.ge.Tensor` | `hfusion.compare_fn` | | `aten.le.Scalar` / `aten.le.Tensor` | `hfusion.compare_fn` | | `aten.eq.Scalar` / `aten.eq.Tensor` | `hfusion.compare_fn` | | `aten.ne.Scalar` / `aten.ne.Tensor` | `hfusion.compare_fn` | ##### Reduction | Torch Op | Lowering target | |----------|-----------------| | `aten.sum` / `aten.sum.dim_IntList` | `linalg.reduce` + `arith.addf/addi` | | `aten.prod` / `aten.prod.dim_int` | `linalg.reduce` + `arith.mulf/muli` | | `aten.max` | `linalg.reduce` + `arith.maximumf/maxsi` | | `aten.min` | `linalg.reduce` + `arith.minimumf/minsi` | | `aten.max.dim` | `hfusion.reduce_with_index` (MAX) | | `aten.min.dim` | `hfusion.reduce_with_index` (MIN) | | `aten.any` / `aten.any.dim` / `aten.any.dims` | `linalg.reduce` + `arith.ori` | | `aten.all` / `aten.all.dim` | `linalg.reduce` + `arith.andi` | ##### Data movement | Torch Op | Lowering target | |----------|-----------------| | `aten.permute` | `linalg.transpose` | | `aten.broadcast_to` | `linalg.broadcast` | ##### Other | Torch Op | Lowering target | |----------|-----------------| | `aten.to.dtype` | `hfusion.cast` | | `aten.where.self` | `hfusion.select` | | `aten.arange.start_step` | `hfusion.arange` | ### Linalg/HFusion IR integration Use Linalg/Tensor, HFusion, and other standard MLIR dialects for operator semantics; input goes directly into the Linalg/HFusion IR layer's fusion and scheduling flow. #### Example `hfusion.mlir` ```mlir func.func @hfusion_reduce_mul(%arg0: tensor<40960xf32>, %arg1: tensor<40960x1024xf32>, %arg2: tensor<40960x1024xf32>, %arg3: tensor<40960x1024xf32>) -> tensor<40960xf32> attributes {hacc.entry, hacc.function_kind = #hacc.function_kind} { %1 = tensor.empty() : tensor<40960x1024xf32> %3 = linalg.elemwise_binary {fun = #linalg.binary_fn} ins(%arg1, %arg2 : tensor<40960x1024xf32>, tensor<40960x1024xf32>) outs(%arg3: tensor<40960x1024xf32>) -> tensor<40960x1024xf32> %4 = tensor.empty() : tensor<40960xf32> %sum = linalg.reduce {arith.addf} ins(%3 : tensor<40960x1024xf32>) outs(%4 : tensor<40960xf32>) dimensions = [1] %5 = tensor.empty() : tensor<40960xf32> %6 = linalg.elemwise_binary {fun = #linalg.binary_fn} ins(%arg0, %sum : tensor<40960xf32>, tensor<40960xf32>) outs(%5: tensor<40960xf32>) -> tensor<40960xf32> return %6 : tensor<40960xf32> } ``` Invocation: - Command: `bishengir-compile -enable-hfusion-compile=true -enable-hivm-compile=true -target=Ascend910B1 hfusion.mlir -o hfusion_kernel.o` - **Expected output**: Ascend NPU operator binary (`.o` format), loadable and runnable on device via CANN runtime. #### Automatic fusion Once Linalg/HFusion IR is ingested, the HFusion compile flow performs **automatic fusion and scheduling** on eligible ops: multiple ops are merged into the same kernel so intermediate results are reused in on-chip memory and global memory traffic is reduced; scheduling and Tiling strategies are selected automatically based on fusion patterns and operator traits, producing efficient schedules for Ascend NPU. After fusion, the IR passes through Tiling, loop generation, Transform Dialect application, and similar steps before being lowered to HIVM and emitting an executable binary. Supported op types: - Elemwise - Broadcast - Reduce - Transpose - Concat For algorithm details, constraints, architecture, and related topics, see [HFusion AutoSchedule: Automatic Fusion and Scheduling](../features/AutoSchedule/HFusion_AutoSchedule.md). ### HIVM IR integration For fine-grained hardware control, you can write kernels directly in the HIVM dialect, managing memory hierarchy and compute pipelines explicitly. #### Example `hivm.mlir` ```mlir func.func @hivm_vadd(%valueA: memref<16xf16, #hivm.address_space>, %valueB: memref<16xf16, #hivm.address_space>, %valueC: memref<16xf16, #hivm.address_space>) attributes {hacc.entry, hacc.function_kind = #hacc.function_kind} { %ubA = memref.alloc() : memref<16xf16, #hivm.address_space> hivm.hir.load ins(%valueA : memref<16xf16, #hivm.address_space>) outs(%ubA : memref<16xf16, #hivm.address_space>) %ubB = memref.alloc() : memref<16xf16, #hivm.address_space> hivm.hir.load ins(%valueB : memref<16xf16, #hivm.address_space>) outs(%ubB : memref<16xf16, #hivm.address_space>) %ubC = memref.alloc() : memref<16xf16, #hivm.address_space> hivm.hir.vadd ins(%ubA, %ubB : memref<16xf16, #hivm.address_space>, memref<16xf16, #hivm.address_space>) outs(%ubC : memref<16xf16, #hivm.address_space>) hivm.hir.store ins(%ubC : memref<16xf16, #hivm.address_space>) outs(%valueC : memref<16xf16, #hivm.address_space>) return } ``` HIVM uses `#hivm.address_space` to annotate memory hierarchy: `gm` (global memory), `ub` (Unified Buffer), `l1` (L1 Buffer), `l0a`/`l0b`/`l0c` (L0 Buffer). Use `hivm.hir.load`/`hivm.hir.store` for explicit DMA transfers and `hivm.hir.vadd` and similar ops for on-chip compute. Invocation: HIVM does not require the HFusion compile pipeline. The default HIVM compile pipeline performs sync insertion, memory planning, and other optimizations. - Command: `bishengir-compile -enable-hfusion-compile=false -enable-hivm-compile=true -target=Ascend910B1 hivm.mlir -o hivm_kernel.o` - **Expected output**: Ascend NPU operator binary (`.o` format), loadable and runnable on device via CANN runtime. For IR-level concepts, common compile options, and other integration paths (e.g., Triton, TileLang), see [Interface API](interface_api.md).