# Framework Integration AscendNPU IR supports framework integration (PyTorch/TensorFlow/MindSpore) in two ways: - **DSL integration**: Integrate via domain-specific languages such as Triton and TileLang, which compile to AscendNPU IR. - **IR integration**: Integrate via IR representation, supporting multi-level IR (Torch IR, Linalg/HFusion IR, HIVM IR), with automatic fusion and tiling for Ascend-friendly kernels. ## DSL integration AscendNPU IR supports upstream integration with languages and frameworks such as Triton and TileLang, so that third-party DSLs can target Ascend hardware and run custom operators on the NPU. | Integration | Description | |-------------|-------------| | [Triton interface](triton_interface.md) | Use Triton to write high-performance kernels and run them on Ascend NPU via Triton Ascend. Covers installation, environment, op mapping, and Ascend extensions. | | [TileLang interface](tile_lang_interface.md) | Use TileLang Ascend (tile-lang/TVM-based DSL) to develop kernels for Ascend NPU (e.g., GEMM, vector ops, attention). Covers environment, build, and quick start. | ## IR integration AscendNPU IR supports multi-level IR integration; each level differs in abstraction and control granularity (see [Interface API - Multi-level IR Abstraction](interface_api.md#multi-level-ir-abstraction)): - **Torch IR**: Framework-level ATen ops, lowered to Linalg/HFusion via Passes. - **Linalg/HFusion IR**: General tensor algebra and hardware-aware fusion layer; standard MLIR dialects for operator semantics, HFusion performs fusion, tiling, and scheduling automatically. - **HIVM IR**: NPU instruction layer; direct mapping to hardware instructions, explicit control of memory hierarchy (GM/UB/L1/L0) and compute pipelines (Vector/Cube/MTE) for fine-grained tuning. ### Torch IR integration Use Torch dialect ATen ops; Passes such as `convert-torch-to-hfusion` lower to Linalg/HFusion named ops, then enter the fusion and scheduling flow. #### Torch → AscendNPU IR pipeline Torch IR is integrated via the `torch-backend-to-named-op-backend-pipeline` conversion pipeline. The custom `convert-torch-to-hfusion` Pass lowers Torch ATen ops to Linalg/HFusion named ops first; uncovered ops fall back to the standard lowering path of upstream torch-mlir. Main conversion stages: - `convert-torch-to-hfusion`: BishengIR custom lowering for 55+ ATen ops to Linalg/HFusion named ops. - `convert-torch-to-linalg`: Upstream torch-mlir for remaining ops. - `convert-torch-to-scf / arith / tensor`: Upstream torch-mlir for control flow, arithmetic, and tensor conversion. - `func-backend-type-conversion`: Converts Torch types (`!torch.vtensor`) to builtin types (`tensor`). #### Example `torch.mlir` ```mlir func.func @torch_mul(%arg0: !torch.vtensor<[4096],f16>, %arg1: !torch.vtensor<[1,56,4096],f16>) -> !torch.vtensor<[1,56,4096],f16> attributes {hacc.entry, hacc.function_kind = #hacc.function_kind} { %0 = torch.aten.mul.Tensor %arg0, %arg1 : !torch.vtensor<[4096],f16>, !torch.vtensor<[1,56,4096],f16> -> !torch.vtensor<[1,56,4096],f16> return %0 : !torch.vtensor<[1,56,4096],f16> } ``` Invocation: two methods, both share the same compile pipeline. - **Stepwise conversion**: Converts Torch IR to Linalg/HFusion IR first, suitable for caching or inspecting intermediate IR. After conversion, use `torch_to_hfusion.mlir` as input and continue with the [Linalg/HFusion IR integration](#linalghfusion-ir-integration) flow to produce a binary. - Command: `bishengir-opt -torch-backend-to-named-op-backend-pipeline torch.mlir -o torch_to_hfusion.mlir` - **Expected output**: MLIR text file (`.mlir` format) containing the converted Linalg/HFusion IR. For example: ```mlir func.func @torch.aten.mul_tensor(%arg0: tensor<4096xf16>, %arg1: tensor<1x56x4096xf16>) -> tensor<1x56x4096xf16> attributes {hacc.entry, hacc.function_kind = #hacc.function_kind} { %0 = tensor.empty() : tensor<1x56x4096xf16> %broadcasted = linalg.broadcast ins(%arg0 : tensor<4096xf16>) outs(%0 : tensor<1x56x4096xf16>) dimensions = [0, 1] %1 = linalg.elemwise_binary {fun = #linalg.binary_fn} ins(%broadcasted, %arg1 : tensor<1x56x4096xf16>, tensor<1x56x4096xf16>) outs(%0 : tensor<1x56x4096xf16>) -> tensor<1x56x4096xf16> return %1 : tensor<1x56x4096xf16> } ``` - **End-to-end compilation**: Uses `bishengir-compile` to compile Torch IR directly to an executable binary, running through the full Torch → HFusion → HIVM IR compile pipeline. - Command: `bishengir-compile -enable-torch-compile=true -enable-hfusion-compile=true -enable-hivm-compile=true -target=Ascend910B1 torch.mlir -o torch_kernel.o` - **Expected output**: Ascend NPU operator binary (`.o` format), loadable and runnable on device via CANN runtime. #### Supported Torch ops ##### Elementwise binary | Torch Op | Lowering target | |----------|-----------------| | `aten.add.Tensor` / `aten.add.Scalar` | `linalg.binary_fn` | | `aten.sub.Tensor` / `aten.sub.Scalar` | `linalg.binary_fn` | | `aten.mul.Tensor` / `aten.mul.Scalar` | `linalg.binary_fn` | | `aten.div.Tensor` / `aten.div.Scalar` | `linalg.binary_fn
` | | `aten.maximum` | `linalg.binary_fn` | | `aten.minimum` | `linalg.binary_fn` | | `aten.clamp_min` / `aten.clamp_min.Tensor` | `linalg.binary_fn` | | `aten.clamp_max` / `aten.clamp_max.Tensor` | `linalg.binary_fn` | | `aten.clamp` | Combination of `max_signed` + `min_signed` | | `aten.pow.Tensor_Tensor` / `aten.pow.Tensor_Scalar` / `aten.pow.Scalar` | `hfusion.binary_fn` | | `aten.logical_and` | `hfusion.binary_fn` | | `aten.logical_or` | `hfusion.binary_fn` | ##### Elementwise unary | Torch Op | Lowering target | |----------|-----------------| | `aten.abs` | `linalg.unary_fn` | | `aten.ceil` | `linalg.unary_fn` | | `aten.floor` | `linalg.unary_fn` | | `aten.neg` | `linalg.unary_fn` | | `aten.log` | `linalg.unary_fn` | | `aten.exp` | `linalg.unary_fn` | | `aten.reciprocal` | `hfusion.unary_fn` | | `aten.relu` | `hfusion.unary_fn` | | `aten.rsqrt` | `hfusion.unary_fn` | | `aten.sqrt` | `hfusion.unary_fn` | | `aten.erf` | `hfusion.unary_fn` | | `aten.tanh` | `hfusion.unary_fn` | | `aten.sin` | `hfusion.unary_fn` | | `aten.cos` | `hfusion.unary_fn` | | `aten.bitwise_not` | `hfusion.unary_fn` | | `aten.sigmoid` | Decomposed to negf -> exp -> add -> div | | `aten.gelu` | Decomposed to tanh-based approximation | ##### Compare | Torch Op | Lowering target | |----------|-----------------| | `aten.gt.Scalar` / `aten.gt.Tensor` | `hfusion.compare_fn` | | `aten.lt.Scalar` / `aten.lt.Tensor` | `hfusion.compare_fn` | | `aten.ge.Scalar` / `aten.ge.Tensor` | `hfusion.compare_fn` | | `aten.le.Scalar` / `aten.le.Tensor` | `hfusion.compare_fn` | | `aten.eq.Scalar` / `aten.eq.Tensor` | `hfusion.compare_fn` | | `aten.ne.Scalar` / `aten.ne.Tensor` | `hfusion.compare_fn` | ##### Reduction | Torch Op | Lowering target | |----------|-----------------| | `aten.sum` / `aten.sum.dim_IntList` | `linalg.reduce` + `arith.addf/addi` | | `aten.prod` / `aten.prod.dim_int` | `linalg.reduce` + `arith.mulf/muli` | | `aten.max` | `linalg.reduce` + `arith.maximumf/maxsi` | | `aten.min` | `linalg.reduce` + `arith.minimumf/minsi` | | `aten.max.dim` | `hfusion.reduce_with_index` (MAX) | | `aten.min.dim` | `hfusion.reduce_with_index` (MIN) | | `aten.any` / `aten.any.dim` / `aten.any.dims` | `linalg.reduce` + `arith.ori` | | `aten.all` / `aten.all.dim` | `linalg.reduce` + `arith.andi` | ##### Data movement | Torch Op | Lowering target | |----------|-----------------| | `aten.permute` | `linalg.transpose` | | `aten.broadcast_to` | `linalg.broadcast` | ##### Other | Torch Op | Lowering target | |----------|-----------------| | `aten.to.dtype` | `hfusion.cast` | | `aten.where.self` | `hfusion.select` | | `aten.arange.start_step` | `hfusion.arange` | ### Linalg/HFusion IR integration Use Linalg/Tensor, HFusion, and other standard MLIR dialects for operator semantics; input goes directly into the Linalg/HFusion IR layer's fusion and scheduling flow. #### Example `hfusion.mlir` ```mlir func.func @hfusion_reduce_mul(%arg0: tensor<40960xf32>, %arg1: tensor<40960x1024xf32>, %arg2: tensor<40960x1024xf32>, %arg3: tensor<40960x1024xf32>) -> tensor<40960xf32> attributes {hacc.entry, hacc.function_kind = #hacc.function_kind} { %1 = tensor.empty() : tensor<40960x1024xf32> %3 = linalg.elemwise_binary {fun = #linalg.binary_fn} ins(%arg1, %arg2 : tensor<40960x1024xf32>, tensor<40960x1024xf32>) outs(%arg3: tensor<40960x1024xf32>) -> tensor<40960x1024xf32> %4 = tensor.empty() : tensor<40960xf32> %sum = linalg.reduce {arith.addf} ins(%3 : tensor<40960x1024xf32>) outs(%4 : tensor<40960xf32>) dimensions = [1] %5 = tensor.empty() : tensor<40960xf32> %6 = linalg.elemwise_binary {fun = #linalg.binary_fn} ins(%arg0, %sum : tensor<40960xf32>, tensor<40960xf32>) outs(%5: tensor<40960xf32>) -> tensor<40960xf32> return %6 : tensor<40960xf32> } ``` Invocation: - Command: `bishengir-compile -enable-hfusion-compile=true -enable-hivm-compile=true -target=Ascend910B1 hfusion.mlir -o hfusion_kernel.o` - **Expected output**: Ascend NPU operator binary (`.o` format), loadable and runnable on device via CANN runtime. #### Automatic fusion Once Linalg/HFusion IR is ingested, the HFusion compile flow performs **automatic fusion and scheduling** on eligible ops: multiple ops are merged into the same kernel so intermediate results are reused in on-chip memory and global memory traffic is reduced; scheduling and Tiling strategies are selected automatically based on fusion patterns and operator traits, producing efficient schedules for Ascend NPU. After fusion, the IR passes through Tiling, loop generation, Transform Dialect application, and similar steps before being lowered to HIVM and emitting an executable binary. Supported op types: - Elemwise - Broadcast - Reduce - Transpose - Concat For algorithm details, constraints, architecture, and related topics, see [HFusion AutoSchedule: Automatic Fusion and Scheduling](../features/AutoSchedule/HFusion_AutoSchedule.md). ### HIVM IR integration For fine-grained hardware control, you can write kernels directly in the HIVM dialect, managing memory hierarchy and compute pipelines explicitly. #### Example `hivm.mlir` ```mlir func.func @hivm_vadd(%valueA: memref<16xf16, #hivm.address_space>, %valueB: memref<16xf16, #hivm.address_space>, %valueC: memref<16xf16, #hivm.address_space>) attributes {hacc.entry, hacc.function_kind = #hacc.function_kind} { %ubA = memref.alloc() : memref<16xf16, #hivm.address_space> hivm.hir.load ins(%valueA : memref<16xf16, #hivm.address_space>) outs(%ubA : memref<16xf16, #hivm.address_space>) %ubB = memref.alloc() : memref<16xf16, #hivm.address_space> hivm.hir.load ins(%valueB : memref<16xf16, #hivm.address_space>) outs(%ubB : memref<16xf16, #hivm.address_space>) %ubC = memref.alloc() : memref<16xf16, #hivm.address_space> hivm.hir.vadd ins(%ubA, %ubB : memref<16xf16, #hivm.address_space>, memref<16xf16, #hivm.address_space>) outs(%ubC : memref<16xf16, #hivm.address_space>) hivm.hir.store ins(%ubC : memref<16xf16, #hivm.address_space>) outs(%valueC : memref<16xf16, #hivm.address_space>) return } ``` HIVM uses `#hivm.address_space` to annotate memory hierarchy: `gm` (global memory), `ub` (Unified Buffer), `l1` (L1 Buffer), `l0a`/`l0b`/`l0c` (L0 Buffer). Use `hivm.hir.load`/`hivm.hir.store` for explicit DMA transfers and `hivm.hir.vadd` and similar ops for on-chip compute. Invocation: HIVM does not require the HFusion compile pipeline. The default HIVM compile pipeline performs sync insertion, memory planning, and other optimizations. - Command: `bishengir-compile -enable-hfusion-compile=false -enable-hivm-compile=true -target=Ascend910B1 hivm.mlir -o hivm_kernel.o` - **Expected output**: Ascend NPU operator binary (`.o` format), loadable and runnable on device via CANN runtime. For IR-level concepts, common compile options, and other integration paths (e.g., Triton, TileLang), see [Interface API](interface_api.md).