‘hfusion’ Dialect Passes¶

`-adapt-triton-kernel`¶

Adapt the triton kernel

Options¶

-hivmc-version : Specify hivmc version to resolve backward compatibility

`-hfusion-add-ffts-addr`¶

Add ffts base address to func param and annotation

Options¶

-force-add-ffts-addr : Force adding FFTS base addr to the user specified param location. Default value -1 means no insertion. 0 meansinsert to the first param location.

`-hfusion-auto-schedule`¶

Auto schedule fused kernels.

Options¶

-block-dim                      : Number of blocks to use
-enable-auto-multi-buffer       : Enable auto multi buffer
-enable-deterministic-computing : Enable deterministic computing
-max-buffer-count-tuning        : allow maxBufferCnt tuning
-enable-count-buffer-dma-opt    : If enabled, the buffer used by DMA operations will not bereused by Vector operations
-enable-manage-host-resources   : Enable managing resource for Host functions
-cube-tiling-tuning             : allow cube tiling params tuning
-external-tiling-func-path      : auto add external tiling func
-enable-symbol-analysis         : Enable symbol analysis for tiling and fusion

`-hfusion-cache-io`¶

Cache input and output argument

`-hfusion-cache-io-for-return-arg`¶

Cache argument that returns directly

`-hfusion-compose-multi-reduce`¶

Compose multi reduce optimization

Options¶

-max-compose   : Maximum reduce composed into single operation, -1 is limitless
-max-dist-diff : Maximum distance difference from common ancestor
-aggressive    : Aggressive mode will try to reshape if shape are loosely matched

`-hfusion-constantize-tiling-data`¶

Propagate constants between tiling and device function

Propagate constants from calculate tiling to the device function.

Modifications made:

Constant tiling data are inlined into the device function.
Constant tiling data are removed from the tiling function.
Constant tiling data are removed from the arguments of the device function. And the call sites are modified accordingly.
Constant tiling data are removed from the callers of the device function, and the callers of the callers, and so on.

Constraints/Assumptions:

For all the device functions sharing the same tiling function, the order of tiling data argument is exactly the same.
The tiling arguments in device function’s input arguments has the exact same order as the return values of the tiling function.

Input

func.func @tiling_func(%arg0: tensor<?x?xf16>) -> (i64, i64)
attributes {hacc.function_kind = #hacc.function_kind<HOST>} {
  %ret0 = "some_calculation"() : () -> i64
  %ret1 = arith.constant 42: i64
  return %ret0, %ret1: i64, i64
}

func.func @device_kernel_tiling_0(%arg0: tensor<?x?xf16>,
                                 %arg1: i64 {hacc.tiling_data},
                                 %arg2: i64 {hacc.tiling_data}) -> tensor<?x?xf16>
attributes {hacc.function_kind = #hacc.function_kind<DEVICE>, hacc.tiling_func = "tiling_func"} {
  "some_use"(%arg1) : (i64) -> ()
  "some_use"(%arg2) : (i64) -> ()
  %ret0 = "some_op"(%arg0) : (tensor<?x?xf16>) -> tensor<?x?xf16>
  return %ret0 : tensor<?x?xf16>
}

func.func @device_kernel_tiling_1(%arg0: tensor<?x?xf16>,
                                  %arg1: i64 {hacc.tiling_data},
                                  %arg2: i64 {hacc.tiling_data}) -> tensor<?x?xf16>
attributes {hacc.function_kind = #hacc.function_kind<DEVICE>, hacc.tiling_func = "tiling_func"} {
  "some_use"(%arg1) : (i64) -> ()
  "some_use"(%arg2) : (i64) -> ()
  %ret0 = "some_op"(%arg0) : (tensor<?x?xf16>) -> tensor<?x?xf16>
  return %ret0 : tensor<?x?xf16>
}

func.func @main(%arg0: tensor<?x?xf16>,
                %arg1: i64 {hacc.tiling_data},
                %arg2: i64 {hacc.tiling_data}) -> tensor<?x?xf16>
attributes {hacc.function_kind = #hacc.function_kind<HOST>} {
  %0 = arith.index_castui %arg1 : i64 to index
  %1 = scf.index_switch %0 -> tensor<?x?xf16>
  case 1 {
    %2 = func.call @device_kernel_tiling_1(%arg0, %arg1, %arg2) : (tensor<?x?xf16>, i64, i64) -> tensor<?x?xf16>
    scf.yield %2 : tensor<?x?xf16>
  }
  case 0 {
    %2 = func.call @device_kernel_tiling_0(%arg0, %arg1, %arg2): (tensor<?x?xf16>, i64, i64) -> tensor<?x?xf16>
    scf.yield %2 : tensor<?x?xf16>
  }
  default {
    %false = arith.constant false
    cf.assert %false, "Invalid tiling key"
    %2 = ub.poison : tensor<?x?xf16>
    scf.yield %2 : tensor<?x?xf16>
  }
  return % 1 : tensor < ? x ? xf16 >
}

Output

func.func @tiling_func(%arg0: tensor<?x?xf16>) -> (i64)
attributes {hacc.function_kind = #hacc.function_kind<HOST>} {
  %ret0 = "some_calculation"() : () -> i64
  return %ret0: i64
}

func.func @device_kernel_tiling_0(%arg0: tensor<?x?xf16>,
                                  %arg1: i64 {hacc.tiling_data}) -> tensor<?x?xf16>
attributes {hacc.function_kind = #hacc.function_kind<DEVICE>, hacc.tiling_func = "tiling_func"} {
  "some_use"(%arg1) : (i64) -> ()
  %arg2 = arith.constant 32 : i64
  "some_use"(%arg2) : (i64) -> ()
  %ret0 = "some_op"(%arg0) : (tensor<?x?xf16>) -> tensor<?x?xf16>
  return %ret0 : tensor<?x?xf16>
}

func.func @device_kernel_tiling_1(%arg0: tensor<?x?xf16>,
                                  %arg1: i64 {hacc.tiling_data}) -> tensor<?x?xf16>
attributes {hacc.function_kind = #hacc.function_kind<DEVICE>, hacc.tiling_func = "tiling_func"} {
  "some_use"(%arg1) : (i64) -> ()
  %arg2 = arith.constant 32 : i64
  "some_use"(%arg2) : (i64) -> ()
  %ret0 = "some_op"(%arg0) : (tensor<?x?xf16>) -> tensor<?x?xf16>
  return %ret0 : tensor<?x?xf16>
}

func.func @main(%arg0: tensor<?x?xf16>,
                %arg1: i64 {hacc.tiling_data}) -> tensor<?x?xf16>
attributes {hacc.function_kind = #hacc.function_kind<HOST>} {
  %0 = arith.index_castui %arg1 : i64 to index
  %1 = scf.index_switch %0 -> tensor<?x?xf16>
  case 1 {
    %2 = func.call @device_kernel_tiling_1(%arg0, %arg1) : (tensor<?x?xf16>, i64) -> tensor<?x?xf16>
    scf.yield %2 : tensor<?x?xf16>
  }
  case 0 {
    %2 = func.call @device_kernel_tiling_0(%arg0, %arg1): (tensor<?x?xf16>, i64) -> tensor<?x?xf16>
    scf.yield %2 : tensor<?x?xf16>
  }
  default {
    %false = arith.constant false
    cf.assert %false, "Invalid tiling key"
    %2 = ub.poison : tensor<?x?xf16>
    scf.yield %2 : tensor <?x?xf16>
  }
  return %1 : tensor<?x?xf16>
}

`-hfusion-convert-generic-to-named`¶

Convert linalg generic ops to linalg named ops and hfusion named ops.

`-hfusion-decompose`¶

Decompose ops that implemented AggregatedOpInterface.

Options¶

-hfusion-decompose-phase : Specify which decompose phase to apply.

`-hfusion-decompose-multi`¶

Decompose multi ops into single ones

`-hfusion-downgrade-fp64`¶

Downgrade fp64 constant to fp32

`-hfusion-drop-symbols`¶

Drop ranked tensor symbols from operations

`-hfusion-eliminate-duplicate-funcs`¶

Eliminate duplicate functions after fusion

`-hfusion-flatten-ops`¶

Flatten linalg and hfusion ops.

Options¶

-flatten-mode        : Flatten mode, tidy mode will do an analysis on the entire function
-skip-host           : Whether to skip the host function or not
-multi-dynamic-shape : Whether to collapse multiple dynamic shape or not

`-hfusion-fold-symbolic-dim`¶

Replace tensor.dim source operands to hfusion::SymbolicDimOp

`-hfusion-fuse-ops`¶

HFusion Fuse operations on tensors

Options¶

-output-mode                : Outlined function output mode (default is multi, can also use single or single-aggr)
-fusion-mode                : Fusion kind is determined by label
-always-inline              : Enable always inline for the outline function.
-move-out-to-param          : Whether move the tensor out to params or not
-max-horizontal-fusion-size : Maximum horizontal (non-dependent) fusioning allowed, -1 for unlimited attemptof horizontal fusion
-multi-kernel               : When disabled, graph must fuse as single kernel; when enabled, outline multiple kernels.
-enable-symbol-analysis     : Enable symbol dialect analysis.

`-hfusion-hoist-tensor-empty`¶

Hoist tensor empty to func parameters and merge into one parameter

This pass merge all tensor.empty to one func parameter.

`-hfusion-infer-func-fusion-kind`¶

Infer function for fusion kind

`-hfusion-infer-out-shapes`¶

Generate out tensor’s shape function for kernel

`-hfusion-inline-brc`¶

Inline broadcast-like ops.

`-hfusion-legalize-bf16`¶

Normalize BF16 to FP32.

`-hfusion-legalize-bool`¶

Cast int8 to int1 for input and int1 to int8 for output.

`-hfusion-normalize-ops`¶

Normalize Hfusion

`-hfusion-normalize-slice-ops`¶

Normalize Slice Ops.

Options¶

-skip-aligned-slice : Skip FoldInsertSliceToConcat pattern for aligned slice.

`-hfusion-outline-single-op`¶

Outline single linalg ops into kernels.

Options¶

-move-out-to-param : Whether move the tensor out to params or not

`-hfusion-pack-tiling-data`¶

Pack dynamic tiling information into a struct.

Pack the tiling information into a struct.

Options¶

-include-symbols                      : Comma separated list of symbols that should apply this transformation. If empty, the default behavior is to apply transformation to all functions.
-emit-get-tiling-struct-size-function : When enabled, a host function that returns the number of i64 tiling data is emitted.
-pack-tiling-key                      : When enabled, the tiling key would also be packed into the tiling struct.Otherwise, the tiling key is directly written to a pointer.

`-hfusion-recache-io`¶

Recache io

`-hfusion-reorder-ops`¶

Reorder the ops by bfs.

`-hfusion-simplify-ops`¶

Simplify operations

`-hfusion-tensor-results-to-out-params`¶

Move tensor results to function output parameters

Options¶

-include-symbols              : Comma separated list of symbols that should apply this transformation. If empty, the default behavior is to apply transformation to all functions.
-enable-manage-host-resources : Enable managing resource for Host functions

`-hfusion-unfold-symbolic-dim`¶

Replace hfusion::SymbolicDimOp to same symbolic arguments

`-hfusion-wrap-host-func`¶

Create wrappers for certain host related functions

This pass creates wrapper functions for host tiling func, infer shape func, etc.

Options¶

-remove-unused-arguments : Whether to remove unused arguments in host wrapper function or not

‘hfusion’ Dialect Passes¶

-adapt-triton-kernel¶

Options¶

-hfusion-add-ffts-addr¶

Options¶

-hfusion-auto-schedule¶

Options¶

-hfusion-cache-io¶

-hfusion-cache-io-for-return-arg¶

-hfusion-compose-multi-reduce¶

Options¶

-hfusion-constantize-tiling-data¶

-hfusion-convert-generic-to-named¶

-hfusion-decompose¶

Options¶

-hfusion-decompose-multi¶

-hfusion-downgrade-fp64¶

-hfusion-drop-symbols¶

-hfusion-eliminate-duplicate-funcs¶

-hfusion-flatten-ops¶

Options¶

-hfusion-fold-symbolic-dim¶

-hfusion-fuse-ops¶

Options¶

-hfusion-hoist-tensor-empty¶

-hfusion-infer-func-fusion-kind¶

-hfusion-infer-out-shapes¶

-hfusion-inline-brc¶

-hfusion-legalize-bf16¶

-hfusion-legalize-bool¶

-hfusion-normalize-ops¶

-hfusion-normalize-slice-ops¶

Options¶

-hfusion-outline-single-op¶

Options¶

-hfusion-pack-tiling-data¶

Options¶

-hfusion-recache-io¶

-hfusion-reorder-ops¶

-hfusion-simplify-ops¶

-hfusion-tensor-results-to-out-params¶

Options¶

-hfusion-unfold-symbolic-dim¶

-hfusion-wrap-host-func¶

Options¶

`-adapt-triton-kernel`¶

`-hfusion-add-ffts-addr`¶

`-hfusion-auto-schedule`¶

`-hfusion-cache-io`¶

`-hfusion-cache-io-for-return-arg`¶

`-hfusion-compose-multi-reduce`¶

`-hfusion-constantize-tiling-data`¶

`-hfusion-convert-generic-to-named`¶

`-hfusion-decompose`¶

`-hfusion-decompose-multi`¶

`-hfusion-downgrade-fp64`¶

`-hfusion-drop-symbols`¶

`-hfusion-eliminate-duplicate-funcs`¶

`-hfusion-flatten-ops`¶

`-hfusion-fold-symbolic-dim`¶

`-hfusion-fuse-ops`¶

`-hfusion-hoist-tensor-empty`¶

`-hfusion-infer-func-fusion-kind`¶

`-hfusion-infer-out-shapes`¶

`-hfusion-inline-brc`¶

`-hfusion-legalize-bf16`¶

`-hfusion-legalize-bool`¶

`-hfusion-normalize-ops`¶

`-hfusion-normalize-slice-ops`¶

`-hfusion-outline-single-op`¶

`-hfusion-pack-tiling-data`¶

`-hfusion-recache-io`¶

`-hfusion-reorder-ops`¶

`-hfusion-simplify-ops`¶

`-hfusion-tensor-results-to-out-params`¶

`-hfusion-unfold-symbolic-dim`¶

`-hfusion-wrap-host-func`¶