‘hfusion’ Dialect Passes

-adapt-triton-kernel

Adapt the triton kernel

Options

-hivmc-version : Specify hivmc version to resolve backward compatibility

-hfusion-add-ffts-addr

Add ffts base address to func param and annotation

Options

-force-add-ffts-addr : Force adding FFTS base addr to the user specified param location. Default value -1 means no insertion. 0 meansinsert to the first param location.

-hfusion-auto-schedule

Auto schedule fused kernels.

Options

-block-dim                      : Number of blocks to use
-enable-auto-multi-buffer       : Enable auto multi buffer
-enable-deterministic-computing : Enable deterministic computing
-max-buffer-count-tuning        : allow maxBufferCnt tuning
-enable-count-buffer-dma-opt    : If enabled, the buffer used by DMA operations will not bereused by Vector operations
-enable-manage-host-resources   : Enable managing resource for Host functions
-cube-tiling-tuning             : allow cube tiling params tuning
-external-tiling-func-path      : auto add external tiling func
-enable-symbol-analysis         : Enable symbol analysis for tiling and fusion

-hfusion-cache-io

Cache input and output argument

-hfusion-cache-io-for-return-arg

Cache argument that returns directly

-hfusion-compose-multi-reduce

Compose multi reduce optimization

Options

-max-compose   : Maximum reduce composed into single operation, -1 is limitless
-max-dist-diff : Maximum distance difference from common ancestor
-aggressive    : Aggressive mode will try to reshape if shape are loosely matched

-hfusion-constantize-tiling-data

Propagate constants between tiling and device function

Propagate constants from calculate tiling to the device function.

Modifications made:

  • Constant tiling data are inlined into the device function.

  • Constant tiling data are removed from the tiling function.

  • Constant tiling data are removed from the arguments of the device function. And the call sites are modified accordingly.

  • Constant tiling data are removed from the callers of the device function, and the callers of the callers, and so on.

Constraints/Assumptions:

  • For all the device functions sharing the same tiling function, the order of tiling data argument is exactly the same.

  • The tiling arguments in device function’s input arguments has the exact same order as the return values of the tiling function.

Input

func.func @tiling_func(%arg0: tensor<?x?xf16>) -> (i64, i64)
attributes {hacc.function_kind = #hacc.function_kind<HOST>} {
  %ret0 = "some_calculation"() : () -> i64
  %ret1 = arith.constant 42: i64
  return %ret0, %ret1: i64, i64
}

func.func @device_kernel_tiling_0(%arg0: tensor<?x?xf16>,
                                 %arg1: i64 {hacc.tiling_data},
                                 %arg2: i64 {hacc.tiling_data}) -> tensor<?x?xf16>
attributes {hacc.function_kind = #hacc.function_kind<DEVICE>, hacc.tiling_func = "tiling_func"} {
  "some_use"(%arg1) : (i64) -> ()
  "some_use"(%arg2) : (i64) -> ()
  %ret0 = "some_op"(%arg0) : (tensor<?x?xf16>) -> tensor<?x?xf16>
  return %ret0 : tensor<?x?xf16>
}

func.func @device_kernel_tiling_1(%arg0: tensor<?x?xf16>,
                                  %arg1: i64 {hacc.tiling_data},
                                  %arg2: i64 {hacc.tiling_data}) -> tensor<?x?xf16>
attributes {hacc.function_kind = #hacc.function_kind<DEVICE>, hacc.tiling_func = "tiling_func"} {
  "some_use"(%arg1) : (i64) -> ()
  "some_use"(%arg2) : (i64) -> ()
  %ret0 = "some_op"(%arg0) : (tensor<?x?xf16>) -> tensor<?x?xf16>
  return %ret0 : tensor<?x?xf16>
}

func.func @main(%arg0: tensor<?x?xf16>,
                %arg1: i64 {hacc.tiling_data},
                %arg2: i64 {hacc.tiling_data}) -> tensor<?x?xf16>
attributes {hacc.function_kind = #hacc.function_kind<HOST>} {
  %0 = arith.index_castui %arg1 : i64 to index
  %1 = scf.index_switch %0 -> tensor<?x?xf16>
  case 1 {
    %2 = func.call @device_kernel_tiling_1(%arg0, %arg1, %arg2) : (tensor<?x?xf16>, i64, i64) -> tensor<?x?xf16>
    scf.yield %2 : tensor<?x?xf16>
  }
  case 0 {
    %2 = func.call @device_kernel_tiling_0(%arg0, %arg1, %arg2): (tensor<?x?xf16>, i64, i64) -> tensor<?x?xf16>
    scf.yield %2 : tensor<?x?xf16>
  }
  default {
    %false = arith.constant false
    cf.assert %false, "Invalid tiling key"
    %2 = ub.poison : tensor<?x?xf16>
    scf.yield %2 : tensor<?x?xf16>
  }
  return % 1 : tensor < ? x ? xf16 >
}

Output

func.func @tiling_func(%arg0: tensor<?x?xf16>) -> (i64)
attributes {hacc.function_kind = #hacc.function_kind<HOST>} {
  %ret0 = "some_calculation"() : () -> i64
  return %ret0: i64
}

func.func @device_kernel_tiling_0(%arg0: tensor<?x?xf16>,
                                  %arg1: i64 {hacc.tiling_data}) -> tensor<?x?xf16>
attributes {hacc.function_kind = #hacc.function_kind<DEVICE>, hacc.tiling_func = "tiling_func"} {
  "some_use"(%arg1) : (i64) -> ()
  %arg2 = arith.constant 32 : i64
  "some_use"(%arg2) : (i64) -> ()
  %ret0 = "some_op"(%arg0) : (tensor<?x?xf16>) -> tensor<?x?xf16>
  return %ret0 : tensor<?x?xf16>
}

func.func @device_kernel_tiling_1(%arg0: tensor<?x?xf16>,
                                  %arg1: i64 {hacc.tiling_data}) -> tensor<?x?xf16>
attributes {hacc.function_kind = #hacc.function_kind<DEVICE>, hacc.tiling_func = "tiling_func"} {
  "some_use"(%arg1) : (i64) -> ()
  %arg2 = arith.constant 32 : i64
  "some_use"(%arg2) : (i64) -> ()
  %ret0 = "some_op"(%arg0) : (tensor<?x?xf16>) -> tensor<?x?xf16>
  return %ret0 : tensor<?x?xf16>
}

func.func @main(%arg0: tensor<?x?xf16>,
                %arg1: i64 {hacc.tiling_data}) -> tensor<?x?xf16>
attributes {hacc.function_kind = #hacc.function_kind<HOST>} {
  %0 = arith.index_castui %arg1 : i64 to index
  %1 = scf.index_switch %0 -> tensor<?x?xf16>
  case 1 {
    %2 = func.call @device_kernel_tiling_1(%arg0, %arg1) : (tensor<?x?xf16>, i64) -> tensor<?x?xf16>
    scf.yield %2 : tensor<?x?xf16>
  }
  case 0 {
    %2 = func.call @device_kernel_tiling_0(%arg0, %arg1): (tensor<?x?xf16>, i64) -> tensor<?x?xf16>
    scf.yield %2 : tensor<?x?xf16>
  }
  default {
    %false = arith.constant false
    cf.assert %false, "Invalid tiling key"
    %2 = ub.poison : tensor<?x?xf16>
    scf.yield %2 : tensor <?x?xf16>
  }
  return %1 : tensor<?x?xf16>
}

-hfusion-convert-generic-to-named

Convert linalg generic ops to linalg named ops and hfusion named ops.

-hfusion-decompose

Decompose ops that implemented AggregatedOpInterface.

Options

-hfusion-decompose-phase : Specify which decompose phase to apply.

-hfusion-decompose-multi

Decompose multi ops into single ones

-hfusion-downgrade-fp64

Downgrade fp64 constant to fp32

-hfusion-drop-symbols

Drop ranked tensor symbols from operations

-hfusion-eliminate-duplicate-funcs

Eliminate duplicate functions after fusion

-hfusion-flatten-ops

Flatten linalg and hfusion ops.

Options

-flatten-mode        : Flatten mode, tidy mode will do an analysis on the entire function
-skip-host           : Whether to skip the host function or not
-multi-dynamic-shape : Whether to collapse multiple dynamic shape or not

-hfusion-fold-symbolic-dim

Replace tensor.dim source operands to hfusion::SymbolicDimOp

-hfusion-fuse-ops

HFusion Fuse operations on tensors

Options

-output-mode                : Outlined function output mode (default is multi, can also use single or single-aggr)
-fusion-mode                : Fusion kind is determined by label
-always-inline              : Enable always inline for the outline function.
-move-out-to-param          : Whether move the tensor out to params or not
-max-horizontal-fusion-size : Maximum horizontal (non-dependent) fusioning allowed, -1 for unlimited attemptof horizontal fusion
-multi-kernel               : When disabled, graph must fuse as single kernel; when enabled, outline multiple kernels.
-enable-symbol-analysis     : Enable symbol dialect analysis.

-hfusion-hoist-tensor-empty

Hoist tensor empty to func parameters and merge into one parameter

This pass merge all tensor.empty to one func parameter.

-hfusion-infer-func-fusion-kind

Infer function for fusion kind

-hfusion-infer-out-shapes

Generate out tensor’s shape function for kernel

-hfusion-inline-brc

Inline broadcast-like ops.

-hfusion-legalize-bf16

Normalize BF16 to FP32.

-hfusion-legalize-bool

Cast int8 to int1 for input and int1 to int8 for output.

-hfusion-normalize-ops

Normalize Hfusion

-hfusion-normalize-slice-ops

Normalize Slice Ops.

Options

-skip-aligned-slice : Skip FoldInsertSliceToConcat pattern for aligned slice.

-hfusion-outline-single-op

Outline single linalg ops into kernels.

Options

-move-out-to-param : Whether move the tensor out to params or not

-hfusion-pack-tiling-data

Pack dynamic tiling information into a struct.

Pack the tiling information into a struct.

Options

-include-symbols                      : Comma separated list of symbols that should apply this transformation. If empty, the default behavior is to apply transformation to all functions.
-emit-get-tiling-struct-size-function : When enabled, a host function that returns the number of i64 tiling data is emitted.
-pack-tiling-key                      : When enabled, the tiling key would also be packed into the tiling struct.Otherwise, the tiling key is directly written to a pointer.

-hfusion-recache-io

Recache io

-hfusion-reorder-ops

Reorder the ops by bfs.

-hfusion-simplify-ops

Simplify operations

-hfusion-tensor-results-to-out-params

Move tensor results to function output parameters

Options

-include-symbols              : Comma separated list of symbols that should apply this transformation. If empty, the default behavior is to apply transformation to all functions.
-enable-manage-host-resources : Enable managing resource for Host functions

-hfusion-unfold-symbolic-dim

Replace hfusion::SymbolicDimOp to same symbolic arguments

-hfusion-wrap-host-func

Create wrappers for certain host related functions

This pass creates wrapper functions for host tiling func, infer shape func, etc.

Options

-remove-unused-arguments : Whether to remove unused arguments in host wrapper function or not