‘hivm’ Dialect Passes

-auto-blockify-parallel-loop

Enable auto loop on blocks when logical blocknum is larger than physical one

-compose-collapse-expand

Compose collapse and expand op

-convert-non-contiguous-reshape-to-copy

Generate copy for reassociative reshape that might be non-contiguous

-convert-to-hivm-op

Convert Ops from other dialects to HIVM Ops

-cv-pipelining

Cube and vector core pipelining for multi-buffer’ed mix-cv ops

Options

-enable-auto-balance : Enable balancing of vector subtasks during pipelining.

-hivm-add-ffts-to-syncblocksetop

Add FFTS (arg0) to SyncBlockSetOp

This pass adds FFTS (arg0) to SyncBlockSetOp.

-hivm-aggregated-decompose-op

Decompose hivm ops that use hivm AggregatedOpInterface

Options

-decompose-phase : Specify which decompose phase to apply.

-hivm-align-alloc-size

Automatically align memref.alloc size for special hivm op that has to access aligned size

For some hivm ops, its access size can only be aligned to hw unit size, so this pass need adjust the memref.alloc size for the case to avoid access out of bounds.

-hivm-alloc-extra-buffer

Allocate additional temporary buffer for op needed

-hivm-auto-infer-buffer-size

Auto infer buffer size

infer buffer size by inserting annotation.mark Op### -hivm-bind-sub-block

Tile and bind sub block

-hivm-bind-sync-block-lock-arg

Bind func augument with hacc.syncblocklock to CreateSyncBlockLockOp

-hivm-bind-workspace-arg

Bind func augument with hacc.workspace to AllocWorkspaceOp

-hivm-bubble-up-extract-slice

Tile and bind sub block

-hivm-clone-tensor-empty

Output clones to different empty tensors based on hivmOp.

This pass clone different tensor.empty to hivmOp output### -hivm-constantize-buffer-size

Try to constantize dynamic shape buffers.

This pass tries to constantize dynamic shape buffers by upper-bounding their original shape. If successful, a new, static shaped alloc will be created and subviewed to the original shape for further use.

-hivm-decompose-op

Decompose compound hivm op to multiple hivm ops according to hardware ability.

This pass decomposes compound hivm op to multiple hivm ops according to hardware ability. For example, hardware cannot cast f32 to i8 type directly, therefore it needs to be composed to the combination of f32 to f16 cast op and f16 to i8 cast op. In dynamic cases, create annotation.markOp with attr buffer_size_in_byte for allocated extra buffer, the value of buffer_size_in_byte is same as the src or dst operand of original op.

-hivm-enable-multi-buffer

Enable multi buffer

This pass enable multi buffer for hivm op if the op is marked “hivm.multi_buffer”.

-hivm-enable-stride-align

Align memref allocations according to stride align marks

Re-allocate memrefs according to anntations of storage_align marks

-hivm-flatten-ops

Flatten HIVM ops.

-hivm-graph-sync-solver

Graph sync solver

Options

-enable-unit-flag : Enable unit-flag modes for synchronization

-hivm-infer-data-layout

Infer data layout for HIVM Ops

-hivm-infer-func-core-type

Infer the core type of each function

-hivm-infer-mem-scope

Infer memory scope for HIVM Ops

-hivm-init-entry-kernel

Insert set_mask_norm() at the beginning of entry kernel

-hivm-inject-block-sync

Auto inject block sync

Options

-block-all-sync                 : Enable inject all block sync for HIVM injectBlockSync.
-assume-alive-loops             : Assume that all loops (forOp whileOp) will execute at least once.
-disable-auto-inject-block-sync : Toggle auto set/wait insertion, always keep SetFFTSBaseAddrOp

-hivm-inject-sync

Auto inject sync

Options

-sync-mode          : inject sync mode (default is inject normal)
-enable-unit-flag   : Enable unit-flag modes for synchronization
-assume-alive-loops : Assume that all loops (forOp whileOp) will execute at least once.

-hivm-inline-fixpipe

Convert ops to HIVM Fixpipe op

-hivm-inline-load-copy

Inline Copied load

-hivm-inline-otf-broadcast

Inline OTF broadcast

-hivm-inline-otf-load-store

On the fly Inline Load and Store operations

-hivm-insert-infer-sync-block-lock-num-and-init-func

Insert infer-sync-block-lock callback func for host

Calculate total static sync block lock num and initand then create host callback to return this size### -hivm-insert-infer-task-type-func

Infer the module’s task type and emit a host function that returns it.

Detect whether the module is CubeVectorMix, CubeOnly, VectorOnly or Unknown, then emit a host‑side function <original_func>_infer_task_type_function that returns an i8 constant encoding the detected type and is marked with the appropriate HACC host‑function attributes.

-hivm-insert-infer-workspace-size-func

Insert infer-workspace callback func for host

Calculate total static workspace size after plan-workspace pass and then create host callback to return this size### -hivm-insert-init-and-finish-for-debug

Insert init and finish for debug

-hivm-insert-load-store-for-mix-cv

Insert load store op for mix cv

-hivm-insert-nz2nd-for-debug

Insert nz2nd for debug

-hivm-lift-lowest-stride

Lift lowest stride of operands of hivm ops

For most hivm structured op, lift the lowest stride of operands, if the last dim is not contiguous.

Exceptions: MacroOp and VArangeOp.

For example, the type of operand is memref<16xf16, strided<[8]>>, after LiftLowestStride, the type would be memref<16x1xf32, strided<[8, 1]>> with contiguous last dim.

-hivm-lift-zero-rank

-hivm-lower-create-sync-block-lock

Lower CreateSyncBlockLockOp to ViewOp

-hivm-lower-to-loops

Lower hivm ops to loops

-hivm-map-forall-to-blocks

Map forall to hivm blocks.

This pass maps each scf.forall operations to HIVM block ops. Mapping is one-to-one and the induction variables of scf. forall are rewritten to hivm block idx ops.

-hivm-mark-disable-load

Mark the memref.loads that need to disable dcache

-hivm-mark-multi-buffer

Mark multi buffer for HIVM Ops

This pass mark multi buffer for hivm ops if the option enable-auto is true. Note that Buffer with scope L0C would not be marked. If enable-auto is false, do nothing.

Options

-enable-auto                                   : Mark multi buffer automatically.
-limit-auto-multi-buffer-only-for-local-buffer : Disable multi-buffer mark on workspace
-limit-auto-multi-buffer-of-local-buffer       : Limit local buffer auto multi buffer
-limit-mix-auto-multi-buffer-buffer            : Disable multi-buffer-buffer on cube, vector Or no limit
-set-workspace-multibuffer                     : Override for multibuffer number for workspace

-hivm-mark-real-core-type

Mark scalar operations with core-type attribute.

Options

-remove-core-type-attrs : Remove all core type attributes. If set to true, this pass becomes a cleanup pass.

-hivm-mark-stride-align

Automatically annotate stride_align marks for operands of hivm ops

For all hivm ops, annotate their memref operands with storage_align marks automatically

-hivm-memref-alloc-to-alloca

Convert local AllocOp to AllocaOp

This pass replace all memref.alloc with non - global memory space to memref.alloca.### -hivm-normalize-loop-iterator

Normalize special state of loop iterator before plan-memory

-hivm-normalize-matmul

Normalize hivm matmul op

-hivm-opt-func-output

Try to optimize function output after bufferization.

Try to remove unnecessary address return.### -hivm-opt-single-point

Optimize single point hivm op by scalar operation.

This pass optimize the single point hivm op by scalar operation.

-hivm-plan-memory

Plan memory for HIVM Ops

Options

-mem-plan-mode                 : plan mem mode (default is LOCAL_MEM_PLAN)
-enable-global-workspace-reuse : Enable global workspace reuse ,default : false
-restrict-inplace-as-isa       : restrict memory inplace as isa, default : false

-hivm-recognize-deinterleave-op

Optimize discontinuous access to deinterleave.

This pass optimize discontinuous memory access using deinterleave.

-hivm-reduce-rank-subview

Reduce rank using subview

-hivm-set-buffer-size

-hivm-split-mix-kernel

Split Mix device functions into AICube and AIVector functions.

Split mix kernels into separate AICube and AIVector kernels, and mark the parent module as a Mix module.

Note:

  • If a Mix kernel is called within a Host function, a function declaration is generated for the final kernel launch. Currently don’t support calling Mix kernel within a device function.

Input:

func (workspace) attribute {tcore_type = #hivm.tcore_type<CUBE_OR_VECTOR>} {
  t = cube_op ins() outs(workspace)
  ... = vector_op ins(t) ...
}

Output :

func (workspace) attribute {tcore_type = #hivm.tcore_type<CUBE>} {
  t = cube_op ins() outs(workspace)
  annotation.mark t // mark to avoid dce
}

func (workspace) attribute {tcore_type = #hivm.tcore_type<VECTOR>} {
  ... = vector_op ins(workspace) ...
}

-hivm-sync-block-hoisting

Hoist syncblock lock and unlock operation to the parent region if it is in the scf.for or scf.while

-hivm-tile-batchmm-into-loop

Tile batch matmul into loop with iteration on batch dimension

-insert-workspace-for-mix-cv

Insert workspace for mix cv

-tile-cube-vector-loop

Tile cube and vector loops on local buffer

This pass will attempt to tile cube and vector ops again on the local buffer because:

  1. we can reduce the amount of inter-core synchronizations, which is costly.

  2. we can make the tiling size bigger.

Options

-tile-mix-vector-loop : The trip count of the tiled vector loop for mix kernels
-tile-mix-cube-loop   : The trip count of the tiled cube loop for mix kernels

-triton-global-kernel-args-to-hivm-op