MLIR矩阵乘算法，新建Dialect，lowering

MLIR：新建一个Dialect，lowering

Multi-Level Intermediate Representation（MLIR）是创建可重用、可扩展编译器基础设施的新途径。

MLIR 项目的核心是 Dialect，MLIR 自身就拥有例如linalg，tosa，affine 这些 Dialect。各种不同的 Dialect 使不同类型的优化或转换得以完成。

好了，如果说前面的部分算是 MLIR 的坡道起步，那这一节就要开始弹射起飞了。本期开始讲解 Dialect 的 Lowering，即由 MLIR 代码逐级转换为机器代码的过程。

当然了，前期也提到过，MLIR 生态的目标只在中间阶段，所以其 lowering 本质上并不涉及太多最终的 IR 生成，这一部分更依赖 LLVM 的基石。

内容较多，建议收藏、细品。

1复习

工具链、总览等等知识请自行翻看历史 MLIR 标签的相关文章

mlir-hello^[1] 项目的目标就是使用自建的 Dialect 通过 MLIR 生态实现一个 hello world，具体做法为：

创建 hello-opt 将原始 print.mlir （可以理解成 hello world 的 main.cpp）转换为 print.ll 文件
使用 LLVM 的 lli 解释器直接运行 print.ll 文件

前文主要介绍了如何通过 ODS^[2] 实现新的 Dialect/Op 的定义。

2Lowering

MLIR 看似清爽，但相关 Pass 的实现一样工作量巨大。

在定义和编写了 HelloDialect 的方方面面后，最终还是要使它们回归 LLVM MLIR “标准库” Dialect，从而再做面向硬件的代码生成。因为标准库中的 Dialect 的剩余工作可以“无痛”衔接到 LLVM 的基础组件上。

具体到 mlir-hello，HelloDialect 到 LLVM 标准库 Dialect，例如 affine dialect，llvm dialect 的 lowering 将手工编码完成。

这一部分可能是 MLIR 相关任务工作量最大的地方。

这一篇文章作为 lowering 相关内容的开端，来解读如何通过 C++ 实现 HelloDialect 到 affine dialect 的 lowering。

3代码解读

简单讲，Dialect 到 Dialect 是一个 match and rewrite 的过程。

注意，有一个之前介绍过的、在 MLIR 中被大量应用的 C++ 编程技巧可能需要巩固一下：C++：CRTP，传入继承。

Pass registration

mlir-hello/include/Hello/HelloPasses.h

通过 std::unique_ptr<mlir::Pass> 在 MLIR 中注册两条 lowering pass。

注册的这个函数钩子将会在下一节的 cpp 中得到具体的实现的函数。

// mlir-hello/include/Hello/HelloPasses.h

// 该文件在 MLIR 中注册两条 lowering pass，没啥特别的



#ifndef MLIR_HELLO_PASSES_H

#define MLIR_HELLO_PASSES_H



#include <memory>



#include "mlir/Pass/Pass.h"



namespace hello {

  std::unique_ptr<mlir::Pass> createLowerToAffinePass();

  std::unique_ptr<mlir::Pass> createLowerToLLVMPass();

}



#endif // MLIR_HELLO_PASSES_H

Lowering implementation

mlir-hello/lib/Hello/LowerToAffine.cpp

负责 hello 到 affine 的 lowering 实现，本质上分为各 Op lowering 的前置工作和Dialect to Dialect实现两个部分。最终的实现 createLowerToAffinePass 将作为 Pass 注册时函数钩子的返回。

1. Op lowering

例如对于某 Xxx 算子，共性为

定义为 class XxxOpLowering

继承自 mlir::OpRewritePattern<hello::XxxOp>

重载 matchAndRewrite 函数，做具体实现

XxxOpLowering 最终将作为模板参数传入新 pass 的 mlir::RewritePatternSet<XxxOpLowering>

例如 class ConstantOpLowering 的实现如下：它会将 ConstantOp 所携带的信息最终转储到 mlir::AffineStoreOp 中。

class ConstantOpLowering : public mlir::OpRewritePattern<hello::ConstantOp> {

  using OpRewritePattern<hello::ConstantOp>::OpRewritePattern;



  mlir::LogicalResult matchAndRewrite(hello::ConstantOp op, mlir::PatternRewriter &rewriter) const final {

    // 捕获 ConstantOp 的元信息：值、位置

    mlir::DenseElementsAttr constantValue = op.getValue();

    mlir::Location loc = op.getLoc();



    // lowering 时，需要将 constant 的参数转存为 memref

    auto tensorType = op.getType().cast<mlir::TensorType>();

    auto memRefType = convertTensorToMemRef(tensorType);

    auto alloc = insertAllocAndDealloc(memRefType, loc, rewriter);



    // 预先声明一个“最高维”的变量

    auto valueShape = memRefType.getShape();

    mlir::SmallVector<mlir::Value, 8> constantIndices;



    if (!valueShape.empty()) {

      for (auto i : llvm::seq<int64_t>(

          0, *std::max_element(valueShape.begin(), valueShape.end())))

        constantIndices.push_back(rewriter.create<mlir::arith::ConstantIndexOp>(loc, i));

    } else {

      // rank 为 0 时

      constantIndices.push_back(rewriter.create<mlir::arith::ConstantIndexOp>(loc, 0));

    }



    // ConstantOp 将作为一个“多维常量”被使用，它可能包含下面这些隐含信息（结构、值），

    // [4, 3] (1, 2, 3, 4, 5, 6, 7, 8)

    // storeElements(0)

    //   indices = [0]

    //   storeElements(1)

    //     indices = [0, 0]

    //     storeElements(2)

    //       store (const 1) [0, 0]

    //     indices = [0]

    //     indices = [0, 1]

    //     storeElements(2)

    //       store (const 2) [0, 1]

    //  ...

    

    // 于是，可以通过定义一个递归 functor （中文一般译为仿函数）去捕获这些信息。

    // functor 的基本思路为，从第一个维度开始，向第 2， 3，...个维度递归取回每个维度上的元素。

    mlir::SmallVector<mlir::Value, 2> indices;

    auto valueIt = constantValue.getValues<mlir::FloatAttr>().begin();

    std::function<void(uint64_t)> storeElements = [&](uint64_t dimension "&") {

      // 递归边界情况：到了最后一维，直接存下整组值

      if (dimension == valueShape.size()) {

        rewriter.create<mlir::AffineStoreOp>(

            loc, rewriter.create<mlir::arith::ConstantOp>(loc, *valueIt++), alloc,

            llvm::makeArrayRef(indices));

        return;

      }

      // 未到递归边界：在当前维度上挨个儿递归，存储结构信息

      for (uint64_t i = 0, e = valueShape[dimension]; i != e; ++i) {

        indices.push_back(constantIndices[i]);

        storeElements(dimension + 1);

        indices.pop_back();

      }

    };



    // 使用上面的 functor

    storeElements(/*dimension=*/0);



    // 将 insertAllocAndDealloc 替换为当前 op

    rewriter.replaceOp(op, alloc);

    return mlir::success();

  }

};

2. Dialect to Dialect

定义好 op 的 lowering 后，就可以通过点对点的 lowering pass 说明如何进行 Dialect 之间的转换了。

这里的 class HelloToAffineLowerPass 主要需要实现 runOnOperation 函数。

namespace {

// 继承 PassWrapper，定义 HelloToAffineLowerPass，它将作为函数钩子的实现返回到上面的 pass 注册

class HelloToAffineLowerPass : public mlir::PassWrapper<HelloToAffineLowerPass, mlir::OperationPass<mlir::ModuleOp>> {

public:

  MLIR_DEFINE_EXPLICIT_INTERNAL_INLINE_TYPE_ID(HelloToAffineLowerPass)



  // 依赖哪些标准库里的 Dialect

  void getDependentDialects(mlir::DialectRegistry &registry) const override {

      registry.insert<mlir::AffineDialect, mlir::func::FuncDialect, mlir::memref::MemRefDialect>();

  }



  void runOnOperation() final;

};

}



// 需要实现的函数，它来说明如何做 lowering

void HelloToAffineLowerPass::runOnOperation() {

    // 获取上下文

    mlir::ConversionTarget target(getContext());



    // 在 addIllegalDialect 中将 HelloDialect 置为不合法（需要被lowering）

    target.addIllegalDialect<hello::HelloDialect>();

    // 说明哪些 Dialect 是合法（lowering目标，通常是标准库中的 Dialect）的

    target.addLegalDialect<mlir::AffineDialect, mlir::BuiltinDialect,

    mlir::func::FuncDialect, mlir::arith::ArithDialect, mlir::memref::MemRefDialect>();

    // 后期可通过 `isDynamicallyLegal` 决定其是否合法，这里具体表现为“当 PrintOp 的参数合法时，它才合法”

    target.addDynamicallyLegalOp<hello::PrintOp>([](hello::PrintOp op "") {

        return llvm::none_of(op->getOperandTypes(),

                            [](mlir::Type type "") { return type.isa<mlir::TensorType>(); });

    });



    // 说明如何 lowering，只需要把 illegal 的 op 的 lowering 实现作为模板参数传入 RewritePatternSet

    mlir::RewritePatternSet patterns(&getContext());

    patterns.add<ConstantOpLowering, PrintOpLowering>(&getContext());



    if (mlir::failed(mlir::applyPartialConversion(getOperation(), target, std::move(patterns)))) {

        signalPassFailure();

    }

}

Pass 的实现确实工作量比较大，但是又不可避免，因为新的 Dialect 到标准库 Dialect 的过程还是必定需要手工实现。这也是很多反对 MLIR 的声音的来源。我们下期继续。

附全部代码

mlir-hello/lib/Hello/LowerToAffine.cpp



// Licensed to the Apache Software Foundation (ASF) under one

// or more contributor license agreements.  See the NOTICE file

// distributed with this work for additional information

// regarding copyright ownership.  The ASF licenses this file

// to you under the Apache License, Version 2.0 (the

// "License"); you may not use this file except in compliance

// with the License.  You may obtain a copy of the License at

//

//   http://www.apache.org/licenses/LICENSE-2.0

//

// Unless required by applicable law or agreed to in writing,

// software distributed under the License is distributed on an

// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

// KIND, either express or implied.  See the License for the

// specific language governing permissions and limitations

// under the License.



#include "Hello/HelloDialect.h"

#include "Hello/HelloOps.h"

#include "Hello/HelloPasses.h"



#include "mlir/Dialect/Affine/IR/AffineOps.h"

#include "mlir/Dialect/Arith/IR/Arith.h"

#include "mlir/Dialect/Func/IR/FuncOps.h"

#include "mlir/Dialect/MemRef/IR/MemRef.h"

#include "mlir/Pass/Pass.h"

#include "mlir/Transforms/DialectConversion.h"

#include "llvm/ADT/Sequence.h"



static mlir::MemRefType convertTensorToMemRef(mlir::TensorType type) {

  assert(type.hasRank() && "expected only ranked shapes");

  return mlir::MemRefType::get(type.getShape(), type.getElementType());

}



static mlir::Value insertAllocAndDealloc(mlir::MemRefType type, mlir::Location loc,

                                         mlir::PatternRewriter &rewriter) {

  auto alloc = rewriter.create<mlir::memref::AllocOp>(loc, type);



  // Make sure to allocate at the beginning of the block.

  auto *parentBlock = alloc->getBlock();

  alloc->moveBefore(&parentBlock->front());



  // Make sure to deallocate this alloc at the end of the block. This is fine

  // as toy functions have no control flow.

  auto dealloc = rewriter.create<mlir::memref::DeallocOp>(loc, alloc);

  dealloc->moveBefore(&parentBlock->back());

  return alloc;

}



class ConstantOpLowering : public mlir::OpRewritePattern<hello::ConstantOp> {

  using OpRewritePattern<hello::ConstantOp>::OpRewritePattern;



  mlir::LogicalResult matchAndRewrite(hello::ConstantOp op, mlir::PatternRewriter &rewriter) const final {

    mlir::DenseElementsAttr constantValue = op.getValue();

    mlir::Location loc = op.getLoc();



    // When lowering the constant operation, we allocate and assign the constant

    // values to a corresponding memref allocation.

    auto tensorType = op.getType().cast<mlir::TensorType>();

    auto memRefType = convertTensorToMemRef(tensorType);

    auto alloc = insertAllocAndDealloc(memRefType, loc, rewriter);



    // We will be generating constant indices up-to the largest dimension.

    // Create these constants up-front to avoid large amounts of redundant

    // operations.

    auto valueShape = memRefType.getShape();

    mlir::SmallVector<mlir::Value, 8> constantIndices;



    if (!valueShape.empty()) {

      for (auto i : llvm::seq<int64_t>(

          0, *std::max_element(valueShape.begin(), valueShape.end())))

        constantIndices.push_back(rewriter.create<mlir::arith::ConstantIndexOp>(loc, i));

    } else {

      // This is the case of a tensor of rank 0.

      constantIndices.push_back(rewriter.create<mlir::arith::ConstantIndexOp>(loc, 0));

    }

    // The constant operation represents a multi-dimensional constant, so we

    // will need to generate a store for each of the elements. The following

    // functor recursively walks the dimensions of the constant shape,

    // generating a store when the recursion hits the base case.



    // [4, 3] (1, 2, 3, 4, 5, 6, 7, 8)

    // storeElements(0)

    //   indices = [0]

    //   storeElements(1)

    //     indices = [0, 0]

    //     storeElements(2)

    //       store (const 1) [0, 0]

    //     indices = [0]

    //     indices = [0, 1]

    //     storeElements(2)

    //       store (const 2) [0, 1]

    //  ...

    //

    mlir::SmallVector<mlir::Value, 2> indices;

    auto valueIt = constantValue.getValues<mlir::FloatAttr>().begin();

    std::function<void(uint64_t)> storeElements = [&](uint64_t dimension "&") {

      // The last dimension is the base case of the recursion, at this point

      // we store the element at the given index.

      if (dimension == valueShape.size()) {

        rewriter.create<mlir::AffineStoreOp>(

            loc, rewriter.create<mlir::arith::ConstantOp>(loc, *valueIt++), alloc,

            llvm::makeArrayRef(indices));

        return;

      }



      // Otherwise, iterate over the current dimension and add the indices to

      // the list.

      for (uint64_t i = 0, e = valueShape[dimension]; i != e; ++i) {

        indices.push_back(constantIndices[i]);

        storeElements(dimension + 1);

        indices.pop_back();

      }

    };



    // Start the element storing recursion from the first dimension.

    storeElements(/*dimension=*/0);



    // Replace this operation with the generated alloc.

    rewriter.replaceOp(op, alloc);

    return mlir::success();

  }

};



class PrintOpLowering : public mlir::OpConversionPattern<hello::PrintOp> {

  using OpConversionPattern<hello::PrintOp>::OpConversionPattern;



  mlir::LogicalResult matchAndRewrite(hello::PrintOp op, OpAdaptor adaptor,

                  mlir::ConversionPatternRewriter &rewriter) const final {

      // We don't lower "hello.print" in this pass, but we need to update its

      // operands.

      rewriter.updateRootInPlace(op,

                                 [&] { op->setOperands(adaptor.getOperands()); });

      return mlir::success();

  }

};



namespace {

class HelloToAffineLowerPass : public mlir::PassWrapper<HelloToAffineLowerPass, mlir::OperationPass<mlir::ModuleOp>> {

public:

  MLIR_DEFINE_EXPLICIT_INTERNAL_INLINE_TYPE_ID(HelloToAffineLowerPass)



  void getDependentDialects(mlir::DialectRegistry &registry) const override {

      registry.insert<mlir::AffineDialect, mlir::func::FuncDialect, mlir::memref::MemRefDialect>();

  }



  void runOnOperation() final;

};

}



void HelloToAffineLowerPass::runOnOperation() {

  mlir::ConversionTarget target(getContext());



  target.addIllegalDialect<hello::HelloDialect>();

  target.addLegalDialect<mlir::AffineDialect, mlir::BuiltinDialect,

    mlir::func::FuncDialect, mlir::arith::ArithDialect, mlir::memref::MemRefDialect>();

  target.addDynamicallyLegalOp<hello::PrintOp>([](hello::PrintOp op "") {

      return llvm::none_of(op->getOperandTypes(),

                           [](mlir::Type type "") { return type.isa<mlir::TensorType>(); });

  });



  mlir::RewritePatternSet patterns(&getContext());

  patterns.add<ConstantOpLowering, PrintOpLowering>(&getContext());



  if (mlir::failed(mlir::applyPartialConversion(getOperation(), target, std::move(patterns)))) {

    signalPassFailure();

  }

}



std::unique_ptr<mlir::Pass> hello::createLowerToAffinePass() {

  return std::make_unique<HelloToAffineLowerPass>();

}

MLIR添加矩阵乘算法

MLIR编译：参考官方文档即可，对机器的内存大小要求比较高

每章对应的代码路径：mlir/examples/toy/Ch.. 每章对应的示例路径：mlir/test/Examples/Toy/Ch.. 每章对应的可执行文件目录：build/bin

ch1

ch1的主要内容是介绍了Toy语言和其AST，由toy语言产生AST的命令如下例：

./toyc-ch1 ../../mlir/test/Examples/Toy/Ch1/ast.toy -emit=ast

toy-ch1等可执行文件在LLVM-main/build/bin文件夹下

ch2

ch2介绍了如何定义一个Dialect(教程中的Toy Dialect定义在Ops.td中，使用的是tablegen的形式)和operations的定义

下面我们介绍如何使用ODS定义矩阵乘,MatmulOp（有关MLIR中Op VS Operation中教程有，Operation是基类，Op类是衍生类）：

（1）编写Ops.td

可以仿照已有的MulOp，基本一致，因为此时类似于定义这个Op的基本结构，并没有涉及这个Op实际上应该干啥

def MatmulOp : Toy_Op<"matmul">{
let summary = "matrix multiplication";
let description = [{
The "matmul" operation performs multiplication between two matrixs.
}]
let arguments = (ins F64Tensor:$lhs, F64Tensor:$rhs);//定义输入参数
let results = (outs F64Tensor);//定义返回值
let parser = [{return ::parseBinaryOp(parser,result);}];
let printer = [{return ::printBinaryOp(p,*this);}];
let builders = [
OpBuilder<(ins "Value":$lhs,"Value":$rhs)>
];
}

（2）编写Dialect.cpp

自定义build方法，可以先试着理解一下build是在干啥，之后会出文章专门细讲：

void MatmulOp::build(mlir::OpBuilder &builder,mlir::OperationState &state,
mlir::Value lhs,mlir::Value rhs){
state.addTypes(UnrankedTensorType::get(builder.getF64Type()));
state.addOperands({lhs,rhs});}

（3）编写MLIRGen.cpp

加入对于matmul op的解析：

if (callee == "transpose") {
      if (call.getArgs().size() != 1) {
        emitError(location, "MLIR codegen encountered an error: toy.transpose "
                            "does not accept multiple arguments");
        return nullptr;
      }
      return builder.create<TransposeOp>(location, operands[0]);
    }else if(callee == "matmul"){
      if(call.getArgs().size()!=2){
        emitError(location,"MLIR codegen encountered an error: toy.matmul"
        "does not accept multiple arguments");
        return nullptr;
      }
      return builder.create<MatmulOp>(location,operands[0],operands[1]);
    }

此时CH2部分能实现的代码完成，可以尝试跑一下示例codegen.toy改写为：

def multiply_transpose(a, b) {
return transpose(a) * transpose(b);
}

def main() {
var a<2, 3> = [[1, 2, 3], [4, 5, 6]];
var b<2, 3> = [1, 2, 3, 4, 5, 6];
var c = multiply_transpose(a, b);
var d = multiply_transpose(b, a);
var g = matmul(transpose(a),b);
print(g);
print(d);
}

首先需要重新编译一下代码：

在build文件夹下执行cmake --build .

然后bin文件夹下`./toyc-ch6 ../../mlir/test/Exa

mples/Toy/Ch2/codegen.toy --emit=mlir`

Ch3

Ch2主要定义了矩阵乘的基本结构，但是对于矩阵乘内部如何实现还远远不够，Ch3这里教你如何实现一些Op优化，通过使用canonicalization pass，教程中给了优化transpose和reshape，我们这里针对Matmul Op实现matmul(matmul(A,B),C) -> matmul(A,matmul(B,C))的转变

（1）修改Ops.td

def MatmulOp:Toy_Op<"matmul",
[NoSideEffect]>{//添加NoSideEffect特性，以使优化更加彻底
let summary = "matrix multiplication";
let description = [{
The "matmul" operation performs multiplication between two matrixs.
}]
let arguments = (ins F64Tensor:$lhs,F64Tensor:$rhs);
let results = (outs F64Tensor);
let parser = [{return ::parseBinaryOp(parser,result);}];
let printer = [{return ::printBinaryOp(p,*this);}];
let builders = [
OpBuilder<(ins "Value":$lhs,"Value":$rhs)>
];
let hasCanonicalizer = 1; //启用，代表该Op需要经过canonicalization pass
}

（2）修改ToyCombine.cpp

//addstruct RepositionRedundantMatmul:public mlir::OpRewritePattern<MatmulOp>{
RepositionRedundantMatmul(mlir::MLIRContext *context)
    :OpRewritePattern<MatmulOp>(context,2){}
mlir::LogicalResult matchAndRewrite(MatmulOp op,mlir::PatternRewriter &rewriter)const override{
    mlir::Value MatmulLhs = op.getOperands()[0];//获取第一个操作数
    mlir::Value MatmulRhs = op.getOperands()[1];
    MatmulOp matmulLhsOp = MatmulLhs.getDefiningOp<MatmulOp>();
    if(!matmulLhsOp)return failure();//判断第一个操作数是否依然为MatmulOp
    auto BxC = rewriter.create<MatmulOp>(op.getLoc(),matmulLhsOp.getOperands()[1],MatmulRhs);//重现创建Op
    auto AxBC = rewriter.create<MatmulOp>(op.getLoc(),matmulLhsOp.getOperands()[0],BxC);
    rewriter.replaceOp(op,{AxBC});//Op替换
    return success();
}};//相当于注册吧，启用canonicalization passvoid MatmulOp::getCanonicalizationPatterns(RewritePatternSet &results,MLIRContext *context){
results.add<RepositionRedundantMatmul>(context);}

编译完后进行验证

transpose_transpose.toy:

def transpose_transpose(x) {
return transpose(transpose(x));
}

def main() {
var a<2, 3> = [[1, 2, 3], [4, 5, 6]];
var b = transpose_transpose(a);
var c = matmul(matmul(transpose(a),b),transpose(a));
print(c);
print(b);
}

`./toyc-ch3 ../../mlir/test/Exa

mples/Toy/Ch3/transpose_transpose.toy --emit=mlir` (这张图显示的结果的形状是进行完Ch4后，结果tensor也有了明确的形状，如果没进行Ch4，结果tensor应该是tensor<?x?xf64>)

Ch4

Ch4介绍的使用interface实现形状推导，比如之前是

(tensor<2x3xf64>, tensor<3x2xf64>) -> tensor<?x?xf64>，形状推导后是

(tensor<2x3xf64>, tensor<3x2xf64>) -> tensor<2x2xf64>，形状推到的意义我觉得是能提前知道结果形状更容易调整空间分配

（1）修改Dialect.cpp

void MatmulOp::inferShapes(){
auto lhsShape = getOperand(0).getType().cast<RankedTensorType>().getShape();
auto rhsShape = getOperand(1).getType().cast<RankedTensorType>().getShape();
SmallVector<int64_t,2>dims;//构造新的形状
dims.push_back(lhsShape[0]);
dims.push_back(rhsShape[1]);
getResult().setType(RankedTensorType::get(dims,getOperand(0).getType().cast<RankedTensorType>().getElementType()));}

（2）修改Ops.td

def MatmulOp:Toy_Op<"matmul",
[NoSideEffect,DeclareOpInterfaceMethods<ShapeInferenceOpInterface>]>{ //添加了Interface
let summary = "matmul operation";
let description = [{
matmul operation
}];
let arguments = (ins F64Tensor:$lhs,F64Tensor:$rhs);
let results = (outs F64Tensor);
let parser = [{return ::parseBinaryOp(parser,result);}];
let printer = [{return ::printBinaryOp(p,*this);}];
let builders = [
OpBuilder<(ins "Value":$lhs,"Value":$rhs)>
];
let hasCanonicalizer = 1;
}

实现效果Ch3中那张图有了

Ch5

Ch5主要是实现Lowering到更低级别的Affine的Dialect上，这个比上面几个要难一些，因为这里要具体规定矩阵乘的操作。

主要是修改LowerToAffineLoops.cpp文件

static void lowerOpToLoopsMatmul(Operation *op,ValueRange operands,PatternRewriter &rewriter ,LoopIterationFn processIteration){
auto tensorType = (*op->result_type_begin()).cast<TensorType>();//获得结果类型
auto loc = op->getLoc();
auto memRefType = convertTensorToMemRef(tensorType);//给结果申请空间
auto alloc = insertAllocAndDealloc(memRefType,loc,rewriter);//类似于指向结果的指针
SmallVector<int64_t,4>lowerBounds(tensorType.getRank(),0);
SmallVector<int64_t,4>steps(tensorType.getRank(),1);
//获取第一个数组的第二维或第二个数组的第一维
SmallVector<int64_t,1> dimV;
auto dim = op->getOperand(0).getType().cast<RankedTensorType>().getShape()[1];
dimV.push_back(dim);
//构架外面的两层循环
buildAffineLoopNest(rewriter,loc,lowerBounds,tensorType.getShape(),steps,
    [&](OpBuilder &nestedBuilder,Location loc,ValueRange ivs){
    //先将结果数组赋初值为0
    SmallVector<Value,2>setZeroIvs(ivs); //这里里面取消了llvm::reverse的用法，这样最后输出的结果里面不会存在0项
    //所以提醒我们要注意数据的存放顺序应该保持一致
    auto loadRes = rewriter.create<AffineLoadOp>(loc,alloc,setZeroIvs);
    Value valueToStore = rewriter.create<arith::SubFOp>(loc,loadRes,loadRes);
    //下面这个就感觉就是将某个数以某种写顺序存在某个地方
    rewriter.create<AffineStoreOp>(loc,valueToStore,alloc,llvm::makeArrayRef(setZeroIvs));

    //下面开始准备最内层的循环
    SmallVector<int64_t,4>lowerBounds(1,0);//这里的1就指代一层循环
    SmallVector<int64_t,4>steps(1,1);
    //保留上面两层循环的层次信息，以便构造最内层操作
    ValueRange resultIvs=ivs;
    SmallVector<Value,3>forIvs;
    forIvs.push_back(ivs[0]);
    forIvs.push_back(ivs[1]);
    //构造最内层循环
    buildAffineLoopNest(rewriter,loc,lowerBounds,dimV,steps,
      [&](OpBuilder &nestedBuilder,Location loc,ValueRange ivs){
      //在这里可以集齐所需要的三层循环层次
      forIvs.push_back(ivs[0]);
      Value valueToAdd=processIteration(nestedBuilder,operands,forIvs);
      //实现加法
      auto loadResult = nestedBuilder.create<AffineLoadOp>(loc,alloc,resultIvs);
      Value valueToStore = nestedBuilder.create<arith::AddFOp>(loc,loadResult,valueToAdd);
      nestedBuilder.create<AffineStoreOp>(loc,valueToStore,alloc,resultIvs);
    });
});
//这里直接可以替换op
rewriter.replaceOp(op,alloc);}// 最内层循环操作struct MatmulOpLowering:public ConversionPattern{
MatmulOpLowering(MLIRContext *ctx)
    :ConversionPattern(toy::MatmulOp::getOperationName(),1,ctx){}
LogicalResult matchAndRewrite(Operation *op,ArrayRef<Value>operands,ConversionPatternRewriter &rewriter)const final{
    auto loc = op->getLoc();
    lowerOpToLoopsMatmul(op,operands,rewriter,
      [loc](OpBuilder &builder,ValueRange memRefOperands,ValueRange loopIvs){
      typename toy::MatmulOpAdaptor MatmulAdaptor(memRefOperands);
      SmallVector<Value,2>LhsIvs,RhsIvs;
      LhsIvs.push_back(loopIvs[0]);
      LhsIvs.push_back(loopIvs[2]);
      RhsIvs.push_back(loopIvs[2]);
      RhsIvs.push_back(loopIvs[1]);
      auto loadedLhs = builder.create<AffineLoadOp>(loc,MatmulAdaptor.lhs(),LhsIvs);
      auto loadedRhs = builder.create<AffineLoadOp>(loc,MatmulAdaptor.rhs(),RhsIvs);
      return builder.create<arith::MulFOp>(loc,loadedLhs,loadedRhs);
    });
    return success();
}};

编译后观察结果，codegen.toy改写为：

def main() {
var a<2, 3> = [[1, 2, 3], [4, 5, 6]];
var b<2, 3> = [1, 2, 3, 4, 5, 6];
var g = matmul(transpose(a),b);
print(g);
}

执行：`./toyc-ch5 ../../mlir/test/Examples/Toy/Ch5/codegen.toy --emit=mlir-affine`

结果：

可以看到符合矩阵乘的定义

Ch6

Ch6就更进一步，lowering 到LLVM IR，然后利用LLVM的JIT机制执行，代码不需要改啥，需要你去把教程看一遍。

改写jit.toy:

def main() {
    var a<2, 3> = [[1, 2, 3], [4, 5, 6]];
    var b<3, 5> = [[1, 2, 3, 4, 5],[1,2,3,4,5],[1,2,3,4,5]];
    var c = matmul(a, b);
    print(c);
}

执行：`./toyc-ch6 ../../mlir/test/Examples/Toy/Ch6/jit.toy --emit=mlir-llvm `化为LLVM IR

执行：`./toyc-ch6 ../../mlir/test/Examples/Toy/Ch6/jit.toy --emit=ji` 可以看到矩阵乘的执行结果：