加餐-微调Stable Diffusion V1.4-526互联

1.概述

本文的目标是概括性介绍一下Stable Diffusion，谈不上对Stable Diffusion原理的介绍，更不用谈介绍清楚复杂的数学推导，因为整个Stable Diffusion系统的任何一个模块都不是能通过一篇博客就能简单的弄清楚的，所以本文的目标就是说明清楚Stable Diffusion的结构，另外并通过一个fine-tune的案例，从代码层面感受一下扩散模型的魅力。

本文主要内容参照TensorFlow的官方博客项目：以及keras-cv库中关于stable_diffusion的模型，预训练的模型参数由于比较大，不在引入时下载，而是直接从hugging-face官方下载下来，在加载模型时直接引入。

本文微调的原始模型时sd-1.4版本。

2.Stable Diffusion原理

2.1整体架构

如上图，stable diffusion从左到右分为三个独立的模块，第一个模块是一个图像编解码器，主要为VAE、第二个模块是stable diffusion的主体-扩散模块，主要为U-Net，第三个模块是条件模块，对于文生图而言，是CLIP算法的文本编码器。

要理解Stable Diffusion，要分别理解训练及推理两个过程。

训练过程或者说微调过程其实只是针对扩散模块进行训练的，因为CLIP算法的文本编码器和VAE都是已经训练好的，而且对最终结果的影响并不大，主要在于扩散模块。stable diffusion的训练数据是文本图像对，训练时，首先通过VAE的编码器，将图片压缩到图像潜空间（4*64*64），后续的正向扩散或反向扩散都是在潜空间内完成的。然后将得到的图像潜空间表示送入扩散过程，扩散过程是对图像的潜空间表示不断加噪音，然后用U-Net预测噪音，注意这里为了达到对生成的图像的控制，也会把文本的编码结果输入U-Net，通过交叉注意力的形式，让生成的图像更加关注输入的文本，通过反向传播不断优化loss，最终就得到了一个可以预测出噪音的U-Net模型。

推理过程则是将文本的编码结果以及随机采样的噪音输入到训练好的U-Net模型中，通过U-Net模型预测出噪音并不断从结果中减去噪音，这也是一个迭代多次的过程，其中噪音调度算法主要有DDPM、DDIM等，这样通过denoise过程恢复出图片应由的潜空间表示，最后再将恢复的潜空间表示输入到VAE的解码器中，恢复到原始的像素空间，即最终生存的图片。

2.2Stable Diffusion用到的算法简要介绍

Transformer

加餐-基于Transformer实现中译英（tf2.x） - lotuslaw - 博客园 (cnblogs.com)

CLIP

CLIP的英文全称是Contrastive Language-Image Pre-training，即一种基于对比文本-图像对的预训练方法或者模型。CLIP是一种基于对比学习的多模态模型，与CV中的一些对比学习方法如moco和simclr不同的是，CLIP的训练数据是文本-图像对：一张图像和它对应的文本描述，这里希望通过对比学习，模型能够学习到文本-图像对的匹配关系。如下图所示，CLIP包括两个模型：Text Encoder和Image Encoder，其中Text Encoder用来提取文本的特征，可以采用NLP中常用的text transformer模型；而Image Encoder用来提取图像的特征，可以采用常用CNN模型或者vision transformer。

U-Net

[1505.04597] U-Net: Convolutional Networks for Biomedical Image Segmentation (arxiv.org)

U-Net网络一开始是作为医学影响分割用途被提出来的，U-Net结构能够有效地捕捉图像中的全部和局部信息。在预测过程中，通过反复调用 U-Net，将 U-Net预测输出的 noise slice 从原有的噪声中去除，得到逐步去噪后的图像表示。

变分自编码器（Variational Auto-Encoders，VAE）作为深度生成模型的一种形式，是由 Kingma 等人于 2014 年提出的基于变分贝叶斯（Variational Bayes，VB）推断的生成式网络结构。与传统的自编码器通过数值的方式描述潜在空间不同，它以概率的方式描述对潜在空间的观察，在数据生成方面表现出了巨大的应用价值。

这里要强调的一点是，对于stable diffusion而言，VAE相当于是一个滤镜，比如本文接下来要用到的VAE就偏向于动画风格。

DDPM/DDIM/PLMS

这几个算法是stable diffusion结构中的去噪算法，在训练及图像生成阶段，通过指定步数的迭代，将噪音逐步从图像潜空间中去除，而每一次迭代如何去除噪音，则是这几个算法在起作用。

3.fine-tune实操

源代码库

3.1环境准备

注意，因为需要用到GPU，这里环境的配置还是要注意一下，CUDA11.2，cudnn8.1，tensorflow2.10.0，keras-cv0.3.5，tensorflow-datasets4.8.0，pandas1.3.5，numpy1.21.6，matplotlib3.5.3。

其余的环境，根据运行时的提示配置即可。

3.2数据准备

这里微调的数据使用的是Hugging Face的宝可梦数据集，因为原始的训练代码默认是直接从Hugging Face下载数据集到本地，考虑到网络问题，先提前将数据集下载好，然后改一下datasets.py脚本。

也就是说，在当前fine-tune完成的情况下，希望生成的stable diffusion生成的图片都是带有宝可梦风格的图片。
# data_path = tf.keras.utils.get_file(
#     origin=DEFAULT_DATA_ARCHIVE,
#     untar=True,
# )
data_path = r'D:/PythonProject/尝试/fine-tune-sd/pokemon_dataset'

3.3预训练模型的参数准备

同数据准备一样，预先从Hugging Face将VAE、CLIP text Encoder、CLIP text Encoder的词典、Stable Diffusion的参数下载到本地，然后更改datasets.py和finetune.py脚本。

self.tokenizer = SimpleTokenizer(bpe_path=r'D:\PythonProject\尝试\pre-train\text\bpe_simple_vocab_16e6.txt.gz')
self.text_encoder = TextEncoder(MAX_PROMPT_LENGTH, download_weights=False)
self.text_encoder.load_weights(r'D:\PythonProject\尝试\pre-train\text-encoder\kcv_encoder.h5')

parser.add_argument(
    "--pretrained_ckpt",
    default='D:\PythonProject\尝试\pre-train\sd\kcv_diffusion_model.h5',
    type=str,
    help="Provide a local path to a diffusion model checkpoint in the `h5`"
    " format if you want to start over fine-tuning from this checkpoint.",
)

image_encoder = ImageEncoder(args.img_height, args.img_width, download_weights=False)
image_encoder.load_weights(r'D:\PythonProject\尝试\pre-train\vae-encoder\vae_encoder.h5')
diffusion_model_tmp = DiffusionModel(
    args.img_height, args.img_width, MAX_PROMPT_LENGTH, download_weights=False
)
diffusion_model_tmp.load_weights(r'D:\PythonProject\尝试\pre-train\sd\kcv_diffusion_model.h5')

3.4训练

这里的微调仅针对Stable Diffusion的U-Net模型，VAE即CLIP是固定不变的，微调方法就是最简单的将所有参数在一批新数据上微调，从而达到学习新数据风格的目的。
# 256*256
python finetune.py --batch_size 4 --num_epochs 577

# 512*512
python finetune.py --img_height 512 --img_width 512 --batch_size 1 --num_epochs 72 --mp
训练完成后，你会得到两个模型参数文件，这里以512*512fine-tune的结果为例测试

3.5对比测试

原项目讲的通过tf-serving实现端到端的部署，这里就不演示了，因为模型能保存为tf格式，就意味着是可以通过tf-serving部署的。

首先加载原始的未fine-tune的模型参数，进行绘图并查看风格

import tensorflow as tf
import time
import base64
import keras_cv
from tensorflow import keras
import matplotlib.pyplot as plt
from keras_cv.models.stable_diffusion.text_encoder import TextEncoder
from keras_cv.models.stable_diffusion.diffusion_model import DiffusionModel
from keras_cv.models.stable_diffusion.decoder import Decoder
from keras_cv.models.stable_diffusion.constants import _ALPHAS_CUMPROD
from tensorflow.python.saved_model import tag_constants
from keras_cv.models.stable_diffusion.clip_tokenizer import SimpleTokenizer
from keras_cv.models.stable_diffusion.constants import _UNCONDITIONAL_TOKENS
import matplotlib.pyplot as plt

MAX_PROMPT_LENGTH = 77
IMG_HEIGHT = 512
IMG_WIDTH = 512

# 这里同样是加载已经下载好的模型参数
text_encoder = TextEncoder(MAX_PROMPT_LENGTH, download_weights=False)
text_encoder.load_weights('./pre-train/text-encoder/kcv_encoder.h5')
diffusion_model = DiffusionModel(IMG_HEIGHT, IMG_WIDTH, MAX_PROMPT_LENGTH, download_weights=False)
diffusion_model.load_weights('./pre-train/sd/kcv_diffusion_model.h5')
decoder = Decoder(IMG_HEIGHT, IMG_WIDTH, download_weights=False)
decoder.load_weights('./pre-train/decoder/kcv_decoder.h5')

stable diffusion主模型保存

# 这里一个非常值得学习的技巧是，通过自定义函数签名，将一些预处理或后处理过程打包进模型中，保存后的模型就可以直接在tf-serving部署

signature_dict = {
    "context": tf.TensorSpec(shape=[None, 77, 768], dtype=tf.float32, name="context"),
    "unconditional_context": tf.TensorSpec(
        shape=[None, 77, 768], dtype=tf.float32, name="unconditional_context"
    ),
    "num_steps": tf.TensorSpec(shape=[], dtype=tf.int32, name="num_steps"),
    "batch_size": tf.TensorSpec(shape=[], dtype=tf.int32, name="batch_size"),
}


def diffusion_model_exporter(model: tf.keras.Model):
    IMG_HEIGHT = 512
    IMG_WIDTH = 512
    MAX_PROMPT_LENGTH = 77
    _ALPHAS_CUMPROD_tf = tf.constant(_ALPHAS_CUMPROD)
    UNCONDITIONAL_GUIDANCE_SCALE = 7.5
    SEED = None

    @tf.function
    def get_timestep_embedding(timestep, batch_size, dim=320, max_period=10000):
        half = dim // 2
        log_max_preiod = tf.math.log(tf.cast(max_period, tf.float32))
        freqs = tf.math.exp(
            -log_max_preiod * tf.range(0, half, dtype=tf.float32) / half
        )
        args = tf.convert_to_tensor([timestep], dtype=tf.float32) * freqs
        embedding = tf.concat([tf.math.cos(args), tf.math.sin(args)], 0)
        embedding = tf.reshape(embedding, [1, -1])
        return tf.repeat(embedding, batch_size, axis=0)

    @tf.function(input_signature=[signature_dict])
    def serving_fn(inputs):
        img_height = tf.cast(tf.math.round(IMG_HEIGHT / 128) * 128, tf.int32)
        img_width = tf.cast(tf.math.round(IMG_WIDTH / 128) * 128, tf.int32)

        batch_size = inputs["batch_size"]
        num_steps = inputs["num_steps"]

        context = inputs["context"]
        unconditional_context = inputs["unconditional_context"]

        latent = tf.random.normal((batch_size, img_height // 8, img_width // 8, 4))

        timesteps = tf.range(1, 1000, 1000 // num_steps)
        alphas = tf.map_fn(lambda t: _ALPHAS_CUMPROD_tf[t], timesteps, dtype=tf.float32)
        alphas_prev = tf.concat([[1.0], alphas[:-1]], 0)

        index = num_steps - 1
        latent_prev = None
        for timestep in timesteps[::-1]:
            latent_prev = latent
            t_emb = get_timestep_embedding(timestep, batch_size)
            unconditional_latent = model(
                [latent, t_emb, unconditional_context], training=False
            )
            latent = model([latent, t_emb, context], training=False)
            latent = unconditional_latent + UNCONDITIONAL_GUIDANCE_SCALE * (
                latent - unconditional_latent
            )
            a_t, a_prev = alphas[index], alphas_prev[index]
            pred_x0 = (latent_prev - tf.math.sqrt(1 - a_t) * latent) / tf.math.sqrt(a_t)
            latent = (
                latent * tf.math.sqrt(1.0 - a_prev) + tf.math.sqrt(a_prev) * pred_x0
            )
            index = index - 1

        return {"latent": latent}

    return serving_fn

tf.saved_model.save(
    diffusion_model,
    "./diffusion_model/1/",
    signatures={"serving_default": diffusion_model_exporter(diffusion_model)},
)

!saved_model_cli show --dir diffusion_model/1/ --tag_set serve --signature_def serving_default

"""
The given SavedModel SignatureDef contains the following input(s):
  inputs['batch_size'] tensor_info:
      dtype: DT_INT32
      shape: ()
      name: serving_default_batch_size:0
  inputs['context'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 77, 768)
      name: serving_default_context:0
  inputs['num_steps'] tensor_info:
      dtype: DT_INT32
      shape: ()
      name: serving_default_num_steps:0
  inputs['unconditional_context'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 77, 768)
      name: serving_default_unconditional_context:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['latent'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 64, 64, 4)
      name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict
"""

saved_model_loaded = tf.saved_model.load(
    "./diffusion_model/1/", tags=[tag_constants.SERVING]
)
predict_fn = saved_model_loaded.signatures["serving_default"]

同样的处理一下VAE的Decoder和CLIP的text Encoder

signature_dict = {
    "latent": tf.TensorSpec(shape=[None, 64, 64, 4], dtype=tf.float32, name="latent"),
}


def decoder_exporter(model: tf.keras.Model):
    @tf.function(input_signature=[signature_dict])
    def serving_fn(inputs):
        latent = inputs["latent"]
        decoded = model(latent, training=False)
        decoded = ((decoded + 1) / 2) * 255
        images = tf.clip_by_value(decoded, 0, 255)
        images = tf.cast(images, tf.uint8)
        return {"generated_images": images}

    return serving_fn

tf.saved_model.save(
    decoder,
    "./decoder/1/",
    signatures={"serving_default": decoder_exporter(decoder)},
)

saved_model_loaded = tf.saved_model.load("./decoder/1/", tags=[tag_constants.SERVING])
decoder_predict_fn = saved_model_loaded.signatures["serving_default"]

signature_dict = {
    "tokens": tf.TensorSpec(shape=[None, 77], dtype=tf.int32, name="tokens"),
    "batch_size": tf.TensorSpec(shape=[], dtype=tf.int32, name="batch_size"),
}


def text_encoder_exporter(model: tf.keras.Model):
    MAX_PROMPT_LENGTH = 77
    POS_IDS = tf.convert_to_tensor([list(range(MAX_PROMPT_LENGTH))], dtype=tf.int32)
    UNCONDITIONAL_TOKENS = tf.convert_to_tensor([_UNCONDITIONAL_TOKENS], dtype=tf.int32)

    @tf.function(input_signature=[signature_dict])
    def serving_fn(inputs):
        batch_size = inputs["batch_size"]

        # context
        encoded_text = model([inputs["tokens"], POS_IDS], training=False)
        encoded_text = tf.squeeze(encoded_text)

        if tf.rank(encoded_text) == 2:
            encoded_text = tf.repeat(
                tf.expand_dims(encoded_text, axis=0), batch_size, axis=0
            )

        # unconditional context
        unconditional_context = model([UNCONDITIONAL_TOKENS, POS_IDS], training=False)

        unconditional_context = tf.repeat(unconditional_context, batch_size, axis=0)
        return {"context": encoded_text, "unconditional_context": unconditional_context}

    return serving_fn

tf.saved_model.save(
    text_encoder,
    "./text_encoder/1/",
    signatures={"serving_default": text_encoder_exporter(text_encoder)},
)

saved_model_loaded = tf.saved_model.load(
    "./text_encoder/1/", tags=[tag_constants.SERVING]
)
text_encoder_predict_fn = saved_model_loaded.signatures["serving_default"]

tokenizer = SimpleTokenizer(bpe_path='./pre-train/text/bpe_simple_vocab_16e6.txt.gz')
padding_token = 49407

# 画图
def plot_images(images):
    plt.figure(figsize=(20, 20))
    for i in range(len(images)):
        ax = plt.subplot(1, len(images), i + 1)
        plt.imshow(images[i])
        plt.axis("off")

prompt = "Yoda"
tokens = tokenizer.encode(prompt)

padding_token = 49407

tokens = tokens + [padding_token] * (MAX_PROMPT_LENGTH - len(tokens))
tokens = tf.convert_to_tensor([tokens], dtype=tf.int32)
tokens.shape

batch_size = tf.constant(4)  # Denotes how many images to generate.

encoded_text = text_encoder_predict_fn(
    tokens=tokens,
    batch_size=batch_size,
)

num_steps = 50

latents = predict_fn(
    batch_size=batch_size,
    context=encoded_text["context"],
    num_steps=tf.convert_to_tensor(num_steps),
    unconditional_context=encoded_text["unconditional_context"],
)

decoded_images = decoder_predict_fn(latent=latents["latent"])

plot_images(decoded_images["generated_images"].numpy())

Yoda是星球大战中的角色，通过原始的模型画出的Yoda还是比较符合原始形象的，不过要是观察细节的，可以发现当前的sd v1.4模型还是很弱的，达不到商用效果，如果要实现效果更炸裂的画图，还是要多参照大神分享的微调好的大模型以及VAE模型。参照C站https://civitai.com/

再试一下画人物：prompt = "A rag picking grandpa"。可以看出大体效果还是符合提示的，但是如果看细节，则要弱的多。

再试一下画卡通：prompt = "An image of a squirrel in Picasso style"。可以看出卡通效果比较不错，所以当前的基模型以及VAE是相对擅长画卡通图像的。

再来看一下fine-tune后的模型效果

# 加载fine-tune后的参数，替换原模型参数，其余代码均不变
diffusion_model = DiffusionModel(IMG_HEIGHT, IMG_WIDTH, MAX_PROMPT_LENGTH, download_weights=False)
diffusion_model.load_weights(r'D:/PythonProject/尝试/fine-tune-sd/stable-diffusion-keras-ft-main/stable-diffusion-keras-ft-main/ckpt_epochs_72_res_512_mp_False.h5')

看一下同样画Yoda的效果，prompt = "Yoda"。可以看出对于同样的提示词，Yoda已经充满了宝可梦风格，但是细节上相对欠缺很多，不过至此已经可以验证，微调产生影响作用了，整个流程是通的，如果想要达到商用效果，可能要在各个方面（数据、微调算法比如LoRa）进行改进提升。

stable-diffusion-webui

stable_diffusion

diffusion mov2mov stable 2mov

diffusion stable seed

diffusion autodl stable设备

diffusion stable xl1 xl

stable-diffusion-webui diffusion stable webui