自定义Graph Component：1-开发指南-526互联

可以使用自定义NLU组件和策略扩展Rasa，本文提供了如何开发自己的自定义Graph Component指南。
Rasa提供各种开箱即用的NLU组件和策略。可以使用自定义Graph Component对其进行自定义或从头开始创建自己的组件。
要在Rasa中使用自定义Graph Component，它必须满足以下要求：

它必须实现GraphComponent接口
必须注册使用过的model配置
必须在配置文件中使用它
它必须使用类型注释。Rasa利用类型注释来验证模型配置。不允许前向引用。如果使用Python 3.7，可以使用from __future__ import annotations来摆脱前向引用。

一.Graph Components
Rasa使用传入的模型配置（config.yml）来构建DAG，描述了config.yml中Component间的依赖关系以及数据如何在它们之间流动。这有两个主要好处：

Rasa可以使用计算图来优化模型的执行。这方面的例子包括训练步骤的高效缓存或并行执行独立的步骤。
Rasa可以灵活地表示不同的模型架构。只要图保持非循环，Rasa理论上可以根据模型配置将任何数据传递给任何图组件，而无需将底层软件架构与使用的模型架构绑定。

将config.yml转换为计算图时，Policy和NLU组件成为该图中的节点。虽然模型配置中的Policy和NLU组件之间存在区别，但当它们被放置在图中时，这种区别就被抽象出来了。此时，Policy和NLU组件成为抽象图组件。在实践中，这由GraphComponent接口表示：Policy和NLU组件都必须继承此接口，才能与Rasa的图兼容并可执行。

二.入门指南
在开始之前，必须决定是实现自定义NLU组件还是Policy。如果正在实现自定义策略，那么建议扩展现有的rasa.core.policies.policy.Policy类，该类已经实现了GraphComponent接口。如下所示：

from rasa.core.policies.policy import Policy
from rasa.engine.recipes.default_recipe import DefaultV1Recipe

# TODO: Correctly register your graph component
@DefaultV1Recipe.register(
    [DefaultV1Recipe.ComponentType.POLICY_WITHOUT_END_TO_END_SUPPORT], is_trainable=True
)
class MyPolicy(Policy):
    ...

如果要实现自定义NLU组件，要从以下框架开始：

from typing import Dict, Text, Any, List

from rasa.engine.graph import GraphComponent, ExecutionContext
from rasa.engine.recipes.default_recipe import DefaultV1Recipe
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage
from rasa.shared.nlu.training_data.message import Message
from rasa.shared.nlu.training_data.training_data import TrainingData

# TODO: Correctly register your component with its type
@DefaultV1Recipe.register(
    [DefaultV1Recipe.ComponentType.INTENT_CLASSIFIER], is_trainable=True
)
class CustomNLUComponent(GraphComponent):
    @classmethod
    def create(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> GraphComponent:
        # TODO: Implement this
        ...

    def train(self, training_data: TrainingData) -> Resource:
        # TODO: Implement this if your component requires training
        ...

    def process_training_data(self, training_data: TrainingData) -> TrainingData:
        # TODO: Implement this if your component augments the training data with
        #       tokens or message features which are used by other components
        #       during training.
        ...

        return training_data

    def process(self, messages: List[Message]) -> List[Message]:
        # TODO: This is the method which Rasa Open Source will call during inference.
        ...
        return messages

下面会介绍如何解决上述示例中的TODO，以及需要在自定义组件中实现的其它方法。
自定义词法分析器：如果创建了一个自定义的tokenizer，应该扩展rasa.nlu.tokenizers.tokenizer. Tokenizer类。train和process方法已经实现，所以只需要覆盖tokenize方法。

三.GraphComponent接口
要使用Rasa运行自定义NLU组件或Policy，必须实现GraphComponent接口。如下所示：

from __future__ import annotations
from abc import ABC, abstractmethod
from typing import List, Type, Dict, Text, Any, Optional

from rasa.engine.graph import ExecutionContext
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage


class GraphComponent(ABC):
    """Interface for any component which will run in a graph."""

    @classmethod
    def required_components(cls) -> List[Type]:
        """Components that should be included in the pipeline before this component."""
        return []

    @classmethod
    @abstractmethod
    def create(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> GraphComponent:
        """Creates a new `GraphComponent`.

        Args:
            config: This config overrides the `default_config`.
            model_storage: Storage which graph components can use to persist and load
                themselves.
            resource: Resource locator for this component which can be used to persist
                and load itself from the `model_storage`.
            execution_context: Information about the current graph run.

        Returns: An instantiated `GraphComponent`.
        """
        ...

    @classmethod
    def load(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
        **kwargs: Any,
    ) -> GraphComponent:
        """Creates a component using a persisted version of itself.

        If not overridden this method merely calls `create`.

        Args:
            config: The config for this graph component. This is the default config of
                the component merged with config specified by the user.
            model_storage: Storage which graph components can use to persist and load
                themselves.
            resource: Resource locator for this component which can be used to persist
                and load itself from the `model_storage`.
            execution_context: Information about the current graph run.
            kwargs: Output values from previous nodes might be passed in as `kwargs`.

        Returns:
            An instantiated, loaded `GraphComponent`.
        """
        return cls.create(config, model_storage, resource, execution_context)

    @staticmethod
    def get_default_config() -> Dict[Text, Any]:
        """Returns the component's default config.

        Default config and user config are merged by the `GraphNode` before the
        config is passed to the `create` and `load` method of the component.

        Returns:
            The default config of the component.
        """
        return {}

    @staticmethod
    def supported_languages() -> Optional[List[Text]]:
        """Determines which languages this component can work with.

        Returns: A list of supported languages, or `None` to signify all are supported.
        """
        return None

    @staticmethod
    def not_supported_languages() -> Optional[List[Text]]:
        """Determines which languages this component cannot work with.

        Returns: A list of not supported languages, or
            `None` to signify all are supported.
        """
        return None

    @staticmethod
    def required_packages() -> List[Text]:
        """Any extra python dependencies required for this component to run."""
        return []

    @classmethod
    def fingerprint_addon(cls, config: Dict[str, Any]) -> Optional[str]:
        """Adds additional data to the fingerprint calculation.

        This is useful if a component uses external data that is not provided
        by the graph.
        """
        return None

1.create方法
create方法用于在训练期间实例化图组件，并且必须被覆盖。Rasa在调用该方法时传递以下参数：
（1）config：这是组件的默认配置，与模型配置文件中提供给图组件的配置合并。
（2）model_storage：可以使用此功能来持久化和加载图组件。有关其用法的更多详细信息，请参阅模型持久化部分。
（3）resource：模型存储中组件的唯一标识符。有关其用法的更多详细信息，请参阅模型持久性部分。
（4）execution_context：提供有关当前执行模式的额外信息：

model_id: 推理过程中使用的模型的唯一标识符。在训练过程中，此参数为None。
should_add_diagnostic_data：如果为True，则应在实际预测的基础上向图组件的预测中添加额外的诊断元数据。
is_finetuning：如果为True，则可以使用微调来训练图组件。
graph_schema：graph_schema描述用于训练助手或用它进行预测的计算图。
node_name：node_name是图模式中步骤的唯一标识符，由所调用的图组件完成。

2.load方法
在推理过程中，使用load方法来实例化图组件。此方法的默认实现会调用create方法。如果图组件将数据作为训练的一部分，建议覆盖此方法。有关各个参数的描述，参阅create方法。

3.get_default_config方法
get_default_config方法返回图组件的默认配置。它的默认实现返回一个空字典，这意味着图组件没有任何配置。Rasa将在运行时使用配置文件（config.yml）中的给定值更新默认配置。

4.supported_languages方法
supported_languages方法指定了图组件支持的语言。Rasa将使用模型配置文件中的语言键来验证图组件是否可用于指定的语言。如果图组件返回None（这是默认实现），则表示图组件支持not_supported_languages中未包含的所有语言。示例如下所示：

[]：图组件不支持任何语言
None：支持所有语言，但不支持not_supported_languages中定义的语言
["en"]：图组件只能用于英语对话

5.not_supported_languages方法
not_supported_languages方法指定图组件不支持哪些语言。Rasa将使用模型配置文件中的语言键来验证图组件是否可用于指定的语言。如果图组件返回None（这是默认实现），则表示它支持supported_languages中指定的所有语言。示例如下所示：

无或[]：支持supported_languages中指定的所有语言。
["en"]：该图形组件可用于除英语以外的任何语言。

6.required_packages方法
required_packages方法表明需要安装哪些额外的Python包才能使用此图组件。如果在运行时找不到所需的库，Rasa将在执行过程中抛出错误。默认情况下，此方法返回一个空列表，这意味着图组件没有任何额外的依赖关系。示例如下所示：

[]：使用此图组件不需要额外的包
["spacy"]：需要安装Python包spacy才能使用此图组件。

四.模型持久化
一些图组件需要在训练期间持久化数据，这些数据在推理时应该对图组件可用。一个典型的用例是存储模型权重。为此，Rasa为图组件的create和load方法提供了model_storage和resource参数，如下面的代码片段所示。model_storage提供对所有图组件数据的访问。resource允许唯一标识图组件在模型存储中的位置。

from __future__ import annotations

from typing import Any, Dict, Text

from rasa.engine.graph import GraphComponent, ExecutionContext
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage

class MyComponent(GraphComponent):
    @classmethod
    def create(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> MyComponent:
        ...

    @classmethod
    def load(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
        **kwargs: Any
    ) -> MyComponent:
        ...

1.写模型存储
下面的代码片段演示了如何将图组件的数据写入模型存储。要在训练后持久化图组件，train方法需要访问model_storage和resource的值。因此，应该在初始化时存储model_storage和resource的值。图组件的train方法必须返回resource的值，以便Rasa可以在训练之间缓存训练结果。self._model_storage.write_to(self._resource)上下文管理器提供了一个目录路径，可以在其中持久化图组件所需的任何数据。

from __future__ import annotations
import json
from typing import Optional, Dict, Any, Text

from rasa.engine.graph import GraphComponent, ExecutionContext
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage
from rasa.shared.nlu.training_data.training_data import TrainingData

class MyComponent(GraphComponent):

    def __init__(
        self,
        model_storage: ModelStorage,
        resource: Resource,
        training_artifact: Optional[Dict],
    ) -> None:
        # Store both `model_storage` and `resource` as object attributes to be able
        # to utilize them at the end of the training
        self._model_storage = model_storage
        self._resource = resource

    @classmethod
    def create(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> MyComponent:
        return cls(model_storage, resource, training_artifact=None)

    def train(self, training_data: TrainingData) -> Resource:
        # Train your graph component
        ...

        # Persist your graph component
        with self._model_storage.write_to(self._resource) as directory_path:
            with open(directory_path / "artifact.json", "w") as file:
                json.dump({"my": "training artifact"}, file)

        # Return resource to make sure the training artifacts
        # can be cached.
        return self._resource

2.读模型存储
Rasa将调用图组件的load方法来实例化它以进行推理。可以使用上下文管理器self._model_storage.read_from(resource)来获取图组件数据所保存的目录的路径。使用提供的路径，可以加载保存的数据并用它初始化图组件。请注意，如果给定的资源没有找到保存的数据，model_storage将抛出ValueError。

from __future__ import annotations
import json
from typing import Optional, Dict, Any, Text

from rasa.engine.graph import GraphComponent, ExecutionContext
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage

class MyComponent(GraphComponent):

    def __init__(
        self,
        model_storage: ModelStorage,
        resource: Resource,
        training_artifact: Optional[Dict],
    ) -> None:
        self._model_storage = model_storage
        self._resource = resource

    @classmethod
    def load(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
        **kwargs: Any,
    ) -> MyComponent:
        try:
            with model_storage.read_from(resource) as directory_path:
                with open(directory_path / "artifact.json", "r") as file:
                    training_artifact = json.load(file)
                    return cls(
                        model_storage, resource, training_artifact=training_artifact
                    )
        except ValueError:
            # This allows you to handle the case if there was no
            # persisted data for your component
            ...

五.用模型配置注册Graph Components
为了让图组件可用于Rasa，可能需要使用recipe注册图组件。Rasa使用recipe将模型配置的内容转换为可执行的graph。目前，Rasa支持default.v1和实验性graph.v1 recipe。对于default.v1 recipe，需要使用DefaultV1Recipe.register装饰器注册图组件：

from rasa.engine.graph import GraphComponent
from rasa.engine.recipes.default_recipe import DefaultV1Recipe

@DefaultV1Recipe.register(
    component_types=[DefaultV1Recipe.ComponentType.INTENT_CLASSIFIER],
    is_trainable=True,
    model_from="SpacyNLP",
)
class MyComponent(GraphComponent):
    ...

Rasa使用register装饰器中提供的信息以及图组件在配置文件中的位置来调度图组件及其所需数据的执行。DefaultV1Recipe.register装饰器允许指定以下详细信息：
1.component_types
指定了图组件在助手内实现的目的。可以指定多种类型（例如，如果图组件既是意图分类器又是实体提取器）。
（1）ComponentType.MODEL_LOADER
语言模型的组件类型。如果指定了model_from=，则此类型的图组件为其它图组件的train、process_training_data和process方法提供预训练模型。这个图组件在训练和推理期间运行。Rasa将使用此图组件的provide方法检索应提供给依赖项图组件的模型。
（2）ComponentType.MESSAGE_TOKENIZER
分词器的组件类型。如果指定了is_trainable=True，则此类型的图形组件在训练和推理期间运行。Rasa将使用此图形组件的train方法进行训练。Rasa将使用 process_training_data进行训练数据示例的分词，并在推理期间使用process进行消息的分词。
（3）ComponentType.MESSAGE_FEATURIZER
特征提取器的组件类型。如果指定了is_trainable=True，则此类型的图组件在训练和推理期间运行。Rasa将使用此图组件的train方法进行训练。Rasa将使用 process_training_data进行训练数据示例的特征提取，并在推理期间使用process进行消息的特征提取。
（4）ComponentType.INTENT_CLASSIFIER 意图分类器的组件类型。如果指定了is_trainable=True，则此类型的图组件仅在训练期间运行。此组件在推理期间始终运行。如果指定了is_trainable=True，Rasa将使用此图形组件的train方法进行训练。Rasa将使用此图组件的process方法在推理期间对消息的意图进行分类。
（5）ComponentType.ENTITY_EXTRACTOR
实体提取器的组件类型。如果指定了is_trainable=True，则此类型的图组件仅在训练期间运行。此组件在推理期间始终运行。如果指定了is_trainable=True，Rasa将使用此图组件的train方法进行训练。Rasa将使用此图组件的process方法在推理期间提取实体。
（6）ComponentType.POLICY_WITHOUT_END_TO_END_SUPPORT
不需要其它端到端功能的策略的组件类型（有关更多信息，请参阅end-to-end training）。如果指定了is_trainable=True，则此类型的图组件仅在训练期间运行。此组件在推理期间始终运行。如果指定了is_trainable=True，Rasa将使用此图组件的train方法进行训练。Rasa将使用此图组件的predict_action_probabilities来预测在对话中应运行的下一个动作。
（7）ComponentType.POLICY_WITH_END_TO_END_SUPPORT
需要其它端到端功能（请参阅end-to-end training以获取更多信息）的策略的组件类型。端到端功能将作为预计算参数传递到图组件的train和predict_action_probabilities中。如果指定了is_trainable=True，则此类型的图组件仅在训练期间运行。此组件在推理期间始终运行。如果指定了is_trainable=True，Rasa将使用此图组件的train方法进行训练。Rasa将使用此图组件的predict_action_probabilities来预测在对话中应运行的下一个动作。
2.is_trainable
指定在处理其它依赖图组件的训练数据之前，或者在可以进行预测之前，是否需要训练图组件本身。
3.model_from
指定是否需要向图组件的train、process_training_data和process方法提供预训练语言模型。这些方法必须支持参数模型以接收语言模型。请注意，仍然需要确保提供此模型的图组件是模型配置的一部分。一个常见的用例是，如果想将SpacyNLP语言模型暴露给其它NLU组件。

六.在模型配置中使用自定义组件
可以在模型配置中使用自定义图组件，就像其它NLU组件或策略一样。唯一的变化是，必须指定完整的模块名称，而不是仅指定类名。完整的模块名称取决于模块相对于指定的PYTHONPATH的位置。默认情况下，Rasa会将运行CLI的目录添加到PYTHONPATH。例如，如果从/Users/<user>/my-rasa-project运行CLI，并且模块MyComponent在/Users/<user>/my-rasa-project/custom_components/my_component.py 中，则模块路径为custom_components.my_component.MyComponent。除了name条目之外，所有内容都将作为config传递给组件。config.yml文件如下所示：

recipe: default.v1
language: en
pipeline:
# other NLU components
- name: your.custom.NLUComponent
  setting_a: 0.01
  setting_b: string_value

policies:
# other dialogue policies
- name: your.custom.Policy

七.实现提示
1.消息元数据
当在训练数据中为意图示例定义元数据时，NLU组件可以在处理过程中访问意图元数据和意图示例元数据，如下所示：

# in your component class
def process(self, message: Message, **kwargs: Any) -> None:
    metadata = message.get("metadata")
    print(metadata.get("intent"))
    print(metadata.get("example"))

2.稀疏和稠密消息特征
如果创建了一个自定义的消息特征器，可以返回两种不同的特征：序列特征和句子特征。序列特征是一个大小为(number-of-tokens x feature-dimension)的矩阵，即该矩阵包含序列中每个token的特征向量。句子特征由大小为(1 x feature-dimension)的矩阵表示。

八.自定义组件的例子
1.稠密消息特征器
使用预训练模型的一个dense message featurizer的例子，如下所示：

import numpy as np
import logging
from bpemb import BPEmb
from typing import Any, Text, Dict, List, Type

from rasa.engine.recipes.default_recipe import DefaultV1Recipe
from rasa.engine.graph import ExecutionContext, GraphComponent
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage
from rasa.nlu.featurizers.dense_featurizer.dense_featurizer import DenseFeaturizer
from rasa.nlu.tokenizers.tokenizer import Tokenizer
from rasa.shared.nlu.training_data.training_data import TrainingData
from rasa.shared.nlu.training_data.features import Features
from rasa.shared.nlu.training_data.message import Message
from rasa.nlu.constants import (
    DENSE_FEATURIZABLE_ATTRIBUTES,
    FEATURIZER_CLASS_ALIAS,
)
from rasa.shared.nlu.constants import (
    TEXT,
    TEXT_TOKENS,
    FEATURE_TYPE_SENTENCE,
    FEATURE_TYPE_SEQUENCE,
)


logger = logging.getLogger(__name__)


@DefaultV1Recipe.register(
    DefaultV1Recipe.ComponentType.MESSAGE_FEATURIZER, is_trainable=False
)
class BytePairFeaturizer(DenseFeaturizer, GraphComponent):
    @classmethod
    def required_components(cls) -> List[Type]:
        """Components that should be included in the pipeline before this component."""
        return [Tokenizer]

    @staticmethod
    def required_packages() -> List[Text]:
        """Any extra python dependencies required for this component to run."""
        return ["bpemb"]

    @staticmethod
    def get_default_config() -> Dict[Text, Any]:
        """Returns the component's default config."""
        return {
            **DenseFeaturizer.get_default_config(),
            # specifies the language of the subword segmentation model
            "lang": None,
            # specifies the dimension of the subword embeddings
            "dim": None,
            # specifies the vocabulary size of the segmentation model
            "vs": None,
            # if set to True and the given vocabulary size can't be loaded for the given
            # model, the closest size is chosen
            "vs_fallback": True,
        }

    def __init__(
        self,
        config: Dict[Text, Any],
        name: Text,
    ) -> None:
        """Constructs a new byte pair vectorizer."""
        super().__init__(name, config)
        # The configuration dictionary is saved in `self._config` for reference.
        self.model = BPEmb(
            lang=self._config["lang"],
            dim=self._config["dim"],
            vs=self._config["vs"],
            vs_fallback=self._config["vs_fallback"],
        )

    @classmethod
    def create(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> GraphComponent:
        """Creates a new component (see parent class for full docstring)."""
        return cls(config, execution_context.node_name)

    def process(self, messages: List[Message]) -> List[Message]:
        """Processes incoming messages and computes and sets features."""
        for message in messages:
            for attribute in DENSE_FEATURIZABLE_ATTRIBUTES:
                self._set_features(message, attribute)
        return messages

    def process_training_data(self, training_data: TrainingData) -> TrainingData:
        """Processes the training examples in the given training data in-place."""
        self.process(training_data.training_examples)
        return training_data

    def _create_word_vector(self, document: Text) -> np.ndarray:
        """Creates a word vector from a text. Utility method."""
        encoded_ids = self.model.encode_ids(document)
        if encoded_ids:
            return self.model.vectors[encoded_ids[0]]

        return np.zeros((self.component_config["dim"],), dtype=np.float32)

    def _set_features(self, message: Message, attribute: Text = TEXT) -> None:
        """Sets the features on a single message. Utility method."""
        tokens = message.get(TEXT_TOKENS)

        # If the message doesn't have tokens, we can't create features.
        if not tokens:
            return None

        # We need to reshape here such that the shape is equivalent to that of sparsely
        # generated features. Without it, it'd be a 1D tensor. We need 2D (n_utterance, n_dim).
        text_vector = self._create_word_vector(document=message.get(TEXT)).reshape(
            1, -1
        )
        word_vectors = np.array(
            [self._create_word_vector(document=t.text) for t in tokens]
        )

        final_sequence_features = Features(
            word_vectors,
            FEATURE_TYPE_SEQUENCE,
            attribute,
            self._config[FEATURIZER_CLASS_ALIAS],
        )
        message.add_features(final_sequence_features)
        final_sentence_features = Features(
            text_vector,
            FEATURE_TYPE_SENTENCE,
            attribute,
            self._config[FEATURIZER_CLASS_ALIAS],
        )
        message.add_features(final_sentence_features)

    @classmethod
    def validate_config(cls, config: Dict[Text, Any]) -> None:
        """Validates that the component is configured properly."""
        if not config["lang"]:
            raise ValueError("BytePairFeaturizer needs language setting via `lang`.")
        if not config["dim"]:
            raise ValueError(
                "BytePairFeaturizer needs dimensionality setting via `dim`."
            )
        if not config["vs"]:
            raise ValueError("BytePairFeaturizer needs a vector size setting via `vs`.")

2.稀疏消息特征器
以下是稀疏消息特征器的示例，它训练了一个新模型：

import logging
from typing import Any, Text, Dict, List, Type

from sklearn.feature_extraction.text import TfidfVectorizer
from rasa.engine.recipes.default_recipe import DefaultV1Recipe
from rasa.engine.graph import ExecutionContext, GraphComponent
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage
from rasa.nlu.featurizers.sparse_featurizer.sparse_featurizer import SparseFeaturizer
from rasa.nlu.tokenizers.tokenizer import Tokenizer
from rasa.shared.nlu.training_data.training_data import TrainingData
from rasa.shared.nlu.training_data.features import Features
from rasa.shared.nlu.training_data.message import Message
from rasa.nlu.constants import (
    DENSE_FEATURIZABLE_ATTRIBUTES,
    FEATURIZER_CLASS_ALIAS,
)
from joblib import dump, load
from rasa.shared.nlu.constants import (
    TEXT,
    TEXT_TOKENS,
    FEATURE_TYPE_SENTENCE,
    FEATURE_TYPE_SEQUENCE,
)

logger = logging.getLogger(__name__)


@DefaultV1Recipe.register(
    DefaultV1Recipe.ComponentType.MESSAGE_FEATURIZER, is_trainable=True
)
class TfIdfFeaturizer(SparseFeaturizer, GraphComponent):
    @classmethod
    def required_components(cls) -> List[Type]:
        """Components that should be included in the pipeline before this component."""
        return [Tokenizer]

    @staticmethod
    def required_packages() -> List[Text]:
        """Any extra python dependencies required for this component to run."""
        return ["sklearn"]

    @staticmethod
    def get_default_config() -> Dict[Text, Any]:
        """Returns the component's default config."""
        return {
            **SparseFeaturizer.get_default_config(),
            "analyzer": "word",
            "min_ngram": 1,
            "max_ngram": 1,
        }

    def __init__(
        self,
        config: Dict[Text, Any],
        name: Text,
        model_storage: ModelStorage,
        resource: Resource,
    ) -> None:
        """Constructs a new tf/idf vectorizer using the sklearn framework."""
        super().__init__(name, config)
        # Initialize the tfidf sklearn component
        self.tfm = TfidfVectorizer(
            analyzer=config["analyzer"],
            ngram_range=(config["min_ngram"], config["max_ngram"]),
        )

        # We need to use these later when saving the trained component.
        self._model_storage = model_storage
        self._resource = resource

    def train(self, training_data: TrainingData) -> Resource:
        """Trains the component from training data."""
        texts = [e.get(TEXT) for e in training_data.training_examples if e.get(TEXT)]
        self.tfm.fit(texts)
        self.persist()
        return self._resource

    @classmethod
    def create(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> GraphComponent:
        """Creates a new untrained component (see parent class for full docstring)."""
        return cls(config, execution_context.node_name, model_storage, resource)

    def _set_features(self, message: Message, attribute: Text = TEXT) -> None:
        """Sets the features on a single message. Utility method."""
        tokens = message.get(TEXT_TOKENS)

        # If the message doesn't have tokens, we can't create features.
        if not tokens:
            return None

        # Make distinction between sentence and sequence features
        text_vector = self.tfm.transform([message.get(TEXT)])
        word_vectors = self.tfm.transform([t.text for t in tokens])

        final_sequence_features = Features(
            word_vectors,
            FEATURE_TYPE_SEQUENCE,
            attribute,
            self._config[FEATURIZER_CLASS_ALIAS],
        )
        message.add_features(final_sequence_features)
        final_sentence_features = Features(
            text_vector,
            FEATURE_TYPE_SENTENCE,
            attribute,
            self._config[FEATURIZER_CLASS_ALIAS],
        )
        message.add_features(final_sentence_features)

    def process(self, messages: List[Message]) -> List[Message]:
        """Processes incoming message and compute and set features."""
        for message in messages:
            for attribute in DENSE_FEATURIZABLE_ATTRIBUTES:
                self._set_features(message, attribute)
        return messages

    def process_training_data(self, training_data: TrainingData) -> TrainingData:
        """Processes the training examples in the given training data in-place."""
        self.process(training_data.training_examples)
        return training_data

    def persist(self) -> None:
        """
        Persist this model into the passed directory.

        Returns the metadata necessary to load the model again. In this case; `None`.
        """
        with self._model_storage.write_to(self._resource) as model_dir:
            dump(self.tfm, model_dir / "tfidfvectorizer.joblib")

    @classmethod
    def load(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> GraphComponent:
        """Loads trained component from disk."""
        try:
            with model_storage.read_from(resource) as model_dir:
                tfidfvectorizer = load(model_dir / "tfidfvectorizer.joblib")
                component = cls(
                    config, execution_context.node_name, model_storage, resource
                )
                component.tfm = tfidfvectorizer
        except (ValueError, FileNotFoundError):
            logger.debug(
                f"Couldn't load metadata for component '{cls.__name__}' as the persisted "
                f"model data couldn't be loaded."
            )
        return component

    @classmethod
    def validate_config(cls, config: Dict[Text, Any]) -> None:
        """Validates that the component is configured properly."""
        pass

九.NLP元学习器
NLU Meta学习器是一个高级用例。以下部分仅适用于拥有一个基于先前分类器输出学习参数的组件的情况。对于具有手动设置参数或逻辑的组件，可以创建一个is_trainable=False的组件，而不用担心前面的分类器。
NLU Meta学习器是意图分类器或实体提取器，它们使用其它经过训练的意图分类器或实体提取器的预测，并尝试改进其结果。Meta学习器的一个例子是平均两个先前意图分类器输出的组件，或者是一个fallback分类器，它根据意图分类器对训练示例的置信度设置阈值。
从概念上讲，要构建可训练的fallback分类器，首先需要将该fallback分类器创建为自定义组件：

from typing import Dict, Text, Any, List

from rasa.engine.graph import GraphComponent, ExecutionContext
from rasa.engine.recipes.default_recipe import DefaultV1Recipe
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage
from rasa.shared.nlu.training_data.message import Message
from rasa.shared.nlu.training_data.training_data import TrainingData
from rasa.nlu.classifiers.fallback_classifier import FallbackClassifier

@DefaultV1Recipe.register(
    [DefaultV1Recipe.ComponentType.INTENT_CLASSIFIER], is_trainable=True
)
class MetaFallback(FallbackClassifier):

    def __init__(
        self,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> None:
        super().__init__(config)

        self._model_storage = model_storage
        self._resource = resource

    @classmethod
    def create(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> FallbackClassifier:
        """Creates a new untrained component (see parent class for full docstring)."""
        return cls(config, model_storage, resource, execution_context)

    def train(self, training_data: TrainingData) -> Resource:
        # Do something here with the messages
        return self._resource

接下来，需要创建一个定制的意图分类器，它也是一个特征器，因为分类器的输出需要被下游的另一个组件使用。对于定制的意图分类器组件，还需要定义如何将其预测添加到指定process_training_data方法的消息数据中。确保不要覆盖意图的真实标签。这里有一个模板，显示了如何为此目的对DIET进行子类化：

from rasa.engine.recipes.default_recipe import DefaultV1Recipe
from rasa.shared.nlu.training_data.training_data import TrainingData
from rasa.nlu.classifiers.diet_classifier import DIETClassifier

@DefaultV1Recipe.register(
    [DefaultV1Recipe.ComponentType.INTENT_CLASSIFIER,
     DefaultV1Recipe.ComponentType.ENTITY_EXTRACTOR,
     DefaultV1Recipe.ComponentType.MESSAGE_FEATURIZER], is_trainable=True
)
class DIETFeaturizer(DIETClassifier):

    def process_training_data(self, training_data: TrainingData) -> TrainingData:
        # classify and add the attributes to the messages on the training data
        return training_data

参考文献：
[1]Custom Graph Components：https://rasa.com/docs/rasa/custom-graph-components

component tokenizer graph 1.2

jiebatokenizer component graph 1.1