自定義Graph Component：1-開發指南

可以使用自定義NLU元件和策略擴充套件Rasa，本文提供瞭如何開發自己的自定義Graph Component指南。
Rasa提供各種開箱即用的NLU元件和策略。可以使用自定義Graph Component對其進行自定義或從頭開始建立自己的元件。
要在Rasa中使用自定義Graph Component，它必須滿足以下要求：

它必須實現GraphComponent介面
必須註冊使用過的model設定
必須在組態檔中使用它
它必須使用型別註釋。Rasa利用型別註釋來驗證模型設定。不允許前向參照。如果使用Python 3.7，可以使用from __future__ import annotations來擺脫前向參照。

一.Graph Components
Rasa使用傳入的模型設定（config.yml）來構建DAG，描述了config.yml中Component間的依賴關係以及資料如何在它們之間流動。這有兩個主要好處：

Rasa可以使用計算圖來優化模型的執行。這方面的例子包括訓練步驟的高效快取或並行執行獨立的步驟。
Rasa可以靈活地表示不同的模型架構。只要圖保持非迴圈，Rasa理論上可以根據模型設定將任何資料傳遞給任何圖元件，而無需將底層軟體架構與使用的模型架構繫結。

將config.yml轉換為計算圖時，Policy和NLU元件成為該圖中的節點。雖然模型設定中的Policy和NLU元件之間存在區別，但當它們被放置在圖中時，這種區別就被抽象出來了。此時，Policy和NLU元件成為抽象圖元件。在實踐中，這由GraphComponent介面表示：Policy和NLU元件都必須繼承此介面，才能與Rasa的圖相容並可執行。

二.入門指南
在開始之前，必須決定是實現自定義NLU元件還是Policy。如果正在實現自定義策略，那麼建議擴充套件現有的rasa.core.policies.policy.Policy類，該類已經實現了GraphComponent介面。如下所示：

from rasa.core.policies.policy import Policy
from rasa.engine.recipes.default_recipe import DefaultV1Recipe

# TODO: Correctly register your graph component
@DefaultV1Recipe.register(
    [DefaultV1Recipe.ComponentType.POLICY_WITHOUT_END_TO_END_SUPPORT], is_trainable=True
)
class MyPolicy(Policy):
    ...

如果要實現自定義NLU元件，要從以下框架開始：

from typing import Dict, Text, Any, List

from rasa.engine.graph import GraphComponent, ExecutionContext
from rasa.engine.recipes.default_recipe import DefaultV1Recipe
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage
from rasa.shared.nlu.training_data.message import Message
from rasa.shared.nlu.training_data.training_data import TrainingData

# TODO: Correctly register your component with its type
@DefaultV1Recipe.register(
    [DefaultV1Recipe.ComponentType.INTENT_CLASSIFIER], is_trainable=True
)
class CustomNLUComponent(GraphComponent):
    @classmethod
    def create(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> GraphComponent:
        # TODO: Implement this
        ...

    def train(self, training_data: TrainingData) -> Resource:
        # TODO: Implement this if your component requires training
        ...

    def process_training_data(self, training_data: TrainingData) -> TrainingData:
        # TODO: Implement this if your component augments the training data with
        #       tokens or message features which are used by other components
        #       during training.
        ...

        return training_data

    def process(self, messages: List[Message]) -> List[Message]:
        # TODO: This is the method which Rasa Open Source will call during inference.
        ...
        return messages

下面會介紹如何解決上述範例中的TODO，以及需要在自定義元件中實現的其它方法。
自定義詞法分析器：如果建立了一個自定義的tokenizer，應該擴充套件rasa.nlu.tokenizers.tokenizer. Tokenizer類。train和process方法已經實現，所以只需要覆蓋tokenize方法。

三.GraphComponent介面
要使用Rasa執行自定義NLU元件或Policy，必須實現GraphComponent介面。如下所示：

from __future__ import annotations
from abc import ABC, abstractmethod
from typing import List, Type, Dict, Text, Any, Optional

from rasa.engine.graph import ExecutionContext
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage


class GraphComponent(ABC):
    """Interface for any component which will run in a graph."""

    @classmethod
    def required_components(cls) -> List[Type]:
        """Components that should be included in the pipeline before this component."""
        return []

    @classmethod
    @abstractmethod
    def create(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> GraphComponent:
        """Creates a new `GraphComponent`.

        Args:
            config: This config overrides the `default_config`.
            model_storage: Storage which graph components can use to persist and load
                themselves.
            resource: Resource locator for this component which can be used to persist
                and load itself from the `model_storage`.
            execution_context: Information about the current graph run.

        Returns: An instantiated `GraphComponent`.
        """
        ...

    @classmethod
    def load(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
        **kwargs: Any,
    ) -> GraphComponent:
        """Creates a component using a persisted version of itself.

        If not overridden this method merely calls `create`.

        Args:
            config: The config for this graph component. This is the default config of
                the component merged with config specified by the user.
            model_storage: Storage which graph components can use to persist and load
                themselves.
            resource: Resource locator for this component which can be used to persist
                and load itself from the `model_storage`.
            execution_context: Information about the current graph run.
            kwargs: Output values from previous nodes might be passed in as `kwargs`.

        Returns:
            An instantiated, loaded `GraphComponent`.
        """
        return cls.create(config, model_storage, resource, execution_context)

    @staticmethod
    def get_default_config() -> Dict[Text, Any]:
        """Returns the component's default config.

        Default config and user config are merged by the `GraphNode` before the
        config is passed to the `create` and `load` method of the component.

        Returns:
            The default config of the component.
        """
        return {}

    @staticmethod
    def supported_languages() -> Optional[List[Text]]:
        """Determines which languages this component can work with.

        Returns: A list of supported languages, or `None` to signify all are supported.
        """
        return None

    @staticmethod
    def not_supported_languages() -> Optional[List[Text]]:
        """Determines which languages this component cannot work with.

        Returns: A list of not supported languages, or
            `None` to signify all are supported.
        """
        return None

    @staticmethod
    def required_packages() -> List[Text]:
        """Any extra python dependencies required for this component to run."""
        return []

    @classmethod
    def fingerprint_addon(cls, config: Dict[str, Any]) -> Optional[str]:
        """Adds additional data to the fingerprint calculation.

        This is useful if a component uses external data that is not provided
        by the graph.
        """
        return None

1.create方法
create方法用於在訓練期間範例化圖元件，並且必須被覆蓋。Rasa在呼叫該方法時傳遞以下引數：
（1）config：這是元件的預設設定，與模型組態檔中提供給圖元件的設定合併。
（2）model_storage：可以使用此功能來持久化和載入圖元件。有關其用法的更多詳細資訊，請參閱模型持久化部分。
（3）resource：模型儲存中元件的唯一識別符號。有關其用法的更多詳細資訊，請參閱模型永續性部分。
（4）execution_context：提供有關當前執行模式的額外資訊：

model_id: 推理過程中使用的模型的唯一識別符號。在訓練過程中，此引數為None。
should_add_diagnostic_data：如果為True，則應在實際預測的基礎上向圖元件的預測中新增額外的診斷後設資料。
is_finetuning：如果為True，則可以使用微調來訓練圖元件。
graph_schema：graph_schema描述用於訓練助手或用它進行預測的計算圖。
node_name：node_name是圖模式中步驟的唯一識別符號，由所呼叫的圖元件完成。

2.load方法
在推理過程中，使用load方法來範例化圖元件。此方法的預設實現會呼叫create方法。如果圖元件將資料作為訓練的一部分，建議覆蓋此方法。有關各個引數的描述，參閱create方法。

3.get_default_config方法
get_default_config方法返回圖元件的預設設定。它的預設實現返回一個空字典，這意味著圖元件沒有任何設定。Rasa將在執行時使用組態檔（config.yml）中的給定值更新預設設定。

4.supported_languages方法
supported_languages方法指定了圖元件支援的語言。Rasa將使用模型組態檔中的語言鍵來驗證圖元件是否可用於指定的語言。如果圖元件返回None（這是預設實現），則表示圖元件支援not_supported_languages中未包含的所有語言。範例如下所示：

[]：圖元件不支援任何語言
None：支援所有語言，但不支援not_supported_languages中定義的語言
["en"]：圖元件只能用於英語對話

5.not_supported_languages方法
not_supported_languages方法指定圖元件不支援哪些語言。Rasa將使用模型組態檔中的語言鍵來驗證圖元件是否可用於指定的語言。如果圖元件返回None（這是預設實現），則表示它支援supported_languages中指定的所有語言。範例如下所示：

無或[]：支援supported_languages中指定的所有語言。
["en"]：該圖形元件可用於除英語以外的任何語言。

6.required_packages方法
required_packages方法表明需要安裝哪些額外的Python包才能使用此圖元件。如果在執行時找不到所需的庫，Rasa將在執行過程中丟擲錯誤。預設情況下，此方法返回一個空列表，這意味著圖元件沒有任何額外的依賴關係。範例如下所示：

[]：使用此圖元件不需要額外的包
["spacy"]：需要安裝Python包spacy才能使用此圖元件。

四.模型持久化
一些圖元件需要在訓練期間持久化資料，這些資料在推理時應該對圖元件可用。一個典型的用例是儲存模型權重。為此，Rasa為圖元件的create和load方法提供了model_storage和resource引數，如下面的程式碼片段所示。model_storage提供對所有圖元件資料的存取。resource允許唯一標識圖元件在模型儲存中的位置。

from __future__ import annotations

from typing import Any, Dict, Text

from rasa.engine.graph import GraphComponent, ExecutionContext
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage

class MyComponent(GraphComponent):
    @classmethod
    def create(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> MyComponent:
        ...

    @classmethod
    def load(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
        **kwargs: Any
    ) -> MyComponent:
        ...

1.寫模型儲存
下面的程式碼片段演示瞭如何將圖元件的資料寫入模型儲存。要在訓練後持久化圖元件，train方法需要存取model_storage和resource的值。因此，應該在初始化時儲存model_storage和resource的值。圖元件的train方法必須返回resource的值，以便Rasa可以在訓練之間快取訓練結果。self._model_storage.write_to(self._resource)上下文管理器提供了一個目錄路徑，可以在其中持久化圖元件所需的任何資料。

from __future__ import annotations
import json
from typing import Optional, Dict, Any, Text

from rasa.engine.graph import GraphComponent, ExecutionContext
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage
from rasa.shared.nlu.training_data.training_data import TrainingData

class MyComponent(GraphComponent):

    def __init__(
        self,
        model_storage: ModelStorage,
        resource: Resource,
        training_artifact: Optional[Dict],
    ) -> None:
        # Store both `model_storage` and `resource` as object attributes to be able
        # to utilize them at the end of the training
        self._model_storage = model_storage
        self._resource = resource

    @classmethod
    def create(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> MyComponent:
        return cls(model_storage, resource, training_artifact=None)

    def train(self, training_data: TrainingData) -> Resource:
        # Train your graph component
        ...

        # Persist your graph component
        with self._model_storage.write_to(self._resource) as directory_path:
            with open(directory_path / "artifact.json", "w") as file:
                json.dump({"my": "training artifact"}, file)

        # Return resource to make sure the training artifacts
        # can be cached.
        return self._resource

2.讀模型儲存
Rasa將呼叫圖元件的load方法來範例化它以進行推理。可以使用上下文管理器self._model_storage.read_from(resource)來獲取圖元件資料所儲存的目錄的路徑。使用提供的路徑，可以載入儲存的資料並用它初始化圖元件。請注意，如果給定的資源沒有找到儲存的資料，model_storage將丟擲ValueError。

from __future__ import annotations
import json
from typing import Optional, Dict, Any, Text

from rasa.engine.graph import GraphComponent, ExecutionContext
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage

class MyComponent(GraphComponent):

    def __init__(
        self,
        model_storage: ModelStorage,
        resource: Resource,
        training_artifact: Optional[Dict],
    ) -> None:
        self._model_storage = model_storage
        self._resource = resource

    @classmethod
    def load(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
        **kwargs: Any,
    ) -> MyComponent:
        try:
            with model_storage.read_from(resource) as directory_path:
                with open(directory_path / "artifact.json", "r") as file:
                    training_artifact = json.load(file)
                    return cls(
                        model_storage, resource, training_artifact=training_artifact
                    )
        except ValueError:
            # This allows you to handle the case if there was no
            # persisted data for your component
            ...

五.用模型設定註冊Graph Components
為了讓圖元件可用於Rasa，可能需要使用recipe註冊圖元件。Rasa使用recipe將模型設定的內容轉換為可執行的graph。目前，Rasa支援default.v1和實驗性graph.v1 recipe。對於default.v1 recipe，需要使用DefaultV1Recipe.register裝飾器註冊圖元件：

from rasa.engine.graph import GraphComponent
from rasa.engine.recipes.default_recipe import DefaultV1Recipe

@DefaultV1Recipe.register(
    component_types=[DefaultV1Recipe.ComponentType.INTENT_CLASSIFIER],
    is_trainable=True,
    model_from="SpacyNLP",
)
class MyComponent(GraphComponent):
    ...

Rasa使用register裝飾器中提供的資訊以及圖元件在組態檔中的位置來排程圖元件及其所需資料的執行。DefaultV1Recipe.register裝飾器允許指定以下詳細資訊：
1.component_types
指定了圖元件在助手內實現的目的。可以指定多種型別（例如，如果圖元件既是意圖分類器又是實體提取器）。
（1）ComponentType.MODEL_LOADER
語言模型的元件型別。如果指定了model_from=，則此型別的圖元件為其它圖元件的train、process_training_data和process方法提供預訓練模型。這個圖元件在訓練和推理期間執行。Rasa將使用此圖元件的provide方法檢索應提供給依賴項圖元件的模型。
（2）ComponentType.MESSAGE_TOKENIZER
分詞器的元件型別。如果指定了is_trainable=True，則此型別的圖形元件在訓練和推理期間執行。Rasa將使用此圖形元件的train方法進行訓練。Rasa將使用 process_training_data進行訓練資料範例的分詞，並在推理期間使用process進行訊息的分詞。
（3）ComponentType.MESSAGE_FEATURIZER
特徵提取器的元件型別。如果指定了is_trainable=True，則此型別的圖元件在訓練和推理期間執行。Rasa將使用此圖元件的train方法進行訓練。Rasa將使用 process_training_data進行訓練資料範例的特徵提取，並在推理期間使用process進行訊息的特徵提取。
（4）ComponentType.INTENT_CLASSIFIER 意圖分類器的元件型別。如果指定了is_trainable=True，則此型別的圖元件僅在訓練期間執行。此元件在推理期間始終執行。如果指定了is_trainable=True，Rasa將使用此圖形元件的train方法進行訓練。Rasa將使用此圖元件的process方法在推理期間對訊息的意圖進行分類。
（5）ComponentType.ENTITY_EXTRACTOR
實體提取器的元件型別。如果指定了is_trainable=True，則此型別的圖元件僅在訓練期間執行。此元件在推理期間始終執行。如果指定了is_trainable=True，Rasa將使用此圖元件的train方法進行訓練。Rasa將使用此圖元件的process方法在推理期間提取實體。
（6）ComponentType.POLICY_WITHOUT_END_TO_END_SUPPORT
不需要其它端到端功能的策略的元件型別（有關更多資訊，請參閱end-to-end training）。如果指定了is_trainable=True，則此型別的圖元件僅在訓練期間執行。此元件在推理期間始終執行。如果指定了is_trainable=True，Rasa將使用此圖元件的train方法進行訓練。Rasa將使用此圖元件的predict_action_probabilities來預測在對話中應執行的下一個動作。
（7）ComponentType.POLICY_WITH_END_TO_END_SUPPORT
需要其它端到端功能（請參閱end-to-end training以獲取更多資訊）的策略的元件型別。端到端功能將作為預計算引數傳遞到圖元件的train和predict_action_probabilities中。如果指定了is_trainable=True，則此型別的圖元件僅在訓練期間執行。此元件在推理期間始終執行。如果指定了is_trainable=True，Rasa將使用此圖元件的train方法進行訓練。Rasa將使用此圖元件的predict_action_probabilities來預測在對話中應執行的下一個動作。
2.is_trainable
指定在處理其它依賴圖元件的訓練資料之前，或者在可以進行預測之前，是否需要訓練圖元件本身。
3.model_from
指定是否需要向圖元件的train、process_training_data和process方法提供預訓練語言模型。這些方法必須支援引數模型以接收語言模型。請注意，仍然需要確保提供此模型的圖元件是模型設定的一部分。一個常見的用例是，如果想將SpacyNLP語言模型暴露給其它NLU元件。

六.在模型設定中使用自定義元件
可以在模型設定中使用自定義圖元件，就像其它NLU元件或策略一樣。唯一的變化是，必須指定完整的模組名稱，而不是僅指定類名。完整的模組名稱取決於模組相對於指定的PYTHONPATH的位置。預設情況下，Rasa會將執行CLI的目錄新增到PYTHONPATH。例如，如果從/Users/<user>/my-rasa-project執行CLI，並且模組MyComponent在/Users/<user>/my-rasa-project/custom_components/my_component.py 中，則模組路徑為custom_components.my_component.MyComponent。除了name條目之外，所有內容都將作為config傳遞給元件。config.yml檔案如下所示：

recipe: default.v1
language: en
pipeline:
# other NLU components
- name: your.custom.NLUComponent
  setting_a: 0.01
  setting_b: string_value

policies:
# other dialogue policies
- name: your.custom.Policy

七.實現提示
1.訊息後設資料
當在訓練資料中為意圖範例定義後設資料時，NLU元件可以在處理過程中存取意圖後設資料和意圖範例後設資料，如下所示：

# in your component class
def process(self, message: Message, **kwargs: Any) -> None:
    metadata = message.get("metadata")
    print(metadata.get("intent"))
    print(metadata.get("example"))

2.稀疏和稠密訊息特徵
如果建立了一個自定義的訊息特徵器，可以返回兩種不同的特徵：序列特徵和句子特徵。序列特徵是一個大小為(number-of-tokens x feature-dimension)的矩陣，即該矩陣包含序列中每個token的特徵向量。句子特徵由大小為(1 x feature-dimension)的矩陣表示。

八.自定義元件的例子
1.稠密訊息特徵器
使用預訓練模型的一個dense message featurizer的例子，如下所示：

import numpy as np
import logging
from bpemb import BPEmb
from typing import Any, Text, Dict, List, Type

from rasa.engine.recipes.default_recipe import DefaultV1Recipe
from rasa.engine.graph import ExecutionContext, GraphComponent
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage
from rasa.nlu.featurizers.dense_featurizer.dense_featurizer import DenseFeaturizer
from rasa.nlu.tokenizers.tokenizer import Tokenizer
from rasa.shared.nlu.training_data.training_data import TrainingData
from rasa.shared.nlu.training_data.features import Features
from rasa.shared.nlu.training_data.message import Message
from rasa.nlu.constants import (
    DENSE_FEATURIZABLE_ATTRIBUTES,
    FEATURIZER_CLASS_ALIAS,
)
from rasa.shared.nlu.constants import (
    TEXT,
    TEXT_TOKENS,
    FEATURE_TYPE_SENTENCE,
    FEATURE_TYPE_SEQUENCE,
)


logger = logging.getLogger(__name__)


@DefaultV1Recipe.register(
    DefaultV1Recipe.ComponentType.MESSAGE_FEATURIZER, is_trainable=False
)
class BytePairFeaturizer(DenseFeaturizer, GraphComponent):
    @classmethod
    def required_components(cls) -> List[Type]:
        """Components that should be included in the pipeline before this component."""
        return [Tokenizer]

    @staticmethod
    def required_packages() -> List[Text]:
        """Any extra python dependencies required for this component to run."""
        return ["bpemb"]

    @staticmethod
    def get_default_config() -> Dict[Text, Any]:
        """Returns the component's default config."""
        return {
            **DenseFeaturizer.get_default_config(),
            # specifies the language of the subword segmentation model
            "lang": None,
            # specifies the dimension of the subword embeddings
            "dim": None,
            # specifies the vocabulary size of the segmentation model
            "vs": None,
            # if set to True and the given vocabulary size can't be loaded for the given
            # model, the closest size is chosen
            "vs_fallback": True,
        }

    def __init__(
        self,
        config: Dict[Text, Any],
        name: Text,
    ) -> None:
        """Constructs a new byte pair vectorizer."""
        super().__init__(name, config)
        # The configuration dictionary is saved in `self._config` for reference.
        self.model = BPEmb(
            lang=self._config["lang"],
            dim=self._config["dim"],
            vs=self._config["vs"],
            vs_fallback=self._config["vs_fallback"],
        )

    @classmethod
    def create(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> GraphComponent:
        """Creates a new component (see parent class for full docstring)."""
        return cls(config, execution_context.node_name)

    def process(self, messages: List[Message]) -> List[Message]:
        """Processes incoming messages and computes and sets features."""
        for message in messages:
            for attribute in DENSE_FEATURIZABLE_ATTRIBUTES:
                self._set_features(message, attribute)
        return messages

    def process_training_data(self, training_data: TrainingData) -> TrainingData:
        """Processes the training examples in the given training data in-place."""
        self.process(training_data.training_examples)
        return training_data

    def _create_word_vector(self, document: Text) -> np.ndarray:
        """Creates a word vector from a text. Utility method."""
        encoded_ids = self.model.encode_ids(document)
        if encoded_ids:
            return self.model.vectors[encoded_ids[0]]

        return np.zeros((self.component_config["dim"],), dtype=np.float32)

    def _set_features(self, message: Message, attribute: Text = TEXT) -> None:
        """Sets the features on a single message. Utility method."""
        tokens = message.get(TEXT_TOKENS)

        # If the message doesn't have tokens, we can't create features.
        if not tokens:
            return None

        # We need to reshape here such that the shape is equivalent to that of sparsely
        # generated features. Without it, it'd be a 1D tensor. We need 2D (n_utterance, n_dim).
        text_vector = self._create_word_vector(document=message.get(TEXT)).reshape(
            1, -1
        )
        word_vectors = np.array(
            [self._create_word_vector(document=t.text) for t in tokens]
        )

        final_sequence_features = Features(
            word_vectors,
            FEATURE_TYPE_SEQUENCE,
            attribute,
            self._config[FEATURIZER_CLASS_ALIAS],
        )
        message.add_features(final_sequence_features)
        final_sentence_features = Features(
            text_vector,
            FEATURE_TYPE_SENTENCE,
            attribute,
            self._config[FEATURIZER_CLASS_ALIAS],
        )
        message.add_features(final_sentence_features)

    @classmethod
    def validate_config(cls, config: Dict[Text, Any]) -> None:
        """Validates that the component is configured properly."""
        if not config["lang"]:
            raise ValueError("BytePairFeaturizer needs language setting via `lang`.")
        if not config["dim"]:
            raise ValueError(
                "BytePairFeaturizer needs dimensionality setting via `dim`."
            )
        if not config["vs"]:
            raise ValueError("BytePairFeaturizer needs a vector size setting via `vs`.")

2.稀疏訊息特徵器
以下是稀疏訊息特徵器的範例，它訓練了一個新模型：

import logging
from typing import Any, Text, Dict, List, Type

from sklearn.feature_extraction.text import TfidfVectorizer
from rasa.engine.recipes.default_recipe import DefaultV1Recipe
from rasa.engine.graph import ExecutionContext, GraphComponent
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage
from rasa.nlu.featurizers.sparse_featurizer.sparse_featurizer import SparseFeaturizer
from rasa.nlu.tokenizers.tokenizer import Tokenizer
from rasa.shared.nlu.training_data.training_data import TrainingData
from rasa.shared.nlu.training_data.features import Features
from rasa.shared.nlu.training_data.message import Message
from rasa.nlu.constants import (
    DENSE_FEATURIZABLE_ATTRIBUTES,
    FEATURIZER_CLASS_ALIAS,
)
from joblib import dump, load
from rasa.shared.nlu.constants import (
    TEXT,
    TEXT_TOKENS,
    FEATURE_TYPE_SENTENCE,
    FEATURE_TYPE_SEQUENCE,
)

logger = logging.getLogger(__name__)


@DefaultV1Recipe.register(
    DefaultV1Recipe.ComponentType.MESSAGE_FEATURIZER, is_trainable=True
)
class TfIdfFeaturizer(SparseFeaturizer, GraphComponent):
    @classmethod
    def required_components(cls) -> List[Type]:
        """Components that should be included in the pipeline before this component."""
        return [Tokenizer]

    @staticmethod
    def required_packages() -> List[Text]:
        """Any extra python dependencies required for this component to run."""
        return ["sklearn"]

    @staticmethod
    def get_default_config() -> Dict[Text, Any]:
        """Returns the component's default config."""
        return {
            **SparseFeaturizer.get_default_config(),
            "analyzer": "word",
            "min_ngram": 1,
            "max_ngram": 1,
        }

    def __init__(
        self,
        config: Dict[Text, Any],
        name: Text,
        model_storage: ModelStorage,
        resource: Resource,
    ) -> None:
        """Constructs a new tf/idf vectorizer using the sklearn framework."""
        super().__init__(name, config)
        # Initialize the tfidf sklearn component
        self.tfm = TfidfVectorizer(
            analyzer=config["analyzer"],
            ngram_range=(config["min_ngram"], config["max_ngram"]),
        )

        # We need to use these later when saving the trained component.
        self._model_storage = model_storage
        self._resource = resource

    def train(self, training_data: TrainingData) -> Resource:
        """Trains the component from training data."""
        texts = [e.get(TEXT) for e in training_data.training_examples if e.get(TEXT)]
        self.tfm.fit(texts)
        self.persist()
        return self._resource

    @classmethod
    def create(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> GraphComponent:
        """Creates a new untrained component (see parent class for full docstring)."""
        return cls(config, execution_context.node_name, model_storage, resource)

    def _set_features(self, message: Message, attribute: Text = TEXT) -> None:
        """Sets the features on a single message. Utility method."""
        tokens = message.get(TEXT_TOKENS)

        # If the message doesn't have tokens, we can't create features.
        if not tokens:
            return None

        # Make distinction between sentence and sequence features
        text_vector = self.tfm.transform([message.get(TEXT)])
        word_vectors = self.tfm.transform([t.text for t in tokens])

        final_sequence_features = Features(
            word_vectors,
            FEATURE_TYPE_SEQUENCE,
            attribute,
            self._config[FEATURIZER_CLASS_ALIAS],
        )
        message.add_features(final_sequence_features)
        final_sentence_features = Features(
            text_vector,
            FEATURE_TYPE_SENTENCE,
            attribute,
            self._config[FEATURIZER_CLASS_ALIAS],
        )
        message.add_features(final_sentence_features)

    def process(self, messages: List[Message]) -> List[Message]:
        """Processes incoming message and compute and set features."""
        for message in messages:
            for attribute in DENSE_FEATURIZABLE_ATTRIBUTES:
                self._set_features(message, attribute)
        return messages

    def process_training_data(self, training_data: TrainingData) -> TrainingData:
        """Processes the training examples in the given training data in-place."""
        self.process(training_data.training_examples)
        return training_data

    def persist(self) -> None:
        """
        Persist this model into the passed directory.

        Returns the metadata necessary to load the model again. In this case; `None`.
        """
        with self._model_storage.write_to(self._resource) as model_dir:
            dump(self.tfm, model_dir / "tfidfvectorizer.joblib")

    @classmethod
    def load(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> GraphComponent:
        """Loads trained component from disk."""
        try:
            with model_storage.read_from(resource) as model_dir:
                tfidfvectorizer = load(model_dir / "tfidfvectorizer.joblib")
                component = cls(
                    config, execution_context.node_name, model_storage, resource
                )
                component.tfm = tfidfvectorizer
        except (ValueError, FileNotFoundError):
            logger.debug(
                f"Couldn't load metadata for component '{cls.__name__}' as the persisted "
                f"model data couldn't be loaded."
            )
        return component

    @classmethod
    def validate_config(cls, config: Dict[Text, Any]) -> None:
        """Validates that the component is configured properly."""
        pass

九.NLP元學習器
NLU Meta學習器是一個高階用例。以下部分僅適用於擁有一個基於先前分類器輸出學習引數的元件的情況。對於具有手動設定引數或邏輯的元件，可以建立一個is_trainable=False的元件，而不用擔心前面的分類器。
NLU Meta學習器是意圖分類器或實體提取器，它們使用其它經過訓練的意圖分類器或實體提取器的預測，並嘗試改進其結果。Meta學習器的一個例子是平均兩個先前意圖分類器輸出的元件，或者是一個fallback分類器，它根據意圖分類器對訓練範例的置信度設定閾值。
從概念上講，要構建可訓練的fallback分類器，首先需要將該fallback分類器建立為自定義元件：

from typing import Dict, Text, Any, List

from rasa.engine.graph import GraphComponent, ExecutionContext
from rasa.engine.recipes.default_recipe import DefaultV1Recipe
from rasa.engine.storage.resource import Resource
from rasa.engine.storage.storage import ModelStorage
from rasa.shared.nlu.training_data.message import Message
from rasa.shared.nlu.training_data.training_data import TrainingData
from rasa.nlu.classifiers.fallback_classifier import FallbackClassifier

@DefaultV1Recipe.register(
    [DefaultV1Recipe.ComponentType.INTENT_CLASSIFIER], is_trainable=True
)
class MetaFallback(FallbackClassifier):

    def __init__(
        self,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> None:
        super().__init__(config)

        self._model_storage = model_storage
        self._resource = resource

    @classmethod
    def create(
        cls,
        config: Dict[Text, Any],
        model_storage: ModelStorage,
        resource: Resource,
        execution_context: ExecutionContext,
    ) -> FallbackClassifier:
        """Creates a new untrained component (see parent class for full docstring)."""
        return cls(config, model_storage, resource, execution_context)

    def train(self, training_data: TrainingData) -> Resource:
        # Do something here with the messages
        return self._resource

接下來，需要建立一個客製化的意圖分類器，它也是一個特徵器，因為分類器的輸出需要被下游的另一個元件使用。對於客製化的意圖分類器元件，還需要定義如何將其預測新增到指定process_training_data方法的訊息資料中。確保不要覆蓋意圖的真實標籤。這裡有一個模板，顯示瞭如何為此目的對DIET進行子類化：

from rasa.engine.recipes.default_recipe import DefaultV1Recipe
from rasa.shared.nlu.training_data.training_data import TrainingData
from rasa.nlu.classifiers.diet_classifier import DIETClassifier

@DefaultV1Recipe.register(
    [DefaultV1Recipe.ComponentType.INTENT_CLASSIFIER,
     DefaultV1Recipe.ComponentType.ENTITY_EXTRACTOR,
     DefaultV1Recipe.ComponentType.MESSAGE_FEATURIZER], is_trainable=True
)
class DIETFeaturizer(DIETClassifier):

    def process_training_data(self, training_data: TrainingData) -> TrainingData:
        # classify and add the attributes to the messages on the training data
        return training_data

參考文獻：
[1]Custom Graph Components：https://rasa.com/docs/rasa/custom-graph-components