4 PreLiminaries

Let \(D^k=\left\{d_i^k\right\}_{i=1}^{N_k}\) denote a set of dialogs between the seeker \(s_k\) \(\left(0 \leq k<\mathbb{N}_s\right)\) and the recommender, where \(N_k\) is the number of dialogs by the seeker \(s_k\), and \(\mathbb{N}_s\) is the number of seekers. Recall that we attach each dialog (say \(d_i^k\) ) with a seeker profile \(\mathcal{P}^k\) (updated based on dialog corpus), a knowledge graph \(\mathcal{K}\), and a goal sequence \(\mathcal{G}=\left\{\mathbf{g}_t\right\}_{t=1}^f\).

Definition 1. Dialog goal. To facilitate the modeling of indepth sub-dialog about a given topic, we augment the goals in DuRecDial with topic attribute information and then define each goal as a triplet \(\mathbf{g}=\left\langle g^p, g^o, g^r\right\rangle\), where \(g^p \in \mathbb{P}\) denotes goal type (e.g., \(Q A\), chitchat and recommendation), \(g^o \in \mathbb{O}\) represents goal topic and \(g^r \in \mathbb{R}\) is a set of topic facets about \(g^o\).

Specifically, the last goal in the goal sequence \(\mathcal{G}\), denoted by \(\mathbf{g}_f\), is the long-term goal we wish to achieve and is predetermined before the dialog.

Problem Statement. Different from previous work of conversational recommenders that focus on retrieving appropriate recommendation items [3], [5], the focus of this work is how to lead the dialog to reach a predefined long-term goal (e.g., recommending a given movie to the user) and then respond to user feedback appropriately. Then our problem is formally defined as follows: Given a dialog context \(X\) with a sequence of utterances \(\left\{u_j\right\}_{j=1}^m\), short-term goal history \(\mathcal{G}^{\prime}=\left\{\mathbf{g}_i\right\}_{i=1}^t\) of \(X\left(\mathbf{g}_t\right.\) as the goal for \(u_{m}(t \leq m)\) ), a pre-defined long-term goal \(\mathbf{g}_{f}\) for current dialog session, knowledge graph \(\mathcal{K}\), and updated seeker profile \(\mathcal{P}^{k}\), our system will plan a goal sequence \(\left\{\mathbf{g}_{t+1}, \mathbf{g}_{t+2}, \ldots, \mathbf{g}_{f}\right\}\) to guide this ongoing dialog and then generate appropriate responses \(Y=\left\{y_{1}, y_{2}, \ldots, y_{n}\right\}\) for completion of each goal, including both short-term and long-term goals.

OVERVIEW OF MGCG

To strengthen the capability of dialog flow control, we focus on dialog management modeling as done in previous work [38], [41], [42], and then decouple dialog management from response generation. Following this design principle, we propose a MultiGoal driven Conversation Generation (MGCG) framework, which consists of a Graph-Grounded Goal Planning module, and a Responding module. Compared with our previous work [21], we employ a novel hierarchical goal graph to facilitate goal planning, where the graph provides prior information about the transitions between goal elements (goal types/goal topics/topic attributes). Specifically, the goal planning module first takes \(X, \mathcal{G}^{\prime}, \mathcal{P}^{k}\), and \(\mathcal{K}\) as input, then outputs \(g_{t+1}\) as guidance information of responding module by learning to walk over the graph. Then, the responding module is responsible for completing each goal by producing responses or a sub-dialog conditioned on \(X, g_{t+1}\), and \(\mathcal{K}\). Next, we will provide more details about the two modules.

Graph-GroUnded Goal PlanNing

In the presence of explicit dialog goals, a straightforward approach is to devise a goal planning module to determine the current shortterm dialog goal \(\mathbf{g}_{t}\) for response generation. A baseline model is to decompose the goal planning task into two sub-tasks, the current goal completion estimation, and the next goal prediction. For the first task, the probability of current goal completion can be estimated by:

\[P_{G C}\left(l=1 \mid X, \mathbf{g}_{t}\right) \]

In each step, if \(\mathbf{g}_{t}\) is not completed \(\left(P_{G C}<0.5\right)\), we keep \(g_{t+1}=\) \(g_{t}\) without goal transition. Otherwise, we predict the next goal by maximizing the following probability:

\[\mathbf{g}_{t+1}=\arg \max _{\mathbf{g}_{t+1}} P_{G P}\left(\mathbf{g}_{t+1} \mid X, \mathcal{G}^{\prime}, \mathcal{P}^{k}, \mathcal{K}\right), \]

and we further use \(\mathbf{g}_{t+1}\) for subsequent generation of \(Y\).

However, the above baseline neglects the global structural correlation and local sequential dependency among dialog goals, which can be exploited to further improve the goal planning module. On the one hand, the transition of goal type, goal topic, and topic attribute naturally form a hierarchical graph structure. Modeling such a hierarchical structure can promote natural transition between goals. On the other hand, the current goal is coherent with historical goals, and the transition from the current goal to subsequent goals is toward the long-term goal. Incorporating historical goals in past utterances and the long-term goal can further capture the dependency of the goal sequence. In this section, we introduce our Graph-Grounded Goal Planning module for more effective goal planning.

Overview

Figure 3 shows the end-to-end framework of our graph-grounded goal planning module. Given a dialog corpus and a knowledge graph, we expect to build a graph-grounded goal planning module that can output appropriate goal types, goal topics, and topic attributes to guide response generation with the following two modules. In the hierarchical goal graph construction and learning module, we first construct a hierarchical goal graph to capture the global correlation among dialog goals, by combining sequence information of goal types, goal topics, and topic attributes in dialog corpus and knowledge triples in the knowledge graph. In particular, the goal type or topic attribute sequences provide important prior information for goal type or topic attribute transitions, while the knowledge triples provide complementary information for goal topic transitions. Then, the hierarchical graph representation learning component learns embeddings of goal types, goal topics, and topic attributes by preserving the firstorder proximity among them in the hierarchical goal graph, which reflects the global relatedness between these goal elements. In the goal sequence dependency learning and prediction module, based on the learned goal embeddings from the previous graph module, the Goal Sequence Dependency Learning component incorporates the local sequential dependency between goals in dialog corpus into goal embeddings by analogizing goal sequences as routes on the graph. In this way, the transition to the next goal should not only be coherent with past goals but also approach the final goal. Besides, the asynchronous goal transition control component integrates several pruning strategies to guarantee the transition efficiency of goal type, goal topic and topic attributes toward the final goal. Finally, the next goal prediction component outputs appropriate goal types, goal topics, and topic attributes for response generation.

Hierarchical Graph Construction and Representa- tion Learning

We first learn the representations of goal type, goal topic and topic attribute over a hierarchical goal graph that combines sequence information in a dialog corpus and semantic relationship between goals in a knowledge graph.

Hierarchical Goal Graph Construction

Definition 2. Hierarchical goal graph. A hierarchical goal graph is defined as \(G=(\mathcal{V}, \mathcal{E}, \mathcal{A})\), where \(\mathcal{V}=\mathbb{P} \cup \mathbb{O} \cup \mathbb{R}\) is a union of goal types \(\mathbb{P}\), goal topics \(\mathbb{O}\) and topic attributes \(\mathbb{R}\), \(\mathcal{E}\) is the set of edges indicating connectivity among vertices, \(\mathcal{A}=\left\{A^{p}, A^{o}, A^{r}, A^{p o}, A^{\text {or }}\right\}\) is the set of adjacency matrices where the five elements are indicating 1) the adjacency score between two distinct goal types, 2) the adjacency score between two distinct goal topics, 3) the adjacency score between two distinct topic attributes, 4) the adjacency score between a goal type and a goal topic, and 5) the adjacency score between a goal topic and a topic attribute.

The hierarchical goal graph is a three-layer heterogeneous graph. Two vertices are linked by an edge if the adjacency score between two vertices is greater than zero. In particular, \(A^{p}, A^{o}\), and \(A^{r}\) describe the adjacency scores of homogenous nodes in three homogeneous subgraphs, including a goal type subgraph \(G^{p}\), a goal topic subgraph \(G^{o}\), and a topic attribute subgraph \(G^{r}\). \(A^{p o}\) and \(A^{\text {or }}\) describe the adjacency of heterogeneous nodes from goal types to goal topics, and from goal topics to topic attributes.

Fig. 3. An overview of our end-to-end graph-grounded goal planning module. The sub-graphs in each ellipse are homogeneous graphs, and crossellipse sub-graphs form the heterogeneous graph.

Homogeneous subgraph construction. We construct three homogeneous subgraphs (i.e., \(G^{p}, G^{o}\), and \(G^{r}\) ) based on the co-occurrence graph and the knowledge graph. Specifically, the co-occurrence graph is constructed based on the co-occurrence frequencies of goals in dialog corpus, which is determined by the distribution of dialogs in the training data. Take the goal topic subgraph \(G^{o}\) for example, the co-occurrence graph is constructed as below. Consider two homogeneous vertices \(v_{i}^{o}\) and \(v_{j}^{o}\) in \(G^{o}\), \(e_{i j}^{o}=k\) if and only if they are co-occurred in \(k\) dialogs in the dataset, and \(e_{i j}^{o}=0\) otherwise. The co-occurrence graph of goal types and topic attributes are constructed in the same way. Differently, the knowledge graph is constructed based on human knowledge prior, which reflects the semantic relationship between goal elements. The construction of knowledge graph is described in the dataset construction in Section 3.1 and Appendix 1.1. In practice, the co-occurrence graph contains important prior information for goal type or topic attribute transitions but may contain noise relationships. The knowledge graph can provide complementary information for goal topic transitions, but cannot directly handle goal type or topic attribute transitions. In this work, we combine the co-occurrence graph with the knowledge graph, which can overcome the above drawbacks to some extent.

Formally, let \(G_{O}^{o}=\left(\mathcal{V}_{O}^{o}, \mathcal{E}_{O}^{o}\right)\) and \(G_{K}^{o}=\left(\mathcal{V}_{K}^{o}, \mathcal{E}_{K}^{o}\right)\) denote the co-occurrence graph and the knowledge graph for goal topics, respectively. To transform the above two graphs into the same scale, we first construct the adjacency matrix for each set of homogeneous vertices in the knowledge graph, where goal topics, goal entities, and attributes of each entity are regarded as individual vertices. For instance, for goal topics \(v_{i}^{o}\) and \(v_{j}^{o}\), we set the adjacency score \(\left[A_{K}^{o}\right]_{i j}=1\) if and only if they are connected by a relation in the knowledge graph and 0 otherwise. Besides, we construct adjacency score of two goal attributes to 1 if they are connected by a same goal topic. In this way, we extract three subgraphs of homogeneous nodes from the knowledge graph for homogeneous graph construction. Then, we normalize the adjacency matrix of the co-occurrence graph \(A_{O}^{o}\) to \(A_{O}^{o \prime}=\left(D_{O}^{o}\right)^{-1 / 2} A_{O}^{o}\left(D_{O}^{o}\right)^{-1 / 2}\), where \(\left[D_{O}^{o}\right]_{i i}=\sum_{j}\left[A_{O}^{o}\right]_{i j}\). The adjacency matrix of the knowledge graph \(A_{K}^{o}\) is normalized in the same way. Note that we use \([\cdot]_{a b}\) to denote entry \((a, b)\) in a matrix, use \([\cdot]_{a *}\) and \([\cdot]_{* b}\) to respectively denote the \(a\)-th row and \(b\)-th column. After that, two normalized graphs are combined by updating the new adjacency matrix,

\[A_{O K}^{o}=\lambda A_{O}^{o \prime}+(1-\lambda) A_{K}^{o}{ }^{\prime} \]

where \(\lambda \in[0,1]\) is weight coefficient. With the updated adjacency matrix \(A_{O K}^{o}\), we construct the corresponding edges by

\[\mathcal{E}_{O K}^{o}=\left\{\left(V_{i}^{o}, V_{j}^{o}\right) \mid\left[A_{O K}^{o}\right]_{i j} \neq 0 \text {, and } 0 \leq i, j \leq\left|\mathcal{V}^{o}\right|\right\} \text {. } \]

We use \(G_{O K}^{o}=\left(\mathcal{V}^{o}, \mathcal{E}_{O K}^{o}, A_{O K}^{o}\right)\) or simply \(G^{o}\) to denote the fused goal topic subgraph. The subgraphs of goal types \(G_{O K}^{p}\) and topic attributes \(G_{O K}^{r}\) are constructed in the same way.

Heterogeneous subgraph construction. We also construct two heterogeneous subgraphs (i.e., with directed edges from goal types to goal topics or from goal topics to topic attributes). Different from homogeneous subgraphs, the heterogeneous structure information is only available in dialog corpus (i.e., the cooccurrence statistics). We initialize the adjacency matrix between goal types and goal topics with a slack constraint,

\[\left[A^{p o}\right]_{i j}=\left\{\begin{array}{l} 1, \text { if } V_{i}^{p} \text { and } V_{j}^{o} \text { co-occurred } \\ \tau, \text { otherwise } \end{array},\right. \]

where \(\tau\) is a small constant. The adjacency matrix between goal topics and topic attributes \(A^{p o}\) is constructed in the same way.

Hierarchical Graph Representation Learning

Then, we devise the graph convolution operation to capture the structural correlation of the hierarchical goal graph in an end-toend manner, as illustrated in Figure 3 (b). Intuitively, a vertex in the hierarchical goal graph has a higher correlation with adjacent vertices compared with distant vertices. For instance, a goal topic more likely leads to the subsequent conversation to a connected topic attribute (i.e., talk on the topic) or a connected goal topic (i.e., conversation topic transition). It is appealing to preserve the structural proximity between vertices in the hierarchical graph to enhance the effectiveness and naturalness of proactive goal planning. Graph Convolution Networks (GCN) have shown their superiority on processing non-Euclidean graph structures by smoothing the representation of neighboring nodes via aggregation and transformation functions [43]. In this work, we leverage GCN to derive node representations so that correlated goal types, goal topics, and topic attributes are closer to each other in the unified vector space. Formally, let \(\mathbf{e}_{i}^{p} \in \mathbf{E}^{p}\), \(\mathbf{e}_{i}^{p} \in \mathbf{E}^{o}\) and \(\mathbf{e}_{i}^{r} \in \mathbf{E}^{r}\) denote embedding of type node \(V_{i}^{p}\), topic node \(V_{i}^{o}\) and attribute node \(V_{i}^{r}\), respectively. We first initialize the embedding of type nodes in the graph \(\mathbf{e}^{p} \in \mathbb{R}^{\left|V^{p}\right| \times F}\) by Gaussian distribution with mean equals to 0 and variance equals to 1 , where \(F \ll\left|V^{t}\right|\). The goal topic embedding and topic attribute embedding are initialized by \(\mathbf{E}^{o}=A^{p o} \mathbf{E}^{p}\) and \(\mathbf{E}^{r}=A^{o r} \mathbf{E}^{o}\). Topic-attribute level representation. For topic attributes, the corresponding embeddings are updated by stacking multiple graph convolution layers, which is defined as

\[\mathbf{E}_{(l+1)}^{r}=\sigma\left(A^{r} \mathbf{E}_{(l)}^{r} W_{(l)}^{r}\right) \]

where \(A^{r}\) is the adjacency matrix of attributes, \(\mathbf{E}_{(l)}^{r} \in \mathbb{R}^{\left|V^{r}\right| \times C_{(l)}^{r}}\) denotes the attribute embedding of all \(V^{r}\) at the \(l\)-th layer. \(W_{(l)}^{r} \in \mathbb{R}^{C_{(l)}^{r} \times C_{(l+1)}^{r}}\) is a weighted matrix which is learnable in the training phase. \(\sigma(\cdot)\) is a non-linear activation function. The learned topic attribute embeddings capture the co-occurrence and cognitive correlation among attributes, which will be used for higher-level goal topic representation learning as well as subsequent goal planning task.

Goal-topic level representation. As shown in Figure 3 (b), being different from the topic attribute embedding, the update of goal topic embedding is based on both the topic relationship subgraph and the topic attribute embeddings via an Attribute Gate. Specifically, for a topic \(V_{i}^{o}\), we adopt the attention mechanism to quantify the influence of lower level attribute embeddings,

\[\left[\mathbf{Z}^{o}\right]_{i *}=\sum_{j \in \mathcal{N}_{i}^{o r}} \alpha_{i j}^{o}\left[\mathbf{E}^{r}\right]_{j *}, \]

where \(\mathcal{N}_{i}^{o r}\) is neighbor attributes of \(g_{i}^{o}, \alpha_{i j}^{o}\) is the attention score derived by

\[\alpha_{i j}^{o}=\frac{\exp \left(W_{o}\left(\left[\mathbf{E}^{o}\right]_{i *} \|\left[\mathbf{E}^{r}\right]_{j *}\right)\right)}{\sum_{k \in \mathcal{N}_{i}^{o r}} \exp \left(W_{o}\left(\left[\mathbf{E}^{o}\right]_{i *} \|\left[\mathbf{E}^{r}\right]_{k *}\right)\right)}, \]

where \(\|\) denotes concatenation operation. The embedding of the topic \(V_{i}^{o}\) is updated by the following graph convolution operation,

\[\mathbf{E}_{(l+1)}^{o}=\sigma\left(A^{o} \mathbf{E}_{(l)}^{o} W_{(l)}^{o}\right) \]

where \(A^{o}\) is the adjacency matrix of goal topics, \(\mathbf{E}_{(l)}^{o} \in\) \(\mathbb{R}^{\left|V^{o}\right| \times C_{(l)}^{o}}\) is the topic embedding for all \(\left|V^{o}\right|\) topics in the \(l\)-th layer graph convolution, and \(W_{(l)}^{o} \in \mathbb{R}^{C_{(l)}^{o} \times C_{(l+1)}^{o} \text { is the }}\) learnable transformation matrix. Note that \(\mathbf{E}_{(0)}^{o} \equiv \mathbf{E}^{o} \| \mathbf{Z}^{o}\), which differentiates Equation 9 from Equation 6 .

Goal-type level representation. Similar to goal topic embedding, goal type embedding is updated jointly based on the goal type subgraph and lower-level goal topic embeddings via a Topic Gate. We reuse Equation 7 to Equation 9 to generate goal type embeddings, but with goal type specific parameters \(\mathcal{N}_{i}^{p o}, \alpha_{i j}^{p}, W_{p}\) and embeddings \(\mathbf{E}_{(0)}^{p} \equiv \mathbf{E}^{p} \| \mathbf{Z}^{p}\).

On the one hand, the low level nodes (e.g., goal topics and topic attributes) contribute to high level node representations in the hierarchical learning. On the other hand, during the end-toend graph representation learning, the low level representations are also guided by higher level supervision signals via backpropagation. Therefore, the hierarchical graph learning module absorbs knowledge bidirectionally from the graph to enhance the representation effectiveness. Note the hierarchical graph degrade to a heterogeneous graph without considering the hierarchical structure. However, directly learning representation for three types of nodes over heterogeneous graph neural network [44], [45] relies on human defined relations or meta-paths and may overlook the hierarchical dependency knowledge [46]. We left how to incorporate hierarchical learning with heterogeneous graph neural network as a future work to enhance the representation learning of goal type, goal topic, and topic attribute.

Goal Sequence Dependency Learning and Goal Prediction

After that, based on the output of the hierarchical graph representation learning component, we incorporate the local sequence dependency between goals into learned goal type, goal topic, and topic attribute embeddings to enhance the proactivity and naturalness of goal planning.

Goal Sequence Dependency Learning

The intuition of the goal sequence dependency learning component is that the selection of the next goal depends on not only the previous goal sequence in previous utterances (i.e., coherence and naturalness), but also the long-term goal (i.e., long-term goal completion). However, in practice, the cost of selecting each goal (e.g., coherence, informativeness) to reach the long-term goal is unknown. Inspired by the recent success of integrating neural network into heuristic shortest path search algorithm [47] for route planning, we introduce the Goal Sequence Dependency Learning component to capture the implicit goal dependency in previous goal sequences for next goal prediction, as illustrated in Figure 3 (c). Given the previous goal sequence \(\left[\cdots, \mathbf{g}_{t-1}, \mathbf{g}_{t}\right]\) and the long-term goal \(\mathbf{g}_{f}\), we estimate the state representation of the next goal as follows.

Dependency learning of goal types and goal topics. Consider the sequence of corresponding goal type representations \(\left[\cdots, \mathbf{e}_{i, t-1}^{p}, \mathbf{e}_{j, t}^{p}\right]\) and next goal type candidate \(\mathbf{e}_{k, t+1}^{p}\) derived from hierarchical graph representation learning component, we employ \(G R U\) for previous goal sequence dependency modeling. Specifically, we denote the goal type status at current step \(t\) and next step \(t+1\) as \(\mathbf{h}_{t}^{p}\) and \(\mathbf{h}_{t+1}^{p}\), respectively. The dependency between \(\mathbf{h}_{t+1}^{p}\) and \(\mathbf{h}_{t}^{p}\) is defined as

\[\mathbf{h}_{t+1}^{p}=\left(1-\mathbf{z}_{t+1}^{p}\right) \circ \mathbf{h}_{t}^{p}+\mathbf{z}_{t+1}^{p} \circ \widetilde{\mathbf{h}}_{t+1}^{p}, \]

where \(\mathbf{z}_{t+1}^{p}\) and \(\widetilde{\mathbf{h}}_{t+1}^{p}\) are defined as

\[\left\{\begin{array}{l} \mathbf{r}_{t+1}^{p}=\sigma\left(\left[\mathbf{h}_{t}^{p} \| \mathbf{e}_{k, t+1}^{p}\right] W_{r}^{p}+\mathbf{b}_{r}^{p}\right) \\ \mathbf{z}_{t+1}^{p}=\sigma\left(\left[\mathbf{h}_{t}^{p} \| \mathbf{e}_{k, t+1}^{p}\right] W_{z}^{p}+\mathbf{b}_{z}^{p}\right) \\ \left.\widetilde{\mathbf{h}}_{t+1}^{p}=\tanh \left(\left[\mathbf{r}_{t+1}^{p} \circ \mathbf{h}_{t}^{p} \| \mathbf{e}_{k, t+1}^{p}\right] W_{\widetilde{h}}^{p}+\mathbf{b}_{\widetilde{h}}^{p}\right)\right) \end{array}\right. \]

where \(W_{r}^{p}, W_{z}^{p}, W_{\widetilde{h}}^{p}, \mathbf{b}_{r}^{p}, \mathbf{b}_{z}^{p}, \mathbf{b}_{\widetilde{h}}^{p}\) are learnable parameters, and \(\circ\) denotes Hadamard products. \(\mathbf{h}_{t+1}^{p}\) can be regarded as the intermediate cost of selecting \(g_{t+1}^{p}\) as the next goal type. To model the remaining cost of unobserved type sequence to reach \(g_{f}^{p}\) via \(g_{t+1}^{p}\), the destination embedding \(\mathbf{e}_{f}^{p}\) is further incorporated via a fully connected feedforward neural network

\[\mathbf{h}_{t+1}^{p^{\prime}}=\mathbf{M L P}\left(\mathbf{h}_{t+1}^{p} \| \mathbf{e}_{f}^{p}\right) \]

and \(\mathbf{h}_{t+1}^{p^{\prime}}\) will be used for the next goal type prediction. Similarly, we reuse Equation 10 to Equation 12 to derive the state representation of the next goal topic \(g_{t+1}^{o}\), but with different parameters, denoted by \(\mathbf{h}_{t+1}^{o^{\prime}}\). Note that since goal types and goal topics may transit to the next one asynchronously, the subscript \(t\) in corresponding state representations may be different.

Dependency learning of topic attributes. Different from cost estimation of goal types and goal topics, the selection of next attribute highly depends on current goal topic. Therefore, we estimate the representation of next topic attribute by \(\mathbf{h}_{t+1}^{r}=\mathbf{M L P}\left(A^{o r} \mathbf{h}_{t}^{o^{\prime}} \mathbf{e}_{f}^{r}\right)\), where \(\mathbf{h}_{t}^{o^{\prime}}\) is the state of current goal topic, and \(A^{o r}\) prunes attributes irrelevant with current goal topic. The state of next topic attribute is derived solely based on a fully connected feedforward neural network.

Goal Prediction

Asynchronous goal transition control. In multi-turn dialogs, the transition of goal type, goal topic, and topic attribute is often asynchronous. For instance, in a dialog, the goal topic may transit from the movie "Almost a Love Story" to "Forrest Gump", whereas the goal type keep unchanged (i.e., both belong to "movie recommendation"). As another example, users may talk about different topic attributes such as "Tom Hanks" and "Academy Awards", which are all under the same goal topic "Forrest Gump". We introduce the Goal Completion Gate to determine if we should change goal type and goal topic to the next one. Specifically, we employ another GRU to capture the semantic information \(\mathbf{h}_{u}^{s}\) in the last utterance. The completion probability of current goal type \(g_{t}^{p}\) is defined as \(P_{G C}^{p}\left(l=1 \mid \mathbf{h}_{t}^{p^{\prime}}, \mathbf{h}_{u}^{s}\right)\), where \(\mathbf{h}_{t}^{p^{\prime}}\) is the state representation of current goal sequence, which is derived by Equation 10. Similarly, the completion probability of the current goal topic \(g_{t}^{o}\) is defined as \(P_{G C}^{o}\left(l=1 \mid \mathbf{h}_{t}^{o^{\prime}}, \mathbf{h}_{u}^{s}\right)\). Different with goal types and goal topics, not every utterance is grounded on an attribute. We employ an Attribute Gate to predict if we should use an attribute for response, \(P_{N A}\left(l=1 \mid \mathbf{h}_{t}^{p^{\prime}}, \mathbf{h}_{t}^{o^{\prime}}\right)\), we set \(g_{t}^{r}\) empty if \(P_{N A}<0.5\). Once the output of the goal-transition control module exceed a threshold, our system predicts the next goal type, goal topic or topic attribute as below.

Next goal prediction. Based on the representations derived from the goal sequence dependency learning component and the asynchronous goal transition control component, we estimate the probability of selecting \(g_{i}^{p}\) as the next goal type by

\[P_{G P}\left(g_{i}^{p}\right)=\frac{\exp \left(\mathbf{h}_{i}^{p^{\prime}}\right)}{\sum_{j=1}^{k} \exp \left(\mathbf{h}_{j}^{p^{\prime}}\right)}, \]

where \(g_{i}^{p}\) is one of \(k\) valid goal type candidates in time step \(t+1\). We predict the next goal type by

\[\hat{g}_{t+1}^{p}=\arg \max _{g_{i}^{p} \in \mathbb{P}} P_{G P}\left(g_{i}^{p}\right), \]

where \(\mathbb{P}\) are all possible goal types. The next goal topic \(\hat{g}_{t+1}^{o}\) and the next topic attribute \(\hat{g}_{t+1}^{r}\) are predicted in the same way, but using \(\mathbf{h}_{i}^{o^{\prime}}\) and \(\mathbf{h}_{i}^{r}\) as input. In addition, we leverage the adjacency constraint between goal type and goal topic \(A^{p o}\) as well as the adjacency constraint between goal topic and topic attribute \(A^{o r}\) further to prune invalid goal topic and topic attribute candidates.

Optimization

For the goal planning module, we aim to optimize the negative \(\log\)-likelihood between the ground truth and the estimated goal. For example, we optimize the following goal type objective for all dialogs in the dataset,

\[L_{p}=-\frac{1}{N} \sum_{i=1}^{N} g_{i}^{p} \log \hat{g}_{i}^{p} \]

Overall, we aim to minimize the following objective function for goal planning,

\[L_{G P}=L_{p}+L_{o}+L_{r}, \]

where \(L_{p}, L_{o}\) and \(L_{r}\) are losses for goal types, goal topics and goal attributes, respectively.

Fig. 4. Architecture of our end-to-end goal-guided responding module.

GOAL-GUIDEd RespONDING MODULE

Overview

To use the aforementioned goal information to guide response generation, our responding module produces a response conditioned on dialog context, a goal, and reference knowledge at each turn. Figure 4 provides an overview of this responding module. For the implementation of this module, we adopt a retrieval model and a generation model proposed by [31] and modify them to suit our task.

Specifically, to improve dialog proactivity, naturalness, and recommendation effectiveness, we fully exploit the hierarchical graph to enhance the accuracy of predicting goals, and introduce a goal loss and copy mechanism to emphasize the guidance of goals for responding module.

Goal-guided Response Retrieval Model

As aforementioned, the conversational goal is an important guidance signal for response ranking to improve dialog proactivity, naturalness, and recommendation effectiveness. Therefore, we modify the original retrieval model in [31] by introducing an independent encoder for goals and an auxiliary Goal loss to emphasize the guidance of goals.

As shown in Figure 4(a), our end2end retrieval-based model (matcher) consists of four components: a context-response representation component (C-R Encoder), a knowledge representation component (Knowledge Encoder), a goal representation component (Goal Encoder), and a matching component (Ranker).

The C-R Encoder has the same architecture as BERT [48], and it takes a context \(X\) and a candidate response \(Y\) as segment_a and segment_b in BERT, and leverages a stacked self-attention to produce the joint representation of \(X\) and \(Y\), denoted as \(x y\). The model regards the current goal topic \(\hat{g}_{t}^{o}\) and the current topic attribute \(\hat{g}_{t}^{r}\) predicted by hierarchical goal planning module as reference knowledge \(\mathrm{K}\), which is also encoded as a vector by the Knowledge Encoder using a bi-directional GRU [49]. Specifically, it can be formulated as \(k_{c}=\left[\overrightarrow{\mathbf{h}_{\mathbf{T}_{\mathbf{k}}}} ; \overleftarrow{\mathbf{h}_{\mathbf{0}}}\right]\), where \(T_{k}\) denotes the length of \(\mathrm{K}, \overrightarrow{\mathbf{h}_{\mathbf{T}_{\mathrm{k}}}}\) and \(\overleftarrow{\mathbf{h}_{0}}\) represent the last and initial hidden states of the two directional GRU respectively.

The Goal Encoder uses bi-directional GRUs to encode a dialog type and a dialog topic for goal representation (denoted as \(g_{c}\) ).

We view \(k_{c}, g_{c}\), and \(x y\) as the information from knowledge source, goal source, and dialogue source respectively, and fuse the three information sources into a single vector via concatenation. Finally, we calculate a matching probability for each \(Y\) by:

\[p\left(l=1 \mid X, Y, K, g_{c}\right)=\operatorname{softmax}\left(\mathbf{M L P}\left(\left[x y ; k_{c} ; g_{c}\right]\right)\right) \text {. } \]

Goal-guided Response Generation Model

Similarly, to highlight the guidance of conversational goals, we also modify the original generation model in [31] by introducing an independent encoder for goals and an auxiliary Goal loss. As shown in Figure 4(b), our end2end generation-based model (generator) consists of four components: a Context Encoder, a Knowledge Encoder, a Goal Encoder, and a Decoder.

The reference knowledge \(k_{c}\) is obtained similar to the retrievalbased method with one notable difference that we normalize the entity with its type in the knowledge graph. For example, @star stands for all the movie stars, and @movie stands for all the movies, to improve generalization capabilities.

Given a context \(X\), conversational goal \(g_{c}\), and knowledge \(k_{c}\), our model first encodes them as vectors with the use of the above encoders (based on bi-directional GRUs).

The decoder is implemented with the Hierarchical Gated Fusion Unit (HGFU) described in [50], which is a standard GRU based decoder enhanced with external knowledge gates, with one noticeable difference that the copy mechanism [51] is introduced to enhance the accuracy of knowledge utilization. Specifically, we use HGFU to learn a dynamic dialogue strategy to generate a sketch response \(Y^{\prime}\) without real entity values, for example, generate "@star is @constellation.", instead of "Jay Chou is capricorn.", and replace the sketch tags in \(Y^{\prime}\) into real entity values based on the copy mechanism to generate the final response \(Y\).

For the generator, we jointly optimize the \(N L L\) loss, the \(B O W\) loss [52], and the Goal loss. In particular, the \(N L L\) loss computes the negative log-likelihood of the ground-truth response \(\left(L_{N L L}(\theta)\right)\). We use the \(B O W\) loss to ensure the accuracy of the fused knowledge \(k_{c}\) by enforcing the relevancy between the knowledge and the ground-truth response. Let \(w^{k}=\mathbf{M L P}\left(k_{c}\right) \in\) \(\mathcal{R}^{\left|V^{k}\right|}\), where \(\left|V^{k}\right|\) is vocabulary size, we define

\[p\left(y_{t} \mid k_{c}\right)=\frac{\exp \left(w_{y_{t}}^{k}\right)}{\sum_{v=1}^{V^{k}} \exp \left(w_{v}^{k}\right)} . \]

Then, the \(B O W\) loss is defined as follows

\[L_{B O W}(\theta)=-\frac{1}{m} \sum_{t=1}^{m} \log p\left(y_{t} \mid k_{c}\right) . \]

To ensure the relevancy between current goal and generated responses, we introduce Goal loss as follows

\[L_{G o a l}(\theta)=-\frac{1}{m} \sum_{t=1}^{m} \log p\left(y_{t} \mid g_{c}\right) . \]

In summary, we optimize the generator by minimizing the following loss:

\[L(\theta)=\alpha \cdot L_{N L L}(\theta)+\beta \cdot L_{B O W}(\theta)+\beta \cdot L_{\text {Goal }}(\theta), \]

where \(\alpha\) and \(\beta\) are hyper-parameters.

Complexity AnALYsis

In this section, we provide an analysis of the computational complexity of MGCG, including both the goal planning module and the responding module.

Complexity of goal planning. The complexity of goal planning mainly depends on two components, i.e., the hierarchical graph representation learning component and the goal sequence dependency learning component. For hierarchical graph representation learning, the complexities of graph convolution operations for each homogeneous nodes are \(O\left(l\left(\left|V^{p}\right| F^{2}+\left|E^{p}\right| F\right)\right)\), \(O\left(l\left(\left|V^{o}\right| F^{2}+\left|E^{o}\right| F\right)\right)\), and \(O\left(l\left(\left|V^{a}\right| F^{2}+\left|E^{a}\right| F\right)\right)\), respectively. The complexities of Topic Gate and Attribute Gate are \(O\left(\left|V^{p}\right|\left|V^{o}\right| F\right)\) and \(O\left(\left|V^{o}\right|\left|V^{r}\right| F\right)\). For goal sequence dependency learning, the complexity of the GRU component at each time step is \(O\left(N_{D} F^{2}\right)\). The overall complexity of goal planning is

\[\begin{aligned} O_{G P}= & O\left(s \left(l\left(|V| F^{2}+|E| F\right)+\left|V^{p}\right|\left|V^{o}\right| F\right.\right. \\ & \left.\left.+\left|V^{o} \| V^{r}\right| F+N_{D} F^{2}\right)\right), \end{aligned} \]

where \(s\) is the length of dialog sequence.

Complexity of matcher. The complexity of the matcher mainly depends on three encoder components. The complexities for c-r encoder, knowledge encoder, and goal encoder are \(O\left(s_{c r}\right)\), \(O\left(s_{k}\right)\), and \(O\left(s_{g}\right)\), respectively. The overall complexity of the generator is

\[O_{R M}=O\left(s_{c r}\right)+O\left(s_{k}\right)+O\left(s_{g}\right) \]

where \(s_{c r}, s_{k}, s_{g}\) are the lengths of context-response, knowledge, and goal, respectively.

Complexity of generator. The complexity of the generator mainly depends on two components, i.e., three encoder components and the decoder component. The complexities for context encoder, knowledge encoder, and goal encoder are \(O\left(s_{c}\right), O\left(s_{k}\right)\), and \(O\left(s_{g}\right)\), respectively. The complexity for the decoder black is \(O\left(t \cdot s_{N}\right)\). The overall complexity of generator is

\[O_{G M}=O\left(s_{c}\right)+O\left(s_{k}\right)+O\left(s_{g}\right)+O\left(t \cdot s_{V}\right), \]

where \(s_{c}, s_{k}, s_{g}, s_{V}\) are the length of dialog context, knowledge, goal, and vocabulary size, respectively. \(t\) is the number of time steps of decoding.

EXPERIMENTS

Experiment Setting

We conduct both automatic and human evaluation for our systems, MGCG_R_graph (with a retrieval based responding model) and MGCG_G_graph (with a generation based responding model), on DuRecDial. We mainly focus on (1) the overall performance of MGCG compared with six baselines using both automatic and human metrics, (2) an ablation study for each module, (3) parameter sensitivity, and (4) the framework efficiency. We split DuRecDial into train/dev/test data by randomly sampling \(65 \% / 10 \% / 25 \%\) data at the level of seekers, instead of individual dialogs.

We carefully select a few state-of-the-art models proposed for non-task conversational recommendation [13], [17], [26], [27],