Policy learning for task-oriented dialogue systems via reinforcement learning techniques
AffiliationComputing and Information Systems
Document TypeMasters Research thesis
Access StatusOpen Access
© 2018 Chuandong Yin
Task-oriented dialogue systems such as Apple Siri and Microsoft Cortana are becoming increasingly popular and are attracting much attention. Task-oriented dialogue systems aim to serve people as virtual personal assistants. For example, they can help users create calendar events or look up the weather through conversations, which improves work efficiency and life convenience. Nevertheless, task-oriented dialogue systems are still in their infant stage and are encountering many problems in dialogue policy learning (i.e., learning the next response given a user input and a dialogue context). Most existing work uses supervised learning techniques to learn dialogue policies. However, supervised learning models, especially deep learning models, are usually data- hungry and need a large number of training dialogues. It is difficult to obtain such a large number of training dialogues since intensive labeling efforts are needed to collect training dialogues for a given task domain. Moreover, it is also difficult to measure the quality (correctness) of a training dialogue or the quality of each response in a dialogue, while a supervised learning method needs such information as the training signal to guide policy learning. To overcome these shortcomings, we take a reinforcement learning based approach for policy learning in task-oriented dialogue systems. In the reinforcement learning paradigm, a user simulator (i.e., the environment) is introduced to mimic various user behaviors and to interact with an agent (i.e., a task-oriented dialogue system) that needs to be trained. Via simulated interactions with the user simulator, the agent is able to learn how to make correct responses using a small number of training dialogues and eventually becomes well-trained to serve real users. We identify two limitations of the reinforcement learning based approach and offer solutions to overcome such limitations in this study. First, existing reinforcement learning based training procedures have not taken into account the context uncertainties of ambiguous user inputs when learning dialogue policies. In task-oriented dialogue systems, user inputs are first transformed to a series of ⟨entity name, entity value, conf idence⟩ tuples by the automatic speech recognition (ASR) and natural language understanding (NLU) components. Here, entity name represents an attribute that makes up for a target task (e.g., cuisine types in a restaurant booking task) and entity value represents its value (e.g., “Korean”). The confidence field indicates how confident the ASR and NLU components are for recognizing the entity name and entity value from user input. For ambiguous user input (e.g., due to environment noises or user accents), the confidence might be lower. A low conference value triggers a “confirmation” response of the task-oriented dialogue system, which asks the user to confirm whether the recognized entity name and value are the intended entity name and value. Existing work uses a fixed confidence threshold to trigger confirmation responses. However, the confidence threshold should vary from person to person. For users with accents, even if the entity values are correctly recognized, the confidence is generally lower than those without accents. If a universal confidence threshold is used, it may lead to many rounds of unnecessary confirmations and lengthy dialogues, which will impinge the user experience. To address this issue, we propose to learn a dynamic threshold based on the dialogue context. But learning a dynamic threshold is very challenging because the response (action) space is too large (i.e., each different response of the task-oriented dialogue system may require a different confidence threshold). Finding an optimal dynamic threshold that suits the entire response space needs a large number of simulation steps. As a result, it is difficult for reinforcement learning models to fully explore the response space. We therefore propose a parameterized auxiliary asynchronous advantage actor-critic (PA4C) model to solve this problem. PA4C utilizes the action parameterization technique to reduce the size of the response space and introduces auxiliary tasks to efficiently explore responses, which is beneficial to learn the dynamic threshold and to improve task completion rates. We evaluate PA4C on a calendar event creation task, and the experimental results show that PA4C outperforms the state-of-the-art baselines by over 10% in completion rate, dialogue length, and episode reward. Second, existing task-oriented dialogue systems such as Apple Siri can only handle short dialogues that can be completed in a few dialogue turns, such as looking up the weather. As the dialogue gets longer, learning the optimal policy in each turn becomes increasingly difficult. This is because we can only obtain a global reward at the end of a training dialogue to indicate whether a task has been completed or not. There are no local rewards to evaluate the quality of the response in each intermediate dialogue turn. Propagating the global reward back to each dialogue turn to optimize the turn-by-turn policy is challenging. Such a problem is called the sparse rewards problem in reinforcement learning and may lead reinforcement learning models into local optima. To address this issue, we propose a new reinforcement learning model named PGGAN to provide local rewards for each dialogue turn by incorporating policy gradient (PG) and generative adversarial networks (GANs). In particular, we train a discriminator to evaluate the similarity between a response produced by the dialogue agent and a human response. This similarity score is then used as the local reward to optimize the dialogue agent (i.e., generator) in each intermediate dialogue turn. In this way, the sparse rewards problem can be solved and the dialogue agent can therefore handle long dialogues. We evaluate PGGAN on a public dialogue dataset, and the experimental results show that PGGAN significantly outperforms the state-of-the-art baselines by up to 25% in completion rate.
- Click on "Export Reference in RIS Format" and choose "open with... Endnote".
- Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References