Invented by Rashmi Gangadharaiah, Balakrishnan Narayanaswamy, Charles Elkan, Amazon Technologies Inc

Task-oriented dialog systems have become increasingly popular in recent years, with the rise of virtual assistants such as Siri, Alexa, and Google Assistant. These systems are designed to help users complete specific tasks, such as setting reminders, ordering food, or booking a hotel room. However, developing effective task-oriented dialog systems is a challenging task, as it requires the system to understand natural language, recognize user intent, and provide accurate responses. To address these challenges, researchers have been exploring the use of combined supervised and reinforcement learning techniques to develop more effective task-oriented dialog systems. Supervised learning involves training the system on a large dataset of labeled examples, while reinforcement learning involves using trial and error to optimize the system’s performance. The market for task-oriented dialog systems using combined supervised and reinforcement learning is expected to grow significantly in the coming years. According to a report by MarketsandMarkets, the global conversational AI market is expected to reach $13.9 billion by 2025, with a compound annual growth rate of 30.2% from 2020 to 2025. One of the key drivers of this growth is the increasing demand for virtual assistants in various industries, including healthcare, finance, and retail. For example, virtual assistants can help healthcare providers schedule appointments, answer patient questions, and provide personalized health recommendations. In the finance industry, virtual assistants can help customers manage their accounts, make payments, and get financial advice. In the retail industry, virtual assistants can help customers find products, track orders, and provide personalized recommendations. Another factor driving the growth of task-oriented dialog systems is the increasing availability of data and computing power. With the rise of big data and cloud computing, it is now possible to train and deploy complex machine learning models at scale. This has enabled researchers to develop more sophisticated task-oriented dialog systems that can handle a wide range of tasks and user interactions. However, there are also several challenges associated with developing effective task-oriented dialog systems using combined supervised and reinforcement learning. One of the main challenges is the lack of high-quality training data, as it can be difficult to obtain large datasets of labeled examples for specific tasks. Another challenge is the need for effective evaluation metrics, as it can be difficult to measure the performance of task-oriented dialog systems in real-world scenarios. Despite these challenges, the market for task-oriented dialog systems using combined supervised and reinforcement learning is expected to continue growing in the coming years. As virtual assistants become more ubiquitous in various industries, there will be increasing demand for more sophisticated and effective task-oriented dialog systems that can provide personalized and accurate responses to users. With advances in machine learning and natural language processing, it is likely that we will see significant progress in this area in the near future.

The Amazon Technologies Inc invention works as follows

The techniques for intelligent multi-turn dialog systems automation that is task-oriented are described. A seq2seq model can be developed using a corpus or training data, and a loss-function that is at least partially based on the distance from a goal. A user utterance can be input to the seq2seq model, and the nearest neighbor algorithm can select one or several candidate responses to that user utterance. In some embodiments the specially adapted Seq2seqML model can trained using unsupervised training, and can then be adapted in order to select intelligent, coherant agent responses which move a dialog towards its completion.

Background for Task-oriented Dialog Systems Using Combined Supervised and Reinforcement Learning

Conversational Agents have been proposed and used for many commercial domain-specific applications. These applications can be task-oriented in the sense that they are designed to help customers/users achieve a certain goal, like making a hotel or airline reservation. In order to achieve this goal, the agent must collect relevant information (e.g. preferences), give her relevant knowledge (e.g. prices and availability), and issue appropriate system calls. Make a payment, and complete the task effectively.

Chatbots, which are now ubiquitous thanks to recent advancements in speech recognition (e.g. smart speakers at home, mobile applications or computer programs, etc.), reach many people via speech-based services. Recently, chitchat has received a lot of attention in contexts that are open. The term “chit-chat” is used. The term “chit-chat” refers to systems which can generate fluent responses in response to questions and other utterances, that are reasonable within the context of a conversation. This is in contrast to the task-oriented settings that were discussed above in order to guide or conduct a conversation in order to complete a specific task.

BRIEF DESCRIPTION DES DRAWINGS

The drawings will show various embodiments of the present disclosure.

FIG. “FIG.

FIG. “FIG.

FIG. “FIG.

FIG. “FIG.

FIG. “FIG.

FIG. “FIG.

FIG. “FIG. 6 is a diagram of a illustrative environment where machine learning models can be trained and hosted in accordance with some embodiments.

FIG. “FIG.

FIG. “FIG. 8 is a diagram that illustrates an example computer system which may be used for some embodiments.

FIG. “FIG.

FIG. FIG. 10 is an example of a system for implementing various aspects according to different embodiments.

The description includes “Various embodiments” of intelligent multi-turn task-oriented dialog systems. In some embodiments, an ML model (such a sequence-to-sequence (seq2seq), for example, can be trained with a corpus (e.g. prior multi-turn, task-oriented dialogs), and a loss-function that is at least partly based on the distance from a goal. The ML can be given a user’s utterance and the output of the ML (e.g. a vector of multiple values from a plurality hidden units in a seq2seq model) can be used to select a candidate response to the user’s utterance. The specially-adapted ML models can be trained by unsupervised learning and adapted to select intelligent agent responses which move a dialog towards completion.

The large-domain task-oriented dialogue systems are a system that is widely used. Agents may be required to perform simultaneous actions, such as database queries, while also generating fluent responses in natural language. For example, some approaches to implement such systems could use reinforcement learning (RL), or supervised-learning (SL).

RL refers to a class of techniques that allow machines to learn sequential decision making from sparse and distant reward. In RL approaches to dialog, a policy is learned online via interactions with users who provide feedback. RL has the advantage that it learns models that optimize the appropriate long-term reward, in this case, fast and accurate completion of the user’s task. However, RL approaches usually require separate Natural Language Understanding (NLU) and Generation (NLG) components that are tuned separately to generate states, and also typically use predefined templates with slots and values to specify actions. Thus, the state space, the action space, and the rewards need to be carefully defined, requiring expensive human annotations or domain knowledge. Requiring domain-specific knowledge as rules or templates limits the expressive power of the models as the responses must belong to the pre-defined sets of possible responses, making the deployment of such systems difficult in the real world.

In contrast, SL-based approaches learn dialog policy offline by looking at expert trajectories. These approaches are appealing because the dialog policy can be learned without human supervision. However, SL methods require many examples dialogs to reach acceptable performance levels. This trade-off is reasonable for some dialog applications such as customer service and support, where there are many examples created by human agents.

A major disadvantage of SL is that it does not optimize for future reward. They learn to match every utterance of a training dialogue, using a loss-function like the cross entropy between predicted words distributions and the true agent utterances. However, they do not take into account the task, dialog history and the final goal. It is important to consider this, given that dialog systems are characterized by repeated and sequential interactions. Cross-entropy losses has a major flaw in that they penalize small changes to the order or choice of words even if the sentences are semantically the same. Cross-entropy loss can be a big factor in determining the validity of different responses, even if they are all valid within the context of a conversation.

Embodiments disclosed in this document can use aspects of both SL type approaches and RL types to implement highly accurate task-oriented dialog system. Embodiments may provide rewards for every step of the dialog based on the goal state. Embodiments are able to use SL-type techniques in order to learn the embeddings of dialog history at every turn, without needing additional annotation. Embodiments may add a reward to the cross-entropy that is negative at each turn. This term measures the deviation between the predicted state learned embedding and the final state embedded for the dialog. The final embedding can capture information such as the API goal call issued by the agent, or any other event/state that ends the dialog. It may also include information gathered from the customer during the dialog. This reward term encourages agents to respond in a way that moves the conversation in a positive direction within the latent space and reduces cross-entropy. This does not imply that the dialog agent is looking ahead to the customer’s final goal. Instead, the RL rewards will be shaped in training so as to encourage good behavior.

Click here to view the patent on Google Patents.