Case Study

Virtual Assistant Manual

by Jay Goluguri and Ryan Wheeler

4 minutes

owners manual
deep-learning
tensorflow
pytorch
kubernetes
nlp

A Toyota owners manual is over 500 pages. Imagine needing to quickly know how to open the hood or get information about Toyota Safety Sense quickly; with the idea of crafting a more engaging experience, we looked to develop a novel AI-powered voice interface that a user could simply ask a question via natural language, and in turn, have guided answers displayed on their head unit.

At Toyota Connected, we developed a scalable microservice-based application, capable of being cloud agnostic with kubernetes, powered by Deep Learning algorithms written in Tensorflow.

Deep Learning - Vector Representations

We experimented with a broad arrange of algorithms and techniques for question answering using a variety of semi-supervised and supervised learning methods. One key aspect of language modeling is a sufficient dense vector representation that is capable of robustly encoding the syntactic and semantic relationships of Toyota domain-specific words. While pre-trained embeddings exist ranging from public GloVe and Word2Vec embeddings, these suffer from a high margin of out of vocabulary words; for example eTune is not present in the largest embedding lookup files trained on Google News or Wikipedia. (A quick aside: we highly favor transfer learning approaches that do not suffer word based OOV and prefer using the latest state of temporal encoders.) A first pass was to scrape all of the Toyota website owners manuals as well as 20 gigs of reddit comment data from all car/auto related subreddits, then train our own robust embedding space. Training weekly supervised language models frames an optimization problem that has an algorithm look to predict surrounding words in a given context; in this space we heavily used Transformer-based approaches with subword masking such as BERT and LSTM+Attention based models. Once these language models are trained, semantically similar words can be calculated by taking the cosine similarity between a set of vectors. In our case, the head unit software known as etune shows high semantic correlation to words like music head-unit bluetooth etc. One idea we hope to revisit in embedding representations is exploring (Poincare Embeddings) https://arxiv.org/pdf/1705.080... for ontological relationships.

For downstream learning tasks we found that a variety of dense feature vectors, those trained on a large public corpus and integrated into our deep learning models using Tensorflow Hub, as well as domain specific information concatenated together, provided the most robust feature representation.

Supervised Language Modeling - Question Answering

Our task of question answering had Machine Learning engineers approach this using two high level techniques. The first frames the problem in the following manner.

Given a question: How do I disable the VSC

And given a general paragraph about the VSC:

The VSC OFF switch can be used to help free a stuck vehicle in surroundings like mud, dirt or snow. While the vehicle is stopped, press switch to disable the TRAC system. To disable both VSC and TRAC systems, press and hold the switch for at least 3 seconds

Predict where in the above passage the answer occurs. More specifically our task is to predict the index of the start and end token, which would answer:

To disable both VSC and TRAC systems, press and hold the switch for at least 3 seconds.

If this looks familiar it is a very similar problem as the SQUAD machine comprehension challenge. At the time this code was written we implemented a modified version of the (RNET) https://www.microsoft.com/en-u... algorithm and experimented with a few similar approaches using the SQUAD dataset for training an algorithm how to recognize answers to questions given a sufficient context. All approaches made heavy use of various forms of attention mechanisms that allow deep learning models to learn what areas of long texts (represented as LSTM Hidden states) are most important.

Paragraph Identification

While the above approach worked once we had sufficiently narrowed down the correct context, the next step was to retrieve the best context or document from the user manual dataset. There are a variety of ways and a robust set of literature describing various information retrieval techniques (when discussing this problem in it's early phase I remember thinking, "why don't we just throw this into elastic and index it?"). Treating this document retrieval as a supervised learning problem (both classification and learning2rank) we presented crowd source annotators with a a section of the owners manual and provide answers to the question, What questions could this passage answer? Each section of the manual was treated as a separate class and we could either retrieve the associated context and display it on the screen, or use it as the context into the more refined machine comprehension algorithm. Here we again used a variety of multi-objective Transformer based models with BERT embeddings, combined with domain-specific embeddings mentioned above.

Deep Learning in Kubernetes

Our goal at the outset was not just to simply explore an algorithm and do research, but to put models into production. We make heavy use of Tensorflow serving in kubernetes running on GPUs and have also written our own model server in Rust. Our Machine Learning engineers are responsible for every step of the modeling process -- from data generation -- translating papers into Tensorflow/Pytorch code ---- to ensuring the models are scalable up to hundreds of request per second.

    Join our Team

    We build technology that empowers people to move, and makes their lives easier and safer at Toyota Connected.