Machine Learning in Production for Conversational AI with Rachael Tatman
Learn about the challenges involved in building and deploying conversational models, as well as what it's like to work at Kaggle
BIO

Dr. Rachel Tatman is a linguist and developer advocate for Rasa, who helps developers build and deploy conversational applications using an open source framework.  Before that she was a data scientist at Kaggle where she was also a Kaggle grandmaster. She also did really interesting work at the University of Washington as part of her PhD in linguistics.

TRANSCRIPT


Tell us a little bit about what your experience at Kaggle.

I was at Kaggle for two and a half years and I think most people who are familiar with Kaggle are familiar with the competitions which are generally supervised machine learning competitions where everyone's working on the same problem with the same data set and I never actually supported competitions directly. I worked on the data set hosting platform, I worked on the hosted Jupiter notebook environment that Kaggle develops, which is called Kaggle notebooks. It was called Kernels for awhile because you could also have scripts and also the forums for developers to talk with each other and the learn content I also worked on a bit as well. Learn is Kaggle's machine learning courses that are becoming more structured and fully featured over time. So I worked on all those parts of the website, I worked a lot with the community developing educational content and making product recommendations. One of the things that I'm most proud of is I mentioned we had scripts that were sort of flat hosted Python or R files or our R Markdown and we also had notebooks but for a while if you had a module, you're working on a script, there was no way to use that module in a notebook. So I worked with the engineering team to spec out what it would look like to have in portable scripts and how you could do that.


Was Kaggle your first industry job coming out of a PhD?

Yeah, I started right after I graduated and I was actually still applying for faculty positions as well. I was in a sort of limbo where I was working at Kaggle and I was also on the faculty job market for academic positions and I found that I really enjoyed working in industry and the things that I would have liked about a faculty position, so the teaching and helping people build cool things. My preference would have been more language-y things, which is why I'm at Rasa now. I had a lot more reach and impact at Kaggle than I would have had in a university setting so I found that really appealing.


Were there any things you had to get used to going into an industry job out of academia? I feel like that's a hard adjustment for some people, but it sounds like you really enjoyed it.

I did. I think one of the biggest changes for me and a lot of people talk about the pace of work in academia is much, much slower and if you've ever been in industry or academia trying to collaborate across across those fields, it can be a little challenging.

For me, a bigger change was that the North Star of what success looks like would change very dramatically. Kaggle was a startup that was acquired by Google and it still has sort of a startup mentality of iterating fairly quickly. So when I first started, my focus was on trying to increase the number of datasets on the dataset platform and we found eventually that was happening organically and then my focus changed to helping people write Kaggle notebooks using the hosted platform and then to growing the community and helping them grow their skills.

Whereas in academia, you know you need to publish to top tier conferences and journals, show up for the classroom teaching and you know that in order to continue to advance, you need to basically have the highest number of high quality publications possible. So that North Star, what success will like for you in academia is very, very static and in an industry, at least in my experience, it has been much more variable. I wouldn't call it North Star, I'd call it like a planet.

In academia you can be working on a single research paper for years and years and years and years, whereas a project that's a long term project in an industry setting would be like a quarter. That would be like a big ask, at least in my experience.

image kaggle logo


I know Kaggle helps a lot of people get into machine learning and they've made so many open data sets and increased collaboration. What do you think is the bigger goal of Kaggle?

I think the the big goal of Kaggle is to really help all data scientists with their work. Since I'm no longer with the company I don't really know the new term goals at the moment, but all of the different things that Kaggle is doing are in service of that higher goal, to help people get better at their job and then do their job successfully. So that includes people who are brand new to the field, starting to try it out and get their first steps and also people who are really advanced in the field and want a challenge. As most professional data scientists and machine learning engineers, you don't spend most your time building models. That's a fairly small slice of the of the data science workflow but I think it is for many people, it drew them to do and science machine learning in the first place. So having a place where you can go and just have the part of the task that you really like doing in a very challenging way I think is really appealing to people. Practically, XGBoost or some sort of gradient boosted model will work for most things. It's fast, it's cheap to run, probably that's what you're going to be doing day to day and you don't really need to get much fancier so having a place to go and cut loose is nice as well.


What is Rasa?

Rasa is a startup that has an open source conversational AI framework. Basically to take in text in a conversation and figure out the information that is in that text, decide what to do next and then take that action whether that's a turn or whether that's running some code. Then on top of that open source framework there is a free platform called Rasa X and Rasa X lets you to deploy models, have people test your models, annotate data, and sort of fold them back in to a little bit more of a human in the loop learning process where you're iterating. If you are a business that wanted to use these tools we have Rasa Enterprise which has lots of additional features. We’re focused around conversational AI, so chatbots, virtual assistants, anything where you would be interacting with an automated system through a text conversation or voice conversation rather than through a GUI or a command line.

image rasa logo


What makes you excited about conversational AI?

We've all had probably bad experiences with chat bots in the past. There was definitely a period in the last couple years when people were very excited to try the technology and I think the industry wide design expertise wasn't there yet. There's a study that like 54 percent of people had a bad chatbot experience but as design has really matured, I think it opens up being able to do computational tasks, anything where you need to use a computer, to a much wider variety of people.

As an example, people who aren't literate have limited ability to use GUIs and sort of have to memorize where things are but can especially, with voice technology, interact very naturally with whatever computational system they're interacting with. Even just thinking of computer literacy, using a mouse if you're a technologist it's second nature to you, but it's a learned skill. So being able to provide services and open up access to people with a variety of different abilities and backgrounds to me is the most appealing part of conversational AI. But also it comes with the second challenge that people who are using conversational AI come in knowing how conversations work and will always judge a conversational system against a human. This conversation is high quality and if I were having this with the bot my mind to be blown. I don't I don't think that's necessarily where we're at quite yet. So being able to achieve that really fluent level of conversational interaction is a really large engineering challenge and a really large machine learning challenge and I'm really excited to be working on it.


Do you have a good example today, of a kind of conversational thing that you can actually interact with?

I don’t think it's publicly available, but the most recent one that I had that was really fantastic was for booking booking time off. So I knew the dates that I wanted for my vacation, I knew that if I was going to go through the website I'd have to do like 80 different things. I wasn't really sure how the process works, I've never done it before and a co-worker of mine was like, hey, use this bot. It was a really fantastic experience and well designed. In turns where there were very few possible options instead of having me generate text, it had buttons. So using buttons in that conversational flow made it feel much faster and in the whole process there was a variety of different things that needed to happen. It kept track of the things that I said before like the dates and the things that I needed. At the end, it took maybe two minutes to do what otherwise would've taken me a good half hour. So in general, I think a good conversational interaction is one where it is faster to do the thing that you need to do than it would be otherwise.


I get the feeling that NLP has made some huge improvements in the last year or two. Are these things already deployed in conversational agents or is there more work to do to make them production ready?

We're open-source so if you want to use something else you'd be welcome to but we use contextual embeddings, specifically conversationally trained contextual embeddings. If you wanted to use BERT instead, you're definitely welcome to.

A lot of the more recent work that I think has been a little bit more headline-grabbing has been around natural language generation, the GPT-2 stuff and Meena which is a Google project that came out relatively recently. I think natural language generation is much trickier to get right. The default set up for Rasa is that you have a limited number of utterances that your bot can say. You might have some slot fillings, you might say, "Hello Lukas, I see that you recently went to Vienna, do you want to rate your hotel?" The tricky thing about a lot of the natural language generation is that it sounds very fluent, right? It sounds like something a human conceivably could say, which is very exciting and no small feat, but it's not grounded. If you look through the GPT-2 text examples, there was one where like, "these scientists discovered unicorns." Scientists have have not discovered unicorns. It doesn't have ties to any sort of knowledge base that is the ground truth that the text is being generated around.

I think my worry is that especially people without a deeper understanding of NLP will see these very fluent text productions and be like, "oh, I don't have to build a bot, I can just pipe the user input into this predictor that will come up with some sort of text that I should say back," and there's nothing to stop it from being completely unfactual, from being like, "Yes, of course. We'll give you a full refund on your house." There's nothing to stop it from being abusive. With most of the large language models there are certain sort of adversarial texts that you can use, like small text strings, ten to twelve characters that will cause it to produce really vile abusive output. That's obviously not something you want to be showing to people, hopefully, obviously. So let me put it this way, I would not be comfortable doing a completely neural, natural language generation conversational assistant. I definitely would want to have more control over over utterances.


So the way Rasa works is it figures out an intention and then it sort of fills in slots, is that right?

Yeah, sort of the current approach is to do entity extraction and then intention identification, so intents. Where you have a set number of intents that you've provided training data for. Going forward we're combining intents and entities together. So instead of having intents interaction as a single part of the pipeline making it a little bit more tied into the rest of the NLU that's going on which is research that we're working on.


What are the core technical challenges to make and deploy a chatbot?

The first hurdle is the first foil you'll get with any machine learning project which is getting the data. There's a sort of an older school approach, which is to build a state machine. Where the person said “hi”, we're going to say “hi”. The person wants to know how to rent a car or the person wants to know how to find the dealerships. We're gonna go down that path. Ok, they wanted a car. What type of car did they want? So it's sort of like a decision tree but for for a dialogue agent and one of the big challenges is if people are like, OK, I want a car, actually where is the dealership? Being able to recover from someone interleaving other types of of intents into your your happy path that you've constructed is fairly challenging for that sort of state machine based approach. So one thing that we've done at Rasa is we have an attention model. So instead of having a straight through the tree, you'll you'll start with sample stories, dialogues that someone might have. Then it comes to turn, pick the next turn. If you have an exact example of the specific conversational flow that you've seen before, you're just going to continue on that flow because you've you've seen it before. You're pretty sure it's right. If you aren't sure what the intention or the entity is or any of the required information is, then you'll have a fallback like, "Could you rephrase that or. I'm not entirely sure what you're asking for, but here's the next to closest results or that sort of thing." Then also a machine learning based policy that ranks possible responses and says, OK, I think this one's probably the correct next one. Those are all considered and if there's one that's highly likely then that's the one that you go with and if there isn't, you go back to the fallback policy. So it can handle these sort of interleave decide type structures and conversations in a way that a more rigid state machine cannot.


So it sounds like the first thing is essentially like a state machine kind of rule based system.

Yes. That's that's not what underlies Rasa, but that's a very common approach to building conversational systems.


What's the first Rasa approach?

We have a variety of policies that are all considered together and then the one that has the highest confidence is usually the one that's left in. You can think of a policy as a multi-class classification across all of the possible responses and then it'll select the one that's most likely based on the training data.


So the training would be like an utterance or like a conversation and an intent?

Two types of training data, one are examples, intents and example entities. Things that the user would say and these you might just sit there and come up with, you might collect them from FAQ's you've gotten. That's more on the NLU side and on the other side to determine the dialogue policy, what gets said next. You'll have examples of conversations. So you'll probably have the one that's like, OK, this is the ideal: the person says hi, they want to know what car to rent. I'm going to quiery the database and get the available cars and then tell them the available cars and all of that. Then you might have other turns, like other possible stories. Someone's like, hey, are you a bot? And you're like, yeah, I'm a bot. And they're like I would talk to a human or whatever and then it helps them out with the thing that they need. Then those stories and those example utterances together are used to train the model. And you don't need that much if you are using a language for which we have pre-trained embeddings. You don't need that much data to get started and the idea is you build a minimally viable assistant and then you deploy it and you have people have test conversations with it. You go back, you annotate those, you put those back into the training data, you retrain and you continue on. So you could add additional stories, you could add new intents, you could add new utterances, you can sort of change your model to fit the factual conversations that you see.

image rasa chatbot

Wow, that's so cool. How much training do you think you need before you get something reasonable?

So for some of the examples that I've worked on, you'll probably need 10 to 20 examples per intent and then maybe three or four stories. We're using pre-training embedding so you know that like I want a car and I want a vehicle are going to be similar because the embeddings for car and vehicle are similar. So you you have the sort of the fuzziness in machine learning to help you out with it.


I know that you you write a lot about papers that you've read. A really common question I get from folks is how do of approach papers? How do I find what papers to read? How do I even go about reading a paper? I would think you'd have some smart advice on that.

Yeah, don't go to archive every day, you'll just make yourself upset. I don't try to stay on top of things right after they're written. How I will come to know that there is a paper that I want to read is I'm very active on Twitter and if a lot of people are talking about a specific paper, either as they like it or they are not a fan of it either way. Once I have enough people that I trust be like, "hey, this is an important paper for one reason or another," I'll set aside time to read it. First of all if it's a really seminal paper like the transformer paper oftentimes there will be a blog or a talk that someone has done and you can read that instead of the paper and get the same information. If there isn't, I would start by reading the abstract and then I like to read the introduction and then the conclusion so I have a good general idea of what's going on with the paper. Then after that I'm starting from the top, in the related work section or the literature review section, sometimes it's at the end.

I wouldn't go and chase down all of those terms that might be new to you right away. Just sort of skim that section, go to the methods and if you see terms there that are repeated that you saw previously and they look like they're going to be used a lot in the paper, then go and look those up if you're not familiar with them. When you get to math, like an equation, my strategy is always to try and take that and put it in to human words like the way that I would say it. In the process of doing that, I'm usually like, oh, I don't actually know what this term is. Can I figure out what this term is from other places in the text? For me, that's the part that takes the longest in reading the paper, which I think is true for most people, unless it's very similar to a field you already work in, you're very familiar with sort of the bones of the equation. Continuing through, I always skim the results because it's usually like, oh, look, here's our results, here are all these people's results, we got state of the art, huzzah. Unless there's something very specific you're interested in.

Then I will pay more attention to the ablation results if they have any ablation. Ablations are when you have a full model and you start to take parts out of it and you see what changes what. I find those to be, particularly if you're a practitioner, if you're thinking, oh, maybe I want to implement this, but that's a lot of layers, maybe I want a couple fewer layers. Figuring out what you can get rid of that may be practical in an academic setting, but not in a production setting is really helpful. Like ICLR, there's a subset of conferences where the reviews are made public. So it might be helpful for you to go back and also read the reviews of the paper if it's against something you're like, do I want to put this in production? What are other people saying? So the more you care about the paper, the further down that list you will go.


What's a paper that you've gone pretty deep in recently?

I am like 50 percent of the way through the list for the ConveRT paper which I mentioned earlier. So that is a paper that we have implemented into Rasa that is Henderson et al, 2019. It's a transformer embedding architecture specifically for conversational data, which is obviously very relevant to us.


I was looking at your your papers from grad school and I saw you had a bunch of papers on Twitter and gender and I remember being super interested in that. I'm curious if you have any favorite paper or any favorite result that you want to talk about.

I got a lot of traction with my paper that was looking at automatic speech recognition and accuracy across different demographic groups and I had two papers, one at ACL and then one at Interspeech. ACL is an NLP conference, Interspeech is a speech conference and the ACL paper was on gender and region and I found differences for both of them, but that was using user uploaded YouTube videos. Again, I was guessing at gender and not particularly, I would say, ecologically valid way, so not the absolute best methodology. When I repeated the experiment with additional APIs and using higher quality audio where the signal to noise ratio would be controlled so basically recorded in a quiet environment with high quality microphones. I found that the gender differences disappeared and for this one I did have self-reported gender from users. The demographic regional differences were still there and this time I also had access to race and ethnicity data and I found a really strong difference. So when signal to noise ratio is controlled, I didn't find that the gender difference obtained but there was a really strong difference in accuracy from people of different geographic regions. So the sort of general American prestige: educated, upper middle class dialect had the best retention rates. Any other regional dialect had lower recognition rates. It's interesting because in England there is like a very specific pronunciation set of rules that are considered the standard received pronunciation or R.P., anything else is considered a regional dialect. In the United States you can have quite a bit of variation in pronunciation and still be considered a general American speaker and it seems to be more around lexical items and grammatical features that make you sound not accented, everybody has an accent. So I would say it's a variety, it's less internally consistent of a dialect than a lot of other dialects like California English, for example.


It sounds like you basically found the quality got worse with any variation from the general American speaker?

General American or standard American English, the variety where speakers are consciously avoiding using regional forms. Both in region, but also African-American speakers had much higher error rates and that's not due to African-American English being any less internally consistent or easy to recognize. It's almost certainly due to imbalances in the training data, so your classic inbalance class problem.


Did you have recommendations on how to deal with this?

Yeah. So the gender thing is real. There is a gender difference, but it's more on the signal processing side and less on the automatic speech recognition sort of machine learning modeling side. A couple of things: one is that women in general tend to be slightly smaller, tend to have slightly lower lung capacity, tend to be slightly quieter. So for an equal decibel level of noise, you'll tend to have a little bit less signal. Also, when we were developing sort of telephony and recording in general, the frequency band that was picked to be the target band for all systems basically and that a lot of the speech recognition comes directly from Bell Labs and a lot of the telephone work in earlier days was picked to suit a male voice and not any of the other types of voices you might encounter. So children also tend to have really high error recognition rates. Partially that's due to children varying more as they're learning the language. Partially, that's just due to their frequency range not being represented as well.


But I guess downstream then the error is is higher?

Definitely the regional and racial differences are due to things that you could fix with machine learning, whereas I think the the other differences are not due to that.


How were you able to pull that apart?

I believe I had fairly balanced classes specifically on the modeling side I used mixed effects models so you can control for some features as well, identifying the effects of others.


What do you think is an underrated aspect of machine learning that you think people should pay more attention to?

Data visualization. I've seen a lot of really excellent machine learning engineers who have a hard time communicating their results and models because their charts are just unreadable.


Does anything come to mind where you saw a really good visualization and something you want to call out as an excellent example?

If you want to see some masterclasses in visualization, check out the Pudding, which is actually a data journalism web magazine with really stunning visualizations that just sort of push the the limit of the art. So data visualization: one of my biggest pet peeves is you should not use lines to connect points unless there is a logical reason to do it, like it's a time series or that something could exist in the space between the two points. Don't do it for categories it drives me nuts up the wall.


One question. How do you feel about 3D visualizations?

Are you in AR? Are you presenting them in a way that people can like walk around them? I'm not against them in general I just find them harder to harder to parse. There's there's a whole field of study that specifically looks at how humans process information and what is most useful for conveying different types of information visually. Please, please read any visualization papers y'all.


What do you think is the biggest challenge for making machine learning work in the real world right now?

It's not really a technical challenge but I think the biggest thing that trips people up is deciding to build things that don't need to or shouldn't exist. I understand it is a very exciting time to be in machine learning, we all want to work on fun projects and and change the world. Particularly if you are building anything that would deal with a sensitive or a vulnerable community I would highly encourage you to reach out to people from that community and work with them and make sure that it's something that does need to exist and that you are building it in such a way that it's actually useful. An example that comes to mind is there's been a number of projects built by people who are not deaf and who are not signers to help deaf people communicate. Usually they take the form of gloves or computer vision to take sign language and turn it into a different language. That's not usually the the problem in speaking deaf communication situations. Usually it's that the speaking person does not have a very good way of communicating their intent. In general, deaf people are masterful communicators and do not struggle with getting themselves understood. So that's just an example of like, I get it, I understand that it's an exciting project and you are very passionate about it but before you spend a lot of time and money and resources building something, make sure people want it.


Check out Rachel's website and follow her on Twitter here.