We’ve all heard that we should be testing our Machine Learning models in production and getting good test coverage. This talk will give an overview of 4 types of tests: unit, integration, regression and parametrized test, via examples in pytest. You’ll write more robust code in no-time, no prior testing experience necessary.
Dr. Irina Kukuyeva is a Principal at Kukuyeva Consulting, advising companies on data strategy, collection, implementations and hiring. She has 10+ years of experience collaborating with technical and business/clinical stakeholders, developing and productionalizing machine learning models for start-ups and large-sized companies in areas of healthcare, IoT, fashion, hospitality, market research and online advertising. Irina holds a Ph.D. in Statistics from UCLA, where she developed a novel image compression algorithm for Hubble Space Telescope’s images of Jupiter. In her spare time, she enjoys mentoring adults that want to get into tech.
before we get started I just wanted to give you a little bit of my background I've been a data scientist for over 10 years and I help I found Kokua consulting to help overwhelm founders try to make more sense of their data so after working with me on their data strategy and implementation they're able to make their product more profitable scalable and more user friendly and in my experience developing a lot of customer facing models I found that I have to learn to make time for tests and try to make time for tests and today we'll learn more about why do we need to make do tests why do we need to write tests and how do you actually go about doing it so let's get started first things first I want to talk about what happened at night capital so you may may not have heard of this financial institution so 2012 they had a code base that they reverted to a previous checkpoint and what an amp happening was anytime somebody went to fill in an order it was negating tagged as filled and so in August of 2012 when they went to place 200 orders they ended up placing over 2 million so in the span of 45 minutes they lost over 400 million dollars their stock tanked by 75% and they were also fined by the SEC so we'll try to make our code more robust and more testable so that that doesn't happen to our client so what's the prerequisite for this talk I try to make this hopefully approachable by approachable for everyone so the only requirement is understanding or being able to read small Python functions as we go along the agenda for this talk is first I'm going to introduce the four terminologies of different parts of a testing suite then we'll talk about how do you actually go about running it and what does it mean to get test coverage and how do you actually get it and I'll talk about tips and recommendations and then of course you know and then make recommendations for where to go from here I'll also pause along the way so that if you have questions you can also you know type them in the chat or in the Q&A and when we'll come to a stopping point I'll read them aloud and answer them so let's get started first of just what's a testing suite you as I mentioned you you know you've been asked to potentially write one your code probably has what I'll cause an existing code base you have a bunch of Python scripts which either were exports from a notebook or you actually had a bunch of you know you code in Python and got PI files and you have one or more of these files that for instance read in a dataset do some data processing fit a predictive model and then make recommendations for the sake of this tutorial and I do find it harder to share or follow along when there's a lot of code Oh on page so the example one will be this as our we'll go back to this one to talk about how do you write a testing suite for this but I'll actually walk through applying in explaining the four tests for a very very tiny Python script slash code base it has only five lines we're gonna be looking at how do you say how much something costs after applying a discount and applying taxes it's in our case our code has no functions because you know usually we're machine learning model development it's a POC it tends to be in a Python notebook and you know it's a kind of exploratory analysis it look promising and then we were asked to production lies it so again we have no functions and that's fine and let's just dive right in and actually talk about the first first type of test so the first type of test is a unit test and the unit portion of it says it's the smallest part of the code that you can test the one caveat is it has to be a function so in our tiny example since we don't have any functions the easiest simplest way to do it is throw your whole code into a function and then you can test the inputs and outputs so in our case let's say V 0 for our model or for our code is throwing the two lines of code into a function call the good for now and then test that so for a function we know that part of it you know we have to define what it's called and in this case since it does two things it's gonna be called price after discount in taxes it takes three arguments what the price of the item was what the discount is and what the tax rate is and then it has our two lines of code that actually do the calculations and linear returns a final price so one way one unit test might be okay well let's apply this function to an input two to a set of inputs and see what the output is and check that it's the same so in our case usually try to have each function start with a named test and then what the function name is and maybe a couple of other details so that as you run as your test runs and we'll talk a little bit more later what that means you'll get more meaningful output so that if something breaks you'll know which portion of the test to look at to tell you what happened so in our case we expect to see a price of eighty two point five or $82.50 and that's given if the I am original costs $100 we had a discount of 25% and we have a high tax rate of 10% and so the fourth line of that says what we want to do is check did the input that we feeding to our function the price after discounted taxes did we get the result that we expected which was $82.50 and the reason why we have to use this absolute value less than something else is because sometimes numerical representation on the computer it might look like 82.5 but sometimes it might look like eighty two point five zero zero zero zero line or eighty two point four and nine nine nine and for all intensive purposes we're just saying okay that looks about the same our tests passed so I'll stop right here actually and check to see if there are any questions on unit tests or code anything else before we move on so feel free to add this in the Q&A or a chat and I'll wait for a little bit okay I think there's no questions for now so I'll keep going so as as I mentioned before we are one function had two things that it was doing and so it was a little bit hard to test to figure out you know something broke which part of the code was it for instance and again this was just two lines maybe an easier way to do it is to break up the one function into two and then those two would be two separate candidates for two separate unit tests so the rule of thumb is I would if you're writing a function and then I said and somewhere in the dark string or in the function name that's a candidate to refactor or reorganize your code to make it simpler so in our case we can make two functions one that says okay I'm going to apply a discount real price I'm going to return it and then the other one says I'm going to increase the value of our item by a certain tax rate and then return that a price so now that we have two functions we can have two separate unit tests and so one way to do that is very similar to how we did it you know two slides ago is now we have one function we say hey are we expect to see a discounted price of 75 dollars given that in line 16 the inputs are the item cost $100 and the discount was 25 and then again line 17 says that we want to check how is our expected price very very similar to output price and we can make this value to be a let's say point zero zero one or some other value and then we would test the other function similarly you know price after tax would have a similar unit test but now we have two functions what we can do is check do they actually perform as expected together and so then you can use an integration test so that's for two or more functions to see does the output of one that you send to the other one how does that align does that work do you get to see the output that you expect so in this case we still have our two functions that we you know where your factory code we had two functions and then our integration test what it does in line 10 is called the first function and then a line 11 sends that results of the second function and then checks is that so in line 12 it checks is the output from both of them the same as what we expected so again as we saw before when we just had one function that does two things we did want to see a output item price of $82.50 and so first we want to see in line 10 what what that value was calling the first function and then what that value was calling the second function of you know getting our price after taxes and then again comparing that to the values that and do we get the answer than we do I'll stop here and see if there's any questions in the Q&A or the chat see okay there's one let me start sharing to see the Q&A okay so that's a good question how many values you should test so one thing that you salt I'll talk briefly about that a little later but I'll address that part of that right now um so one thing you want to do is you want to check all edge cases so for instance you know can the values be negative can they take what other formats do they take does it take out does the function take a list or does it only takes integers or take integers and what does it do when when it's a list we'll talk about parametrize test that helps you test the same function with a lot of multiple inputs and we'll also talk about testing pyramid that says unit tests you should have a lot more unit tests and other types of tests but I will say I've definitely gotten into discussions with coworkers I do find you know testing design how many tests what to test it is an art and a science and I've definitely gotten feedback from co-workers that I had too many tests in our code and I should take stuff out so I would say as long as you can give an idea for what inputs and outputs to expect if there's something weird in the data or if there's any edge cases that you think you should handle or not handle that should be part of the testing suite and anything that you can almost treat it also as additional documentation all I'll touch on all of these points a little bit later as well so that you know if you look at a code base maybe if there is no documentation you start at the testing suite and see what got what got tested because those you would think are the most important parts of the code to get familiar of what's going on before you dive into let's understand the rest of the code base yeah very good question okay so I briefly mentioned so we have two functions we did a unit test or did an integration test and then you give this to pass around first saying they do maybe is say hey I want a discount of 120 percent give me the item for free and actually give me money back and that may or may not be a valid ask it's the one thing we may want to do is say oh that was actually a bug we weren't we shouldn't have been able to handle arbitrary percentages they should always be you know smaller than one so one of the things that we can do what that's called is so then it's a bug in our code and one thing we can add is what's called a regression test so in this case regression is not the same thing as linear regression it's a regression in the sense that you don't want the bug to come back so if you already fix something you write a regression test to check that that was fixed so that later if this bug comes back what that means is it actually wasn't completely fixed so in our case what we'll do is if somebody you know asks for a discount of more than a hundred percent we'll actually want to throw an error in line here in line eight to say hey that's not this is not a valid value throw an error and otherwise actually give the discount so then what our regression test will look like actually is we will check that it actually raises an error when it gives the value that in invalid value so in line 34 what it says is we're gonna raise and in this case in line eight above and our function it's a value error and so we want to make sure it raises the correct error type and then in line 35 we call our function with one type of an valid input so then what will happen is this test will fail but we will mark it as I expected it to fail that was actually the right situation for that to happen so at this point we fixed a bug in one function but we had to and so then we would have to copy paste the code in the second function and then we'll have to remember whatever you're doing one to the other by this point you may have noticed that the functions were really really similar so whenever you feel like you have to copy paste something you shouldn't do that so it's called dry don't repeat yourself because you'll at some point inevitably forget that this code was copy pasted at three five two times and so what you should do is actually refactor your code to make sure you only need to make this update once so what we can do is combine our price after discount and price after tax into one function that is a little bit more general and just takes in a price and the adjustment and if the adjustment is negative then it's a discount and if the adjustment is positive then we can treat it as a tax rate so now what we can do is only then test one function and one way that we can test multiple things so if you wanted to test multiple inputs for the same function and you don't want to be copying pasting kind of the guts of the function one way to do that is with a pyramid tries test so at the top is again our function and what the pyramid tries test does is you specify all the arguments all the inputs and all the outputs and then it'll run those for you so for instance in lines 41 through 46 we actually specified I want to be testing five different scenarios so in the first one line four do you find we want to make sure that if somebody says I want an item with a discount of a hundred twenty-five percent the function throws an error and if somebody wants to check what the price of an item is after a tax rate of forty percent so in line forty five we get the right output and similarly in line forty six if somebody who wants to apply a tax rate of more than one hundred percent we so should have the function through here so then the what the middle of the function then does align 53:54 says for each of those things for each of those arguments that we specified in line 41 to 47 put those scenarios into our function compute the output and check if that value was what was expected that we specified in those in the lines 41 through 47 and so what I will actually do is it will run what it look like five different unit tests just by using the same kind of the same function but using different arguments and I want to stop here and see if there's any questions before we talk through tips and tricks okay let's see I see one okay so it was the same same question okay no no new questions okay so now we've learned about four different types of tests how do you actually run it and how do you how do you get code coverage so one of the ways that you can get a testing suite to run is have this folder structure at the very top you have the project name so in the screenshot that I show you for instance you can have a folder called price adjustment you'll have an init file in the sub folder that tells you are there certain constants that you might want to define so for instance here I've defined this roundoff error to compare when one value similar to another to be point zero zero one you also have your code base or multiple Python files so in this case I just have one file let's call the just price but then you have a separate folder that's called tests and the test also has an init file it's usually empty but if there are certain things that only apply to the tests you may want to put that in there and then depending on how big your testing suite is and how many files are in your codebase so if there's only one file you may just want to have one file that's called test just price for the unit test and then test integration adjust price if you have multiple scripts or code base one two three you may want to have a folder that says this is the unit test folder and then say test underscore code base one test underscore code base two pi and so on and so forth so it's a little bit more organized well once your code is in this structure what you should be able to do is if you navigate in your terminal to this folder or in this case let's say price adjustment you can and you can clone and download this repository and it'll have all the code that I've shown on the screen if you navigate to this folder price adjustment and run the following command to say a PI test run all the tests in my test folder and the qualifiers say if the test fails break at the first one so that's the – X and then – – PDB says open the debugger so that I can actually step in and see what happened and dash V says verbose output so that you get to see the parametric test what values it actually ran so in the screenshot below we only wrote what looks like four different tests so one that says test price adjust discount test price adjust tax test price adjustment discount throws air below zero so it's passed then there were five that were parametrized and there was one more that was test price adjustment integration discount so we wrote one two three four five and it actually got converted into nine different ones with not much more work so that's nice and then it also told you that seven pass and two failed but they were mark just failed so that's good and then it tells you how long it took which is nice so hopefully here we took less than a second which is very nice otherwise you know if it takes an hour to the longer the test run the less people may want to check I run the testing suite because you know you want to make sure that I finished before we add a new code for instance and the note is that one thing to to notice that if you have this testing suite you can potentially configure your repo to automatically run your testing suite every time code is committed so that if you for instance add code in your test break or your test fail now that code doesn't get merged into master for instance until that's fixed so that's a good practice to have and now that we have a testing suite we can also check what's called code coverage so code coverage checks how much of your code is checked by the testing suite so if we have a hundred lines of code and we have one unit test that checks one function that's fans three lines we're probably covering only three out of a hundred in this case this is a super tiny repository with a lot of tests and we are getting 100% test coverage which is very rare if not if not you know extremely rare to do but again this is an example one thing I do want to note is that getting 100% code coverage is not the same thing as a hundred percent getting a hundred percent amazing tests you can write tests that may not be as informative or you know they may not if somebody's if your test let's say passes in you know let's say a data set of a thousand rows and something breaks it's hard to debug so maybe having a small data set of let's say three rows to better test something maybe maybe more informative then some more tips and recommendations now that we have a testing suite we can run it what what can we do next so as I mentioned you can think of a testing suite as additional documentation I say additional parenthesis because sometimes there's no other documentation it might explain what inputs and outputs from each function or when the function is run to handle edge cases of hey maybe when my data's missing my functions stole all my functionality is there and it returns a certain result and it explains to user what that is it may also explain what is and isn't handled and data processing just by what's getting tested in the testing suite and of course my favorite is you start writing code you have our testing suite things are working you change one line and nothing's working anymore so if you use version control and you have a testing suite the testing suite will tell you which test broke and your version control will also show you which which lines of the code you changed so then it's much easier to also help figure out and narrow down what portion of your code broke and how do you fix it and I find in writing on that is much easier another thing as we briefly talked about is it's much easier to write tests when each function does one thing so again if you start writing a function name that has an ant in it or a dot string to say this function does this and that that probably should be two functions at least and then it's again much easier to write tests I also found that it is much easier to write tests when you're doing development side by side so what does that mean you've probably heard of test-driven development where you write the test first and then and then you write the code that does it I found that you know with data science a lot of the models there you know proof of concept potentially the first iterations in a Jupiter notebook and so there's no you know you don't even know is this gonna be production code down the line and so what I found what helps me is that as I write in my Jupiter notebook as I start writing code and I start refactoring my code to have functions which I test in certain cells what a layer do is save the code that tested those functionality in a separate file so that I'll already have which inputs that went to the file what was I testing so that later if that word proves to be production already I already have kind of a draft of what my testing suite should look like or some of the components so that it's much easier to write to add let's say function means than it is to also kind of the guts of what the function does and then another useful thing that I found was that there's a panda's built-in function that helps you check if two data sets are equivalent so it could be you know do you care about the column order do you care about the reward or a couple of other things which makes it really really nice and makes it much easier to write a testing suite for data sets so now we talked about what to test let's talk about what not to test so if there's any functions from base Python or third-party third-party modules that you had to install you don't need to test it so in this example say we need CSV that's not on you to test you know you're fitting a logistic regression you don't need to check that the fit actually happened that you actually was were able to you may want to check did the fit did the predictions happen after the model was estimated but you don't actually need to check the fit function itself to say okay well does it take of Liz does it take a data frame that's outside of scope because you did not develop that function similarly hopefully unless your manager tells you otherwise I would say if there's first-party functions that a team develops so if one team developed one module and it has function B a function b your writing module C that cost function B it's not on you to write the unit test for function B unless that's your job I would say if your scope of work is module C then you may should write an integration test for how this function B ties in with your work but you don't need to unit test function B in your module hopefully that helps otherwise also feel free to add questions to the Q&A in the chat so now what to test you know we talked about you know the first example you know it's a machine learning example what are some examples of things you might want to add as parts of unit integration test so one thing is you know does your do your functions allow different you know different types do they allow missing values what does that mean for a thing to be missing does it have to be completely missing is it an empty space if you're doing text processing what what does that mean what does it allow and not allow in the feature engineering side after you apply to transformation or multiple transformations do each of those give you you know do you if you transformed all of them into categorical variables do you get the expected number of categories again how do you handle missing values if at all another check would be you create a certain number of features is it a certain is it 20 features should we expect after you know we give a data set to our data processing step do yet 20 features do we get a certain feature so for instance if my sentiment if I've developed developing a model to predict sentiment around certain text features does the thing that I care about show up in the vocabulary so if I'm trying to pick out features trying to predict sentiment around clothes do I get clothing attributes or clothing mentions in my vocabulary that that should be a unit test on the output side there are a couple of other ideas that I would highly recommend is your output what does your output look like again if we're predicting sentiment do you surface the probability do you surface the category do you surface it as a human readable output or maybe as a JSON or dictionary that says this was the probability this was the category this was the sentiment this was the model version this is the timestamp and a bunch of other variables another thing you may want to test is I know that you know different data sets might be different but there are certain categories that you expect if you get one type of output input you get a success so for instance if you give your model raw text or I loved it do you get positive sentiment or is your model you know what's going on so that if originally did then you update your model it doesn't you know and that's part of your testing suite what does that say another thing on the output side is what happens when it shouldn't be predicting a sentiment for instance this was an actual screenshot for one of my for a co based that I've worked on in the past where the input is you know the customers giving us feedback missing it's missing a button so it should not as that's not positive sentiments on negative its effect right it was just missing a button and then this pipeline transform does a lot of the pre-processing steps to say okay let's parse the tags formatted see where it is in the vocabulary feature engineer it then make a prediction and then so the last so line 10 then says I have a prediction and the line 11 says is it are we actually able to make it and in this case the negative 99 says no we're not able to predict a sentiment for it and then the test says okay do we we don't expect it to have sentiment do you actually predict it to be the case and if the answer is no then the test passes so in this case we don't we don't since as a factual statement it shouldn't have sentiment and so in this case this would pass if the model behaves as expected this already you know came up in queue a really good question how many tests do you write and as I mentioned it is an RTO science I would say to again just a stress that if there are certain nuances in the code or if there's nuances in the data that should be tested or at least explained a you know what are again like the missing button having if somebody gives you factual statements how do you handle that or or do you handle that and then on the testing pyramid side what you want to do is have tests that take the least amount of time have a lot of those and then integration test again combine more than one function so there should be a little bit less of them and then in twin you can think of it as from getting the data to doing the feature engineering to estimating the model or to making the prediction that takes longer and so there should be fewer of them there shouldn't be zero but there should be a little bit of fewer of them and again as we mentioned if you get 100% code coverage so it does not mean you have good tests because you might be testing something that's not an edge case for instance or there might not be any or again maybe the data sets are too big to if the test fails is much harder to figure out what went wrong so I'll wrap this up and then summarize and make recommendations so as we learned we should be testing your code a lot it helps when our functions do only one thing so it's much easier to write the unit tests and write combinations of them and integrate as we learned that we should be definitely testing edge cases and every time we squash a bug we should add a regression test to make sure the test doesn't come back again and then every time we have multiple functions we should make sure they're still combined and work together well with an integration test and if you see yourself copy pasting code so fits then if it's just if it's code then it should be a function that you call multiple times and if it's within my testing suite then you should use a parametric test so that you can pass in different arguments and then you don't have to rewrite the the actual the function itself multiple times so next steps we touch we covered a lot but we still touched on just a little bit of the whole kind of gamut of what you can do with a testing suite there's a lot of resource on this page in the next page and there's one you know Joe has a really good table on how to become a better better data scientist there's also a Google paper on a lot more recommendations for how do you even test your machine learning framework in terms of you know how do you to have a test for if the distribution changed or if you know you have the same data do the same trends come out you know I'd highly recommend reading that paper there's also you know blog post on more about the testing pyramid I found the PI test book really helpful really recommend that as well the documentation is also pretty good there's also a podcast and there's a couple of other resources or one on you know beginners how do you start writing doing software development and then this notion of writing tests when you know black box either you didn't write the code and you don't know what it's doing gray box is that maybe you wrote the code but it's been six months since you wrote it and then white boxes you know as you're developing the code I also wanted to briefly mention type hints those are becoming more popular so the tutorial that came out just this past week from Vicki on what are tied paints and again I would add stress that additional documentation so it tells you what argument type the function allows so that if you allow floats and lists you can specify that so that later potentially if somebody else is looking your code and wants to add in a data frame they will actually your OEE foreign central state actually that's not a valid input so that helps a lot I didn't cover this but I wanted to you know recommend this as well there's a package called hypothesis where you can specify the type of input and the type of output and it will actually generate tests for you so you can say I want a hundred tests with this float and this other float and it will actually test your code for you to help make your codebase even more robust we didn't talk at all actually about what's called fixture so setting up for instance if you wanted to test if your input to your functionality for instance could be a data set or something that you know or maybe a dictionary vocabulary object that you will want to check all your functions against you can define that in what's called a contest file and then all your tests or a portion of your tests can use that so that for instance you can say hey my vocabulary contains 10 words this is what they are and then you can check hey when my input is a certain data set again that you can also specify in contest when I do my data processing step do I get this vocabulary or not and then what's what mock-ups are is if you are getting data from a database of phrases to get even an input you don't want to be in your testing suite calling your production database so what you can do is then create tiny databases locally to be able to test your testing suite against and then I briefly mentioned that when you have a testing suite you can have tests running automatically as soon as you push code to repository and that's done with continuous integration so there's a couple of ways to do that Jenkins was called circle CI or Travis CI for CI for continuous integration to be able to integrate that into your repository so that as soon as you commit something it runs it and if the test broke then your Co doesn't get committed or actually gets committed but doesn't get pushed a master it'll say hey you actually need to fix this and I found this other blog for how do you in you code that you don't understand there's a lot of tips on that and then last but not least is now you have your code it's ready for production it has a testing suite how is it actually going to be rolled out to your customer base you know is it gonna be rolled out ten percent of your audience on twenty thirty percent so you'd be rolled out to everyone how do you do that to make sure that you know you can also check on the implementation side is the model working is it not working and now I want to stop for questions again so again if you have chat you know if you want to send those to the chato the QA the slides are available at this bitly link there's a github link and ideas as well also this webinar is recorded it'll be on the YouTube channel posted I would say then you know probably in the next two to three or four weeks there and then while I await your questions I also wanted to since you stayed until the end I did want to offer you a free gift I'll be giving out five free 15-minute strategy sessions to the first five people to email me so let's see if there's any questions either in the Q&A or chat no we have a new question in the chat window I see okay how do you yeah let's go so there's a question of how do you approach testing deep learning architectures you can end up training the wrong model for quite a bit of time so that's a really good question um I would say in part you can treat it so I'll give a simplified approach first as first you can treat it as any other model just say you know what inputs do you give it you you know what are the outputs how are the output how is the output formatted you know is it formatted you know kind of what we talked about you know as a formatted in in the format that you expect I would say there's a lot of um you know you can potentially depending on which framework you're using you could be you know again saving intermediary to say hey this layered this could be one function this layer is another function checking to make sure that's consistent there's been a lot more blog posts specifically for deep learning just because there's a lot more parameters but I would say the big picture ideas are still the same just to make sure you under to make sure to test any edge cases you know how do you how do you test if you know again if you have you know a document are you certain things are you predicting you know if for for certain scenarios that should be of a certain case are you per is your model doing that or not and hopefully it hopefully that helps and if not you know feel free to go are there any other questions I see okay one more in the Q&A okay so the question is can you post the link to the slides and the repo why or this or the chat yes so I can post it I can post it in the chat so the you know so for everyone who joined the talk today I guess we will share this lightning in our YouTube video description as well and I just shared the link to the slides in the repository alright so if there's no more questions I guess we will end all a talk today and again thank you for being here today with us and I think your dogs are being that to the give us a wonderful talk they think you're all for attending and have a good rest here your weekend it was really nice person attacking you have a good day alright bye bye everyone