IP Insights Model ‘De-simplify’ — Part IV — Model Deployment and Performance Evaluation
After you train your model, you can deploy it using Amazon SageMaker to get predictions in any of the following ways:
- To set up a persistent endpoint to get one prediction at a time, use SageMaker hosting services.
- To get predictions for an entire dataset, use SageMaker batch transform.
In the AWS Github example, they demonstrated how each pipeline works. However, they didn’t provide any evaluation or monitoring guidance. I will to demonstrate how to do evaluation.
Since this is an anomaly detection (imbalanced binary classification) model, we can use the popular classification metrics with the adjustment of reducing the impact of imbalanced data.
There are many good blogs that talks about evaluation metrics and the specialty of imbalanced data. We will skip that intro to each metric here but focuses on evaluation principles and some tips for monitoring a production ML model.
- Business impact matters the most
Most classification metrics measures the number of misclassified samples or records, be it AUC, ROC or F1, KS. However, you should create some other metrics to measure the cost of business in terms of dollars or something else. For example, if you have a model detecting potential customers, you should create a metric measures the revenue gain of your model — how much in dollars did your model bring to your company because you identified them as potential customers correctly? Similarly, how much did your model lose because it fail to find the right customers?
You must have report them to your leadership as a measure of team value, but you should really also use these metrics to train your model.
2. Always a mix of metrics
In reality, we always rely on a few metrics together to pick and finalize the ‘best’ model to deploy.
The first type of metrics is for model performance such as confusion metric, AUC, ROC, KS etc.. The metrics should be on both sample level and business impact level, you may need to customize something sometimes. You should choose the ones that align with your business scenario. For example, AUC and ROC is a good measure on the overall performance of the model. While KS stats it a good measurement of the discriminative power of your model on separating two groups of population.
The second type of metrics is for model stability check. No matter you have a daily retrained model or annually retrained model, rarely any leaders would sacrifice the stability of the model in exchange for few percents on ‘accuracy’ gain. The most common way for stability check is plot the prediction performance time series and check for any unreasonable spikes. In addition, the quantile distribution check is also needed if you use the classification probability as a reference for future steps.
Real Time performance Monitoring
Be proactive on monitoring
Model performance monitoring is critical but we should spend equal time on monitoring the input data’s quality and engineered features quality. Although the data’s issue could be identified by the performance monitoring but it’s risky to rely on it and it’s always delayed.
There are some mature pipelines on data quality check, the most popular ones are: ‘great expectation’, AWS Deequ (GLUE) and of course, built your own based on some statistics check like mean, quantile etc..
There is a great post listed some common measurements for a real-time production system. Feel free to check it out.
I have a few ‘updated’ models that adding some reality factors into the model.
The detailed explanation of why are these models recovered the real-life challenges we face in data science can be found in this blog.