Coder’s Cauldron | Coding Best Practices in MLOps

“…developing and deploying ML systems is relatively fast and cheap but maintaining them over time is difficult and expensive.”

– D. Sculley, Hidden Technical Debt in Machine Learning Systems, NIPS 2015

All professional data scientists know the truth of this quote. You may have encountered this in solving a problem in your machine learning system’s many processes. Solving any number of solutions together usually means screwing up; meanwhile, problems tend to increase in complexity as the system gets older. Even worse, you can waste time and resources and cause production issues.

It can be easy to make a model fulfil its business objectives and deploy it, but operating it in production can be quite challenging. A change in input data or dependent variables may lead to poor performance and production quality. Microsoft itself has stated that data drift is one of the major reasons model accuracy degrades over time.

Also, models need to be retrained on every observation. Determining model performance metrics and continuously monitoring its performance – a process is called continuous training and continuous monitoring – is key to ML models’ long-term success. So let’s consider some best practices for consistent delivery in machine learning systems.

Open Communication

Machine learning teams need to communicate if they want to meet business objectives in the face of changing resources, data patterns, and expectations. In an ML project, teams include experts in data engineering, DevOps, data visualization, AI, software engineering, business domain knowledge, etc. All these experts must work together for the success of the project. And clear communication between these professionals as well as with the business teams and stakeholders is essential if the model is to perform properly.

Cost-Benefit Analysis

A major part of MLOps is the costs of the machine learning lifecycle. Introducing MLOps in short-term projects might be expensive; on the other hand, it is more beneficial for long-term projects. Therefore, understanding how MLOps can be helpful is critical.

Manual efforts to manage ML models can be reduced by automation (which is covered later in this article). This provides engineers more flexibility and more time for productive tasks.

When evaluating the desirability of MLOps. systematic process should be considered, including goals, budgets, ML activities, and team capabilities.

Clear Naming Conventions

Naming conventions make sure that everyone is on the same page. In general, a naming convention is a descriptive way to label variables, functions, and other elements within code or its documentation. The goal is for everyone to be able to easily read and understand the element by its name – what it does, represents, or contains.

A naming convention also helps to promote consistency within the teams. By avoiding naming conflicts, it’s easier to transfer projects to another team or a new team member. Therefore, it’s a good idea to introduce this standard and get contribution from the entire team, ensuring that everyone moves in the same direction together.

Choose Your MLOps Toolkit Strategically

There are many MLOps tools available. When choosing which tools you will use, think about the long term. Consider your business objective and the tasks you’ll need to do. Think about your budget constraints and the ML team’s knowledge and skill. Examine the data you’ll use and your options for data versioning, parameter tuning, production monitoring, etc. Once you get the complete picture, you’ll be able to select MLOps tools that will adequately support your team’s efforts throughout the different lifecycle stages.

Experiment Tracking

Developing machine learning models is a highly repetitive process. Unlike the traditional software development process, multiple experiments on model training can be performed in parallel in ML before finalizing the production model.

Experimentation during ML model developments can take the form of several different scenarios. One way to track multiple experiments is to use different branches (such as Git), each one dedicated to a separate experiment. And depending on the performance metrics, the best ML model is selected across various trained models.

Validating Models Across Market Segments

ML models are vulnerable to poor data quality. The importance of a model reduces over time; then the model needs to be retained. To do this, a training pipeline is required.

Once a model is trained, it is evaluated with a testing dataset (which may or may not be a subset of the training dataset). The main purpose of using the testing dataset is to check the generalization ability of the trained model.

Monitoring

Changes in data can cause models’ performance to deteriorate over time; we need to ensure that our systems are monitoring and responding to this degradation.

Therefore, we need to monitor the performance of our models online, tracking summary statistics of data and sending notifications when values deviate from expectations. We could also potentially start a new iteration in the ML process. This online monitoring serves as a signal for a new experiment iteration and the re-training of the model on new data.

Automation

In many MLOps environments, most of the machine learning tasks are done by people. This includes data pre-processing, feature engineering, splitting data into small pieces for training and testing purposes, training models, etc.

Many data scientists waste time by creating chances of error and resolving them, which can be used for research and exploration by doing these procedures manually.

Continuous retraining to forestall model drift is often the entry point of MLOps automation. This can lead to automating data ingestion pipelines, model validation and testing, and more.

Conclusion

MLOps can help machine learning teams manage the complex maintenance and operation of models. This requires clear communication, setting up naming conventions, using automation to manage time-consuming routine tasks, and more. This can save time and encourage a company to develop more ML models.

References

Authored by: Anmol Singh, Consultant at Absolutdata