A more data-centric approach to machine learning

8 min readJun 6, 2021

What is MLOps?

MLOps is a new and nascent field in the world of data science. NVIDIA defines MLOps as a set of best practices for businesses to run AI successfully. It is also a relatively new field because commercial use of AI is itself fairly new. As MLOps can span several processes of the Machine Learning project life-cycle, this article will focus only on the data portion of MLOps, specifically the challenges of label noise and lack of quality data.

*Source: Machine Learning Engineering by Andriy Burkov*

I will also draw heavily on the talk given by Andrew Ng titled MLOps: From Model-Centric to Data-Centric AI and the book Machine Learning Engineering by Andriy Burkov. I highly recommend both these resources to you as they are fantastic!

What is Considered Good Data?

Andrew Ng defines good quality data as being defined consistently with an unambiguous definition of what the label is. Good data should also cover important cases, with good coverage of input features. Good data should also have timely feedback from production data to ensure that it covers data drift and concept drift.

Label Consistency

Data is needed for the development of machine learning models. However, as the famous saying goes, “garbage in and garbage out”. Poor quality data and its labels can lead to poor model performance. This problem is often magnified if the dataset is small or if there are certain observations or classes which are in the minority.

Poorly labeled data can also make an evaluation of models and error analyses challenging. We cannot trust our evaluation metrics if the labels themselves can’t be trusted. This problem is worse if data is hard to understand and interpret for which domain experts are needed to properly determine the label of each observation.

For example, our team deals with legal contracts between SAP and its clients. These contracts are often very long, some with more than 50 pages, and our tasks consist of classifying these documents based on certain criteria. However, it is more tricky and challenging for us to deal with data that might have been mislabeled since domain knowledge of the data is needed to label the observation.

Label noise might not just arise from the task of classification. In his talk, Andrew Ng shared an example of inconsistent bounding box annotation for object detection. This, he suggested, could be due to different labeling methodologies between labelers or even how a labeler might label observations differently over time.

How can we improve label consistency?

Firstly, Andrew Ng suggests having clear and specific labeling instructions. He also suggests having two independent labelers label a sample of observations. If they disagree, with a certain type of observation or set of classes, the labeling instructions can be revised until it has become clearer.

Andriy Burkov also suggests how different labelers may provide labels of varying quality, so it’s important to keep track of who created which labeled example. Data versioning is therefore very important. Additionally, he suggests asking several individuals to provide labels for the same training example. For some situations, the team can choose to only accept that label if all individuals assigned the same label to that example. In less demanding situations, the team can accept a label if the majority of individuals assigned it

Model-Centric VS Data-Centric

In his talk on MLOps, Andrew Ng emphasized the differences between a model-centric approach and a data-centric approach. In his definition of the model-centric approach, a data science team collects what data is available, holds the data fixed, and improves the code. Improving the code could include using different models, different hyper-parameters, etc. Conversely, a data-centric approach holds the code fixed and iteratively tries to improve the data quality.

Despite most people agreeing that data is key, much of the research and hype about AI is often on the latest algorithms and technologies. Andrew Ng did an informal survey of the latest papers published on ArXiv and found that about 99% of the articles were on ML algorithms and only one was on data augmentation. Most of the courses I took, whether in school or online, tended to focus on the algorithms, with little emphasis on the data aspect as well.

Andrew Ng argues that a data-centric approach leads to faster and better improvements in model performance. He shared an example on model detecting steel sheets for defects and commented how a model-centric focus on code did not lead to an increase in model performance, but a data-centric approach on improving data quality increased the model performance from 76.2% to 93.1%.

This mirrors my experience in developing machine learning models in my current job and in past work completed with companies. Dealing with the issues of data, such as distribution shift, label noise, data quality and error analyses tended to produce much better results compared to blindly spending hours hyper-parameter tuning or trying out increasingly complex models. For me, finding out what kind of data was being predicted wrongly and why is crucial.

In a similar vein, Andrew Ng argues that modern neural network architectures are quite complex and are low bias models. As such, the variance issues need to be addressed and one good way to deal with them is through improving data quality. He finds that once the data is good enough, there will be plenty of models that can work well.

Small Datasets and Label Noise

*Source: A Chat with Andrew on MLOps: From Model-Centric to Data-Centric AI*

Andrew Ng mentions that having a small dataset size tended to accentuate the problem of label noise. A survey of Kaggle datasets by his colleague and a question he posed to his audience during the talk, showed that there are many instances where datasets are of small to moderate size.

However, even in big data, such as search queries or footage for self-driving cars, there could still be the “long-tail” problem, where there are few instances where an observation appears. For example, an extremely rare search query or a very unique traffic situation.

Andrew then went through an example of the effect of label noise on small data. As we can see from the image, a model fitted on small data but noisy labels tended to perform poorly whereas, with big data, the model can still fit well despite noisy labels. However, the most important point Andrew was trying to raise was how small data with clean and consistent labels was already good enough.

Cleaning Data VS Collecting More Data

With noisy labels, Andrew Ng proposes either collecting more data or cleaning the existing data. He proposes a scenario where if you have 500 examples and 12% are noisy, cleaning up the noise or collecting another 500 new examples are about equally effective.

Andrew also presents another scenario graphically here also. Where to get to the same level of performance with a clean dataset, you would need to almost triple the amount of training data. Therefore, depending on the cost of cleaning versus collecting more data needs to be evaluated.

One way to go about this label cleaning process is through a more automated way of mislabeled examples detection. Andriy Burkov suggests applying the model to the training data from which it was built, and analyze the examples for which it made a different prediction as compared to the labels provided by humans. Another method is to examine the predictions with a score close to the decision threshold.

*Source: Classification Metrics & Thresholds Explained,* *Kamil Mysiak*

If the decision of the team is to label new data, Burkov suggests another interesting approach to consider. He says to use the best model to score the unlabeled examples and label those examples, whose prediction score is close to the prediction threshold. Conversely, if error analysis has revealed error patterns by means of visualization, then the focus could be on such examples.

Systematic Error Analysis

Andrew proposes the following systemic approach to improving the data. First, train a model, perform error analysis to identify the types of data the algorithm performs poorly on. Next, get more of such data via data augmentation, data generation, data collection or change the labels if needed.

I believe that this method can also be incorporated with a more systematic form of error analysis as suggested by Andriy Burkov. He suggests that errors can be uniform, appearing with the same rate in all use cases, or focused and appear more frequently in certain types of use cases.

In the words of Andriy Burkov, “Focused errors following a specific pattern are those that merit special attention. By fixing an error pattern, you fix it once for many examples. Focused errors, or error trends, usually happen when some use cases aren’t well-represented in the training data…This can be done by clustering test examples, and by testing the model on examples coming from different clusters.”

Burkov suggests that we can utilize dimensionality reduction tools, like UMAP or an autoencoder to reduce the dimensionality of the data to 2D and then visually inspect the distribution of errors. We can then use different colors or shapes to represent points belonging to different classes or whether the model predicted the observation correctly or not. We can then focus on regions that are consistently predicted wrongly. He suggests considering 100–300 examples at a time and then iteratively improving the model.