How Can You Handle Missing Data In A Dataset In Machine Learning?

In Machine learning, the data on which models are built determines the efficiency of such models, and data has to be of good quality and full. Gaps usually occur normally in datasets and they can easily affect the results of a potential predictive model. When it comes to numerical or categorical data, or when your data includes time series, you really should pay a lot of attention to missing values, as ignoring them may lead to poor performance of your machine learning models. In this article, the author provides several techniques that can be used when data is missing such as whether or not to delete the data, replace or transform the data.

Understanding Missing Data

It may be seen that there exists some information missing in a dataset, and these can be attributed to various factors ranging from human errors to system crashes among others. There are three main types of missing data:

Missing Completely at Random (MCAR): Numerical data are sometimes missing without any rationality logic or trends.
Missing at Random (MAR): Actually, it refers to the observed data and not the missing values a point that makes the missing different from the missing value.
Missing Not at Random (MNAR): As for missing data, they are connected with the unobserved value and thus it is less manageable.

How Can You Handle Missing Data In A Dataset In Machine Learning?

Main Strategies for Dealing with Missing Information

1. Removal of Data

One of the approaches to dealing with missing data is to delete the records or variables with incomplete information. This method is fruitful if the number of missing values is small and scattered randomly. Nevertheless, it is not advised to be used when a large portion of data is either incomplete or unavailable because it results in either biased data or reduced numbers of records. Removing rows or columns is a viable approach when:

Only about 4 % of the data is missing.
The data which is missing is not important to your approach.

Nevertheless, it is not always beneficial to remove outliers because, in big samples, such actions worsen an overall problem of significant pattern loss.

2. Imputation

Imputation is an efficient way of completing the missing data so that, the whole set can be used. Imputation can be performed in several ways:

Mean/Median/Mode Imputation: When it comes to numerical data for handling missing values one of the easiest, yet effective methods is the process of imputing the missing data by the mean, median, or mode of the column respectively. It is all right to use it when missing data are ‘MCAR’ and there is no significant impact on data distribution.
K-Nearest Neighbors (KNN) Imputation: In this method all the missing values are replaced based on k-nearest rows in terms of distance. This technique is suitable for both interval and ratio level data as well as nominal and ordinal levels of measurement.
Regression Imputation: If in some feature there are missing values, a regression model can be built to predict those missing values in case the feature can be predicted using other features. While this method brings a fair amount of noise, it could provide higher accuracy than mean or median imputation.

There are other techniques like Multiple Imputation and the MICE (Multiple Imputation by Chained Equations) with the help of which it is possible to eliminate biases that might have been introduced through simple imputations.

3. Using Algorithms that Handle Missing Data

Random forest and boost are some of the machine learning algorithms that do not require a user to handle the missing data as they handle them internally. These algorithms can construct what are known as decision trees, and these are capable of splitting the data based on values that are not missing as well as ignoring missing data points. Thereby, such algorithms may minimize the need for specific imputation or data deletion.

For instance, Random Forests way of dealing with missing values is either by using the most frequent value adopted from decision trees or by splitting the data with the help of surrogate splits if data is absent.

Advanced Methods

4. Predictive Modeling and Machine Learning Imputation

More complex methods involve using the so-called machine learning models to perform the task of missing value prediction. For example, consider a large data set in which the first few columns contain a large number of missing values and data for the rest of the attributes is available then the available data can be used to train a model and the missing values can be predicted. This method is usually common in large databases and it is preferred because simple forms of imputation introduce a lot of bias.

Conclusion

Missing data management is a very crucial step in the preprocessing stage within the machine learning process and the choice of the method is based on the amount of missing data present in the data set. In any case, whether you choose to delete, impute, or use procedures that address missing data, it is always insightful to know how the decisions and the processes impact the results of your final model. Skilled management of the missing data may bring some bias, affect the performance of the model, and lead to the generation of wrong predictions.

For more details on how to deal with missing data and other data preprocessing in machine learning, do read the data preprocessing techniques.

By adapting to the ways which have been highlighted above you will be able to enhance champs for your machine learning models with poor data.

articles

How Can You Handle Missing Data In A Dataset In Machine Learning?

Shivani Singh

Leave Comment

Comments

Liked By