MISSING VALUE TREATMENT 

Treatment of missing values is among the most crucial processes in the data processing. It involves identifying missing values and processing them so that only the minimum amount of data is lost. Missing value treatment must be performed to be applied to the data before it is used for modeling because missing values may make it harder for the model and make us draw incorrect conclusions from the model, leading to inaccurate forecasts and classifications. 

The absence of values is often observed in data for various reasons. They can occur in the data extraction and data collection procedures. Human error, such as that of the person who collects the data, may result in missing data. Other reasons, for instance, the information gathered through surveys may have a missing value when the respondent chooses not to answer certain questions due to inadequate responses or because the question might require information that respondents aren’t inclined to share. 

Three types of Missing Values 

All missing values are broadly classified into NMAR, MAR, and MCAR. 

MSAR: The absence of all values random means that the missing values occur randomly, and the value of a variable does not depend on the previously known values or the data that is missing. 

MARS: Missing at Random occurs when missing values are randomly occurring; however, there is a connection since the missing value in a variable could be dependent on the known values, not the actual value of the missing data. 

NMAR: The definition of not absent at random is when the missing values follow certain patterns, and the values of a different variable could influence the value of the variable that is missing. 

Methods to treat Missing Values 

Ignoring and discarding data 

Two possibilities for eliminating or deleting missing values are pairwise and listwise. 

List-wise deletion: List-wise deletion is the simplest way to treat a missing value, where records(rows) that have missing values are eliminated from the data. The biggest drawback of this approach is that it could cause data loss, particularly when there are many missing values in the database. 

Pairwise Deletion: Pairwise deletion is also referred to as available-case analysis because it analyzes variables in cases where missing values are evident. In this case, correlation coefficients are employed for every couple of variables. The benefit of this method can be that the loss of information is small and is a good choice over deletion using List-wise methods. One of the disadvantages of this technique is that it employs various sizes of samples for different variables. 

Mean/ Median/ Mode Imputation 

When using these methods for imputation, those missing data are computed using estimations such as median, mean, or mode. Data missing can be replaced with mean or median when numerical quantities are involved; likewise, modes are employed to impute missing values for categorical variables. For instance, if you have a database that has variables named Name (Name of students), Marks ( Marks of students), and Gender (Gender of the student), If missing values are discovered in the Variable Marks and Gender, then the mean of the variable could substitute the values that are missing. This kind of Mean Imputation is referred to as Generalized Imputation. If we calculate an independent mean for marks of males and marks for Females, i.e., we also take into consideration whether the student is male or female when formulating their average marks, and then compute missing values in a separate way for females and males, this is known by the term similar case Imputation. 

Use of Prediction Models 

Here, supervised modeling can be performed to determine the values that can be used in calculating missing values. This is where we divide our data into tests and train, with the train having no missing values while the test has only missing values. The training data can be used to develop our model to predict missing values for the targeted variable. The model can then identify those missing values within the testing data. Techniques to model like Linear Regression (for missing values in continuous variables), Logistic Regression (or missing values in categorical variables), and so on can be employed. The major drawback of this procedure is that if there’s any relationship or lack of it between the variables of interest and the predictor variables, then the missing values can’t be correctly predicted. So, we must presume that the attributes have connections (correlations). Furthermore, this kind of method limits the amount of noise or randomness in the data. 

K-Nearest Neighbour as an Imputation Method 

KNN is described in detail in the section on Supervised Modeling. K-Nearest Neighbour algorithm can be utilized to estimate and replace missing data. In this case, the missing values are identified by taking the observations closest to it, which are based on other characteristics. 

 For instance, there is a data set that contains three variables: Age, Income, and Number of cars owned. There are missing values in the third variable (Number of cars owned). This is where we employ KNN, in which we consider the income and age of the observed data that is missing and then, based on the most commonly used value of the cars that are in their vicinity, determine the value that is missing. In this case, we assume that people with the same income and age will have the same amount of cars. Thus the reason for using KNN to determine the value that is missing is the value is approximated using the value of the points nearest to it in a manner that is based on other variables. The most frequently used value of the K Nearest Neighbours could be used to estimate the missing values of categorical variables. In contrast, the average of the values from the k nearest neighbor can be used to determine the missing values of continuous characteristics.   

This is an extremely time-consuming method (especially when the data is vast). The value of K can be crucial in determining the missing value. In order to overcome this issue, cross-validation or analysis of experiments may be necessary to assess the effectiveness of the K-Nearest Neighbor algorithm, which can further make it more time-consuming to complete the procedure. 

Missing values are common, particularly when working with large amounts of data. Since they could negatively impact models’ performance, separating the data in the absence of missing values is essential. In addition, the simple method of dropping variables to the more complex methods for using predictive models are available under the degree of accuracy needed. Before making use of missing value imputation, data must be considered as outliers. These have been covered in the blog’s previous article. 

Leave a Reply