Home › How to deal with missing data

How to deal with missing data

In Data Science, it’s pretty rare to come across a perfect dataset; data will frequently be of poor quality. It could have been recorded incorrectly, or it can be missing completely, which can be caused by poor data collection or storage methodologies.

Angus Doherty

September 26, 2022

Missing data requires robust strategies to improve the data quality, else analysis will likely provide little value. As the saying goes – Garbage in, garbage out.

This post will look at some of the ways we can process missing data, without skewing the dataset; allowing us to produce valuable insight.

Identification

The first step in the process is to identify exactly what is missing from the dataset. Missing data can be categorised into the following groups:

MCAR – Missing Completely at Random
MAR – Missing at Random
NMAR – Not Missing at Random

Here is an example of a complete dataset. We will show how this dataset looks with missing data from each category.

First name	Surname	Age	Gender	Salary
John	Doe	28	Male	£ 29,200
Frank	Smith	47	Male	£ 50,400
Mary	White	48	Female	£ 46,900
Kevin	Park	35	Male	£ 40,500
Jane	Johnson	33	Female	£ 30,300
Sarah	McGeady	22	Female	£ 17,900
Gareth	Hancock	26	Male	£ 28,000
Sally	Robbins	39	Female	£ 36,000
George	Simpson	23	Male	£ 24,200
Emily	Jones	41	Female	£ 40,100

MCAR

This data is missing from the dataset with no discernible pattern. This means that any features can have missing values. This is a rare case and leaves the data unbiased. This is usually caused by errors in collection or storage. Below we can see that there is no obvious pattern to where missing data occurs.

First name	Surname	Age	Gender	Salary
John	Doe	28	Male
	Smith		Male	£ 50,400
Mary	White	48	Female	£ 46,900
Kevin	Park	35	Male	£ 40,500
Jane	Johnson	33	Female	£ 30,300
Sarah		22		£ 17,900
Gareth	Hancock	26	Male	£ 28,000
Sally		39	Female	£ 36,000
George	Simpson	23		£ 24,200
Emily	Jones	41	Female

MAR

This data is missing but can be explained by another feature in the dataset. This often occurs during data collection. For example, older people may be less likely to disclose their earnings. In the below example, we can see that almost half of age data is missing and is highly correlated with gender. This can lead to bias in the data, and so must be dealt with appropriately.

First name	Surname	Age	Gender	Salary
John	Doe	28	Male	£ 29,200
Frank	Smith	47	Male	£ 50,400
Mary	White		Female	£ 46,900
Kevin	Park	35	Male	£ 40,500
Jane	Johnson		Female	£ 30,300
Sarah	McGeady		Female	£ 17,900
Gareth	Hancock	26	Male	£ 28,000
Sally	Robbins		Female	£ 36,000
George	Simpson	23	Male	£ 24,200
Emily	Jones	41	Female	£ 40,100

MNAR

This data is missing from the dataset and has a discernible pattern but cannot be explained without knowledge of the missing data. In the example below, older respondents have not disclosed their age. Without prior knowledge of this feature, this pattern would be hard to spot. This will obviously lead to bias towards younger people in the dataset, so must be dealt with appropriately.

First name	Surname	Age	Gender	Salary
John	Doe	28	Male	£ 29,200
Frank	Smith		Male	£ 50,400
Mary	White		Female	£ 46,900
Kevin	Park	35	Male	£ 40,500
Jane	Johnson	33	Female	£ 30,300
Sarah	McGeady	22	Female	£ 17,900
Gareth	Hancock	26	Male	£ 28,000
Sally	Robbins	39	Female	£ 36,000
George	Simpson	23	Male	£ 24,200
Emily	Jones		Female	£ 40,100

The first 2 situations are simple to identify:

Calculate the missing value rate of the whole dataset (e.g., 10% missing values)
Group the dataset by different features and assess if the missing value rate is reasonably consistent with the total missing value rate.
If it is, we can assume MCAR, if not MAR.

The third situation is more complex. It requires domain knowledge of the features within the dataset. We can compare the distribution of non-missing values, to the expected distribution of values. In the above example, we know that we had recorded people of all ages, so identifying that there are no values > than 40, is a red flag. There is no one size fits all process for identifying this situation, as it depends on the dataset, and can sometimes require further data collection.

Solutions

There are 2 main groups of solutions to missing data: deletion and imputation. The best method to choose depends on multiple factors, such as the amount of data you have, whether the missing values are MCAR, MAR, or MNAR, the amount of time you would like to spend on the process, and more. We will look at a few of the most popular methods, and understand their pros and cons.

Deletion

By far the simplest way of handling missing data is to delete anything that’s missing from the dataset. Deletion can be broken down into two methods

Row Deletion

We can delete any individual record that contains missing data. This can only be done with MCAR data, else we can introduce bias to the dataset. The cleaned MCAR data would look as follows:

First name	Surname	Age	Gender	Salary
Mary	White	48	Female	£ 46,900
Kevin	Park	35	Male	£ 40,500
Jane	Johnson	33	Female	£ 30,300
Gareth	Hancock	26	Male	£ 28,000

Pros:

Simple operation, requires little effort or computation

Cons:

Can greatly reduce the dataset if there’s a high level of missing data
Can only be performed on MCAR data, which is the least common type of missing data

Column Deletion

We can delete any column that contains missing data. This can be done with any type of missing data, but greatly reduces potential for insights. The cleaned MNAR data would look as follows:

First name	Surname	Gender	Salary
John	Doe	Male	£ 29,200
Frank	Smith	Male	£ 50,400
Mary	White	Female	£ 46,900
Kevin	Park	Male	£ 40,500
Jane	Johnson	Female	£ 30,300
Sarah	McGeady	Female	£ 17,900
Gareth	Hancock	Male	£ 28,000
Sally	Robbins	Female	£ 36,000
George	Simpson	Male	£ 24,200
Emily	Jones	Female	£ 40,100

Pros:

Simple operation, requires little effort or computation
Can be used on MNAR/MAR/MCAR data.

Cons:

Can greatly reduce the dataset if there’s a high level of missing data
Reduces potential for insights.

Imputation

Imputation is more complex than simply deleting data, but ranges in complexity from a simple mean imputation, up to a more complicated k-means clustering methodology. Let’s look at some imputation methods in more detail:

Mean/Median/Mode Imputation

Mean/median/mode imputation simply averages across the values that you do have available to you. Either of the three average methods can be chosen, depending on the dataset. The cleaned MAR data looks as follows (using the mean) :

First name	Surname	Age	Gender	Salary
John	Doe	28	Male	£ 29,200
Frank	Smith	47	Male	£ 50,400
Mary	White	33.3	Female	£ 46,900
Kevin	Park	35	Male	£ 40,500
Jane	Johnson	33.3	Female	£ 30,300
Sarah	McGeady	33.3	Female	£ 17,900
Gareth	Hancock	26	Male	£ 28,000
Sally	Robbins	33.3	Female	£ 36,000
George	Simpson	23	Male	£ 24,200
Emily	Jones	41	Female	£ 40,100

Pros:

Simple operation, requires little effort or computation
Can be used on MAR/MCAR data.

Cons:

Reduces dataset variance.
Cannot be used on MNAR data.

Regression Imputation

Linear regression uses the relationships of the data to calculate expected values. I won’t go into detail of how to calculate it, but you can read more on the process here. Certain conditions are required to use regression, else it may provide a worse estimate than the mean. We can see the cleaned MAR data below, which clearly has better estimates than mean imputation above.

First name	Surname	Age	Gender	Salary
John	Doe	28	Male	£ 29,200
Frank	Smith	47	Male	£ 50,400
Mary	White	43.9	Female	£ 46,900
Kevin	Park	35	Male	£ 40,500
Jane	Johnson	28.6	Female	£ 30,300
Sarah	McGeady	17.2	Female	£ 17,900
Gareth	Hancock	26	Male	£ 28,000
Sally	Robbins	33.9	Female	£ 36,000
George	Simpson	23	Male	£ 24,200
Emily	Jones	41	Female	£ 40,100

Pros:

Takes multiple features of the dataset into account, so produces a better imputed estimate
Can be used on MAR/MCAR data.
Can choose with features to use.

Cons:

Requires more effort and computation
Requires linear relationship to be present between computed variables.

There are further types of regression that can be used, such as logistic regression, multinomial regression, polynomial regression, and more, depending on the type of data and its relationships.

K-Nearest Neighbours Imputation

K-Nearest Neighbours selects the most similar records to the missing record, and averages across them. The use must select the number of nearest neighbours, and the distance metric (how similarity is determined). You can read more about the workings of this method here. We can see the cleaned MAR data below.

First name	Surname	Age	Gender	Salary
John	Doe	28	Male	£ 29,200
Frank	Smith	47	Male	£ 50,400
Mary	White	41.0	Female	£ 46,900
Kevin	Park	35	Male	£ 40,500
Jane	Johnson	25.7	Female	£ 30,300
Sarah	McGeady	25.7	Female	£ 17,900
Gareth	Hancock	26	Male	£ 28,000
Sally	Robbins	33.9	Female	£ 36,000
George	Simpson	23	Male	£ 24,200
Emily	Jones	41	Female	£ 40,100

Pros:

Takes multiple features of the dataset into account, so produces a better imputed estimate
Can be used on MAR/MCAR data.
Can choose with features to use.

Cons:

Requires more effort and computation
Requires thought around distance metric and number of neighbours

Timeseries Imputation

When we are dealing with missing values from timeseries datasets, there are a few more strategies we can consider. Here is an example timeseries dataset:

Timestamp	Value
12:00	11
12:05	12
12:10	14
12:15	14
12:20	16
12:25	18
12:30	19

Forward Fill

Forward filling makes a forward pass over the dataset, row by row, starting with the earliest timestamp. Any missing value is filled in with the latest actual value. For example, with missing data for 12:05, 12:10, and 12:20, the results would look as follows:

Timestamp	Value
12:00	11
12:05	11
12:10	11
12:15	14
12:20	14
12:25	18
12:30	19

Pros:

Easy to implement
Can be used on MAR/MCAR data.

Cons:

Reduces variance of dataset
Ignores variables other than timestamp

Backward Fill

Backward filling data makes a backward pass over the dataset, in the opposite direction to a forward fill, starting with the latest timestamp. Again, any missing value is filled in with the earliest actual value. For example, with missing data for 12:05, 12:10, and 12:20, the results would look as follows:

Timestamp	Value
12:00	11
12:05	14
12:10	14
12:15	14
12:20	18
12:25	18
12:30	19

Pros:

Easy to implement
Can be used on MAR/MCAR data.

Cons:

Reduces variance of dataset
Ignores variables other than timestamp

Linear Interpolation

Linear interpolation takes the difference of values either side of the missing data and assumes a linear increment for each time step. This is similar to linear regression; we are fitting a straight line between the two points. For example, with missing data for 12:05, 12:10, and 12:20, the results would look as follows:

Timestamp	Value
12:00	11
12:05	12
12:10	13
12:15	14
12:20	16
12:25	18
12:30	19

Pros:

Easy to implement
Can be used on MAR/MCAR data.
Generally, it’s more accurate than forward/backward fill.

Cons:

Ignores variables other than timestamp
Assumes linear relationship between points.

Conclusion

Through human or machine error, missing data is always a possibility. Awareness of this, and planning how to deal with it, can help mitigate its effects. We have learned some of the most common ways of identifying, categorising, and dealing with missing data. As always, there is no one size fits all method, and care should be taken when choosing a strategy.

If you’d like to learn more about how we ensure robust data collection strategies, get in touch at hello@harksys.com

Angus Doherty

How to deal with missing data

Identification

MCAR

MAR

MNAR

Solutions

Deletion

Row Deletion

Column Deletion

Imputation

Mean/Median/Mode Imputation

Regression Imputation

K-Nearest Neighbours Imputation

Timeseries Imputation

Forward Fill

Backward Fill

Linear Interpolation

Conclusion

Related Content

Asda partners with Hark and SeeChange to understand the future of retail in Stevenage store

The Fashion Industry and Common Sustainability Faux Pas

Welcome to The Age of The Smart Store

Would you like to find out more about the Hark Platform?

Subscribe to Our Newsletter

About The Hark Platform

IIoT and Energy Solutions