hark logo

How to deal with missing data

In Data Science, it’s pretty rare to come across a perfect dataset; data will frequently be of poor quality. It could have been recorded incorrectly, or it can be missing completely, which can be caused by poor data collection or storage methodologies.
puzzle piece missing - missing data

Missing data requires robust strategies to improve the data quality, else analysis will likely provide little value. As the saying goes – Garbage in, garbage out.

This post will look at some of the ways we can process missing data, without skewing the dataset; allowing us to produce valuable insight.

Identification

The first step in the process is to identify exactly what is missing from the dataset. Missing data can be categorised into the following groups:

  • MCAR – Missing Completely at Random
  • MAR – Missing at Random
  • NMAR – Not Missing at Random

Here is an example of a complete dataset. We will show how this dataset looks with missing data from each category.

First nameSurnameAgeGenderSalary
JohnDoe28Male £        29,200
FrankSmith47Male £        50,400
MaryWhite48Female £        46,900
KevinPark35Male £        40,500
JaneJohnson33Female £        30,300
SarahMcGeady22Female £        17,900
GarethHancock26Male £        28,000
SallyRobbins39Female £        36,000
GeorgeSimpson23Male £        24,200
EmilyJones41Female £        40,100

MCAR

This data is missing from the dataset with no discernible pattern. This means that any features can have missing values. This is a rare case and leaves the data unbiased. This is usually caused by errors in collection or storage. Below we can see that there is no obvious pattern to where missing data occurs.

First name Surname Age Gender Salary
JohnDoe28Male 
SmithMale £        50,400
MaryWhite48Female £        46,900
KevinPark35Male £        40,500
JaneJohnson33Female £        30,300
Sarah22 £        17,900
GarethHancock26Male £        28,000
Sally39Female £        36,000
GeorgeSimpson23 £        24,200
EmilyJones41Female 

MAR

This data is missing but can be explained by another feature in the dataset. This often occurs during data collection. For example, older people may be less likely to disclose their earnings. In the below example, we can see that almost half of age data is missing and is highly correlated with gender. This can lead to bias in the data, and so must be dealt with appropriately.

First name Surname Age Gender Salary
JohnDoe28Male £        29,200
FrankSmith47Male £        50,400
MaryWhiteFemale £        46,900
KevinPark35Male £        40,500
JaneJohnsonFemale £        30,300
SarahMcGeadyFemale £        17,900
GarethHancock26Male £        28,000
SallyRobbinsFemale £        36,000
GeorgeSimpson23Male £        24,200
EmilyJones41Female £        40,100

MNAR

This data is missing from the dataset and has a discernible pattern but cannot be explained without knowledge of the missing data. In the example below, older respondents have not disclosed their age. Without prior knowledge of this feature, this pattern would be hard to spot. This will obviously lead to bias towards younger people in the dataset, so must be dealt with appropriately.

First name Surname Age Gender Salary
JohnDoe28Male £        29,200
FrankSmithMale £        50,400
MaryWhiteFemale £        46,900
KevinPark35Male £        40,500
JaneJohnson33Female £        30,300
SarahMcGeady22Female £        17,900
GarethHancock26Male £        28,000
SallyRobbins39Female £        36,000
GeorgeSimpson23Male £        24,200
EmilyJonesFemale £        40,100

The first 2 situations are simple to identify:

  • Calculate the missing value rate of the whole dataset (e.g., 10% missing values)
  • Group the dataset by different features and assess if the missing value rate is reasonably consistent with the total missing value rate.
  • If it is, we can assume MCAR, if not MAR.

The third situation is more complex. It requires domain knowledge of the features within the dataset. We can compare the distribution of non-missing values, to the expected distribution of values. In the above example, we know that we had recorded people of all ages, so identifying that there are no values > than 40, is a red flag. There is no one size fits all process for identifying this situation, as it depends on the dataset, and can sometimes require further data collection.

Solutions

There are 2 main groups of solutions to missing data: deletion and imputation. The best method to choose depends on multiple factors, such as the amount of data you have, whether the missing values are MCAR, MAR, or MNAR, the amount of time you would like to spend on the process, and more. We will look at a few of the most popular methods, and understand their pros and cons.

Deletion

By far the simplest way of handling missing data is to delete anything that’s missing from the dataset. Deletion can be broken down into two methods

Row Deletion

We can delete any individual record that contains missing data. This can only be done with MCAR data, else we can introduce bias to the dataset. The cleaned MCAR data would look as follows:

First name Surname Age Gender Salary
MaryWhite48Female £        46,900
KevinPark35Male £        40,500
JaneJohnson33Female £        30,300
GarethHancock26Male £        28,000

Pros:

  • Simple operation, requires little effort or computation

Cons:

  • Can greatly reduce the dataset if there’s a high level of missing data
  • Can only be performed on MCAR data, which is the least common type of missing data

Column Deletion

We can delete any column that contains missing data. This can be done with any type of missing data, but greatly reduces potential for insights. The cleaned MNAR data would look as follows:

First name Surname Gender Salary
JohnDoeMale £        29,200
FrankSmithMale £        50,400
MaryWhiteFemale £        46,900
KevinParkMale £        40,500
JaneJohnsonFemale £        30,300
SarahMcGeadyFemale £        17,900
GarethHancockMale £        28,000
SallyRobbinsFemale £        36,000
GeorgeSimpsonMale £        24,200
EmilyJonesFemale £        40,100

Pros:

  • Simple operation, requires little effort or computation
  • Can be used on MNAR/MAR/MCAR data.

Cons:

  • Can greatly reduce the dataset if there’s a high level of missing data
  • Reduces potential for insights.

Imputation

Imputation is more complex than simply deleting data, but ranges in complexity from a simple mean imputation, up to a more complicated k-means clustering methodology. Let’s look at some imputation methods in more detail:

Mean/Median/Mode Imputation

Mean/median/mode imputation simply averages across the values that you do have available to you. Either of the three average methods can be chosen, depending on the dataset. The cleaned MAR data looks as follows (using the mean) :

First name Surname Age Gender Salary
JohnDoe28Male £        29,200
FrankSmith47Male £        50,400
MaryWhite33.3Female £        46,900
KevinPark35Male £        40,500
JaneJohnson33.3Female £        30,300
SarahMcGeady33.3Female £        17,900
GarethHancock26Male £        28,000
SallyRobbins33.3Female £        36,000
GeorgeSimpson23Male £        24,200
EmilyJones41Female £        40,100

Pros:

  • Simple operation, requires little effort or computation
  • Can be used on MAR/MCAR data.

Cons:

  • Reduces dataset variance.
  • Cannot be used on MNAR data.

Regression Imputation

Linear regression uses the relationships of the data to calculate expected values. I won’t go into detail of how to calculate it, but you can read more on the process here. Certain conditions are required to use regression, else it may provide a worse estimate than the mean. We can see the cleaned MAR data below, which clearly has better estimates than mean imputation above.

First name Surname Age Gender Salary
JohnDoe28Male £        29,200
FrankSmith47Male £        50,400
MaryWhite43.9Female £        46,900
KevinPark35Male £        40,500
JaneJohnson28.6Female £        30,300
SarahMcGeady17.2Female £        17,900
GarethHancock26Male £        28,000
SallyRobbins33.9Female £        36,000
GeorgeSimpson23Male £        24,200
EmilyJones41Female £        40,100

Pros:

  • Takes multiple features of the dataset into account, so produces a better imputed estimate
  • Can be used on MAR/MCAR data.
  • Can choose with features to use.

Cons:

  • Requires more effort and computation
  • Requires linear relationship to be present between computed variables.

There are further types of regression that can be used, such as logistic regression, multinomial regression, polynomial regression, and more, depending on the type of data and its relationships.

K-Nearest Neighbours Imputation

K-Nearest Neighbours selects the most similar records to the missing record, and averages across them. The use must select the number of nearest neighbours, and the distance metric (how similarity is determined). You can read more about the workings of this method here. We can see the cleaned MAR data below.

First name Surname Age Gender Salary
JohnDoe28Male £        29,200
FrankSmith47Male £        50,400
MaryWhite41.0Female £        46,900
KevinPark35Male £        40,500
JaneJohnson25.7Female £        30,300
SarahMcGeady25.7Female £        17,900
GarethHancock26Male £        28,000
SallyRobbins33.9Female £        36,000
GeorgeSimpson23Male £        24,200
EmilyJones41Female £        40,100

Pros:

  • Takes multiple features of the dataset into account, so produces a better imputed estimate
  • Can be used on MAR/MCAR data.
  • Can choose with features to use.

Cons:

  • Requires more effort and computation
  • Requires thought around distance metric and number of neighbours

Timeseries Imputation

When we are dealing with missing values from timeseries datasets, there are a few more strategies we can consider. Here is an example timeseries dataset:

Timestamp Value
12:0011
12:0512
12:1014
12:1514
12:2016
12:2518
12:3019

Forward Fill

Forward filling makes a forward pass over the dataset, row by row, starting with the earliest timestamp. Any missing value is filled in with the latest actual value. For example, with missing data for 12:05, 12:10, and 12:20, the results would look as follows:

Timestamp Value
12:0011
12:0511
12:1011
12:1514
12:2014
12:2518
12:3019

Pros:

  • Easy to implement
  • Can be used on MAR/MCAR data.

Cons:

  • Reduces variance of dataset
  • Ignores variables other than timestamp

Backward Fill

Backward filling data makes a backward pass over the dataset, in the opposite direction to a forward fill, starting with the latest timestamp. Again, any missing value is filled in with the earliest actual value. For example, with missing data for 12:05, 12:10, and 12:20, the results would look as follows:

Timestamp Value
12:0011
12:0514
12:1014
12:1514
12:2018
12:2518
12:3019

Pros:

  • Easy to implement
  • Can be used on MAR/MCAR data.

Cons:

  • Reduces variance of dataset
  • Ignores variables other than timestamp

Linear Interpolation

Linear interpolation takes the difference of values either side of the missing data and assumes a linear increment for each time step. This is similar to linear regression; we are fitting a straight line between the two points. For example, with missing data for 12:05, 12:10, and 12:20, the results would look as follows:

Timestamp Value
12:0011
12:0512
12:1013
12:1514
12:2016
12:2518
12:3019

Pros:

  • Easy to implement
  • Can be used on MAR/MCAR data.
  • Generally, it’s more accurate than forward/backward fill.

Cons:

  • Ignores variables other than timestamp
  • Assumes linear relationship between points.

Conclusion

Through human or machine error, missing data is always a possibility. Awareness of this, and planning how to deal with it, can help mitigate its effects. We have learned some of the most common ways of identifying, categorising, and dealing with missing data. As always, there is no one size fits all method, and care should be taken when choosing a strategy.

If you’d like to learn more about how we ensure robust data collection strategies, get in touch at  hello@harksys.com

Related Content

Welcome to The Age of The Smart Store

No cash, no cards – no tills? If you took this news to the middle ages, they’d have burned you at the stake. But here in the outrageous present tense, all is normal (well, kind of).

Read More

Would you like to find out more about the Hark Platform?

Subscribe to Our Newsletter

Stay up to date with the latest industry news, platform developments and more.