Missing data requires robust strategies to improve the data quality, else analysis will likely provide little value. As the saying goes – Garbage in, garbage out.
This post will look at some of the ways we can process missing data, without skewing the dataset; allowing us to produce valuable insight.
Identification
The first step in the process is to identify exactly what is missing from the dataset. Missing data can be categorised into the following groups:
- MCAR – Missing Completely at Random
- MAR – Missing at Random
- NMAR – Not Missing at Random
Here is an example of a complete dataset. We will show how this dataset looks with missing data from each category.
First name | Surname | Age | Gender | Salary |
---|---|---|---|---|
John | Doe | 28 | Male | £ 29,200 |
Frank | Smith | 47 | Male | £ 50,400 |
Mary | White | 48 | Female | £ 46,900 |
Kevin | Park | 35 | Male | £ 40,500 |
Jane | Johnson | 33 | Female | £ 30,300 |
Sarah | McGeady | 22 | Female | £ 17,900 |
Gareth | Hancock | 26 | Male | £ 28,000 |
Sally | Robbins | 39 | Female | £ 36,000 |
George | Simpson | 23 | Male | £ 24,200 |
Emily | Jones | 41 | Female | £ 40,100 |
MCAR
This data is missing from the dataset with no discernible pattern. This means that any features can have missing values. This is a rare case and leaves the data unbiased. This is usually caused by errors in collection or storage. Below we can see that there is no obvious pattern to where missing data occurs.
First name | Surname | Age | Gender | Salary |
---|---|---|---|---|
John | Doe | 28 | Male | |
Smith | Male | £ 50,400 | ||
Mary | White | 48 | Female | £ 46,900 |
Kevin | Park | 35 | Male | £ 40,500 |
Jane | Johnson | 33 | Female | £ 30,300 |
Sarah | 22 | £ 17,900 | ||
Gareth | Hancock | 26 | Male | £ 28,000 |
Sally | 39 | Female | £ 36,000 | |
George | Simpson | 23 | £ 24,200 | |
Emily | Jones | 41 | Female |
MAR
This data is missing but can be explained by another feature in the dataset. This often occurs during data collection. For example, older people may be less likely to disclose their earnings. In the below example, we can see that almost half of age data is missing and is highly correlated with gender. This can lead to bias in the data, and so must be dealt with appropriately.
First name | Surname | Age | Gender | Salary |
---|---|---|---|---|
John | Doe | 28 | Male | £ 29,200 |
Frank | Smith | 47 | Male | £ 50,400 |
Mary | White | Female | £ 46,900 | |
Kevin | Park | 35 | Male | £ 40,500 |
Jane | Johnson | Female | £ 30,300 | |
Sarah | McGeady | Female | £ 17,900 | |
Gareth | Hancock | 26 | Male | £ 28,000 |
Sally | Robbins | Female | £ 36,000 | |
George | Simpson | 23 | Male | £ 24,200 |
Emily | Jones | 41 | Female | £ 40,100 |
MNAR
This data is missing from the dataset and has a discernible pattern but cannot be explained without knowledge of the missing data. In the example below, older respondents have not disclosed their age. Without prior knowledge of this feature, this pattern would be hard to spot. This will obviously lead to bias towards younger people in the dataset, so must be dealt with appropriately.
First name | Surname | Age | Gender | Salary |
---|---|---|---|---|
John | Doe | 28 | Male | £ 29,200 |
Frank | Smith | Male | £ 50,400 | |
Mary | White | Female | £ 46,900 | |
Kevin | Park | 35 | Male | £ 40,500 |
Jane | Johnson | 33 | Female | £ 30,300 |
Sarah | McGeady | 22 | Female | £ 17,900 |
Gareth | Hancock | 26 | Male | £ 28,000 |
Sally | Robbins | 39 | Female | £ 36,000 |
George | Simpson | 23 | Male | £ 24,200 |
Emily | Jones | Female | £ 40,100 |
The first 2 situations are simple to identify:
- Calculate the missing value rate of the whole dataset (e.g., 10% missing values)
- Group the dataset by different features and assess if the missing value rate is reasonably consistent with the total missing value rate.
- If it is, we can assume MCAR, if not MAR.
The third situation is more complex. It requires domain knowledge of the features within the dataset. We can compare the distribution of non-missing values, to the expected distribution of values. In the above example, we know that we had recorded people of all ages, so identifying that there are no values > than 40, is a red flag. There is no one size fits all process for identifying this situation, as it depends on the dataset, and can sometimes require further data collection.
Solutions
There are 2 main groups of solutions to missing data: deletion and imputation. The best method to choose depends on multiple factors, such as the amount of data you have, whether the missing values are MCAR, MAR, or MNAR, the amount of time you would like to spend on the process, and more. We will look at a few of the most popular methods, and understand their pros and cons.
Deletion
By far the simplest way of handling missing data is to delete anything that’s missing from the dataset. Deletion can be broken down into two methods
Row Deletion
We can delete any individual record that contains missing data. This can only be done with MCAR data, else we can introduce bias to the dataset. The cleaned MCAR data would look as follows:
First name | Surname | Age | Gender | Salary |
---|---|---|---|---|
Mary | White | 48 | Female | £ 46,900 |
Kevin | Park | 35 | Male | £ 40,500 |
Jane | Johnson | 33 | Female | £ 30,300 |
Gareth | Hancock | 26 | Male | £ 28,000 |
Pros:
- Simple operation, requires little effort or computation
Cons:
- Can greatly reduce the dataset if there’s a high level of missing data
- Can only be performed on MCAR data, which is the least common type of missing data
Column Deletion
We can delete any column that contains missing data. This can be done with any type of missing data, but greatly reduces potential for insights. The cleaned MNAR data would look as follows:
First name | Surname | Gender | Salary |
---|---|---|---|
John | Doe | Male | £ 29,200 |
Frank | Smith | Male | £ 50,400 |
Mary | White | Female | £ 46,900 |
Kevin | Park | Male | £ 40,500 |
Jane | Johnson | Female | £ 30,300 |
Sarah | McGeady | Female | £ 17,900 |
Gareth | Hancock | Male | £ 28,000 |
Sally | Robbins | Female | £ 36,000 |
George | Simpson | Male | £ 24,200 |
Emily | Jones | Female | £ 40,100 |
Pros:
- Simple operation, requires little effort or computation
- Can be used on MNAR/MAR/MCAR data.
Cons:
- Can greatly reduce the dataset if there’s a high level of missing data
- Reduces potential for insights.
Imputation
Imputation is more complex than simply deleting data, but ranges in complexity from a simple mean imputation, up to a more complicated k-means clustering methodology. Let’s look at some imputation methods in more detail:
Mean/Median/Mode Imputation
Mean/median/mode imputation simply averages across the values that you do have available to you. Either of the three average methods can be chosen, depending on the dataset. The cleaned MAR data looks as follows (using the mean) :
First name | Surname | Age | Gender | Salary |
---|---|---|---|---|
John | Doe | 28 | Male | £ 29,200 |
Frank | Smith | 47 | Male | £ 50,400 |
Mary | White | 33.3 | Female | £ 46,900 |
Kevin | Park | 35 | Male | £ 40,500 |
Jane | Johnson | 33.3 | Female | £ 30,300 |
Sarah | McGeady | 33.3 | Female | £ 17,900 |
Gareth | Hancock | 26 | Male | £ 28,000 |
Sally | Robbins | 33.3 | Female | £ 36,000 |
George | Simpson | 23 | Male | £ 24,200 |
Emily | Jones | 41 | Female | £ 40,100 |
Pros:
- Simple operation, requires little effort or computation
- Can be used on MAR/MCAR data.
Cons:
- Reduces dataset variance.
- Cannot be used on MNAR data.
Regression Imputation
Linear regression uses the relationships of the data to calculate expected values. I won’t go into detail of how to calculate it, but you can read more on the process here. Certain conditions are required to use regression, else it may provide a worse estimate than the mean. We can see the cleaned MAR data below, which clearly has better estimates than mean imputation above.
First name | Surname | Age | Gender | Salary |
---|---|---|---|---|
John | Doe | 28 | Male | £ 29,200 |
Frank | Smith | 47 | Male | £ 50,400 |
Mary | White | 43.9 | Female | £ 46,900 |
Kevin | Park | 35 | Male | £ 40,500 |
Jane | Johnson | 28.6 | Female | £ 30,300 |
Sarah | McGeady | 17.2 | Female | £ 17,900 |
Gareth | Hancock | 26 | Male | £ 28,000 |
Sally | Robbins | 33.9 | Female | £ 36,000 |
George | Simpson | 23 | Male | £ 24,200 |
Emily | Jones | 41 | Female | £ 40,100 |
Pros:
- Takes multiple features of the dataset into account, so produces a better imputed estimate
- Can be used on MAR/MCAR data.
- Can choose with features to use.
Cons:
- Requires more effort and computation
- Requires linear relationship to be present between computed variables.
There are further types of regression that can be used, such as logistic regression, multinomial regression, polynomial regression, and more, depending on the type of data and its relationships.
K-Nearest Neighbours Imputation
K-Nearest Neighbours selects the most similar records to the missing record, and averages across them. The use must select the number of nearest neighbours, and the distance metric (how similarity is determined). You can read more about the workings of this method here. We can see the cleaned MAR data below.
First name | Surname | Age | Gender | Salary |
---|---|---|---|---|
John | Doe | 28 | Male | £ 29,200 |
Frank | Smith | 47 | Male | £ 50,400 |
Mary | White | 41.0 | Female | £ 46,900 |
Kevin | Park | 35 | Male | £ 40,500 |
Jane | Johnson | 25.7 | Female | £ 30,300 |
Sarah | McGeady | 25.7 | Female | £ 17,900 |
Gareth | Hancock | 26 | Male | £ 28,000 |
Sally | Robbins | 33.9 | Female | £ 36,000 |
George | Simpson | 23 | Male | £ 24,200 |
Emily | Jones | 41 | Female | £ 40,100 |
Pros:
- Takes multiple features of the dataset into account, so produces a better imputed estimate
- Can be used on MAR/MCAR data.
- Can choose with features to use.
Cons:
- Requires more effort and computation
- Requires thought around distance metric and number of neighbours
Timeseries Imputation
When we are dealing with missing values from timeseries datasets, there are a few more strategies we can consider. Here is an example timeseries dataset:
Timestamp | Value |
---|---|
12:00 | 11 |
12:05 | 12 |
12:10 | 14 |
12:15 | 14 |
12:20 | 16 |
12:25 | 18 |
12:30 | 19 |
Forward Fill
Forward filling makes a forward pass over the dataset, row by row, starting with the earliest timestamp. Any missing value is filled in with the latest actual value. For example, with missing data for 12:05, 12:10, and 12:20, the results would look as follows:
Timestamp | Value |
---|---|
12:00 | 11 |
12:05 | 11 |
12:10 | 11 |
12:15 | 14 |
12:20 | 14 |
12:25 | 18 |
12:30 | 19 |
Pros:
- Easy to implement
- Can be used on MAR/MCAR data.
Cons:
- Reduces variance of dataset
- Ignores variables other than timestamp
Backward Fill
Backward filling data makes a backward pass over the dataset, in the opposite direction to a forward fill, starting with the latest timestamp. Again, any missing value is filled in with the earliest actual value. For example, with missing data for 12:05, 12:10, and 12:20, the results would look as follows:
Timestamp | Value |
---|---|
12:00 | 11 |
12:05 | 14 |
12:10 | 14 |
12:15 | 14 |
12:20 | 18 |
12:25 | 18 |
12:30 | 19 |
Pros:
- Easy to implement
- Can be used on MAR/MCAR data.
Cons:
- Reduces variance of dataset
- Ignores variables other than timestamp
Linear Interpolation
Linear interpolation takes the difference of values either side of the missing data and assumes a linear increment for each time step. This is similar to linear regression; we are fitting a straight line between the two points. For example, with missing data for 12:05, 12:10, and 12:20, the results would look as follows:
Timestamp | Value |
---|---|
12:00 | 11 |
12:05 | 12 |
12:10 | 13 |
12:15 | 14 |
12:20 | 16 |
12:25 | 18 |
12:30 | 19 |
Pros:
- Easy to implement
- Can be used on MAR/MCAR data.
- Generally, it’s more accurate than forward/backward fill.
Cons:
- Ignores variables other than timestamp
- Assumes linear relationship between points.
Conclusion
Through human or machine error, missing data is always a possibility. Awareness of this, and planning how to deal with it, can help mitigate its effects. We have learned some of the most common ways of identifying, categorising, and dealing with missing data. As always, there is no one size fits all method, and care should be taken when choosing a strategy.
If you’d like to learn more about how we ensure robust data collection strategies, get in touch at hello@harksys.com