Missing Values in Data Analytics
Missing Values
It is not unusual for an object to be missing one or more attribute values.
In some cases, the information was not collected.
- Example: Some people decline to provide their age or weight.
In other cases, certain attributes are not applicable to all objects.
- Example: Forms may have conditional parts that are filled out only when a person answers a previous question in a specific way. For simplicity, all fields are stored regardless of their applicability.
Missing values should be taken into account during data analysis.
Strategies for Dealing with Missing Data
1. Eliminate Data Objects or Attributes
Objects:
A simple and effective strategy is to eliminate objects with missing values.
However, even a partially specified object contains some information.
If many objects have missing values, eliminating them can make reliable analysis difficult or impossible.
If only a few objects have missing values, it may be convenient to omit them.
Attributes:
Another approach is to eliminate attributes with missing values.
This should be done cautiously since the eliminated attributes may be critical to the analysis.
2. Estimate Missing Values
- Missing data can sometimes be reliably estimated.
Examples:
Time Series:
- If a time series changes in a reasonably smooth manner but has a few scattered missing values, the missing values can be estimated using the remaining values.
Similar Data Points:
- If a dataset has many similar data points, the closest points to the one with the missing value can be used for estimation.
Approaches:
Continuous Attributes:
- Use the average attribute value of the nearest neighbors.
Categorical Attributes:
- Use the most commonly occurring attribute value among the nearest neighbors.
3. Ignore the Missing Value During Analysis
- Many data mining approaches can be adapted to ignore missing values.
Example:
When clustering data, similarity between pairs of objects needs to be calculated.
If one or both objects of a pair have missing values for some attributes, the missing values can be ignored while calculating similarity.