Data cleaning is a key element in HR analytics. Before you can analyze your data, it needs to be ‘clean’. In this data cleaning guide, we will explain why data cleaning is important and how you can do it. At the bottom of the article we included a helpful data cleaning infographic.
A common saying in data analysis is: “garbage in, garbage out”.
This saying means that you can put a lot of thought and effort into your data analysis and come up with lots of results. However, these results will mean nothing if the input data is not accurate. In fact, the results may even be harmful as they can misrepresent reality.
Why is data cleaning important?
HR data is oftentimes dirty. Dirty data is any data record that contains errors. This can happen for different reasons.
The simplest one being missing data. Other examples of dirty data are different labels for one and the same job functions, multiple records for the same people in one system, non-matching records in different systems, and so on.
Cleaning and ordering this data can be a time-consuming process. Indeed, aggregating data from all these different data sources and making them compliant can take weeks or even months. This holds especially true for international companies. These often use different systems in different countries to record the same data.
The problem with data is that it’s easy to get dirty. As soon as data collection procedures differ in the slightest, the data will become inconsistent.
As a company, you can decide to clean all your data at once. Some companies opt for this strategy. However, this can take tremendous amounts of time. It is hence much smarter to clean only the data you need to perform a specific analysis.
This approach will prevent a lot of unnecessary work and produce results faster. Based on the outcomes of the first analysis, you can determine which extra data you need to clean to run your next analysis.
Data cleaning helps to run a smooth analysis. It also helps normal HR reporting as clean data can be fed back into the HR systems. This will help improve the data quality and is extremely beneficial for later data analyses and data aggregation efforts.
Data cleaning is thus a necessary step in the HR analytics process.
The data cleaning process
When cleaning HR data there are two things you need to understand. The first is data validity and the second is data reliability.
When data is not valid or reliable, it may tell you something different than what you are looking for. The following section will explore this deeper. It is important to understand these two terms. However, if you’re looking for a more practical step-by-step guide, you can scroll down to the next section.
Validity is whether you’re actually measuring what you need to measure. Does the appraisal system only measure individual performance, or does it (also) measure who is best liked by his/her manager? Is data collected evenly throughout the organization, or is it skewed in one way or another?
An example: The city of Boston made an app that their drivers could install on their smartphone. The app would measure bumps in the road and report their location via GPS. These bumps were then recorded and the city road service would fix them. According to a spokesperson: “the data provides the city with real-time information it uses to fix problems and plan long-term investments”.
Unfortunately, not everyone benefited equally from this system. The app was mainly used by the young and in more affluent communities. Meanwhile, the poorer communities did not have equal access to smartphones and mobile data. This is a significant bias in the data.
Questions you can ask yourself to check for validity, are:
- Does the data represent what we want to measure?
- Are there any biases in the way we measured our data?
- Was the data collected in a clear and consistent way?
- Are there outliers in the data?
Reliability is about measuring the same thing over and over again and achieving the same result.
When you measure someone’s engagement in the morning you want to have a similar result as when you measure it again in the afternoon. This is because engagement is a trait that is relatively stable over time.
The same holds true for different raters. If you ask both Bill and Jim to rate Wendy’s engagement, you want both Bill and Jim to give Wendy the same rating. However, when the scales that are used to rate Wendy are vague and open to different interpretations, Bill and Jim will likely give Wendy different ratings. This is called a rater bias and is best avoided.
This might sound obvious but it is not. Oftentimes reported data depends on other factors like the instructions that are given, and the mood of the person who gives the rating. This is the big question when we talk about reliability: Are the same scores achieved when the same data is measured in the same way by different people and at different times of the day/week?
Procedures play an important role in this process. In rating performance, if one manager considers an employee’s performance over the last six months, while another only thinks back over the last two weeks, performance ratings will likely differ and be unreliable. Clearly documented procedures would help different managers measure performance the same way.
Questions you should ask yourself in this context, are:
- Did we consistently produce the same results when the same thing was measured multiple times?
- Did we use clearly documented data collection methods?
- Were data collection instructions followed each time?
A simple data cleaning checklist
The previous questions on validity and reliability help you to analyze whether your input data is sufficiently accurate to yield reliable and valid results. There are several other criteria your data needs to comply with. For example, your data needs to be up to date.
Data that is outdated will produce potentially irrelevant results and can potentially mess up your results. Additionally, you need to check if you have all the relevant data: records are oftentimes missing. Depending on how you analyze your data, this may or may not cause problems. Some methods of analysis allow for missing data while other algorithms struggle when data is missing.
Missing data will narrow your population. Plus, there is a real chance that there are shared similarities between the people whose data is missing. For instance, if one department still uses an outdated performance management system that omits certain questions, it would mean that you’d lack data of all the people working in that department. This can seriously skew your results towards the other departments and threaten the generalizability of the results.
This is a practical checklist with six steps for data cleaning:
- Check if the data is up-to-date.
- Check for reccurring unique identifiers. Some people hold more than one positions. Systems often create separate records for each position. These people thus end up having multiple records in a single database. Depending on the situation, these records may be condensed.
- Check data labels across multiple fields and merged datasets and see if all the data matches.
- Count missing values. When missing values are over-represented in specific parts of the organization, they may skew your results. We saw this in the previous example. In addition, an analysis with too many missing values (i.e. insufficient data) runs the risk of becoming inaccurate. This also impacts the generalizability of your results.
- Check for numerical outliers. Calculate the descriptive statistics and the values of the quantiles. These enable you to calculate potential outliers. The minimum and maximum values are a good starting point.
In addition, you can calculate the interquartile range. You can do this by multiplying the difference between quantile 3 (Q3) and Q1 by 1.5. The result can be added to Q3 and subtracted from Q1. Values outside this range are assumed to be outliers. This Wikipedia article describes how to do this in more detail.
- Define valid data output and remove all invalid data values. This is useful for all data. Character data is clearly defined. For example, gender is defined by M or F. These are the valid data values. Any other values are presumed to be invalid. This data can be easily flagged for inspection.
Numeric data is often limited in range (e.g. working age is between 15 and 100). Numeric data that falls outside the predefined range can be flagged the same way.