This module contains some strategies and rules to follow when you start working with a new dataset that might need some cleaning. This module is not tool specific, in the next Working with data: cleaning data in practice module these principles are put into practice using Microsoft Excel.

Metadata

When you start looking at a new dataset your first reflex should be to look out for the metadata. Metadata, “data about the data”, explains who collected the data, when and how it was collected, and for what purpose it was collected

Good metadata will also explain what each column in the data means exactly and what the units of the numbers in the data are. This information is crucial for interpreting the data correctly and to make meaningful visualisations with it.

So never assume what the meaning of a column in the data is, or what the units of the numbers in a column are: make sure to consult the metadata, and ask your data provider to provide the metadata when it is missing.

Inspecting data: data profiling

Inspecting the data should be your next step. Check wether your data is in either wide or long format (see the Wide versus long data section in the Working with data: files, formats and structures). Then you can check the number of rows and columns of the data: are they consistent with what you were expecting from the data?

To get an idea of the content of each column, you can create a summary of each column. For text columns this means checking the unique values in a column (and counting the number of times each value occurs), for numerical columns this means getting an idea about the mean (or median) values in a column, and about the variability in the numbers. For data/time columns this means checking the range of the date or time values in the column.

Special attention should go to data columns that had their data entered manually. Values in these columns are prone to typos, and different names for the same thing can be present (like “Netherlands”, “The Netherlands”, “NL” and “Holland”). Values like this should be standardised, so that only one value is used to represent each category in the data.

In this step, you should also check for blanks in your data. These missing data values can show up as empty cells in your data, but they can also be codified in different ways, like with certain text strings (like “NA” or “null”) or with special numerical values (like 0, -1 or 9999). Make sure that you know how missing values are encoded, that you are able to find them, and that you understand why these values are missing.

You also need to make sure that the software you use to work with the data has recognised the data columns correctly. Are numbers correctly interpreted as numbers? And are dates recognised as dates, and not stored as text?

In this phase, it can be helpful to use data visualisation as a tool to “profile” your data. For text values, you can make bar charts to show how frequent each of the values are, and for numerical columns you can make histograms to get an idea of the distribution of the numbers in these columns. This is a good way to spot outliers in the data. Outliers can also be spotted by simply sorting the data on the numerical columns.

Duplicate rows in your data (rows containing the exact same values for all columns) are also suspicious. The same is true for a column containing unique id numbers that still occur more than once. So check that unique id’s are indeed unique, and when they are not or when you find duplicate rows in your data, you should investigate why this is the case before proceeding.

Cleaning

When it is needed to edit the data for further analysis and visualisation, it is a good idea to create a duplicate of the original data file and make the edits only in the duplicated file. In this way, you can always go back to the source data to check if a suspicious value in the data was present in the source data already, or that you introduced an error in the data yourself when you where cleaning it.

Standardising data

Standardising your data can be done by performing search and replace operations. For example, you can decide to search for all occurrences of “Netherlands”, “NL” and “Holland”and replace all of them with “The Netherlands”.

If the data is not too big, you can make edits to the data manually. When the data is too big for manual edits, you can make use of clustering algorithms to detect possible unstandardised data, and correct it (see the Working with data: tools module for an example of this clustering technique).

A hard to detect but common issue with unstandardised text columns, are spaces (and other invisible characters) at the start and end of strings. These are called trailing and leading spaces, and if you suspect they might be present in your data, you should run a script or use a formula to remove them. In many tools, removing trailing and leading spaces is called “trimming”.

Data type mismatches