Pitfalls in data

Data visualisation always involves data preparation. And when working with data, encountering some pitfalls is almost guaranteed. This module lists some of the common ones, and what you can do to avoid them.

File encoding

Computers store characters as a series of 0’s and 1’s. But there are different ways of translating characters into bits. These translations are called file encodings. When the software you use to open a file assumes a file is using one encoding but the file is actually using a different encoding, you get garbled text and data.

A Wikipedia article in UTF-8 encoding displayed using the Windows-1252 encoding. Source: Wikimedia Foundation, CC BY SA 3.0

Some encodings lack encodings for special characters, like letters with specific accents. When these letters are displayed through one of these encodings, the characters with accents will be replaced by weird symbols or with the generic “�" symbol. This is a sign the wrong encoding is used to read and display a file.

Many software programs offer the option to set the encoding both when saving a file and when opening or importing a file. The UTF-8 encoding is a widely used encoding and a safe bet as a default.

Setting the file encoding in Mac OS Numbers. Source: Maarten Lambrechts, CC BY SA 4.0

If you notice file encoding issues in your data, you should ask your data provider what encoding the data file is using.

Missing values

Missing values in data need to be identified, understood and dealt with before any further processing of the data.

There are numerous ways to store missing values in data. Values may be missing completely (like empty cells in a spreadsheet), they can consist of empty strings (””), special characters (NA, null) or impossible values (like -1 or -9999).

The value of 0 deserves special attention. Does it really mean 0, or is it maybe used as a replacement for missing data? Missing date values might be present in the data as January 1, 1900 or as January 1, 1970, so these dates are suspicious and might be an indication of missing values. Make sure that you know how missing values in your data are stored, and that you are able to identify them.

Missing data can also be “hidden”. Take for example a time series data with monthly values. If the data for August would be missing, the data could look like this:

Month	Value
January	123
February	126
March	145
April	136
May	111
June	98
July	128
August
September	154
October	167
November	183
December	186

In the table above, the value for August is not hidden, and explicitly missing. This method of storing missing data is preferred over “hiding” missing data, like in the table below:

Month	Value
January	123
February	126
March	145
April	136
May	111
June	98
July	128
September	154
October	167
November	183
December	186

You also need to know why values are missing. Was the data never collected for the records with missing values? Or maybe the data is a result of joining 2 tables together, when not all records were available in both tables? Or maybe the column with missing data is the result of a calculation that generated an error for some rows in the data?

If you know why values are missing, you will have a better idea of how to handle them: should you filter them out, set them to 0 or some other value, or should you try to recover missing values from some additional data sources?

Duplicates, aggregates and totals

Duplicate rows in your data (rows containing the exact same values for all columns) are also suspicious. The same is true for a column containing unique id numbers that still occur more than once. So check that unique id’s are indeed unique, and when they are not or when you find duplicate rows in your data, you should investigate why this is the case before proceeding.