23 Research Data Things – 21. Dirty Data


There’s a great story about an influential economics paper Growth in a Time of Debt by Reinhart & Rogoff, where the dataset had a simple calculation error that was busted by a student when he tried to replicate the study for an assignment. Here’s an article about it that makes for good reading:  http://www.bbc.com/news/magazine-22223190. My economist brother told me about it a while ago, and it instantly sprung to mind.

This incident also appears in European Spreadsheet Risks Interest Group’s list of horror stories (it’s the second entry) http://www.eusprig.org/horror-stories.htm. The Excel formula used for the calculation that was key to the researcher’s argument was AVERAGE(L30:L44) when it should have been AVERAGE(L30:L49) – leaving off 5 rows of data and schewing the results. Oops!

But it’s so easy to make that kind of mistake! The cynic in me wonders if resistance to open data is at least partially influenced by researchers being worried they’ll have made a similar type of mistake somewhere and have it found out. If it can happen to a couple of Harvard professors, it can happen to anyone!

My Excel-Fu is pretty good, but in the past I’ve made mistakes like re-sorting the data in just one column in a table (meaning to re-sort the whole table of course) while all the other columns stayed in their original order. That meant that I had totally mixed up entries in that column and had no way to know which row each cell should have belonged to. The data wasn’t “dirty” so much as completely destroyed! My mistake didn’t stand out to me until I had done some more work on the table and realized something was not right with that column. Even the “Undo” button couldn’t help me – I had to restore to a previous version, undoing quite a bit of work. It’s one of those things where you just need to learn the proper method before you try anything.

I liked learning about Open Refine, and was impressed at its capability for cleaning up inconsistent cells. You can use Excel’s Filter options to clean up consistences, typos, or duplicates (or if you’re entering data into a spreadsheet, you can use the data validation tool to control what values can be put in each cell to prevent it becoming messy), but the process is more manual and Open Refine’s functions make it a lot easier.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s