Aaron Schiff
aschiff 26 at gmail dot com

Data’s dirty secret

Data sounds so clean and clinical but the dirty little secret of data is that it’s messy. When working with data invariably the first step involves cleaning the data and processing it into a form that is usable for analysis. This process is even more complicated when bringing together multiple datasets.

Basic data problems that need to be dealt with before analysis include:

  • Missing or incomplete data
  • Inconsistent definitions across datasets
  • Different frequencies (e.g. annual, monthly, quarterly) used in different datasets

Data wrangling is the process of sorting out these types of problems to produce a nice clean dataset. The objective is usually a flat database-style dataset, with columns for data descriptors and then the data itself. Missing data often needs to be imputed or otherwise estimated. It is often necessary to come up with creative ways to adjust the data to account for inconsistencies in definitions across datasets. In fact, inconsistencies can arise within the same data series, for example when the definition of a time-series has changed at some point. In such a case, some kind of back-casting can often be used to produce a series that is consistent over time.

I use Excel most often for data wrangling. Useful tools include pivot tables, the database commands like DSUM, and conditional commands like SUMIF or the newer SUMIFS. I also sometimes write custom macros that re-shape data into the format that I want, if there is too much data for manual editing. However, often the process of macro-writing and debugging can take longer than just doing the data edits manually.

March 8th, 2010 in Data Wrangling
Subscribe RSS
View all content ...

Things I do

  • Aviation economics
  • Competition policy
  • Data analysis
  • Data presentation
  • Data wrangling
  • Economic modelling
  • Intellectual property
  • Internet economics
  • Quantitative modelling
  • Regulatory economics
  • Research papers
  • Telecommunications

Archives

  • June 2010
  • May 2010
  • April 2010
  • March 2010
© Copyright Aaron Schiff 2010