Woohoo, the World Bank’s World Development Indicators dataset is now available for free. This is an excellent dataset of cross-country economic and social development indicators. There’s even an API for accessing the data directly!
A nice Excel tip:
In this Excel data validation example, you’ll create an Excel Data Validation drop down list that allows users to add new items.
Data sounds so clean and clinical but the dirty little secret of data is that it’s messy. When working with data invariably the first step involves cleaning the data and processing it into a form that is usable for analysis. This process is even more complicated when bringing together multiple datasets.
Basic data problems that need to be dealt with before analysis include:
Data wrangling is the process of sorting out these types of problems to produce a nice clean dataset. The objective is usually a flat database-style dataset, with columns for data descriptors and then the data itself. Missing data often needs to be imputed or otherwise estimated. It is often necessary to come up with creative ways to adjust the data to account for inconsistencies in definitions across datasets. In fact, inconsistencies can arise within the same data series, for example when the definition of a time-series has changed at some point. In such a case, some kind of back-casting can often be used to produce a series that is consistent over time.
I use Excel most often for data wrangling. Useful tools include pivot tables, the database commands like DSUM, and conditional commands like SUMIF or the newer SUMIFS. I also sometimes write custom macros that re-shape data into the format that I want, if there is too much data for manual editing. However, often the process of macro-writing and debugging can take longer than just doing the data edits manually.