The Importance of Clean Data in Excel

Clean Data in Excel

To be sure that you’re able to report using your data, try to ensure that each row represents a Single Event!

Normally, each row in a table contains data about the first column, or the subject that you will likely be reporting on. To be able to take full advantage of PowerPivot, it’s important to break each row down further, by Unpivoting any aggregated columns. Ideally, you will be able to control how the data is being entered or stored, so you don’t have to deal with trying to unravel it later.

For example, imagine that you’re analyzing hitters in a baseball game. You want to focus on creating a row for each at-bat, as opposed to having one row for each player containing game or season totals. As you compile data from new games, you hope to be able to identify trends between the player, the number of unique at bats, and hopefully the events that take place in each at bat.

Is it more advantageous to:

  • List the player’s name in column A, the number/order of each at-bat in columns B – F (or more), and the result in additional columns?
  • Have the player’s name in column A, the at-bat number in column B, and the result in column C? (This would create a separate row for every at-bat, for each player listed).

I already provided my answer above, but in my opinion, it’s MUCH better to setup your data using the second schema. This method allows you to track any number of statistics in the columns to the right.

Once you setup your Pivot Table(s) and Pivot Chart(s) to summarize and analyze your data, they will be able to be refreshed as you add at-bats to your source file/worksheet. Microsoft has really created the most user-friendly ways to easily visualize, and refresh your data. Starting with Excel, and now via PowerBI, the reporting capabilities are hard to match…

Each Column should represent an Attribute that corresponds to a Unique Event

If you followed the convoluted narrative above, then you see the benefit of setting your data up so that each row represents a single event. With the data set properly arranged, each additional column represents one Attribute (or statistic), for that single event. This way you are able to track as many statistics as you want, in one table.

Pivot Tables make it incredibly easy to summarize data that’s in this format, allowing you to create a handful of insightful tables and charts in minutes, based on one big data set.

How would you have tracked multiple statistics, if you had listed the at-bat number across the column headings? After one full slate of MLB games, you’d have over 1,000 columns of statistics… How would you use it in a Pivot Table? Would it make sense to anyone besides you?

Why is Clean Data Necessary?

In addition to the formatting above, Clean Data also means that there are no errors, and blanks are handled properly. To ensure that you are reporting accurate information to the boss, you need to be sure that you understand the source data. It’s important to understand where it came from, how it’s generated, the format it comes in, and most of all, what the values in each cell REALLY represent.

Don’t make assumptions about the meaning of any value, based upon its title, because everybody uses their preferred terminology when they create or enter data. If you’re not the source of the raw data, then it’s very important to take the time to fully understand what it really means!

Leave a Reply

Your email address will not be published. Required fields are marked *