Cleaner, Smarter Spreadsheets Start with Structure
Learning to structure data is more important than memorizing formulas
At the 2017 CAR conference earlier this month, I did a demo workshop on how to structure data in spreadsheets properly—the right way, the first time.
Normally when people teach Excel, they go through the common formulas that one might use to analyze their data. Personally, I think understanding good structure within an Excel file (proper headers, cells formatted correctly) is more important than learning every single formula that you can do in Excel or Google Sheets.
A well-structured file is similar to a well-structured story. It has all the information you need in a neat format. Formulas are easy to look up, but understanding why and how you should structure a file properly is really important.
Think Now About the Structure You’ll Need Later
While creating a spreadsheet or database, consider what you might do with the data and the issues that you might run into. Think about the functions one commonly uses in a spreadsheet: sort, aggregate, transform, filter. For example, if the file contains names, it’s a lot easier to sort a file if you separate first name, middle name, and last name into three separate columns right away than put them all in one column and split them later on. By separating the names into their own columns, you can see how many people share the same last name. Or if you are using the data in a display, you can choose to display only the first initial and last name.
What does a poorly structured file look like? Here’s a classic—an example of what I’d like to call “the government file structure.”
The hardest part about seeing a poorly structured file is knowing that the person spent a lot of time creating it, but that, in the end, it may be unusable.
For my demo workshop at the CAR conference, I created a spreadsheet that I commented out.
I created this file by combining some of the most common mistakes I’ve seen in files created from scratch.
How to Never Make a Poorly Structured Spreadsheet
Alongside your new spreadsheet, you should also create two things, a data dictionary and a data diary.
- A data dictionary is a file that explains what each column header in your spreadsheet is, along with information on the source of the data.
- A data diary has all the queries and notes about the data set in one place. It’s important to keep track of your steps in your data diary, so you can reproduce the work if you need to, later on.
Tip: Stop using Post-Its to remember things and create one file which has all the information that someone may need to understand your data.
The data dictionary can contain (but is not limited to):
- Field name: these should be clear. If you use headers like revenue1 and revenue2, chances are you probably won’t remember what they are. Save yourself time and make them concise and make sure to write this up in the data dictionary.
- Meaning - what does the field name mean
- Format - the format of the file: text, integer, float, date format
- Source - where the data comes from. This is especially helpful if you are combining multiple data sources.
With all this in mind, here are some common issues that can be addressed when thinking about how to structure a file.
It’s really important to specify the format the date will be written in, especially if multiple people are working together to enter data.
Dates can be (but are not limited to):
- Month-day-year written out (January 24, 1992) or abbreviated (Jan. 24, 1992)
- Month-day-year (01/24/1992)
- Month-day-year, only this time it’s the last two digits of a year (01/24/92)
- Day-month-year written out (24 January, 1992) or abbreviated (24 Jan. 1992)
- Day-month-year (24–01–1992)
- Day-month-year, only this time it’s the last two digits of a year (24–01–92)
- Year-month-day (1992–01–24)
- Year-month-day abbreviated (92–01–24)
Blank rows or cells
Don’t put blank rows before your header row. It doesn’t look pretty, and it just adds confusion. Also, don’t assume that the next person reading your file will know why you included blank cells. For example, don’t put
thing1 in cell
a1 and leave
a3 blank because you assume that the person will know that the next two cells are also
thing1. If you leave blank cells, you will also run into issues when sorting. Save yourself time and just enter data in each cell and don’t think someone will understand your strange logic.
Have clear and concise names for headers and use dashes or underscores in between words to make it easier to parse later on (ex: first-name instead of first name). If you mess up the spacing between words and have two spaces in one column header and only one in the other, it may run into issues when you’re trying to put the file into another format.
Comments on cells
This works fine if you’re copyediting or working with multiple people entering data. But just remember that when you convert it to a CSV, you will lose these comments. If necessary, create a notes column to keep track of the comments on your cells.
This is especially important when dealing with states and cities. It’s best to create a column for the city and another column for the state, especially if you’re working with multiple states. Stick to the same format—if you’re going to spell out the state, make sure you do it for all of them. An example: New Jersey, can be entered as NJ, N.J., or New Jersey. Don’t use nicknames like NOLA and assume people will know what that is. Make it easier on yourself and enter the data the same way each time, so you can join on other files and not have to spend time later cleaning the file.
Formats for numbers
Thousands-place separators vary by country, so remember that if you’re working with data from several countries to check these formats. I err toward not using commas in my numbers just so that I don’t add confusion. If you are entering data, use the actual numeric values rather than writing out the number (ex: 70000 rather than 70 thousand). You can’t sort if there is text in the number field. This is the same with currency. If you’re working with multiple currencies, you can create a column which says what the currency is, rather than putting a dollar or euro sign in the cell itself. That way it’s easier to convert the money to a single currency—and you won’t have to go into each cell and delete the currency sign. Stick to numbers.
Multiple tables in one sheet
Don’t do this. You will only cause yourself more confusion. Create a separate sheet for each table or dataset. It will be harder to sort and parse later on if you include multiple tables in one sheet rather than putting them in separate files.
They’re great for identifying areas that you want to mark, or to highlight an area when you’re copyediting or trying to do research. But colors can add confusion to a spreadsheet and will not translate over to a CSV. You may not remember why you colored a cell red and another yellow, so instead use flags. A flag can be a numeric value (I like to use 1 and 0) to denote the different things you are highlighting.
So for example, if you are copyediting, let’s say you used yellow for “yes” and red highlighting for “no.” Instead of using colors, you can create a column that says “checked” and uses 1 to indicate the file has been edited and 0 to indicate that it has not. This way, when you convert the Excel file or Google Sheet to a CSV, you have a way of keeping track of the edits, even though the colors will be removed. If you are really insistent on using colors, you can use what’s called conditional formatting. Here are the two steps in Google Sheets.
Remember, Structure is Your Friend
Save yourself time and think through the proper format of your spreadsheet before you make it. Whether you’re learning Excel for the first time, or if you have other tips or ideas, I’d love to hear from you.
Sandhya Kambhampati just finished up her 2016 Knight-Mozilla fellowship at Correctiv in Berlin, Germany and is currently looking for her next job. During her fellowship, she worked on investigations at Correctiv, a guide for newsrooms for on-boarding and off-boarding and taught at conferences. Follow her @sandhya__k