A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible. This paper tackles a small, but important, component of data cleaning: data tidying. Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.Messy data needn't be bad data, but it might not be in a format that makes it easy to process. Many tables used for data presentation will contain implicit variables, such as person or result in the table here:
If you've ever generated, aggregated, or inherited data of any scale for analysis you're almost certainly already familiar with the basic ideas. You've probably also done informally, with much cursing, copy-pasting, and burnt fingers what this paper formalises as a small set of transformation patterns that, applied appropriately, will make messy data tidier.
The table below tidies the table above to have one row per experimental observation, one data value per cell, and all variables explicitly named:
In our discussion we noted some interesting overlap with the observability white paper we read recently. Although the term tidy data wasn't used, these pieces of advice suggest that the folks over at Honeycomb.io are familiar with the idea:
Exploring your systems and looking for common characteristics requires support for high-cardinality fields as a first-order group-by entity.
Add units to field names, not values (such as parsing_duration_µs or file_size_gb)
Generate one event per service/hop/query/etc.Wickham's paper is short and readable so I won't summarise it here but I will note that the operations have snappy names (melting, splitting, casting) and examples that illustrate them, and their application and composition clearly.
I'll also mention that, once again, I'm struck by how useful it is to name a thing and take it out of the realms of tacit, experiential knowledge and into the world of explicit, inspectable knowledge and hence shared value.
Images: Tidy Data, AbeBooks
Comments
Data science people I follow say that a data science project is usually 80% data engineering, and only 20% actual data science. (Data engineering includes cleaning, but can also be denormalising and other things.)
Post a Comment