Good data management can be described as reaching for the “5 Vs of Data”; volume, velocity, variety, value, and veracity, but what do you do when you are given dirty data to begin with?

Data cleansing, the art of transforming dirty data into something useful from which you can gather information, and provide metrics, KPIs or meaningful statistics to the business, can be intimidating to even the most senior DBAs. I recently attended the PASS Data Community Summit in Seattle, Washington, and I was shocked at how little information I found on data cleansing. It inspired me to write a beginners guide to approaching dirty data, breaking down this fight against a behemoth into smaller, more manageable steps.

Step 1: Determine What Constitutes Dirty Data in Your Organization

Asking “what does clean data look like?” is the first step to getting from dirty data to useable data, and to do so, you should create a list of concrete and actionable business rules that separate clean from dirty data. It can vary greatly by industry. For example, a table of current card holders for a retail credit card should not have birth dates prior to 115 years or so ago, and a table of students in an online school should probably not have birth dates more than about twenty years prior.

Other business rules are fairly universal. Some common ones are: expecting the absence of special characters or expecting zip codes to match the state of the address to which their attached, or the zip code to exist at all. So, for example, if an address has the zip code 90210, then it should be associated with Beverly hills, CA, or thereabouts. If 90210 is associated with, say, Chicago, IL, then the address is incorrect, and you can consider it “dirty data.” If the zip code 84672 is associated with Tampa, FL, it too can be considered dirty data, because 84672 is not a zip code that exists in the US and I just made it up. The business rule to account for this can be as simple as “The zip code must be associated with the state in the address.”

If you feed this data into an application, any business rule that the application artificially enforces by throwing errors when the rule is broken should be included in this list. For example, if the application will only read varchar datatypes of up to 30 characters in length, but your table is varchar(MAX), add “no data in this column should be longer than 30 characters” to your list of business rules.

Step 2: Identify and Preparing Dirty Data

Next you will need to compile a query or a set of queries to gather the dirty data. This is as simple as turning the business rules into queries:

SELECT *

FROM a addresses

WHERE NOT EXISTS (SELECT * FROM z zipCodesIDidntMakeUp

WHERE a.state = z.state

AND a.zipcode = z.zipcode)

Don’t make that harder than it needs to be.

Nex you will check the data manually to ensure your business rules are going to work in the way you intend, within reason. Create a test case to ensure that you know exactly what clean data looks like. If necessary, return to step one and readdress your business rules.

Finally, before you move onto the next step, take a backup of your data as is. Next we will be changing data and it is ALWAYS prudent to create a backup first.

Step 3: data cleansing

Not all data is created equal, and not all data will be cleaned in the same way. I generally would not suggest the IT department does it manually, and I could and will write another entire article on data cleansing, but for now let me breifly offer two better (in my humble opinion) solutions:

Do It En Masse

Replacing and removing special characters is usually a task that can be performed with minimal detriment to the data and can be done in mass. Partitioning data that is old or outdated is generally safe. Anything that does not change the meaning behind the data, only the accessibility (partitioning) or  presentation (making a whole column of single characters uppercase, changing “miss” to “ms” in title column) can often be done without fear. This is what update statements were made for. If you have a test environment (please, for the love of god, create a test environment if you don’t have one) you will want to do this in the test environment first.

Have the Customer Do it

Updating states based on zip codes or zip codes based on states (assuming that the person living in Tampa, FL intended to put “33614” or assuming the person who lived in “90210, Chicago, IL” lives in Beverley hills, CA), on the other hand, is very dangerous and risks corrupting the data beyond repair. You, the person who is going to be held responsbile for this data, have no way of knowing what they intended. That information is lodged exclusively in the mind of your customer. The best solution then becomes obvious: have the customer fix their own data. Consider sending out an email requesting politely that they update their address or whatever other information you find to be faulty.

For a more technical view of data cleansing in MSSQL specifically, keep an eye on your email for my next article.

Step 4: Build a Virtual Roomba

Most forms of data storage offer to let you create a list of rules that determine how a row can be updated or defaults for the row. Instead of deciding that clean data for now is good enough, and allowing it to accumulate “dust” (bad data) over time until it becomes dirty data again, put your business rules (you didn’t think we were done with those, did you?) to work. Create technical functions, be it powershell operations, table constraints, triggers, or application constraints, to prevent the list of issues you’ve discovered. Then, at least next time you’ll have a list of issues you can rule out.