Tuesday, June 30, 2009

Dealing with messy data, part 1

On most of our projects, we set up the Modeling Studio so that Microsoft Excel is used for the majority of input parameters for the simulation model. This makes a lot of sense for most of our users for many reasons:
  • They are typically comfortable in working within the Microsoft Excel environment
  • We can take advantage of Excel’s formatting to provide a more professional appearance
  • We can use built-in Excel functions such as data validation and conditional formatting to make sure that the input data provided to the simulation model is correct, preventing extra coding in the model’s read routines
  • It requires less code within the Modeling Studio user interface itself (thus less developer time, less testing, and less bugs) than custom forms
  • It helps create a standard look-and-feel across all TranSystems simulation engagements

On the occasional project, we may need to extract data from the customers’ internal systems and data warehouses. For example, on a recent engagement for a grocery distributor’s distribution center, we pulled the majority of our data directly from the customer’s legacy data warehouse. This included a list of all SKUs available in the warehouse and their current inventory levels, a week’s worth of historical inbound and outbound orders, and the SKU-level detail supporting those orders.

In that project, the customer wanted to run historical data through a model that reflected physical system and control software changes in the warehouse, to understand the benefit of those changes relative to what actually happened that week. In other projects, we might generate representative data based on broad profiles of what is ordered and when. In this case, we wanted to consider every pallet that actually arrived to Receiving that week, as well as every outbound shipment request. Getting good historical data was key.

So how would we get this data? For IT and security reasons, it is usually impractical to try to establish a direct connection to the customer’s data warehouse within our source code. It would be hard to develop and test from the comforts of our own office, although establishing a VPN connection to the client’s network is an option – and even then it would likely be slow.

As a result, we’d usually ask the client to extract this data for us. Specify the files that we want, and the fields that we want in those files, and ask the client to send us CSV (comma-separated values) or flat text files containing exactly those fields. Simple, right? After all, the client’s team knows the existing data warehouse (table relationships, structure, etc.) better than anybody. And you provided them with an exact document outlining precisely what it is you want to see in those files. Should be a straightforward process to write those queries and get the data back to you.

Yeah…. that works out perfectly every time, doesn’t it?

Data can be messy in many ways

Real world data is messy. It just is. It’s not (usually) the fault of a single person or a single system. It’s not that your client is deliberately trying to provide you with crappy information to sabotage your otherwise pristine and flawless simulation model. It just happens, and you have to learn to expect it and handle it as part of your project.

Some of the types of things you see frequently:
  • Columns out of sequence or missing entirely
  • Additional fields provided that you didn’t specify (“they were available in the system”)
  • Null values in individual fields
  • Negative values for timestamps, inventory levels, or other things that should always be positive
  • Values that you expected to be numeric but instead turn out to be strings (e.g. Door Number, I expected the number 1 but the data had “DR 01”)
  • Mixed numeric and string values within the same field (e.g. on one project, every truck in the fleet had a numeric identifier, but the data was hand-entered into the data warehouse, so the same truck #17 could be specified as “17”, “T-17”, “T17”, “017”)
  • Line items for zero pieces – who orders 0 of anything?
  • Bad cross references between files / tables – e.g. orders for items that don’t exist in the SKU inventory
  • Leading zeroes getting truncated on fixed-width fields (e.g. the product code is “0000714” but your data file contains just the value “714”). This can be a contributor to bad cross references.
  • Duplicate entries representing the same information
  • Different types of data within the same file (e.g. the same file representing all UPS shipment transactions within a calendar year also contained rows for all weekly invoices to UPS for that year. Adding up the sum total of costs would be doubling the actual cost because you’d be counting once for the transaction, and once for the roll-up cost of the weekly invoice.)
  • Nonsensical data within the fields – often occurs with data that’s manually entered instead of scanned

These anomalies are all things you should remember to look for on your projects. Knowing that this can and will happen, what can you do about it?

No, the answer isn’t always “Send it back!”

And even worse, it can’t be “Well, I’ll just run it through the model and see what happens.” Unfortunately, under tight schedule pressure, we can sometimes feel tempted to take that shortcut on our projects. It’s important that we remember the old adage – “Garbage in, Garbage out” – and take the time to thoroughly understand and clean up the data that serves as input to our simulations.

Understand, interpret, and summarize

Anytime you receive data from the customer, one of your first responsibilities is to do an assessment of just how bad it is. It’s tempting to dismiss the entire data set because you notice a few mistakes, and send it right back: “This is wrong, do it over, a team of monkeys could do a better job with your data warehouse!” But as an analyst, you need to be methodical, specific, and quantify the problem.
  • What are the specific mismatches between the data expected and the data received? Think of all the things that could go wrong as outlined above – on every record, column, and file.
  • How many records were affected by each type of mismatch? What percentage does this represent of the overall data set?
  • How significant are these discrepancies to the problem you are trying to solve?
  • If the data is not corrected, how big of an impact would you expect it to have on the simulation model results?
  • Can you fix the data yourself by writing a query, macro, or function?
  • Can the bad data simply be deleted?

Many times, it’s OK to just get rid of the offending data for the purposes of your analysis. For instance, I was involved in a transportation analysis where we were using the client’s data to determine their total freight spend in various modes of transit: private truck fleet, LTL (Less-Than-Truckload), small parcel ground (UPS and FedEx), and small parcel air (2-day vs. next-day). Since the data was extracted from a legacy system, I found all kinds of discrepancies, in the thousands of records. Sure, that sounds like a lot. But in the end, the number of bad or unexplainable records only represented < 5% of the total data set, so we felt comfortable to simply delete them.

Document whatever you do. This can be used as a discussion vehicle to work with the customer on specific issues. “I noticed that 273 out of 37014 order records had null fields in the Product ID column. In addition, there were another 138 records that had Product IDs that weren’t in your master list of products. Does this have something to do with the data gathering process? Or is it acceptable to just ignore these records for modeling purposes?”

Perhaps they can explain the discrepancy, or think of a different way of extracting the data that you asked for, or maybe it was a simple mistake in the query that was used. Or perhaps you will learn something about how the database is used that changes the data to be collected. Whatever the resolution, the customer has become part of the decision-making process. And your joint decision will then become part of the key assumptions in your final presentation, report, or other deliverable.

Fix it yourself

You can also fix many types of data problems yourself. In the old days, we used to write stand-alone utility applications called “pre-processors” in C that would parse the flat files and make all necessary corrections. But there’s a tool already available on your computer to make this much easier: Microsoft Access. Did you know that the Modeling Studio can be set up to download input parameters from an Access database, just like you can set it up to download from an Excel file?

In part 2 of this article, I’ll outline a few tips and conventions to working with Microsoft Access to clean up your simulation data files.