Tuesday, June 30, 2009

Dealing with messy data, part 1

On most of our projects, we set up the Modeling Studio so that Microsoft Excel is used for the majority of input parameters for the simulation model. This makes a lot of sense for most of our users for many reasons:
  • They are typically comfortable in working within the Microsoft Excel environment
  • We can take advantage of Excel’s formatting to provide a more professional appearance
  • We can use built-in Excel functions such as data validation and conditional formatting to make sure that the input data provided to the simulation model is correct, preventing extra coding in the model’s read routines
  • It requires less code within the Modeling Studio user interface itself (thus less developer time, less testing, and less bugs) than custom forms
  • It helps create a standard look-and-feel across all TranSystems simulation engagements

On the occasional project, we may need to extract data from the customers’ internal systems and data warehouses. For example, on a recent engagement for a grocery distributor’s distribution center, we pulled the majority of our data directly from the customer’s legacy data warehouse. This included a list of all SKUs available in the warehouse and their current inventory levels, a week’s worth of historical inbound and outbound orders, and the SKU-level detail supporting those orders.

In that project, the customer wanted to run historical data through a model that reflected physical system and control software changes in the warehouse, to understand the benefit of those changes relative to what actually happened that week. In other projects, we might generate representative data based on broad profiles of what is ordered and when. In this case, we wanted to consider every pallet that actually arrived to Receiving that week, as well as every outbound shipment request. Getting good historical data was key.

So how would we get this data? For IT and security reasons, it is usually impractical to try to establish a direct connection to the customer’s data warehouse within our source code. It would be hard to develop and test from the comforts of our own office, although establishing a VPN connection to the client’s network is an option – and even then it would likely be slow.

As a result, we’d usually ask the client to extract this data for us. Specify the files that we want, and the fields that we want in those files, and ask the client to send us CSV (comma-separated values) or flat text files containing exactly those fields. Simple, right? After all, the client’s team knows the existing data warehouse (table relationships, structure, etc.) better than anybody. And you provided them with an exact document outlining precisely what it is you want to see in those files. Should be a straightforward process to write those queries and get the data back to you.

Yeah…. that works out perfectly every time, doesn’t it?

Data can be messy in many ways

Real world data is messy. It just is. It’s not (usually) the fault of a single person or a single system. It’s not that your client is deliberately trying to provide you with crappy information to sabotage your otherwise pristine and flawless simulation model. It just happens, and you have to learn to expect it and handle it as part of your project.

Some of the types of things you see frequently:
  • Columns out of sequence or missing entirely
  • Additional fields provided that you didn’t specify (“they were available in the system”)
  • Null values in individual fields
  • Negative values for timestamps, inventory levels, or other things that should always be positive
  • Values that you expected to be numeric but instead turn out to be strings (e.g. Door Number, I expected the number 1 but the data had “DR 01”)
  • Mixed numeric and string values within the same field (e.g. on one project, every truck in the fleet had a numeric identifier, but the data was hand-entered into the data warehouse, so the same truck #17 could be specified as “17”, “T-17”, “T17”, “017”)
  • Line items for zero pieces – who orders 0 of anything?
  • Bad cross references between files / tables – e.g. orders for items that don’t exist in the SKU inventory
  • Leading zeroes getting truncated on fixed-width fields (e.g. the product code is “0000714” but your data file contains just the value “714”). This can be a contributor to bad cross references.
  • Duplicate entries representing the same information
  • Different types of data within the same file (e.g. the same file representing all UPS shipment transactions within a calendar year also contained rows for all weekly invoices to UPS for that year. Adding up the sum total of costs would be doubling the actual cost because you’d be counting once for the transaction, and once for the roll-up cost of the weekly invoice.)
  • Nonsensical data within the fields – often occurs with data that’s manually entered instead of scanned

These anomalies are all things you should remember to look for on your projects. Knowing that this can and will happen, what can you do about it?

No, the answer isn’t always “Send it back!”

And even worse, it can’t be “Well, I’ll just run it through the model and see what happens.” Unfortunately, under tight schedule pressure, we can sometimes feel tempted to take that shortcut on our projects. It’s important that we remember the old adage – “Garbage in, Garbage out” – and take the time to thoroughly understand and clean up the data that serves as input to our simulations.

Understand, interpret, and summarize

Anytime you receive data from the customer, one of your first responsibilities is to do an assessment of just how bad it is. It’s tempting to dismiss the entire data set because you notice a few mistakes, and send it right back: “This is wrong, do it over, a team of monkeys could do a better job with your data warehouse!” But as an analyst, you need to be methodical, specific, and quantify the problem.
  • What are the specific mismatches between the data expected and the data received? Think of all the things that could go wrong as outlined above – on every record, column, and file.
  • How many records were affected by each type of mismatch? What percentage does this represent of the overall data set?
  • How significant are these discrepancies to the problem you are trying to solve?
  • If the data is not corrected, how big of an impact would you expect it to have on the simulation model results?
  • Can you fix the data yourself by writing a query, macro, or function?
  • Can the bad data simply be deleted?

Many times, it’s OK to just get rid of the offending data for the purposes of your analysis. For instance, I was involved in a transportation analysis where we were using the client’s data to determine their total freight spend in various modes of transit: private truck fleet, LTL (Less-Than-Truckload), small parcel ground (UPS and FedEx), and small parcel air (2-day vs. next-day). Since the data was extracted from a legacy system, I found all kinds of discrepancies, in the thousands of records. Sure, that sounds like a lot. But in the end, the number of bad or unexplainable records only represented < 5% of the total data set, so we felt comfortable to simply delete them.

Document whatever you do. This can be used as a discussion vehicle to work with the customer on specific issues. “I noticed that 273 out of 37014 order records had null fields in the Product ID column. In addition, there were another 138 records that had Product IDs that weren’t in your master list of products. Does this have something to do with the data gathering process? Or is it acceptable to just ignore these records for modeling purposes?”

Perhaps they can explain the discrepancy, or think of a different way of extracting the data that you asked for, or maybe it was a simple mistake in the query that was used. Or perhaps you will learn something about how the database is used that changes the data to be collected. Whatever the resolution, the customer has become part of the decision-making process. And your joint decision will then become part of the key assumptions in your final presentation, report, or other deliverable.

Fix it yourself

You can also fix many types of data problems yourself. In the old days, we used to write stand-alone utility applications called “pre-processors” in C that would parse the flat files and make all necessary corrections. But there’s a tool already available on your computer to make this much easier: Microsoft Access. Did you know that the Modeling Studio can be set up to download input parameters from an Access database, just like you can set it up to download from an Excel file?

In part 2 of this article, I’ll outline a few tips and conventions to working with Microsoft Access to clean up your simulation data files.

Saturday, February 28, 2009

The OOBE

So you’ve just spent weeks developing the perfect simulation model for your client’s distribution center. You have achieved the pinnacle of what it means to be a simulationist. Your logic is flawless, the model validates perfectly on 10 different data sets, and you’ve just saved the client big bucks when the model results showed that the system doesn’t really need a 267-lane sorter after all like the vendor was trying to sell them. The project is complete, and you proudly hand over your work of art to the client so that they can continue to run what-if scenarios on their own. You strut around the office letting your co-workers know Who’s The Man [1] and consider giving yourself a day off to celebrate your greatness.

The client installs it. Then they try to run it….. Uh oh.

You cringe in embarrassment as you get the call:

Customer (irritated voice): “Your model doesn’t work. I thought you said you were done. This is broken.”

You (shocked voice): “Well, it works on my machine!”

Customer (annoyed voice): “…… Maybe you didn’t hear me.”

You (backpedaling voice): “Well, let’s see how we can fix it. Did the install process go okay?”

Customer (discombobulated voice): “Sure, it said it installed successfully after I clicked OK a bunch of times. There were a couple of weird messages – something about libraries. Then I opened the Modeling Studio like you showed me the other day. But when I tried to run it, nothing happens.”

You (staticky voice): “Hello? Can you hear me? I think we have a bad connection. I’m gonna have to call you back – are you free four months from next Tuesday?”

This is what we call the OOBE – the Out Of Box Experience. And in this example, it’s a bad one.

What’s an OOBE?

There is no impression like the first impression. If something goes wrong before the client even has a chance to use your brilliant, wonderful simulation masterpiece, you’ve dug yourself into a hole that can be hard to climb out of. It’s like showing up at one of those speed dating things right after eating a garlic and anchovy pizza. [2] Or like showing up to a job interview in spandex tights. [3]

It is your job to make sure that the customer’s OOBE is as good as it possibly could be. If you give your customer a bad OOBE, what effect do you think this will have on the customer’s perception of you, your product, and your company? What’s a more likely thought process: a) “If I can just get it to run once, I bet everything else is absolutely perfect!” Or b) “I can’t even get it to run once – I wonder what else they did wrong?”

Think of it this way: Pretend you are sending the model to your Aunt, what would you want her experience to be like when she installs your software? (For the sake of this discussion, pretend it’s an Aunt that you like, not that creepy old aunt with all of the cats running around the house.)

Before a project deliverable goes out the door to your Aunt, or to your customer, it's time for a thorough OOBE test. This is different from conducting structured testing of the specifics of your model. The objective is to step out of the project’s intricate details, put yourself in the customer’s shoes, and assess the Big Picture. Does the model run the first time? Do the links in the Modeling Studio’s navigation panel work? Do the logos and images show up where you’d expect them to? Do all of the outputs load correctly? Can you save and restore a scenario?

Note that none of this has to do with the specific functionality of your model -- it's testing the little stuff you take for granted. You’ve been working on the project for weeks, and you’ve just come to assume that all of these little things are in place and that they just plain work. But that’s on your development computer. What about the customer’s environment?

This is where a Virtual Machine (VM) comes in. A VM is just like a computer within your computer. You set up a VM to match the customer’s expected environment, and boot it up just as if you had another physical box in your office. Then, you can install your model onto that VM so you can fairly evaluate the OOBE.

A VM test should be the final step in your project process before you deliver a release to your customer. This is not “something we might do if we have time and budget”. This is a critical part of ensuring the quality of the product you deliver to your customer.

What’s in it for me?

I hear you saying: “Customer this, customer that, blah blah blah, what do I get out of it?” Ah, you selfish simulationist. But believe me, a VM test helps your project in many ways:
  • You save face. You get to see what the client sees the first time they install and try to run your model. By doing this, you can anticipate any problems they might encounter – and fix them – before you get embarrassed by them. You personally get to put your own stamp of approval on the OOBE, and make sure there is nothing in the way of the client’s beaming appreciation of your glorious work of art.

  • It helps you remember. By conducting a VM test, you ensure that your installation package is complete and that you haven’t forgotten any key files that are required to run the application. Ever seen what happens in the Modeling Studio if you forget to include one of the files defined as part of a scenario? It will completely reformat your hard drive. [4] Good thing you’re testing on a virtual machine!

  • You ensure that your application doesn’t rely on software that’s only installed in the developer environment. We used to have this problem with DynaZip, which is used by the Scenario Manager, where if you run it on a developer’s machine, it looks like you only need the library dznet.dll in order for it to function. However, there was an additional file dzncore.dll, inconveniently placed out of the way in C:\Windows\System32, that was required. If you forgot to include this in your installation package, bam! Modeling Studio crashes. We found out about this problem via testing on a virtual machine.

  • Similarly, you are checking how your own software will function within the client’s environment, which is probably different than TranSystems’ environment. Chances are that the customer’s IT Department has mandated the operating system, version of Office, and version of Java on each of the staff computers. For example, we’ve got a significant client that is still using Office 2000 (It’s 2009, by the way, and Microsoft has released 3 versions since then). You know that fancy chart you created using Excel 2007 on your machine? Guess what, it looks like a bunch of spaghetti in Excel 2000. Matching the customer environment on your virtual machine gives you the best chance of detecting these types of problems before the client finds them.

  • Maybe you’ve spent weeks in the formal testing process. You’ve checked every combination of inputs and made sure that the outputs produced by your simulation are meaningful. You’ve checked boundary conditions, zero value inputs, and have flooded the system with 3x the volume expected to prove that the model does not crash. You have been an excellent Tester. You’ve done it the Right Way. Now how discouraging is it if the customer can’t even run the darn thing?

  • If something goes wrong in the install process, your own development environment is not compromised. For instance, we had this problem with the infamous Scripting Runtime Engine, scrrun.dll. This is one of the core libraries that ships with some versions of Windows but is not always guaranteed to be found in a client’s environment. Sometimes, we write code that references the Scripting Runtime so that we need to include scrrun.dll in our installation packages. (Hi2u, RMS!) If you install this directly on your computer, and then you later uninstall this same package, then scrrun.dll might be removed from your computer. To coin a phrase, that can really grouse up the skizzlewits. A Virtual Machine protects you from all of this.

As I often preach, improve the quality of the process and you improve the quality of the product. A VM test, just before release, is one those process steps that’s easy to do (and easy to make excuses not to do), and there’s really no downside to doing it. So do it. You only have one chance for your product to make a first impression. And a good first impression goes a long way toward customer satisfaction.

Which, as consultants in the service industry, is what we’re all about.

So how’s your OOBE?

Next installment: How to set up a Virtual Machine on your computer, featuring a special guest author!

[1] In this context, “Who’s The Man” is a gender-neutral term that applies to all of you.
[2] Which the author has never done. But I did know this guy once…
[3] Professional wrestler job interviews excluded.
[4] It’s an Easter Egg I had Geoff put in there in 2006. Try it. I dare you.

Saturday, January 31, 2009

ADO.NET Entity Framework

Wow is it 2009 already? Sorry about the lack of posting, I've been swamped with project work for Ortho Clinical Diagnostics -- not to mention all of the turbulence surrounding the company's reorganization last fall. I'll try to do better this year.

Some of the team and I have been discussing the Microsoft's next generation data access technology, the ADO.NET Entity Framework, and someone suggested (demanded?) that I put this up on the jimblog. "Next generation" is a bit misleading however -- this has been out for nearly a year with the release of the .NET 3.5 framework. Our current .NET technology at TranSystems -- for the Modeling Studio, or for other software development -- is .NET 2.0, which was available with Visual Studio 2005. I'm not sure when we will transition, but the Entity Framework is a good reason to consider it.

The key concept behind the Entity Framework is to provide a conceptual language for you to code with, instead of writing low-level data access code. So let's take an example from the MSDN overview where you want to find all the pending orders for sales people in Washington. In traditional ADO.NET, you would write a SQL command that joins the tables together, return the results in a DataSet, and then iterate through the rows in the dataset. Something like:

sSQL = "SELECT * FROM tblSalesOrders _
INNER JOIN tblSalesPeople ON tblSalesOrders.SalesPersonID = tblSalesPeople.SalesPersonID _
WHERE tblSalesOrders.Status = 'Pending Stock Verification' _
AND tblSalesPeople.State
= WA"

cmd = New SQLCommand(sSQL, cnn) // you've already set up cnn somewhere

dsOrders = cmd.ExecuteReader

I'm positive I messed up the syntax, but you get the idea.

Note that implicitly you need to know some things about how the data is stored -- what tables have relationships to each other, on which fields, etc. And you may tell me, "This is usually not a big deal -- if you're writing the code, you're probably also the one who designed the database. And we're all in the habit of doing this anyway." And you would be correct.

But the cool thing about the Entity Framework is that all of this SQL stuff is just abstracted away. You are programming in the business logic itself. You don't need to know anything structural about the database. Rather, once you set it up, you can simply write (this is C# below):
 // find all the pending orders for sales people
// in Washington
var orders = from order in orderTracking.SalesOrders
where order.Status == "Pending Stock Verification" &&
order.SalesPerson.State == "WA"
select order;
Check it out. There's no joins. No SQL. You go from the Order object to a related SalesPerson object to get the State field. It's almost like the code was written in English! In case you don't see right away why this is so cool, here's how MSDN describes it:

There are two elements that are important to highlight in the code above:

  • No artificial constructs. It's common to see applications that need to adapt to peculiarities of the underlying store schema. For example, applications built on top of relational databases often have to make extensive use of joins in order to navigate through relationships. In the code above, in contrast, the "shape" of the data follows the abstractions of the problem being modeled; there are "orders", which have "order lines" and that are related to a "sales person".
  • No plumbing. The code is very database intensive, yet there are no database connection objects, no external language such as SQL for query formulation, no parameter binding, no configuration embedded in code. In this sense, you could say this code is "pure business logic".
Think about this last point -- the pseudocode (you do that, right?) that you'd write for a critical design review (you do those with your team sometimes, right?) becomes really really close to the actual code that's implemented. How much time do you think that would save you?

Anyway, there's a lot more in the full technical article. Also, here are some white papers that might be of interest. They're from a company called IdeaBlade, whose product DevForce helps boostrap construction of enterprise-level software solutions so you can get going quickly (but that's a subject for another day...)

Though I haven't tried Entity Framework yet, I've been monitoring the technology for a couple years, and I'd encourage you to put it on your radar screens as well. We have at least two pure software development projects right now (NASSCO and WholeWorks), and while it's probably too late in the game to affect those, it's always a good idea to keep your design options open for the future. At least we should know it exists!

Cheers,
-- Jim