On Terminology: “Single Source of the Truth”

According to Wikipedia, Single Source of the Truth “refers to the practice of structuring information models and associated schemata, such that every data element is stored exactly once” (emphasis is mine). This would mean, for example, a customer’s first name to be stored in once repository, not in every system that refers to the customer.

First, it’s a concept that is both difficult, and subject to various interpretations and implementations. The Wikipedia page does a nice job of mentioning the difficult parts, like dealing with the schemata of Vendor products, etc. As for the variety of implementations, you can enforce this in a dogmatic way where data is truly only stored in one place. Or you can implement with policy, having a location for each piece of data that is considered the master, and other pieces of data are responsible for publishing changes and updating periodically from that source. Either way, it is clear that this is a strategy to choose judiciously.

Additionally, choosing this strategy requires strong consideration of the effects on performance, reliability, and caching. If secondary storage is allowed, then stale data and concurrency issues arise. If secondary storage is prohibited, then you now have a single point of failure for many applications. Using the example of a CRM system being the single source of a customer’s first name, imagine the impact of that CRM system being down if other applications are not allowed to store that data.

So why this post? Why all this time and effort to define the term and discuss some of it’s nuances? In a variety of work places, I’ve seen this catch on as part of the lingua franca between business and IT workers, but used carelessly. And the number one problem is that I’ve seen Type A managers use this term to justify their oversimplified view of information management.

Notice the emphasis on information, and that I emphasized data in the definition “refers to the practice of structuring information models and associated schemata, such that every data element is stored exactly once.” Information is data within a context, and that’s the key problem when you get sloppy with the concept of “Single Source of the Truth.”

In one example, a particular manager had a problem with the fact that weather data was being stored in many different systems across the enterprise. I was part of a team tasked with creating a single consolidated data store and import program for all weather data across the company, because of his goal of having a single source of the truth. Briefly after looking into the other systems, it was clear that he didn’t grasp the ramifications of the concept.

Weather is a key factor in the demand for this customer, and so it is the basis of historical analysis, contract bidding, countless other aspects of their business. To our anonymous manager, that meant it was crucial to consolidate this information and have only one source. He was certain that people were out there using inconsistent sources that were causing efficiency problems, among other things.

Let’s start with the different types of weather data. There are forecasts and actuals. There is daily weather and hourly (and daily sources are peak for some uses, average for others). Finally, it’s worth noting that weather data is often corrected later, when the real-time value provided was measured incorrectly, or some other type of error occurred.

So let’s assume that we’re trying to consolidate hourly actual data. All applications should use this source. And let’s look at a couple of those uses:

A bid for service is based on historical data, where the agent writing the bid used that weather data to evaluate the customers demand sensitivity to weather, and to evaluate the companies supply as trend of weather.
A report on the effect on weather on supply is regularly supplied to operations managers.
The accuracy of this data supplied by an outside vendor is to be regularly audited by Supply Chain.

Now, let’s assume those activities have taken place for the month of April, and it’s the middle of May. Now the vendor comes in with correction data for the middle of April. For the first purpose (the contract), I want to store what information was used to write the contract at the time. It’s the only fair way to evaluate the agent, as he wrote the bid based on the best available information.

Because supply and production is naturally affected by weather, for the second purpose (operations evaluation), I want to rerun those reports based on new, more accurate information.

Even more disruptive is the fact that in order to evaluate the variances in the accuracy of the vendor data, the company should be storing both values.

This leaves you at a decision point: Do you handle this by declaring these as different information, or version the information. In other words, the bid history is linked to uncorrected vendor data, and the updates are used to create a corrected data source that can be used for the operations purpose.

The alternative is that that corrections cause the creation a new set, but all sets are retained. Differentiation is handled with a version number or timestamp, and all the above problems are solved. While this sounds simple, versioned data grows quickly, and is difficult to query and understand.

Due to the timestamp, each record can now be referred to as the single source of weather data, for that location, occurance date and time (date of the weather), as provided on said date and time (import time). But for each location and time, there are multiple potential values as corrections are entered. And there is forecast vs actual data.

So to be precise, I still can’t say “give me Cleveland’s weather for March 7, 2011.” I would have to say “give me the actual weather value for Cleveland on March 7, 2011 that was available when I wrote a bid on April 5th.” Or in the case of an operations manager, they would request “the latest value of actual weather for Cleveland on March 7, 2011.”

Those are different pieces of data. But I don’t think that’s what our manager had in mind when he requested a single source of the truth for weather data. Because he meant weather information. Context / details / reality didn’t fit the mental model he had of weather data.

In the case of this project, we were able to slightly reduce the amount of weather data stored. And we certainly reduced the amount of batch jobs involved in fetching that data from external sources. But we also created a performance and reliability bottleneck. That may or may not have been the right decision. My point is that it is worth taking some time to think through and understand the terms you are using. Sometimes simple answers are great, but sometimes they are really just a sign of naivety.