Data Quality - The Elephant In The Room
So, you are looking to add time-series data from one of your facilities to your data lake to start doing "Big Data Analytics" on the data, huh? Well, do you REALLY understand how good your data is in that unknown time-series world? I would be willing to bet that some or all of it isn't that useful to you. Sure, some of it will be fine, or so you think. However, that one key data stream may not be at the fidelity that you need to unlock any insights about something you really want to know about, or worse yet, the data has been stale for a while and no one knew it, the instrument went out of calibration and you didn't know it, or a host of other calamities.
Proper handling of time series data for both real time analysis and for analysis in more of the "big data" platforms is really critical, but is often overlooked. Below, I will discuss common problems that I have seen.
First, I am going to address the traditional "Data Historian" market for just a moment - this will include the following, among others:
OSIsoft's PI System
Rockwell's FactoryTalk Historian
Canary Labs Historian
Some of these systems have data compression algorithms and some don't, and I would like to address this topic, as I think there is a LOT of misconception out there. If you are using a SQL based historian or some other means to collect data that doesn't have some kind of compression algorithm, you really need to look at doing something else.
There are really two camps on the subject of compression: those who strongly believe in data compression (count me among them), and those who don't (Capstone is among the most notable that don't). There really isn't a middle ground. you either believe in data compression or you don't. I can poke holes in all of Capstone's arguments for not having data compression, but this isn't the forum to do so. If you want to discuss this idea further, you will just have to email or call me.
If you do have a historian that has data compression, you really need to evaluate on a tag by tag basis if you have the right compression settings. If you don't, here are the issues you will face:
You will miss lots of valuable information if you over compress the data
You will have trouble with large data pulls if you under compress the data
I was doing an investigation for a customer one time and found that they had way over compressed some differential pressure readings. Fortunately, of the 12 readings I was looking at, only one was over compressed, but it did skew the number of times that a DP threshold was violated and gave us a skewed answer to how bad their DP excursion problem was. Here is a picture of 4 of the trends, 3 are compressed reasonably well and the other one is a straight line (over compressed):
In the report I had created to investigate the high DP problem, I was looking at how many times and for how long the DP had gone above 5 PSI, and the equipment that had over compressed data had significantly less events than the other three pieces of equipment that did not have over compressed data. Imagine if all of the tags related to differential pressure were tuned the way that the one represented by a straight line was. I wouldn't have found anything at all and my report would have been completely useless.
There is actually a tool that I use against the PI System called CompressionInsight from Pattern Discovery. You can read about it here. This tool actually gives you a "toolbox" to be able to allow you to individually tune tags. Below is a screenshot that I took from a tag on my demo PI System:
I know it is a bit difficult to see, but the blue line is the raw data coming in, the red line is the data I am storing (look familiar?) and what is hard to see if the orange line where I did some manual "what if" scenarios on the data fidelity. Below, there will be a link to a YouTube video where you can see this more clearly, as I discuss data quality in greater detail and show examples.
Now, imagine yourself pouring in a bunch of time-series data into your data lake, and you have the above scenario. How do you think your analytics will look? Not great and they may even lie to you. This is not a good situation. I would venture to guess the each of you reading this has at least one critical piece of information that is behaving like this. You know why I am so confident? I haven't been anywhere yet where this wasn't true. I have always been able to find a critical tag in this state. Now, I am sure I will go somewhere where this isn't true, but I bet the longer the PI System has been in place at the customer site, the better chance I have to finding over compressed data. This is because data storage and network bandwidth really were issues at one time, so people purposely over compressed data to store less of it. Now, due to attrition of people and lack of attention on the data, this issue has never been corrected to take advantage of better network bandwidth and less expensive storage.
The other issue that I often see is that data can be under compressed. This presents a different issue. Let's say I have a tag that I want to retrieve data from for a year and the data is being collected at 1 second intervals with no compression at all. That means I am getting 86,400 points per day times 365 days or 31,536,000 data points. What if I am trying to retrieve several hundred tags at a time? This will be a huge hog on my computer and potentially my network. Some people argue, as do some in the industry, that you want to keep ALL of the data. Well, are you sure? Are you sure you really need ALL of the data? Watch this and tell me if you still think that way.
Now, for a data lake or some other "big data" type of an application, the tool will likely want evenly spaced data, and other considerations need to be considered. A friend of mine, Holger Amort and others weigh in on this topic recently in PI Square. I think Holger and others bring up really valid points. You will likely want to take some type of moving average, rather than sampling evenly spaced data so that it is as accurate as possible.
You obviously won't quite see the level of detail if you aggregate the data, but a "big data" analysis will not require high fidelity data anyway. However, it will require an accurate depiction of high fidelity data at evenly spaced intervals. It also may point out places where you need to go take a deeper look at some high fidelity data. In the example below, I am using Tableau to look at a lot of downtime events (comparing the number of downtime events versus the actual downtime in a scatterplot and looking at the outliers) for a piece of equipment, and I have included a link to the high fidelity data trend for each of the circles. In this way, someone can use lots of aggregated data to see where the high fidelity data for a production run may be of use. In the case below, one can follow the Coresight hyperlink to investigate a production run that had 8 downtime events and 386 minutes of downtime So, make sure that you use each tool in the right way.
So, my real point is, no matter what historian you are using, you should have some kind of compression algorithm to store just the data you need for quality data analysis in either real time or with larger sets of data. More times than not, these data compression algorithms are not applied correctly and can cause big data analytics to "lie" if this issues aren't rectified. So, please, don't be one of those companies or individuals that overlooks this key idea, and get the data tuned correctly, so that both real-time visualization and analytics tools, as well as big data tools will give you the answers you need.