Why Compression Is So Important For The PI System
Industrial Insight's engineers have found in that every PI System the data quality is pretty low. This is not because the PI System isn’t capable of collecting high quality data, but it is often because the system and its individual tags are poorly tuned and don’t reflect the nature of the process. In many cases, way too much data is stored and in others, not nearly enough data is stored.
With the advent of modern tools like PI Analytics, PI Event Frames, the PI Integrator for Business Analytics within the PI System; as well as machine learning and other advanced analytics techniques that are being employed on time-series datasets, data quality is more critical than ever. We have seen events not get captured, incredibly slow data retrieve speeds, and large inaccuracies of totalizers and utilization calculations because of poorly executed tuning parameters within the PI System. This causes people to not trust the data, which is shameful in this day and age.
One of the main reasons people don’t trust their data is because the data compression algorithm is poorly tuned for the majority of the tags within the system.
There is actually a fair amount of misinformation and varying opinions on the need and desire for data compression even in the time-series data community. There are numerous entities will tell you that data compression is a BAD thing and that you need to keep ALL of the data.
It is Industrial Insight’s belief that it is not a good idea to keep all data, except in some special circumstances. In most cases, and almost always with sensor data, there are many times that miniscule changes in sensor readings might be well within the instrument’s error and can be regarded as “noise”; or the changes could reflect loss in the analog to digital conversion.
For instance, a J type thermocouple used to measure temperatures up to 1300 degrees Fahrenheit has a typical tolerance of around +/- 4 degrees. So, if monitoring a temperature, a change between 1,000 degrees and 1,001 degrees would be well within the normal error range and may or may not be significant. We have often seen that engineers want high speed (1 second) data with storage of all readings. Our contention is that when it takes 5 minutes to heat up from 1,000 to 1,005 degrees, a one second data capture while storing every miniscule temperature change may not be the best idea, because there is excess data and it may be excessively noisy. There should be a balance in the scan rate of the data, and how much data is kept that reflects what is going on in the process.
The data compression algorithm (aka the swinging door algorithm) in the PI System is designed such that each PI Tag (sensor reading) is to have its own “tuning parameters.” However, this is one of the most misunderstood components of the PI System, is often difficult to visualize, and is very often misconfigured or misapplied. We often see “under compressed” tags (storing way too much data) and “over compressed” tags (storing way too little data). These conditions often go unnoticed until one of the follow events happens:
Someone is trying to troubleshoot and notices an over compressed tag that prevents troubleshooting because the actual sensor readings are filtered out
Someone is trying to configure analytics or event frames based on tags that are over compressed (not enough events or analytics way off from expected) or under compressed (takes extraordinarily long to backfill or retrieve data), and results are uneven
Someone is trying to pull long data pulls and notices that it is either exceedingly fast or exceedingly slow to do so
Only then are tags “tuned” and this is typically done based on feel, rather than on any type of scientific method. This is why Pattern Discovery Technologies wrote the software tool CompressionInsight many years ago – to have a scientific way of tuning tags. By using this tool, one can ensure proper data quality and data fidelity within the PI System, so that the most accurate process information is reflected in the PI System.
For a full video on how compression and exception testing work within the PI System, please watch this video: https://www.youtube.com/watch?v=89hg2mme7S0
Since compression is so often misapplied, it is often seen as unnecessary in the modern world as data storage is relatively cheap, as is bandwidth, and retrieval speeds today are faster than ever. However, at scale and with millions more sensors coming on the market and into our customers’ data systems, it is Industrial Insight’s opinion that data quality should be handled on the front end of data collection, rather than dealt with by expensive data scientists on the backend; or in many cases never dealt with at all. We also believe it is lazy to just “store all of the data and deal with it later.” We also believe that it is expensive to store unnecessary data, since even low cost data storage can get “blown up” quickly by storing lots of excess and unnecessary process data.
As an example, below is a weight transmitter. 21,600 raw data points are collected, and the customer had over compressed the data and was storing only 77 of the 21,600. CompressionInsight recommended that if 743 of the 21,600 samples were kept, the data would have a 95% fidelity and would only store 3.4% of the data. This is an extreme case, but often, 50%-75% of the original samples can be kept and one can still achieve 90-95% data fidelity. Multiply these types of examples across millions of sensors and the data storage and retrieval savings can be massive, yet with little to no real loss of information.