Are Data Lakes Evil?
I recently saw a former colleague of mine, Tjeerd Zwijnenberg of OSIsoft post this Forbes article "Why Data Lakes are Evil" and there are definitely some points made in the article that I agree with and there are some things that I believe get swept under the rug a bit.
Don't be a pack rat:
The first point in the article is that you shouldn't just blindly store all data in a data lake. Here is Adam Wray of Basho's comment:
"Why is this? Well, think about a pack rat. Sure they have lots of stuff, and a few items might be valuable, but because they have no ability to categorize or judge between priceless and worthless items, they end up with disorder."
I believe this is true. We all do this in our personal lives to some degree. Look in your garage, attic, storage building, or paid for storage pod at all of the stuff we have packed away that you might need one day. You never know when you might need some item, right? However, when that day comes, good luck remembering where you stored it.
If you choose to store everything in a data lake, you are acting on the same premise. You throw everything in there, at various levels of cleanliness, and it can make it difficult to find and extract just what you need and just when you need it. Wait, did I ever decide to store that? Oh dear....
There is a flip side to this as well. In the last several years, I have taken time-series data that people were collecting and not really using, and have been turning it into insights they didn't even think of before. If they hadn't kept all of that data (at least it was easy to find), then I wouldn't be able to help them. Probably the biggest use case that I am seeing this for is condition-based maintenance. I recently did a mash up of some maintenance work order data, along with some time-series data on some pumps to show a customer how this unused data could help them with condition-based maintenance. I am almost certain that this data was rarely, if ever, looked at. So, the flipside of the above premise is often to collect all of the data you think you might need, even if you don't know how you will use it yet. Just make sure it is organized and easy to get to.
All you have to do is?
There is definitely one statement in the article that I believe is glossed over and is "easier said than done." Here is the statement:
In many cases, data should be summarized or acted upon before being stored in the data lake. Take IoT data for example. Proponents of data lakes often say we should store all the data from IoT sensors. But consider a temperature sensor on a turbine. If the temperature reaches a certain threshold, action should be taken to shut it down or dispatch someone to fix it. That’s timely information. And even for the long term, if the sensor transmits the temperature every 15 seconds, you don’t need to store all those values along with timestamps; you can summarize the data when the temperature is stable without actually losing any information. Instead of storing everything, decision-making based on needed data should be accommodated into workflows.
Now, this sounds easy and is actually quite pragmatic, but exactly how is that accomplished? How do you define when an instrument is "stable?" My guess is that various instruments and signals would all have slightly different parameters, so now you have to go through EVERY signal and define stability metrics. I recently looked at a signal and a drop from 3.11 to 3.09 was actually quite significant. It was a specific gravity measurement and told a customer of ours that something was in their product that shouldn't be there. However, if we simply summed the data because it was "stable," we would have completely missed this relationship. "Stable" will likely mean something different depending on the equipment and process stability, as well as the accuracy of the instrument. Sometimes +/- 1 degree is "stable" and for other instruments, it may be +/- 10 degrees.
At what interval do you sum it? As I outlined in an earlier blog post, Business Intelligence and Machine Learning algorithms expect the data to be stored in evenly spaced intervals, so how will you account for this uneven data - some summed because it is "stable" and other data coming in at high speed? At what level does data get stored - i.e. if you have a data historian, do you have a "raw" tag and a "summary" tag? How do you mesh the two?
For those of us that work with the PI System, do we configure two tags at two different scan intervals or do we load every sensor reading down with several PI analytics that look for stability and then sum the data if not? To me, what the author was getting at is that proper data compression is important. However, someone has to DEFINE stability and what does "not losing any information" mean for each signal, because it is likely quite different.
Here is what a signal might look like based on the above statement (diamonds are stored data points, as well as the noisier looking signal, so this is definitely not evenly spaced data). How would we store this signal in a data lake? This data is definitely not evenly spaced, and does it even meet the stability requirements for that signal? Who knows..
The whole statement reeks of someone who says "all you have to do is..." without fully understanding HOW to do "all you have to do is...." This harkens back to my post about data quality, it is really that important.
You have to solve a business problem
The approach that is recommended at the end of the article is one that I have been using with the PI, Business Intelligence, and Machine Learning projects that I have been working on the last 3 years and it is this (emphasis mine):
As Wray points out, this can prove disastrous in the long run. “CEOs should pause work on data lake projects under the ‘store everything’ model and instead focus on individual projects, and the analytics necessary to provide high value by assembling just the data needed for that workload. This is not a step to becoming a data lake, but instead a step toward creating actionable data that will solve your business problems today,” said Wray. “After completing a series of individual projects, you can look for what data is used most often and what data is a priority to keep versus deleting and then, you can create a corresponding repository that's more efficient and effective."
We have been doing smaller scale analytics and BI projects that teach us what works and what doesn't. We also learn what could scale and what might have performance issues as we scale it. This approach is less risky, yet still with high potential upside.
As with ANY data project, you have to look at the cost/benefit ratio and get a clear picture of what the ratio could be. I typically look for a 2-1 to 4-1 benefit-cost ratio. We often realize that we can't get to 100% of the value and the project requires more effort than we all thought at the outset. We often discover things that need to be done or looked at that we never would have thought at the beginning of the project. These items can drive up the complexity and cost of the project. So, my take is that if we get half the value and the project cost twice as much as originally intended, the hurdle rate still needs to be good enough that people want to go forward with more projects.
I think another benefit of doing smaller scale projects is that it can give a truer sense of the data quality in an organization. It is easier to get a handle on how big or small the problem really is. I suspect that data quality is the number one issue that impedes the value of data analytics projects. So, I think in starting with smaller projects, it is easier to get a sense of how extensive the data quality problem in an organization is.
A final thought
So, are data lakes evil? No, but as the article eloquently points out, data lakes should be approached with a healthy dose of skepticism. Goals and metrics should be clearly defined, and smaller projects that potentially have high value should be tackled first. From those learnings, the project(s) can then be scaled up.