Dylan's BI Study Notes

My notes about Business Intelligence, Data Warehousing, OLAP, and Master Data Management

Cloud Database and Cloud DataLake

Posted by Dylan Wan on June 15, 2022

The term DataLake was invented to describe the data storage and the fact that after Hadoop and HDFS were introduced, you can have a cheaper way and place to store your data without using a traditional database, by traditional, I mean a RDBMS, relational database management system. Cheaper is not just about cost, it is about that you have much less efforts to just store the data. Now I see that many vendors are trying to mimic RDBMS and make the line disappear.

One idea I do not really like is that the DDL comes back. DDL refers to Data Definition Language. One of such example is SnowFlake. When I used Oracle, I have to create the table before I start storing data in that table. When I use HDFS, the magic word COPY TO does the job and we can create an object without specifying the schema. This idea is important. The data come first and schema used to describe the data structure come later.

However, the definition cheaper has a new meaning too. When I use a RDBMS, the cost we spent have to be paid up front. In SnowFlake, we “Pay as you go”, but it means we have to specify the resource and give the resource a profile before we run any compute.

The real value does not from avoiding vendor lock-in, but coming from which vendors we can avoid. In SnowFlake assumption, we avoid lock-in to the Cloud storage and Cloud compute by having a software vendor that help us to manage the resource.

In the year of 2022, this becomes important as we now have much fewer Cloud provider to choose from. Actually even with SnowFlake, we are probably talking about the choices from three vendors only, AWS, GCP, and Azure.

The vendor agnostic and the management of compute resource is not free either. The price we are paying is to define these resources and think of them all the time and the DDL.

It is not a DataLake solution for me and I lost the freedom in exchange of getting an even cheaper solution. The story will go on, once we have the vendor lock-in to SnowFlake, I think that we will find that it is never cheap.

Selections and alternatives will make things cheaper, not this type of software. I appreciate the existence of snowflake and it will force the vendors to provide better fine grain control over the resource for storage and compute. It also tells me that DataLake is not cheaper, when I can free to make my COPY TO, I should also think of the storage and compute resource and it is never free.

Leave a comment