Data Lake and Data Warehouse
Posted by Dylan Wan on April 7, 2017
This is an old topic but I learned more and come up more perspectives over time.
- Raw Data vs Clean Data
- What kind of services are required?
- Data as a Service
- Analytics as a Service
Raw Data and Clean Data
I think that assuming that you can use raw data directly in a dangerous thing. However, don’t wait for processing the data and being able to use it is required. In the past, the raw data are not brought into a place that can be used without significant impacts to the source system, so raw data was not usable.
Due to the cost of storage becomes lower and due to the advancement in data ingestion, data are brought into a place that are consumable. The place was calls Data Lake.
Do we need clean data? Of course we do. Garbage in, garbage out.
The “Clean Data” can have different definitions for different people. From data quality perspective, it means, dedup, match and merge, standardization, formatting, extra coding, data enrichment, etc. For a statistician, or a data scientist, they want tidy data. From data warehousing perspectives, we may follow dimension modeling technique to design the model and use ETL to transform and prepare the data.
I think that metadata management is as important as it was if not becoming more important.
In the past, a professional person design and define the metadata before the data can be used. BI tools use metadata to expose the presentation model to the users. The users thus know what the data are available and how the data look like. The BI use the metadata to generate the queries. The source data for BI tools need to be defined before they can use it. The data preparation tasks are a professional job. Data from different systems may need to be extracted, transformed and loaded into a place for the BI tool to consume. The ETL metadata describe how the data will flow.
So, what schema-less, or schema on demand means? I think that these terms are misleading. I guess that we should describe it in more details. Do we need to have metadata? I think that the answer is yes. Do we always need the metadata to query data? We should not strictly require metadata before we can query. Do I need to define metadata before loading data? No, the metadata should be defined from the data. Should I always skip metadata? No, once the metadata exists, there are more metadata layer processes are required. I think that Schema inference is important, but a schema discovery and matching process may be even more important.
All the old requirements such as sharing metadata, querying metadata, comapring, subsetting and patching, programactically manipulating, etc. are all as important as it was before.
What services are required?