BI without a data warehouse
Posted by Dylan Wan on September 18, 2016
When we think of a data warehouse, we are typically thinking of having a database that store the data and need to design the schema and the ETL program to populate the database. If we take out the “database” from the above description, will it still be called a data warehouse?
I think that a place to capture the snapshots and a place to consolidate the data is still required. However, whether a database, or specifically a relational database, is required is very questionable.
The concept of having a BI solution without a data warehouse is not about reporting against the source system. Minimizing the impact to the source system should still be desirable. A solution like OTBI almost cannot be acceptable from OLTP’s perspectives unless someway to separate the load from the BI query from the transaction processing.
Having a BI without data warehouse may not necessarily mean having a real time BI. Actually, the requirement of real time is really about latency. Traditional BI with a data warehouse requires an ETL process to move the data. Due to the nature of SQL process, which is about pulling the data from the source, it is almost always a batch process. The real breakthrough should come from getting the (incremental) data from a much smaller batch much more frequently.
How a BI tool without a data warehouse better?
1. Utilize the memory
Querying the data directly from the memory runs much faster than querying the data from a database. You can avoid accessing the disk to get the data.
This is like querying from cache.
2. Data Structure
When we prepare the data in data warehouse, we purposely follow the star schema design so a relational database engine can query it faster. However, joining the data and creating and using indexes, no matter it is bitmap index or binary tree index, are still necessary during run time.
With the advanced design of the data structure, the query can run much faster.
This is not like querying cache as cache does not support the dynamics from the different queries.
3. Write only, no update, during ETL Extract
Update has a higher cost as it needs to locate the record in the disk before write on top of it.
Typically the staging table works in this way.
4. Write to a file instead of database for extracted data
More overhead occurs when you write into a database.
Cassandra also write to the log file to accomplish the low latency.
5. Write the target data into memory directly for preparing BI queries
6. Processing the data directly without using the database technologies
Transformation cannot be totally avoided. However, why move the data to database in order to perform transformation? Not only the database engine can support SQL. It is still convenience if we are using the database based tool like ODI. Otherwise, Apache Spark is actually much more powerful and fast.
(not yet finished, to be continued)