Schema On Demand and Extensibility
Posted by Dylan Wan on September 3, 2015
Today I see a quite impressive demo in the Global Big Data Conference.
AtScale provides a BI metadata tool for data stored in Hadoop.
At first, I thought that this is just another BI tool that access Hadoop via Hive like what we have in OBIEE. I heard that that the SQL performance for BI query over Hive could be very slow. The typical issue is that when the query involves joins, the SQL join may be translated to map /reduce codes by Hive. Doing the Join in this way may not be as effective as the RDMBS.
However, the concept is actually very different here. Traditionally ROLAP is built on relational database and we use relational join between the fact table and the dimension table. When we see the Oracle-acquired tool like Endeca, we already see the data modeling principle changes. Endeca does not model data in star schema. It simply denormalizes dimension data into fact table. It can thus run query fast. AtScale seems doing exactly the same thing. When the data is stored in the Hadoop cluster, the data is not normalized by separating data into fact and dimension. It just stores as the data as the source and duplicating the dimension into fact. There is really no join here. The closest design technique in OBIEE I can think of is to use degenerated approach. However, will it work for using Hadoop as a source?
What really impressed me is the concept of Schema on Demand. I feel that this is actually the major challenge of ROLAP and relational database technology. When we model the potential additional attributes, we have to add placeholder columns to the relational table. However, in the data storage / database technology that store attributes as Key Value pairs or as Map, the data do not have to be stored as columns. This is actually nothing new. Oracle database has the VARRAY support since Oracle 8. However, there is no BI tool I am aware of can support this Oracle object type. While Oracle database has moved to not just supporting relational tables, the BI tool still make the assumption of supporting relational tables only.
It seems that AtScale solved this challenge by generating the metadata that can perform the attribute map to column transformation. I guess that we will be getting to see these big data technologies start getting into the traditional BI tool space. It is not due to the 3 Vs nature of the big data, it is due to the flexibility.