Lessons Learned – Data Warehouse on Cloud – Amazon Redshift
Posted by Dylan Wan on September 22, 2015
I feel that these are the rules applicable for any cloud based data warehouse solution. In general, I feel that the on-premise data warehouse deployment probably will remain for a while.
1. For a columnar database, “select *” is bad
I think that the projection needs to be done as early as possible and should be pushed down.
If a column is not needed in the downstream flow, it should not be selected in the first place.
If the application logic is defined in the metadata, the tool should read it and generate the extract logic.
2. Pick the right index / partition key that is optimized for the storage
This is relating to the data warehouse tuning. Since Amazon Redshift stores data by columns and store the data in blocks. The sortkey is critical as it can skip the blocks when the sortkey is a range based filter.
In Oracle database, the similar concept is the partition key. If we choose date column as the partition key, when the date column is used in the query, the irrelevant partitions can be skipped in the query.
3. Deploying the BI tool on the cloud together with the data warehouse
Amazon Redshift supports JDBC / ODBC access. It seems suggesting that the user can deploy a BI tool locally and access the data deployed on the cloud based data warehouse.
I think that the end users should just interact with BI as a service. ‘Data Warehouse as a service’ and ‘BI as a service ‘ seem different. Amazon seems only providing ‘Data Warehouse as a service’. It of course will be slow.
4. Aggregate the data
It is a conventional data warehouse best practice.
However, it is not always needed in the modern BI tool as many BI tool manage it for you.
I feel that Amazon Redshift supports a conventional data warehouse without a proper BI tool is the major issue in its architecture.
5. Filter as early as possible in ETL and push down the filter to closer to source in Query
The join order and filter order are determined by Oracle database optimizer if you are using oracle database.
In the big data technologies, many queries are written in a lower level languages in the query API. The optimization is the responsibility of the developers, not the database management tool.
Spark to Cassandra connector has the optimization of pushing down the filter from Spark to Cassandra. It also tries to follow the same principle.
6. Configuring and tuning based on the # of concurrent users
I cannot believe this “Run one query at a time” can be a “best practice for Amazon Redshift”. LOL
However, cloud may enable more access and the system needs to be prepared for access.
I like the slide 46 from this slideshare – Uses and Best Practices for Amazon Redshift
Mary idea are triggered by that slide.