Dylan's BI Study Notes

My notes about Business Intelligence, Data Warehousing, OLAP, and Master Data Management

Lessons Learned – Data Warehouse on Cloud – Amazon Redshift

Posted by Dylan Wan on September 22, 2015

I feel that these are the rules applicable for any cloud based data warehouse solution.  In general, I feel that the on-premise data warehouse deployment probably will remain for a while.

1. For a columnar database, “select *” is bad

I think that the projection needs to be done as early as possible and should be pushed down.

If a column is not needed in the downstream flow, it should not be selected in the first place.

If the application logic is defined in the metadata, the tool should read it and generate the extract logic.

2. Pick the right index / partition key that is optimized for the storage

This is relating to the data warehouse tuning.  Since Amazon Redshift stores data by columns and store the data in blocks.  The sortkey is critical as it can skip the blocks when the sortkey is a range based filter.

In Oracle database, the similar concept is the partition key.  If we choose date column as the partition key, when the date column is used in the query, the irrelevant partitions can be skipped in the query.

3. Deploying the BI tool on the cloud together with the data warehouse

Amazon Redshift supports JDBC / ODBC access.  It seems suggesting that the user can deploy a BI tool locally and access the data deployed on the cloud based data warehouse.

I think that the end users should just interact with BI as a service.  ‘Data Warehouse as a service’ and ‘BI as a service ‘ seem different.  Amazon seems only providing ‘Data Warehouse as a service’.  It of course will be slow.

4. Aggregate the data

It is a conventional data warehouse best practice.

However, it is not always needed in the modern BI tool as many BI tool manage it for you.

I feel that Amazon Redshift supports a conventional data warehouse without a proper BI tool is the major issue in its architecture.

5. Filter as early as possible in ETL and push down the filter to closer to source in Query

The join order and filter order are determined by Oracle database optimizer if you are using oracle database.

In the big data technologies, many queries are written in a lower level languages in the query API.  The optimization is the responsibility of the developers, not the database management tool.

Spark to Cassandra connector has the optimization of pushing down the filter from Spark to Cassandra.  It also tries to follow the same principle.

6. Configuring and tuning based on the # of concurrent users

I cannot believe this “Run one query at a time” can be a “best practice for Amazon Redshift”.  LOL

However, cloud may enable more access and the system needs to be prepared for access.

I like the slide 46 from this slideshare – Uses and Best Practices for Amazon Redshift

Mary idea are triggered by that slide.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s