Aggregation for Data Warehouse, Part 3
Posted by Dylan Wan on May 1, 2007
This is my 3rd post related my series of postings on various techniques (see part 1) for providing aggregations in a data warehouse. I mentioned in part 2 that many ETL tools provide the support of creating, populating, and maintaining the aggregation tables. Today, I am describing the 2nd method – manually write the summarization process in a programming language.
There are several situations that you may need to code your own summarization routine when using a ETL tool is not preferred:
- You cannot afford to buy a ETL tool.
- Your summarization job is very simple and using a ETL tool is overkilled.
- Your summarization job is too complex to be handed in the ETL tools available to you.
- You have specific needs that cannot be addressed by the ETL tools available to you.
I will skip #1 and #2 and discuss some possibilities of #3 and #4.
One possibility that you need to use a programming language to code your ETL process is that you would like to use some specific performance or scalability features available from your database and the new syntax is not supported by the ETL. For example, when the Oracle database first introduced the new bulk load features in their SQL and PL/SQL. There were few ETL tools can support it at that time. You do not have much choice but code the process by yourself.
Sometime you may feel that it is easier to code the complex logic in a programming language you are familiar then playing the logic in the ETL tool. For example, project XYZ is a two year project started in Feb last year. We already have weekly base summary for project cost data. We can aggregate the inception to date (ITD) project cost information using the following two approaches:
Approach 1: Sum up the weekly data
Project ITD Cost on April, 20, 2007 = week 1 (Jan 29 – Feb 4) + week 2 (Feb 5 – Feb 10 + week 3 + … + week n (Apr 15- Apr 20)
Approach 2: Prior Year Summary + Prior Month Summary + Prior Week Summary
Project ITD Cost on April, 28, 2007 = Year 2006 + Jan 2007 + Feb 2007 + March 2007 + Week (Apr 1- Apr 7 ) + Week (Apr 8 – Apr 14) + Week ( Apr 15 – Apr 20)
In the 2nd approac, the volume of data is much reduced. It may be easier to write such logic in a programming language.