Use Smart Key for the Date Dimension
The Date dimension table stores the individual dates that can be the reference from the Fact table. Ralph Kimball touches this topic in the following design tips:
- Kimball Design Tip #2: Multiple Time Stamps
- Kimball Design Tip #5: Surrogate Keys for the Time Dimension
- Kimball Design Tip #51: Latest Thinking on Time Dimension Table
- Kimball Design Tip #85: Smart Date Keys to Partition Fact Tables
He strongly suggests the use of a surrogate key for the Date dimension. He also suggested that the date dimension should have every discrete day and populate with all the attributes that nowaday can be found by database function.
He does not like the design of using the SQL timestamp. He feels that the approach cannot deal the case such as
- Some fact tables may contain “unknown” date
- Some fact tables may want to have “have not happen yet” date
I cannot totally agree with him. My observation is that many database or software actually store the date information as a numeric value internally. I feel that we can use a smart key that is very close to the surrogate key.
In Excel, if you take out the date format, you will find that the date value is actually stored as a numeric value as the number of days since “January 1, 1900″. For example 39070 is actually the date “12/19/2006″
In Oracle Database, you can very easily convert any date as the Julian day number using the ‘J’ as the date format:
SELECT TO_CHAR(TO_DATE('01-01-1997', 'MM-DD-YYYY'),'J') FROM DUAL; TO_CHAR -------- 2450450
The beauty of these approaches is that in the ETL program, you do not need hit the data warehouse and join the Date dimension table to get the right foreign key for stamping the fact table. This is particular useful in a heterogeneous environment, where the source data can come from various system. You can also have the flexibility to load the Date dimension after loading the Fact tables.
As Ralph Kimball mentioned, one of the benefits of using a surrogate key that has a correct sequence is that the fact table can be physically partitioned on the base of the sequence of the key column. I think that using a smart key like the Excel numeric representation of a date or the Julian day can provide the same benefit.
Storing the numeric key will take less then space then the “YYYYMMDD” format. Also you can also use the range scan, apply the plus or minus calculation on the day column.