Dimension Hierarchy Support in Data Warehouse(2)
Posted by Dylan Wan on December 8, 2011
I am describing the column flattening table in this post.
The column flattening table is a typical data warehouse modeling pattern for addressing the dimension hierarchy requirements described in my last post Dimension Hierarchy Support in Data Warehouse (1).
This technique can be considered as a denormalization of the parent information into the child records.
Here are some of the rules we followed to populate in a column flattening data warehouse table:
1. Each record is representing a node that can have transaction data associated with it.
If this the product category hierarchy, each record is created for a product category.
If this is the employee reporting hierarchy, each record is an employee.
The surrogate key of this table will be the primary that can be joined from the fact tables.
The fact data can thus be associated with each of the nodes.
Typically, the ancestors also have separate rows. The reason is that the ancestors can have fact data associated with them. If the ancestors never have data directly associated with them, there is no need to create separate records for the ancestors.
2. Parent / Ancestor columns, one set of such columns for each level
I call these “level attribute columns”.
If this is the employee reporting hierarchy table, the employee name will be a good level attribute. If this table is created for product category hierarchy, the name of the product category will be a good level attribute.
The level attribute is the column that you would like to show on the report and would like to drill from it. When we drill down from one level to the next level, you would like to see the attribute shown in the next level.
Typically they are identifying attributes. They are unique or almost unique so you can use it to distinguish one from the other. For example, the employee name in the employee table is one of such attribute. Employee Number, if you are frequently use it directly, could be a candidate. The gender of the employee will not be considered as an identifying attribute. You can show them as level attribute but it will not be a very good attribute for the users to drill down.
You do not want to capture too many level attributes. The width of the table has significant impacts on the query performance. The level attribute columns times the number of levels you support plus other system columns and base level attribute will be the number of column supported in the column flattening table.
I feel that more than six columns should not be allowed.
3. The number of the levels
Column flattening table are denormalizing the parent attributes. It will have the limitation on how many columns can be supported as part of the table design. The number of levels supported is really depending on the given dimension. It is a case by case decision depending on the nature of the entity.
It is also related to the usability of the hierarchy. Unlimited number of levels is not reasonable since you are not able to use it properly. When you use the dimension hierarchy, you are adding one level at a time while drilling. How many column can be shown on the screen? Unless the UI design allow you to view each level relatively, the parent / ancestor will be on your screen.
The number of levels will increase the overhead in the BI query as well as in the data warehouse ETL process. It will be unwise to create a huge numbers of level to the column flattening table. It is unnecessary cost to those people who do not have such big number of levels.
4. Populate the level columns from the top to the bottom
The level attribute columns are populated from top to the bottom.
This is related to how the table is being by the BI tool.
As mentioned in my last post, the assumption is that the user would like to view from the top or higher level and drill down to see more details. The data populated in the top column are the top most node in your hierarchy. The second level from the top is populated in the 2nd column from the top column.
This rule is best described using examples. Let’s say that you have an hierarchy with four nodes:
A is a parent of B and B is a parent of C. B is also a parent of C.
You will have four records:
In all four records, the top will be populated with A
In all four records, the 2nd level from the top will be populated with B except the top record.
A: A > A
B: A > B
C: A > B
D: A > B
When we reach the next level, some data warehouse design, such as Oracle BI Apps, follows a padding process. The idea is that you can see the data at any level. The data should still have proper break down.
The leave node will be repeated. So you can see the next level as
A: A > A> A
B: A > B > B
C: A > B > C
D: A > B > D
Although the hierarchy only have 3 level, but data warehouse table design may have more level columns that can support the hierarchy with the deeper hierarchy. The next level column will be populated as:
A: A > A > A > A
B: A > B > B > B
C: A > B > C > C
D: A > B > D > D