Dylan's BI Study Notes

My notes about Business Intelligence, Data Warehousing, OLAP, and Master Data Management

Use Surrogate Key in Data Warehouse

Posted by Dylan Wan on August 31, 2017

Using surrogate key is part of dimensional modeling technique for populating a data warehouse using a relational database.

The original idea was to generate the sequence generated IDs and use them in between the fact and dimension table, so we can avoid using the concatenated string or using composite key to join.  Also, due to the size of the column, it becomes a compression of the data and thus performs better.

However, maintaining the surrogate keys itself has actually a big portion of the ETL process:

Introduce the task dependency between fact and dimension

Since the fact table has a FK to the dimension and the ID value stored in the fact table needs to match the surrogate key value in the dimension table, the dimension table always has to be populated first.

Populating dimension becomes a precedence task to the job of populating the fact table.  This type of Finish-Start relationship makes the ETL job becomes longer and inflexible.

FK Lookup

FK lookup is the task of using the source unique key to get the data warehouse generated surrogate key.  It is the logic that we typically maintain in the ETL task for fact tables.

This task is a major ETL tuning effort.

Informatica allows us to use connected lookup, disconnected lookup, joining to the dimension table as different approaches for addressing the same requirement.  The benefit is the choices provided so depending on different situations, different technical approach can be used.

However, switching among different approaches requires coding, not just configuration.

Tuning the ETL is not just like creating index, adding database hint, but involve re-coding.

Full Refresh may be required when brew away a dimension table

Changes are not uncommon in the source system.  During development phase, the logic of populating dimension may change due to the design change or bug fixing.

If you decide to do a full load on a dimension table, the surrogate key will be regenerated.  Regenerating a surrogate key means that FK IDs stored in the fact table becomes invalid.


Some ideas…

If a data warehouse is not built based on relational database which relies on Join to get the data, using a surrogate key is actually not necessary.   Eliminating the surrogate key generation and the FK lookup can reduce the time in developing the ETL and reduce the duration of running the ETL.  It can provide  more flexibility for the data warehouse to adopt the changes.

However, we should not overlook the function of data integrity check  through this process.  The key generation is actually goes hand in hand with duplicate check.  The process of FK lookup can actually detect the missing FK.  We may still need to develop alternate solution for data validation if we are not using surrogate keys.





Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s