Dylan's BI Study Notes

My notes about Business Intelligence, Data Warehousing, OLAP, and Master Data Management

How to – Incremental ETL

Posted by Dylan Wan on September 6, 2017

This is a very basic topic.  An ETL 101 question come up a lot in interview.

Even we are moving to a different storage and different processing framework, the concepts are still important.

The idea is simple – you do not need to keep extracting and updating all data in the data store that are used for analytics and reporting.  Only the newly created or changed data are needed.   When the “Delta” arrive, we need to identify what are the impacts to the existing data, which can be either Adding or Modifying.

Deletion is a different topic.  We need to identify what are deleted from the source and we need to reflect the change, which can be either marking  them as deleted or removing it from the data store so when the query is issued, the deleted data won’t appear

  1. Define key in the target table
    Knowing what the key columns are for determining a unique row is important.
    If the source has a composite key, the target may or may not have composite key depending on whether we concatenate these values.
    Whether the key will be registered in the database depending on whether we would like to enforce the business rule using database constraint and whether we will use database level service for UPSERT.
  2. Use the UPSERT logic
    Most database support MERGE statement.
    MySQL support ON DUPLICATE KEY UPDATE
    Use a Target Lookup to check if the row already exists and use the UPDATE strategy to determine if INSERT or UPDATE the existing row.
    If Update is expensive, skip the unnecessary updates by comparing the values. For example, comparing the source last update column from the source table with the value stored in the target table.
  3. Apply incremental filters on the source tables
    1. If there are more than one source tables, all source table should have the incremental filter.  Basically, any change to the source table will generate delta in the joined result set.  One simple rule is to based on the columns in the SELECT statement and ask yourselves:  Can the value of this column be change in the source database?  Do we need to update the target database with the latest value?
    2. In some cases, the joined table is for determining if a record is applicable. The last updated date from the joined table should be considered even though none of columns presented in the SELECT statement.
    3. Logically speaking, the logic is to use OR condition to connect the different incremental filters.  SELECT the key and UNION the key value in a subquery . can be an alternate.  This is for performance tuning.

 

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s