Dylan's BI Study Notes

My notes about Business Intelligence, Data Warehousing, OLAP, and Master Data Management

Incremental ETL : Streaming via Micro-Batch

Posted by Dylan Wan on October 11, 2017

A modern analytic application takes the approach of streaming data to perform the similar process as the traditional data warehousing incremental ETL.

Actually, if we look into Spark Streaming in details, the concept of streaming in Spark and Incremental ETL are the same:

Spark Streaming is a Micro-Batch based streaming.

Each micro-patch is much like getting the delta from the source since last refresh.

The difference is that the last refresh time may be just a half minute ago.

Trigger — How Frequently to Check Sources For New Data

Spark Streaming uses Trigger class to set the interval, how often the incremental ETL will be scheduled.

The ProgressReporter maintains similar information like the ETL Load History table.

  • lastTriggerStartTimestamp
  • currentStatus
  • currentTriggerStartTimestamp
  • currentTriggerEndTimestamp

We can read the FileStreamSource to see how the incremental filter is applied.  It tries to get all the files that have the timestamp later than the last timestamp.  The comment in the source code explains well:

Note that we are testing against lastPurgeTimestamp here so we’d never miss a file that  is older than (latestTimestamp – maxAgeMs) but has not been purged yet.


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s