Dylan's BI Study Notes

My notes about Business Intelligence, Data Warehousing, OLAP, and Master Data Management

Is ETL still necessary?

Posted by Dylan Wan on January 22, 2019

ETL stands for Extract, Transform, and Load.

Extract and Load, their existence itself implies that the source data and target data are stored separately, so you need to extract from source and load the data into the target data store.

Extract and Load won’t go away if the data used for reporting is not stored in the same place as the data captured in the source system for transactional processing.

However, sometime, we use the term, Replicate, which refers to simply copying the data from the source to the target without any transformation.

In a way, the ETL was not necessary when we said that we can have a reporting solution directly from the transactional system.   In the old day, Oracle DBI running off Oracle E-Business Suite does not require ETL.  It does has batch programs for Summarization, but it does not extract and load data.

ETL went away at once when a replication or backup based approach is being used.  A solution such as Oracle Golden Gate, some how can eliminate the need of Extract and Load since it does Extract and Load in one step.  SOmetime people call this ELT, Extract and Load first, then Transform.

The real question is how easy these tranditional ETL programs can be developed and maintained.

The real question is whether an ETL tool and ETL programs running outside the BI tool is necessary.  The processes, at least the Transform process, won’t go away as long as the data need to prepared for reporting.

Are we able to get rid of the heavy ETL development time? Are we able to get rid of the separate ETL tool?  Are we able to stop maintaining those ETL programs? Definitely.



Posted in BI, Business Intelligence, ETL | Leave a Comment »

Is Star Schema necessary?

Posted by Dylan Wan on January 18, 2019

A star schema describes the data by fact and dimension.

From one angle, it is a data modeling technique for designing the data warehouse based on relational database technology.  In the old OLAP world, even though a cube is also links to the dimensions that describe the measure, we typically won’t call them Star Schema.

A star schema design based on an assumption that joins can be expensive so the dimension tables can be created by denormalizing the data from parents.   It is also based on the assumption that most of business questions are like “How much sales we did this period, by sales, by product, by region, by customer categories, by sales channels”, slice the data by those elements that we can track the performance and apply actions for.  On the technical side, it also assumes that without duplicating the dimensions to each measure, via the normalized relationship between the fact and dimension, the performance can be optimized by such as bitmap indexes, by prebuilt aggregates, and also the application of common dimensions that can summarize the data across subject areas are possible,

So, is star schema design necessary?

If data storage and query generation changed and we do not use relational database, I feel that the way of following Star Schema design to prepare the data is definitely questionable.

On one extreme, if you don’t have to aggregate data since you just want to view the data in a list report and you don’t have the requirement of viewing different subject areas side by side.  A fully denormalized structure may be ideal.

Will the way of describing business questions by using the term “dimension” and “fact” goes away?

For the time being, when the professional and functional people who live in the traditional BI  world are still in place,  I think that the separation of data into “dimension”, “fact” will be alive for a while, until better terms are invented to describe the same concept.  For example, I feel that I start using more “grouping attribute” than “dimension”.


Posted in Business Intelligence | Leave a Comment »

Incremental ETL : Streaming via Micro-Batch

Posted by Dylan Wan on October 11, 2017

A modern analytic application takes the approach of streaming data to perform the similar process as the traditional data warehousing incremental ETL.

Actually, if we look into Spark Streaming in details, the concept of streaming in Spark and Incremental ETL are the same: Read the rest of this entry »

Posted in Data Warehouse | Tagged: | Leave a Comment »

Incremental ETL – The last refresh date

Posted by Dylan Wan on October 11, 2017

There are multiple ways to model the last refresh date.

In OBIA, DAC and Informatica based ETL, the last refresh date is maintained within DAC. It is maintained at the level of the source tables that populates the data.

Oracle BI DAC User Guide > About Refresh Dates and DAC’s Incremental Load Strategy

In OBIA ODI based ETL, the “Last Extract Date” is maintained in an ETL_LOAD_DATE table at the interface package level.

Oracle BI Applications ETL Guide > ETL Overview > About Changed Data Capture

There are other simpler ways: Read the rest of this entry »

Posted in Business Intelligence, Data Warehouse | Leave a Comment »

Use Bit to represent groups

Posted by Dylan Wan on October 11, 2017

Here I am providing an alternate approach of supporting group membership in MySQL.

It is a common seen requirement that a group may have multiple members and a person may be added to multiple groups.  This many to many relationship is typically modeled in an intersection table.

When the group membership is being used as a filter, for example, to show all the students within a group, it becomes a subquery.

Read the rest of this entry »

Posted in Business Intelligence, Data Warehouse, ETL | Leave a Comment »

Schema On Read?

Posted by Dylan Wan on September 24, 2017

I saw “create external table ” first in Oracle DBMS 11G.

It was created for the purpose of loading data.

When Hive was introduced, a lot of data were already created in HDFS.

Hive was introduced to provided the SQL interface on these data.

Using the external table concept is a nature of the design.  It is not really a creative thing.

Read the rest of this entry »

Posted in Big Data, Data Lake, Data Science, Oracle | Tagged: , | Leave a Comment »

Preserve Surrogate Key During Upgrade

Posted by Dylan Wan on September 23, 2017

The generated surrogate key is used everywhere in the data warehouse.  What do we do during upgrade?

Here are some approaches:

1. Full Refresh

You can perform a full refresh of the data warehouse.  The surrogate keys will be regenerated.  The FK will be updated.

Obviously, this is not a good approach.  There are problems with this approach:

  • Too big impact
  • The historical data may not be available from source
  • Lost the historical data in the data warehouse
    • Type2 records
    • Historical snapshots

2. Partial Refresh – Refresh the impacted fact tables  Read the rest of this entry »

Posted in Data Warehouse, ETL | Tagged: , | Leave a Comment »

Unified Data Model or Not

Posted by Dylan Wan on September 13, 2017

Do we need to store the data all together in same places?

Do we need to use the same data model ?

Do we need to put data into cloud?

Read the rest of this entry »

Posted in CDH, EDW, Master Data Management | Leave a Comment »

How to – Incremental ETL

Posted by Dylan Wan on September 6, 2017

This is a very basic topic.  An ETL 101 question come up a lot in interview.

Even we are moving to a different storage and different processing framework, the concepts are still important.

The idea is simple – you do not need to keep extracting and updating all data in the data store that are used for analytics and reporting.  Only the newly created or changed data are needed.   When the “Delta” arrive, we need to identify what are the impacts to the existing data, which can be either Adding or Modifying.

Deletion is a different topic.  We need to identify what are deleted from the source and we need to reflect the change, which can be either marking  them as deleted or removing it from the data store so when the query is issued, the deleted data won’t appear

Read the rest of this entry »

Posted in Business Intelligence, Data Warehouse, ETL | Leave a Comment »

Use Surrogate Key in Data Warehouse

Posted by Dylan Wan on August 31, 2017

Using surrogate key is part of dimensional modeling technique for populating a data warehouse using a relational database.

The original idea was to generate the sequence generated IDs and use them in between the fact and dimension table, so we can avoid using the concatenated string or using composite key to join.  Also, due to the size of the column, it becomes a compression of the data and thus performs better.

However, maintaining the surrogate keys itself has actually a big portion of the ETL process:

Read the rest of this entry »

Posted in BI, Business Intelligence, Data Warehouse, ETL | Leave a Comment »

Prebuilt BI Contents should replace BI Tools

Posted by Dylan Wan on August 13, 2017

Most school districts need the same kind of reports and dashboard for measuring the performance of students, teachers, and schools.   They do not really need to have IT to build reports for them if the vendors can provide the reports OOTB.

There is really hardly a need to have a custom reporting tool for building or customizing the reports if the OOTB reports do the jobs. Read the rest of this entry »

Posted in BI, BI Application, Business Intelligence | Tagged: | Leave a Comment »

Users of Analytics Applications

Posted by Dylan Wan on May 21, 2017

Business User who are consuming the data and the report.  They see the information pushed to them.  They can see alerts in their phone.  They see emails.  They add the page to a bookmark in their browser and periodically look at them.   They are executives, managers, busy users who have other duties.   They don’t spend much time.  They may not come to see the data every day.  In a school system, they are teachers, principals, vice principals, school counselors.  In an corporation, they are the business partners of IT.  They are controllers, account payables, contractor admin, buyers, billing clerks.  Many of them know what information they would like to have and information is essential for their jobs.  They do not like complex navigation.  They would like to see dashboards or data integrated into their portal or the applications they have to use. Read the rest of this entry »

Posted in BI, Business Intelligence, Incorta, OBIEE | Leave a Comment »

Data Lake and Data Warehouse

Posted by Dylan Wan on April 7, 2017

This is an old topic but I learned more and come up more perspectives over time.

  • Raw Data vs Clean Data
  • Metadata
  • What kind of services are required?
  • Data as a Service
  • Analytics as a Service

Read the rest of this entry »

Posted in BI, Big Data, Business Intelligence, Data Lake, Data Warehouse, EDW, Enrichment, Master Data Management | Tagged: , | Leave a Comment »

RTI and Actionable BI

Posted by Dylan Wan on April 1, 2017

RTI stands for Response to Intervention.  It is one the Data Driven Instruction methods. It is a way to monitor students’ progress and to determine what interventions students may need, assign the appropriate intervention to them, and monitor the progress closely.

This is a good demonstration of the need of actionable BI.  The system collects the information about students and identify the students that need the interventions.  The information may be learning related or behavior related problems.   Read the rest of this entry »

Posted in BI, Business Intelligence | Tagged: | Leave a Comment »

Analytics apps or embedded analytics

Posted by Dylan Wan on February 16, 2017

An analytics apps should be an application.   We should not see it as a report solution or just a way to view or download data.

A best analytics apps is an apps, and lets people forget about it is an “analytics” apps.

If you keep the end user in mind and drive the design of the apps using user persona and knowing really what users do, the line between analytics apps and other apps should not be very clear.

Read the rest of this entry »

Posted in Business Intelligence | Leave a Comment »

Dashboard vs Report

Posted by Dylan Wan on February 16, 2017

A dashboard is a data driven application.  It is important to think about:

Read the rest of this entry »

Posted in Business Intelligence | Leave a Comment »

Schema-less or Schema On Demand

Posted by Dylan Wan on January 29, 2017

I am trying to define what are the criteria for a schema-less, or a schema on demand data storage.

In relational database, we use DDL, data definition language, to define schema.

We have to create table first, before we can insert data into a table.  When we update data, we update the data by mentioning the name of columns.  We use DDL before we use DML.

We have to know the name of columns in order to construct the SQL statement with specific select clause.  If a column does not exist, the system throws an error when we try to query. SELECT * FROM does not have the requirement.  CREATE SELECT also kind of giving us some tolerance, but the table it creates will be unstable.  INSERT.. SELECT * will be a bad practice as when the source schema changes, this statement becomes broken.

Schema describes the name of table, the names and orders of the columns, and the data type (or domain) of the columns.

Here are what I feel possible and something we can pursue: Read the rest of this entry »

Posted in Business Intelligence, Data Warehouse, EDW, OBIEE | Leave a Comment »

Uncommonly Common

Posted by Dylan Wan on January 26, 2017

An interesting concept.

Significant Terms Aggregation – Elastic Search

Posted in Business Intelligence | Leave a Comment »

Elastic Stack

Posted by Dylan Wan on January 24, 2017

This post comes from notes I took while learning Elastic Stack.

ELK is a combination of three tools:

  • Elastic Search
  • Logstash
  • Kibana

When ELK is changed to Elastic Stack, there is a forth tool:

  • Beat

There are a lot of information on the net.  I do not want to repeat anything.  I will just write my impressions.

In this Elastic world, data are represented, passed, and stored in JSON format.  In our relational world, data are represented, passed, and stored in tabular format.  Data in the tabular format can also be easily represented in the JSON format, but not the other way.  Data stored in XML or JSON can be nested and to store the normalized format in the relational database, the information about the relationships are lost or only exists in the form of metadata and the keys, PK and FK.

Elastic Search

Elastic Search is the engine in the picture.  However, the real engine is Apache Lucene.  Elastic Search was built on the top of Lucene by adding two things:  distributed processing and RESTful APIs.  Elastic Search / Lucene is the database (index) engine, which organizes the data and process the data like the CRUD (Create, Read, Update, Delete) operations in relational database.

We use SQL for CRUD in relational database. We will use the query language in RESTful API for similar purposes.  To use Elastic Search, you just need to put the data in and you can get the data out by searching.  Knowing how it works internally should help for optimizing and provide a scalable solution, but may not be required.


Logstash was introduced to me as an ETL tool.  However, my impression is that it is a specific, not a generic, transformation tool for handling the log information. It provides the specific capability to parse the log files.  They call this “grok”.  The technology is built on the top of  regular expression.  Other impressive features (plugins) are “geoip”, which is an enrichment service, and “date” processor, which can parse many if not any date formats.

The beauty of this tool is extensible.  Like ODI, which can be extended by adding Knowledge Module, Logstash can be extended by adding Plugins. R is also benefited from the R packages contributed from the community.


Kibana is the BI/visualization tool in the picture. It issues the queries to the engine (elastic search), get the data and display the data in charts or tables.

It seems that it emphasizes on time series analysis as it is commonly used in log analysis.  Otherwise, most of charts and tables are just like other similar tools.


These are source side agents for collecting information.  It extracts the data on the source and deliver the data to the target.





Posted in Business Intelligence | Leave a Comment »

HR Reporting

Posted by Dylan Wan on January 16, 2017

HR is an important business function in any company.  It supports business growth.  It also satisfies the compliance requirements.  Reports are required for supporting the HR functions.

Typical HR Reporting Requirements

HR reports are surrounding the HR function:  Human Resource Planning, Recruitment and Selection, Orientation and Training, Performance Appraisal, Compensation and Benefit.

They serves many different roles and thus having different ways to slice and dice the data.

  • Personal Information: Demographics, Education, Job, Location, Manager, Shift, Salaries, Payroll, Skills, Benefit
  • Personal Information Changes – Before and After for audit and review
  • New Hire – Who are they?  Where we are in the hiring status?  When they will start?  Temporary or Employment?  Full Time or Part Time, New Hire or Rehire? Have they attended the orientation?   Are they getting sign-on Bonus?  Are they getting stock options?
  • Allowance – Who got the allowance?  How much?  How it was calculated?
  • Transfer – internal transfers, what is the releasing org? which org the person is moving to?
  • Assignment – Global, special assignments, start date, end date, purpose, etc.
  • Training – Who take training?  What training? Compliance Status
  • Termination – Who have left?  Who is leaving?  When?  Last Payroll date?


I found that HR reporting is somewhat challenging:

Read the rest of this entry »

Posted in BI Application, Business Intelligence, Human Resource Intelligence, Incorta | Leave a Comment »

%d bloggers like this: