Dylan's BI Study Notes

My notes about Business Intelligence, Data Warehousing, OLAP, and Master Data Management

Predictive Analytics and AI/ML

Posted by Dylan Wan on February 25, 2024

We have heard about advanced analytics, which was described by dividing analytics into three main types:

  • Descriptive
  • Predictive
  • Prescriptive

While many articles explore these concepts,, such as Prescriptive vs. Predictive Analytics: Examples & Use Cases

I aim to approach these categories from the standpoint of a software vendor, describe how these categories relate to machine learning practices.

Read the rest of this entry »

Posted in Business Intelligence, Incorta | Tagged: , , , , | Leave a Comment »

Semantic Layer for Data Scientist

Posted by Dylan Wan on June 16, 2022

I recently read a good buyer’s guide from AtScale: The Buyer’s Guide to the Best Semantic Layer Tools for Data and Analytics. I think the buyer’s guide is fair not because that the company I worked for also has such semantic layer, but I really feel that the drawback of the vendor specific semantic layer has a unnecessary vendor lock-in and the design is not customer-oriented.

Read the rest of this entry »

Posted in BI, Data Science | Tagged: | Leave a Comment »

Cloud Database and Cloud DataLake

Posted by Dylan Wan on June 15, 2022

The term DataLake was invented to describe the data storage and the fact that after Hadoop and HDFS were introduced, you can have a cheaper way and place to store your data without using a traditional database, by traditional, I mean a RDBMS, relational database management system. Cheaper is not just about cost, it is about that you have much less efforts to just store the data. Now I see that many vendors are trying to mimic RDBMS and make the line disappear.

Read the rest of this entry »

Posted in Business Intelligence | Tagged: , | Leave a Comment »

ML Data Engineering and Feature Store

Posted by Dylan Wan on March 22, 2022

A typical ML process flow is about

  • Load the data
  • Explore and Clean the data
  • Create features
  • Create ML model
  • Deploy the ML model for inference/prediction

The problem of this flow is that it ignores the fact that the process has to be repeatable and the data need to be reused.

Read the rest of this entry »

Posted in ML | Tagged: , | Leave a Comment »

Getting Data into Cloud

Posted by Dylan Wan on August 27, 2021

When I worked on the data warehousing technologies, we extract the data from the source. The “Extract” is the first step in ETL (or ELT). The extraction was typically done by using SQL connection to the database that holds the transactional data.

When we start introducing cloud based storage, or the Data Lake, many of the process is done via “Data Ingestion”.

The real difference between “Data Extraction” and “Data Ingestion” is that when we extract the data using SQL, we typically know what the data look like. The structure of the data is known and thus we can design the target placeholder, which was called Staging tables, Tables in the data warehouse, or the operational data store (ODS) before we write the code to extract the data. When we perform data ingestion, we should not need to know what the source structure looks like and we definitely do not have to design the target for holding the data until later when we need to use the data. The design and development phase can be shorten or deferred.

When a cloud service cannot handle supporting the schema inference and support getting insights from the data with unknown structure or the structure that may change over time, the intelligence will have to put into the extraction or ingestion layer.

The intelligent ingestion means that even though the source structure is complex, the complex data model has been already handled and the structured data are delivered to the destination.


How is this different from the prebuilt BI Apps?

Prebuilt Apps provides the end to end, not just the extraction or transformation, or the target schema design, but prebuilt libraries of metrics, analytics workflow, prebuilt dashboards, etc.

By providing the intelligent ingestion, the data is there and you can still building the analytics application yoursleves.

Posted in Business Intelligence | Leave a Comment »

Time Series

Posted by Dylan Wan on March 12, 2021

Times Series is defined as a series of data, typically values of a variables, the value of which may change over time.

A set of statistical methods were developed for analyzing such data.

Those methods help to understand and interpret the data, and once the data can be understood, then the model can be used for forecasting.

The assumption of time series are including these components: (https://medium.com/high-data-stories/time-series-patterns-600b03d13903)

  • Trend
  • Pattern – sometime this is called seasonality
  • Cycle
  • Random Error

On the business site, many data are time series, but we don’t typically use those time series methods to analyze those data. Here are typical ways that analytics application dashboard and data model designs about time series:


 

Read the rest of this entry »

Posted in Business Intelligence | Tagged: , , , | Leave a Comment »

Migrating from OBI to Incorta

Posted by Dylan Wan on December 9, 2020

I am sharing my experience of migrating from OBI to Incorta.

Process

  1. Start with Incorta EBS Blueprint
  2. Configure and customize for the deploying company
  3. Optionally, Demo the Fusion Connector
  4. Preview and demo to business users using their own data
  5. Provide the existing OBI dashboard usage analysis – Help prioritize the replacement project
  6. Provide the lineage and SQLs from OBI. Get the access to nqquery log and SDE logic to the development team
  7. Analyze the SQL and OBI report one by one against the blueprint business schemas
  8. Interactive processes about enhancing Incorta physical schema and creating dashboards in Incorta
  9. Create materialized View when necessary
  10. Demo and get feedback from business users during the development
  11. Leverage the existing OBI team in verifying and comparison, help the team gain knowledge about Incorta for future support
  12. Prepare the pilot and overlap period before shutting down OBI. Define the Exit criteria
  13. Prepare the training to the future analyzers/superuser, who can create and manage their own contents. This may be necessary as a recursive event for several months. Providing an office hour will be helpful.
  14. Provide a demo to Excel Add-in if necessary
  15. Provide the SQL interface if necessary. Control the usage by understanding their usage. This may be helpful for Tableau users, but need to be careful as it impacts the system resource usage
  16. Schedule the dashboard delivery and the download. Configure the integration with network drive storage. FTP, dropbox, OneDrive, GDrive, etc.
  17. Along with the development of the new content, start discussion about job schedule options. For example, dependencies from Incorta to source app batch process. Also, understand the source data update pattern and frequency. For example, when the Transfer to GL was executed, when the depreciation process is executed. How frequent these processes are run. For example, any difference in period closing weeks?

Very few resistance from existing OBI developers since most of them see this as an opportunity.

Learning new skills such as using Spark SQL and PySpark in creating materialized views. Explore ML library in Incorta.

Their Data Model knowledge is typically the asset to the new platform.

Posted in BI, BI Application, DBI, EBS, Incorta, Infomatica, OBIA, OBIEE, Oracle, Oracle BI Suite EE, OTBI | Leave a Comment »

Oracle App Cloud and Incorta

Posted by Dylan Wan on September 30, 2020

OTBI is great. But when people are migrating from Oracle EBS to Oracle Cloud App, they would like to view the data from both EBS and Oracle Cloud, Incorta becomes a cost saving and a quick implementation solution without implementing a data warehouse.

Incorta is not a data warehouse although it does has the data storage, which is optimized for extraction, optimized for internal process reading and transformation, and optimized for analytical queries.

The scalability is accomplished by a distributed architecture, and thus no issue with resource contention.

The support of Tableau and PowerBI provides an alternatives for those companies that already standardized their BI front-end.

Very interested to see how Incorta can help upsell Oracle Cloud App, not the other way around.

Posted in Business Intelligence, EBS, Human Resource Intelligence, Incorta, Oracle | Leave a Comment »

Scalable Distributed BI Architecture

Posted by Dylan Wan on November 26, 2019

I have not written anything for almost a year.  I think that one of the major change is that I am dealing with a distributed BI architecture.

There is not much extra work for a solution architect to do in a distributed BI architecture.  Design of schema and dashboards are the same.  The metadata are the same.  The challenge is all on the system, but supporting it including debugging the issues is new. Understanding how it works itself is interesting.

High Availability – The BI system needs to be always available.  If a machine is down, can the site be still up?  For the front-end UI, we need to have a load balancer that is the site accessed by users.  We can have multiple nodes that serve requests from the users, but which node is being used is random, or can be configured based on rules.  Having HA means that the nodes involved are providing the same services so they can replace each other.  Even one node failed, the other node can still take care of new requests.

Disaster Recovery – A stable BI system needs to support recovery.  DR is about resuming the operation from a disaster.  It is different from HA in that the nodes in HA are serving requests.  DR site is a backup and may be passive, not active. DR site typically has some distance from the main site so it won’t experience the same disaster.

Horizontal scaling – Achieving horizontal scaling may also be involved in multiple distributed nodes but it is a bit different.  This is about reacting to the growth of your organization and thus the growth of the data. You can add more nodes to serve new request is horizontal scaling, but having nodes that are providing the same servicing without distinguishing the existing and new can be more desirable for achieving horizontal scaling.

Separation of Data Preparation/Integration and Data Request – Many conventional BI and ETL systems are separate by nature.  But breaking a system that has been originally designed to be one JVM and one process to be separate but integrated systems is equally challenge.  Data consistency requires that the data that are updated not available unless other integrated parts are also available.  Latency is OK but inconsistent is not OK.  While the data is being updated, the users who rely on the data to make decision should not be affected, until the last moment that the new consistent data becomes available.

Technologies

Zookeeper –   I don’t want to use the term database to describe it but it is indeed a database that store a specific type of data – the messages between nodes.

Apache Helix – A cluster management A good slideshare.

Shared Metadata Database

Shared File System

An enterprise level BI architecture is not just provide instant queries and not just about providing the minimal latency.  It is about highly available, supporting disaster recovery, and being scalable over time.  Different types of challenges indicate how the deploying company is relying on the system and reflects the stage of the product and company in its lifecycle.

 

 

Posted in BI, Business Intelligence, Incorta | Leave a Comment »

Is ETL still necessary?

Posted by Dylan Wan on January 22, 2019

ETL stands for Extract, Transform, and Load.

Extract and Load, their existence itself implies that the source data and target data are stored separately, so you need to extract from source and load the data into the target data store.

Extract and Load won’t go away if the data used for reporting is not stored in the same place as the data captured in the source system for transactional processing.

However, sometime, we use the term, Replicate, which refers to simply copying the data from the source to the target without any transformation.

In a way, the ETL was not necessary when we said that we can have a reporting solution directly from the transactional system.   In the old day, Oracle DBI running off Oracle E-Business Suite does not require ETL.  It does has batch programs for Summarization, but it does not extract and load data.

ETL went away at once when a replication or backup based approach is being used.  A solution such as Oracle Golden Gate, some how can eliminate the need of Extract and Load since it does Extract and Load in one step.  SOmetime people call this ELT, Extract and Load first, then Transform.

The real question is how easy these tranditional ETL programs can be developed and maintained.

The real question is whether an ETL tool and ETL programs running outside the BI tool is necessary.  The processes, at least the Transform process, won’t go away as long as the data need to prepared for reporting.

Are we able to get rid of the heavy ETL development time? Are we able to get rid of the separate ETL tool?  Are we able to stop maintaining those ETL programs? Definitely.

 

Posted in BI, Business Intelligence, ETL | Leave a Comment »

Is Star Schema necessary?

Posted by Dylan Wan on January 18, 2019

A star schema describes the data by fact and dimension.

From one angle, it is a data modeling technique for designing the data warehouse based on relational database technology.  In the old OLAP world, even though a cube is also links to the dimensions that describe the measure, we typically won’t call them Star Schema.

A star schema design based on an assumption that joins can be expensive so the dimension tables can be created by denormalizing the data from parents.   It is also based on the assumption that most of business questions are like “How much sales we did this period, by sales, by product, by region, by customer categories, by sales channels”, slice the data by those elements that we can track the performance and apply actions for.  On the technical side, it also assumes that without duplicating the dimensions to each measure, via the normalized relationship between the fact and dimension, the performance can be optimized by such as bitmap indexes, by prebuilt aggregates, and also the application of common dimensions that can summarize the data across subject areas are possible,

So, is star schema design necessary?

If data storage and query generation changed and we do not use relational database, I feel that the way of following Star Schema design to prepare the data is definitely questionable.

On one extreme, if you don’t have to aggregate data since you just want to view the data in a list report and you don’t have the requirement of viewing different subject areas side by side.  A fully denormalized structure may be ideal.

Will the way of describing business questions by using the term “dimension” and “fact” goes away?

For the time being, when the professional and functional people who live in the traditional BI  world are still in place,  I think that the separation of data into “dimension”, “fact” will be alive for a while, until better terms are invented to describe the same concept.  For example, I feel that I start using more “grouping attribute” than “dimension”.

 

Posted in Business Intelligence | Leave a Comment »

Incremental ETL : Streaming via Micro-Batch

Posted by Dylan Wan on October 11, 2017

A modern analytic application takes the approach of streaming data to perform the similar process as the traditional data warehousing incremental ETL.

Actually, if we look into Spark Streaming in details, the concept of streaming in Spark and Incremental ETL are the same: Read the rest of this entry »

Posted in Data Warehouse | Tagged: | Leave a Comment »

Incremental ETL – The last refresh date

Posted by Dylan Wan on October 11, 2017

There are multiple ways to model the last refresh date.

In OBIA, DAC and Informatica based ETL, the last refresh date is maintained within DAC. It is maintained at the level of the source tables that populates the data.

Oracle BI DAC User Guide > About Refresh Dates and DAC’s Incremental Load Strategy

In OBIA ODI based ETL, the “Last Extract Date” is maintained in an ETL_LOAD_DATE table at the interface package level.

Oracle BI Applications ETL Guide > ETL Overview > About Changed Data Capture

There are other simpler ways: Read the rest of this entry »

Posted in Business Intelligence, Data Warehouse | Leave a Comment »

Use Bit to represent groups

Posted by Dylan Wan on October 11, 2017

Here I am providing an alternate approach of supporting group membership in MySQL.

It is a common seen requirement that a group may have multiple members and a person may be added to multiple groups.  This many to many relationship is typically modeled in an intersection table.

When the group membership is being used as a filter, for example, to show all the students within a group, it becomes a subquery.

Read the rest of this entry »

Posted in Business Intelligence, Data Warehouse, ETL | Leave a Comment »

Schema On Read?

Posted by Dylan Wan on September 24, 2017

I saw “create external table ” first in Oracle DBMS 11G.

It was created for the purpose of loading data.

When Hive was introduced, a lot of data were already created in HDFS.

Hive was introduced to provided the SQL interface on these data.

Using the external table concept is a nature of the design.  It is not really a creative thing.

Read the rest of this entry »

Posted in Big Data, Data Lake, Data Science, Oracle | Tagged: , | Leave a Comment »

Preserve Surrogate Key During Upgrade

Posted by Dylan Wan on September 23, 2017

The generated surrogate key is used everywhere in the data warehouse.  What do we do during upgrade?

Here are some approaches:

1. Full Refresh

You can perform a full refresh of the data warehouse.  The surrogate keys will be regenerated.  The FK will be updated.

Obviously, this is not a good approach.  There are problems with this approach:

  • Too big impact
  • The historical data may not be available from source
  • Lost the historical data in the data warehouse
    • Type2 records
    • Historical snapshots

2. Partial Refresh – Refresh the impacted fact tables  Read the rest of this entry »

Posted in Data Warehouse, ETL | Tagged: , | Leave a Comment »

Unified Data Model or Not

Posted by Dylan Wan on September 13, 2017

Do we need to store the data all together in same places?

Do we need to use the same data model ?

Do we need to put data into cloud?

Read the rest of this entry »

Posted in CDH, EDW, Master Data Management | Leave a Comment »

How to – Incremental ETL

Posted by Dylan Wan on September 6, 2017

This is a very basic topic.  An ETL 101 question come up a lot in interview.

Even we are moving to a different storage and different processing framework, the concepts are still important.

The idea is simple – you do not need to keep extracting and updating all data in the data store that are used for analytics and reporting.  Only the newly created or changed data are needed.   When the “Delta” arrive, we need to identify what are the impacts to the existing data, which can be either Adding or Modifying.

Deletion is a different topic.  We need to identify what are deleted from the source and we need to reflect the change, which can be either marking  them as deleted or removing it from the data store so when the query is issued, the deleted data won’t appear

Read the rest of this entry »

Posted in Business Intelligence, Data Warehouse, ETL | Leave a Comment »

Use Surrogate Key in Data Warehouse

Posted by Dylan Wan on August 31, 2017

Using surrogate key is part of dimensional modeling technique for populating a data warehouse using a relational database.

The original idea was to generate the sequence generated IDs and use them in between the fact and dimension table, so we can avoid using the concatenated string or using composite key to join.  Also, due to the size of the column, it becomes a compression of the data and thus performs better.

However, maintaining the surrogate keys itself has actually a big portion of the ETL process:

Read the rest of this entry »

Posted in BI, Business Intelligence, Data Warehouse, ETL | Leave a Comment »

Prebuilt BI Contents should replace BI Tools

Posted by Dylan Wan on August 13, 2017

Most school districts need the same kind of reports and dashboard for measuring the performance of students, teachers, and schools.   They do not really need to have IT to build reports for them if the vendors can provide the reports OOTB.

There is really hardly a need to have a custom reporting tool for building or customizing the reports if the OOTB reports do the jobs. Read the rest of this entry »

Posted in BI, BI Application, Business Intelligence | Tagged: | Leave a Comment »