Dylan's BI Study Notes

My notes about Business Intelligence, Data Warehousing, OLAP, and Master Data Management

Archive for the ‘Data Warehouse’ Category

Incremental ETL : Streaming via Micro-Batch

Posted by Dylan Wan on October 11, 2017

A modern analytic application takes the approach of streaming data to perform the similar process as the traditional data warehousing incremental ETL.

Actually, if we look into Spark Streaming in details, the concept of streaming in Spark and Incremental ETL are the same: Read the rest of this entry »

Advertisements

Posted in Data Warehouse | Tagged: | Leave a Comment »

Incremental ETL – The last refresh date

Posted by Dylan Wan on October 11, 2017

There are multiple ways to model the last refresh date.

In OBIA, DAC and Informatica based ETL, the last refresh date is maintained within DAC. It is maintained at the level of the source tables that populates the data.

Oracle BI DAC User Guide > About Refresh Dates and DAC’s Incremental Load Strategy

In OBIA ODI based ETL, the “Last Extract Date” is maintained in an ETL_LOAD_DATE table at the interface package level.

Oracle BI Applications ETL Guide > ETL Overview > About Changed Data Capture

There are other simpler ways: Read the rest of this entry »

Posted in Business Intelligence, Data Warehouse | Leave a Comment »

Use Bit to represent groups

Posted by Dylan Wan on October 11, 2017

Here I am providing an alternate approach of supporting group membership in MySQL.

It is a common seen requirement that a group may have multiple members and a person may be added to multiple groups.  This many to many relationship is typically modeled in an intersection table.

When the group membership is being used as a filter, for example, to show all the students within a group, it becomes a subquery.

Read the rest of this entry »

Posted in Business Intelligence, Data Warehouse, ETL | Leave a Comment »

Preserve Surrogate Key During Upgrade

Posted by Dylan Wan on September 23, 2017

The generated surrogate key is used everywhere in the data warehouse.  What do we do during upgrade?

Here are some approaches:

1. Full Refresh

You can perform a full refresh of the data warehouse.  The surrogate keys will be regenerated.  The FK will be updated.

Obviously, this is not a good approach.  There are problems with this approach:

  • Too big impact
  • The historical data may not be available from source
  • Lost the historical data in the data warehouse
    • Type2 records
    • Historical snapshots

2. Partial Refresh – Refresh the impacted fact tables  Read the rest of this entry »

Posted in Data Warehouse, ETL | Tagged: , | Leave a Comment »

How to – Incremental ETL

Posted by Dylan Wan on September 6, 2017

This is a very basic topic.  An ETL 101 question come up a lot in interview.

Even we are moving to a different storage and different processing framework, the concepts are still important.

The idea is simple – you do not need to keep extracting and updating all data in the data store that are used for analytics and reporting.  Only the newly created or changed data are needed.   When the “Delta” arrive, we need to identify what are the impacts to the existing data, which can be either Adding or Modifying.

Deletion is a different topic.  We need to identify what are deleted from the source and we need to reflect the change, which can be either marking  them as deleted or removing it from the data store so when the query is issued, the deleted data won’t appear

Read the rest of this entry »

Posted in Business Intelligence, Data Warehouse, ETL | Leave a Comment »

Use Surrogate Key in Data Warehouse

Posted by Dylan Wan on August 31, 2017

Using surrogate key is part of dimensional modeling technique for populating a data warehouse using a relational database.

The original idea was to generate the sequence generated IDs and use them in between the fact and dimension table, so we can avoid using the concatenated string or using composite key to join.  Also, due to the size of the column, it becomes a compression of the data and thus performs better.

However, maintaining the surrogate keys itself has actually a big portion of the ETL process:

Read the rest of this entry »

Posted in BI, Business Intelligence, Data Warehouse, ETL | Leave a Comment »

Data Lake and Data Warehouse

Posted by Dylan Wan on April 7, 2017

This is an old topic but I learned more and come up more perspectives over time.

  • Raw Data vs Clean Data
  • Metadata
  • What kind of services are required?
  • Data as a Service
  • Analytics as a Service

Read the rest of this entry »

Posted in BI, Big Data, Business Intelligence, Data Lake, Data Warehouse, EDW, Enrichment, Master Data Management | Tagged: , | Leave a Comment »

Schema-less or Schema On Demand

Posted by Dylan Wan on January 29, 2017

I am trying to define what are the criteria for a schema-less, or a schema on demand data storage.

In relational database, we use DDL, data definition language, to define schema.

We have to create table first, before we can insert data into a table.  When we update data, we update the data by mentioning the name of columns.  We use DDL before we use DML.

We have to know the name of columns in order to construct the SQL statement with specific select clause.  If a column does not exist, the system throws an error when we try to query. SELECT * FROM does not have the requirement.  CREATE SELECT also kind of giving us some tolerance, but the table it creates will be unstable.  INSERT.. SELECT * will be a bad practice as when the source schema changes, this statement becomes broken.

Schema describes the name of table, the names and orders of the columns, and the data type (or domain) of the columns.

Here are what I feel possible and something we can pursue: Read the rest of this entry »

Posted in Business Intelligence, Data Warehouse, EDW, OBIEE | Leave a Comment »

Generate Date dimension in PL/SQL Table Function

Posted by Dylan Wan on October 30, 2015

Almost all data warehouse have a date dimension.  The purpose of the date dimension is to provide some pre-calculated grouping for dates.  It helps rolling up the data that entered against dates to a higher level, such as year, quarter, month, week, etc.

In some system, source files are used in generating the date dimension.  IMHO, it makes the modification to the logic difficult.  In some ETL programs, the task involves various table joins, try to generate the rows for the year range.

This post is for describing how to populate a table with rows for each date for a given year range. Read the rest of this entry »

Posted in Common Dimension, Data Warehouse | Tagged: | Leave a Comment »

Data Lake vs. Data Warehouse

Posted by Dylan Wan on October 4, 2015

These are different concepts.

Data Lake – Collect data from various sources in a central place.  The data are stored in the original form.  Big data technologies are used and thus the typical data storage is Hadoop HDFS.

Data Warehouse – “Traditional” way of collecting data from various sources for reporting.  The data are consolidated and are integrated.  A data warehouse design that follow the dimensional modeling technique may store data in star schema with fact tables and dimension tables.   Typically a relational database is used.

If we look at the Analytics platform at Ebay from this linkedin slideshare and this 2013 article: Read the rest of this entry »

Posted in Big Data, Data Warehouse, EDW | Tagged: , , | Leave a Comment »

Using Load Plan for managing your ETL task in BI Apps 11.1.1.7.1 (1)

Posted by Dylan Wan on May 14, 2013

One of the major change introduced in BI Apps 11.1.1.7.1 is the way how we manage the ETL task sequence and trim the unnecessary tasks.

This functionality was accomplished earlier using DAC.  The problem we frequently faced was that the DAC repository and the INFA repository are maintained as two separate repositories.  We have to sync up the name of tasks exactly in order to use DAC to manage the task execution of the Informatica workflow tasks.

Load Plan and Load Plan Generator was designed for addressing this requirement.

Here is a good article that describes the story.

Load Plan Generator – An Inside Look

Posted in BI Application, Business Intelligence, DAC, Data Warehouse, ETL, Infomatica, OBIA, ODI, Oracle, Oracle Data Integrator | Tagged: , , , | Leave a Comment »

ODI Shortcut

Posted by Dylan Wan on May 25, 2012

ODI provides the shortcut feature since the 11.1.1.6.0 release.  Shortcuts look like symbolic links in Linux file system and shortcuts in windows. You keep a reference to an object that exists in another folder.

The symbolic link feature has been used widely in prepackaged apps development to handle multiple releases of the software product internally. Typically multiple top folders are created, one for each release.

Many codes can be shared from an old release to its next release.  Instead of copying the codes, a symbolic link can be created in the folder created for the new release to point to the code in the folder created for the old release.

ODI shortcut feature was created for a similar purpose.

ODI allow you to create shortcuts for these objects under the ODI Project:

  • Interface
  • Package
  • Procedure

Shortcuts are for the codes that the developers created and modified.  It try to maximize the code reuse.  You create a code in a folder, and reference it in another folder, after you change the code in the original folder, the reference in the other folder also get the change.  You do not need to make the same change over again for each release.

I also notice that the scenarios are not shortcutable.  The scenarios (package scenario and interface scenario) are executable objects.

This dsign means that the scenarios should be regenerated after the shortcuts are created.  It also means that the scenario should be regenerated after the code in the base folder is modified.  It is kind of strange, but it is a design to help you to maximize the code reuse.  Each interface or a package is tie to an ODI model and model could be source apps version specific.  You want your code to run against a different model even the code is a shortcut.

Folders are also not shortcutable. ODI Folders are not regular folders like the way how we think in the file system directory.  Each folder carries a property called “Release Tag”.  The release tag can help redirecting the model to be used in generating the scenarios.

 

Posted in Data Warehouse, ODI, Oracle, Oracle Data Integrator, OWB, Sunopsis | Tagged: | Leave a Comment »

About Big Data (2): SQL and No SQL

Posted by Dylan Wan on April 14, 2012

This is a good article written by Alan Gates, Pig architect at Yahoo!

Comparing Pig Latin and SQL for Constructing Data Processing Pipelines

He compares Pig Latin, a query language for Hive with SQL, the query language for relational database. He gave a good example that helps those who know SQL to understand the differences.

The query language is not difficult to write.  The key point is that it lets the programmer to control the execution plan.  SQL on the other is a higher level abstraction that hides the execution plan from users.  It does not means that we do not really need to be worried about execution plan.  We do.  It is a key for performance tuning.

A good practice for SQL developers is to get the explain plan and to see how Oracle plans to execute the query and optionally use hint and index to control the execution.

Comparing to this, the approach of directly telling the system what to do may seem easier.

But it means that the database system to handle big data does less things, not smarter.

Posted in Big Data, Data Warehouse | Tagged: , , | 1 Comment »

About Big Data (1)

Posted by Dylan Wan on March 7, 2012

Recently I read several articles and books about big data.

I found that many use a very funny definition to define big data.

Big data is the data that you typically cannot handle in the database.  It is bigger than the size of the data you have.

What?

It is a joke I told my daughter during the dinner.  Someone said that they are selling a very good car.  You asked them:  How good is it?  They said that their car can take more people, run faster, much more comfortable, provide better safety, and cheaper.  When you ask them about more details, they keep saying that it will better than what you have. Will you buy it.  She felt that the sale person is a liar.

I do believe that the big data problem does exist today, but it is a special kind of data and requires some special way to handle.

It is not everything.  It may require a new way that does not exist before.  It may be also likely to require some ways that have been there for some time, but we just did not pay attention to it.

Posted in Big Data, Data Warehouse | Tagged: | Leave a Comment »

OTBI vs. OBIA

Posted by Dylan Wan on October 5, 2011

Several people are curious about what are OTBI and OBIA, and what are the differences between OTBI and OBIA.  I will discuss these in this article.

OTBI stands for Oracle Transactional Buisness Intelligence. 

OBIA stands for Oracle Business Intelligence Applications.

Let’s start with OBIA.  OBIA is the pre-packaged BI Apps that Oracle has provided for several years.  It is the data warehouse based solution.  It is based on the universal data warehouse design with different prebuilt adapters that can connect to various source application to bring the data into the data warehouse.  It allows you to conslidate the data from various sources and bring them together.  It provides a library of metrics that help you measure your business. It also provides a set of predefined reports and dashboards.  OBIA works for multiple sources, including E-Business Suite, PeopleSoft, JDE, SAP, and Fusion Applications.

OTBI is different.  First of all, it is a real time BI.  There is no data warehouse or ETL process for OTBI.  Second, it is for Fusion Apps only.   OTBI is leveraging the advanced technologies from both BI platform and ADF to enable the online BI queries agains the Fusion Applications database directly.  In addition, in some area, such as Financial, you can also connect to the Essbase cubes.  Unlike OBIA, OTBI does not have a lot of prebuilt dashboards and reports.  The reason is that for some advanced analysis, the data need to be prepared.  You cannot get eveything you can get from the OBIA data warehouse in OTBI. 

Both OTBI and OBIA are available from the same metadata repository.  Some of the repository objects are shared between OTBI and OBIA.  It was designed to allow you have the following configurations:

  • OTBI Only
  • OBIA only
  • OTBI and OBIA coexist

If you implement Fusion Apps, you can enable OTBI.  You can use the BI EE Answer to access the prebuild metadata and metrics those are built against the Fusion Apps.  You may not get the full powerful prebuild dashboard and repost and prebuilt navigation workflow.  However, you can start experiencing what the BI EE based reports look like.  You can start bring the data out from your OLTP system.  You can provide training to the users to get familar with the subject areas, some of which are shared with OBIA. 

If you enjoy OTBI and want to further get OBIA with a data warehouse based solution.  You can implement OBIA later.  Some of the OTBI reports maybe switched to run against OBIA.  Some of OTBI reports can continue connecting to Fusion Apps directly.  They can coexist in a single BI server and a single BI answer client.

Both OTBI and OBIA are accessing Fusion Apps via the ADF.   This is a more advanced topic. 

 

 

 

Posted in BI, BI Application, BI Links, Business Intelligence, Data Warehouse, DBI, essbase, ETL, Infomatica, OBIA, OBIEE, Oracle, Oracle BI Suite EE, OTBI | 1 Comment »

Dimension Hierarchy Support in Data Warehouse (1)

Posted by Dylan Wan on June 7, 2011

This is my first post about dimension hierarchy support in a data warehouse.

I will first starting with the requirement assumptions in this post and later posts will talk about the implementations.

Dimension is mainly about “View By”,  “Group by”, and “Filter By”.  You say that you want to view your last year sales by regions.  Last year is a filter and “By region” is the “view by” or “group by”, so there are two dimensions involved here:  the Calendar dimension and the Region dimension.

Each year, such as year 2010, is a dimension member in the Calendar dimension.  Each region, such as the East Region, is a dimension member.

Sometime there are hierarchical relationship among the dimension members.  for example, year 2010 is a dimension member, and the month “January 2010” is also a member and we know that the member year 2010 can be related to another 12 dimension members.  the data for the dimension member year 2010 can actually further break down by those twelves members.  we call this relationship between the member year 2010 and the member month January 2010  is a hierarchical relationship.  The year 2010 is a parent member and the  month “January 2010” is a child member.

The relationship is useful in BI since you can see where the data come from.  Basically if you know that the year 2010 is consistent of 12 child members, January 2010, February 2010, …, to December 2010, it would be great if BI allows you to drill from the group by view with the year 2010 to the view by the child members of year 2010.

If there is hierarchical relationships among regions, it would be great that when you view any region in your report, you can further see what are the other regions that the region is consist of and see the details, especially see how the figure is made from.

For example, if you see the sales for 2010 for the East region is 21M and the east region is consist of three child regions, region A, region B, and region C.  You may want to see how this 21M come from.  Whether the figure is 7M for each region, or the figure is actually unbalanced among regions may mean different for you and different action plan may come up.

The requirement assumptions are

  • Dimension Hierarchy is for supporting drill down reporting.  You should be able to drill into a dimension member and see the further details about the member.
  • Dimension hierarchy let you see the break down.  Browsing the dimension members is not the main purpose.  The purpose of having the dimension support in BI is for viewing the metrics along with the dimension.
  • The number that is associated with the parent dimension members would typically be a number that can be added up from the child members.

Next post, I will talk about BI tool implementation.

Posted in BI, BI Work, Business Intelligence, Data Warehouse, OBIA, OBIEE, Oracle BI Suite EE | 1 Comment »

User Defined Function (UDF) in ODI

Posted by Dylan Wan on March 21, 2011

User Defined Function (UDF) is a very powerful feature from ODI.

One of features that are absent from other ETL tool is to address the need to support different database platforms.  I won’t blame those ETL tools since they are not really designed for pre-package BI Apps.

Who will need to switch database platform like that?

If your data warehouse is deployed on Oracle, you can use Oracle SQL.  If you are using Teredata, you can use Teradata.  You know that your PeopleSoft is running on DB2, you can write the DB2 SQL.  In the custom data warehouse ETL environment, switching database platforms is uncommon, one time only task.  You do not need to switch among different database platforms within your code.

A prepackaged BI apps ETL developers, however, are facing different challenges.  You do not know if the source apps is running on which database platform.  Also, you want to give customers the choices on the database platforms to deploy the data warehouse.

ODI UDF comes very handy.  You can create a UDF to use in your SQL, you can have multiple implementation of the UDF for different database platform.  You can use GetDate() for MS SQL and use SYSDATE for Oracle database in the implementation, but you can create you own function such TODAY() and use in your SQL.

User Defined Function is not a new idea.  You may see something similar in other tools.  However, to be able to use UDF in SQL and to be able to use UDF with multiple implementations under different technologies, I only see the feature in ODI.

I won’t be surprised to see those “me too” products in the near feature.

More and more companies are moving to use prepackaged BI apps.

Posted in BI, BI Application, BI Work, Business Intelligence, Data Warehouse, ETL, Infomatica, OBIA, ODI, Oracle, Oracle Data Integrator | Tagged: , | 1 Comment »

Why I do not use PowerConnect to access PeopleSoft Tree

Posted by Dylan Wan on March 16, 2011

1. It does not allow you to use parameters to the PeopleSoft connect. It may be changed later. However, it was a big issue when we try to address customer issues.

2. It requires EFFDT as an input.
It expects that people change the EFFDT using Mapping Editor. How can a business user does that every month?

3. It asks for a Tree Name. Many PeopleSoft tree structure supports multiple trees. Tree is just a header of the hierarchy. Whenever you add a new Tree, you need to create a new mapping!!

It does not make sense to use PowerConnect due to the customer demands. All above requirements are from customers.

We have no choice but stop using it.

It is a nice feature, but it was not designed for a prepackaged apps.

Posted in BI, BI Application, Business Intelligence, Data Warehouse, ETL, Infomatica, OBIEE, PeopleSoft | Leave a Comment »

EDW and BI Apps (Part 3)

Posted by Dylan Wan on November 12, 2010

EDW and BI Apps integration is a fun topic. I heard that more and more organizations are facing this situation. The reason is that many people buy the prepackage BI Application even though they already have an enterprise data warehouse.

I found that an interesting thing is that their existing enterprise data warehouse covers many more subject areas specific to their business, but when they touch the data from the ERP or CRM apps, they still would like to use the prepackaged BI Apps applications. The reason is that it saves a lot of their efforts.

Since BI Applications supports the ERP or CRM apps, their BI Application deployment typically supports the horizontal business functions. On the ERP side, it supports back office operations in financial, procurement, order management, and human resource. Their CRM apps supports marketing, sales, and services. However, the core business system may not be using the prepackaged enterprise apps. The data source for the enterprise data warehouse are industry specific or even in-house built systems.

This leads into the following scenario about integrating the EDW and BI Apps. The integration is really about integrating the Vertical data warehouse with the Horizontal data warehouse. Conformed dimension is a key successful factor for this integration.

There are multiple of technical approaches of doing the integration, such as building a cross reference table or directly sharing the logical or physical layers as I mentioned in the prior posts. No matter which technical approach are taken, I think that they should follow some data warehouse conformance process.

Some people just jump directly into the process of comparing the data warehouse schema. It seems both data warehouse has the party dimension. Let’s merge them. It seems both has the location dimension. Let’s create a cross reference.

I think that it is dangerous to look at the problem in this way. Just because that both data warehouse has something with the name of “Party” does not mean that they are the same thing.

It is important to go through the follow steps if you are involved in a such project:

1. What are the business questions you would like to answer via the conformance and integration?

2. What are the data available in each of your systems?

3. Where are the data required stored?

4. Determine the technical approach to integrate.

Many valuable information you can get from the prepackaged Horizontal BI apps that can be leveraged as part of conformance project:

1. BI Apps collects your people (employee/resource) information from your enterprise apps.
– It may also give your the headcount and reporting structure information.
– The people / resources may have various roles depending on the sets of enterprise apps are deployed.

2. It collects your customer information from your enterprise apps
– It may also provide you the revenue information by major customer related attributes such as geography and industry.
– If financial is being used, you can get the payment and credit information as well.

3. It may have your supplier information if you are using the procurement or Financial payables apps.

4. It has the GL account / Financial reporting structure information
– It already has the cost / expense information collected from various places for accounting

5. It has the internal organization structure information
The org structure defined for business processing as well the org structure defined for reporting / management reporting are there.

6. It has the calendar / fiscal year and quarter definition
If you have the accounting system, the fiscal calendar will be there.

7. It has the product / item information
– It could be the products the organization is selling.
– The items the deploying organization is building
– It can also include the product that the deploying organization is buying.

These of course depend on the nature of the business.

Posted in BI, BI Application, BI Work, Business Intelligence, Data Warehouse, EDW | Leave a Comment »

EDW and BI Apps (Part 2)

Posted by Dylan Wan on October 11, 2010

Last week I talked about EDW as a data model offering. I also mentioned that it is possible to use the enterprise data warehouse as the source for BI Apps, but it requires a lot of manual work. The topic discussed is about evaluating if you need to have the EDW data model or BI Apps, or both.

There is another commonly seen scenario that you may already have an existing enterprise data warehouse. If you already have an enterprise data warehouse and already have the integration from various in-house systems, what do you do?

First of all, I think that it is not wrong to co-exist. You have existing investments in your EDW and you should consider keeping it for the value it already provides. However, considering the benefits and the cost and time saving you may get from the prepackaged BI apps, you may still want to deploy a prepackaged BI Apps. There is really no conflict.

The other questions come from those people who think of one plus one should not be two, but more. I think that it is possible to integrate the EDW and prepackaged BI apps in several ways to gain the additional values:

1. Dashboard and report level integration

BI tool, such as Oracle BI EE, allows you to have multiple data sources for your BI. You can put the reports or regions from different data warehouses into the same end user business flow. You can even put them into the same page if it makes sense.

If you include a cross reference table or cross reference from at least one side, you can actually drill into from one to the other.

The integration between the two will be similar from the integration between the BI apps and an OLTP system, such as E-Business Suite. For example, you can navigate to a EBS page as long as the page is callable. You can use URL rewrite to pass the context. The URL can encrypt the identifier so the data can still be secured. You can the URL as an presentation layer attribute that can be a derived attribute that include the object instance ID from the record.

You can also define the page navigation from one BI page to the other.

I will call this loosely-decoupled approach.

2. Logical Layer Integration via Data Federation

We can also use the data federation feature from a BI tool such as Oracle BI EE. The concept of the data federation is very simple. Basically, as an end user of BI, you should not need to know where your data is physically located. Your BI design architect can tell the BI system as part of the metedata repository about where the data is physically located and what the semantic layer of the data mode should be and how the data are related. During run time, the BI tool can get the data from the various physical database systems or even the text files and spreadsheet data. It will merge the results and show the data to the users.

3. Data Warehouse and Database level integration

The data federation is done via the BI tool. There is mot much impact to the ETL process as long as we can identify the share nature key.

The Data warehouse level integration means that you can not only make the BI Apps co-exist with your enterprise data warehouse, you may actually build some integration via the ETL process.

For example, you do not really need duplicate the Date dimension. The Calendar Date is an important dimension in the data warehouse. However, the definition of the Gregorian calendar is defined outside your organizations. Most of data warehouses have the similar design on the Date dimension. If the various facts can get the agreement on how to derive the Date dimension foreign key, you do not really need to have two Date dimension tables.

You need to be careful for going into this approach though. The benefit of using a prepackaged BI apps is not just cut your initial cost, but also reduce your lifetime maintenance. When you upgrade your OLTP apps, or when you want to implement additional modules, it may be the time for you to upgrade your BI Apps. The prepackaged BI Apps can provide the upgraded adapters to the latest version of the OLTP. If you change the out of the box date dimension from the BI Apps, you make need to find a strategy to keep the change isolated from the upgrade to make the change be easily redone or be protected from the upgrade.

(to be continued…)

Posted in BI, BI Application, BI Work, Business Intelligence, Data Warehouse, EBS, ETL, OBIA, OBIEE, Oracle BI Suite EE | Leave a Comment »

 
%d bloggers like this: