Dylan's BI Study Notes

My notes about Business Intelligence, Data Warehousing, OLAP, and Master Data Management

Archive for the ‘Big Data’ Category

Schema On Read?

Posted by Dylan Wan on September 24, 2017

I saw “create external table ” first in Oracle DBMS 11G.

It was created for the purpose of loading data.

When Hive was introduced, a lot of data were already created in HDFS.

Hive was introduced to provided the SQL interface on these data.

Using the external table concept is a nature of the design.  It is not really a creative thing.

Read the rest of this entry »


Posted in Big Data, Data Lake, Data Science, Oracle | Tagged: , | Leave a Comment »

Data Lake and Data Warehouse

Posted by Dylan Wan on April 7, 2017

This is an old topic but I learned more and come up more perspectives over time.

  • Raw Data vs Clean Data
  • Metadata
  • What kind of services are required?
  • Data as a Service
  • Analytics as a Service

Read the rest of this entry »

Posted in BI, Big Data, Business Intelligence, Data Lake, Data Warehouse, EDW, Enrichment, Master Data Management | Tagged: , | Leave a Comment »

Data Lake vs. Data Warehouse

Posted by Dylan Wan on October 4, 2015

These are different concepts.

Data Lake – Collect data from various sources in a central place.  The data are stored in the original form.  Big data technologies are used and thus the typical data storage is Hadoop HDFS.

Data Warehouse – “Traditional” way of collecting data from various sources for reporting.  The data are consolidated and are integrated.  A data warehouse design that follow the dimensional modeling technique may store data in star schema with fact tables and dimension tables.   Typically a relational database is used.

If we look at the Analytics platform at Ebay from this linkedin slideshare and this 2013 article: Read the rest of this entry »

Posted in Big Data, Data Warehouse, EDW | Tagged: , , | Leave a Comment »

Is Apache Spark becoming a DBMS?

Posted by Dylan Wan on September 9, 2015

I attended a great meetup and this is the question I have after the meeting.

Perhaps the intent is to make it like a DBMS, like Oracle, or even a BI platform, like OBIEE?

The task flow it actually very similar to a typical database profiling and data analysis job.

1. Define your question

2. Understand and identify your data

3. Find the approach / model that can be used

Read the rest of this entry »

Posted in Big Data, Business Intelligence | Tagged: , | Leave a Comment »

About Big Data (2): SQL and No SQL

Posted by Dylan Wan on April 14, 2012

This is a good article written by Alan Gates, Pig architect at Yahoo!

Comparing Pig Latin and SQL for Constructing Data Processing Pipelines

He compares Pig Latin, a query language for Hive with SQL, the query language for relational database. He gave a good example that helps those who know SQL to understand the differences.

The query language is not difficult to write.  The key point is that it lets the programmer to control the execution plan.  SQL on the other is a higher level abstraction that hides the execution plan from users.  It does not means that we do not really need to be worried about execution plan.  We do.  It is a key for performance tuning.

A good practice for SQL developers is to get the explain plan and to see how Oracle plans to execute the query and optionally use hint and index to control the execution.

Comparing to this, the approach of directly telling the system what to do may seem easier.

But it means that the database system to handle big data does less things, not smarter.

Posted in Big Data, Data Warehouse | Tagged: , , | 1 Comment »

About Big Data (1)

Posted by Dylan Wan on March 7, 2012

Recently I read several articles and books about big data.

I found that many use a very funny definition to define big data.

Big data is the data that you typically cannot handle in the database.  It is bigger than the size of the data you have.


It is a joke I told my daughter during the dinner.  Someone said that they are selling a very good car.  You asked them:  How good is it?  They said that their car can take more people, run faster, much more comfortable, provide better safety, and cheaper.  When you ask them about more details, they keep saying that it will better than what you have. Will you buy it.  She felt that the sale person is a liar.

I do believe that the big data problem does exist today, but it is a special kind of data and requires some special way to handle.

It is not everything.  It may require a new way that does not exist before.  It may be also likely to require some ways that have been there for some time, but we just did not pay attention to it.

Posted in Big Data, Data Warehouse | Tagged: | Leave a Comment »