Posted by Dylan Wan on October 4, 2015
These are different concepts.
Data Lake – Collect data from various sources in a central place. The data are stored in the original form. Big data technologies are used and thus the typical data storage is Hadoop HDFS.
Data Warehouse – “Traditional” way of collecting data from various sources for reporting. The data are consolidated and are integrated. A data warehouse design that follow the dimensional modeling technique may store data in star schema with fact tables and dimension tables. Typically a relational database is used.
If we look at the Analytics platform at Ebay from this linkedin slideshare and this 2013 article: Read the rest of this entry »
Posted in Big Data, Data Warehouse, EDW | Tagged: big-data, Data Warehouse, hadoop | Leave a Comment »
Posted by Dylan Wan on September 9, 2015
I attended a great meetup and this is the question I have after the meeting.
Perhaps the intent is to make it like a DBMS, like Oracle, or even a BI platform, like OBIEE?
The task flow it actually very similar to a typical database profiling and data analysis job.
1. Define your question
2. Understand and identify your data
3. Find the approach / model that can be used
Read the rest of this entry »
Posted in Big Data, Business Intelligence | Tagged: ApacheSpark, BI | Leave a Comment »
Posted by Dylan Wan on April 14, 2012
This is a good article written by Alan Gates, Pig architect at Yahoo!
Comparing Pig Latin and SQL for Constructing Data Processing Pipelines
He compares Pig Latin, a query language for Hive with SQL, the query language for relational database. He gave a good example that helps those who know SQL to understand the differences.
The query language is not difficult to write. The key point is that it lets the programmer to control the execution plan. SQL on the other is a higher level abstraction that hides the execution plan from users. It does not means that we do not really need to be worried about execution plan. We do. It is a key for performance tuning.
A good practice for SQL developers is to get the explain plan and to see how Oracle plans to execute the query and optionally use hint and index to control the execution.
Comparing to this, the approach of directly telling the system what to do may seem easier.
But it means that the database system to handle big data does less things, not smarter.
Posted in Big Data, Data Warehouse | Tagged: Hive, Pig, Pig Latin | 1 Comment »
Posted by Dylan Wan on March 7, 2012
Recently I read several articles and books about big data.
I found that many use a very funny definition to define big data.
Big data is the data that you typically cannot handle in the database. It is bigger than the size of the data you have.
It is a joke I told my daughter during the dinner. Someone said that they are selling a very good car. You asked them: How good is it? They said that their car can take more people, run faster, much more comfortable, provide better safety, and cheaper. When you ask them about more details, they keep saying that it will better than what you have. Will you buy it. She felt that the sale person is a liar.
I do believe that the big data problem does exist today, but it is a special kind of data and requires some special way to handle.
It is not everything. It may require a new way that does not exist before. It may be also likely to require some ways that have been there for some time, but we just did not pay attention to it.
Posted in Big Data, Data Warehouse | Tagged: big-data | Leave a Comment »