Why Use Data Mining with Data Warehouse?
Posted by Dylan Wan on April 1, 2015
1. Use the data warehouse data as the training set
Data Mining requires the training data to train the learning algorithm. The data warehoucing processes provide the following services:
- Consolidate the data from different sources
- Aggregate the data: for example, we have the order return transactions but the training data can be # of returns by customers and by products.
- Capture the historical data – This can be accomplished using the TYPE2 dimension or periodic snapshots. for example, if you are going to do time series analysis, the source data may not keep the history.
- Data Cleansing: The quality of the data impacts the quality of the scoring engine. Handling the missing data by setting different default value.
- Normalize the values, using domain lookup, or transformation logic. For example, transform the numeric data to categories.
- Transform the data structure to fit the structure required by data mining models
2. Provide the scoring service as the additional services provided by BI applications
The scoring engine can be deployed as a service. The service can be provided from the BI and can be embedded in other apps.
For example, a data warehouse may use the historical orders to do the market basket analysis. The results of the scoring engine needs to deployed in the ecommerce apps, not as BI reports or dashboard.
3. Showing the scoring or the prediction together with the rest of contents
For example, the customer profitability score can be shown wherever the customer data is shown. The predictive profitability score can help adjust the customer interactions at all layers of the activities.
This can be done at different layer:
a. Run-time scoring: No ETL process involved, call the scoring API from BI
This depends on the BI platform you are using. If you are using Oracle BIEE and Oracle Data Mining Option, the opaque view can be used.
b. Scoring as part of the regular ETL process or as a batch process: we can come up the persistent storage for holding the results of the scoring. The data will be reflected when the data is refreshed.