Tuesday, September 1, 2015

Staging or Not Staging

Data extracted, transformed and loaded (ETL) from the source to destination may require temporary staging for various reasons such as addressing failures, reducing load on the source system, data cleansing, auditing purposes etc. 

Whether you need to stage the data or not as part of your ETL process is common design decision that you have to make. 

Not Staging

In most simple scenario, the ETL process copies data from the source directly to the data warehouse. In this case, if there is a failure, you restart the job, again extracting data from the source system. 

The opportunity to avoid a staging database is even greater when using SSIS, which allows you to apply multiple transformations to data as it passes through memory between the source and the destination.

Also, if the source is simple and clean enough that it does not need any data cleansing or transformations before loading to warehouse then it is a simple mapping without the need of a separate transformation process.  The required data is extracted, mapped and loaded to warehouse, is the scenario here.

Factors for Staging

  • There is a time lag between Extraction and Loading process. If the extraction occurs at 7 AM and the loading to the warehouse happens at midnight then the extracted data has to be stored somewhere. A staging database is a reasonable option.
  • Restart on failure: If various parts of the ETL process are vulnerable to fail, recovery is easier if a staging database is available as the starting point. For example, Out of 100 tables, the extraction process is failed after extracting 60 files to staging area successfully.  When the extraction process is restarted it does not need to extract all but the failed 40 as the data is already available for successful 60 in staging.  When the source system resources are constrained, it is particularly important to avoid repeating an extract unnecessarily. The staging database allows the Extraction process to be decoupled from the subsequent processes.
  • Memory Constraints. Some transformations particularly those that require sorting, can consume substantial memory resources. In many cases, those transformations can be done more efficiently by making use of a disk-based staging. Depending entirely on memory availability limits the ability of the ETL process to scale. When there are memory intensive transformations, having a disk-based process allows for more growth in the volume of data you can process. 
  • Auditing Requirements. Depending on the specifics of the extraction and transformation process, a staging area can provide a level of auditing support. For example, if it turns out that a specific customer is missing from the data warehouse, you could examine the staging database to help identify why that customer’s records were lost.
  • Multiple Source systems: When the data is to be consolidated from multiple source systems, a staging database allows the consolidating to take place before processing the transformation. 

Source: http://biarchanddesignguide.codeplex.com/releases/view/6572


No comments:

Post a Comment