Cloudy River's Workshop: Guidelines for BI Extracting Mode

In BI, three Data Extracting Mode exists: Pull, Push, and Hybrid to extract data from the source and transfer them to the destination.

Pull Mode

Pull mode is the most common type of data extraction process.

The staging database initiates the data extraction process, issuing SQL statements to query the data from the source system.

You should use the pull method when it is necessary to minimize the impact on the source server. This may be the case when the source system is limited in disk space, CPU or memory, or when you do not have access to the source applications or systems and cannot enhance them for extraction process.

Push Mode

The source system initiations the data extraction process which typically results in extracting and moving the data to the staging database. The source system is responsible for extracting the incremental data, tracking the process, restarting in case of a failure, and even moving the data to the staging area.

Use the push system when the source system has special scheduling requirements, when you do not have read access to the source system databases, or when you need near real-time data transfers to the destination.

Hybrid Mode

The source system extracts data to keep in an intermediate store from which the staging system picks it up at regular intervals.

A hybrid model can provide some of the best features of both push and pull. In order to use a hybrid model, you must have some ability to modify processes on the source server. One scenario involves creating triggers on the source database; another involves changing the source application. The hybrid models are better than a pure push model in that the source server does not need to know anything about the staging server—this local isolation is sometimes referred to as disconnected architecture. In addition, hybrid models are better than a pure pull model in that keeping track of delta batches is relatively straightforward.

Guidelines to use Pull Mode

Following are guidelines for extracting data by using the pull method:

Use a scheduler to launch the extraction process during off-peak hours for the source system or, in cases where there is no off-peak time, at specific periodic intervals

Store metadata required for the extraction on the staging server. Use the staging server to maintain the status of all extractions.

Configure the extraction process to read the metadata for configuration parameters and to configure source connections. Make the extraction process begin by checking the schema to see if there have been any modifications to the source since last extraction process.

Configure the extraction process to prepare the extraction queries based on the metadata. This includes information such as list of the tables to extract and how to identify the appropriate delta period.

Configure the extraction process to keep track of the status for each source table. Create a status code - such as initialized, in progress, completed, and failed—for each stage of the extraction. During each extraction update the status code in the metadata table. Use the status code to manage recovery, whether immediately, at a specified recovery interval, or during the next normal load process. A simple restart may be sufficient if the problem was a failed connection, but actual changes to the source may require manual intervention to change the extraction process or the staging database.

After the extraction, check to make sure all the data transferred successfully, perhaps by comparing total row counts. In addition, you can check the data for errors such as primary key violations, null violations, and date time violations (or you can defer quality checks to the transformation process.) Also, after the extraction, update the metadata information needed to manage the next extraction.

A pull method extraction may be significantly more complex than suggested in the above guidelines. For example, if you extract data from multiple sources asynchronously, the extract from one server may succeed while the one other fails, which would require special attention to the recovery process. Likewise if extractions from different servers happen at different times (for example, one at 10:00 PM, and one at 6:00 AM) then the extraction process needs to track each extraction separately.

Guidelines to use Push Mode

Schedule the extraction process at regular intervals, or when certain triggers are fired, such as when certain data values change.

Store the metadata required to perform the extraction process on the source system (or in a neutral location).

Choose whether to store the extracted information temporarily in the local source system or to send it directly to the staging server.

Maintain metadata status information for recovery purposes as when using the pull method.

Senarios for using Hybrid Mode

Scenario 1

In this scenario, triggers on the source database push every transaction to audit tables. The staging server reads the audit tables and manages the triggers. You don’t need to change the source applications, but you do need to be able to create triggers that log transactions.

Add triggers to the source system to log each transaction to alternating audit tables, using codes for inserts, updates, and deletions.

In the staging server extraction process, read the metadata to determine which audit table currently receives the log the records and then switch the log to the other table. For example, if the trigger is logging the records to A1, change the trigger to log to A2.

In the extraction process, extract the data from the previous log file to the staging server and, if successful, truncate the table.

In the extraction process, update the metadata to indicate which audit table is currently receiving the logged records. This prepares for the next execution

In a subsequent process, load the information into the data warehouse.

Scenario 2 Hybrid extraction – Application logging to flat file

In this scenario the application (for example, an ASP.NET web application logging click stream data) logs records to a flat file as illustrated in the above figure.

Here the data is not extracted from source database but written redundantly by source application. This scenario can be adopted when source application collects huge data volume from user interface (such as click stream). The logs are created/switched by the application as configured such as one log file for each day. Source application needs to create a new file at the end of the day.

Use a time stamp (e.g., 200608070030.txt) or a sequential number (e.g., ABC00043.txt) as the file name. The extraction process on the staging server, then reads the files, loads the data into the staging database, and then deletes the source files. The processes can be extended to extract data from multiple servers.

You do have to modify the source application to write to the log file for each transaction in addition to the regular updating of the source database. Writing two entries does add overhead to the source application, but the cost should be minimal, as the source application already knows everything necessary about the transaction, including the type of the change.

Source: http://biarchanddesignguide.codeplex.com/releases/view/6572

Cloudy River's Workshop

Monday, August 31, 2015

Guidelines for BI Extracting Mode