In BI, three Data Extracting Mode exists: Pull, Push, and Hybrid to extract data from the source and transfer them to the destination.
Pull Mode
Pull mode is the most common type of data extraction process.
The staging database initiates the data
extraction process, issuing SQL statements to query the data from the source
system.
You should use the pull method when it is necessary to minimize the impact on the source server. This may be the case when the source system is limited in disk space, CPU or memory, or when you do not have access to the source applications or systems and cannot enhance them for extraction process.
Push Mode
The source system initiations the data extraction process which typically results in extracting and moving the data to the staging database. The source system is responsible for extracting the incremental data,
tracking the process, restarting in case of a failure, and even moving the data to the staging area.
Use the push system when the source system has
special scheduling requirements, when you do not have read access to the source
system databases, or when you need near real-time data transfers to the
destination.
Hybrid Mode
The source system extracts data to keep in an
intermediate store from which the staging system picks it up at regular
intervals.
A hybrid model can provide some of the best features of both push
and pull. In order to use a hybrid model, you must have some ability to modify
processes on the source server. One scenario involves creating triggers on the
source database; another involves changing the source application. The hybrid
models are better than a pure push model in that the source server does not
need to know anything about the staging server—this local isolation is
sometimes referred to as disconnected
architecture. In addition, hybrid models are better than a pure pull model
in that keeping track of delta batches is relatively straightforward.
Guidelines to use Pull Mode
Following are guidelines for extracting data by using the pull method:
- Use a scheduler to launch the extraction process during off-peak hours for the source system or, in cases where there is no off-peak time, at specific periodic intervals
- Store metadata required for the extraction on the staging server. Use the staging server to maintain the status of all extractions.
- Configure the extraction process to read the metadata for configuration parameters and to configure source connections. Make the extraction process begin by checking the schema to see if there have been any modifications to the source since last extraction process.
- Configure the extraction process to prepare the extraction queries based on the metadata. This includes information such as list of the tables to extract and how to identify the appropriate delta period.
- Configure the extraction process to keep track of the status for each source table. Create a status code - such as initialized, in progress, completed, and failed—for each stage of the extraction. During each extraction update the status code in the metadata table. Use the status code to manage recovery, whether immediately, at a specified recovery interval, or during the next normal load process. A simple restart may be sufficient if the problem was a failed connection, but actual changes to the source may require manual intervention to change the extraction process or the staging database.
- After the extraction, check to make sure all the data transferred successfully, perhaps by comparing total row counts. In addition, you can check the data for errors such as primary key violations, null violations, and date time violations (or you can defer quality checks to the transformation process.) Also, after the extraction, update the metadata information needed to manage the next extraction.
A pull method extraction may be significantly more complex than
suggested in the above guidelines. For example, if you extract data from
multiple sources asynchronously, the extract from one server may succeed while the
one other fails, which would require special attention to the recovery process.
Likewise if extractions from different servers happen at different times (for
example, one at 10:00 PM, and one at 6:00 AM) then the extraction process needs
to track each extraction separately.
Guidelines to use Push Mode
- Schedule the extraction process at regular intervals, or when certain triggers are fired, such as when certain data values change.
- Store the metadata required to perform the extraction process on the source system (or in a neutral location).
- Choose whether to store the extracted information temporarily in the local source system or to send it directly to the staging server.
- Maintain metadata status information for recovery purposes as when using the pull method.
Senarios for using Hybrid Mode
Scenario 1
- Add triggers to the source system to log each transaction to alternating audit tables, using codes for inserts, updates, and deletions.
- In the staging server extraction process, read the metadata to determine which audit table currently receives the log the records and then switch the log to the other table. For example, if the trigger is logging the records to A1, change the trigger to log to A2.
- In the extraction process, extract the data from the previous log file to the staging server and, if successful, truncate the table.
- In the extraction process, update the metadata to indicate which audit table is currently receiving the logged records. This prepares for the next execution
- In a subsequent process, load the information into the data warehouse.
Scenario 2 Hybrid extraction – Application logging to flat file
In this scenario the application (for example, an ASP.NET web
application logging click stream data) logs records to a flat file as
illustrated in the above figure.
Here the data is not extracted from source database but written
redundantly by source application. This
scenario can be adopted when source application collects huge data volume from user
interface (such as click stream). The
logs are created/switched by the application as configured such as one log file
for each day. Source application needs to
create a new file at the end of the day.
Use a time stamp (e.g., 200608070030.txt) or a sequential number
(e.g., ABC00043.txt) as the file name. The extraction process on the staging
server, then reads the files, loads the data into the staging database, and
then deletes the source files. The processes can be extended to extract data
from multiple servers.
You do have to modify the source application to write to the log
file for each transaction in addition to the regular updating of the source
database. Writing two entries does add overhead to the source application, but
the cost should be minimal, as the source application already knows everything
necessary about the transaction, including the type of the change.
Source: http://biarchanddesignguide.codeplex.com/releases/view/6572