Cloudy River's Workshop: August 2015

Monday, August 31, 2015

Guidelines for BI Extracting Mode

In BI, three Data Extracting Mode exists: Pull, Push, and Hybrid to extract data from the source and transfer them to the destination.

Pull Mode

Pull mode is the most common type of data extraction process.

The staging database initiates the data extraction process, issuing SQL statements to query the data from the source system.

You should use the pull method when it is necessary to minimize the impact on the source server. This may be the case when the source system is limited in disk space, CPU or memory, or when you do not have access to the source applications or systems and cannot enhance them for extraction process.

Push Mode

The source system initiations the data extraction process which typically results in extracting and moving the data to the staging database. The source system is responsible for extracting the incremental data, tracking the process, restarting in case of a failure, and even moving the data to the staging area.

Use the push system when the source system has special scheduling requirements, when you do not have read access to the source system databases, or when you need near real-time data transfers to the destination.

Hybrid Mode

The source system extracts data to keep in an intermediate store from which the staging system picks it up at regular intervals.

A hybrid model can provide some of the best features of both push and pull. In order to use a hybrid model, you must have some ability to modify processes on the source server. One scenario involves creating triggers on the source database; another involves changing the source application. The hybrid models are better than a pure push model in that the source server does not need to know anything about the staging server—this local isolation is sometimes referred to as disconnected architecture. In addition, hybrid models are better than a pure pull model in that keeping track of delta batches is relatively straightforward.

Guidelines to use Pull Mode

Following are guidelines for extracting data by using the pull method:

Use a scheduler to launch the extraction process during off-peak hours for the source system or, in cases where there is no off-peak time, at specific periodic intervals

Store metadata required for the extraction on the staging server. Use the staging server to maintain the status of all extractions.

Configure the extraction process to read the metadata for configuration parameters and to configure source connections. Make the extraction process begin by checking the schema to see if there have been any modifications to the source since last extraction process.

Configure the extraction process to prepare the extraction queries based on the metadata. This includes information such as list of the tables to extract and how to identify the appropriate delta period.

Configure the extraction process to keep track of the status for each source table. Create a status code - such as initialized, in progress, completed, and failed—for each stage of the extraction. During each extraction update the status code in the metadata table. Use the status code to manage recovery, whether immediately, at a specified recovery interval, or during the next normal load process. A simple restart may be sufficient if the problem was a failed connection, but actual changes to the source may require manual intervention to change the extraction process or the staging database.

After the extraction, check to make sure all the data transferred successfully, perhaps by comparing total row counts. In addition, you can check the data for errors such as primary key violations, null violations, and date time violations (or you can defer quality checks to the transformation process.) Also, after the extraction, update the metadata information needed to manage the next extraction.

A pull method extraction may be significantly more complex than suggested in the above guidelines. For example, if you extract data from multiple sources asynchronously, the extract from one server may succeed while the one other fails, which would require special attention to the recovery process. Likewise if extractions from different servers happen at different times (for example, one at 10:00 PM, and one at 6:00 AM) then the extraction process needs to track each extraction separately.

Guidelines to use Push Mode

Schedule the extraction process at regular intervals, or when certain triggers are fired, such as when certain data values change.

Store the metadata required to perform the extraction process on the source system (or in a neutral location).

Choose whether to store the extracted information temporarily in the local source system or to send it directly to the staging server.

Maintain metadata status information for recovery purposes as when using the pull method.

Senarios for using Hybrid Mode

Scenario 1

In this scenario, triggers on the source database push every transaction to audit tables. The staging server reads the audit tables and manages the triggers. You don’t need to change the source applications, but you do need to be able to create triggers that log transactions.

Add triggers to the source system to log each transaction to alternating audit tables, using codes for inserts, updates, and deletions.

In the staging server extraction process, read the metadata to determine which audit table currently receives the log the records and then switch the log to the other table. For example, if the trigger is logging the records to A1, change the trigger to log to A2.

In the extraction process, extract the data from the previous log file to the staging server and, if successful, truncate the table.

In the extraction process, update the metadata to indicate which audit table is currently receiving the logged records. This prepares for the next execution

In a subsequent process, load the information into the data warehouse.

Scenario 2 Hybrid extraction – Application logging to flat file

In this scenario the application (for example, an ASP.NET web application logging click stream data) logs records to a flat file as illustrated in the above figure.

Here the data is not extracted from source database but written redundantly by source application. This scenario can be adopted when source application collects huge data volume from user interface (such as click stream). The logs are created/switched by the application as configured such as one log file for each day. Source application needs to create a new file at the end of the day.

Use a time stamp (e.g., 200608070030.txt) or a sequential number (e.g., ABC00043.txt) as the file name. The extraction process on the staging server, then reads the files, loads the data into the staging database, and then deletes the source files. The processes can be extended to extract data from multiple servers.

You do have to modify the source application to write to the log file for each transaction in addition to the regular updating of the source database. Writing two entries does add overhead to the source application, but the cost should be minimal, as the source application already knows everything necessary about the transaction, including the type of the change.

Source: http://biarchanddesignguide.codeplex.com/releases/view/6572

Business Intelligence Concept Framework

Business Intelligence Concept Framework

This text is to provide a common framework for architects and developers. This framework is conceptual, technology agnostic, and covers the major phases, features, and functionality required to effectively implement a BI solution.

Data Source

BI system always originates in many different storage systems. Extracting data from those different sources and merging this data into a single, consistent dataset is daunting and challenging.

Above figure showed that with the data source elements increase, the complexity of the system increases.

When working with diverse data sources, the following issues may be considered

Different source environments with different systems, including different platforms and operating systems: Heterogeneous environments are silos of information due to differences in business and applications. Each application has its own database, business rules, and schema. On the other hand, homogeneous environments, where both the source and destination are using similar technologies and versions, are much easier to deal with. For example a homogeneous environment may use SQL Server 2005 on both the OTLP and online analytical processing (OLAP) systems.

Different database systems: Source data may originate in many different types of database system, including Oracle, DB2, SQL Server or others. The components that read the source data are usually referred to as data adapters. Connecting to a wide variety of data sources means choosing the right source adaptor to facilitate the connection and retrieval of data from various data sources. Consider the type of host systems and applications, for example, the data could be read from a standard database (SAP, DB2, Oracle, SQL Server), from a flat file such as an IIS log or a queue, or data could be read via a third party broker or Web service. When connecting to a data source system, consider appropriate security precautions such as authentication, authorization, and secure communication.

Different schemas, data formats, and naming conventions: Legacy transaction systems may use EBCDIC string formats. OR the source database systems may use VSAM files which are not relational in nature. Table and column names from an enterprise application such as SAP may be difficult to understand.

Geographically separated source locations: If the data warehouse collects data from geographically dispersed locations, it is required to consider the implications for timing requirements on timelines and bandwidth to address the latency challenges.

Source system ownership. If the same team, product, or department owns both the OLTP and DW system, then it is relatively easy to work out issues over ownership. Ownership pertains to both access permission and data quality. Getting permission to read data from transaction system databases can be a complex political problem. However, it is still critical to understand trust boundaries so that appropriate quality gates in place. If control over the source system is restricted then consider all necessary precautions for limitations or variations of the source system.

Impact on the source system. Retrieving large quantities of data from transactional system databases can have negative impact on the operational applications. It may be appropriate to consider techniques that would minimize usage of source resources and also extract all the required information timely. All the precautions need to be taken to extract only the data that is required for business analysis.

Source data volatility: The source data can be volatile as a result of streaming data, transactional data, periodic snapshot data, stand-alone data, and replicated data. Consider appropriate window for data extraction from source systems as they are meant to change continuously with operational or batch process transactions.

Source data volume: Data volume is a major consideration for data extraction operation as it would affect both processing as well as resource utilization. Consider extraction techniques that would reduce the data volume on ongoing basis to meet the business requirements.

Data Integration

The following issues need to be considered while integrating data from various sources

Data profiling: Upon having access to data source, study the underlying data, its dependencies, and rules is a complex task. The ability to profile and analyze the source data to identify potential issues such as anomalies, outliers, dependency violations, and redundant or orphaned data is commonly referred to as data profiling. The three commonly sited aspects of data profiling are: column analysis, dependency analysis, and redundancy analysis. Column or attribute analysis evaluates the distribution range, completeness, uniqueness, format, type, size, and frequency of data. Dependency or referential analysis looks for relationships, integrity, and business rule dependencies. Finally, redundancy analysis, as the name suggests, is the technique for identifying duplicate data, in addition to orphan records.

Data extraction: Once the source data is studied, extracting meaningful data is the next big challenge with data integration. Extracting data from the source requires connecting to heterogeneous or homogeneous data sources using a source adapter. During the extraction following issues needs to be addressed: What source format are you dealing with? What is the frequency of the extraction? What is the load? What if there is no record tracking for the transactions on the source system? How do you select and extract data that has only been changed since your last extraction? How do you minimize the load on the source data during extraction?

Data staging: Staging is a location where data is temporarily stored before loading the data into the destination warehouse. Staging may not be necessary for simple cleansing or transformation operations; however, staging is for other reasons, for example, you may want to avoid the overhead on the source system so checks are performed locally on the staging server for cleansing and transformations. There may be procedural reasons causing a time lag between data extraction and loading. You may need to perform complex transformations that require you to access multiple sources, lookups, or fact tables.

Data transformation: Once you have connected to the source system, identified relevant data and extracted the data from the source, it is ready to be transformed. During the transformation process you could perform various actions on the data depending on the scenario. You could sort, split, merge, lookup, address slowly changing dimensions, audit, pivot, or aggregate the data. This process may go hand in hand with data cleansing. Data transformation can be done at various stages depending on the type of operation you are performing and the data load. You could choose to do it at the source, before the staging, or before loading the data into the destination warehouse database.

Data cleansing: Data cleansing ensures inconsistent and invalid data is cleaned before loading it into the data warehouse. During this process, common data quality problems such as absent data values, inconsistent values, duplicate values, primary key reuse, and violation of business rules are detected and corrected. Correcting data quality issues at the source is ideal; however, in most cases it is not possible. For example, you may not have ownership or influence over the source system, the source system may have dependencies which increase the complexity.

Data loading: Populating the data warehouse is the last step of the Extract Transform, and Load (ETL) process. If you are populating the warehouse for the first time, you will load the historical data, followed by a new transactional data on a periodic basis. For loading large transaction data into fact tables, you have to consider issues such as: loading data during off-peak usage, loading data into temporary tables, creating indexes on those temporary tables similar to fact tables and merging the temporary table as a partition back into the fact table, or dropping indexes on the fact table before loading the data.

Data Storage

Data Storage is the end result of Data Source and Data Integration layers. Data Storage can be termed as Data Warehouse (DW)/Data Mart (DM). Data Analysis and Data Presentation will leverage the information stored in Data Storage. The following issues need to be considered regarding Data Storage.

Dimensional modeling. Strategies for effectively organizing data are critical to implementing BI/DW systems. The technique used in modeling the logical data warehouse is commonly referred to as dimensional modeling. The guidelines for designing data warehouse solutions are different from transaction systems. As a designer you have to choose appropriate schema types, such as star or snowflake schema, design fact tables with measures relevant for the business and at the appropriate level of granularity, and address attributes that change over time by picking appropriate types of slow changing dimensions (SCD).

Partitions. Warehouse databases typically contain millions of rows in tables. Dividing large tables and their indexes into multiple segments and assigning them to filegroups is called partitioning. As a designer you have to create a scalable partition that enables the best possible performance, you have to choose an appropriate partitioning strategy, choose optimal partitioning functions, appropriately place partitions in filegroups, use index alignment, and plan for data management, such as moving new data inside a partition and removing aging data out of the partition.

Indexes: Designing an appropriate indexing strategy taking into account various factors such as usage patterns, schema design, column types, and storage needs are important for efficient operation.

Data Analysis

Data analysis system contains managing, summarizing, querying, and analyzing massive quantities of data. The following issues need to be considered regarding Data Storage.

OLAP. In addition to providing data storage, an OLAP engine such as SQL Server Analysis Services also facilitates is designed for analysis of business measures, optimized for bulk loads, and superior performance for business intelligence and large complex queries. Data is organized and preprocessed into multidimensional cubes based on a dimensional model that enables you to rapidly summarize information for analytical queries.

Data Mining. You can use sophisticated and complex data mining algorithms to analyze the data for exposing interesting information useful for decision makers. You can create complex models, browse and query them, perform predictions against those models, and test the model's accuracy. Choosing an appropriate data mining algorithm for a specific business problem requires testing various algorithms separately or together and exploring the results.

Data Presentation

Presenting meaningful and visually appealing information in interesting ways is critical to helping analysts, managers, and leaders make informed decisions.

The following critical factors need to be considered for a data presentation system.

Navigation. The presentation layer provides easy access to complex query results in a rich user interface that facilitates navigation. Users have the ability to interactively explore the data by drilling, pivoting, and drag-drop capabilities. Visualizing and navigating through data to analyze root cause helps organizations gain deeper insight into what business drivers matter the most.

Format. Depending on the type of information available and the message that you want to deliver, it is important to choose an appropriate format. Choosing the right format such as chart, graph, table, report, dashboard, or Key Performance Indicator (KPI) depends on the usage patterns such as trends, behaviors, comparisons, correlation, change, classifications, or facts.

Host. Choosing an appropriate client depends on the business needs. If the business demands access to information at any time, anywhere, and through any device, the design needs to accommodate such a requirement. Given the broad spectrum of client devices available in the market today – desktop PCs, Tablet PCs, and handheld mobile devices such as Pocket PCs and Smartphones – your client presentation can be a thin client, smart client, or mobile client. However, each type of client has its advantages and disadvantages. When designing your application, you will need to carefully consider the specifics of your situation before you can determine which is appropriate.

General Concerns

Regardless of which layer of the data flow you deal with, following issues must be constantly addressed:

Metadata
Security
Performance
Operations

Source: http://biarchanddesignguide.codeplex.com/releases/view/6572

Cube Storage MOLAP vs ROLAP vs HOLAP

Cube data includes meta-data, detail-data, and aggregate data. Meta-data always stored on the OLAP, the other two stored depending on the storage mode.

MOLAP (Multidimensional OLAP)

MOLAP is default and most frequently used storage, it has maximum query performance because after each cube processing, detail-data and meta-data would be stored locally on the OLAP in a compressed and optimized multidimensional format

Pros of MOLAP:

All calculations are pre-generated when the cube is processed.
Don't need permanent connection to the underlying relational database.
Has maximum query performance.

Cons of MOLAP:

Need frequent processing to pull refreshed data.
Latency: Any changes in the relational database will not reflected after the processing.
If data volume is huge, cube processing needs times. You can use incremental processing to avoid this.
Needs additional storage to store relational data.

ROLAP (Relational OLAP)

ROLAP didn't pull data from relational database. It stored detail-data and aggregate data in a format of additional database objects such as indexed views within the relational database source rather than OLAP.

Pros of ROLAP:

Ability to view the data in almost real time.
Has less storage requirement as no other copy of source data required.
Can handle huge data source with no size limitation. This is because the detail-data is stored in the underling relational database.

Cons of ROLAP:

Query response is quite slower as everything you want is through accessing relational database source rather than OLAP.
Permanent connection is needed while viewing cube data.

HOLAP (Hybrid OLAP)

HOLAP is hybrid of MOLAP and ROLAP in which to include high query response time and performance from MOLAP while maintaining high data capacity from ROLAP. In HOLAP, the detailed data stored at underling database source while the aggregate data stored at OLAP server.
Thus, querying aggregate data is same as MOLAP with high performance. For detail data query, HOLAP will go inside the source database with slower performance.

Pros of HOLAP:

Balance of the disk requirement and query performance.
Has less storage requirement as no other copy of source data required.
Can handle huge data source with no size limitation. This is because the detail-data is stored in the underling relational database.
Quick query response for aggregate data.

Cons of HOLAP:

Query response is quite slower for detailed data.

Summary

Storage Mode	Detail Data	Summary/ Aggregations	Storage space requirement	Query Response Time	Processing Time	Latency
MOLAP	OLAP	OLAP	Medium	Fast	Fast	High
HOLAP	Relational DB Source	OLAP	Small	Medium	Fast	Medium
ROLAP	Relational DB Source	Relational DB Source	Large	Slow	Slow	Low

Source:

http://www.sql-server-performance.com/2013/ssas-storage-modes/