Cloudy River's Workshop: Business Intelligence Concept Framework

Business Intelligence Concept Framework

This text is to provide a common framework for architects and developers. This framework is conceptual, technology agnostic, and covers the major phases, features, and functionality required to effectively implement a BI solution.

Data Source

BI system always originates in many different storage systems. Extracting data from those different sources and merging this data into a single, consistent dataset is daunting and challenging.

Above figure showed that with the data source elements increase, the complexity of the system increases.

When working with diverse data sources, the following issues may be considered

Different source environments with different systems, including different platforms and operating systems: Heterogeneous environments are silos of information due to differences in business and applications. Each application has its own database, business rules, and schema. On the other hand, homogeneous environments, where both the source and destination are using similar technologies and versions, are much easier to deal with. For example a homogeneous environment may use SQL Server 2005 on both the OTLP and online analytical processing (OLAP) systems.

Different database systems: Source data may originate in many different types of database system, including Oracle, DB2, SQL Server or others. The components that read the source data are usually referred to as data adapters. Connecting to a wide variety of data sources means choosing the right source adaptor to facilitate the connection and retrieval of data from various data sources. Consider the type of host systems and applications, for example, the data could be read from a standard database (SAP, DB2, Oracle, SQL Server), from a flat file such as an IIS log or a queue, or data could be read via a third party broker or Web service. When connecting to a data source system, consider appropriate security precautions such as authentication, authorization, and secure communication.

Different schemas, data formats, and naming conventions: Legacy transaction systems may use EBCDIC string formats. OR the source database systems may use VSAM files which are not relational in nature. Table and column names from an enterprise application such as SAP may be difficult to understand.

Geographically separated source locations: If the data warehouse collects data from geographically dispersed locations, it is required to consider the implications for timing requirements on timelines and bandwidth to address the latency challenges.

Source system ownership. If the same team, product, or department owns both the OLTP and DW system, then it is relatively easy to work out issues over ownership. Ownership pertains to both access permission and data quality. Getting permission to read data from transaction system databases can be a complex political problem. However, it is still critical to understand trust boundaries so that appropriate quality gates in place. If control over the source system is restricted then consider all necessary precautions for limitations or variations of the source system.

Impact on the source system. Retrieving large quantities of data from transactional system databases can have negative impact on the operational applications. It may be appropriate to consider techniques that would minimize usage of source resources and also extract all the required information timely. All the precautions need to be taken to extract only the data that is required for business analysis.

Source data volatility: The source data can be volatile as a result of streaming data, transactional data, periodic snapshot data, stand-alone data, and replicated data. Consider appropriate window for data extraction from source systems as they are meant to change continuously with operational or batch process transactions.

Source data volume: Data volume is a major consideration for data extraction operation as it would affect both processing as well as resource utilization. Consider extraction techniques that would reduce the data volume on ongoing basis to meet the business requirements.

Data Integration

The following issues need to be considered while integrating data from various sources

Data profiling: Upon having access to data source, study the underlying data, its dependencies, and rules is a complex task. The ability to profile and analyze the source data to identify potential issues such as anomalies, outliers, dependency violations, and redundant or orphaned data is commonly referred to as data profiling. The three commonly sited aspects of data profiling are: column analysis, dependency analysis, and redundancy analysis. Column or attribute analysis evaluates the distribution range, completeness, uniqueness, format, type, size, and frequency of data. Dependency or referential analysis looks for relationships, integrity, and business rule dependencies. Finally, redundancy analysis, as the name suggests, is the technique for identifying duplicate data, in addition to orphan records.

Data extraction: Once the source data is studied, extracting meaningful data is the next big challenge with data integration. Extracting data from the source requires connecting to heterogeneous or homogeneous data sources using a source adapter. During the extraction following issues needs to be addressed: What source format are you dealing with? What is the frequency of the extraction? What is the load? What if there is no record tracking for the transactions on the source system? How do you select and extract data that has only been changed since your last extraction? How do you minimize the load on the source data during extraction?

Data staging: Staging is a location where data is temporarily stored before loading the data into the destination warehouse. Staging may not be necessary for simple cleansing or transformation operations; however, staging is for other reasons, for example, you may want to avoid the overhead on the source system so checks are performed locally on the staging server for cleansing and transformations. There may be procedural reasons causing a time lag between data extraction and loading. You may need to perform complex transformations that require you to access multiple sources, lookups, or fact tables.

Data transformation: Once you have connected to the source system, identified relevant data and extracted the data from the source, it is ready to be transformed. During the transformation process you could perform various actions on the data depending on the scenario. You could sort, split, merge, lookup, address slowly changing dimensions, audit, pivot, or aggregate the data. This process may go hand in hand with data cleansing. Data transformation can be done at various stages depending on the type of operation you are performing and the data load. You could choose to do it at the source, before the staging, or before loading the data into the destination warehouse database.

Data cleansing: Data cleansing ensures inconsistent and invalid data is cleaned before loading it into the data warehouse. During this process, common data quality problems such as absent data values, inconsistent values, duplicate values, primary key reuse, and violation of business rules are detected and corrected. Correcting data quality issues at the source is ideal; however, in most cases it is not possible. For example, you may not have ownership or influence over the source system, the source system may have dependencies which increase the complexity.

Data loading: Populating the data warehouse is the last step of the Extract Transform, and Load (ETL) process. If you are populating the warehouse for the first time, you will load the historical data, followed by a new transactional data on a periodic basis. For loading large transaction data into fact tables, you have to consider issues such as: loading data during off-peak usage, loading data into temporary tables, creating indexes on those temporary tables similar to fact tables and merging the temporary table as a partition back into the fact table, or dropping indexes on the fact table before loading the data.

Data Storage

Data Storage is the end result of Data Source and Data Integration layers. Data Storage can be termed as Data Warehouse (DW)/Data Mart (DM). Data Analysis and Data Presentation will leverage the information stored in Data Storage. The following issues need to be considered regarding Data Storage.

Dimensional modeling. Strategies for effectively organizing data are critical to implementing BI/DW systems. The technique used in modeling the logical data warehouse is commonly referred to as dimensional modeling. The guidelines for designing data warehouse solutions are different from transaction systems. As a designer you have to choose appropriate schema types, such as star or snowflake schema, design fact tables with measures relevant for the business and at the appropriate level of granularity, and address attributes that change over time by picking appropriate types of slow changing dimensions (SCD).

Partitions. Warehouse databases typically contain millions of rows in tables. Dividing large tables and their indexes into multiple segments and assigning them to filegroups is called partitioning. As a designer you have to create a scalable partition that enables the best possible performance, you have to choose an appropriate partitioning strategy, choose optimal partitioning functions, appropriately place partitions in filegroups, use index alignment, and plan for data management, such as moving new data inside a partition and removing aging data out of the partition.

Indexes: Designing an appropriate indexing strategy taking into account various factors such as usage patterns, schema design, column types, and storage needs are important for efficient operation.

Data Analysis

Data analysis system contains managing, summarizing, querying, and analyzing massive quantities of data. The following issues need to be considered regarding Data Storage.

OLAP. In addition to providing data storage, an OLAP engine such as SQL Server Analysis Services also facilitates is designed for analysis of business measures, optimized for bulk loads, and superior performance for business intelligence and large complex queries. Data is organized and preprocessed into multidimensional cubes based on a dimensional model that enables you to rapidly summarize information for analytical queries.

Data Mining. You can use sophisticated and complex data mining algorithms to analyze the data for exposing interesting information useful for decision makers. You can create complex models, browse and query them, perform predictions against those models, and test the model's accuracy. Choosing an appropriate data mining algorithm for a specific business problem requires testing various algorithms separately or together and exploring the results.

Data Presentation

Presenting meaningful and visually appealing information in interesting ways is critical to helping analysts, managers, and leaders make informed decisions.

The following critical factors need to be considered for a data presentation system.

Navigation. The presentation layer provides easy access to complex query results in a rich user interface that facilitates navigation. Users have the ability to interactively explore the data by drilling, pivoting, and drag-drop capabilities. Visualizing and navigating through data to analyze root cause helps organizations gain deeper insight into what business drivers matter the most.

Format. Depending on the type of information available and the message that you want to deliver, it is important to choose an appropriate format. Choosing the right format such as chart, graph, table, report, dashboard, or Key Performance Indicator (KPI) depends on the usage patterns such as trends, behaviors, comparisons, correlation, change, classifications, or facts.

Host. Choosing an appropriate client depends on the business needs. If the business demands access to information at any time, anywhere, and through any device, the design needs to accommodate such a requirement. Given the broad spectrum of client devices available in the market today – desktop PCs, Tablet PCs, and handheld mobile devices such as Pocket PCs and Smartphones – your client presentation can be a thin client, smart client, or mobile client. However, each type of client has its advantages and disadvantages. When designing your application, you will need to carefully consider the specifics of your situation before you can determine which is appropriate.

General Concerns

Regardless of which layer of the data flow you deal with, following issues must be constantly addressed:

Metadata
Security
Performance
Operations

Source: http://biarchanddesignguide.codeplex.com/releases/view/6572

Cloudy River's Workshop

Monday, August 31, 2015

Business Intelligence Concept Framework