Data Warehouse Implementation & Use



index
Disabled back button Next Section
printable version

Section 0: Module Objectives or Competencies
Course Objective or Competency Module Objectives or Competency
The student will be able to list and explain basic data warehouse concepts. The student will be able to describe the data warehouse implementation steps, such as creating a data warehouse, the ETL process, developing data warehouse front-end applications, and data warehouse deployment.
The student will be able to explain how data warehouses are used.
The student will be able to explain online analytical processing (OLAP).
Section 1: Overview

Creating a data warehouse involves using the functionalities of database management software to implement the data warehouse model as a collection of physically created and mutually connected database tables.

Creating a Data Warehouse

Assume we have the model for the data warehouse shown below:

The CREATE TABLE statements for the model are shown below:

The created data warehouse tables are populated by the ETL infrastructure.

Section 2: ETL: Extraction, Transformation, Load

ETL (extraction, transformation, and load) involves a set of processes used to populate the data warehouse tables with the appropriate relevant data retrieved from the operational databases.

Here is an overview titled What is ETL?

In this example, the tables will be populated with the data from the following three data sources:

The ETL infrastructure will extract the data from the three sources, transform it, and load it into the data warehouse.


Extraction

Extraction refers to retrieving data from the operational data sources so that it can eventually be loaded into the data warehouse.

All of the data from Sources 1 and 3 will be extracted, but only some of the data from Source 2 will be extracted.


Transformation

Transformation involves converting the structure of extracted data to conform to the structure of the target data warehouse model.

In the example, the transformation process follows these steps:

  • The records from the Source 1 CUSTOMER table in the Sales Department database are merged with the records from the Source 3 CUSTOMER table in the Customer Demographic Data external table.
    • In other words, data about customers involved in sales transactions is merged with customers' demographic data.
  • The surrogate key is added to each record, and data is transformed in such a way to fit the structure of the dimension CUSTOMER in the target data warehouse.

The uniqueness data quality problem arises because two tables from Source 1 contain records with duplicated data.

  • The last displayed record in the SALESTRANSACTION table and the last three records in the INCLUDES table are a result of the uniqueness data quality problem.
  • The data entry process in Source 1 incorrectly recorded the same transaction twice, using two different TID values (T999 and T1000).
  • Both recorded transactions involve the same customer, store, and products and occurred on the same day at the exact same time.
  • The data cleansing action that took place during the transformation process identified this uniqueness data quality problem and solved it by excluding the records with the TID value of T1000 from the set of extracted data.
  • The non-unique data was filtered out, and did not end up in the SALES fact table in the data warehouse.

The process of transformation may involve standardizing different versions of the same data present in different data sources.

In this example, both the Source 1 and Source 2 databases have overlapping information about stores (Store ID and Store Zip).

  • Fortunately, this information is recorded in the exact same fashion in Sources 1 and 2.
  • However, Source 1 and Source 3 contain an instance of overlapping information about customers that is not recorded in the same fashion.
  • In particular, the customer whose ID is 3-4-555 in table CUSTOMER in Source 1 has her name recorded as Pam. That same customer in table CUSTOMER in Source 3 has her name recorded as Pammy.
  • The transformation process has been coded in such a way that it recognizes that these are both variations of the same name belonging to the same person and chooses one of them to be used in the dimension CUSTOMER.

Another type of activity that is part of the transformation process is deciding on and implementing slowly changing dimensions policies, for columns that are subject to changes.

There are two major kind of transformations: active transformations and passive transformations.


Load

Load involves storing the extracted, transformed, and quality-assured data into the target data warehouse.


ETL Infrastructure

The ETL processes are facilitated by the ETL infrastructure.


In the example, the result of the ETL process is the populated data warehouse as shown below.

Section 3: Online Analytical Processing (OLAP)

Online analytical processing (OLAP) refers to querying and presenting data from data warehouses for analytical purposes.

The correlation of OLAP in operational databases is online transaction processing (OLTP).

  • OLTP refers to updating (i.e., inserting, modifying, and deleting), querying, and presenting data from databases for operational purposes.
Comparison

What is OLAP?

Section 4: OLAP/BI Tools

BI Tools Overview (link no longer available on YouTube)



The data in a data warehouse can be accessed directly by the users via SQL or, more commonly, via OLAP tools, also known as business intelligence (BI) tools.

Section 5: OLAP/BI Tools Interface

There are numerous OLAP/BI tools (by vendors such as Tableau, Splunk, Domo, IBM Cognos, Apache Kylin, SAP BusinessObject, Oracle BI, or Qlik) available in the marketplace today, and the following is an overview of functionalities that are common for all OLAP/BI tools.

An OLAP/BI tool interface, with a layout similar to the figure below, might be used to analyze various facts across all attributes of all dimensions in the data warehouse.

The user could specify any of the particular queries analyzing DollarsSold and/or UnitsSold across the dimensions by simply choosing the attributes needed for the query via the drag-and-drop method.

In the following scenario, an OLAP/BI tool will be used to analyze the DollarsSold and UnitsSold sales facts across all attributes of all dimensions in the data warehouse to address the following query.

  • OLAP Query I: For each individual store, show separately for male and female shoppers the number of product units sold for each product category.

To address OLAP Query 1,

  • The user drags StoreID and Gender to the vertical axis, and Category to the horizontal axis, and then
  • drags the attribute UnitsSold (from the SALES fact table) into the fact area of the query construction space.
  • This is illustrated below:

Once the query is constructed, and then executed, results like the following will be shown on screen.



When a query has displayed the results, the user has the option to perform any of the three basic OLAP operations: slice and dice, pivot (rotate), and drill-down or roll-up, as discussed in the next section.



BI Dashboard Demo





Additional OLAP/BI Tools Functionality Note

The dimensional model is essential for OLAP.

Section 6: OLAP/BI Tools Operations

Three basic OLAP/BI tool features regularly used by analysts include:

Here are some links that provide greater detail about the above operations:



Slice and Dice

Recall that the slice and dice operation adds, replaces, or eliminates specified dimension attributes (or particular values of the dimension attributes) from the already displayed result.

The query could be modified so that only the results for stores 1 and 2 are displayed, that is, the results showing sales in store 3 will be "sliced out" from the original query.

The modified query is specified as follows:

  • OLAP Query 2: For stores 1 and 2, show separately for male and female shoppers the number of product units sold for each product category.

The next figure shows the result of OLAP Query 2, created by performing a slice and dice operation on the OLAP Query 1.


In addition to eliminating dimension attributes or selected values of dimension attributes, a slice and dice operation can replace or add dimension attributes.

This can be illustrated by modifying the OLAP Query 1 by replacing the ProductCategory attribute from the PRODUCT dimension with the DayType attribute from the CALENDAR dimension.

The wording of the modified query is as follows:

  • OLAP Query 3: For each individual store, show separately for male and female shoppers the number of product units sold on workdays and on weekends/holidays.

Here is the result of OLAP Query 3, created by performing another slice and dice operation on OLAP Query 1.



Pivot (Rotate)

Unlike the slice and dice operation, the pivot (rotate) operation does not change the values displayed in the original query, it simply reorganizes them.

The following example of a pivot operation uses OLAP Query 1 as a starting point and then simply swaps the axes for the ProductCategory attribute and the StoreID attribute.

The result of this pivot action is shown below:

Because the pivot action does not change the values shown to the user, the wording for the query is not modified.

In the original query, ProductCategory was placed on the horizontal axis and CustomerGender was placed on the vertical axis, but in this version, the pivoting action was performed and ProductCategory was rotated onto the vertical axis, whereas CustomerGender was moved to the horizontal axis.

The page axis can be used in pivot operations to view the table per single value of the attribute placed in the page axis.

This figure below shows the result of pivoting the CustomerGender attribute from the horizontal axis to the page axis, using OLAP Query 1.

In this case, the table shows the sales in different stores per product category for Female shoppers only.

Users can easily switch from one value of the attribute (Female, as shown above) to another value of the attribute (Male, as shown below) by using a point-and-click menu mechanism that usually accompanies the page axis feature in a typical OLAP/BI tool.



Drill-Down and Roll-Up

The purpose of the drill-down operation is to make the granularity of the data in the query result finer; in the case of roll-up, it is to make it coarser.

The drill-down operation allows users to drill through hierarchies within dimensions.

A hierarchy is a set of attributes within a dimension where an attribute is related to one or more attributes at a lower level but only related to one item at a higher level.

The following example uses the OLAP Query 1 result as a starting point and then drills down in the PRODUCT dimension from ProductCategory to ProductName.

The wording of the query is expanded in the following way:

  • OLAP Query 4: For each individual store, show separately for male and female shoppers the number of product units sold for each individual product in each product category.

The following figure shows the result of OLAP Query 4, created by performing a drill-down operation on the OLAP Query 1 result.

Conversely, to illustrate the roll-up operation, consider the recent query results as starting query and the OLAP Query 1 result as the product of a roll-up operation from ProductName to ProductCategory.



Video Supplements

OLAP Tools

OLAP Operations

OLAP Operations

Multidimensional Analysis

Section 7: Front-End Applications

OLAP/BI Tools have two purposes:

Data Warehouse Front-End (BI) Applications provide access for indirect use.

Section 8: Data Warehouse Deployment

Data warehouse deployment refers to allowing the end user access to the newly created and populated data warehouse and to its front-end applications.

Like most other information systems, the data warehousing system often goes through a testing phase before it is fully deployed to all of its users, with the feedback from the testing phases leading to modifications to the system prior to the actual deployment of a functioning data warehousing system.

Once the data warehousing system is deployed, the end user can start using it for the tasks that require data from the data warehouse.

Section 9: Resources

Granularity is the relative size, scale, or level of detail that characterizes an object or activity.

When granularity is applied to information technology, it refers to the level of detail (fineness) in a model, requirements, data, code modularity, or even a decision making process.

Greater (or finer) granularity corresponds to an increase in the level of detail (fineness of data).

The term "ad hoc" generally signifies a solution designed for a specific problem or task, non-generalizable, and not intended to be adapted to other purposes.

Business intelligence (BI) is a process for gathering usable knowledge about the external business environment and turning it into the intelligence required for tactical or strategic decisions.