Data Warehouse Design Basics

Data Warehousing Concepts

Data warehouse concepts are set of useful information to create a centralized database system used to record information from various areas within the business process. Data warehousing, on the other hand, is another concept under data warehousing wherein it involves the method creating and storing information that a certain company has gathered and will be utilized for future purposes.

 

After recognizing the existence of the industry’s need for a more advance and technology-based approaches in managing corporate information, many multinational corporation decided to embrace the concept of data warehousing as a useful tool in their company’s success. Date warehouses are created to help business across the world in different aspect of data management from collection, handling, storing, and all the way to retrieval. The massive innovation in data warehousing also helps to fill the necessity for subject- oriented concepts especially since Integration is linked to subject orientation closely. Data warehouses actually works by integrating technology on how the end-user can manage their internal data. The data is from different sources which will be brought to reliable and consistent structure. Among the problems and issues that must be resolved with the introduction of this technology are the problems in naming and classifying data with conflicting and inconsistencies in nature.

 

Since data warehousing is focused on specific and detailed information when storing them, it is important that the primary source of all the information that are set to be stored are imprinted cleaned, and cataloged before one can say that the entire data warehousing system is considered a success. The company that has installed this type of system in their business can use the information stored for future used and in archival purposes. The system was also proven effective in processing online analytics, running some market research, and most especially, in policy making of the company, among others.

 

Data warehouse concepts were also proven effective in compiling statistical data of the company, analyzing facts, extracting past corporate records, loading previous items, managing and even storing the recorded dictionary. Data Warehousing system also referred to as a business intelligence tool, in which it can be used in removing date and information or loading them and in managing or retrieving metadata.

 

All enterprise or organizations for that matter need to direct the data collected that are already present in the company to become useful and consistent with the objective of the data warehousing system which is to have a strong and consistent compiled information. The facts could turn into answers and also solutions that will support the entire company particularly on decision-making. The system is a must in the company’s recording and analysis section that must be prioritized.

Data Warehouse Architecture

Different data warehousing systems have different structures .The simplest structure can be defined as follow.

Data Sources

For any organization DWH contains data from different sources. It could be any flat file, database or any data dump generated through any software. All the data from these data sources are put into an organized way into the DWH by the ETL

ETL

You must load your data warehouse regularly so that it can serve its purpose of facilitating business analysis. To perform this operation, data from one or more operational systems must be extracted and copied into the warehouse. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. The acronym ETL is perhaps too simplistic, because it omits the transportation phase and implies that each of the other phases of the process is distinct. The entire process, including data loading, is referred to as ETL.

DWH

This is where all the data of an organization will reside. There is no specific process to follow to organize and store data into the DWH. It varies based on the organization’s need. The simplest and most efficient way to store data into the following structure.

Raw data

In the DWH all the data from different data sources are kept into the DWH as it is (from the source). There is no cleansing or transformation done in between. Many of the organization does avoid this whereas many of the organization want to keep all these data for safety purpose.

Storing this kind of data in the DWH will certainly consume more disk space. Unless any organization have extremely big storage, should avoid storing these raw data.

Summary Data

Summary tables store data that is aggregated and/or summarized for performance reasons (i.e., to improve the performance of business queries). Most business queries (i.e., approximately 80%) will run against summary tables.

Data is aggregated by combining multiple concepts together and/or combining large amounts of detailed data together. Most business queries analyze a summarization or aggregation of data (i.e., facts) across one or more dimensions. Therefore, a summary table may use multiple dimensions. For example, a table that analyzes accounts by region by customer by service by month uses four dimensions.

Design Considerations

The main objective when designing summary tables is to minimize the amount of data being accessed and the number of tables being joined. This is done by storing intermediate query results, such as:

  • summaries of large amounts of data (e.g., summing product inventory by quarter),
  • combinations of multiple concepts (e.g., sales by customer by market),
  • Reference data (e.g., product description).

Identify What to Aggregate

Examine highly used business queries and problem business queries (i.e., queries that are slow or consume a lot of resources) and identify aggregation requirements. Define:

  • Fact data to be summarized. For example, a fact table has the following three dimensions: Service, Geographical Location, and Time. There are multiple queries that aggregate the facts by month. The summary table created to meet this requirement has the following dimensions: Service, Geographical Location, and Month. The summary table time dimension contains a month (i.e., MM) instead of a date (i.e., YYMMDD). This summary table reduces the number of rows to be read and the length of the rows.
  • Attributes (i.e., metrics) in the fact tables that should be aggregated. Examine problem queries and identify the attributes that are aggregated by the queries. The aggregation will most probably be across multiple dimensions, and multiple attributes from the same fact table will also probably have to be aggregated.
  • Related facts to be aggregated into the same summary table. Examine problem business queries and for each query identify the facts (i.e., from different fact tables) that are aggregated by the query. Aggregating multiple facts into the same summary table improves performance by replacing multiple queries or a complex query containing multiple unions with a simple query.

Identify How Much to Aggregate

For each summary table, select the degree of aggregation required. Once facts are aggregated to a certain level of detail more detail is not available within that summary table.
To ensure greater flexibility, a rule of thumb is to aggregate to one level of detail greater than what is required and aggregate up a level when the queries run. However, this approach should not be used if the number of rows to be aggregated (when the queries are run) will be large.

Select the Level of De-normalization

Summary tables are recreated on a regular basis, therefore including dimension data in summary tables is not an issue. To limit the number of table joins (i.e., summary table to dimension tables), a rule of thumb is to use real world keys in summary tables and not use generated keys. This approach should not be used if the summarization level of the summary table is low (i.e., the summary table contains a lot of rows).
Another rule of thumb that minimizes joins is to always store physical dates in summary tables.

Design Indexes

To maximize the performance (and use) of summary tables, the rule of thumb is to index all access paths to a summary table.

Moving Data

Data warehouses are time variant in the sense that they maintain both historical and (nearly) current data. Operational databases, in contrast, contain only the most current, up-to-date data values. Furthermore, they generally maintain this information for no more than a year (and often much less). In contrast, data warehouses contain data that is generally loaded from the operational databases daily, weekly, or monthly which is then typically maintained for a period of 3 to 10 years. This is a major difference between the two types of environments.

Master Data

Master Data is conceptualized in the core entities of the enterprise. These entities represent the slowly changing part of the business’ data and usually fall into four categories:

  • Parties: represents all parties the enterprise conducts business with such as customers, prospects, individuals, suppliers, partners, etc.
  • Places: represents the physical places and their segmentations such as geographies, locations, subsidiaries, sites, areas, zones, etc.
  • Things: usually represents what the enterprise actually sells such as products, services, packages, items, financial services, etc.
  • Financial and Organizational: represents all roll-up hierarchies used in many places for reporting and accounting purposes such as organization structures, sales territories, chart of accounts, cost centers, business units, profit centers, price lists, etc.

Transactional Data such as purchase orders, invoices or financial statements, is not usually considered master data since it actually registers a “fact” that happened at a certain point in time. Transactional Data is really what drives the business indicators of the enterprise and it relies entirely on Master Data.

Analytics & Reporting

The main reasons we store data in a data warehousing systems are so they can be:

  • Reported against
  • Cleaned up
  • Transported to another data store where they can be reported against and/or cleaned up.

This is the part where end business user will be benefited from the DWH. Since the data already centralized and organized into the DWH, user can use any BI tool to generate their own report. This will help the business user/decision maker to change/upgrade/enhance the existing process.
DWH should be designed in a way which can cater any (or most) of the business query requested by the Business users.

[feather_share size="24" show="twitter, google_plus, facebook, linkedin, mail " hide="reddit, pinterest, tumblr"]

We love to hear from you