Data Source Integration Fundamentals – ETL Basics

Introduction

ETL = Extract – Transform – Load

  • Extract
    – Get the data from source system as efficiently as
  • Transform
    – Perform calculations on data
  • Load
    – Load the data in the target storage

Extraction of Data

During extraction, the desired data is identified and extracted from many different sources, including database systems and applications. Very often, it is not possible to identify the specific subset of interest, therefore more data than necessary has to be extracted, so the identification of the relevant data will be done at a later point in time. Depending on the source system’s capabilities (for example, operating system resources), some transformations may take place during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business situation. The same is true for the time delta between two (logically) identical extractions: the time span may vary between days/hours and minutes to near real-time. Web server log files, for example, can easily grow to hundreds of megabytes in a very short period of time.

Transportation & Loading of Data

After data is extracted, it has to be physically transported to the target system or to an intermediate system for further processing. Depending on the chosen way of transportation, some transformations can be done during this process, too.

Why is ETL (System) Important?

  • Adds value to data
    ▪ Removes mistakes and corrects data
    ▪ Documented measures of confidence in data
    ▪ Captures the flow of transactional data
    ▪ Adjusts data from multiple sources to be used together (conforming)
    ▪ Structures data to be usable by BI tools
    ▪ Enables subsequent business / analytical data

ETL Disambiguation

  • ETL = Extract – Transform – Load
    ▪ Not tight specifically to DW anymore
  • Process/System
    ▪ A complete process including
    – Data extraction
    – Enforcing DQ and consistency standards
    – Conforming data from disparate systems
    – Delivering data to target
    – People, HW, Documentation, Support, etc.
  • Tool
    ▪ A piece of software implementing the
    – Three (four) E-(C)-T-L steps.
    – A tool designed specifically to perform data transformations

ETL Process

ETL Data Integration

ETL is a bridge for bi-directional flow.

It can work in either direction. It does not necessarily extract data from source only and load into destination only after transform. The vice versa is true as well.

ETL Data Integration Solution Step

Data Migration

Process of transferring data between storage types or formats. An automated migration frees up human resources from tedious tasks. Design, extraction, cleansing, load and verification are done for moderate to high complexity jobs.

Data Consolidation

Usually associated with moving data from remote locations to a central location or combining data due to an acquisition or merger

Data Integration

Process of combining data residing at different sources and providing a unified view. Emerges in both commercial and scientific fields and is focus of extensive theoretical work. Also referred to as Enterprise Information Integration.

twittergoogle_plusFacebooklinkedinmail

We love to hear from you