SSIS for loading External Database into Staging

02 November4 min read
SSIS for loading External Database into Staging

The time-consuming task of staging external relational data can be made more accessible with automation. Here are the steps to overcome the problems of manual staging and make building a data warehouse quick. 

Introduction

External transactional databases act as the primary data sources for building a data warehouse. Nobody wants to set up additional staging for many operating tables, which is a hectic thing to do. 

Since it is usually performed in the last stages of DWH development, people do not want more transformations when external databases are transferred into staging. But it is not the same for SSIS as there will be changes in runtime metadata, and it will not work without precise mappings for each table. 

This article presents the automation of this process which will save more time. 

The main aim is to avoid doing twice the same work, and existing database metadata such as database schema and unique indexes helps achieve this. A big problem for the vanilla SSIS toolbox, providing support for dynamic SSIS metadata, will be solved by Components Suite such as dynamic SQL and CozyRoc's SSIS+. 

Creation of staging tables and preparation of statements are done by Dynamic SQL, whereas SSIS+ will support dynamic runtime metadata. 

This can also be done using an SSIS Script task and a C# developer, but the license will be cheaper for something developed already. 

Extraction of Database Schema and Staging Creation

A solution was created, which can transfer the databases approximately from several Informix servers to a single SQL Server which will be used as staging, for providing proof for the concept. Although databases contain the same schema, each has its operational data.  

Analyzing the source data and planning is the foremost step. It will be apparent during analysis that the tables do not have the same structure. 

We implemented multiple data load variations to support this, and the three of them are: 

Full table load: it overwrites the data by fetching the whole table from the source. 

Incremental load: it only brings new data and will not look for the existing rows for updation or deletion. 

Lookup load: used for tables with a composite primary key and critical columns sort the data and compare them. Larger tables need RowVersion columns. 

To store the metadata, we need specific tables. They are:

  • Servers table: list of servers in the process
  • Tables table: list of tables in the process.
  • Columns: list of table columns per server.
  • Incremental tables: it is a subset table of Tables, which tracks the last loaded value and total key name.
  • TableLoadStatus: tracks load status of each table and operation row count.
  • SSISErrorLog: it logs any errors that are thrown by SSIS tasks.

Apart from tables, some procedures were needed. They are: 

  • meta.csp_FetchServersForLoad: creates list of servers in the load
  • meta.csp_CreateMissingObjects: for creating schemas and tables based on server metadata.
  • meta.csp_CreateMissingColumns: for a given server name, it creates new or missing columns.

We also add meta columns to help with data tracking. ServerName and RowStatus are the most significant ones. 

Data loading into staging tables

The prepared staging tables are needed to be populated with data. 

Each data load package will have two parallel tasks; one is for the server list and the other for the table list. 

Twenty-four processes were running in parallel, and 4 different servers provided data. 

Full Table Load

Small tables could be loaded in full each time, and some tables do not have any incremental key, which is valid. Full table load is for such ones. 

Incremental Table Load

It is used for tables that use the primary or surrogate incremental key. But it will not recognize the updated or deleted rows. 

To switch from one process to another, the latest key value will be updated. 

Lookup Table Load

It tracks any data changes, compares two data sets, source, and stage them, and sends to essential outputs. Based on the metadata, it prepares select statements from New and Old. 

Last words

Apart from staging external relational data, there exists a lot of works. But as far as this article is concerned, Hope it helped in providing an alternative way for a considerable part of the work. 

0 comments