2 Nov · 4 min read
The time-consuming task of staging external relational data can be made more accessible with automation. Here are the steps to overcome the problems of manual staging and make building a data warehouse quick.
External transactional databases act as the primary data sources for building a data warehouse. Nobody wants to set up additional staging for many operating tables, which is a hectic thing to do.
Since it is usually performed in the last stages of DWH development, people do not want more transformations when external databases are transferred into staging. But it is not the same for SSIS as there will be changes in runtime metadata, and it will not work without precise mappings for each table.
This article presents the automation of this process which will save more time.
The main aim is to avoid doing twice the same work, and existing database metadata such as database schema and unique indexes helps achieve this. A big problem for the vanilla SSIS toolbox, providing support for dynamic SSIS metadata, will be solved by Components Suite such as dynamic SQL and CozyRoc's SSIS+.
Creation of staging tables and preparation of statements are done by Dynamic SQL, whereas SSIS+ will support dynamic runtime metadata.
This can also be done using an SSIS Script task and a C# developer, but the license will be cheaper for something developed already.
A solution was created, which can transfer the databases approximately from several Informix servers to a single SQL Server which will be used as staging, for providing proof for the concept. Although databases contain the same schema, each has its operational data.
Analyzing the source data and planning is the foremost step. It will be apparent during analysis that the tables do not have the same structure.
We implemented multiple data load variations to support this, and the three of them are:
Full table load: it overwrites the data by fetching the whole table from the source.
Incremental load: it only brings new data and will not look for the existing rows for updation or deletion.
Lookup load: used for tables with a composite primary key and critical columns sort the data and compare them. Larger tables need RowVersion columns.
Apart from tables, some procedures were needed. They are:
We also add meta columns to help with data tracking. ServerName and RowStatus are the most significant ones.
The prepared staging tables are needed to be populated with data.
Each data load package will have two parallel tasks; one is for the server list and the other for the table list.
Twenty-four processes were running in parallel, and 4 different servers provided data.
Small tables could be loaded in full each time, and some tables do not have any incremental key, which is valid. Full table load is for such ones.
It is used for tables that use the primary or surrogate incremental key. But it will not recognize the updated or deleted rows.
To switch from one process to another, the latest key value will be updated.
It tracks any data changes, compares two data sets, source, and stage them, and sends to essential outputs. Based on the metadata, it prepares select statements from New and Old.
Apart from staging external relational data, there exists a lot of works. But as far as this article is concerned, Hope it helped in providing an alternative way for a considerable part of the work.