Contents

Loading a Data Warehouse Dimension with RabbitMQ and SSIS

In this post I’m going to take a break from the actual development of custom components for SSIS and show how the RabbitMQ Source that I have created so far could be used in an ETL package to load a data warehouse.

The Scenario

The scenario for this post is a company that sells cool stuff over the internet, they have a website where people can signup and registered users can place orders.

The way this imaginary system works:

  • When a user signs up, their details are created in the operational database which is used by the website. After creation their details are then sent to RabbitMQ
  • When a user updates their details, they are updated in the operational database, this information is also sent to RabbitMQ
  • When a user places an order, the details are sent to two different RabbitMQ queues for processing:
    • An imaginary process writes these details into the operational database in “real-time”
    • An ETL process runs periodically to load the orders into the data warehouse

In this post I’m going to show how the customer dimension can be loaded from the RabbitMQ queue, in a future post I will show how to load the fact table via SSIS.

My ETL package will do the following:

  • Read from a queue of customers, parse the xml and populate attribute columns
  • Using the SCD (slowly changing dimension) component to populate the customer dimension

Preparing the solution

The only preparation I did before starting was to create a new database with a new table DimCustomer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
CREATE TABLE [dbo].[DimCustomer](
    [CustomerKey] [int] IDENTITY(1,1) NOT NULL,
    [CustomerBusinessKey] [int] NOT NULL,
    [FirstName] [nvarchar](50) NOT NULL,
    [Surname] [nvarchar](50) NOT NULL,
    [Gender] [char](1) NOT NULL,
    [City] [nvarchar](50) NOT NULL,
    [Country] [nvarchar](50) NOT NULL,
    [MaritalStatus] [char](1) NOT NULL,
    [StartDate] [datetime] NULL,
    [EndDate] [datetime] NULL,
 CONSTRAINT [PK_DimCustomer] PRIMARY KEY CLUSTERED
(
    [CustomerKey] ASC
)
) ON [PRIMARY]

Creating the ETL Package

The ETL Package is going to have a number of different steps that it goes through.

  1. Retrieve customer messages from queue
  2. Parse XML
  3. Load customers into the dimension, accounting for SCD attributes

The end result ended up looking like this:

/loading-a-data-warehouse-dimension-with-rabbitmq-and-ssis/images/etl-1.png

Building the package is actually very straightforward. After creating a new solution and adding a new package, I added a new data flow task.

To the data flow task, I first added the RabbitMQ Source and configured it with a new RabbitMQ Connection manager to my local broker, I also told it to use the customer queue.

Next I added a script component, this is where the xml message from RabbitMQ will be parsed.

The script component is the key to making the messages in RabbitMQ understood in the SSIS package. What the script does is  it takes in each of the messages and then updates rows in a new output buffer. This output buffer contains columns corresponding to the elements in the xml message.

The XML messages that are published look like this:

1
2
3
4
5
6
7
8
9
<customer>
  <id>1</id>
  <firstname>Peter</firstname>
  <surname>Parker</surname>
  <gender>M</gender>
  <maritalstatus>S</maritalstatus>
  <city>New York</city>
  <country>USA</country>
</customer>

To configure the Script component, first add a new output then add each of the fields from the xml above.

/loading-a-data-warehouse-dimension-with-rabbitmq-and-ssis/images/etl-21.png

Next edit the script itself and updated the Input0_ProcessInputRow method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
  var stringReader = new System.IO.StringReader(Row.MessageContents);
  var xmlDoc = new XmlDocument();
  xmlDoc.Load(stringReader);
 
  Row.CustomerKey = int.Parse(xmlDoc.GetElementsByTagName("id")[0].InnerText);
  Row.FirstName = xmlDoc.GetElementsByTagName("firstname")[0].InnerText;
  Row.Gender = xmlDoc.GetElementsByTagName("gender")[0].InnerText;
  Row.Surname = xmlDoc.GetElementsByTagName("surname")[0].InnerText;
  Row.City = xmlDoc.GetElementsByTagName("city")[0].InnerText;
  Row.Country = xmlDoc.GetElementsByTagName("country")[0].InnerText;
  Row.MaritalStatus = xmlDoc.GetElementsByTagName("maritalstatus")[0].InnerText;
}

That is it for the script component, the next step is to drag in the Slowly Changing Dimension component. After connecting the script component to the slowly changing dimension component, double clicking will bring up the Slowly Changing Dimension Wizard.

The first page of the wizard is where you configure the connection and table of the dimension, after doing that the wizard will attempt to match the different columns in the table to columns from the input into the component. The next step on this page is to select which column is the business key.

/loading-a-data-warehouse-dimension-with-rabbitmq-and-ssis/images/etl-3.png

The next page is where the SCD types are set for the columns. Historical is SCD Type 2, Changing is SCD Type 1 and Fixed is SCD Type 0.

/loading-a-data-warehouse-dimension-with-rabbitmq-and-ssis/images/etl-4.png

In the next page you can configure how SSIS should deal with different scenarios, below you can see that I’ve opted not to fail the package if fixed attributes change, and ticked the second option.

/loading-a-data-warehouse-dimension-with-rabbitmq-and-ssis/images/etl-5.png

The next page is used to configure how Type 2 dimensions should be stored in the dimension table, so that they can be recognized.

The wizard gives us two options:

  • Use a status column, the first wizard option, whereby a column can be selected as the indicator column then two values can be selected which correspond to the current or expired flag
  • Use start/end date columns, the second wizard option, for this option two datetime columns need to be selected and then set what value to use for the date value

/loading-a-data-warehouse-dimension-with-rabbitmq-and-ssis/images/etl-6.png

The next page gives the option to enable interred member support, outside the scope of this article. I have left this turned off.

/loading-a-data-warehouse-dimension-with-rabbitmq-and-ssis/images/etl-7.png

After clicking Finish, the wizard will close and the remaining elements, that you can see in the picture of the complete package above, will be added.

Running the package

To demonstrate the package, I’m going to using the RabbitMQ management interface put some messages into the queue and then run the package.

The first message I’m placing into the queue is this:

1
2
3
4
5
6
7
8
9
<customer>
  <id>1</id>
  <firstname>Peter</firstname>
  <surname>Parker</surname>
  <gender>M</gender>
  <maritalstatus>S</maritalstatus>
  <city>New York</city>
  <country>USA</country>
</customer>

This is the xml sample from above. After running the package I can query the database and get the following output

/loading-a-data-warehouse-dimension-with-rabbitmq-and-ssis/images/etl-demo-1.png

The next two messages I’m putting in are:

1
2
3
4
5
6
7
8
9
<customer>
  <id>1</id>
  <firstname>Spider</firstname>
  <surname>Man</surname>
  <gender>M</gender>
  <maritalstatus>S</maritalstatus>
  <city>Los Angeles</city>
  <country>USA</country>
</customer>

and

1
2
3
4
5
6
7
8
9
<customer>
<id>2</id>
  <firstname>Clarke</firstname>
  <surname>Kent</surname>
  <gender>M</gender>
  <maritalstatus>S</maritalstatus>
  <city>Kansas</city>
  <country>USA</country>
</customer>

The first is an update to Peter Parker, the second is a new customer. After running the package the output from the table is now this:

/loading-a-data-warehouse-dimension-with-rabbitmq-and-ssis/images/etl-demo-2.png

You can see that the package has identified that Peter Parker is an existing customer, so expires the old dimension and adds a new one, it also adds a new row for Clarke Kent.

Summary

As you can see using the RabbitMQ source itself is very easy, and thanks to some of the pre-built components in SSIS, building a package that loads dimensions into a data warehouse is very easy.

I’m going to be taking a break from blogging about Custom SSIS Components and start a new series on building a real-time monitoring system with StreamInsight.

🍪 I use Disqus for comments

Because Disqus requires cookies this site doesn't automatically load comments.

I don't mind about cookies - Show me the comments from now on (and set a cookie to remember my preference)