31 Days of SSIS – Generating Row Numbers (23/31)

It’s Sunday afternoon and a good time to be putting up post twenty-three for the 31 Days of SSIS blog series.  A list of the other posts in this series can be found in the introductory post.  If you haven’t had a chance to read through that, I would recommend taking a look at it after reading this post.

Package Requirements

For today, we’re going to review a solution I devised for a problem I came across a while back.  For this problem, I had to generate row numbers for rows in a file.   My first inclination had been to use the Aggregate transformation, but it does aggregations which this problem wasn’t requiring.

What I needed here was to add a row number column to the data set.  For each account number in the data set, the row number would start at one and increment for all rows that matched to account number.  This needed to be done to data being received in a flat file and also prior to when it was loaded to the SQL Server database.

Row Number Solution

Solving this issue would actually be fairly simple in SQL Server.  All that would be needed is to add the ROW_NUMBER() function to the query.  Then partition the data on the the account number and all is said and done.  Unfortunately, things don’t work out that simply with SSIS.

As I mentioned, the Aggregation transformation would not work for this problem.  It could count the number of rows or average some values.  Also, no other transformations had similar functionality that could be used directly to solve this problem.

Fortunately, there is an indirect way to solve this problem.  I’ve talked about the flexibility of the Script Component transformation a couple times before.  Once again, that flexibility will come in handy.  By using that transformation along with the Sort transformation the requirements can be accomplished.

To start with, we need to sort the data that needs the row numbers.  It should be sort along the groups that would be the partitions with using the ROW_NUMBER() function.  The sort also needs to include the columns for the sorting for the order of the row numbers.  Next, a Script Component transformation will be used to add the row numbers.  By comparing each group with the other group values, the row number value can be determined.

Since we have the requirements and the solution description, lets take a look at what the package for this would look:

Package Control Flow

ssis-139There isn’t a lot to this SSIS package from the control flow perspective.  There will be a single Data Flow task and no others tasks and no variables.  The Data Flow task is:

  • DF_RowNumbers: Data flow task that contains the Sort and Script Component transformations.

Package Data Flows

Package Data Flow

As mentioned above, there is only one Data Flow task in the SSIS Package.  Since all the work happens there, let’s get on to it.

DF_RowNumbers

ssis-140In this data flow, we’ll be bringing in some data from the AdventureWorks database.  The data will be sorted and then sent into two Script Component transformations.  The first will create a row number across all rows.  The second will create a set of row numbers for each e-mail address in the data being processed through the SSIS package.

To accomplish all of this, the data flow has the following transformations:

  • DS_OLE_SalesOrderHeader: OLE DB Data Source to bring data into the data flow.  When I encountered this issue in the “wild”, it was with flat files but this data source will provide the same effect of bringing in data.
  • SRT_EmailAddressOrderDate: Sort transformation to sort all of the rows in the data flow by EmailAddress then OrderDate.
  • SCR_RowNumberAll: Script Component transformation that adds row numbers to each row beginning at the start of the data set through to the end.  The row numbers are sequential across the whole set.
  • SCR_RowNumberByEmail: Script Component transformation that adds row numbers to each row.  The first row for each EmailAddress group starts a new sequential set of row numbers.
  • RSAMP_Demo: Row Sampling transformation that was added solely to provide a Data Flow path to put a Data Viewer on.

Script Component Transformation Setup

The most important areas to configure for this SSIS package are the Sort transformations and the Script Component transformations.  Setting up the Sort transformation should be fairly easy.  The Script Component transformations will take a little bit more work.

SCR_RowNumberAll

ssis-141To start configuring the Script Component transformation, you will first need to add an output column to the Inputs and Outputs tab.  In this example, the column is named RowNumberAll and it is a 4-byte integer.

Now launch the script editor from the script tab and replace the contents of ScriptMain : UserComponent with the following code.

int rowNumber = 1;

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
Row.RowNumberAll = rowNumber;
rowNumber = rowNumber + 1;
}

ssis-142This code will keep track of the state of the row numbers through the variable rowNumber.  It will also increment the value after each row is processed.

With that, the transformation is configured.  If executed, a data viewer on the data flow path will provide results similar to those in the image  to the right.

SCR_RowNumberByEmail

ssis-143Next, let’s look at the Script Component transformation that will be used to add row numbers across each partition of EmailAddress.  The same output column as above will be added to this transformation.  This time the column will be named RowNumberByEmail.  Then add in the column EmailAddress as one of the Input Columns.

As before, launch the script editor from the script tab.  The contents of of ScriptMain : UserComponent should be replaced with the code below.

string emailAddress = "";
int rowNumber = 0;

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (emailAddress == Row.EmailAddress)
{
rowNumber = rowNumber + 1;
}
else
{
rowNumber = 1;
}

emailAddress = Row.EmailAddress;
Row.RowNumberByEmail = rowNumber;
}

ssis-144The code in this transformation is similar to the previous Script Component transformation.  Except in the previous transformation it incremented the row number for each row, where now it will check to see if the EmailAddress has changed before setting the row number.  If the EmailAddress has changed then the row number is reset to 1.  Otherwise the row number is incremented by 1.

Go ahead and execute the SSIS package again and the data viewer results will look like those in the image above

Row Number Wrap-up

It doesn’t take much to implement this logic in your SSIS package.  All that is needed is a couple transformations and a little outside of the box thinking.  The key to remember is that if there isn’t a specific transformation for your business needs, there is probably a way to put it together with the Script Component transformation.

Why would someone need to add a row number to a data set in an SSIS package?  The main reason I would consider it would be to add an “identity” value to a row.  This can allow rows that have no identifying aspects to be sliced and diced into various transformations and still retain a value that can reliably tie the information back together.  Also, if the identifying information for a row spans multiple columns, this single, narrower column can assist in improving performance.

With a couple of tweaks, this solution can also be varied from just adding a row numbers to solving other requirements.  One such that comes to mind immediately would be the add running totals to a data flow.  Instead of incrementing the row numbers add the values from the current row to the previous rows.

As always, the package for this post is below.  If you have any questions, feel free to leave a comment.

Package Download

23of31-RowNumbers

23 thoughts on “31 Days of SSIS – Generating Row Numbers (23/31)

  1. Excellent post, I just solved a problem using this and I don’t know how to spell C#. Worked exactly as I needed it to!

    Like

  2. This is great!! I’m still wondering why there are only three comments? Thank you! The logic/flow just makes sense, much easier than writing complex code

    Like

    1. Thanks for stopping by and leaving a comment. Not sure why only three comments, these SSIS posts are some of my most read content, so I know people are reading them.

      Like

  3. It’s interesting that you add a column to the OUTPUT (in the designer), and then in the script, you populate the column value in the INPUT buffer.

    This behavior is only possible if the script component is synchronous. If you change the “SynchronousInputId” property of the output to say “None”, then I believe that you get an asynchronous script component. Also, you can’t populate the RowNumberAll member of the Input0Buffer object (it doesn’t exist).

    I don’t know much more about this behavior, or whether it’s good or bad; maybe you could comment on that.

    It also seems like a weird thing to do, adding a column to the output and then populating that same column in the input. Your text description doesn’t mention that distinction.

    I think I read somewhere that SSIS sets up all of the buffers (at least, all of the synchronous ones) to contain all of the columns ahead of time.

    The sample asynchronous script component that I saw online explicitly creates the output row using

    //Add an output row
    Output0Buffer.AddRow();

    and then you have to populate the items in the output buffer. Again, I’m not sure what the differences are, in practice, defining a component to be sync or async. Do you have any comments?

    Thanks.

    Like

Comments are closed.