Friday, March 25, 2011

Handling extra/missing columns in Delimited Flat file

Delimited Flat files (CSV) are one of the important non-relational source for SSIS ETL but if we get robust file all the time then we can use inbuild flat file source in SSIS. Surprisingly we are not that much lucky and get the data in which columns are missing (or we can say extra columns) like as follow...
Problem Reference(Click here) 

So handling such a file is very problematic for a SSIS ETL developer.Here I'm giving you some work around for such cases

  1. SSIS Delimited File Source available at Codeplex  (Click here)
  2. Script Component--- Check Sudeep's Blog (Click here)
  3. Flat file source which reads all data into one column and then splits them based on index of the comma in a derived column.
Here I'm giving you the demonstration to handle such file with the help of derived column.We have to write expression (sometime complex) to get the proper data in column format. In the above example I am getting all the data as String and after splitting we are saving it as string only (If you want to change the data type then you can use Data Conversion task). If number of columns are more then this approach is not recommended, In that case you should go with the first 2 approach.

Step1: Inside the Data Flow Task (DFT), Select a Flat File Source and Derived column.

Step2: Configure the FlatFile Source in such a way that you will get all input data in single column.

Note:  We are keeping the Column name as "C" in the example.

Step3: Double click the derived column and use the following logic.

  • For separating the column "C1" from the column "C", we need to fetch the data from position 1 to first occurrence of delimiter " " so finding the first delimiter we have to use FINDSTRING SSIS function (Returns the location of the specified occurrence of a string within a character expression).
  • SUBSTRING(C,1,FINDSTRING(C,",",1) - 1)
  • Here we have to do "-1" because we have to exclude delimiter from the column value.
  • For separating the column "C2" from the column "C", we have to fetch the data after the first delimiter to 2nd delimiter but before that we have to check whether we have 2nd delimiter in place

  • FINDSTRING(C,",",2) != 0 ? SUBSTRING(C,FINDSTRING(C,",",1) + 1,FINDSTRING(C,",",2) - FINDSTRING(C,",",1) - 1) : SUBSTRING(C,FINDSTRING(C,",",1) + 1,LEN(C) - FINDSTRING(C,",",2))  
  • For separating the column "C3" from the column "C"
  • FINDSTRING(C,",",2) == 0 ? "NULL" : (FINDSTRING(C,",",3) != 0 ? SUBSTRING(C,FINDSTRING(C,",",2) + 1,FINDSTRING(C,",",3) - FINDSTRING(C,",",2) - 1) : "NULL")

  • For separating the Last column "C4" from the column "C",we have to fetch the data after the third delimiter to end of the row. and for calculating the last position of the row, we have to use LEN SSIS function
  • FINDSTRING(C,",",3) == 0 ? "NULL" : SUBSTRING(C,FINDSTRING(C,",",3) + 1,LEN(C) - FINDSTRING(C,",",2))
Step4: And result would be as expected..

Note: Let me know your valuable comments to make this post more robust and helpful..