Problem

A couple of days ago, my client sent historical data of sales to me through a zip file. I tried to unzip it and loaded with SSIS package. However, the zip file contains a lot of sub folders and files in and a sales file is also zipped within .gz extension. It requires me to traverse list of sub folders for unzipping and loading data of sales into a staging table, the folders are organized as below

Sales_Data(parent folder)

|___20170101 (sub folder)

|        |__20170101_Sales.txt.gz

|___20170102 (sub folder)

|        |__20170102_Sales.txt.gz

|___…….

|___Customer.txt

|___Product.txt

 

Format file of Sales data:

Date             |     CustomerCode         |      ProductCode   |   SalesAmount

Format file of Customer data

CustomerCode       |         FirstName         |           LastName    |      PhoneNumber         |    Address

Format file of Product

Product Code       | ProductName | ProductSubCategory | ProductCategory

 

In this article, I would like to introduce how to traverse list of sub folders and unzip file using GzStream library in .NET.

 

Implementation

Create a SSIS project

Create staging table: StgSales, StgCustomer and StgProduct

CREATE TABLE StgSales
(
Date int,
CustomerCode nvarchar(50),
ProductCode nvarchar(50),
SalesAmount float
)
GO
CREATE TABLE StgCustomer
(
CustomerCode nvarchar(50),
FirstName nvarchar(50),
LastName nvarchar(50),
PhoneNumber nvarchar(15),
Address nvarchar(100)
)
GO
CREATE TABLE StgProduct
(
ProductCode nvarchar(50),
ProductName nvarchar(100),
ProductSubCategory nvarchar(50),
ProductCategory nvarchar(50)
)
GO

Create a new package named as Load_SalesData_Folder.dtsx

Create an OLEDBC connection to staging database

Create three Flat File Connections for Sales file, Customer file and Product file

Create FileConnectionString package variable and configure ConnectionString property of three flat file connections to this variable

Drag and drop Execute SQL Task component to truncate the staging tables

TRUNCATE TABLE StgSales;
TRUNCATE TABLE StgCustomer;
TRUNCATE TABLE StgProduct;

Drag and drop Foreach Loop Container task, it uses to traverse the folder and it’s sub folders and if it finds the file, then assign path of file to the FileConnectionString variable created above

Connection tab

Enumerator: Foreach File Enumerator

Folder: D:\Projects\SSIS_Projects\Data\Sales_Data (path of parent folder

Files: *.* ==> This wildcard allows us to find all files without caring any extension file

Check on Traverse subfolders ==> This option is the most important because it allows to traverse the child folders from the parent folder

Variable Mappings

User::FileConnectionString => Index = 0

Drag and drop Scrip Task, we need this component because the folder consists of three kind of files Sales, Customer and Product so we need to detect what file is Sales data, what is Customer and then load data into appropriate tables. Besides, the Sales files are zipped, we need to unzip them before importing into database.

Create a new package variable named as FileType

Double click on Script Task to open Scrip Task Editor

ReadOnlyVariables: User::FileConnectionString

ReadWriteVariable: User::FileType

Click on Edit Script

using System.IO;
using System.IO.Compression;

public void Main()
{
// TODO: Add your code here
string filefullpath = Dts.Variables[“User::FileConnectionString”].Value.ToString();
string filename = Path.GetFileName(filefullpath);
string filetype = “”;
if(filename.ToUpper().Contains(“SALES”))
{
filetype = “SALES”;
}
else if(filename.ToUpper().Contains(“CUSTOMER”))
{
filetype = “CUSTOMER”;
}
else if (filename.ToUpper().Contains(“PRODUCT”))
{
filetype = “PRODUCT”;
}
// Becuse Sales files are zipped so we need to unzip them in case file type is SALES
if(Path.GetExtension(filename)==”.gz” && filetype ==”SALES”)
{
FileInfo f = new FileInfo(filefullpath);
// Call Decompress function to decompress file
Decompress(f);

}
Dts.Variables[“User::FileType”].Value = filetype;
Dts.TaskResult = (int)ScriptResults.Success;
}
public void Decompress(FileInfo fileToDecompress)
{
// The function use GzStream library
// Refer examaple https://msdn.microsoft.com/en-us/library/system.io.compression.gzipstream(v=vs.110).aspx
using (FileStream originalFileStream = fileToDecompress.OpenRead())
{
string currentFileName = fileToDecompress.FullName;
string newFileName = currentFileName.Remove(currentFileName.Length – fileToDecompress.Extension.Length);

using (FileStream decompressedFileStream = File.Create(newFileName))
{
using (GZipStream decompressionStream = new GZipStream(originalFileStream, CompressionMode.Decompress))
{
decompressionStream.CopyTo(decompressedFileStream);
//Console.WriteLine(“Decompressed: {0}”, fileToDecompress.Name);
}
}
// Assign the path of decompressed file to Dts.Variables[“User::FileConnectionString”].Value.ToString()
Dts.Variables[“User::FileConnectionString”].Value = newFileName;
}
}

 

Drag and drop three Data Flow Task into Foreach Loop Container and name as Load Sales, Load Customer and Load Product

Connect Script Task and Load Sales data flow, double click on arrow to open Precedence Constraint Editor

Evaluation operation: Expression

Expression: @[User::FileType]==”SALES” ==> this express is to check if the file is SALES then load data into StgSales table

Check on Logical OR…

Repeat the above step for Customer and Product data flow

 

Open each data flow and configure to load data into staging tables

Finally, the package looks like that

Testing

Query StgSales, StgCustomer and StgProduct and make sure there is no data

Execute package

Verify data

Conclusion

In this article, I describe the simple approach how to traverse sub folders and unzip file before loading them into database. The keywords are to use GzStream library and you need to check Traverse subfolders options on Foreach Loop Container.

Obviously, there are a lot of approaches that can help you load data through sub folders. You can use Script Task to get all sub folders and then use Foreach Loop Container on each folder to read it’s files. This approach is very flexible because you can execute recursively, depending on your situation and you can choose right approach.

Please contact me for example files and source code.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s