Being tasked with data profiling responsibilities during an initial project start-up can seem like a daunting task. With the use of the Data Profiling task in Integration Services within SQL Server 2008, the job becomes a bit more streamlined. This is a great tool to use in data discovery – it’s fast, it’s fairly intuitive, and it provides a tremendous first look into the system you may need to profile. Results of SSIS’s Data Profiler task can either be directed to a variable, or an xml file. For this example, we are looking at the xml file result, and the tool that will help us view those results – the Data Profile Viewer.Data Profile Task
First, we need to work with the Data Profiling Task;
Drag to the Control Flow tab, and simply double-click to enter the properties window.
Depending on how you are going to use the task, your output can be sent to a variable or an xml file. In our case, we will choose an xml file in order to view the results through the Data Profile Viewer. Select ‘new file connection’ in the Destination drop-down and create an xml destination. Once this is complete we can then choose what features we want to use by clicking on the ‘Quick Profile…’ button.
Select your connection – I’m connecting to the Production.Product table within the AdventureWorks Database, and choose which options you would like the profiler to compute;
- Column Null Ratio Profile
- Column Statistics Profile
- Column Value Distribution Profile
- Column Length Distribution Profile
- Column Pattern Profile
- Candidate Key Profile
- Functional Dependency Profile
Once your selections are made, close out of the properties windows and run the task. Now we are ready to view the results.
Start -> All Programs -> Microsoft SQL Server 2008 -> Integration Services -> Data Profile ViewerData Profile Viewer
The Data Profile Viewer is a simple tool that provides a great level of detail for profiling purposes. It is straight-forward and very intuitive. Once opening the viewer, open the xml file created in the above Data Profile task. The results of the profile are now available for us to walk through.Column Null Ratio
Avoiding unintentional null columns in any database is important, but particularly in a warehouse environment. Joins to tables with nulls can return unexpected results, and hinder a well-designed system. The Null Ratio results give a great view into the percentage and total counts by column of what is populated, and what has null values.
As the name implies, the column statistics results display;
- Standard Deviation
This piece of the Data Profiler puzzle will give detail about each distinct value, the column’s domain, and the percentage of total.
If creating or validating source to target mappings – this is a great first step to begin to work with.Column Length Distribution
The length distribution provides a Min and Max length for each column in the table. In addition, the percentage of total is also given.
The column pattern results will give insight into any patterns found within the data. When viewing the results of the pattern profile, you may not fully understand what the pattern is trying to portray, there is a drill-down button which will take you down to row-level detail.
Candidate Key Profiles
The Candidate Key Profile give insight into potential columns that can be used as keys. In our example, there are 3 such columns; Name, ProductID, and ProductNumber. If there were columns that were suggested that had less than 100% uniqueness, the violations would be listed as well.
Functional Dependency will illustrate which columns have a direct dependency with any other column. A good example of this would be State and State Abbreviation. The state abbreviation would be directly affected by the state description. Using the example below, ProductSubcategoryID is identified to have a dependency on the Style column. Looking closely at the Style column, you can see that 28 records support the dependency that the profile identified, while 4 records were in violation.
Record level detail can be seen at the bottom of the screen:
The 28 support rows, and the 4 violation rows are detailed.
There is on more profile that is useful to cover that can't be accessed through the 'Quick Profile...' button. The Value Inclusion Profile will check one column in a table against a corresponding column in another table. The value of this profile is being able to determine if a column may be used as a foreign key to a particular table. For this example, we'll still be looking at the AdventureWorks Database, but we'll be comparing the SalesPersonID in the Sales.Store table versus the Sales.SalesPerson table. In essence, we are determining if the domain in one column is inclusive of the domain of the second column.
To set this up, we have to navigate to the Profile Requests page of the Data Profiling properties window (second down on the left column). You will notice that all of the profiling requests that we made with the 'Quick Profile...' button will be listed. Scroll down to the bottom of the list and click on the next blank row - select 'Value Inclusion Profile Request.' This will allow you to alter the variables.
Under ConnectionManager, select the AdventureWorkds Database. Follow the criteria above to select the two tables, and the columns in each that you are comparing. After running the task, the Value Inclusion Profile will be included in the Data Profile Viewer.
The Value Inclusion is at the bottom of the left screen, and you can see by the 100% Inclusion strength above that all of the SalesPersonID values in the Sales.Store table are found in the Sales.SalesPerson table.
This is a broad overview of the task and the types of profiling that can be done. To become familiar with the tool, the best course of action is always hands-on!