Last November (2012) Microsoft introduced the second version of its massively parallel processing (MPP) architecture called Parallel Data Warehouse. There are some distinct differences between version 1 and version 2. Two of the biggest differences are:
1) PDW ships installed on a Dell appliance.
2) The inclusion of Polybase.
PDW V2 is built using Dell’s Power Edge 12th generation servers. Some of the advantages that are included in this construction are:
- Lower price point (backup and landing zone servers not longer required)
- Increased CPU power (16 vs V1′s 12 cores per compute node)
- Increased Memory (256 gb vs V1′s 96 gb per server)
- Increased storage/scalability (support for up to 5 Petabytes of disk space vs V1′s 610Tb)
- Infiniband Data Movement Speeds: FDR (14x bandwidth) vs V1′s QDR (4x bandwidth)
- Microsoft V2 enhancements include “direct attached” SAS JBOD storage, Hyper V, Mirroring, and Server 2012 Storage Spaces
Putting aside the advantages related to the hardware, this article aims to focus on Polybase which has quickly become a big part of Microsoft’s Big Data strategy.
Polybase is Microsoft’s entry to the cross platform query tool which enables users to write T-SQL code to simultaneously query relational and unstructured data. By allowing users to query in the familiar T-SQL code, Polybase eliminates the need to learn a new query language (MapReduce) and allows for the combination of the different data stores at the query level. The cross platform querying allows you to leave Hadoop data in its unstructured state and thus eliminates the need to populate a Data Warehouse with Hadoop data.
All Hadoop data has some inherent structure and Polybase analyzes this structure and brings the schema of the HDFS cluster into the metadata of the relational database which allows the Hadoop/unstructured data to be treated as a local table. There are a few preliminary steps that need to be performed, such as registering the Hadoop cluster (you can register many clusters as this is not a one-to-one relationship) and setting the file format for the cluster, however these are one time workloads that when completed enable you to create what is essentially a view over the HDFS data which is accessible to the relational query engine.
Polybase introduces the CTAS functionality, ‘Create Table as Select’, and when used in conjunction with the EXTERNAL keyword when querying unstructured data, allows the relational engine to retrieve the data from the nodes in the Hadoop cluster and store it as a temp and/or permanent table in the relational database. Not only can you do simple selects, you will be able to perform JOIN’s and Group By’s on the Hadoop data.
Pulling the data from the HDFS to a table in the relational database isn’t always the optimal plan but all this is done in parallel which allows for better performance and gains in processing. Also, as I’ll explain shortly, Phase 2 of Polybase will eliminate the need to store unstructured data in the relational db.
The result of the combining of traditional and non-traditional data sources, as seen in the below illustration, allows unified access with Microsoft’s BI tools such as Power Pivot and Power View with Excel. This simplifies the access to unstructured data and helps remove the complexity associated with ‘Big Data’.
As many might be aware, Polybase isn’t the first tool to allow querying of structured and unstructured data and what will truly distinguish Polybase from other tools such as Sqoop is the functionality that will be available in Phase 2.
When Phase 2 ships the query optimizer will be able make a cost-based decision, when referencing data in an HDFS, to determine whether it should transform the query into a MapReduce job to be performed on the Hadoop cluster or if it should just process using the SQL server instances on the PDW. Also, the optimizer will have the ability to move the workload of a query involving only PDW data to the Hadoop cluster. This intelligence within the optimizer will allow it to split the workload between the 2 platforms and thus leverage the true capabilities of the Hadoop cluster. All these decisions are of course “behind the scenes” and are totally transparent to the end user.
Ideally, getting the data into the hands (and on the screens) of the folks that know what to do with it is where the power of Business Intelligence truly lies and this is a big step in bringing that closer to reality. The next step in this march: shipping Polybase with other SQL Server versions as opposed to solely PDW. We’ll just have to wait and see what Polybase Phase 3 brings.