Webcast Follow-Up: Recognizing the Parallel Data Warehouse (PDW) Workload

WebcamLast Monday, I presented a session for Pragmatic Works Training on the T’s on the Microsoft’s Parallel Data Warehouse.  In the session, titled Recognizing the Parallel Data Warehouse (PDW) Workload, I discussed the role for MPP and PDW in the enterprise and the types of workloads that it can assist with.

The abstract for the sessions is:

If you are like many, you’ve been hearing a lot about Parallel Data Warehouse (PDW) lately. PDW is Microsoft’s appliance solution with SQL Server 2012 that allows massively parallel processing (MPP) of database workloads. But how do you know that you have a workload that requires or could benefit from MPP over the standard symmetric multiprocessing (SMP) architecture? In this session, we’ll review that question and more with a run through of how to identify MPP workloads? As part of the session, we’ll also review some of the key components and differentiators of the PDW platform.

Webcast Recording

If you missed the webcast, you can access the recording from the Pragmatic Works Learning Center.  The session is available here.  The presentation deck isn’t available for download, but if you are looking to get more information on PDW or have the session delivered to your team or user group, feel free to reach out to me in the comments or through twitter.

Session Questions and Answers

The questions below were asked during the session.  I’ll keep the answers fairly brief, so if I miss a point and need to explain more, let me know.

Q: Data distribution should be made based on data usage/query. A single table could be used for multiple queries. Is there some rule of thumb to be followed that helps in determining the distribution key?

Generally, you will want to distribute data based on the column that is most frequently joined between the fact and dimension tables.  For instance, if every query includes a pageId and that is used in joins to the dimensions, then it may make a good distribution key.  Of course, if the queries will also filter down on that value limiting the distribution of values that are returned, then another column, such as date, may be more appropriate.  The idea is to choose a value that is leveraged often and which is also going to leverage as many nodes of the appliance as possible for each query.

Q: Does it really returns million of rows with in no time if i call it from asp.net webform?

PDW can aggregate millions of rows very quickly.  Building the resultset will have similar performance.  Returning millions of rows to a webform will run into the same network issues that any other database platform will run into when transferring that much data across the network.  When it comes to PDW the value is in the need to aggregate data, which may be millions and billions of rows, and calculating a resultset from those that includes aggregations.  Whether the application making the request is SQL Server Data Tools or a webform makes no difference.

Q: Have you used PDW 2012 with Clustered Column Store with Analysis Server in Tabular Model utilizing a PassThrough to the Column Store in PDW?

I haven’t personally done this, but I did stay in a Holiday Inn last night.  In all seriousness, when I was working with a client on a proof of concept (POC) a few months back, one of the Microsoft team members performed a demo for the client using their data with PowerPivot and a Tabular Models.  They were very impressed.

Q: In your experience implementing PDW have you found any of the limitations of PDW, as opposed to base SQL Server, to be an issue? If so, what did you do to work around them?

I haven’t found the language limitations too challenging.  Some nice to have items would be scalar functions, common table expressions, and UNPIVOT.  That being said, they are easy to get around.  For the first two, I leverage CTAS (CREATE TABLE AS SELECT) and for the last, I use UNION ALLs.  These aren’t that bothersome and really just get me to flex my old SQL skills.  I’ve used numbers tables more often with PDW than typical, but with those solutions, I’ve seen great improvements in performance.  One way to think of it is, instead of having 10 ways to write any solution in SQL, I’m now left with 9 ways.

Q: How is D-sql different from T-Sql? What is the learning curve involved?

The difference between the languages is the lack of support, currently, for some constructs that we are used to in T-SQL that haven’t been implemented yet in D-SQL.  Some good examples are in the previous answer.  The nice thing is with the development cycle, commands are being added into D-SQL with every update.  We just recently upgraded and can use GETDATE() and CTEs.

Since the languages are so close, the learning curve isn’t too challenging.  The biggest consideration is watching how D-SQL statements are interpreted by the query engine and watching whether data movements between nodes is required to satisfy queries.  If a lot of data needs to move around, that will impact query performance and may require rewrites to prevent the data movement.

Q: How easy is it to integrate with the cloud? Will performance be affected by the speed or efficiency of the cloud architecture?

The only integration that can be done currently with the cloud is to query the on-premise PDW from the cloud.  To do so, either site-to-site or point-to-site virtual networking would need to be configured.  This would really be an Azure task that is unrelated to PDW.  Communication with PDW can happen on many of the standard SQL Server protocols that we are all used to using.

Q: If someone wanted to upgrade to PDW 2.1 from PDW 2.0, who does that work? Pragmatic Works? Or do you go to the vendor such as HP?

Neither Pragmatic Works nor HP/Dell would perform that work.  The upgrade would be a task assigned to the Microsoft Designated Support Engineer (DSE) that is assigned to the PDW implementation when it is purchased.  To my knowledge, only DSEs are allowed to do PDW upgrades.

Q: If you could touch on dimension management in PDW, if not in SSIS? 

With PDWs ability to process millions of rows and build new tables faster than SSIS can often manage changes through slowly changing dimensions, this logic is typically moved into stored procedures.  In the stored procedures, the changes are made using CTAS statements and merge techniques that build the new dimension table; which is then used to replace the existing table.  It seems like doing so would be more work, but in my experience the kill and fill technique often takes seconds to execute.

Q: Is PDW more suitable for a certain type of schema e.g. Star-Schema or normalized schema? or It does not matter?

Both normalized schemas and star schemas can be used, but star schemas will be more optimal.  With star schemas, the hashed distribution values will be easier to leverage and likely provide an easier time writing queries.  A normalized schema can still be used, but more tables will be replicated across the nodes than in a star schema.  This will lead to more complex join patterns and could result in issues when tuning queries.

Q: Can SSIS/SSRS also be used in PDW?

Yes, both can be used.  SSIS has some drawbacks since performing data changes in-memory and using the PDW Connector is typically slower than a DWLoader import with stored procedures transforming the data.  With SSRS, there are no real concerns, except, of course, unless the query returns a few billion rows back to the SSRS service.  You’ll want to perform as many aggregations within PDW as possible, but that is for performance reasons.

Q: Regular ELT using PDW – you mentioned PDW can perform 2TB per 1hr – I saw SSIS could able to do 1TB in 30mins, this is not much diff with PDW?

This is true.  With a decent amount of time tuning and architecting disk and table designs, SSIS can be coxed into performing 1TB loads in 30 mins.  I’ve gone through this before while using very fast storage for both the database and source files.  There was a fair amount of work and I eventually got some pretty decent results.  With PDW, the amount of effort to get it to do the same data load isn’t hard and doesn’t require a white paper worth of instructions to accomplish.  Simply point the command line DWLoader tool at the file and reference the PDW IP address and the load is ready to go.  The only thing that may need tuning is the source file disk subsystem, since PDW can likely receive and process the data faster than it can be sent over.

Q: The architecture looks very similar to Teradata, are there any major differences between the two?

I’m not too familiar with Teradata, but my understanding is conceptually the platforms manage data similarly.  One thing I’ve heard from people comparing PDW and Teradata is that the price quotes are substantially different.  The other difference, I’ve been told, is in language.  With D-SQL you get to write queries very similar to T-SQL, with Teradata it’s really a new language to learn, which results in a steeper learning curve..

Q: How much do the racks cost?

I don’t have a fixed price that I can share.  It’s not that the price changes based on how wealthy you are, its just not a published number.  If you are looking at tier one data warehouse solutions on SQL Server systems and need to process millions and billions of rows and are concerned about future scale… and the cost of the SMP system that should handle the data today is freaking you out.  You should talk to Microsoft about PDW because I’m sure you’ll be pleasantly surprised.

Thanks for Attending

Thanks to those that attended the webcast.  The session went pretty much up to the end of the hour so the questions that were asked are listed above.  If you had a question that didn’t get answered or asked during the webcast, feel free to leave the question in the comments below.