Something I hear a lot as a consultant is “Let’s just get it done and we can go back and fine tune it later”. When I hear that, what *I* hear is “Let’s just get it done and we can go back and fix it later”. In this case, “Fine Tune” tends to mean “fix it” or “make it work”. And rarely is that an easy task – it almost always means a change in architecture to support performance, and/or a rewrite to the code.
What I want to talk about today is the value of getting it right the first time. I’m not trying to say that every project has the ability to be perfect first time out the door – but taking the approach that you MUST get your data defined and modeled correctly the first time can often make the difference between a successful project and one that never quite gets off the ground, especially for large data volumes.
Having a good solid data architecture is as important to your IT project as having a good solid architecture for a house. Build your house on a shaky foundation and you will almost always have to tear it down and rebuild. It is the same in the data world. If you build your data warehouse on a shaky data architecture – one that is rushed through development to “just get it done” – rather than taking the time to understand the data structures and validate that they will work long term – you will almost always end up rebuilding the data structures or doing extensive modifications to make them support your long term strategy for data.
Where this really gets noticed is when you are building a data warehouse that will hold millions and/or billions of rows. Making a simple mistake like using too large of a data field can cost megabytes or gigabytes when that data is implemented. Multiply that extra meg/gig by all of your backups, and you can waste a ton of space. Or maybe a bigger issue, too SHORT of a field – defining a field without realizing that the values might change in the future and be larger than originally planned. If you have run into this problem before, you know how painful it is to modify a table that already has a lot of data in it to accommodate a few extra characters.
Or probably the most painful thing of all – changing your mind about how to store the tables. Realizing that you should have denormalized values into tables to reduce joins and improve performance is something you want to address at the beginning of your project, not in a “performance tuning” phase.
Architecting for performance is something that is often overlooked at the beginning of a project, and too many times I see developers use the “make it work, we’ll fix it for performance later” approach – and it almost always results in a poor design that needs to be rewritten – especially in a data warehouse/data mart project.
What started me thinking about this right now is a project I’m working on that has very little data in the test db. For me, getting it right for a lot of data is way more important than getting it to work with my little amount of test data. I have to think about “how will this work with 100 million rows”, not the 2 million I’m testing with. If I don’t get it right the first time, it will mean having a bug assigned, and rewriting my queries later – which means more testing, and most likely a slip in the schedule. So even though I don’t have performance testing data available, I need to think like I do and code as if I had that volume of data.
So how does one put performance up front in a development schedule? Especially if there is little to no test data, and when the test systems rarely give the same performance characteristics as the larger test boxes or production systems? I’m going to try and address some of the issues that come up in this type of development project in this and the next few blog posts I write.
If I had ONE TIP to give to every developer out there about how to build queries that perform with large volumes of records, it’s this: STOP TRYING TO WRITE THE REPORT IN ONE LINE OF CODE. I can’t tell you how many times I’ve been called into “performance tune” a query that is so frickin’ huge my eyes roll back in my head trying to read it. SQL Server is a pretty smart engine, and it is pretty good at figuring out how to handle SQL Statements effectively using indexes and statistics, but there comes a point when the code is so huge any optimization flies out the window and it just starts working on each line of code without choosing the “best” plan. This is because SQL is a “cost-based optimizer” – if the optimizer decides that the cost of looking for the best plan is probably larger than just using any plan – it will pick a plan and use it even if it’s not the best plan. This is most likely going to happen on a huge complex SQL query statement: lots of joins, loops, cursors, subqueries, in-memory tables, CTE’s, etc etc etc. The quickest way to get a crappy query plan is to make your statement so complex that the time it takes to parse is longer than SQL Server will spend finding the right plan.
When I tackle a large complex task, I try and map out what I want SQL to do when the query plan is executed, and then figure out how can I get SQL to follow “my” query plan. I will create the statements that are simple and have non-complicated query plans so that it’s 1) easy to troubleshoot and 2) easy to predict behavior even with large volumes of rows.
Don’t get me wrong – there are places for all of these complicated SQL constructs – but use them WISELY, not just to “get it working” with the thought that you will come back later and “optimize” it. The rewrite process can be extremely painful, especially when you realize that a little managed denormalization up front could save you TONS of time downstream if you think about performance at the beginning, not at the end. And whatever you do – keep that architecture decision in mind when you are writing the code – denormalize for performance, don’t assume that those joins that work fine in your OLTP will be fine with all of the records in a report. Understand when it’s wise to change the architecture to support large datasets, and when you can get away with using a more normalized schema.
More development tips for SQL and SSIS to come over the next couple of weeks – stay tuned!!
Today’s picture: in the mid-1960’s my dad worked for Lockeed in the early 1960’s, and was in charge of the “Zero Defects – do it right the first time” campaign. Our cat had kittens in the box of posters for his campaign. She did it right the first time 😉 This photo made the cover of the Lockeed daily newspaper in 1965. Thanks Dad for setting a great example for me!