I’ve just spent two weeks learning how to cleanse data using the DQS tool that is a part of SQL 2012. Here’s my take so far.
The engine does a really good job catching spelling errors, and allowing me to define synonyms for data that helps me standardize data values that are import into the system. It’s easy to build the knowledge base, and has a pretty easy to use interface. The rules engine allows a data specialist (not a developer!) to define specific criteria that the data must meet to be considered “valid”, and it can support some pretty complex rules. And it works as advertised – data passes through the engine, and comes out with enough information that you can easily decide how to process the “cleansed” data. It also supports the concept of a “Complex Domain” – multiple fields that need to be validated as a group (like people’s addresses, or company-specific codes). And, it has great interaction with external cleansing tools and datasets through the Azure Marketplace.
Some of my frustration with the tool is the simplicity of the toolset – it works great if you have a pretty simple set of data, and that it’s all in the correct data type (it doesn’t support transformations). For example, in my case all of my data comes in as string data. I need to validate that the data can be converted to a specific format (like “is it a date field” or “is it an integer”) before I actual convert it. But, DQS does not do that conversion and only works with the data type that the field was input as. And it limits rules for a field based on the data type, so if I want to validate that a field is a valid date I can do that – but if I want to check that the date is less than Today it won’t work because I can’t validate a string field against a date. Same with numbers – if I want to make sure the field is a valid number, I can do that, but I can’t validate that it’s greater than five. I can check a number field to make sure it’s greater than 5, but not a string field. Here’s what I would have to do to test my data 1) check the string fields for valid values for the field type, 2) use SSIS to convert the fields to the proper data type, then 3) run DQS again to validate the values with the rules for the corrected data type. Not very efficient, and it means that basic ETL rules are in two places – one set in DQS and one set in SSIS. This also means we have to have a business/data analyst manage half the process, and an IT developer manage the transformations in SSIS. This is a huge issue for my customer who is trying to reduce the “noise” in IT when they add new data feeds.
I really like the engine, and the concept of DQS. But what might get it booted off of our project is the lack of operational depth. I think the tool was created primarily for the data analyst trying to manage a small set of files, not a data integrator importing hundreds of thousands of rows a night in an automated process. While there is a tool in SSIS to allow the automation of cleansing packages, I had to find a third party tool on CodePlex that would allow me to import domain values from a database. I could import them manually through the Knowledge Management tool in order to get the values, but that won’t work in an automated way. I needed the third party tool in order to be able to take our SQL tables and sync the values with the Knowledge base in DQS. The fact that on this project we will need a data analyst to manage the DQS and an SSIS person to manage the rest of the rules is very frustrating.
And my biggest frustration – no way to output those values into a SQL table or even a data file!! Sure, you can export a knowledge base but you can’t import that information anywhere other than another knowledge base! This almost got DQS booted out of our project. I finally figured out I could use the cleansed records to update our domains in SQL, but this is not a really good way to manage domains that are in SQL and in the Knowledgebase. I hope this is something that gets fixed soon!
Other really frustrating things: A domain can be created as a “related” domain (data exists once, can validate two domains), but not if it is part of a complex domain. In my case I want to use Complex domains to validate company specific values, but I can’t re-use the Company domain in multiple complex domains… argh!!! I also can’t update more than one domain at a time – very slow to update the domains. And probably the worst thing? Lack of scalability – the white papers I have read indicate that scalability is handled by creating multiple copies of the Knowledgebase and round-robin-ing the data through the various copies. Not the most scaleable system because we have to keep these things in sync. Argh again.
The jury is still out on whether DQS will be adapted at this customer. Part of the problem is having them adapt from a code driven environment (they are suffering from “we didn’t write it” syndrome) to this tool based approach. I hear regularly “I know how to write a tool to clean the data”, and it’s hard for me to convince them that while they indeed know how to code, they aren’t a software shop and they shouldn’t be trying to create that tool. But they have valid issues with DQS – the performance issues, configuration challenges, along with the lack of being able to define ALL processing rules in a single tool (transformations included) – are definitely making it a challenge to put into production. I still believe it’s the right tool, but I won’t be surprised if they don’t implement it due to the limitations in the current tool.
Hey DQS folks – email me and I’m happy to work with you on this! Has anyone else had experience with DQS? Leave me a comment and tell me your thoughts!
Picture of the Day: I’m going to Italy in October, so in honor of that I had dinner near Venice Beach (get it?!) at a great Italian restaurant that had an open-air section. Details: A single shot with my Nikon D800, ISO 1600, F2.8, 1/25th. Taken with my my 14-24 at 24mm. Processed slightly in Lightroom to highlight the dark areas, and bring out a little color. Thank you Nikon high dynamic range!!