I blogged last week that the way to solve my current customer’s problem (aggregating data from lots of sources) was to use the MDS and DQS tools from Microsoft SQL Server. What I didn’t know when I said that is how little information I could find on these tools. I don’t know if that means no one cares about data quality, whether they don’t know about the tools, or if everyone uses other tools for data quality and data management. I was shocked to search on Amazon.com and find only ONE book on Master Data Services, and the only book on DQS was in French!!
I am worried that if the tools aren’t catching on, that Microsoft will drop support for them, but for now they work pretty well for what my customer wants to do. From what I understand, Microsoft is putting lots of resources into MDS, and hopefully they will have better integration with DQS in that process.
So, here’s how I plan to use them in the data workflow I’m architecting:
1) We will define a knowledge base from the data that is already in place. Defining all of the possible domains and populating those domains with current values will make the process of reimporting the data into the new structures much easier.
2) To test our new process, we will be re-processing all of the customer files for the last 13 months (active data) and validating the new structures against the existing reports to make sure we haven’t done anything wrong.
3) When they decide to bring new customers on board, they can run the new files through the DQS “Discover Knowledge” process to add new values to existing domains. This will be an iterative process, where a report of the data values can go back to the customer to help them clean up their data, and when the data is valid (enough), it can then be run through the standard nightly SSIS process in order to move the customer data into the production systems.
4) The Data Cleansing process performed by the SSIS transformation (calling DQS) will always return 4 sets of rows: rows that are 100% correct, rows where something in the row has been corrected, rows where new domain values are suggested, and rows that are invalid. The SSIS steps after the cleansing step will need to evaluate all rows that are not 100%, and add/fix data in the existing domains based on the suggestions or values added in the Knowledge base. This is how we can keep the Knowledge base synchronized with the SQL Server domains.
5) the Cleansing process does not transform the data into the proper data types, nor does it give us things like primary keys (instead of code values). What it does do is guarantee that the follow on processes will be able to successfully identify the PK’s for valid codes, and keys for lookups. So after the SSIS Data Cleansing step is run, there is still one more “custom” step that will take the customer’s data and format it into “standard” data for the rest of the process. From this step forward, the processing should be parameter and data driven rather than coded for specific customers.
6) Once the data is in the standard format, the remaining steps will validate record-to-record data rules, and develop additional complex calculations and data derivations. The data will be converted from a “snapshot” format to a transactional format that will allow my customer to view how data has changed over time.
7) The last step of the process will be to take the transactions and domain tables and turn them into tables that can be consumed by the various BI tools used for data analysis.
Whew. Believe it or not, that took me almost all week to figure out. I had to determine what the DQS tool would and would not do, and how to push the limits of these tools. For example, it turns out that while you can export domains from DQS, the only thing that can consume them is another DQS Knowledge base!! Since one of my requirements is to keep the domains and DQS in sync, this was a big challenge! I found a third party utility on CodePlex that will let DQS consume domains from SQL tables, but nothing to export them!! (Hint hint, anyone wanting to write something useful for CodePlex, here’s your sign!!). I also had to learn how the domains impact columns – how to do complex IF/CASE logic, how to do multiple checks on a single field, and how to handle values without importing a millions of numbers!
I also realized that none of these tools does everything I need, and that I needed to think out of the box to get them all to work together. For example, while DQS does a great job of validating that a string field is indeed a date, it doesn’t OUTPUT it as a date. Doh! In trying to keep my customer’s primary goals in mind (easy to use, fast onboarding of new file formats, minimal developer time), it reminded me to think out of the normal ways and find the best in each tool.
To be honest, I haven’t found a use for MDS, in fact I just got it running on my laptop today – but I’ll be seeing if we need to use that tool in the process. And, based on what I found today – maybe writing a book is in my future 😉
Today’s picture: Earlier this summer I went to Mt. Rainier to photograph the Milky Way. This particular shot is from that night, and was a challenge to process. When I opened my car door and accidently lit up the night, it was actually lit up with the yellow parking lot lights. If I adjusted the color settings for the parking lot, the sky looked stupid blue. But if I left the sky good, then the lighted area looked too yellow. I ended up blending two versions of the same shot – each one color balanced for the area I wanted to highlight – in order to get this awesome shot. More out of the box thinking FTW!!