One of the nice features in DQS is the ability to manage “complex” domains, i.e. groups of fields that must be validated on their own. An Address is a type of complex domain – you validate an address not only by the City, State, and Zip Code but by a valid combination of all three values. So for example, “Seattle, WA, 98103” is a valid combination, but “Seattle, CA, 90211” is not valid. Individually all of those columns are valid – but when combined they become an invalid combination. DQS allows a data analyst to manage a complex domain by validating the individual values, then defining valid groups.
Another nice feature is the ability to have DQS map values (synonyms) to a standard value. This is a great way to translate values from a customer-defined format/value to a standard value. For example, status codes can be translated from customer-specific values to standard values for reporting.
But what happens when you want to use synonyms in a complex domain? If I have company specific codes, and the codes are not unique within the code domain but within the company-code complex domain? That is a bigger challenge, and that’s the subject of what I tackled this week.
Here’s a version of what I tackled this week, but with nice generic data: We’ll pretend that my requirement is to accumulate data on rental cars. Each night I will get data on cars rented, including the car Manufacturer (Make), the Model, and the Model Year, plus associated rental information. My goal is to clean that data so it can be aggregated along Make/Model/ModelYear. Data comes from multiple sources and is not standard: for example Model Year 2006 could come in as “2006”, “06”, “6”, and the misspelled “206”. DQS allows me to define a domain “Model Year”, and specify the valid value (“2006”) as well as the synonyms for this value that are not valid and should be corrected (“06”, “6”, and “206”). Here is a screenshot showing this domain definition:
That data correction works great, primarily because these values all translate the same across all of my data vendors. But what happens when the value is not unique within the domain? If my data vendors all give me values for “Ford” as “Ford”, “F”, and maybe some misspellings (“Forde”, “Fard”) then it’s relatively easy to create synonyms. But what happens when different vendors send values that are only unique to them? If the value is not unique, then setting up synonyms across domains becomes a much bigger challenge.
Let’s evaluate the case where the Manufacturer Codes are unique for each vendor. This table illustrates the problem:
Data Vendor | Manufacturer Code | Manufacturer Name | Our Code | Our Name |
Avis | F | Ford | FORD | Ford Motor Company |
Avis | T | Toyota | TOY | Toyota Car Company |
Avis | K | Kia | KIA | Kia Motors |
Avis | C | Chevy | GM | General Motors |
Hertz | FORD | Ford Motors | FORD | Ford Motor Company |
Hertz | TOYOTA | Toyota Motor | TOY | Toyota Car Company |
Hertz | KIA | Kit Motors | KIA | Kia Motors |
Hertz | GM | General Motors | GM | General Motors |
Budget | A | Ford | FORD | Ford Motor Company |
Budget | B | Toyoda | TOY | Toyota Car Company |
Budget | C | Kia | KIA | Kia Motors |
Budget | F | Cheverolet | GM | General Motors |
Look at the Manufacturer Code – Avis thinks that F is for Ford, but Budget uses F for Chevrolet, which we map to General Motors. within the domain for Manufacturer Code, F is a valid value – but I can’t define a specific value for it in synonym mapping because that depends on another value.
In this case, I will need to use a Complex Domain in DQS in order to define the proper data mapping rules. First I’d define a domain for the Data Provider, with values for Avis, Hertz, and Budget. Then I’d define a domain for Make (manufacturer) codes, with the values T, K, C, Ford, Toyota, KIA, GM, A, B, D, and F.
Then I’d define a Complex Domain (ProviderMake) to validate and transform these combinations, as I’ve shown here:
The two rules in this example define how the values for each Data Provider need to be mapped. To finish, I’d add one row for each of the valid combinations in the table above to do the translation for “Make”.
But that’s only half the battle: Now a domain for Manufacturer Names needs to be established, and a complex domain for ProviderManufacturerName needs to be built, with all of the proper translations for the Name fields.
NOW, think about how complex this gets if we add a third value to the list, like Model. Each manufacturer has many models, and each data provider most likely has a different list of codes for the proper values. The number of potential combinations for a group like this can grow quite large! For example, if we had 10 data providers, and 5 Car manufacturers, and each manufacturer had up to 10 models, that’s 500 possible combinations of Data Provider – Manufacturer – Model. Now imagine if each of those have up to 3 years of Model Years – the complex domain to validate Provider-Make-Model-Year now would have upwards of 1500 data translation rules!!! Adding these by hand can certainly become cumbersome.
If I had to support this set of data, I would do one additional thing: I would use Synonyms to map the majority of my data, then use the Complex Domain rules to override those values that don’t map correctly. So I’d define a synonym for “F” to be “Ford” but I’d add a rule that changes that value to “General Motors” for the Budget cars where the code is “F”. This would let me take advantage of the great Synonym feature to map the majority of my data, but would allow me to use the rules to manage those codes that aren’t unique across the entire domain.
If DQS had an automated way to load the meta data that describes these rules, it would be a much more powerful tool! (Can you hear me DQS folks? From reading posts on the Microsoft Connect site (reporting bugs) I can tell that I’m not the only customer who wishes they had an automated tool for uploading rules, but no word on when this feature might appear.
And BTW – there is another alternative for this problem that does support an automate upload, but it’s not pretty. Leave me a comment if you want to know the ugly way of managing this kind of complex domain.
Picture of the Day: “Too many people grow up. That’s the real trouble with the world, too many people grow up. They forget. They don’t remember what it’s like to be 12 years old.” — Walt Disney
Photographic details don’t matter. Remember what it’s like to be a kid instead.