July 9, 2015

TSQL pattern matching as part of a table join

Written by

OK so anyone with some fundamental SQL knowledge will be able to read the following SQL statement and understand what it’s returning, let’s take a look:

To summarize we’re joining the FactInternetSales table to the DimCustomer table via the foreign key called CustomerKey. From the DimCustomer table we’re returning the LastName and FirstName and from the FactInternetSales table we’re returning the TotalProductCost.

TSQL refresher over let’s move on to a more interesting example. This problem presented itself during some work for a client. Let me explain the data first.

The company receives data in the form of flat files from the mainframe system that gets processed overnight and imported into their data-warehouse. One of the pieces of data received is regarding sales promotions. All sales promotions have a field called PromotionCode. The PromotionCode is a 3 letter alpha numeric code where each character has a meaning:

CSC
CSF
CSH

etc

The first character is the Promotion, the second character is the Sub Promotion and the final character is for a specific Demographic. The issue is the source system (a mainframe) only has information the the highest level, so we only get data coming through in the flat files that describe the first character. In the example above if:

CSC = Commodore Sydney Car Club Promotion
CSF = Commodore Sydney Freemason’s Promotion
CST = Commodore Sydney Tradies Promotion

The only data describing these three promotions would be Commodore Promotion.

Moving along, what is needed is some reference data that describes these codes at a more granular level but also the ability to not have define every combination of the promotion codes as the maintenance would simply be too high (there are thousands of combinations in the source system). For some promo’s we want to define all three characters in our reference table, and others we want to partially define the promo’s (we might not have enough information at the time). Our table would look something like this:

Taking a look at the defined reference data:

The underscores in the PromoCode field represent a place for a single character, refer to https://msdn.microsoft.com/en-us/library/ms179859(v=sql.90).aspx for more information. Now we’ve define some very specific Promo descriptions such as CSC all the way to generic ones like C__ or _ST. The Priority column to determine the priority that we apply the matches.

Now let’s take a look at the source data:

The column we’re interested in is PromoCode, The PromoCode’s CSX, CSS, CAT, XST and COO aren’t defined as part of our reference data, but because we’ve defined a wildcard match on a single character they should be picked up. Now let’s take a look at the SQL we need to wire this all up:

and the result:

Our source data was 8 rows and it’s now blown out to 19 rows. This is because we’ve used the LIKE operator on our JOIN, it’s matched all possible matches. What we now need to do is use the priority columns to return the highest priority match.

Now let’s improve the SQL statement by using the ROW_NUMBER() window function the grab only the highest priority match:

And the results:

And that’s it. By creating a reference table and using pattern matching wildcards, we’ve minimized the number of Promotion Code that need to be maintained. We can also match these partially defined Promotion Code quite easily in SQL.

 

 

 

 

 

 

Category : SQLTSQL

Leave a Reply

Your email address will not be published. Required fields are marked *