Author Archives:

Naming Conventions for Relational Databases

This is a subject that has obviously generated a lot of volume and controversy over the years, and there still isn’t any single agreed upon system for naming entities in relational databases. Through years of experience I have developed a system that I believe has a lot of merits above and beyond the typical recommendations I have seen. My system is a little counter intuitive at first, but I think the value is easy to recognize.

Obviously, one of the most well known naming conventions in computer programming in general is Hungarian Notation, which uses the convention of prefixing an entity with an abbreviation indicating the type of entity that it is. In the RDBMS world this has typically been employed in the naming of high level databases entities like tables, stored procedures, views, and functions, etc., but hasn’t been widely used for the naming of things like columns or variables. If we had a table, a view, and a stored procedure, all of which stored or returned customer data, then using Hungarian Notation they  may be called something like tbl_Customers, vw_Customers, and sp_Customers respectively.

There is widespread agreement in the database world that this type of naming convention isn’t preferable for an RDBMS, to a large extent because the prefix is mostly meaningless within each type of entity, i.e. prefixing all tables with tbl_ provides no differentiation between tables, it only provides differentiation between types of entities, but there again, we shouldn’t actually be trying to differentiate between types of entities. One of the purposes of views is for them to be interchangeable with tables. In fact, there are cases where views are used to replace legacy tables, i.e. a table needs to be deprecated  or re-named, so a view with the same name as the table is created in order to allow legacy code to continue to operate. In addition, SQL Server, and other database systems have a concept of  Synonyms, which are basically aliases and in many cases they are used in a way that would be totally inconsistent with Hungarian Notation.  For example, you can use Synonyms to create aliases to entities in a different database in order to deploy the same exact code to multiple databases or instances, but have all of the code reference a single database on a single instance. If the underlying lying object being aliased is tbl_Customers, then the Synonym would have to be called tbl_Customers in the other databases as well, which would violate the naming rule, because according to the naming rule it should be called syn_Customers, or syn_tbl_Customers, etc.

So, what convention do I use, and why?

The naming convention that I use for container entities, i.e. tables, views, stored procedures, etc. is to use a prefix that denotes the function of the entity. Some standard prefixes that I use for tables are:

  • data_ (Tables that the contain information of primary concern, i.e. in the case of customers, the actual customer names and attributes, etc.)
  • join_ (Tables used for the creation of many-to-many relationships)
  • ref_ (Reference tables that store look-up values, i.e. customer type, etc.)

For more specialized development I have also used the following:

  • app_ (Tables containing data that will be appended another set of data)
  • raw_ (Tables housing raw data imported from some external source)
  • typed_ (Tables with appropriately defined data types hold data that was previously imported in a raw_ table that may or may not have had correctly defined types (i.e. maybe in the raw_ table date columns were imported into a char column, etc.))
  • std_ (Tables that represents a “standard” to which other data has been mapped)
  • sum_ (Summary tables that are aggregations of data held in other tables)
  • dim_ (Typically convention used in a data warehouse for dimensions)
  • fact_ (Typical convention used in a data warehouse for facts that are to be associated with dimensions)

Note that all of those conventions are for Tables. Some of them I also apply to other entities as well, such as app_ and sum_, both of which I also apply to Views. For Views I’m a little less consistent, sometimes I don’t use any prefix at all, but other times I do use them. The prefixes I use for views are less consistent because they have a lot to do with the specific application. For example, I may create a view that pivots data in a table for the purpose of indicating records that are to be excluded from some form of analysis. This view may be joined to another table or view to provide a flattened listing of the reasons why records are marked for exclusion. In this case I’d prefix the view with ex_, etc.

When it comes to stored procedures things get rather “unconventional”, in that here how I prefix the procedures depends on what exactly is being done. In some cases I’ve created procedures that are intended to be run in a sequence. In those cases the sequence is the prefix. There I’ve gone back to the GW Basic days of line number conventions, using prefixes like 010_InitializeTables, 020_AppendWhatever, 030_EtcEtc, etc. The purpose here is making it easy to create a new procedure called 021_ if needed to fit between 020_ and 030_.

In other cases, where sequencing isn’t an issue I’ve tended more to things likes load_, rpt_, seq_, etc. reflecting the primary function of the procedure.

So, what’s the point to all of this? Well its all about organization, which to a large degree is that database are all about. Because entities tend to be listed in alphabetical order, prefixing is a way to group similar things together. In addition, the prefix gives you some information about what the function of the entity is. Looking at a database that I have designed, its easy to know which tables are the reference tables, which tables house the core data, and which tables you need to use for many-to-many joins. Yes, you can get that information from an ERD, assuming one has been created, this just makes it easier and more obvious.

But now on to columns within a Table or View. Here is where my convention is a little less intuitive, but I think is quickly understandable. I tend to name almost everything backwards from the way that we normally say things in the English language. This is because in the English language we tend to put descriptors in front of objects, i.e. we describe the object before we identify the object in the English language, e.g. “red car”, “primary address”, “last name”, “customer id”.

What I do in naming columns is reverse this order, and put the object first, e.g. “address primary”,  ”name last”, “id customer”. Of course I would use id_customer  or idCustomer, etc. In my case I use all lower case with words separated by underscores.  The purpose for this is for avoiding any possible case sensitivity issues. Windows and SQL Server are not case sensitive, but others systems are and some programming languages are, so given that you are “never sure” how the database may be used or whatever, I tend to make my column names all lower case to reduce potential problems. The use of case and isn’t a big issue and these practices depend on what the practice is where you work, or what the customer requests etc.

In addition to putting the object first in the column name, when I design tables I always place the primary key columns at the front of the table in the order of the clustered index, and all other columns are sequenced alphabetically. This makes finding what you are looking for in a table a lot easier, especially if there are more than 10 or so columns.

By putting the object first in the name it not only makes recognizing what a column is easier, it also again leads to better grouping, especially when columns are created in alphabetical order. This becomes even more important when data is de-normalized, for whatever reason, be it in a table, an export file, via a view, or in a OLAP cube.

And last but not least, of course we have the issue of Schemas. (You thought I was going to leave that out didn’t you?) Because security is tied to Schemas they are best used in relation to “who”. Schemas are basically containers into which entities can be placed based on how those entities should be accessed, and by whom. As far as I know there is no “naming convention” for Schemas, other than to try and keep them as short as possible.  I try to keep schema names below 5 characters personally. Depending on the setting, I’ve also used Schemas to identify who created the entity, i.e. “data”, “eng”, “dba”, “ba”, etc. In other words, this identifies if a table was created by a Data Analyst, someone in Engineering, a DBA, a Business Analyst, etc. This is more useful in a data warehouse type setting or data staging setting, you wouldn’t use this in the design of a OLTP system, for example.

So now let’s look at what  basic database might look like using the naming conventions I’ve described above. Here is what a really basic set of tables would look like using this convention:

And here is an ERD with some additional detail:

As you can see, this does a pretty good job of organizing like entities together. Some possible questions here are:

  • In the join table, why is id_customer before id_address since “a” comes before “c”? Answer: Because the columns are in the order of the clustered index, which is puts customer first since that’s how the data would most typically be inserted and read.
  • Why prefix everything in the address table with the word address? This is a preference issue, but I did it here so that when the data is de-normalized alongside the customer data the address columns would be clearly identifiable, for example if you were to create a View that returned customer name along with address.
  • But the address table looks silly with address_city in front of address_line1. Yeah, and if you are manually scripting the tables this may be a place where, within a given prefix, you violate the alphabetical rule to put things into a logical order, however I often create my metadata separately and use a script generator that automatically sorts all the (non-PK) columns in alphabetical order when generating the CREATE TABLE script.

So there you have it. That’s the naming convention that I’ve developed over the years and I think it works really well. There are some kinks here and there and times when aspects of it can seem subjective (what’s a data table vs what’s a reference table, its usually clear, but not always), but overall its done a good job for me of clarifying database structure and making it easier to find what is being looked for. Where I’ve gotten the most push-back is in the column naming, but once people see it in context they get on board unless there is need to preserve an existing convention. When people hear id_x or name_last, they often say, “Why say name_last instead of last_name, sounds weird,” but then I explain it and they usually get it.

Data Profiling using dynamic SQL – Part 1

This is the first in what will be a series of posts on data profiling. This covers one of the simpler profiling tasks, which is profiling a well typed table for the purpose of better understanding the data in the table. This is something I’ve had to do frequently for a variety of reasons, so I’ve developed a relatively simple stored procedure that I use regularly.

In my situation I have to operate on a large number of different databases across multiple instances. What I’ve done is I’ve created a single database called DataStudio, which is copied to all of the instances, that houses a set of data analysis and ETL tools. One of the tool that features prominently in that toolkit is the stored procedure listed below called GetDataStats, which generates a standard profile report on a well typed table, meaning a table for which I believe all of the columns are of an appropriate type.

For tables in which all of the columns are character columns or tables that I know need to be re-factored I have another profiling procedure that seeks to determine the best possible type for columns, which will be the subject of a later post.

I typically use this GetDataStats procedure as a quick and easy first shot at better understanding the content of a table and doing some some general bounds checking above and beyond any bounds checking defined by the schema.

Below is a sample of the output from the procedure as run against a table of stock market index data.

table_name       column_name   column_type   value_count  null_count   null_percent  distinct_count  sum_of             avg_of          top_10_values                                              bottom_10_values
---------------- ------------- ------------- ------------ ------------ ------------- --------------- ------------------ --------------- ---------------------------------------------------------- --------------------------------------------------------
dbo.adj_INDEX    trade_date    datetime      981088       0            0             2666            NULL               NULL            [Jan  1 2001 12:00AM][Jan  2 2001 12:00AM][Jan  3 2001 ... [Mar 22 2011 12:00AM][Mar 21 2011 12:00AM][Mar 18 201...
dbo.adj_INDEX    ticker        varchar       981088       0            0             368             NULL               NULL            [ADR][ADVA][ADVN][ADVQ][AEX][AMZ][ATX][BANK][BDI][BFX]     [ZWI][ZRS][ZOC][ZAX][YIH][XWH][XUO][XUH][XTV][XTC]
dbo.adj_INDEX    high          decimal       760150       220938       0.2251969     267300          2418520957848.34   3181820.637974  [*NULL*][0.0000][0.0050][0.0100][0.0150][0.0200][0.0300... [3174239000.0000][3066776000.0000][3054216000.0000][3...
dbo.adj_INDEX    low           decimal       760150       220938       0.2251969     265535          1144237193618.56   1505444.518789  [*NULL*][-19887.0000][-4.3400][-4.2360][-0.4700][-0.410... [2676176000.0000][2671483000.0000][2632589000.0000][2...
dbo.adj_INDEX    close         decimal       760150       220938       0.2251969     262146          2417073929787.85   3179887.634848  [*NULL*][0.0000][0.0050][0.0100][0.0150][0.0200][0.0250... [3174239000.0000][3066776000.0000][3054216000.0000][3...

As you can see the procedure returns one row for each column in the table. Value_count is the number of rows that have a value other than NULL. Null_count obviously is the number of rows with a NULL value. Null_percent returns the percentage of NULLs as a value between 0 and 1, and the sum_of and avg_of values are only calculated for numeric columns.

Instead of simply showing the minimum and maximum values, I show the top 10 and bottom 10 values (the results have been truncated here for display purposes).

I place the values within brackets in order to more easily show leading and trailing white-space. I return NULL values as the value [*NULL*] because I've come across instances where the literal string "NULL" has existed in tables, which has caused much confusion as one can imagine.

The procedure has three parameters, only two of which are required, a database name and table name. The schema is to be included with the table name the way I have the procedure written. This simplifies the parameters by not requiring a third parameter, and also makes it easy to populate the parameters by dragging and dropping the table name within SSMS. The third, and optional, parameter is @dropresults which by default is set to drop the summary table after the results are returned. A reason not to drop the summary table after returning the results would be if you want to run further queries or processes against the profile data.

EXEC DataStudio.dbo.sp_GetDataStats
  @database ='dbname',
  @tableName ='dbo.table';

The procedure itself is as follows:
CREATE PROCEDURE [dbo].[sp_GetDataStats]
  @database nvarchar(128),-- Database Name
  @tableName nvarchar(128),-- Name of the table to be examined. Include schema in the name
  @dropResults bit=-- 1 = Yes, drop the summary table after returning results | 0 = No, don't drop the summary table after returning results

WITH 
RECOMPILE
AS-- =============================================
-- Author: R. G. Price
-- Create date: 2012/03/01
-- Description: This procedure profiles a table by creating a summary table, into which summary information
-- is inserted. The procedure uses a cursor to loop through each column in a table performing a series of queries,
-- then inserts the results into the summary table.
-- =============================================

SET NOCOUNT ON;
SET ANSI_WARNINGS OFF;

BEGIN

  DECLARE @sql nvarchar(max);

  SET @sql =N'IF EXISTS (SELECT * FROM sysobjects WHERE xtype = ''U'' AND name = ''_DataStats'+@database+''') DROP TABLE [_DataStats'+@database+'];';

  EXEC(@sql);

  SET @sql =N'CREATE TABLE [_DataStats'+@database+'] (table_name nvarchar(128),column_name nvarchar(128),column_type nvarchar(128),value_count bigint,null_count bigint,null_percent real,distinct_count bigint,sum_of float, avg_of float, top_10_values nvarchar(max), bottom_10_values nvarchar(max));';

  EXEC(@sql);

  SET @sql =
  N'DECLARE @columnName nvarchar(128),
  @columnType nvarchar(128),
  @iSQL nvarchar(max);

  DECLARE columns_cursor CURSOR FAST_FORWARD
  FOR
  SELECT
    COLUMN_NAME, DATA_TYPE
  FROM
    ['+@database+'].INFORMATION_SCHEMA.COLUMNS
  WHERE
    TABLE_SCHEMA+''.''+TABLE_NAME = '''+@tableName+'''
  ORDER BY
    ORDINAL_POSITION

  OPEN columns_cursor

  FETCH NEXT FROM columns_cursor
  INTO @columnName, @columnType

  WHILE @@FETCH_STATUS = 0
  BEGIN
  RAISERROR(@columnName, 1, 1) WITH NOWAIT;

  SET @iSQL =
  N''DECLARE @count bigint, @distCount bigint, @nullCount bigint, @sumOf float, @avgOf float, @myTop nvarchar(max), @myBottom nvarchar(max);

  SELECT @count = COUNT_BIG(*) FROM ['+@database+'].'+@tableName+' WHERE [''+@columnName+''] IS NOT NULL;
  SELECT @distCount = COUNT_BIG(DISTINCT [''+@columnName+'']) FROM ['+@database+'].'+@tableName+';
  SELECT @nullCount = COUNT_BIG(*) FROM '+@database+'.'+@tableName+' WHERE [''+@columnName+''] IS NULL;

  IF ''''''+@columnType+'''''' IN (''''bigint'''', ''''real'''', ''''float'''', ''''decimal'''', ''''money'''', ''''numeric'''')
  BEGIN
    EXEC sp_executesql N''''SELECT @colSum = SUM([''+@columnName+'']) FROM ['+@database+'].'+@tableName+''''', N''''@colSum float OUTPUT'''', @colSum = @sumOf OUTPUT;

    EXEC sp_executesql N''''SELECT @colAvg = AVG([''+@columnName+'']) FROM ['+@database+'].'+@tableName+' WHERE [''+@columnName+''] != 0'''', N''''@colAvg float OUTPUT'''', @colAvg = @avgOf OUTPUT;
  END
  ELSE IF ''''''+@columnType+'''''' IN (''''tinyint'''', ''''smallint'''', ''''int'''')
  BEGIN
    EXEC sp_executesql N''''SELECT @colSum = SUM(CAST([''+@columnName+''] as bigint)) FROM ['+@database+'].'+@tableName+''''', N''''@colSum float OUTPUT'''', @colSum = @SumOf OUTPUT;

    EXEC sp_executesql N''''SELECT @colAvg = AVG(CAST([''+@columnName+''] as bigint)) FROM ['+@database+'].'+@tableName+' WHERE [''+@columnName+''] != 0'''', N''''@colAvg float OUTPUT'''', @colAvg = @avgOf OUTPUT;
  END

  SET @myTop = '''''''';

  SELECT
    @myTop += ''''[''''+ISNULL(CAST([''+@columnName+''] as varchar(50)), ''''*NULL*'''')+'''']''''
  FROM
    (SELECT DISTINCT TOP 10 [''+@columnName+''] FROM ['+@database+'].'+@tableName+' ORDER BY [''+@columnName+''] ASC) t

  SET @myBottom = '''''''';

  SELECT
    @myBottom += ''''[''''+ISNULL(CAST([''+@columnName+''] as varchar(50)), ''''*NULL*'''')+'''']''''
  FROM
    (SELECT DISTINCT TOP 10 [''+@columnName+''] FROM ['+@database+'].'+@tableName+' ORDER BY [''+@columnName+''] DESC) b

  INSERT INTO [_DataStats'+@database+']

  VALUES('''''+@tableName+''''', ''''''+@columnName+'''''', ''''''+@columnType+'''''', @Count, @NullCount, CASE @NullCount WHEN 0 THEN 0 ELSE CAST(@nullCount as float)/(CAST(@count+@nullCount as float)) END, @distCount, @sumOf, @avgOf, @myTop, @myBottom);'';

  EXEC(@iSQL);

  FETCH NEXT FROM columns_cursor
  INTO @columnName, @columnType;
  END

  CLOSE columns_cursor;
  DEALLOCATE columns_cursor;';

  EXEC(@sql);

  SET @sql ='SELECT * FROM [_DataStats'+@database+'];';

  EXEC(@sql);

  IF @dropResults = 1
  BEGIN
    SET @sql ='DROP TABLE [_DataStats'+@database+'];';

    EXEC(@sql);
  END

  SET NOCOUNT OFF;
END

This procedure can take a while to run. I've run it on tables with hundreds of columns and millions of rows, but it takes several hours to complete on such a table in the environments I've run it in. Never the less, it's still worth it because it's something you can simply kick off to run in the background or over night and it's still faster for assessing the content of a table than hunting and pecking or doing ad hoc analysis. For tables with dozens of columns and  thousands of rows, however, it typically completes in a matter of seconds or minutes.

What this procedure does:

The procedure starts out by creating a "permanent" table into which the summary information will be inserted. I use a permanent table instead of a temporary table or table variable because there are scenarios in which I want to retain the summary information for later use. If there is no need to retain the data then the data is dropped as indicated by the @dropResults parameter.

Dynamic SQL has to be used throughout in order to accommodate being able to run the procedure on any given database and table. You'll notice the addition of bracket delimiters around the database name and column names used in the dynamic SQL. The more technically correct way to do this would be to use the QUOTENAME function since it handles escaping of brackets should they exist in the object name, however it's more complicated to do this and for my use I didn't feel it was worth the trouble since we don't have any objects with brackets in the object name.

In addition, you'll see that the table name is never delimited. This is because I'm using a single parameter for the schema and table name. In order to properly delimit the values I'd have to parse the string and delimit the schema and table name separately. If we had table names with invalid characters in them I'd go to the trouble to do that as the first step in the procedure, but I have not yet had to run this on table with names that include invalid characters so again it's not something I've gone to the trouble to do.

The main body of the procedure actually includes nested dynamic SQL. A cursor is used to loop through each column in the table. The columns are pulled from the INFORMATION_SCHEMA, as opposed to sys system catalog views because INFORMATION_SCHEMA is simpler to use for this purpose, and INFORMATION_SCHEMA is defined by the SQL-92 standard, and thus less likely to change in the future anyway.

The use of a cursor is appropriate in this situation because the processing of each column requires procedural logic.

Inside the cursor loop I begin by using RAISEERROR() WITH NOWAIT to report the name of the column that is currently being processed. The objective of using RAISEERROR is not to throw an error, but rather to report the status of the procedure in real time. The PRINT function can't be used for this purpose because PRINT statements aren't returned until the entire procedure completes running, which would defeat the purpose of a status update. By using RAISEERROR() WITH NOWAIT with a severity and status of 1, the status is reported immediately as the procedure runs without interrupting the flow of the procedure, which let's you know how far along the procedure is. This works even within a TRY...CATCH black as well by the way. One could easily make this status message even more informative if desired by adding a count reporting something like "Processing "Column Name" : (X out of Y)", where X is the ordinal position of the column being processed out of the total count of columns.

Within the cursor loop nested dynamic SQL is created in order to make use of the column variables populated by the cursor (yes it would be nice if SQL Server were object oriented so we wouldn't have to do this type of stuff).

Within the nested dynamic SQL (iSQL for inner SQL) variables are declared to hold the values that will be inserted into the summary table.

You'll notice that sp_executesql is used for the SUM and AVG values. This is because a compile time error will be generated when the dynamic SQL is created for non-numeric columns, even though those columns would never be summed or averaged due to the IF statement which will prevent the SUM and AVG functions from being applied to non-numeric types. In order to prevent the compiler error the SUM and AVG lines have to be performed within yet a third nesting of dynamic SQL, but since variables are scoped within dynamic SQL in order to return the values to the to calling outer scope sp_executesql has to be used with an OUTPUT parameter.

When the dynamic SQL for this step is actually created it looks like the following:

IF 'datetime' IN ('bigint','real','float','decimal','money','numeric')

This of course seems like a silly statement, but this is how we do things in the non-object oriented world. In the object oriented world we would be doing something like if column.type in ()...

The reason for the ELSE statement is that SUM returns a value of the same type that was passed to it, so the the SUM of an int, smallint, or tinyint column exceeds the size of the passed in type, it will cause an error. In other words, if you SUM a column of tinyint (whose valid range is from 0..255) and the SUM is greater than 255, it will raise a run time error. To get around this its best to cast all integers smaller than bigint to a bigint when doing SUMs.

Notice that the AVG being done here is not a true average, because of the WHERE clause, which says WHERE column != 0. I did this because I'm operating on tables with lots of dubious zero values which can cause the averages to be kind of meaningless. The average returned here is the average of non-zero, and obviously non-NULL (automatically eliminated by the AVG function), values.

Lastly the top 10 and bottom 10 values are built as strings by concatenating each of the values together using the += operator. For backwards compatibility with versions of SQL older than 2008 you can simply change this to @myTop = @myTop + column, etc.

The value of each of these variables is then inserted into the summary table.

Upon completion of the cursor the results are selected from the table and the table is dropped if specified by the @dropResults parameter.

So there you have it.

Note that due to the way this procedure is written, you can't run this multiple times against tables in the same database concurrently. I have, however, created another procedure to sequentially run this procedure against all tables in a database.

Parsing and normalizing string data

This is just a little something I had to do recently and it makes simple use of several different techniques so I’ll throw it out here.

Basically I had a set of 4 digit codes that had anywhere from 0 to 4 single character modifiers append to them, and I needed to separate out each modifier to create a mapping indicating which codes had which modifiers. The modifiers represent available ”options” associated with each code. You can think of this as a data normalization exercise. In this case what we have is multiple pieces of information all mashed together into a single data element. The first 4 characters represent a base code, and each subsequent character represents a discrete piece of information. What we will do here is create a normalized mapping for each code-modifier pair.

This solution makes use of a recursive CTE (Common Table Expression), a “numbers table” (generated by the CTE), CROSS JOIN, and SUBSTRING().

Now obviously if all we needed to do was get the first 4 characters for the code and a single character on the end as the modifier this would have been a simple case of using LEFT(X,4) and RIGHT(X,1), but in this case each modifier is one character and there can be anywhere from 0 to 4 of them, so a simple LEFT() and RIGHT() wasn’t going to work.

My starting data basically looked something like this (except thousands of rows):

code
other columns
0045
...
1478X
...
2387
...
4678P*
...
8906RX
...
9048PNX*
...

We can start by creating this simple sample data to work with:

CREATE TABLE dbo.Codes
(code varchar(8NOT NULL);

INSERT INTO dbo.Codes
(code)
VALUES
(’0045′),
(’1478X’),
(’2387′),
(’4678P*’),
(’8906RX’),
(’9048PNX*’);

Now let’s create the Numbers table we will use to parse the codes using a recursive CTE and just test it out to see what it does:

WITH Numbers(num)
AS
(
SELECT 5
UNION ALL
SELECT
  num+1
FROM
  Numbers
WHERE
  num < 10
)
SELECT FROM Numbers;

This returns the following results:

num
 -----------
 5
 6
 7
 8
 9
 10
(6 row(s) affected)

Now let’s join this up to our data:

WITH Numbers(num)
AS
(
  SELECT 5
  UNION ALL
  SELECT
    num+1
  FROM
    Numbers
  WHERE
    num < 10
)
SELECT
  code, numSUBSTRING(code, num, 1)
FROM
  dbo.Codes
    CROSS JOIN Numbers
WHERE
  num <=LEN(code)
ORDERBY
  code, num

Here are the results from this query:

code     num
-------- ----------- ----
1478X    5           X
4678P*   5           P
4678P*   6           *
8906RX   5           R
8906RX   6           X
9048PNX* 5           P
9048PNX* 6           N
9048PNX* 7           X
9048PNX* 8           *
(9 row(s) affected)

The CROSS JOIN joins every value in Numbers to every value in Codes, and as we can see here, we only get the codes returned that actually have modifiers on them; any codes without modifiers aren’t returned because in those instances num (the smallest value being 5) is less than the length of the string, which would be 4.

So let’s move on to the “final” solution (though in this case I’m inserting into a temp table just to for demonstration purposes), which is a mapping between the base 4 character codes and each of the associated possible modifiers.

WITH Numbers(num)
AS
(
SELECT 5
UNION ALL
SELECT
  num+1
FROM
  Numbers
WHERE
  num < 10
)
SELECT
  LEFT(code, 4as codeSUBSTRING(code, num, 1)as modifier
INTO
  #ModMapping
FROM
  dbo.codes
    CROSS JOIN Numbers
WHERE
  num <=LEN(code)
ORDERBY
  code, num

SELECT FROM #ModMapping;

Which produces the following results:

(9 row(s) affected)
code modifier
---- --------
1478 X
4678 P
4678 *
8906 R
8906 X
9048 P
9048 N
9048 X
9048 *
(9 row(s) affected)

And now we have a basic mapping of codes to modifiers. This could be the first step in a process for creating a table of available options associated with entry codes of some kind, etc. i.e. when code 9048 is entered into some system options P, N, X and * would be presented. Of course one would want to map those code values to some more meaningful description, etc.

At any rate, this is a demonstration of one approach for using set logic to parse a portion of a string and generate a normalized result set comprised of the base portion of the string and associated data elements.

Follow

Get every new post delivered to your Inbox.