Friday, January 11, 2008

A nice Title-case SQL function

This is written for Microsoft T-SQL, but should work with minor tweaks in most SQL environments. I only know of one guy who reads this blog who also writes SQL (besides me) ... but for him and whoever else may run across this, this will save you some time ...

The problem with most string handling in SQL Server (and other SQL environments) is that it invites programmers to write non-set based code. Every function I've ever seen to put a string into title case ('Something Like This') parses the string, character by character, and writes the output. Functional, sure, but slow (in a SQL environment) -- this code does the same thing, using set-based updates. It's about 15 times faster than the code it replaced:

First, put the string to be altered (Address, in this example) into lower case. I used a temp table while developing this --

UPDATE #TMP
SET Address = LOWER(Address)

UPPER the first character in each Address column --

UPDATE #TMP
SET Address = UPPER(LEFT(Address,1)) + RIGHT(Address,LEN(Address) -1)

Look for the occurrance of a space in the Address column ...

-- WHILE through the available addresses
WHILE EXISTS
(SELECT 1 FROM #TMP WHERE Address LIKE '% %')
BEGIN

-- Replace space plus a character with ~ plus an UPPERed character
UPDATE #TMP
SET Address = STUFF(
Address,
CHARINDEX(' ',Address),
2,
'~' + UPPER(SUBSTRING(Address,CHARINDEX(' ',Address) + 1,1))
)
FROM #TMP
WHERE Address LIKE '% %'

END

--When we're done, we replace the ~ with spaces.
UPDATE #TMP
SET Address = REPLACE(Address,'~',' ')

Conceptually this isn't difficult; lots of string handling code does similar stuff, just without the set-based orientation. The first update statement will update only the first case it encounters of the space + a character, so we keep hitting the recordset until there are no more spaces in it. Then we replace the ~ with spaces, and we're done.

In my environment I have further processing to do after this -- I have to make sure that directionals like 'NE' don't turn into 'Ne,' for example, that MacAdams Drive doesn't turn into Macadams Drive, and so on ... but for most simple title case scenarios, this should suffice, and run quickly -- as I said, about 15 times faster than the non-set based code it replaced.

11 comments:

Thomas said...

It's solutions like this that always make me push said tasks out to the middle or top layer. When a one-line regex replacement can do the task, it seems counter productive to do it on the SQL end of things.

Just my 0.02 of course. Your solution provides a way to get consistent output no matter the front end, and I can respect that.

Dan Moran said...

No doubt there are far better string handling environments than the database. That said, every mature business environment I've ever been in had SQL string handling code floating around. Whether it's right or wrong is sort of irrelevant to the question of whether people are doing it.

There's also, as you note, something to be said for a unified interface to the rest of the world. If your database serves many masters, this can be the only practical way to get consistent results.

howard said...

Ah, SQL, I actually miss coding, it's been so long. My favorite task was reverse engineering crappy code and rewriting it. I remember debugging one program that took hours to run and finding out someone had a really bad join in the wrong spot. Reduced processing time by 95% by the time I was done ...

Ray said...

I end up doing a lot of SQL work, most of it on MSSQL, and I've read your books so your blog is a win/win for me :-).

Just as an aside to your solution, if the above is more than a one-off data cleansing script then you may want to switch from temp tables to table variables (# -> @), depending on the row count, of course. (#temp tables are transaction logged, @table_vars are kept in memory only.) Some systems have anemic disk setups, so anything that avoids hitting the disk is always a win.

Also, it may be faster to add a TOP, like so:

WHILE EXISTS
(SELECT TOP 1 1 FROM #TMP WHERE Address LIKE '% %')
BEGIN

At least it would allow SQL Server to do a simpler optimization than noticing the WHILE EXISTS wrapped around the select, but I've no idea if it'll take advantage of it without testing. In my experience, SQL Server is horrible at optimizing things that should be pretty obvious.

J.D. Ray said...

Sounds like there's more than one reader that writes SQL in one fashion or another.

I've never really worked with SQL Server, using Oracle instead. From what little knowledge I have, this thing looks like it would port fairly easily to PL/SQL. Adding a little complexity to deal with the oddballs (as you say, so NE doesn't come out Ne) and compiling the result as a stored procedure should result in a convenient little utility. OTOH, what about an external call to regex() installed on the DB server? I'm not sure what the processor overhead is to make external calls like that, but the net speed might be favorable.

Dan Moran said...

Ray,

SELECT TOP 1 Whatever vs SELECT Whatever evaluate out to be the same thing, in an IF EXISTS clause -- the engine's smart enough to find the first example of whatever you're looking for, thus proving existence, and then quit. You can verify this using the Query Analyzer -- hit Ctrl-M before you execute the code, to get back an execution plan, and compare the two versions. They do the same thing. That SELECT 1 thing is a habit left over from the old days when SQL Server would pull a whole row from disk if you used "SELECT *" ... but even back then I don't think it actually pulled a whole recordset, once it found a matching criteria.

Table variables are fine, but they're not disk access free -- they do create entries in tempdb, they just don't persist the data there. For smaller data sets they're preferable to temp tables, but when you start to get into the ranges where an index would be useful, you'll need to use a temp table, since you can't index table variables. (Well, you can define a primary key with the table variable declaration, but not beyond that.) For very large data sets, the performance improvements of an indexed temp table over a non-indexed table variable can be huge, particularly if you end up having to join the table in a query. Finally, if you have a lot of processes declaring table variables, you can have memory issues -- a few megs of memory being allocated a thousand times a minute, and persisting for several seconds, can cause real memory problems on a busy box. So there's a balance that has to be hit there --

Also, table variables are a pain in the ass to develop with, since the data won't persist once the session that creates it terminates -- I usually develop with temp tables even when I know I'm going to use table variables in production, so I can examine my data a little more easily, and then swap out the "CREATE TABLE #TableName" with "DECLARE @TableName TABLE" when I get done ... then you search and replace #TableName with @TableName in the rest of your code, and you're off to the races.

SQL Server 2008 lets you pass table variables directly between stored procedures. I'm looking forward to that. :-)

JD Ray,

I've never used regular expressions in SQL Server -- they've had them as SQLCLR objects ever since 2005 shipped -- but I haven't yet worked in an environment where anyone was using the CLR. Developers don't want to write against the DB, and DBAs don't want developers in the DB.

It's certainly the sort of thing the CLR was stuck into SQL Server to do ... but as I say, I've never seen it.

Anonymous said...

please go to this web site it provide more easy way

http://blog.sqlauthority.com/2007/02/01/sql-server-udf-function-to-convert-text-string-to-title-case-proper-case/

Dan Moran said...

PinalDave, is that you? If so, let me tell you I've enjoyed your blog a number of times.

Yep, the version ref'd is easier to use than what I wrote. Slower, though.

yogesh.narayanan said...

Title Case SQL Server Function found simple solution...

jeremy said...

This proved helpful to me 3 and a half years after you posted. Thanks.

I had some trailing spaces in some of my data that threw off

SET Name = UPPER(LEFT(Name,1)) + RIGHT(Name,LEN(Name)-1)

I used RTRIM to get rid of the trailing spaces, and then it worked perfectly. I like your solution better than others because it's not a function. I'm not the DBA and creating a function isn't an option for me.

Sudhir DBAKings said...

Nice post very helpful

dbakings