Friday, April 25, 2014

Character Replacement With Regular Expression Range

I was working on a project at work, and came up with the 2 scripts below. Some people may find the scripts useful. The first sql script will replace a range of characters within a single string. The second script will replace a range of characters within a rowset.
Real life use of these scripts example:
Let's assume that we have an application(or a spreadsheet) tracking people annual income but the income column data type is varchar or text, and users may enter these:
100000.00
$100000
$100000
$100000/yr
$ 100000/annual

If we try to convert the values above(except the first one) to a decimal or money, the conversion will fail. Our goal is to remove all character accept character 0 to 9 and the period(.). After that we can convert it to a different data type such decimal or money.
Notice that I use the same technique on the example in this old post, Replace Function with Recursive CTE
--Script 1: replace a range of characters for a single string.
DECLARE @original_str  VARCHAR(50) = '$100000.00/yr'
      , @characters_range_str VARCHAR(50) = '[^0-9.]'--regex range, any character except for number 0 to 9 and period (.)
      , @replacement_str VARCHAR(10) = ''; -- an empty string. Note: if use CHAR, recursion will max out. I don't know why. I will do research and let you know.

; WITH T AS (--recursive cte
      --anchor member
      SELECT 
         @original_str AS string 
       , row_id = 1
      UNION ALL
     --Convert to VARCHAR(50) for data type and length to match with anchor member
      SELECT 
         CAST(REPLACE(string, SUBSTRING(string, PATINDEX('%'+@characters_range_str+'%', string), 1), @replacement_str) AS VARCHAR(50))
       , row_id = row_id + 1
      FROM T
      WHERE PATINDEX('%'+@characters_range_str+'%', string) <> 0 --PATINDEX return 1 when match is found
)
SELECT string
FROM T
ORDER BY row_id DESC
OFFSET 0 ROW FETCH NEXT 1 ROW ONLY --get final result from cte above 
--OPTION(MAXRECURSION 0); --change recursion levels. default = 100. Use 0 for max limit = 32767.






--Script 2: Replace a range of characters for a table
--temp table to hold test data
CREATE TABLE #salary (
 empid INT IDENTITY(1,1) PRIMARY KEY,
 income VARCHAR(20)
);
GO
--insert some rows for testing
INSERT INTO #salary (income)
VALUES   ('80000.00')
        ,('$90000')
        ,('$110000/year')
        ,('$200000/yr')
        ,('$ 150000 per year');
GO

--remove all characters except character 0 to 9.
DECLARE @characters_range_str VARCHAR(10) = '[^0-9.]' --regular expression character range, any character that not number 0 to 9 and period (.)
      , @replacement_str VARCHAR(10) = ''; -- an empty string. Note: if use CHAR, recursion will max out. I don't know why. I will do research and let you know.

; WITH T AS ( --recursive cte
      --anchor member
      SELECT 
          empid 
        , income
        , 1 AS row_id -- will be used to identify final output.
      FROM #salary
       UNION ALL
      SELECT 
          empid
          --Convert to VARCHAR(20) for data type and length to match with anchor member
        , CAST(REPLACE(income, SUBSTRING(income,PATINDEX('%'+@characters_range_str+'%', income),1), @replacement_str) AS VARCHAR(20))
        , row_id + 1
      FROM T
      WHERE PATINDEX('%'+@characters_range_str+'%', income) <> 0 --PATINDEX returns 1 When match is found
)
, T2 AS (--rank the rowset from cte above to get final results
      SELECT 
         empid
       , income
        --rank the row in descending order to get the result
       , ROW_NUMBER() OVER(PARTITION BY empid ORDER BY row_id DESC) AS result_row_id
      FROM T 
)
SELECT empid, income 
FROM T2
WHERE result_row_id = 1
ORDER BY empid
--OPTION(MAXRECURSION 0); --change recursion levels. default = 100. Use 0 for max limit = 32767.

--drop temp table
DROP TABLE #salary;












Note that if the data type for the variable @replacement_str is CHAR, the queries above yield this error:
Msg 530, Level 16, State 1, Line 5
The statement terminated. The maximum recursion 100 has been exhausted before statement completion.
I don't know why this happens. I think because Replace function returns nvarchar or varchar, thus to avoid implicit casting, all arguments need to be either nvarchar or varchar. If you know the reason why this error occurs, please share. Anyway, please let me know if you have other way to implement this.

Reference:
PATINDEX function: http://technet.microsoft.com/en-us/library/ms188395.aspx

No comments:

Post a Comment