This website requires JavaScript.

[译] 使用T-SQL进行URL解码

本文将使用纯T-SQL来解码URL地址

背景

最近我接到一个需求.需要用T-SQL解码URL . 我尝试在网上找解决方案,但是那些都是解码单个字节的, 大家都知道一个字节的值小于128. 例如'%2f'解码了就是 "/". 但是2个字节,3个字节甚至4个字节的Unicode. 单字节的解码就没作用了.例如 "%c2%ae" 应该是 "®"而结果是"®", "%E2%84%A2"应该解成"™",结果是"â"¢", "%e6%9d%a8"是"杨",结果却是"杨".我写了个函数来解决这个问题.

现在让我们回顾下URL转换非ASCII字符的方式. 他用的是UTF-8 编码的字节串. 在最前面加"%" 然后跟十六进制的数字,带代替非ASCII的字符.

SQL Server 不支持UTF-8的编码. 但是支持UTF-16编码,所以我们需要把UTF-8 转换为UTF16. 我们可以用内置的'NCHAR'函数来反馈 Unicode 字符, NCHAR 可以接受十进制值,我们只要把UTF-8转换成十进制.

2,3,4字节的UTF-8 第一个字节前面都是以 2,3,4个1位跟一个0位的格式.后面的字节都是以10开始.如下表:

1-byte UTF-8 = 0xxxxxxxbin = 7 bits = 0-7Fhex

2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 5+6(11) bits = 80-7FFhex

3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 4+6+6(16) bits = 800-FFFFhex

4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 3+6+6+6(21) bits = 10000-10FFFFhex

让我们看一下他们第一个字节,最大最小值如下:

**Bytes** **Byte 1** **Min-Byte1** **Max-Byte1**
1-byte UTF-8 0xxxxxxx 0x00  (0) 0x7F  (127)
2-byte UTF-8 110xxxxx 0xC0  (192) 0xDF  (223)
3-byte UTF-8 1110xxxx 0xE0  (224) 0xEF  (239)
4-byte UTF-8 11110xxx 0xF0  (240) 0xF7  (247)
我通过"Byte1" 来确定我们需要把多少字节的UTF-8转为Unicode 字符

代码如下:

CREATE FUNCTION [dbo].[UrlDecode] ( @URL NVARCHAR(4000) )
RETURNS NVARCHAR(4000) AS BEGIN DECLARE @Position INT, @Base CHAR(16), @High TINYINT, @Low TINYINT, @Pattern CHAR(21) DECLARE @Byte1Value INT, @SurrogateHign INT, @SurrogateLow INT SELECT @Pattern = '%[%][0-9a-f][0-9a-f]%', @Position = PATINDEX(@Pattern, @URL)

WHILE @Position > 0
BEGIN
   SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
          @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
          @High = @High / 17 * 10 + @High % 17,
          @Low  = @Low  / 17 * 10 + @Low  % 17,
          @Byte1Value = 16 * @High + @Low
   IF @Byte1Value < 128 --1-byte UTF-8
      SELECT @URL = STUFF(@URL, @Position, 3, NCHAR(@Byte1Value)),
             @Position = PATINDEX(@Pattern, @URL)
   ELSE IF @Byte1Value >= 192 AND @Byte1Value < 224 AND @Position > 0 --2-byte UTF-8
   BEGIN
       SELECT @Byte1Value = (@Byte1Value & (POWER(2,5) - 1)) * POWER(2,6),
              @URL = STUFF(@URL, @Position, 3, ''),
              @Position = PATINDEX(@Pattern, @URL)
       IF @Position > 0
          SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
                 @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
                 @High = @High / 17 * 10 + @High % 17,
                 @Low  = @Low  / 17 * 10 + @Low  % 17,
                 @Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)),
                 @URL = STUFF(@URL, @Position, 3, NCHAR(@Byte1Value)),
                 @Position = PATINDEX(@Pattern, @URL)
   END
   ELSE IF @Byte1Value >= 224 AND @Byte1Value < 240 AND @Position > 0 --3-byte UTF-8
   BEGIN
       SELECT @Byte1Value = (@Byte1Value & (POWER(2,4) - 1)) * POWER(2,12),
              @URL = STUFF(@URL, @Position, 3, ''),
              @Position = PATINDEX(@Pattern, @URL)
       IF @Position > 0
          SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
                 @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
                 @High = @High / 17 * 10 + @High % 17,
                 @Low  = @Low  / 17 * 10 + @Low  % 17,
                 @Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)) * POWER(2,6),
                 @URL = STUFF(@URL, @Position, 3, ''),
                 @Position = PATINDEX(@Pattern, @URL)
       IF @Position > 0
          SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
                 @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
                 @High = @High / 17 * 10 + @High % 17,
                 @Low  = @Low  / 17 * 10 + @Low  % 17,
                 @Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)),
                 @URL = STUFF(@URL, @Position, 3, NCHAR(@Byte1Value)),
                 @Position = PATINDEX(@Pattern, @URL)
   END
   ELSE IF @Byte1Value >= 240 AND @Position > 0  --4-byte UTF-8
   BEGIN
       SELECT @Byte1Value = (@Byte1Value & (POWER(2,3) - 1)) * POWER(2,18),
              @URL = STUFF(@URL, @Position, 3, ''),
              @Position = PATINDEX(@Pattern, @URL)
       IF @Position > 0
          SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
                 @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
                 @High = @High / 17 * 10 + @High % 17,
                 @Low  = @Low  / 17 * 10 + @Low  % 17,
                 @Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)) * POWER(2,12),
                 @URL = STUFF(@URL, @Position, 3, ''),
                 @Position = PATINDEX(@Pattern, @URL)
       IF @Position > 0
          SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
                 @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
                 @High = @High / 17 * 10 + @High % 17,
                 @Low  = @Low  / 17 * 10 + @Low  % 17,
                 @Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)) * POWER(2,6),
                 @URL = STUFF(@URL, @Position, 3, ''),
                 @Position = PATINDEX(@Pattern, @URL)
       IF @Position > 0
       BEGIN
          SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
                 @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
                 @High = @High / 17 * 10 + @High % 17,
                 @Low  = @Low  / 17 * 10 + @Low  % 17,
                 @Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1))
                 --,@URL = STUFF(@URL, @Position, 3, cast(@Byte1Value as varchar))
                 --,@Position = PATINDEX(@Pattern, @URL)

          SELECT @SurrogateHign = ((@Byte1Value - POWER(16,4)) & (POWER(2,20) - 1)) / POWER(2,10) + 13 * POWER(16,3) + 8 * POWER(16,2),
                 @SurrogateLow = ((@Byte1Value - POWER(16,4)) & (POWER(2,10) - 1)) + 13 * POWER(16,3) + 12 * POWER(16,2),
                 @URL = STUFF(@URL, @Position, 3, NCHAR(@SurrogateHign) + NCHAR(@SurrogateLow)),
                 @Position = PATINDEX(@Pattern, @URL)
       END
   END
END
RETURN REPLACE(@URL, '+', ' ')

END

使用代码

使用起来很简单:

SELECT dbo.UrlDecode('URL Encoded String')
例子 1: 1,2,3字节的英语UTF-8
SELECT dbo.UrlDecode ('http://domain.com/search?keyword=2nd+Generation+Intel%C2%AE+Core%E2%84%A2+Processors+with+Intel%C2%AE+HD+Graphics+3000%2F2000')
结果为:

http://domain.com/search?keyword=2nd Generation Intel® Core™ Processors with Intel® HD Graphics 3000/2000

例子2: 2,3位字节的中文UTF-8

SELECT dbo.UrlDecode ('http://domain.com/search?keyword=%E8%8B%B1%E7%89%B9%E5%B0%94%C2%AE+%E9%85%B7%E7%9D%BF%E2%84%A2+M+%E5%A4%84%E7%90%86%E5%99%A8')
结果为:

http://domain.com/search?keyword=英特尔® 酷睿™ M 处理器

例子3: 2,3个字节的阿拉伯UTF-8

SELECT dbo.UrlDecode ('http://domain.com/search?keyword=%D8%A5%D9%86%D8%AA%D9%84+%C2%AE+%D9%83%D9%88%D8%B1+%E2%84%A2+M+%D8%A7%D9%84%D9%85%D8%B9%D8%A7%D9%84%D8%AC%0D%0A')
结果:

http://domain.com/search?keyword=إنتل ® كور ™ M المعالج

例子4: 2,3个字节的日本UTF-8

SELECT dbo.UrlDecode ('http://domain.com/search?keyword=%E3%82%A4%E3%83%B3%E3%83%86%E3%83%AB%C2%AE+Core%E2%84%A2+M+%E3%83%97%E3%83%AD%E3%82%BB%E3%83%83%E3%82%B5%E3%83%BC%0D%0A')
结果为:

http://domain.com/search?keyword=インテル® Core™ M プロセッサー

例子5: 4字节UTF-8中文

SELECT dbo.UrlDecode ('%F0%A9%A3%91%F0%A3%85%BF')
结果为:

??

说明

### 计算十六进制对应的十进制
SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
       @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
       @High = @High / 17 * 10 + @High % 17,
       @Low  = @Low  / 17 * 10 + @Low  % 17
计算图解如下:
** ** **Hex Digit ** **0** **1** **2** **3** **4** **5** **6** **7** **8** **9** **A** **B** **C** **D** **E** **F**
**(1) =** ASCII Value 48 49 50 51 52 53 54 55 56 57 65 66 67 68 69 70
**(2) =** (1) - 48 0 1 2 3 4 5 6 7 8 9 17 18 19 20 21 22
**(3) =** (2) / 17 *10 0 0 0 0 0 0 0 0 0 0 10 10 10 10 10 10
**(4) =** (2) % 17 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
**(5) =** (3) + (4) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
 
### 将UTF-8编码转成对应的字符
让我们用2字节的UTF-8 距离,2字节的UTF-8 的二进制形式如下:

Byte1: 110xxxxx

Byte2: 10xxxxxx

去掉byte1的"110"和byte2的"10" . 将两个二进制组合在一起就是"xxxxx xxxxx",计算他的十进制值,然后转为UTF-16 字符

SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48, @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48, @High = @High / 17 * 10 + @High % 17, @Low  = @Low  / 17 * 10 + @Low  % 17, @Byte1Value = 16 * @High + @Low

SELECT @Byte1Value = (@Byte1Value & (POWER(2,5) - 1)) * POWER(2,6)

SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48, @Low = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48, @High = @High / 17 * 10 + @High % 17, @Low = @Low / 17 * 10 + @Low % 17, @Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1))

SELECT NCHAR(@Byte1Value)

  • 一个select 语句用来获取首个字节,并且计算它的十进制值.
  • 第二个语句用按位来移除 byte1的"110' (2^5-1 ,即0x1F ,也就是 0001 1111 ).然后左移六位得到结果2^6
  • 第三个select语句用来处理第二个字符,去除'10' 同样的使用了按位 (2^6-1) 即 0x3F , 0011 1111 然后把byte1和byte2的值相加
  • 最后一个select语句则用来把值转为Unicode字符

考虑数据库的collation是否支持扩展字符supplementary character(SC)

4字节的UTF-8,值的范围从65536到1113111(0x10000 到 0x10FFFF). 按照 MSDN 的描述,如果数据库的collation不支持扩展,NCHAR函数会返回NULL. 我用来做测试的数据库collation是" SQL_Latin1_General_CP1_CI_AS" ,它并不支持SC,我需要进行转换处理. 算法如下:

  • 减掉0x10000;
  • 分割成高10位和低10位的值;
  • 将高位加上0xD800
  • 将低位加上0xDC00 代码如下:
SELECT @SurrogateHign = ((@Byte1Value - POWER(16,4)) & (POWER(2,20) - 1)) / POWER(2,10) + 13 * POWER(16,3) + 8 * POWER(16,2),
       @SurrogateLow  = ((@Byte1Value - POWER(16,4)) & (POWER(2,10) - 1)) + 13 * POWER(16,3) + 12 * POWER(16,2),
       @URL = STUFF(@URL, @Position, 3, NCHAR(@SurrogateHign) + NCHAR(@SurrogateLow))
如果数据库支持字符扩展,你可以注释掉上面的转换,然后取消下面语句的注释
--,@URL = STUFF(@URL, @Position, 3, cast(@Byte1Value as varchar))
--,@Position = PATINDEX(@Pattern, @URL)

历史

  • 01 Jul 2015 - - 发布
  • 10 Sep 2015 – 修改显示的样式,代码没有改变.

版权

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

原文地址 : http://www.codeproject.com/Articles/1005508/URL-Decode-in-T-SQL 代码: http://www.codeproject.com/KB/database/1005508/UrlDecode.zip

0条评论
avatar