T-SQL의 Levenshtein 거리
Levenshtein 거리를 계산하는 T-SQL의 알고리즘에 관심이 있습니다.
내가 알고있는 다른 버전에 비해 속도를 향상시키는 몇 가지 최적화를 통해 TSQL에서 표준 Levenshtein 편집 거리 기능을 구현했습니다. 두 문자열의 시작 부분에 공통 문자 (공유 접두사), 끝에 공통 문자 (공유 접미사), 문자열이 크고 최대 편집 거리가 제공되는 경우 속도 향상이 현저합니다. 예를 들어 입력이 두 개의 매우 유사한 4000 문자열이고 최대 편집 거리가 2로 지정된 경우 이는 거의 3 배 더 빠릅니다.edit_distance_within허용 된 답변에서 함수를 사용하여 0.073 초 (73 밀리 초) 대 55 초 내에 답변을 반환합니다. 또한 두 개의 입력 문자열 중 더 큰 공간에 일정한 공간을 더한 것과 동일한 공간을 사용하여 메모리 효율적입니다. 열을 나타내는 단일 nvarchar "배열"을 사용하고 그 안에서 모든 계산과 일부 도우미 int 변수를 제자리에서 수행합니다.
최적화 :
- 공유 접두사 및 / 또는 접미사 처리를 건너 뜁니다.
- 큰 문자열이 전체 작은 문자열로 시작하거나 끝나는 경우 조기 반환
- 크기 차이가 최대 거리 초과를 보장하는 경우 조기 반환
- 행렬의 열을 나타내는 단일 배열 만 사용합니다 (nvarchar로 구현 됨).
- 최대 거리가 주어지면 시간 복잡도는 (len1 * len2)에서 (min (len1, len2))로 이동합니다. 즉, 선형
- 최대 거리가 주어지면 최대 거리 한계를 달성 할 수 없다고 알려진 즉시 조기 복귀
최적화는 TSQL의 Levenshtein에 대한 내 블로그 게시물 과 유사한 Damerau-Levenshtein 구현이있는 다른 게시물에 대한 링크에 좀 더 자세히 설명되어 있습니다. 그러나 다음은 코드입니다 (2014 년 1 월 20 일 업데이트 됨).
-- =============================================
-- Computes and returns the Levenshtein edit distance between two strings, i.e. the
-- number of insertion, deletion, and sustitution edits required to transform one
-- string to the other, or NULL if @max is exceeded. Comparisons use the case-
-- sensitivity configured in SQL Server (case-insensitive by default).
-- http://blog.softwx.net/2014/12/optimizing-levenshtein-algorithm-in-tsql.html
--
-- Based on Sten Hjelmqvist's "Fast, memory efficient" algorithm, described
-- at http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm,
-- with some additional optimizations.
-- =============================================
CREATE FUNCTION [dbo].[Levenshtein](
@s nvarchar(4000)
, @t nvarchar(4000)
, @max int
)
RETURNS int
WITH SCHEMABINDING
AS
BEGIN
DECLARE @distance int = 0 -- return variable
, @v0 nvarchar(4000)-- running scratchpad for storing computed distances
, @start int = 1 -- index (1 based) of first non-matching character between the two string
, @i int, @j int -- loop counters: i for s string and j for t string
, @diag int -- distance in cell diagonally above and left if we were using an m by n matrix
, @left int -- distance in cell to the left if we were using an m by n matrix
, @sChar nchar -- character at index i from s string
, @thisJ int -- temporary storage of @j to allow SELECT combining
, @jOffset int -- offset used to calculate starting value for j loop
, @jEnd int -- ending value for j loop (stopping point for processing a column)
-- get input string lengths including any trailing spaces (which SQL Server would otherwise ignore)
, @sLen int = datalength(@s) / datalength(left(left(@s, 1) + '.', 1)) -- length of smaller string
, @tLen int = datalength(@t) / datalength(left(left(@t, 1) + '.', 1)) -- length of larger string
, @lenDiff int -- difference in length between the two strings
-- if strings of different lengths, ensure shorter string is in s. This can result in a little
-- faster speed by spending more time spinning just the inner loop during the main processing.
IF (@sLen > @tLen) BEGIN
SELECT @v0 = @s, @i = @sLen -- temporarily use v0 for swap
SELECT @s = @t, @sLen = @tLen
SELECT @t = @v0, @tLen = @i
END
SELECT @max = ISNULL(@max, @tLen)
, @lenDiff = @tLen - @sLen
IF @lenDiff > @max RETURN NULL
-- suffix common to both strings can be ignored
WHILE(@sLen > 0 AND SUBSTRING(@s, @sLen, 1) = SUBSTRING(@t, @tLen, 1))
SELECT @sLen = @sLen - 1, @tLen = @tLen - 1
IF (@sLen = 0) RETURN @tLen
-- prefix common to both strings can be ignored
WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1))
SELECT @start = @start + 1
IF (@start > 1) BEGIN
SELECT @sLen = @sLen - (@start - 1)
, @tLen = @tLen - (@start - 1)
-- if all of shorter string matches prefix and/or suffix of longer string, then
-- edit distance is just the delete of additional characters present in longer string
IF (@sLen <= 0) RETURN @tLen
SELECT @s = SUBSTRING(@s, @start, @sLen)
, @t = SUBSTRING(@t, @start, @tLen)
END
-- initialize v0 array of distances
SELECT @v0 = '', @j = 1
WHILE (@j <= @tLen) BEGIN
SELECT @v0 = @v0 + NCHAR(CASE WHEN @j > @max THEN @max ELSE @j END)
SELECT @j = @j + 1
END
SELECT @jOffset = @max - @lenDiff
, @i = 1
WHILE (@i <= @sLen) BEGIN
SELECT @distance = @i
, @diag = @i - 1
, @sChar = SUBSTRING(@s, @i, 1)
-- no need to look beyond window of upper left diagonal (@i) + @max cells
-- and the lower right diagonal (@i - @lenDiff) - @max cells
, @j = CASE WHEN @i <= @jOffset THEN 1 ELSE @i - @jOffset END
, @jEnd = CASE WHEN @i + @max >= @tLen THEN @tLen ELSE @i + @max END
WHILE (@j <= @jEnd) BEGIN
-- at this point, @distance holds the previous value (the cell above if we were using an m by n matrix)
SELECT @left = UNICODE(SUBSTRING(@v0, @j, 1))
, @thisJ = @j
SELECT @distance =
CASE WHEN (@sChar = SUBSTRING(@t, @j, 1)) THEN @diag --match, no change
ELSE 1 + CASE WHEN @diag < @left AND @diag < @distance THEN @diag --substitution
WHEN @left < @distance THEN @left -- insertion
ELSE @distance -- deletion
END END
SELECT @v0 = STUFF(@v0, @thisJ, 1, NCHAR(@distance))
, @diag = @left
, @j = case when (@distance > @max) AND (@thisJ = @i + @lenDiff) then @jEnd + 2 else @thisJ + 1 end
END
SELECT @i = CASE WHEN @j > @jEnd + 1 THEN @sLen + 1 ELSE @i + 1 END
END
RETURN CASE WHEN @distance <= @max THEN @distance ELSE NULL END
END
이 함수의 주석에서 언급했듯이 문자 비교의 대소 문자 구분은 유효한 데이터 정렬을 따릅니다. 기본적으로 SQL Server의 데이터 정렬은 대소 문자를 구분하지 않는 비교를 수행하는 데이터 정렬입니다. 이 함수를 항상 대소 문자를 구분하도록 수정하는 한 가지 방법은 문자열이 비교되는 두 위치에 특정 데이터 정렬을 추가하는 것입니다. 그러나, 특히 데이터베이스가 기본이 아닌 데이터 정렬을 사용하는 경우 부작용에 대해 철저하게 테스트하지 않았습니다. 다음은 대소 문자 구분 비교를 강제하기 위해 두 줄을 변경하는 방법입니다.
-- prefix common to both strings can be ignored
WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS)
과
SELECT @distance =
CASE WHEN (@sChar = SUBSTRING(@t, @j, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS) THEN @diag --match, no change
Arnold Fribble은 sqlteam.com/forums에서 두 가지 제안을 했습니다.
- 하나 2005년 6월 및
- 2006 년 5 월의 또 다른 업데이트
이것은 2006 년의 젊은 것입니다.
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_NULLS ON
GO
CREATE FUNCTION edit_distance_within(@s nvarchar(4000), @t nvarchar(4000), @d int)
RETURNS int
AS
BEGIN
DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,
@cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int
SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = '', @j = 1, @i = 1, @c = 0
WHILE @j <= @tl
SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1
WHILE @i <= @sl
BEGIN
SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = '', @j = 1, @cmin = 4000
WHILE @j <= @tl
BEGIN
SET @c = @c + 1
SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END
IF @c > @c1 SET @c = @c1
SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1
IF @c > @c1 SET @c = @c1
IF @c < @cmin SET @cmin = @c
SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1
END
IF @cmin > @d BREAK
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END
END
GO
IIRC, SQL Server 2005 이상에서는 모든 .NET 언어로 저장 프로 시저를 작성할 수 있습니다 . SQL Server 2005에서 CLR 통합 사용 . 이를 통해 Levenstein 거리 를 계산하는 절차를 작성하는 것은 어렵지 않습니다 .
간단한 Hello, World! 도움말에서 추출 :
using System;
using System.Data;
using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;
public class HelloWorldProc
{
[Microsoft.SqlServer.Server.SqlProcedure]
public static void HelloWorld(out string text)
{
SqlContext.Pipe.Send("Hello world!" + Environment.NewLine);
text = "Hello world!";
}
}
그런 다음 SQL Server에서 다음을 실행하십시오.
CREATE ASSEMBLY helloworld from 'c:\helloworld.dll' WITH PERMISSION_SET = SAFE
CREATE PROCEDURE hello
@i nchar(25) OUTPUT
AS
EXTERNAL NAME helloworld.HelloWorldProc.HelloWorld
이제 테스트 실행할 수 있습니다.
DECLARE @J nchar(25)
EXEC hello @J out
PRINT @J
도움이 되었기를 바랍니다.
Levenshtein Distance Algorithm을 사용하여 문자열을 비교할 수 있습니다.
여기 http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx 에서 T-SQL 예제를 찾을 수 있습니다 .
CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE @s1_len int, @s2_len int
DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)
SELECT
@s1_len = LEN(@s1),
@s2_len = LEN(@s2),
@cv1 = 0x0000,
@j = 1, @i = 1, @c = 0
WHILE @j <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
WHILE @i <= @s1_len
BEGIN
SELECT
@s1_char = SUBSTRING(@s1, @i, 1),
@c = @i,
@cv0 = CAST(@i AS binary(2)),
@j = 1
WHILE @j <= @s2_len
BEGIN
SET @c = @c + 1
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
IF @c > @c_temp SET @c = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
IF @c > @c_temp SET @c = @c_temp
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
END
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN @c
END
(Joseph Gama가 개발 한 기능)
사용법 :
select
dbo.edit_distance('Fuzzy String Match','fuzzy string match'),
dbo.edit_distance('fuzzy','fuzy'),
dbo.edit_distance('Fuzzy String Match','fuzy string match'),
dbo.edit_distance('levenshtein distance sql','levenshtein sql server'),
dbo.edit_distance('distance','server')
알고리즘은 한 단계에서 다른 문자를 대체하여 한 문자열을 다른 문자열로 변경하는 stpe 수를 반환합니다.
I was looking for a code example for the Levenshtein algorithm, too, and was happy to find it here. Of course I wanted to understand how the algorithm is working and I was playing around a little bit with one of the above examples I was playing around a little bit that was posted by Veve. In order to get a better understanding of the code I created an EXCEL with the Matrix.
distance for FUZZY compared with FUZY
Images say more than 1000 words.
With this EXCEL I found that there was potential for additional performance optimization. All values in the upper right red area do not need to be calculated. The value of each red cell results in the value of the left cell plus 1. This is because, the second string will be always longer in that area than the first one, what increases the distance by the value of 1 for each character.
You can reflect that by using the statement IF @j <= @i and increasing the value of @i Prior to this statement.
CREATE FUNCTION [dbo].[f_LevenshteinDistance](@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE @s1_len int;
DECLARE @s2_len int;
DECLARE @i int;
DECLARE @j int;
DECLARE @s1_char nchar;
DECLARE @c int;
DECLARE @c_temp int;
DECLARE @cv0 varbinary(8000);
DECLARE @cv1 varbinary(8000);
SELECT
@s1_len = LEN(@s1),
@s2_len = LEN(@s2),
@cv1 = 0x0000 ,
@j = 1 ,
@i = 1 ,
@c = 0
WHILE @j <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1;
WHILE @i <= @s1_len
BEGIN
SELECT
@s1_char = SUBSTRING(@s1, @i, 1),
@c = @i ,
@cv0 = CAST(@i AS binary(2)),
@j = 1;
SET @i = @i + 1;
WHILE @j <= @s2_len
BEGIN
SET @c = @c + 1;
IF @j <= @i
BEGIN
SET @c_temp = CAST(SUBSTRING(@cv1, @j + @j - 1, 2) AS int) + CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END;
IF @c > @c_temp SET @c = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j + @j + 1, 2) AS int) + 1;
IF @c > @c_temp SET @c = @c_temp;
END;
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1;
END;
SET @cv1 = @cv0;
END;
RETURN @c;
END;
In TSQL the best and fastest way to compare two items are SELECT statements that join tables on indexed columns. Therefore this is how I suggest to implement the editing distance if you want to benefit from the advantages of a RDBMS engine. TSQL Loops will work too, but Levenstein distance calculations will be faster in other languages than in TSQL for large volume comparisons.
I have implemented the editing distance in several systems using series of Joins against temporary tables designed for that purpose only. It requires some heavy pre-processing steps - the preparation of the temporary tables - but it works very well with large number of comparisons.
In a few words: the pre-processing consists of creating, populating and indexing temp tables. The first one contains reference ids, a one-letter column and a charindex column. This table is populated by running a series of insert queries that split every word into letters (using SELECT SUBSTRING) to create as many rows as word in the source list have letters (I know, that's a lot of rows but SQL server can handle billions of rows). Then make a second table with a 2-letter column, another table with a 3-letter column, etc. The end results is a series of tables which contain reference ids and substrings of each the words, as well a the reference of their position in the word.
Once this is done, the whole game is about duplicating these tables and joining them against their duplicate in a GROUP BY select query which counts the number of matches. This creates a series of measures for every possible pair of words, which are then re-aggregated into a single Levenstein distance per pair of words.
Technically this is very different than most other implementations of the Levenstein distance (or its variants) so you need to deeply understand how the Levenstein distance works and why it was designed as it is. Investigate the alternatives as well because with that method you end up with a series of underlying metrics which can help calculate many variants of the editing distance at the same time, providing you with interesting machine learning potential improvements.
Another point already mentioned by previous answers in this page: try to pre process as much as possible to eliminate the pairs that do not require distance measurement. For example a pair of two words that have not a single letter in common should be excluded, because the editing distance can be obtained from the length of the strings. Or do not measure the distance between two copies of the same word, since it is 0 by nature. Or remove duplicates before doing the measurement, if your list of words comes from a long text it is likely that the same words will appear more than once, so measuring the distance only once will save processing time, etc.
참고URL : https://stackoverflow.com/questions/560709/levenshtein-distance-in-t-sql
'IT TIP' 카테고리의 다른 글
| 프로그래밍 방식으로 색상 밝게 (0) | 2020.10.31 |
|---|---|
| 자바 스크립트에서 객체가 해시인지 배열인지 어떻게 식별 할 수 있습니까? (0) | 2020.10.31 |
| 투영과 선택이란 무엇입니까? (0) | 2020.10.31 |
| REST 리소스 URL의 쿼리 문자열 (0) | 2020.10.31 |
| 마우스 오버시 캔버스에서 픽셀 색상 가져 오기 (0) | 2020.10.31 |