Sunday, March 1, 2015

Minimum Edit distance, part 1

Minimum Edit distance can be used to perform fuzzy string matching.

Instead of calculating whether a string S and a string T are identical, you can use minimum edit distance to see how close two strings are to each other.

For instance, KITCHEN and MITTEN are not identical, but they share several common letters, in order. The minimum edit distance is the minimum number of insert, delete, and substitute operations which will transform the source string into the destination string.

This can be used, together with a threshold, to discover if two strings are close enough to each other that you would consider them identical.

(You can modify this basic minimum edit distance in various ways. For instance, here, the cost of each insertion, deletion, or substitution operation is 1. You can define a cost function so that these operations have different costs.)

I constructed the rather neat spreadsheet below, which calculates Minimum Edit Distance using only three formulas. Perhaps first read the Wikipedia entry on Minimum Edit Distance works.


 

In cell B3 is the value 0.
In C3 (and then copied to the right) is =B3+1
In B4 (and then copied down) is =B3+1
In C4 (and then copied down and to the right) is =MIN(C3+1, B4+1, B3+(C$2<>$A4))

The way this works is as follows. KITCHEN is the source word and SMITTEN is the destination word. Row 3 shows the cost of inserting each letter of the destination word where the source word (in the third row) is still null. Column B shows the cost of deleting each letter of the source word where the destination (in column B) is still null.

The formula is C4 is where all the calculation occurs. That formula, again, is:
=MIN(C3+1, B4+1, B3+(C$2<>$A4))

That is, take the minimum of three values:

  • C3+1: same column, one row up, plus the deletion cost of 1
  • B4+1: same row, one column left, plus the insertion cost of 1
  • B3+(C$2<>$A4): one column left, one row up, plus the substitution cost. The substitution cost will either be 0 or 1, as follows. If C$2 (the current letter of the destination word, with a mixed reference of $2 fixing the row) <> $A4 (the current letter of the source word, with a mixed reference of $A fixing the column), then 1. Otherwise, 0.

--------------------

We could also code the same in VBA, as a UDF:

Option Explicit

Function MinEditDistance(ByVal Source As String, ByVal Destination As String) As Long
    Const DeletionCost = 1
    Const InsertionCost = 1
    Const SubstitutionCost = 1
    Dim matrix() As Long
    ReDim matrix(Len(Source), Len(Destination))
    
    ' initialize first column and row of array
    Dim c As Long, r As Long
    For c = 0 To Len(Destination)
        matrix(0, c) = c
    Next c
    
    For r = 0 To Len(Source)
        matrix(r, 0) = r
    Next r
    
    ' dynamic programming to fill matrix
    For r = 1 To Len(Source)
        For c = 1 To Len(Destination)
            matrix(r, c) = WorksheetFunction.Min(matrix(r - 1, c) + DeletionCost, matrix(r, c - 1) + InsertionCost, matrix(r - 1, c - 1) + IIf(Mid(Source, r, 1) <> Mid(Destination, c, 1), SubstitutionCost, 0))
        Next c
    Next r
    
    MinEditDistance = matrix(Len(Source), Len(Destination))
End Function

------------------

However, I've discovered that when there are a lot of calls to MinEditDistance in a workbook, then it can slow to a crawl.

This is because the performance of MinEditDistance is O(n * m), where n and m are the lengths of the source and destination strings. Perhaps in a later post we can consider how to improve this performance.
to be continued...

0 comments: