Wednesday, March 25, 2015

Minimum Edit Distance, part 2

In part one, we introduced Minimum Edit Distance and showed how to implement (as a dynamic programming approach) using just three Excel formulae. Then, we implemented it in VBA, as a UDF -- a user defined function that can be called from any cell. I ended that first post with the following note:
However, I've discovered that when there are a lot of calls to MinEditDistance in a workbook, then it can slow to a crawl. 
This is because the performance of MinEditDistance is O(n * m), where n and m are the lengths of the source and destination strings. Perhaps in a later post we can consider how to improve this performance.
We could try to improve the performance of the algorithm, and achieve some minor speed-up by tweaking the VBA. But, I am not going to focus on such minor tweaks here.

On other projects, when I've needed to run Minimum Edit Distance on millions of pairs of words, I ordered the words and then, based on shared prefixes, reused already partially-filled matrices. This achieves a more significant speed-up. In this case, Minimum Edit Distance is to be used as a UDF, and we cannot readily control the order in which pairs of words are presented to the function, and so this is not doable.

I've come up with an algorithmic improvement, in which, assuming the cost function for insertion, deletion and substitution are the same, we could change the runtime from O(n * m) to something usually much smaller. However, we won't delve into such algorithmic improvements here. Even if we can improve this particular algorithm, there will be other algorithms which we cannot improve in like manner.

Instead, we can use the same memoization technique already used in the dynamic programming approach. That is, MinEditDistance on these many pairs of words will be slow once. But once these results have been computed, we will store the results somewhere. Any time there is a cause to recompute a pair (e.g. an explicit call to recompute via pressing F9, or on workbook save), the UDF will first check if it has computed the result already. If it has, then it will reuse that result. Only if it has not already computed the result will it run the code for Minimum Edit Distance.

The difficulty with this is that we need someplace to store the computed results. We cannot store them on a hidden worksheet (writing something like ShtHidden.[B5].Value = ComputedResult) because MinEditDistance is a UDF. Recall that a UDF, which is able to be invoked from any cell, is not allowed to make any change to a workbook.

Instead, we will use the Scripting Dictionary. This dictionary is implemented as a Hashmap, such that lookup is O(1).

To use the scripting dictionary, we need to make use of the Microsoft Scripting Library, and the Scripting.Dictionary object within that library. We can add a Reference to the Microsoft Scripting Library via the Tools menu, References... menu item, and checking off the checkbox for this library:


This will allow us to declare our variable using the Dictionary data type.
Option Explicit

Dim dictResults As Dictionary

Sub foo()
    Set dictResults = New Dictionary
    Call dictResults.Add("foo", "bar")
    Debug.Print dictResults("foo")
End Sub
Alternatively, we could just use CreateObject as follows:

Option Explicit

Dim dictResults As Object

Sub foo()
    Set dictResults = CreateObject("Scripting.Dictionary")
    Call dictResults.Add("foo", "bar")
    Debug.Print dictResults("foo")
End Sub
The former is better because it makes use of early binding. Also, we can use Intellisense to guide us while programming.

We can assign an entry to a Dictionary because any changes to the Dictionary occurs in the code, rather than on the sheet.

Here is the new version of the UDF, with the lines number put in only for the new lines which handle memoization.
Option Explicit

Dim dictResults As Dictionary

Function MinEditDistance(ByVal Source As String, ByVal Destination As String) As Long
10  If dictResults Is Nothing Then
20      Set dictResults = New Dictionary
30  End If
    
40  Dim key As String
50  key = Source & "@@@@@@@" & Destination
60  If dictResults.Exists(key) Then
70      MinEditDistance = dictResults(key)
80      Exit Function
90  End If

    Const DeletionCost = 1
    Const InsertionCost = 1
    Const SubstitutionCost = 1
        
    Dim matrix() As Long
    ReDim matrix(Len(Source), Len(Destination))
    
    ' initialize first column and row of array
    Dim c As Long, r As Long
    For c = 0 To Len(Destination)
        matrix(0, c) = c
    Next c
    
    For r = 0 To Len(Source)
        matrix(r, 0) = r
    Next r
    
    ' dynamic programming to fill matrix
    For r = 1 To Len(Source)
        For c = 1 To Len(Destination)
            matrix(r, c) = WorksheetFunction.Min(matrix(r - 1, c) + DeletionCost, matrix(r, c - 1) + InsertionCost, matrix(r - 1, c - 1) + IIf(Mid(Source, r, 1) <> Mid(Destination, c, 1), SubstitutionCost, 0))
        Next c
    Next r
    
    MinEditDistance = matrix(Len(Source), Len(Destination))
100 dictResults.Add key, MinEditDistance
End Function
Outside the function, we have declared dictResults as a reference to a Dictionary object, though we do not create the actual Dictionary object.

Then,
10  If dictResults Is Nothing Then
20      Set dictResults = New Dictionary
30  End If
We check if a dictResults has not been created (or if it was set to nothing). If so, we create it.

Then,
40  Dim key As String
50  key = Source & "@@@@@@@" & Destination
60  If dictResults.Exists(key) Then
70      MinEditDistance = dictResults(key)
80      Exit Function
90  End If
We use as our key the Source string followed by seven @ symbols followed by the Destination string. The hope is that the string  "@@@@@@@" is unique enough that it will not occur in either the source or destination string. If this key already exists in the dictionary, then we have stored the result of the MinEditDistance computation, and so we retrieve it from the dictionary (an O(1) operation) and then return it. Otherwise, we proceed with the calculation.

Finally,
100 dictResults.Add key, MinEditDistance
At the very end of the function, besides assigning MinEditDistance (the return code) the result of the computation, we add it to the Dictionary under that key as well.

Note that the Minimum Edit Distance of Source to Destination is the same as that of Destination to Source. If so, we could save on space in our dictionary by making the key the smaller of the strings followed by the larger of the strings:
Dim key As String
If Source < Destination Then
    key = Source & "@@@@@@@" & Destination
Else
    key = Destination & "@@@@@@@" & Source
End If
Because in many cases we cannot be certain that the string "@@@@@@@" or any other selected string will not occur in either the Source or the Destination strings, we can instead maintain a Dictionary of Dictionaries. That is, the key of the first Dictionary will be just the Source string. Lookup by the Source key will yield another Dictionary, in which the key is the Destination string and the value is the calculated MinEditDistance.

The VBA code is as follows, with the line numbers put in only for the lines we have added:
Option Explicit

Dim dictResults As Dictionary

Function MinEditDistance(ByVal Source As String, ByVal Destination As String) As Long
    If dictResults Is Nothing Then
        Set dictResults = New Dictionary
    End If
    
10  Dim key1 As String, key2 As String
20  key1 = Source
30  key2 = Destination
    
40  If dictResults.Exists(key1) Then
50      Dim dict As Dictionary
60      Set dict = dictResults(key1)
70      If dict.Exists(key2) Then
72          MinEditDistance = dict(key2)
75          Exit Function
80      End If
90  Else
100     dictResults.Add key1, New Dictionary
110 End If

    Const DeletionCost = 1
    Const InsertionCost = 1
    Const SubstitutionCost = 1
        
    Dim matrix() As Long
    ReDim matrix(Len(Source), Len(Destination))
    
    ' initialize first column and row of array
    Dim c As Long, r As Long
    For c = 0 To Len(Destination)
        matrix(0, c) = c
    Next c
    
    For r = 0 To Len(Source)
        matrix(r, 0) = r
    Next r
    
    ' dynamic programming to fill matrix
    For r = 1 To Len(Source)
        For c = 1 To Len(Destination)
            matrix(r, c) = WorksheetFunction.Min(matrix(r - 1, c) + DeletionCost, matrix(r, c - 1) + InsertionCost, matrix(r - 1, c - 1) + IIf(Mid(Source, r, 1) <> Mid(Destination, c, 1), SubstitutionCost, 0))
        Next c
    Next r
    
    MinEditDistance = matrix(Len(Source), Len(Destination))
120 dictResults(key1).Add key2, MinEditDistance
End Function
We now go through these lines in detail. First, we now have two keys, key1 and key2.

10  Dim key1 As String, key2 As String
20  key1 = Source
30  key2 = Destination

Key1 is the Source string, used to look into the primary Dictionary, dictResults. Key2 is the Destination string, used to look into one of the Dictionaries stored in the primary dictionary.

Next,

40  If dictResults.Exists(key1) Then
50      Dim dict As Dictionary
60      Set dict = dictResults(key1)
70      If dict.Exists(key2) Then
72          MinEditDistance = dict(key2)
75          Exit Function
80      End If
90  Else
100     dictResults.Add key1, New Dictionary
110 End If

If key1 exists in the primary Dictionary, then there has been a secondary Dictionary stored there. And so we assign that Dictionary to the dict variable. We then check if key2 exists in that secondary Dictionary. If it does, then we have matched both Source and Destination strings, and so at some time in the past, we have already computed MinEditDistance for those two strings. Therefore, we retrieve that computed value and return.

A second possibility is that key1 does not exist in the primary dictionary. If so, that particular Source string has never been used in a MinEditDistance operation. We create a new, blank, secondary dictionary and store it in the primary dictionary under the key key1. In this way, in the future, and later in the code, both primary and secondary dictionary will exist.

A third possibility is that key1 is found in the primary dictionary but key2 is not found in its secondary dictionary. If so, there is nothing to do. The the primary and secondary dictionaries exist, and we fall through this code without doing anything.

Finally,

120 dictResults(key1).Add key2, MinEditDistance

We have changed the memoization step here. We confidently look up the secondary dictionary via a call to dictResults(key1). After all, if we have reached this point in the code, an entry in the primary dictionary for key1 either existed from before, as in line 40, or else was added in line 100. With that secondary dictionary in hand, we add an entry under the key key2, with the computed value MinEditDistance.

One last modification is, as before, to save on space. Earlier we said:
Note that the Minimum Edit Distance of Source to Destination is the same as that of Destination to Source. If so, we could save on space in our dictionary by making the key the smaller of the strings followed by the larger of the strings:
We do the same, as follows:
Dim key1 As String, key2 As String
If Source < Destination Then
    key1 = Source
    key2 = Destination
Else
    key1 = Destination
    key2 = Source
End If
The rest of the code follows as before.

This memoization makes it so that we are not calling this time-consuming operation millions of times unnecessarily.

Yet, there is room for improvement. A Dictionary object resides in memory, rather than on the sheet. That means that if something were to wipe out that Dictionary, all the results would need to be recomputed. This would happen if there were some VBA error which caused a crash, if the user clicked on the Stop button in the VBA editor. Or, quite commonly, if the user closes the workbook. We need some way to make this persistent. Perhaps in a part three of this series we will cover this.

Sunday, March 1, 2015

Minimum Edit distance, part 1

Minimum Edit distance can be used to perform fuzzy string matching.

Instead of calculating whether a string S and a string T are identical, you can use minimum edit distance to see how close two strings are to each other.

For instance, KITCHEN and MITTEN are not identical, but they share several common letters, in order. The minimum edit distance is the minimum number of insert, delete, and substitute operations which will transform the source string into the destination string.

This can be used, together with a threshold, to discover if two strings are close enough to each other that you would consider them identical.

(You can modify this basic minimum edit distance in various ways. For instance, here, the cost of each insertion, deletion, or substitution operation is 1. You can define a cost function so that these operations have different costs.)

I constructed the rather neat spreadsheet below, which calculates Minimum Edit Distance using only three formulas. Perhaps first read the Wikipedia entry on Minimum Edit Distance works.


 

In cell B3 is the value 0.
In C3 (and then copied to the right) is =B3+1
In B4 (and then copied down) is =B3+1
In C4 (and then copied down and to the right) is =MIN(C3+1, B4+1, B3+(C$2<>$A4))

The way this works is as follows. KITCHEN is the source word and SMITTEN is the destination word. Row 3 shows the cost of inserting each letter of the destination word where the source word (in the third row) is still null. Column B shows the cost of deleting each letter of the source word where the destination (in column B) is still null.

The formula is C4 is where all the calculation occurs. That formula, again, is:
=MIN(C3+1, B4+1, B3+(C$2<>$A4))

That is, take the minimum of three values:

  • C3+1: same column, one row up, plus the deletion cost of 1
  • B4+1: same row, one column left, plus the insertion cost of 1
  • B3+(C$2<>$A4): one column left, one row up, plus the substitution cost. The substitution cost will either be 0 or 1, as follows. If C$2 (the current letter of the destination word, with a mixed reference of $2 fixing the row) <> $A4 (the current letter of the source word, with a mixed reference of $A fixing the column), then 1. Otherwise, 0.

--------------------

We could also code the same in VBA, as a UDF:

Option Explicit

Function MinEditDistance(ByVal Source As String, ByVal Destination As String) As Long
    Const DeletionCost = 1
    Const InsertionCost = 1
    Const SubstitutionCost = 1
    Dim matrix() As Long
    ReDim matrix(Len(Source), Len(Destination))
    
    ' initialize first column and row of array
    Dim c As Long, r As Long
    For c = 0 To Len(Destination)
        matrix(0, c) = c
    Next c
    
    For r = 0 To Len(Source)
        matrix(r, 0) = r
    Next r
    
    ' dynamic programming to fill matrix
    For r = 1 To Len(Source)
        For c = 1 To Len(Destination)
            matrix(r, c) = WorksheetFunction.Min(matrix(r - 1, c) + DeletionCost, matrix(r, c - 1) + InsertionCost, matrix(r - 1, c - 1) + IIf(Mid(Source, r, 1) <> Mid(Destination, c, 1), SubstitutionCost, 0))
        Next c
    Next r
    
    MinEditDistance = matrix(Len(Source), Len(Destination))
End Function

------------------

However, I've discovered that when there are a lot of calls to MinEditDistance in a workbook, then it can slow to a crawl.

This is because the performance of MinEditDistance is O(n * m), where n and m are the lengths of the source and destination strings. Perhaps in a later post we can consider how to improve this performance.
to be continued...