In
part one, we introduced Minimum Edit Distance and showed how to implement (as a dynamic programming approach) using just three Excel formulae. Then, we implemented it in VBA, as a UDF -- a user defined function that can be called from any cell. I ended that first post with the following note:
However, I've discovered that when there are a lot of calls to MinEditDistance in a workbook, then it can slow to a crawl.
This is because the performance of MinEditDistance is O(n * m), where n and m are the lengths of the source and destination strings. Perhaps in a later post we can consider how to improve this performance.
We could try to improve the performance of the algorithm, and achieve some
minor speed-up by tweaking the VBA. But, I am not going to focus on such minor tweaks here.
On other projects, when I've needed to run Minimum Edit Distance on millions of pairs of words, I ordered the words and then, based on shared prefixes, reused already partially-filled matrices. This achieves a more significant speed-up. In this case, Minimum Edit Distance is to be used as a UDF, and we cannot readily control the order in which pairs of words are presented to the function, and so this is not doable.
I've come up with an algorithmic improvement, in which, assuming the cost function for insertion, deletion and substitution are the same, we could change the runtime from O(n * m) to something usually much smaller. However, we won't delve into such algorithmic improvements here. Even if we can improve this particular algorithm, there will be other algorithms which we cannot improve in like manner.
Instead, we can use the same memoization technique already used in the dynamic programming approach. That is, MinEditDistance on these many pairs of words will be slow
once. But once these results have been computed, we will store the results somewhere. Any time there is a cause to recompute a pair (e.g. an explicit call to recompute via pressing F9, or on workbook save), the UDF will first check if it has computed the result already. If it has, then it will reuse that result. Only if it has not already computed the result will it run the code for Minimum Edit Distance.
The difficulty with this is that we need someplace to store the computed results. We
cannot store them on a hidden worksheet (writing something like
ShtHidden.[B5].Value = ComputedResult) because MinEditDistance is a UDF. Recall that a UDF, which is able to be invoked from any cell, is
not allowed to make any change to a workbook.
Instead, we will use the Scripting Dictionary. This dictionary is implemented as a Hashmap, such that lookup is O(1).
To use the scripting dictionary, we need to make use of the Microsoft Scripting Library, and the Scripting.Dictionary object within that library. We can add a Reference to the Microsoft Scripting Library via the Tools menu, References... menu item, and checking off the checkbox for this library:
This will allow us to declare our variable using the Dictionary data type.
Option Explicit
Dim dictResults As Dictionary
Sub foo()
Set dictResults = New Dictionary
Call dictResults.Add("foo", "bar")
Debug.Print dictResults("foo")
End Sub
Alternatively, we could just use CreateObject as follows:
Option Explicit
Dim dictResults As Object
Sub foo()
Set dictResults = CreateObject("Scripting.Dictionary")
Call dictResults.Add("foo", "bar")
Debug.Print dictResults("foo")
End Sub
The former is better because it makes use of early binding. Also, we can use Intellisense to guide us while programming.
We can assign an entry to a Dictionary because any changes to the Dictionary occurs in the code, rather than on the sheet.
Here is the new version of the UDF, with the lines number put in only for the new lines which handle memoization.
Option Explicit
Dim dictResults As Dictionary
Function MinEditDistance(ByVal Source As String, ByVal Destination As String) As Long
10 If dictResults Is Nothing Then
20 Set dictResults = New Dictionary
30 End If
40 Dim key As String
50 key = Source & "@@@@@@@" & Destination
60 If dictResults.Exists(key) Then
70 MinEditDistance = dictResults(key)
80 Exit Function
90 End If
Const DeletionCost = 1
Const InsertionCost = 1
Const SubstitutionCost = 1
Dim matrix() As Long
ReDim matrix(Len(Source), Len(Destination))
' initialize first column and row of array
Dim c As Long, r As Long
For c = 0 To Len(Destination)
matrix(0, c) = c
Next c
For r = 0 To Len(Source)
matrix(r, 0) = r
Next r
' dynamic programming to fill matrix
For r = 1 To Len(Source)
For c = 1 To Len(Destination)
matrix(r, c) = WorksheetFunction.Min(matrix(r - 1, c) + DeletionCost, matrix(r, c - 1) + InsertionCost, matrix(r - 1, c - 1) + IIf(Mid(Source, r, 1) <> Mid(Destination, c, 1), SubstitutionCost, 0))
Next c
Next r
MinEditDistance = matrix(Len(Source), Len(Destination))
100 dictResults.Add key, MinEditDistance
End Function
Outside the function, we have declared dictResults as a reference to a Dictionary object, though we do not create the actual Dictionary object.
Then,
10 If dictResults Is Nothing Then
20 Set dictResults = New Dictionary
30 End If
We check if a dictResults has not been created (or if it was set to nothing). If so, we create it.
Then,
40 Dim key As String
50 key = Source & "@@@@@@@" & Destination
60 If dictResults.Exists(key) Then
70 MinEditDistance = dictResults(key)
80 Exit Function
90 End If
We use as our key the Source string followed by seven @ symbols followed by the Destination string. The hope is that the string
"@@@@@@@" is unique enough that it will not occur in either the source or destination string. If this key already exists in the dictionary, then we have stored the result of the MinEditDistance computation, and so we retrieve it from the dictionary (an O(1) operation) and then return it. Otherwise, we proceed with the calculation.
Finally,
100 dictResults.Add key, MinEditDistance
At the very end of the function, besides assigning MinEditDistance (the return code) the result of the computation, we add it to the Dictionary under that key as well.
Note that the Minimum Edit Distance of Source to Destination is the same as that of Destination to Source. If so, we could save on space in our dictionary by making the key the smaller of the strings followed by the larger of the strings:
Dim key As String
If Source < Destination Then
key = Source & "@@@@@@@" & Destination
Else
key = Destination & "@@@@@@@" & Source
End If
Because in many cases we cannot be certain that the string
"@@@@@@@" or any other selected string will not occur in either the Source or the Destination strings, we can instead maintain a Dictionary of Dictionaries. That is, the key of the first Dictionary will be just the Source string. Lookup by the Source key will yield another Dictionary, in which the key is the Destination string and the value is the calculated MinEditDistance.
The VBA code is as follows, with the line numbers put in only for the lines we have added:
Option Explicit
Dim dictResults As Dictionary
Function MinEditDistance(ByVal Source As String, ByVal Destination As String) As Long
If dictResults Is Nothing Then
Set dictResults = New Dictionary
End If
10 Dim key1 As String, key2 As String
20 key1 = Source
30 key2 = Destination
40 If dictResults.Exists(key1) Then
50 Dim dict As Dictionary
60 Set dict = dictResults(key1)
70 If dict.Exists(key2) Then
72 MinEditDistance = dict(key2)
75 Exit Function
80 End If
90 Else
100 dictResults.Add key1, New Dictionary
110 End If
Const DeletionCost = 1
Const InsertionCost = 1
Const SubstitutionCost = 1
Dim matrix() As Long
ReDim matrix(Len(Source), Len(Destination))
' initialize first column and row of array
Dim c As Long, r As Long
For c = 0 To Len(Destination)
matrix(0, c) = c
Next c
For r = 0 To Len(Source)
matrix(r, 0) = r
Next r
' dynamic programming to fill matrix
For r = 1 To Len(Source)
For c = 1 To Len(Destination)
matrix(r, c) = WorksheetFunction.Min(matrix(r - 1, c) + DeletionCost, matrix(r, c - 1) + InsertionCost, matrix(r - 1, c - 1) + IIf(Mid(Source, r, 1) <> Mid(Destination, c, 1), SubstitutionCost, 0))
Next c
Next r
MinEditDistance = matrix(Len(Source), Len(Destination))
120 dictResults(key1).Add key2, MinEditDistance
End Function
We now go through these lines in detail. First, we now have two keys, key1 and key2.
10 Dim key1 As String, key2 As String
20 key1 = Source
30 key2 = Destination
Key1 is the Source string, used to look into the primary Dictionary, dictResults. Key2 is the Destination string, used to look into one of the Dictionaries stored in the primary dictionary.
Next,
40 If dictResults.Exists(key1) Then
50 Dim dict As Dictionary
60 Set dict = dictResults(key1)
70 If dict.Exists(key2) Then
72 MinEditDistance = dict(key2)
75 Exit Function
80 End If
90 Else
100 dictResults.Add key1, New Dictionary
110 End If
If key1 exists in the primary Dictionary, then there has been a secondary Dictionary stored there. And so we assign that Dictionary to the dict variable. We then check if key2 exists in that secondary Dictionary. If it does, then we have matched both Source and Destination strings, and so at some time in the past, we have already computed MinEditDistance for those two strings. Therefore, we retrieve that computed value and return.
A second possibility is that key1 does not exist in the primary dictionary. If so, that particular Source string has never been used in a MinEditDistance operation. We create a new,
blank, secondary dictionary and store it in the primary dictionary under the key key1. In this way, in the future, and later in the code, both primary and secondary dictionary will exist.
A third possibility is that key1 is found in the primary dictionary but key2 is not found in its secondary dictionary. If so, there is nothing to do. The the primary and secondary dictionaries exist, and we fall through this code without doing anything.
Finally,
120 dictResults(key1).Add key2, MinEditDistance
We have changed the memoization step here. We confidently look up the secondary dictionary via a call to dictResults
(key1
). After all, if we have reached this point in the code, an entry in the primary dictionary for key1 either existed from before, as in line 40, or else was added in line 100. With that secondary dictionary in hand, we add an entry under the key key2, with the computed value MinEditDistance.
One last modification is, as before, to save on space. Earlier we said:
Note that the Minimum Edit Distance of Source to Destination is the same as that of Destination to Source. If so, we could save on space in our dictionary by making the key the smaller of the strings followed by the larger of the strings:
We do the same, as follows:
Dim key1 As String, key2 As String
If Source < Destination Then
key1 = Source
key2 = Destination
Else
key1 = Destination
key2 = Source
End If
The rest of the code follows as before.
This memoization makes it so that we are not calling this time-consuming operation millions of times unnecessarily.
Yet, there is room for improvement. A Dictionary object resides in memory, rather than on the sheet. That means that if something were to wipe out that Dictionary, all the results would need to be recomputed. This would happen if there were some VBA error which caused a crash, if the user clicked on the Stop button in the VBA editor. Or, quite commonly, if the user
closes the workbook. We need some way to make this persistent. Perhaps in a part three of this series we will cover this.