In part one, we introduced Minimum Edit Distance and showed how to implement (as a dynamic programming approach) using just three Excel formulae. Then, we implemented it in VBA, as a UDF -- a user defined function that can be called from any cell. I ended that first post with the following note:
On other projects, when I've needed to run Minimum Edit Distance on millions of pairs of words, I ordered the words and then, based on shared prefixes, reused already partially-filled matrices. This achieves a more significant speed-up. In this case, Minimum Edit Distance is to be used as a UDF, and we cannot readily control the order in which pairs of words are presented to the function, and so this is not doable.
I've come up with an algorithmic improvement, in which, assuming the cost function for insertion, deletion and substitution are the same, we could change the runtime from O(n * m) to something usually much smaller. However, we won't delve into such algorithmic improvements here. Even if we can improve this particular algorithm, there will be other algorithms which we cannot improve in like manner.
Instead, we can use the same memoization technique already used in the dynamic programming approach. That is, MinEditDistance on these many pairs of words will be slow once. But once these results have been computed, we will store the results somewhere. Any time there is a cause to recompute a pair (e.g. an explicit call to recompute via pressing F9, or on workbook save), the UDF will first check if it has computed the result already. If it has, then it will reuse that result. Only if it has not already computed the result will it run the code for Minimum Edit Distance.
The difficulty with this is that we need someplace to store the computed results. We cannot store them on a hidden worksheet (writing something like ShtHidden.[B5].Value = ComputedResult) because MinEditDistance is a UDF. Recall that a UDF, which is able to be invoked from any cell, is not allowed to make any change to a workbook.
Instead, we will use the Scripting Dictionary. This dictionary is implemented as a Hashmap, such that lookup is O(1).
To use the scripting dictionary, we need to make use of the Microsoft Scripting Library, and the Scripting.Dictionary object within that library. We can add a Reference to the Microsoft Scripting Library via the Tools menu, References... menu item, and checking off the checkbox for this library:
This will allow us to declare our variable using the Dictionary data type.
We can assign an entry to a Dictionary because any changes to the Dictionary occurs in the code, rather than on the sheet.
Here is the new version of the UDF, with the lines number put in only for the new lines which handle memoization.
Then,
Then,
Finally,
Note that the Minimum Edit Distance of Source to Destination is the same as that of Destination to Source. If so, we could save on space in our dictionary by making the key the smaller of the strings followed by the larger of the strings:
The VBA code is as follows, with the line numbers put in only for the lines we have added:
Key1 is the Source string, used to look into the primary Dictionary, dictResults. Key2 is the Destination string, used to look into one of the Dictionaries stored in the primary dictionary.
Next,
If key1 exists in the primary Dictionary, then there has been a secondary Dictionary stored there. And so we assign that Dictionary to the dict variable. We then check if key2 exists in that secondary Dictionary. If it does, then we have matched both Source and Destination strings, and so at some time in the past, we have already computed MinEditDistance for those two strings. Therefore, we retrieve that computed value and return.
A second possibility is that key1 does not exist in the primary dictionary. If so, that particular Source string has never been used in a MinEditDistance operation. We create a new, blank, secondary dictionary and store it in the primary dictionary under the key key1. In this way, in the future, and later in the code, both primary and secondary dictionary will exist.
A third possibility is that key1 is found in the primary dictionary but key2 is not found in its secondary dictionary. If so, there is nothing to do. The the primary and secondary dictionaries exist, and we fall through this code without doing anything.
Finally,
We have changed the memoization step here. We confidently look up the secondary dictionary via a call to dictResults(key1). After all, if we have reached this point in the code, an entry in the primary dictionary for key1 either existed from before, as in line 40, or else was added in line 100. With that secondary dictionary in hand, we add an entry under the key key2, with the computed value MinEditDistance.
One last modification is, as before, to save on space. Earlier we said:
This memoization makes it so that we are not calling this time-consuming operation millions of times unnecessarily.
Yet, there is room for improvement. A Dictionary object resides in memory, rather than on the sheet. That means that if something were to wipe out that Dictionary, all the results would need to be recomputed. This would happen if there were some VBA error which caused a crash, if the user clicked on the Stop button in the VBA editor. Or, quite commonly, if the user closes the workbook. We need some way to make this persistent. Perhaps in a part three of this series we will cover this.
However, I've discovered that when there are a lot of calls to MinEditDistance in a workbook, then it can slow to a crawl.
This is because the performance of MinEditDistance is O(n * m), where n and m are the lengths of the source and destination strings. Perhaps in a later post we can consider how to improve this performance.We could try to improve the performance of the algorithm, and achieve some minor speed-up by tweaking the VBA. But, I am not going to focus on such minor tweaks here.
On other projects, when I've needed to run Minimum Edit Distance on millions of pairs of words, I ordered the words and then, based on shared prefixes, reused already partially-filled matrices. This achieves a more significant speed-up. In this case, Minimum Edit Distance is to be used as a UDF, and we cannot readily control the order in which pairs of words are presented to the function, and so this is not doable.
I've come up with an algorithmic improvement, in which, assuming the cost function for insertion, deletion and substitution are the same, we could change the runtime from O(n * m) to something usually much smaller. However, we won't delve into such algorithmic improvements here. Even if we can improve this particular algorithm, there will be other algorithms which we cannot improve in like manner.
Instead, we can use the same memoization technique already used in the dynamic programming approach. That is, MinEditDistance on these many pairs of words will be slow once. But once these results have been computed, we will store the results somewhere. Any time there is a cause to recompute a pair (e.g. an explicit call to recompute via pressing F9, or on workbook save), the UDF will first check if it has computed the result already. If it has, then it will reuse that result. Only if it has not already computed the result will it run the code for Minimum Edit Distance.
The difficulty with this is that we need someplace to store the computed results. We cannot store them on a hidden worksheet (writing something like ShtHidden.[B5].Value = ComputedResult) because MinEditDistance is a UDF. Recall that a UDF, which is able to be invoked from any cell, is not allowed to make any change to a workbook.
Instead, we will use the Scripting Dictionary. This dictionary is implemented as a Hashmap, such that lookup is O(1).
To use the scripting dictionary, we need to make use of the Microsoft Scripting Library, and the Scripting.Dictionary object within that library. We can add a Reference to the Microsoft Scripting Library via the Tools menu, References... menu item, and checking off the checkbox for this library:
This will allow us to declare our variable using the Dictionary data type.
Alternatively, we could just use CreateObject as follows:Option Explicit Dim dictResults As Dictionary Sub foo() Set dictResults = New Dictionary Call dictResults.Add("foo", "bar") Debug.Print dictResults("foo") End Sub
The former is better because it makes use of early binding. Also, we can use Intellisense to guide us while programming.Option Explicit Dim dictResults As Object Sub foo() Set dictResults = CreateObject("Scripting.Dictionary") Call dictResults.Add("foo", "bar") Debug.Print dictResults("foo") End Sub
We can assign an entry to a Dictionary because any changes to the Dictionary occurs in the code, rather than on the sheet.
Here is the new version of the UDF, with the lines number put in only for the new lines which handle memoization.
Outside the function, we have declared dictResults as a reference to a Dictionary object, though we do not create the actual Dictionary object.Option Explicit Dim dictResults As Dictionary Function MinEditDistance(ByVal Source As String, ByVal Destination As String) As Long 10 If dictResults Is Nothing Then 20 Set dictResults = New Dictionary 30 End If 40 Dim key As String 50 key = Source & "@@@@@@@" & Destination 60 If dictResults.Exists(key) Then 70 MinEditDistance = dictResults(key) 80 Exit Function 90 End If Const DeletionCost = 1 Const InsertionCost = 1 Const SubstitutionCost = 1 Dim matrix() As Long ReDim matrix(Len(Source), Len(Destination)) ' initialize first column and row of array Dim c As Long, r As Long For c = 0 To Len(Destination) matrix(0, c) = c Next c For r = 0 To Len(Source) matrix(r, 0) = r Next r ' dynamic programming to fill matrix For r = 1 To Len(Source) For c = 1 To Len(Destination) matrix(r, c) = WorksheetFunction.Min(matrix(r - 1, c) + DeletionCost, matrix(r, c - 1) + InsertionCost, matrix(r - 1, c - 1) + IIf(Mid(Source, r, 1) <> Mid(Destination, c, 1), SubstitutionCost, 0)) Next c Next r MinEditDistance = matrix(Len(Source), Len(Destination)) 100 dictResults.Add key, MinEditDistance End Function
Then,
We check if a dictResults has not been created (or if it was set to nothing). If so, we create it.10 If dictResults Is Nothing Then 20 Set dictResults = New Dictionary 30 End If
Then,
We use as our key the Source string followed by seven @ symbols followed by the Destination string. The hope is that the string "@@@@@@@" is unique enough that it will not occur in either the source or destination string. If this key already exists in the dictionary, then we have stored the result of the MinEditDistance computation, and so we retrieve it from the dictionary (an O(1) operation) and then return it. Otherwise, we proceed with the calculation.40 Dim key As String 50 key = Source & "@@@@@@@" & Destination 60 If dictResults.Exists(key) Then 70 MinEditDistance = dictResults(key) 80 Exit Function 90 End If
Finally,
At the very end of the function, besides assigning MinEditDistance (the return code) the result of the computation, we add it to the Dictionary under that key as well.100 dictResults.Add key, MinEditDistance
Note that the Minimum Edit Distance of Source to Destination is the same as that of Destination to Source. If so, we could save on space in our dictionary by making the key the smaller of the strings followed by the larger of the strings:
Because in many cases we cannot be certain that the string "@@@@@@@" or any other selected string will not occur in either the Source or the Destination strings, we can instead maintain a Dictionary of Dictionaries. That is, the key of the first Dictionary will be just the Source string. Lookup by the Source key will yield another Dictionary, in which the key is the Destination string and the value is the calculated MinEditDistance.Dim key As String If Source < Destination Then key = Source & "@@@@@@@" & Destination Else key = Destination & "@@@@@@@" & Source End If
The VBA code is as follows, with the line numbers put in only for the lines we have added:
We now go through these lines in detail. First, we now have two keys, key1 and key2.Option Explicit Dim dictResults As Dictionary Function MinEditDistance(ByVal Source As String, ByVal Destination As String) As Long If dictResults Is Nothing Then Set dictResults = New Dictionary End If 10 Dim key1 As String, key2 As String 20 key1 = Source 30 key2 = Destination 40 If dictResults.Exists(key1) Then 50 Dim dict As Dictionary 60 Set dict = dictResults(key1) 70 If dict.Exists(key2) Then 72 MinEditDistance = dict(key2) 75 Exit Function 80 End If 90 Else 100 dictResults.Add key1, New Dictionary 110 End If Const DeletionCost = 1 Const InsertionCost = 1 Const SubstitutionCost = 1 Dim matrix() As Long ReDim matrix(Len(Source), Len(Destination)) ' initialize first column and row of array Dim c As Long, r As Long For c = 0 To Len(Destination) matrix(0, c) = c Next c For r = 0 To Len(Source) matrix(r, 0) = r Next r ' dynamic programming to fill matrix For r = 1 To Len(Source) For c = 1 To Len(Destination) matrix(r, c) = WorksheetFunction.Min(matrix(r - 1, c) + DeletionCost, matrix(r, c - 1) + InsertionCost, matrix(r - 1, c - 1) + IIf(Mid(Source, r, 1) <> Mid(Destination, c, 1), SubstitutionCost, 0)) Next c Next r MinEditDistance = matrix(Len(Source), Len(Destination)) 120 dictResults(key1).Add key2, MinEditDistance End Function
10 Dim key1 As String, key2 As String 20 key1 = Source 30 key2 = Destination
Next,
40 If dictResults.Exists(key1) Then 50 Dim dict As Dictionary 60 Set dict = dictResults(key1) 70 If dict.Exists(key2) Then 72 MinEditDistance = dict(key2) 75 Exit Function 80 End If 90 Else 100 dictResults.Add key1, New Dictionary 110 End If
If key1 exists in the primary Dictionary, then there has been a secondary Dictionary stored there. And so we assign that Dictionary to the dict variable. We then check if key2 exists in that secondary Dictionary. If it does, then we have matched both Source and Destination strings, and so at some time in the past, we have already computed MinEditDistance for those two strings. Therefore, we retrieve that computed value and return.
A second possibility is that key1 does not exist in the primary dictionary. If so, that particular Source string has never been used in a MinEditDistance operation. We create a new, blank, secondary dictionary and store it in the primary dictionary under the key key1. In this way, in the future, and later in the code, both primary and secondary dictionary will exist.
A third possibility is that key1 is found in the primary dictionary but key2 is not found in its secondary dictionary. If so, there is nothing to do. The the primary and secondary dictionaries exist, and we fall through this code without doing anything.
Finally,
120 dictResults(key1).Add key2, MinEditDistance
We have changed the memoization step here. We confidently look up the secondary dictionary via a call to dictResults(key1). After all, if we have reached this point in the code, an entry in the primary dictionary for key1 either existed from before, as in line 40, or else was added in line 100. With that secondary dictionary in hand, we add an entry under the key key2, with the computed value MinEditDistance.
One last modification is, as before, to save on space. Earlier we said:
Note that the Minimum Edit Distance of Source to Destination is the same as that of Destination to Source. If so, we could save on space in our dictionary by making the key the smaller of the strings followed by the larger of the strings:We do the same, as follows:
The rest of the code follows as before.Dim key1 As String, key2 As String If Source < Destination Then key1 = Source key2 = Destination Else key1 = Destination key2 = Source End If
This memoization makes it so that we are not calling this time-consuming operation millions of times unnecessarily.
Yet, there is room for improvement. A Dictionary object resides in memory, rather than on the sheet. That means that if something were to wipe out that Dictionary, all the results would need to be recomputed. This would happen if there were some VBA error which caused a crash, if the user clicked on the Stop button in the VBA editor. Or, quite commonly, if the user closes the workbook. We need some way to make this persistent. Perhaps in a part three of this series we will cover this.
0 comments:
Post a Comment