String Edit Distance

A "fuzzy way of comparing strings." If we have a string s1 and a string s2, the edit distance is the minimum cost sequence of operations to turn s1 into s2. Here are the operations we're allowed to do

  • Add a character to s1
  • Delete a character from s1
  • Swap a character in s1

All three of these operations cost 1

Ex) "chris" "chase"

In [ ]:
    .
chase

    .
chase

match(c, c)
match(h, h)
delete(r) - costs 1
add(a) - costs 1
match(a, a)
swap(i, s) - costs 1
swap(s, e) - costs 1

Costs a total of 4
In [ ]:
    .
chase

    .
chase

match(c, c)
match(h, h)
swap(r, a) - costs 1
swap(i, s) - costs 1
swap(s, e) - costs 1

Costs a total of 3

Ex) "school" vs "fools"

In [ ]:
    .
fool

    .
fools

swap(s, f) - costs 1
match(f, f)
delete(c) - costs 1
delete(h) - costs 1
match(o, o)
match(o, o)
match(l, l)
add(s) - costs 1

Best you can do is 4

Edit Distance Recursive Algorithm

In [ ]:
edit("chris", "chase") = min(

    edit("chri", "chase") + 1,  # This costs 1 to delete 's' from "chris"
    
    edit("chris", "chas") + 1,  # This costs 1 to delete 'e' from "chase"
    
    edit("chri", "chas") + 1, # I'm swapping in 'e' for 's' or 's' for 'e'
)
In [1]:
min(edit(s1[0:-1], s2)+1, 
    edit(s1, s2[0:-1]) + 1, 
    edit(s1[0:-1], s2[0:-1])) (+ 1 if s1[-1] is different from s2[-1])
Out[1]:
1

We can design a recursive algorithm for edit distance by working backwards. We assume we know how much it costs to match substrings with one character chopped off, and considering three different cases.

  1. We can chop off the last character of the first string at a cost of 1, then match.
  2. We can chop off the last character of the second string at a cost of 1, then match.
  3. We can chop off both characters from each string, match the two strings, and then match the characters that were chopped off. If they're different, it's akin to a swap of cost 1. If they're the same, then they match at cost 0

It's not clear which one of these three options the best thing to do at the end, so we take the one with minimum cost after recursively solving for the three.

We also setup our base cases of matching an empty string to something else

In [ ]:
edit("", "chase") = len("chase")

## Base cases

edit("", s2) = len(s2)
edit(s1, "") = len(s1)