A "fuzzy way of comparing strings." If we have a string s1 and a string s2, the edit distance is the minimum cost sequence of operations to turn s1 into s2. Here are the operations we're allowed to do
All three of these operations cost 1
.
chase
.
chase
match(c, c)
match(h, h)
delete(r) - costs 1
add(a) - costs 1
match(a, a)
swap(i, s) - costs 1
swap(s, e) - costs 1
Costs a total of 4
.
chase
.
chase
match(c, c)
match(h, h)
swap(r, a) - costs 1
swap(i, s) - costs 1
swap(s, e) - costs 1
Costs a total of 3
.
fool
.
fools
swap(s, f) - costs 1
match(f, f)
delete(c) - costs 1
delete(h) - costs 1
match(o, o)
match(o, o)
match(l, l)
add(s) - costs 1
Best you can do is 4
edit("chris", "chase") = min(
edit("chri", "chase") + 1, # This costs 1 to delete 's' from "chris"
edit("chris", "chas") + 1, # This costs 1 to delete 'e' from "chase"
edit("chri", "chas") + 1, # I'm swapping in 'e' for 's' or 's' for 'e'
)
min(edit(s1[0:-1], s2)+1,
edit(s1, s2[0:-1]) + 1,
edit(s1[0:-1], s2[0:-1])) (+ 1 if s1[-1] is different from s2[-1])
We can design a recursive algorithm for edit distance by working backwards. We assume we know how much it costs to match substrings with one character chopped off, and considering three different cases.
It's not clear which one of these three options the best thing to do at the end, so we take the one with minimum cost after recursively solving for the three.
We also setup our base cases of matching an empty string to something else
edit("", "chase") = len("chase")
## Base cases
edit("", s2) = len(s2)
edit(s1, "") = len(s1)