2011年6月18日土曜日

Levenshtein Distance

I recently found a classic dynamic programming problem called "Edit distance," aka,"Levenshtein distance." It's a distance between a string to another string, literally.
More specifically, how many procedure are needed to change a string A to another string B.
Allowed procedure are:
1. Insert a character into a string.
2. Delete a character from a string.
3. Change a character within a string.

For instance, consider the following situation.
You are given "kitten" as A, "sitting" as B, respectively.

0. kitten
1. sitten (change the 1st character from k to s)
2. sittin (change the 5th character from e to i)
3. sitting (insert g at the end of string)

The preceded procedure change a string "kitten" into "sitting", and also this is the minimal number of procedures needed. So the Levenshtein distance between "kitten" and "sitting" is 3.

This problem can be solved by dynamic programming. let dp[i][j] be the Levenshtein distance between substring(A, i) and substring(B, j). Here, substring(X, n) is the sub string of X between 1 and n, inclusive. (notice that n is 1-based number)

Then we come down to the next recurrence formula:
dp[i][0] = i
dp[0][j] = j
dp[i][j] = min (dp[i-1][j], dp[i][j-1], f(i, j))
where f(i, j) = 1 if A[i] == B[i]
f(i, j) = 0 if A[i] != B[i]

So, let's solve a problem on PKU. As in the problem, Levenshtein distance is helpful in dealing with DNA. I'll put my solution to it in the case my explanation above doesn't make sense to you.

`int dp[1024][1024];int main() {    int n1, n2;    string s1, s2;    while (cin >> n1 >> s1 >> n2 >> s2) {        for (int i = 0; i <= n1; i++)            dp[i][0] = i;        for (int i = 0; i <= n2; i++)            dp[0][i] = i;        for (int i = 1; i <= n1; i++) {            for (int j = 1; j <= n2; j++) {                dp[i][j] = min(dp[i][j-1]+1, dp[i-1][j]+1);                if (s1[i-1] == s2[j-1])                    dp[i][j] = min(dp[i][j], dp[i-1][j-1]);                else                    dp[i][j] = min(dp[i][j], dp[i-1][j-1]+1);            }        }        cout << dp[n1][n2] << endl;    }    return 0;}`