Edit Distance

It is the minimum number of edit operations to transform one document to another. Changes allowed are:
 * Add a letter
 * Remove a letter
 * Replace one letter by another

Document Similarity
Given two documents, how similar are they?
 * Plagiarism detection
 * Checking changes between versions of code
 * Answering web search queries more effectively

Brute Force

 * Delete all of the first document. Add all of the second document.
 * Impossibly Inefficient.

Naive Solution

 * Make the first character in both documents same. Explore all possible edit operations to do the same. Recursively fix the rest of the document.
 * Naive solution is inefficient as the same problem is solved recursively many times.

Dynamic Programming

 * Making recursive computations efficient by ensuing that each subproblem is computed only once.We store and look up answers to already solved subproblems.
 * Quite efficient.

Needleman-Wunsch's Algorithm
Consider two strings A[1 ... n] and B[1 ... m]. We define V(i,j) to be the score of alignment between A[1 ... i] and B[1 ... j] and score(A,B) is the score if a character A is aligned with character B. Best case: V(0,0) = 0 // no score for matching 2 empty strings

Recurrences: For i>0 and j>0: V(0,j) = V(0,j-1) + score(-,B[j]) // insert space j times to make alignment V(i,0) = V(i-1,0) + score(A[i],i) // delete i times to make alignment V(i,j) = max(option1,option2,option3), where option1 = V(i-1,j-1) + score(A[i],B[j]) // score of match or mismatch option2 = V(i-1,j) + score(A[i],-) // delete option3 = V(i,j-1) + score(-,B[j]) // insert

In short, this DP algorithm concentrates on three possibilities for the last pair of characters, which must either be a match/mismatch, a deletion, or an insertion. Although we do not know which one is the best, we can try all possibilities while avoiding re-computation of overlapping subproblems (i.e. basically a DP technique). With a simple cost function where a match gets a +2 point and mismatch, insert, delete all get a -1 point, the detail of string alignment score of A = 'ACAATCC' and B = 'AGCATGC' is shown in the figure below. The alignment score is 7 (bottom right). "String Alignment Example for A = 'ACAATCC' and B = 'AGCATGC' (score = 7)"Follow the dashed (red) arrows from the bottom right cell to reconstruct the solution. Diagonal arrow means a match or a mismatch (e.g. the last 'C'). Vertical arrow means a deletion (e.g. .. CAT.. to ..C_A..). Horizontal arrow means an insertion (e.g. A_C.. to AGC..). A = 'A_CAAT[C]C' B = 'AGC_AT[G]C' As we need to fill in all the entries in the table of n X m matrix and each entry can be computed in O(1), the time complexity is O(nm). The space complexity is O(nm) = the size of the DP table.

Variations

 * Interested in only the meaning of the document
 * Focus on words
 * Documents are near if they overlap in many words
 * Order in which words occur may not matter
 * Useful for topic based web search
 * Can have dictionary of "similar" words