Implement a text diff gadget from an algorithm problem

Time:2022-1-2

As we all know, many communities have a content review mechanism. In addition to the first release, subsequent modifications also need to be reviewed. Of course, the most rough way is to look at it again from the beginning, but the editor certainly wants to kill you. Obviously, this is inefficient. For example, if you change a typo, you may not be able to see it several times, so if you can know what has been modified each time, LikegitofdiffIt’s much more convenient. This paper will simply implement one.

Find the longest common subsequence

If you want to know the difference between the two texts, we can find their public content first, and the rest is deleted or added. In the algorithm, this is a classic problem. There is this problem on the force button1143. Longest common subsequence, the title is described as follows:

image-20210816195639935.png

This kind of problem for finding the best value is generally done by using dynamic programming. Dynamic programming is more like reasoning problem. It can be solved from top to bottom by recursion, or it can be usedforCycle from bottom to top, useforLoops typically use a calleddpThe specific use of a several-dimensional array depends on the topic. Because there are two variables (the length of two strings), we use a two-dimensional array, which we definedp[i][j]expresstext1from0-iSubstring sum oftext2from0-jWhen the length of the subsequence is the longest, first consider the length of the subsequenceiby0Whentext1The substring of is an empty string, so no matterjThe length of the longest common subsequence is0jby0The same is true for, so we can initialize an initial value of all0ofdpArray:

let longestCommonSubsequence = function (text1, text2) {
    let m = text1.length
    let n = text2.length
    let dp = new Array(m + 1)
    dp.forEach((item, index) => {
        dp[index] = new Array(n + 1).fill(0)
    })
}

WheniandjNot for0In this case, we need to look at it in several cases:

1. Whentext1[i - 1] === text2[j - 1]If the characters at these two positions are the same, they must be in the longest subsequence, and the current longest subsequence depends on the substring in front of them, that isdp[i][j] = 1 + dp[i - 1][j - 1]

2. Whentext1[i - 1] !== text2[j - 1]When, obviouslydp[i][j]Depending on the previous situation, there are three types:dp[i - 1][j - 1]dp[i][j - 1]dp[i - 1][j]However, the first case can be excluded, because it is obviously not as long as the latter two cases, because the latter two are one more character than the first, so it may be longer1, then we can take the optimal value of the latter two cases;

Next, we just need a double loop to traverse all cases of the two-dimensional array:

let longestCommonSubsequence = function (text1, text2) {
    let m = text1.length
    let n = text2.length
    //Initialize 2D array
    let dp = new Array(m + 1).fill(0)
    dp.forEach((item, index) => {
        dp[index] = new Array(n + 1).fill(0)
    })
    for(let i = 1; i <= m; i++) {
        //Because I and j both start with 1, so subtract 1
        let t1 = text1[i - 1]
        for(let j = 1; j <= n; j++) {
            let t2 = text2[j - 1]
            //Case 1
            if (t1 === t2) {
                dp[i][j] = 1 + dp[i - 1][j - 1]
            }Else {// case 2
                dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1])
            }
        }
    }
}

dp[m][n]The value of is the length of the longest common subsequence, but it’s no use only knowing the length. We need to know the specific location. We need to recurse again. Why not in the above loopt1 === t2The collection position in the branch of, because all positions of the two strings will be compared in pairs. When there are multiple identical characters, there will be duplication, as follows:

image-20210817191053130.png

We define acollectFunction, recursive judgmentiandjIs the position in the longest subsequence, such as foriandjLocation, iftext1[i - 1] === text2[j - 1]Well, obviously, these two positions are in the longest subsequence. Next, just judgei - 1andj - 1If the current position is different, we candpArray, because we already know the wholedpThe value of the array:

image-20210817194735329.png

Therefore, there is no need to try every position again, so there will be no repetition, such asdp[i - 1] > dp[j], then the next thing to judge isi-1andjPosition, otherwise judgeiandj-1Position, the condition for the end of recursion isiandjOne has arrived0Location of:

let arr1 = []
let arr2 = []
let collect = function (dp, text1, text2, i, j) {
    if (i <= 0 || j <= 0) {
        return
    }
    if (text1[i - 1] === text2[j - 1]) {
        //Collect the index of the same character in two strings
        arr1.push(i - 1)
        arr2.push(j - 1)
        return collect(dp, text1, text2, i - 1, j - 1)
    } else {
        if (dp[i][j - 1] > dp[i - 1][j]) {
            return collect(dp, text1, text2, i, j - 1)
        } else {
            return collect(dp, text1, text2, i - 1, j)
        }
    }
}

The results are as follows:

image-20210817202220822.png

You can see that it is in reverse order. If you don’t like it, you can also arrange it in order:

arr1.sort((a, b) => {
    return a - b
});
arr2.sort((a, b) => {
    return a - b
});

There is still no end here. We have to calculate the deletion and addition positions according to the longest subsequence. This is relatively simple. We can directly traverse the two strings, not inarr1andarr2Characters in other positions in the are deleted or added:

let getDiffList = (text1, text2, arr1, arr2) => {
    let delList = []
    let addList = []
    //Traverse old string
    for (let i = 0; i < text1.length; i++) {
        //The character representation of the position in the old string that is not in the common subsequence is deleted
        if (!arr1.includes(i)) {
            delList.push(i)
        }
    }
    //Traverse new string
    for (let i = 0; i < text2.length; i++) {
        //The character representation of the position in the new string that is not in the common subsequence is new
        if (!arr2.includes(i)) {
            addList.push(i)
        }
    }
    return {
        delList,
        addList
    }
}

image-20210818094131108.png

Dimension deletion and addition

We all know the public subsequence and the index of addition and deletion, so we can mark it out. For example, the deleted ones use a red background and the new ones use a green background, so that we can be sure where the changes have taken place at a glance.

For the sake of simplicity, we will display the addition and deletion on the same text, like this:

image-20210818094506055.png

Suppose there are two pieces of text to compare, and each piece of text is marked with\nSeparate to break lines. We first divide them into arrays, and then compare them in pairs. If the old and new text are equal, they are directly added to the displayed array. Otherwise, we operate on the basis of the new text. If the character at a certain position is new, wrap it with a new label, The deleted characters also find the corresponding position in the new text, wrap a label and insert it. The template part is as follows:

{{ index + 1 }}

Then make a pairwise comparison:

export default {
    data () {
        return {
            oldTextArr: [],
            newTextArr: [],
            showTextArr: []
        }
    },
    mounted () {
        this.diff()
    },
    methods: {
        diff () {
            //Split old and new text into arrays
            this.oldTextArr = oldText.split(/\n+/g)
            this.newTextArr = newText.split(/\n+/g)
            let len = this.newTextArr.length
            for (let row = 0; row < len; row++) {
                //If the old and new texts are identical, there is no need to compare them
                if (this.oldTextArr[row] === this.newTextArr[row]) {
                    this.showTextArr.push(this.newTextArr[row])
                    continue
                }
                //Otherwise, the position of the longest common subsequence of old and new text is calculated
                let [arr1, arr2] = longestCommonSubsequence(
                    this.oldTextArr[row],
                    this.newTextArr[row]
                )
                //Label operation
                this.mark(row, arr1, arr2)
            }
        }
    }
}

markMethod is used to generate the final string with difference information, first through the abovegetDiffListMethod to obtain the deleted and added index information. Because we are based on the new text, the operation of adding is relatively simple. Directly traverse the new index, and then find the character at the corresponding position in the new string, splicing the character of the label element before and after:

/*
Oldarr: the longest common subsequence index array of old text
Newarr: the longest common subsequence index array of new text
*/
mark (row, oldArr, newArr) {
    let oldText = this.oldTextArr[row];
    let newText = this.newTextArr[row];
    //Get deleted and added location indexes
    let { delList, addList } = getDiffList(
        oldText,
        newText,
        oldArr,
        newArr
    );
    //Because the added span tag will also occupy the position, it will lead to the offset of our new index, which needs to be corrected by subtracting the length occupied by the tag
    let addTagLength = 0;
    //Traverse the new location array
    addList.forEach((index) => {
        let pos = index + addTagLength;
        //Intercepts the string before the current position
        let pre = newText.slice(0, pos);
        //Intercept the following string
        let post = newText.slice(pos + 1);
        newText = pre + `${newText[pos]}` + post;
        addTagLength += 25;//  The length of the is 25
    });
    this.showTextArr.push(newText);
}

The effects are as follows:

image-20210818181744111.png

image-20210818181753703.png

Deleting is a little troublesome, because obviously the deleted character does not exist in the new text. We need to find out where it should be if it has not been deleted, and then insert it back here. Let’s draw a picture:

image-20210818183712848.png

Look at the deleted firstFlash, its position in the old string is3, through the longest common subsequence, we can find the index of the character in front of it in the new list. Obviously, the index is followed by the position of the deleted character in the new string:

image-20210818184131840.png

First write a function to get the index of the deleted character in the new text:

getDelIndexInNewTextIndex (index, oldArr, newArr) {
    for (let i = oldArr.length - 1; i >= 0; i--) {
        if (index > oldArr[i]) {
            return newArr[i] + 1;
        }
    }
    return 0;
}
}

The next step is to calculate the specific position in the stringFlashIts position is calculated as follows:

image-20210818185833500.png

mark (row, oldArr, newArr) {
    // ...

    //Traverses the deleted index array
    delList.forEach((index) => {
        let newIndex = this.getDelIndexInNewTextIndex(index, oldArr, newArr);
        //Number of characters added before
        let addLength = addList.filter((item) => {
            return item < newIndex;
        }).length;
        //The number of characters that have not changed before
        let noChangeLength = newArr.filter((item) => {
            return item < newIndex;
        }).length;
        let pos = addLength * 26 + noChangeLength;
        let pre = newText.slice(0, pos);
        let post = newText.slice(pos);
        newText = pre + `${oldText[index]}` + post;
    });

    this.showTextArr.push(newText);
}

Come hereFlashYou can see the location of the. See the effect:

image-20210818190258024.png

You can see that the back is in chaos. The reason is very simple. ForcrystalFor example, the newly insertedFlashWe didn’t add it to the position occupied:

//The position occupied by the inserted character
let insertStrLength = 0;
delList.forEach((index) => {
    let newIndex = this.getDelIndexInNewTextIndex(index, oldArr, newArr);
    let addLength = addList.filter((item) => {
        return item < newIndex;
    }).length;
    let noChangeLength = newArr.filter((item) => {
        return item < newIndex;
    }).length;
    //Add the total length of newly inserted characters
    let pos = insertStrLength + addLength * 26 + noChangeLength;
    let pre = newText.slice(0, pos);
    let post = newText.slice(pos);
    newText = pre + `${oldText[index]}` + post;
    //The length of X is 26
    insertStrLength += 26;
});

Here we are hastydiffThe tool is complete:

image-20210818191126696.png

Existing problems

I believe you will find that there is a problem with the above implementation. If I delete a line completely or add a new line completely, the number of new and old lines will be different. Repair it firstdiffFunction:

diff () {
    this.oldTextArr = oldText.split(/\n+/g);
    this.newTextArr = newText.split(/\n+/g);
    //If the number of new and old lines is different, fill it with an empty string
    let oldTextArrLen = this.oldTextArr.length;
    let newTextArrLen = this.newTextArr.length;
    let diffRow = Math.abs(oldTextArrLen - newTextArrLen);
    if (diffRow > 0) {
        let fixArr = oldTextArrLen > newTextArrLen ? this.newTextArr : this.oldTextArr;
        for (let i = 0; i < diffRow; i++) {
            fixArr.push('');
        }
    }
    // ...
}

If we add or delete the last line, it is not a problem:

image-20210818192107232.png

However, if a row in the middle is added or deleted, all the rows after the row will be deleteddiffWill be meaningless:

image-20210818192342057.png

The reason is very simple. Deleting a row will cause the subsequent pairwise comparison to be staggered. What should I do? One idea is to find that a row has been deleted or a row is new, and then correct the number of rows compared. Another method is not to separate each rowdiff, but directlydiffThe whole text, so it doesn’t matter to delete the new line.

The first idea I can’t decide anyway, so I can only look at the second one. We delete the logic separated by line feed and directlydiffEntire text:

diff () {
    this.oldTextArr = [oldText];// .split(/\n+/g);
    this.newTextArr = [newText];// .split(/\n+/g);
    // ...
}

image-20210818192825679.png

It seems possible. Let’s increase the number of text:

image-20210818193054909.png

Sure enough, it’s cold. Obviously, our previous simple algorithm for finding the longest common subsequence can’t bear too many words, eitherdpThe space occupied by the array is too large, or the number of layers of recursive algorithm is too deep, resulting in memory overflow.

For the author of algorithm slag, this is uncertain. What should we do? We can only use the power of open source. Dangdang, Dangdang, that’s it:diff-match-patch

Diff match patch Library

diff-match-patchIt is a high-performance library for operating text. It supports a variety of programming languages. In addition to calculating the difference between the two texts, it can also be used for fuzzy matching and patching, which can also be seen from the name.

It’s easy to use. Let’s bring it in first,importIf the method is introduced, you need to modify the source code file. By default, the source code hangs the class to the global environment. We need to manually export the class, and thennewAn instance, calldiffMethod:

import diff_match_patch from './diff_match_patch_uncompressed';

const dmp = new diff_match_patch();

diffAll () {
    let diffList = dmp.diff_main(oldText, newText);
    console.log(diffList);
}

The returned result is as follows:

image-20210818195940486.png

The returned is an array, and each item represents a difference,0Represents no difference,1The representative is new,-1Represents deletion. We just need to traverse the array and splice the strings. It’s very simple:

diffAll () {
    let diffList = dmp.diff_main(oldText, newText);
    let htmlStr = '';
    diffList.forEach((item) => {
        switch (item[0]) {
            case 0:
                htmlStr += item[1];
                break;
            case 1:
                htmlStr += `${item[1]}`;
                break;
            case -1:
                htmlStr += `${item[1]}`;
                break;
            default:
                break;
        }
    });
    this.showTextArr = htmlStr.split(/\n+/);
}

image-20210818201307191.png

Measured21432CharactersdiffTime consuming4msAround, still very fast.

Well, the editors can fish happily in the future~

summary

This paper simply does an algorithm problem of [finding the longest common subsequence], and analyzes its application in the textdiffBut our simple algorithm can not support the actual project, so if there are relevant requirements, you can use an open source library introduced in this paper.

Complete sample code:https://github.com/wanglin2/text_diff_demo