Java implementation of Chinese and English spelling and error correction? But I can only write crud!

Time:2021-7-28

Simple requirements

Close to work, Xiao Ming has finished today’s task and is preparing to go home from work.

A message flashed.

“Recently, the official account has a good spell checking function, which helps users discover wrongly written characters and experience well. Make one for our system. “

Looking at the news, Xiao Ming said a silent greeting in his heart.

“I TND can do this, so I went directly to the headquarters of others. I’m angry with you here.”

“OK,” Xiao Ming replied, “let me see first.”

Today, when the heavenly king Lao Tzu came, I had to get off work, and Jesus couldn’t stay.

Xiao Ming thought and went home.

Java implementation of Chinese and English spelling and error correction? But I can only write crud!

Calm analysis

Speaking of spell checking, Xiao Ming actually knows.

I haven’t eaten pork, but I’ve seen pigs run.

I have seen some official account holders sharing the official account, and I have said that the public number has launched the spell checking function, and there will never be any wrong words.

Later, Xiao Ming still saw a lot of typos in their articles. Later, there was no later.

Why not ask the omnipotent GitHub?

Xiao Ming opened GitHub and found that there seems to be no mature open source projects related to Java. Some stars are not at ease.

It is estimated that NLP is more engaged in Python. Does Java implement Chinese and English spelling check and error correction? But I can only write crud!

Xiao Ming quietly played a huazi

The night outside the window was like water. I couldn’t help thinking. Where am I from? Where are you going? What is the meaning of life?

Java implementation of Chinese and English spelling and error correction? But I can only write crud!

The ash with residual heat fell on the slippers bought by Xiaoming, burning the runaway Mustang in his mind.

I don’t have any ideas or ideas. I’d better wash and sleep first.

That night, Xiao Ming had a long dream. There are no typos in the dream. All the words are located in the right position

Turn for the better

The next day, Xiao Ming opened the search box and entered spelling correct.

Fortunately, I found an explanation of English spelling correction algorithm.

I taste it all day and think about it. It’s better to learn it in a moment. Xiao Ming sighed and looked up.

Algorithm idea

English words are mainly composed of 26 English letters, so spelling errors may occur.

First, you can get the correct English words. The excerpts are as follows:

apple,16192
applecart,41
applecarts,1
appledrain,1
appledrains,1
applejack,571
applejacks,4
appleringie,1
appleringies,1
apples,5914
applesauce,378
applesauces,1
applet,2

Each line is separated by a comma, followed by the frequency of the word.

As user inputapplFor example, if the word does not exist, you can insert / delete / replace it to find the closest word( In essence, it is to find the word with the smallest editing distance)

If the entered word exists, it indicates that it is correct and does not need to be processed.

Acquisition of Thesaurus

Where can I get the English Thesaurus?

Xiao Ming thought about it, so he went to various places to check, and finally found a relatively perfect English word frequency thesaurus, with a total of 27W + words.

Excerpts are as follows:

aa,1831
aah,45774
aahed,1
aahing,30
aahs,23
...
zythums,1
zyzzyva,2
zyzzyvas,1
zzz,76
zzzs,2

Java implementation of Chinese and English spelling and error correction? But I can only write crud!

Core code

Obtain all possible conditions currently entered by the user, and the core code is as follows:

/**
 *Build out all possible errors in the current word
 *
 *@ param word input word
 *@ return return return result
 * @since 0.0.1
 *@ author old ma xiaoxifeng
 */
private List<String> edits(String word) {
    List<String> result = new LinkedList<>();
    for (int i = 0; i < word.length(); ++i) {
        result.add(word.substring(0, i) + word.substring(i + 1));
    }
    for (int i = 0; i < word.length() - 1; ++i) {
        result.add(word.substring(0, i) + word.substring(i + 1, i + 2) + word.substring(i, i + 1) + word.substring(i + 2));
    }
    for (int i = 0; i < word.length(); ++i) {
        for (char c = 'a'; c <= 'z'; ++c) {
            result.add(word.substring(0, i) + c + word.substring(i + 1));
        }
    }
    for (int i = 0; i <= word.length(); ++i) {
        for (char c = 'a'; c <= 'z'; ++c) {
            result.add(word.substring(0, i) + c + word.substring(i));
        }
    }
    return result;
}

Then compare with the correct words in the Thesaurus:

List<String> options = edits(formatWord);
List<CandidateDto> candidateDtos = new LinkedList<>();
for (String option : options) {
    if (wordDataMap.containsKey(option)) {
        CandidateDto dto = CandidateDto.builder()
                .word(option).count(wordDataMap.get(option)).build();
        candidateDtos.add(dto);
    }
}

Finally, the returned results need to be compared according to the frequency of words. On the whole, it is relatively simple.

Chinese spelling

A drop in the bucket

The spelling of Chinese looks similar to that of English at first, but Chinese has a very special place.

Because the spelling of all Chinese characters itself is fixed, there are no wrong words when users input, only different words.

It is meaningless to say that a word is a character alone. There must be a word or context.

This makes it much more difficult to correct.

Xiao Ming shook his head helplessly. Chinese culture is broad and profound.

Algorithm idea

There are many ways to correct Chinese characters:

(1) Puzzle set.

For example, commonly used characters,All changes are inseparable from its religionMisspelled asEverything changes

(2)N-Gram

That is, the context corresponding to the primary word. 2-gram is widely used. The corresponding corpus is available in sougou laboratory.

That is, when the first word is fixed, the second occurrence will have a corresponding probability. The higher the probability, the more likely it is that the user originally wants to input.

such asRun fast, actuallyRun fastMaybe it’s right.

error correction

Of course, another difficulty in Chinese is that it is impossible to change one word into another directly through insert / delete / replace.

Similarly, however, there are many ways:

(1) Homophone / homophone

(2) Shape near character

(3) Synonyms

(4) Disordered word order, addition and deletion of words

Java implementation of Chinese and English spelling and error correction? But I can only write crud!

Algorithm implementation

I was forced to choose the simplest difficulty set.

First, find the dictionary of common characters, and the excerpts are as follows:

A crane on a hill is a raccoon on a hill
One is still used to the old
One traditional Chinese medicine and one traditional Chinese Medicine
...
Dejected, dejected
Help each other
The drum is impetuous and advances, and the drum is noisy and advances
The dragon and the tiger occupy the dragon and the tiger

The front is typography and the back is correct usage.

Take the character as the dictionary, and then perform fast forward word segmentation on the Chinese text to obtain the corresponding correct form.

Of course, at the beginning, we can simply let the user input a phrase, and the implementation is to directly parse the corresponding map

public List<String> correctList(String word, int limit, IWordCheckerContext context) {
    final Map<String, List<String>> wordData = context.wordData().correctData();
    //Judge whether it is wrong
    if(isCorrect(word, context)) {
        return Collections.singletonList(word);
    }
    List<String> allList = wordData.get(word);
    final int minLimit = Math.min(allList.size(), limit);
    List<String> resultList = Guavas.newArrayList(minLimit);
    for(int i = 0; i < minLimit; i++) {
        resultList.add(allList.get(i));
    }
    return resultList;
}

Mixed Chinese and English long text

Algorithm idea

The actual articles are generally mixed in Chinese and English.

To make it easier for users to use, you must not enter only one phrase at a time.

What should we do?

The answer is word segmentation. Divide the input sentences into words. Then distinguish Chinese and English for corresponding processing.

For word segmentation, open source projects are recommended:

https://github.com/houbb/segment

Algorithm implementation

The modified core algorithm can reuse the implementation of Chinese and English.

@Override
public String correct(String text) {
    if(StringUtil.isEnglish(text)) {
        return text;
    }

    StringBuilder stringBuilder = new StringBuilder();
    final IWordCheckerContext zhContext = buildChineseContext();
    final IWordCheckerContext enContext = buildEnglishContext();

    //Step 1: perform word segmentation
    List<String> segments = commonSegment.segment(text);
    //Only when they are all true can they be considered correct.
    for(String segment : segments) {
        //If it is in English
        if(StringUtil.isEnglish(segment)) {
            String correct = enWordChecker.correct(segment, enContext);
            stringBuilder.append(correct);
        } else if(StringUtil.isChinese(segment)) {
            String correct = zhWordChecker.correct(segment, zhContext);
            stringBuilder.append(correct);
        } else {
            //Other ignore
            stringBuilder.append(segment);
        }
    }

    return stringBuilder.toString();
}

The default implementation of word segmentation is as follows:

import com.github.houbb.heaven.util.util.CollectionUtil;
import com.github.houbb.nlp.common.segment.ICommonSegment;
import com.github.houbb.nlp.common.segment.impl.CommonSegments;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

/**
 *The default mixed word segmentation supports Chinese and English.
 *
 * @author binbin.hou
 * @since 0.0.8
 */
public class DefaultSegment implements ICommonSegment {

    @Override
    public List<String> segment(String s) {
        //Separated by spaces
        List<String> strings = CommonSegments.defaults().segment(s);
        if(CollectionUtil.isEmpty(strings)) {
            return Collections.emptyList();
        }

        List<String> results = new ArrayList<>();
        ICommonSegment chineseSegment = InnerCommonSegments.defaultChinese();
        for(String text : strings) {
            //Chinese word segmentation
            List<String> segments = chineseSegment.segment(text);

            results.addAll(segments);
        }


        return results;
    }

}

The first is word segmentation for spaces, and then fast forward word segmentation for Chinese characters with confusion set.

Of course, these are not difficult to say.

It’s really troublesome to implement. Xiao Ming has opened source the complete implementation:

https://github.com/houbb/word-checker

Friends who feel helpful can fork / star a wave~

Quick start

Word checker is used to check the spelling of words. Support English word spelling detection and Chinese spelling detection.

Without much to say, let’s directly experience the use experience of this tool class.

Characteristic description

  • You can quickly judge whether the current word is misspelled
  • Best match results can be returned
  • You can return the corrected matching list, and you can specify the size of the returned list
  • Error prompt support I18N
  • Support case, full width and half width formatting
  • Support custom Thesaurus
  • Built in 27W + English Thesaurus
  • Support basic Chinese spelling detection

Quick start

Maven introduction

<dependency>
     <groupId>com.github.houbb</groupId>
     <artifactId>word-checker</artifactId>
    <version>0.0.8</version>
</dependency>

Test case

It will automatically return the best correction result according to the input.

final String speling = "speling";
Assert.assertEquals("spelling", EnWordCheckers.correct(speling));

Introduction to core API

Core API inEnWordCheckersUnder tools.

function method parameter Return value remarks
Judge whether the word is spelled correctly isCorrect(string) Words to be tested boolean
Return the best correction result correct(string) Words to be tested String If no word is found that can be corrected, it returns itself
Judge whether the word is spelled correctly correctList(string) Words to be tested List Returns a list of all matching corrections
Judge whether the word is spelled correctly correctList(string, int limit) For the word to be detected, return the size of the list Returns a list of corrections of a specified size The list size is less than or equal to limit

Test example

SeeEnWordCheckerTest.java

Is it spelled correctly

final String hello = "hello";
final String speling = "speling";
Assert.assertTrue(EnWordCheckers.isCorrect(hello));
Assert.assertFalse(EnWordCheckers.isCorrect(speling));

Return best match results

final String hello = "hello";
final String speling = "speling";
Assert.assertEquals("hello", EnWordCheckers.correct(hello));
Assert.assertEquals("spelling", EnWordCheckers.correct(speling));

Default corrective match list

final String word = "goox";
List<String> stringList = EnWordCheckers.correctList(word);
Assert.assertEquals("[good, goo, goon, goof, gook, goop, goos, gox, goog, gool, goor]", stringList.toString());

Specifies the correct match list size

final String word = "goox";
final int limit = 2;
List<String> stringList = EnWordCheckers.correctList(word, limit);
Assert.assertEquals("[good, goo]", stringList.toString());

Chinese spelling correction

Core API

To reduce learning costs, core APIs andZhWordCheckersChinese and English spelling tests are consistent.

Is it spelled correctly

Final string right = "correct";
Final string error = "all changes are inseparable";

Assert.assertTrue(ZhWordCheckers.isCorrect(right));
Assert.assertFalse(ZhWordCheckers.isCorrect(error));

Return best match results

Final string right = "correct";
Final string error = "all changes are inseparable";

Assert.assertequals ("correct", zhwordcheckers.correct (right));
Assert.assertequals ("ten thousand changes do not leave their origin", zhwordcheckers.correct (error));

Default corrective match list

Final string word = "all changes are inseparable";

List<String> stringList = ZhWordCheckers.correctList(word);
Assert. Assertequals ("[changes never leave their roots]", stringlist. Tostring());

Specifies the correct match list size

Final string word = "all changes are inseparable";
final int limit = 1;

List<String> stringList = ZhWordCheckers.correctList(word, limit);
Assert. Assertequals ("[changes never leave their roots]", stringlist. Tostring());

Long text mixed in Chinese and English

scene

If the actual spelling is corrected, the best experience is that the user enters a long text, which may be mixed in Chinese and English.

Then, the above corresponding functions are realized.

Core method

WordCheckersThe tool class provides the automatic correction function of long text mixed in Chinese and English.

function method parameter Return value remarks
Is the text spelled correctly isCorrect(string) Text to be detected boolean True will be returned only if all are correct
Return the best correction result correct(string) Words to be tested String If no text is found that can be corrected, it returns itself
Determine whether the text is spelled correctly correctMap(string) Words to be tested Map Returns a list of all matching corrections
Determine whether the text is spelled correctly correctMap(string, int limit) The text to be detected returns the size of the list Returns a list of corrections of a specified size The list size is less than or equal to limit

Is the spelling correct

Final string Hello = "hello";
Final string speling = "you can poison with poison";
Assert.assertTrue(WordCheckers.isCorrect(hello));
Assert.assertFalse(WordCheckers.isCorrect(speling));

Return the best correction result

Final string Hello = "hello";
Final string speling = "you can poison with poison";
Assert. Assertequals ("Hello, wordcheckers. Correct (Hello));
Assert. Assertequals ("spelling, you fight poison with poison", wordcheckers. Correct (spelling));

Determine whether the text is spelled correctly

Each word corresponds to the correction result.

Final string Hello = "hello";
Final string speling = "you can poison with poison";
Assert. Assertequals ("{Hello = [Hello], = [], you = [you], OK = [OK]}", wordcheckers.correctmap (Hello). Tostring());
Assert. Assertequals ("{= [], spelling = [spelling, spewing, Sperling, seeling, spiling, spiling, speeling, speiling, spelding], you = [you], good = [good], poison with poison = [attacking poison with poison]}", wordcheckers.correctmap (spelling). Tostring());

Determine whether the text is spelled correctly

The same as above, and the maximum number is returned.

Final string Hello = "hello";
Final string speling = "you can poison with poison";

Assert. Assertequals ("{Hello = [Hello], = [], you = [you], OK = [OK]}", wordcheckers.correctmap (Hello, 2). Tostring());
Assert. Assertequals ("{= [], spelling = [spelling, spewing], you = [you], good = [good], poison with poison = [attacking poison with poison]}", wordcheckers.correctmap (spelling, 2). Tostring());

format processing

Sometimes the user’s input is various. This tool supports formatting.

Case

Uppercase is uniformly formatted as lowercase.

final String word = "stRing";

Assert.assertTrue(EnWordCheckers.isCorrect(word));

Full width

Full width is uniformly formatted as half width.

final String word = "string";

Assert.assertTrue(EnWordCheckers.isCorrect(word));

Custom English Thesaurus

File configuration

You can create files in the project resource directoryresources/data/define_word_checker_en.txt

The contents are as follows:

my-long-long-define-word,2
my-long-long-define-word-two

Different words are on separate lines.

The first column of each row represents the word and the second column represents the number of occurrences, both of which are comma,separate.

The higher the number of times, the higher the return priority during correction. The default value is 1.

The user-defined thesaurus has higher priority than the built-in thesaurus.

Test code

After we specify the corresponding word, the spell detection will take effect.

final String word = "my-long-long-define-word";
final String word2 = "my-long-long-define-word-two";

Assert.assertTrue(EnWordCheckers.isCorrect(word));
Assert.assertTrue(EnWordCheckers.isCorrect(word2));

Custom Chinese Thesaurus

File configuration

You can create files in the project resource directoryresources/data/define_word_checker_zh.txt

The contents are as follows:

Stick to the rules

Use English space to separate. The front is error and the back is correct.

Summary

The correction of Chinese and English spelling has always been a hot and difficult topic.

In recent years, due to the progress of NLP and artificial intelligence, its application in business is also gradually successful.

The main implementation is based on the traditional algorithm, and the core is the thesaurus.

Xiaoming has opened source the complete implementation:

https://github.com/houbb/word-checker

Welcome the fork / Star wave if you feel helpful~

follow-up

After several days of hard work, Xiao Ming finally completed one of the simplest spell checking tools.

“Do you want to check the official account number with the spelling checker last time?”

“No, I forgot if you didn’t say it.”, The product looked a little surprised“ It doesn’t matter whether we do it or not. We’ve squeezed a lot of business needs recently. You’d better take a look first“

“……”

“I recently saw another function on XXX, which is also very good. You can make one for our system.”

“……”