C ා to achieve forward maximum match, dictionary tree (word segmentation, retrieval) example code

Time:2020-10-7

Scene: now there is a wrong word database, which maintains the corresponding relationship between wrong words and correct words. For example: the wrong word “we” corresponds to the correct word “we”. Then check the wrong words in the user’s input text. It is necessary to judge whether there is a wrong word in the input text, and find out the wrong word to remind the user. The correct word can be displayed for the user’s confirmation. If it is a wrong word, it will be replaced.

The first thing to think of is to take out the wrong word list and put it in memory. When the user input is completed, use the wrong word list to foreach each wrong word, and then find out whether the input string contains the wrong word. This is an effective method and can be implemented. The problem is that the number of wrong words is relatively large. There are more than 100000 wrong words at present, which will be updated and expanded in the future. So pass uses this scheme. In order to improve the speed of finding wrong words, a dictionary tree is used to store wrong words.

Dictionary tree

Trie tree, namely dictionary tree, also known as word search tree or key tree, is a tree structure and a variant of hash tree. Typical applications are used to count and sort a large number of strings (but not limited to strings), so they are often used by search engine systems for text word frequency statistics. It has the advantage of minimizing unnecessary string comparison.

Trie’s core idea is space for time. Using the common prefix of string to reduce the cost of query time to achieve the purpose of improving efficiency.

Usually, the query time complexity of dictionary tree is O (logl), and l is the length of string. So the efficiency is still relatively high. The time complexity of the foreach loop we mentioned above is O (n). According to the time complexity, dictionary tree efficiency should be a feasible scheme.

Dictionary tree principle

The root node does not contain characters, except for the root node, each node contains only one character; from the root node to a node, the characters passing through the path are connected to form the corresponding string of the node; all the child nodes of each node contain different characters.

For example, now there are wrong words: “my door”, “dry sleep”, “dry rise”. The dictionary tree is shown in the following figure

The red dot indicates the end of the word node, that is, from the root node down to join our word.

Implement dictionary tree:

public class Trie
{
  private class Node
  {
    /// <summary>
    ///Is the word root node
    /// </summary>
    public bool isTail = false;

    public Dictionary<char, Node> nextNode;

    public Node(bool isTail)
    {
      this.isTail = isTail;
      this.nextNode = new Dictionary<char, Node>();
    }
    public Node() : this(false)
    {
    }
  }

  /// <summary>
  ///Root node
  /// </summary>
  private Node rootNode;
  private int size;
  private int maxLength;

  public Trie()
  {
    this.rootNode = new Node();
    this.size = 0;
    this.maxLength = 0;
  }

  /// <summary>
  ///The maximum length of words stored in the dictionary tree
  /// </summary>
  /// <returns></returns>
  public int MaxLength()
  {
    return maxLength;
  }

  /// <summary>
  ///The number of words stored in the dictionary tree
  /// </summary>
  public int Size()
  {
    return size;
  }

  /// <summary>
  ///Get all the words in the dictionary tree
  /// </summary>
  public List<string> GetWordList()
  {
    return GetStrList(this.rootNode);
  }

  private List<string> GetStrList(Node node)
  {
    List<string> wordList = new List<string>();

    foreach (char nextChar in node.nextNode.Keys)
    {
      string firstWord = Convert.ToString(nextChar);
      Node childNode = node.nextNode[nextChar];

      if (childNode == null || childNode.nextNode.Count == 0)
      {
        wordList.Add(firstWord);
      }
      else
      {

        if (childNode.isTail)
        {
          wordList.Add(firstWord);
        }

        List<string> subWordList = GetStrList(childNode);
        foreach (string subWord in subWordList)
        {
          wordList.Add(firstWord + subWord);
        }
      }
    }

    return wordList;
  }

  /// <summary>
  ///Add a new word to the dictionary
  /// </summary>
  /// <param name="word"></param>
  public void Add(string word)
  {
    //Start at the root node
    Node cur = this.rootNode;
    //Loop through words
    foreach (char c in word.ToCharArray())
    {
      //If the letter is not in the dictionary tree node, add
      if (!cur.nextNode.ContainsKey(c))
      {
        cur.nextNode.Add(c, new Node());
      }
      cur = cur.nextNode[c];
    }
    cur.isTail = true;

    if (word.Length > this.maxLength)
    {
      this.maxLength = word.Length;
    }
    size++;
  }

  /// <summary>
  ///Query whether a word exists in the dictionary
  /// </summary>
  /// <param name="word"></param>
  /// <returns></returns>
  public bool Contains(string word)
  {
    return Match(rootNode, word);
  }

  /// <summary>
  ///Find a match
  /// </summary>
  /// <param name="node"></param>
  /// <param name="word"></param>
  /// <returns></returns>
  private bool Match(Node node, string word)
  {
    if (word.Length == 0)
    {
      if (node.isTail)
      {
        return true;
      }
      else
      {
        return false;
      }
    }
    else
    {
      char firstChar = word.ElementAt(0);
      if (!node.nextNode.ContainsKey(firstChar))
      {
        return false;
      }
      else
      {
        Node childNode = node.nextNode[firstChar];
        return Match(childNode, word.Substring(1, word.Length - 1));
      }
    }
  }
}

Under the test:

Now we have a dictionary tree, and then we can’t foreach it as a dictionary tree, which is used for retrieval. We take the string input by the user as the data source and go to the dictionary tree species to find out whether there are wrong words. Therefore, it is necessary to retrieve the input string. In other words, we use forward maximum matching for word segmentation.

Forward maximum matching

The purpose of word segmentation is to divide the input string into several words. The forward maximum matching is to find the words existing in the dictionary from front to back.

Example: let’s assume MaxLength = 3, which means that the maximum length of a word is 3. In fact, we should use the maximum word length in the dictionary tree as the maximum length for word segmentation (the maximum length of our dictionary above should be 2). This is more efficient. In order to demonstrate the matching process, we assume that MaxLength is 3, which makes the demonstration clearer.

We should go to bed early and get up early. Because I am a wrong word match, so I changed this sentence to “we should sleep in a drought”.

The first time: take the substring “we should” and take forward the word. If the matching fails, remove the last word of the matching field each time.

“We should”, scan the words in the dictionary, there is no match, the substring length minus 1 becomes “my gate”.

“My door”, scan the words in the dictionary, match successfully, get the wrong word of “my gate”, and input it into “should be dry”.

The second time: take zichuang “should drought”

“Should be dry”, scan the words in the dictionary, there is no match, the substring length minus 1 becomes “should”.

“Should”, scan the words in the dictionary, no match, the input changes to “should”.

“Should”, the word in the dictionary is scanned and no match is found. The input becomes “the dry sleep”.

The third time: take the son string “the dry sleep”

“The dry sleep”, scan the dictionary words, there is no match, substring length minus 1 becomes “this drought”.

“The drought”, scan the words in the dictionary, there is no match, the input changes to “this”.

“This”, scanning the words in the dictionary, no match, the input changes to “dry sleep drought”.

The fourth time: take the son string “dry sleep drought”

“Dry sleep and drought”, scan the words in the dictionary, there is no match, the substring length minus 1 becomes “dry sleep”.

“Dry sleep”, scan the words in the dictionary, match successfully, get the wrong word “dry sleep”, and input it into “get up early”.

By analogy, we get the wrong word we / sleep / start.

Because I match the wrong word with the dictionary tree, a word may also be a wrong word, so it is matched to a single word. If it is only a word segmentation, the word segmentation should be stopped when the upper word reaches a word, and the string length is directly reduced by 1.

There are also backward maximum matching and two-way matching in this matching method. You can learn about it.

The forward maximum matching is realized, and the backward maximum matching can also be realized together.

public class ErrorWordMatch
  {
    private static ErrorWordMatch singleton = new ErrorWordMatch();
    private static Trie trie = new Trie();
    private ErrorWordMatch()
    {

    }

    public static ErrorWordMatch Singleton()
    {
      return singleton;
    }

    public void LoadTrieData(List<string> errorWords)
    {
      foreach (var errorWord in errorWords)
      {
        trie.Add(errorWord);
      }
    }

    /// <summary>
    ///Maximum forward / reverse matching wrong word
    /// </summary>
    ///< param name = "inputstr" > string to match wrong word < / param >
    ///< param name = "lefttoright" > true is left to right participle, false is right to left participle < / param >
    ///Wrong matching words of < / returns >
    public List<string> MatchErrorWord(string inputStr, bool leftToRight)
    {
      if (string.IsNullOrWhiteSpace(inputStr))
        return null;
      if (trie.Size() == 0)
      {
        Throw new argumentexception ("dictionary tree has no data, please call loadtriedata method to load dictionary tree first");
      }
      //Maximum length of words
      int maxLength = trie.MaxLength();
      //The current length of the word
      int wordLength = maxLength;
      //In word segmentation, the current position in the string
      int position = 0;
      //In word segmentation, the total length of the string that has been processed
      int segLength = 0;
      //The word string used to attempt word segmentation
      string word = "";

      //An array of strings used to store forward participles
      List<string> segWords = new List<string>();
      //An array of strings used to store reverse word segmentation
      List<string> segWordsReverse = new List<string>();

      //Start word segmentation and cycle through the following operations until all are complete
      while (segLength < inputStr.Length)
      {
        //If the length of the remaining undivided string is less than the maximum length of the extracted word, the length of the extracted word is equal to the length of the remaining word
        if ((inputStr.Length - segLength) < maxLength)
          wordLength = inputStr.Length - segLength;
        //Otherwise, the maximum length is adopted
        else
          wordLength = maxLength;

        //When intercepting from left to right and from right to left, the starting position is different
        //At first, the intercepting position is at both ends of the string. With the continuous circulation of word segmentation, the intercepting position will continue to advance
        if (leftToRight)
          position = segLength;
        else
          position = inputStr.Length - segLength - wordLength;

        //Intercepts a word from a string of specified length
        word = inputStr.Substring(position, wordLength);


        //Look up in the dictionary. Is there such a word
        //If not, reduce one character and look it up again in the dictionary
        //So loop until there is only one word left
        while (!trie.Contains(word))
        {
          //If the last word does not match, set word to null to indicate that there is no match (if it is a word segmentation, break directly)
          if (word.Length == 1)
          {
            word = null;
            break;
          }

          //Remove the word on the edge of the intercepted string
          //From left to right and right to left, the positions of the truncated characters are different
          if (leftToRight)
            word = word.Substring(0, word.Length - 1);
          else
            word = word.Substring(1);
        }

        //The words on the match will be separated out and added to the word segmentation string array. The forward and reverse directions are different
        if (word != null)
        {
          if (leftToRight)
            segWords.Add(word);
          else
            segWordsReverse.Add(word);
          //The string length of the completed word segmentation should be increased accordingly
          segLength += word.Length;
        }
        else
        {
          //If there is no match, then + 1 will be added, and a word will be lost (if it is a participle, it is not necessary to judge whether word is empty, and a single word will also be returned)
          segLength += 1;
        }
      }

      //If it is reverse word segmentation, reverse the order of segmentation results
      if (!leftToRight)
      {
        for (int i = segWordsReverse.Count - 1; i >= 0; i--)
        {
          //Save the inverted result in the forward participle array so that the same variable segwords can be returned finally
          segWords.Add(segWordsReverse[i]);
        }
      }

      return segWords;
    }
  }

The singleton pattern is used to share in the project. After the dictionary tree is loaded for the first time, it can be used to match the wrong words in other places.

This is a combination of my specific use, simplified some code, if only word segmentation is word segmentation, that implementation method is OK. Finally, let’s share here. If there is something wrong, please correct it.

This article on the C ා forward maximum map, dictionary tree (word segmentation, retrieval) of the sample code article introduced here, more relevant C ා forward maximum, dictionary tree content, please search the previous articles of developpaer or continue to browse the related articles below, I hope you can support developeppaer more in the future!