Multi criteria CWS
Author: Song Tongtong
Natural language processing (NLP) is a very important and challenging direction in artificial intelligence. The first step of NLP is word segmentation. The effect of word segmentation directly determines and affects the efficiency of subsequent work. In our daily life and work, a large number of Chinese data are produced. Due to the great differences between Chinese and English in terms of words and sentences, for example, spaces are used as natural separators between English words, while Chinese words, sentences and paragraphs can be simply demarcated by obvious separators, “words” and “phrases” have vague boundaries, so Chinese word segmentation is relatively complex and difficult. So let’s talk about Chinese word segmentation (CWS).
1. Current situation of Chinese word segmentation
Chinese word segmentation refers to the segmentation of a Chinese character sequence into a single word. Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain norms. The existing word segmentation methods can be divided into three categories: word segmentation based on string matching, word segmentation based on understanding and word segmentation based on statistics.
- The word segmentation method based on string matching is also called mechanical word segmentation method. It matches the Chinese character string to be analyzed with the entries in a “sufficiently large” machine dictionary according to certain strategies. If a string is found in the dictionary, the matching is successful (a word is recognized). Common string matching methods are as follows: forward maximum matching (from left to right); reverse maximum matching (from right to left); minimum segmentation (the minimum number of words cut out in each sentence); bidirectional maximum matching (scanning from left to right and right to left). The advantages of this kind of algorithm are fast speed, time complexity can be kept at O (n), implementation is simple, and the effect is acceptable, but the effect of processing ambiguity and unknown words is not good.
- The method of word segmentation based on understanding is to make the computer simulate people’s understanding of sentences, so as to achieve the effect of word recognition. Its basic idea is to analyze syntactic and semantic information at the same time of word segmentation, and use syntactic and semantic information to deal with ambiguity. It usually consists of three parts: word segmentation subsystem, sentence French meaning subsystem and general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words and sentences to judge the ambiguity of word segmentation, that is, it simulates the process of people’s understanding of sentences. This method of word segmentation needs a lot of language knowledge and information. Due to the generality and complexity of Chinese language knowledge, it is difficult to organize all kinds of language information into machine readable form, so the word segmentation system based on understanding is still in the experimental stage.
- The method of word segmentation based on statistics is to use statistical machine learning model to learn the rule of word segmentation (called training) under the premise of a large number of segmented text, so as to achieve the segmentation of unknown text. For example, maximum probability segmentation and maximum entropy segmentation. With the establishment of large-scale corpus and the research and development of statistical machine learning, Chinese word segmentation based on statistics has gradually become the mainstream method. The main statistical models are: n-gram, hidden Markov model, HMM, me, CRF, etc. In the practical application, the word segmentation system based on statistics needs to use the word segmentation dictionary to match and segment the word string, and at the same time uses the statistical method to identify some new words, that is, the combination of string frequency statistics and string matching, which not only gives full play to the characteristics of fast and efficient segmentation of matching words, but also uses the combination of dictionary free word segmentation and context to identify new words and automatically eliminate disambiguation The advantages of righteousness.
The statistical segmentation method is based on a large number of existing segmented texts, namely corpora. In order to make a practical word segmentation tool, not only efficient algorithm is needed, but also large-scale corpus is necessary. For the research teams and individuals who lack funds, only a few small corpora such as SIGHAN 2005 can be obtained. Moreover, the tagging specifications of these corpora are not compatible with each other and cannot be mixed for training.
Some teams have begun to study how to use multiple corpora to jointly learn Chinese word segmentation. For example, Chen et al. Carefully designed antagonistic neural network in 2017 to extract relevant or irrelevant features of word segmentation standards for each corpus, but the performance is not ideal. Then there is the plan proposed by Han he and others in 2018: inspired by Google’s multilingual translation system, using engineering ideas, using tags to identify data sets of different standards, so that we can identify which standard data set comes from, improve the performance of the model through transfer learning between different corpora, and output multi-standard word segmentation results at the same time.
3. Experiment and results
The training model is the familiar Bi LSTM + CRF. In the specific joint training, the two introduced artificial identifiers are regarded as common characters, so it is not necessary to distinguish the source of sentences manually. These two artificial identifiers will prompt which segmentation standard RNN belongs to, so that the content representation generated for each character is affected by the segmentation standard.
When testing, these two artificial identifiers play the role of specifying the required word segmentation standard, but they are not included in the calculation of accuracy.
In this paper, experiments are carried out on the standard SIGHAN 2005 and SIGHAN 2008, and higher achievements are still achieved without targeted parameter adjustment (at that time, the equipment conditions are limited, and the same set of super parameters are used on all data sets). All scores have passed the official evaluation script. The baseline in the figure below is the result of individual training on each corpus, + naive is the result of merging expectations without identifiers, + multi is the result of joint training scheme in the paper.
The features used in the experiment are minimal, just characters and bigram. If we add 12 nagram and word embedding like the recent popular practice, it may be further improved. However, the center of the paper is a simple multi-standard word segmentation scheme, which focuses on simplification and efficiency, not pursuing high score over efficiency, so there is no means to use these characteristics engineering. The experiments and results on SIGHAN 2008 are not detailed here.
This is a simple multi label Chinese word segmentation solution, which can combine multiple corpora to train a single model without increasing the complexity of the model. Although the scheme is simple, it does bring significant performance improvement (especially for small data sets such as WTB). However, the benefits of a particularly large data set are small or cannot be benefited (e.g. MSR) and are reserved for future research. Here we provide the project address and some reference materials of this article, and interested students can further explore.
Blog: http://www.hankcs.com/nlp/segment/multi-criteria-cws.html “response”
Thesis: effective neural solution for multi criteria word segmentation, 2018, https://arxiv.org/abs/1712.02856
Mo AI ClubIt is a club initiated by the R & D and product team of artificial intelligence online modeling platform (website: https://momodel.cn) and committed to reducing the threshold of artificial intelligence development and use. The team has experience in big data processing and analysis, visualization and data modeling, has undertaken multi field intelligent projects, and has the ability to design and develop the whole line from the bottom to the front. The main research direction is big data management analysis and artificial intelligence technology, which can promote data-driven scientific research.
At present, the team holds offline salons in Hangzhou every two weeks (Saturday) for paper sharing and academic exchange related to machine learning. I hope to gather friends from all walks of life who are interested in AI, constantly communicate and grow together, and promote the democratization and popularization of AI.