Hengyuan cloud_ When text data is expanded, which words (not) should be selected?

Time:2022-5-21

Source: Hengyuan cloud community(Hengyuan cloud, a shared computing platform focusing on AI industry)

Original address|Paper notes

Original author | mathor


I’m here, or I’m not, and the boss is there. Keep sending messages!
So, I’d better carry it honestly!

Start of text:

Text augmentation is now used by most people because it can help improve the effect of text classification. Specifically, the commonly used methods include but are not limited to: replacement, deletion and addition. Generally speaking, text amplification will make the final performance better and worse in a few cases. You may think that some important words in the sentence are erased by methods such as deletion and replacement, but which words in the sentence are important words? Which words can be expanded and which words are best not expanded?

Acl2022 has an article called《Roles of Words: What Should (n’t) Be Augmented in Text Augmentation on Text Classification Tasks?》This paper studies this problem and gives the guiding method. Firstly, the author trains the FD news data set, and the final accuracy on the test set is 98.92%, which shows that the fitting degree of the model to the data set is very good. Then the author manually enters several test samples, as shown belowHengyuan cloud_ When text data is expanded, which words (not) should be selected?

Because the words “Basketball” and “athlets” often appear in the training samples of “Sport”, the model can accurately predict them as “Sport”; However, from the second and fourth samples, the performance of the model is not as good as we thought. Because “based on” and “team” often appear together with sentences of “Sport” in the training set, the model will naturally have a little “bias” after being trained by this data set; From the last example, the model can not correctly identify the professional vocabulary related to sports: three-pointer

The above example inspires us to look at each word in a sentence from the perspectives of “statistical relevance” and “semantic similarity”. Specifically, we can assign a “role” to each word from these two perspectives. There are four roles in total:

  1. Common class indicating words (CC words): high statistical correlation and high semantic similarity
  2. Specific class indicating words (SC words): low statistical correlation and high semantic similarity
  3. Intermediate class indicating words (IC words): high statistical correlation and low semantic similarity
  4. Class relevant words / other words (o-words): low statistical correlation and low semantic similarity

STATISTICAL CORRELATION & SEMANTIC SIMILARITY

The author uses weighted log likelihood ratio (wllr) to measure the statistical correlation between each word and category in the sentence. The calculation formula of wllr score is as follows:Hengyuan cloud_ When text data is expanded, which words (not) should be selected?

Where \ (w \) is a word\ (Y \) is a category\ (\ bar {y} \) represents all categories\ (\ text {wllr} (W, y) \) the greater the statistical correlation between word \ (w \) and category \ (Y \)

In order to measure the semantic similarity of two words, the most direct way is to calculate the cosine similarity of two vectors. However, the author does not use the more complex Bert based model to extract the word vector, because it requires relatively large computing resources. The author directly uses the simple word2vec method to obtain the word vector. The calculation formula of pre similarity is as follows:Hengyuan cloud_ When text data is expanded, which words (not) should be selected?

Where \ (L \) represents category and \ (v_w, v_l \) represents vector representation of word and category respectively

Generally speaking, categories have text descriptions, such as “Sports”, “computer”, etc. we directly use their descriptions as \ (L \)

After calculating the statistical correlation and cosine similarity of all words in a given sentence, we set a threshold to distinguish between high (low) wllr score \ (c_h (c_l) \) and high (low) cosine score \ (s_h (s_l) \)Hengyuan cloud_ When text data is expanded, which words (not) should be selected?

Where \ (w {CC}, w {SC}, w {IC}, w {o} \) represents CC words, SC words, IC words and o-words respectively. A real extraction example is as followsHengyuan cloud_ When text data is expanded, which words (not) should be selected?

RESULTS

The threshold used in the author’s experiment is the median of the two indicators. The first is to delete the experimentHengyuan cloud_ When text data is expanded, which words (not) should be selected?

From the results, deleting CC words has a great impact on the performance loss; Deleting SC words and IC words has more positive effects. In fact, the first conclusion is easy to think of because CC words and tags have high relevance and semantic similarity at the same time. Deleting it will certainly greatly reduce the accuracy of model judgment. But the latter conclusion is somewhat inconsistent with my guess. At first, I thought it would be better to delete o-words, because o-words is not very related to tags, and deleting it is harmless. But the fact is that deleting SC words and IC words is better. The explanation in the paper is that because the statistical correlation between SC words and tags is relatively low and the semantic similarity is relatively high, deleting these words can force the model to pay more attention to CC words. The statistical correlation between IC words and tags is relatively high and the semantic similarity is relatively low. The paper explains that IC words are usually some data with noise and bias. Deleting them can help the model avoid learning incorrect features about this category

Similarly, the author also made the data amplification methods of insertion, replacement and exchange. The results are not listed here one by one. Interested readers can read the original paper by themselves. The following table is a summary of the author’s use of four data amplification methodsHengyuan cloud_ When text data is expanded, which words (not) should be selected?

Personal summary

This paper proposes a selective text amplification method. Specifically, the paper sets four roles, and assigns each word to a role to operate the words of different roles in the face of different amplification methods. This can effectively avoid information loss and generate high-quality text data