Notes on hanlp’s introduction to natural language processing 9. Key words, key sentences and phrases extraction


The notes are reproduced in GitHub project:

9. Information extraction

Information extraction is a broad concept, which refers to the technology of extracting structured information from unstructured text. This kind of technology is still divided into rule-based regular matching, supervised learning and unsupervised learning. We will use some simple and practical unsupervised learning methods. Since there is no need to annotate corpus, massive unstructured text can be used.

This chapter introduces the unsupervised learning method of extracting new words, key words, key phrases and key sentences according to the order of granularity from small to large.

9.1 new word extraction

  1. summary

    Neologism is a relative concept, and everyone’s standard is different, so we define it here: OOV outside the dictionary is calledNew words

    The extraction of new words is of great significance for Chinese word segmentation because of the high cost of tagging corpus. So how to revise the domain dictionary? At this time, unsupervised new word extraction algorithm reflects the practical significance.

  2. Basic principles

    • A large number of words are extracted from the text (raw corpus), whether old or new.
    • Use the dictionary to filter out the existing words and get new words.

    Step 2 is very easy, the key is step 1, how to extract the words in the text unsupervised. Given a piece of text, randomly take a piece of text. If the collocation of the left and right of the segment is rich, and the collocation of the internal components of the segment is very fixed, then it can be considered as a word. Filter out such fragments and sort them according to the frequency from high to low. The words in the front have a high probability of being words.

    If the text is large enough, you can get “new words” by filtering out “old words” with a general dictionary.

    You can use theInformation entropyAnd the fixed degree of the internal collocation can be measured by the subsequenceMutual informationTo measure.

  3. Information entropy

    In information theory,Information entropy(entropy) refers to the amount of information contained in a message. It reflects the reduction of uncertainty about an event after hearing about it. For example, before tossing a coin, we don’t know the result of the “coin positive and negative” event. However, once someone tells us that the coin is positive, our uncertainty about the coin toss event immediately drops to zero. The reduction of this uncertainty is information entropy. The formula is as follows:

    \[H(X)=-\sum_{x} p(x) \log p(x)

    Given the string s as a word alternative, X is defined as the character (left adjacent word) that may appear on the left side of the string, then H (x) is called the left information entropy of S. similarly, the right information entropy H (y) is defined, such as the following sentence:


    these ones herebutterflyFly away

    Then, for string butterfly, its left entropy is 1 and right entropy is 0. Because the right neighbor of butterfly must be “fly”. If we collect some more sentences, such as “Butterfly Effect” and “butterfly transformation”, we will observe that the right information entropy will increase a lot.

    The greater the left and right information entropy, the richer the possible collocations of the string, and the greater the possibility that the string is a word.

    It is not enough to consider the left and right information entropy. For example, the left and right collocations of one in “eating a meal”, “watching it”, “sleeping one night” and “going for a trip” are also very rich. In order to achieve better results, we must also consider the cohesion of internal fragments of words, which is measured by mutual information.

  4. Mutual information

    Mutual informationIt refers to the measurement of the correlation degree between two discrete random variables X and y, which is defined as follows:

    I(X ; Y) &=\sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x) p(y)} \\
    &=E_{p(x, y)} \log \frac{p(x, y)}{p(x) p(y)}

    The definition of mutual information can be expressed by Wayne diagram

    Where the left circle represents H (x) and the right circle represents H (y). Their union is the information entropy H (x, y) of joint distribution. The difference set has many pieces of entropy, and the intersection is mutual information. It can be seen that the greater the mutual information, the closer the correlation between two random variables, or the greater the possibility of simultaneous occurrence.

    Fragments may be combined in many ways, and the one with the smallest mutual information can be selected as the representative. After the left and right information entropy and mutual information are available, the fragments with two indexes lower than a certain threshold are filtered out, and the remaining fragments are sorted in descending order of frequency, and the N fragments with the highest frequency are intercepted to complete the word extraction process.

  5. realization

    We use four famous works to mention 100 high-frequency words.

    Code see(Corpus auto download):

    The operation results are as follows:

    Although we didn’t train on the classical literature corpus, the new word recognition module successfully recognized the rare words such as sheyue and Gao Taiwei. The module is also suitable for the non-standard text of social media such as microblog.

9.2 keyword extraction

There is another requirement for the information extraction of word granularity, that is to extract the important words in the textKey words mentioned。 Keywords are also a lack of quantitative standards, which can not unify the corpus, so we can use unsupervised learning to complete.

Word frequency, TF-IDF and textrank algorithms are introduced respectively. Word frequency and textrank can be used for single document lifting, and TF-IDF can be used to extract keywords for multiple documents.

  1. word frequency count

    Key words often appear repeatedly in an article. In order to explain the key words, the author often mentions them repeatedly. By counting the frequency of each word in the article and sorting, we can get some key words.

    However, the repeated words in the article are not necessarily keywords, such as “de”. Therefore, it is necessary to remove the stop words before counting the word frequency.

    The process of word frequency statistics is generally word segmentation, stop word filtering and getting the top n words according to word frequency. The problem of finding the first n (n < = m) large elements in M elements is usually solved by the maximum heap, and the complexity is O (mlogn). The hanlp code is as follows:

    from pyhanlp import *
    TermFrequency = JClass('com.hankcs.hanlp.corpus.occurrence.TermFrequency')
    TermFrequencyCounter = JClass('com.hankcs.hanlp.mining.word.TermFrequencyCounter')
    if __name__ == '__main__':
        counter = TermFrequencyCounter()
        counter.add "Come on, China! ") ා first document
        counter.add (Chinese audience cheers for China) ා second document
        For termfrequency in counter: traverse each word and word frequency
            print("%s=%d" % (termFrequency.getTerm(), termFrequency.getFrequency()))
        print( (2) Top n
        #Keyword extraction based on word frequency
        print( TermFrequencyCounter.getKeywordList "The women's volleyball team won the championship, the audience cheered the women's volleyball team! ", 3))

    The operation results are as follows:

    China = 2
    China team = 1
    Refueling = 3
    Audience = 1
    Shout = 1
    [Refueling = 3, China = 2]
    [women's volleyball, audience, cheers]

    There is a defect in using word frequency to extract keywords, that is, high frequency words are not equivalent to keywords. For example, in a sports website, all the articles are about the Olympic Games, leading to the highest frequency of “Olympic Games”. Users hope to see the characteristics of each article through keywords. At this point, TF-IDF comes in handy.

  2. TF-IDF

    TF-IDF (term frequency inverted document frequency) is a statistical index to measure the importance of a word in information retrieval. It is widely used in Lucene, Solr, elastic search and other search engines.

    Compared with word frequency, TF-IDF also considers the rarity of words. In TF-IDF calculation, the importance of a word is not only directly proportional to its frequency in the document, but also inversely proportional to how many documents contain it. The more interesting the document contains this word, the more extensive it is, the less it can reflect the characteristics of the document. Because of the need to consider the whole corpus or document set, TF-IDF belongs to multi document method in keyword extraction.

    The calculation formula is as follows:

    \[\begin{aligned} \mathrm { TF } – \operatorname { IDF } ( t , d ) & = \frac { \mathrm { TF } ( t , d ) } { \mathrm { DF } ( t ) } \\ & = \mathrm { TF } ( t , d ) \cdot \mathrm { IDF } ( t ) \end{aligned}

    Among them, T represents the word, D represents the document, TF (T, d) represents the frequency of T in D, and DF (T) represents how many documents contain t. The derivative of DF is called IDF, which is the origin of TF-IDF.

    Of course, some extensions should be made in practical application, such as adding a smoothing and IDF logarithm to prevent floating-point overflow. An example of hanlp is as follows:

    from pyhanlp import *
    TfIdfCounter = JClass('com.hankcs.hanlp.mining.word.TfIdfCounter')
    if __name__ == '__main__':
        counter = TfIdfCounter()
        counter.add ("women's volleyball team won the championship", "women's volleyball team won the Beijing Olympic Games") input multiple documents
        counter.add ("badminton men's singles", "men's singles final of Beijing Olympic Games")
        counter.add The Chinese women's volleyball team has won the gold medal of Beijing Olympic Games and returned to its peak. The audience cheered the women's volleyball team! ""
        counter.compute () ා input completed
        for id in counter.documents():
            print(id + " : " +  counter.getKeywordsOf (ID, 3). Tostring()) ා extract keywords according to TF-IDF of each document
        #According to the existing IDF information, keywords are extracted for new documents outside the corpus
        print( counter.getKeywords ("Olympic Anti Doping", 2))

    After operation, it is as follows:

    "Women's volleyball team": [women's volleyball team = 5.150728289807123, return = 1.6931471805599454, peak = 1.6931471805599454]
    "Women's Volleyball Championship": [title = 1.6931471805599454, women's volleyball = 1.2876820724517808, Olympic Games = 1.0]
    "Badminton men's singles": [final = 1.6931471805599454, badminton = 1.6931471805599454, men's singles = 1.6931471805599454]
    [Anti Doping]

    From the output, we can see that TF-IDF effectively avoids giving too much weight to the broad word “Olympic Games”.

    TF-IDF statistics on a large corpus is similar to a learning process. If we do not have such a large corpus or memory of IDF, and want to improve the effect of word frequency statistics, what should we do? At this point, you can use the textrank algorithm.

  3. TextRank

    Textrank is the application of PageRank in text. PageRank is a random algorithm for ranking web pages. Its working principle is that the Internet is regarded as a directed graph, the web pages on the Internet are regarded as nodes, the hyperlinks from node VI to node VJ are regarded as directed edges, and the weight s (VI) of each node is 1 during initialization, and the weight of each node is updated iteratively. The update expression of each iteration weight is as follows:

    \[S \left( V _ { i } \right) = ( 1 – d ) + d \times \sum _ { V _ { j \in I n \left( V _ { i } \right) } } \frac { 1 } { \left| O u t \left( V _ { j } \right) \right| } S \left( V _ { j } \right)

    Where D is a constant factor between (0,1). In pagrank, the probability of users clicking links to jump out of the current website is simulated. In (VI) is the set of nodes linked to VI, and out (VJ) is the set of nodes linked from VJ. It can be seen that the more open links are, the higher the PageRank of the website is. The more websites do outside the chain for other websites, the lower the weight of each chain. If the chain of a website is such a low weight chain, then PageRank will also drop, causing adverse reactions. As the saying goes, birds of a feather flock together. Links recommended by junk websites are often junk websites. Therefore, PageRank can fairly reflect the ranking of the website.

    When PageRank is applied to keyword extraction, words are regarded as nodes. In addition, the outer chain of each word comes from all the words in the fixed size window.

    The code of hanlp is as follows:

    from pyhanlp import *
    "Keyword extraction"
    content = (
    "Programmers are professionals engaged in program development and maintenance. "
    "Programmers are generally divided into programmers and programmers."
    "But the boundary between the two is not very clear, especially in China. "
    "Software practitioners are divided into junior programmers, senior programmers and systems"
    "Analysts and project managers. ""
    TextRankKeyword = JClass("com.hankcs.hanlp.summary.TextRankKeyword")
    keyword_list = HanLP.extractKeyword(content, 5)

    The operation results are as follows:

    [programmer, program, divided into, personnel, software]

9.3 phrase extraction

In the field of information extraction, another important task is to extract Chinese phrases, that is, the recognition of fixed multi word expression strings. Phrase extraction is often used in automatic recommendation of search engines, document introduction generation, etc.

Using mutual information and left-right information entropy, we can easily extend the new word extraction algorithm to phrase extraction. Just replace the character of new word extraction with word and string with word list. In order to get the word, we still need to do Chinese word segmentation. Most of the time, stop words do not help to express the meaning of phrases, so they are usually filtered out after word segmentation.

The code is as follows:

from pyhanlp import *

"Phrase extraction"
text = '''
  Algorithm Engineer
  Algorithm is a series of clear instructions to solve problems, that is to say, it can obtain the required output in a limited time for a certain standard input.
  If an algorithm is defective or not suitable for a problem, executing the algorithm will not solve the problem. Different algorithms may take different time
  空间或效率来完成同样的任务。一个算法的优劣可以用空间复杂度与时间复杂度来衡量。Algorithm Engineer就是利用算法处理事物的人。

  1 job description
  Algorithm Engineer是一个非常高端的职位;
  Major requirements: computer, electronics, communication, mathematics and other related majors;
  Education requirements: Bachelor degree or above, most of them are master degree or above;
  Language requirements: English is required to be proficient, basically able to read foreign professional books and periodicals;
  Must master computer related knowledge, skilled use of simulation tools such as MATLAB, must know a programming language.

  2. Research direction
  视频Algorithm Engineer、图像处理Algorithm Engineer、音频Algorithm Engineer 通信基带Algorithm Engineer

  Current situation at home and abroad
  目前国内从事算法研究的工程师不少,但是高级Algorithm Engineer却很少,是一个非常紧缺的专业工程师。
  Algorithm Engineer根据研究领域来分主要有音频/视频算法处理、图像技术方面的二维信息算法处理和通信物理层、
  One dimensional information algorithm processing in radar signal processing, biomedical signal processing and other fields.
  In computer audio and video, graphics and image technology and other two-dimensional information processing algorithm is more advanced at present: machine vision becomes the core of this kind of algorithm research;
  In addition, there are 2d-to-3d conversion, de interlacing and motion compensation
  In this paper, we present a new method for image processing, including motion estimation / motion compensation, noise reduction, scaling,
  Sharpness, super resolution, gesture recognition and face recognition.
  The commonly used algorithms in the field of one-dimensional information, such as communication physical layer, are RRM and RTT in wireless field, modulation and demodulation, channel equalization, signal detection, network optimization, signal decomposition and so on.
  In addition, data mining, Internet search algorithm has become a popular direction.
  Algorithm Engineer逐渐往人工智能方向发展。'''
phrase_list = HanLP.extractPhrase(text, 5)

The operation results are as follows:

[algorithm engineer, algorithm processing, one dimensional information, algorithm research, signal processing]

At present, the module only supports the extraction of binary grammatical phrases. In other cases, key words or key phrases are still fragmented enough to express the whole theme. At this time, the main sentence is usually extracted as a short summary of the article, while the key sentence extraction is still based on the extension of PageRank.

9.4 key sentence extraction

Since it is almost impossible to have the same two sentences in an article, the plain PageRank does not work in sentence granularity. In order to apply PageRank to sentence granularity, we introduce BM25 algorithm to measure sentence similarity and improve link weight calculation. In this way, the link between the center sentence of the window and the adjacent sentence becomes strong or weak, and similar sentences will get higher votes. However, the center sentence of the paper often has high similarity with other explanation sentences, which just provides the foothold for the algorithm. This section will first introduce BM25 algorithm, and then introduce the application of textrank in key sentence extraction.

  1. BM25

    In the field of information retrieval, BM25 is an improved variant of TF-IDF. TF-IDF measures the importance of a single word in a document. In search engines, query is often composed of multiple words. How to measure the association degree of multiple words and documents is the problem that BM25 solves.

    In the formal definition, q is a query statement, which is composed of keywords Q1 to QN, D is a retrieved document, and BM25 measures are as follows:

    \[\operatorname { BM } 25 ( D , Q ) = \sum _ { i = 1 } ^ { n } \operatorname { IDF } \left( q _ { i } \right) \cdot \frac { \operatorname { TF } \left( q _ { i } , D \right) \cdot \left( k _ { 1 } + 1 \right) } { \operatorname { TF } \left( q _ { i } , D \right) + k _ { 1 } \cdot \left( 1 – b + b \cdot \frac { | D | } { \operatorname { avg } D L } \right) }
  2. TextRank

    With BM25 algorithm, a sentence is regarded as a query statement, and adjacent sentences are regarded as documents to be queried. This similarity is used as the weight of links in PageRank, so an improved algorithm called textrank is obtained. Its formal calculation method is as follows:

    \[\mathrm{WS}\left(V_{i}\right)=(1-d)+d \times \sum_{V_{j} \in \ln \left(V_{i}\right)} \frac{\mathrm{BM} 25\left(V_{i}, V_{j}\right)}{\sum_{V_{k} \in O u t\left(V_{j}\right)} \operatorname{Bu} 2 \mathrm{s}\left(V_{k}, V_{j}\right)} \mathrm{WS}\left(V_{j}\right)

    WS (VI) is the score of the i-th sentence in the document. The final score is obtained after iterating the expression several times. After sorting, the first n sentences are output to get the key sentence. The code is as follows:

    from pyhanlp import *
    "Summary Auto"
    Chen Mingzhong, director of the water resources department of the Ministry of water resources, said at a press conference held by the State Council Information Office on September 29,
    According to the assessment of water resources management system, some provinces are close to the red line,
    Some provinces exceed the red line. For some places beyond the red line, Chen Mingzhong said that regional approval should be imposed on some water intake projects,
    Strictly carry out water resources demonstration and approval of water intake permit. ''
    TextRankSentence = JClass("com.hankcs.hanlp.summary.TextRankSentence")
    sentence_list = HanLP.extractSummary(document, 3)

    The results were as follows:

    [strictly carry out water resources demonstration and approval of water intake permit, Chen Mingzhong, director general of Water Resources Department of the Ministry of water resources, revealed at a press conference held by the State Council Information Office on September 29 that some provinces exceeded the red line indicators]

9.5 summary

We can see that the extraction of new words and phrases, the extraction of key words and key sentences are all the same algorithm in the application of different text granularity. It is worth mentioning that these algorithms do not need the participation of annotated corpus, which satisfies people’s desire of “getting something for nothing”. However, it must be pointed out that the effect of these algorithms is very limited.For the same task, the effect of supervised learning method is usually far ahead of that of unsupervised learning method.

9.6 GitHub

Hanlp, he Han — Notes on introduction to natural language processing

The project is constantly updated


Chapter 1: novice on the road
Chapter 2: Dictionary segmentation
Chapter 3: binary grammar and Chinese word segmentation
Chapter 4: Hidden Markov model and sequence annotation
Chapter 5: perceptron classification and sequence tagging
Chapter 6: conditional random fields and sequence labeling
Chapter 7: part of speech tagging
Chapter 8: Named Entity Recognition
Chapter 9: information extraction
Chapter 10: Text Clustering
Chapter 11: Text Classification
Chapter 12: dependency parsing
Chapter 13: deep learning and natural language processing