Extracting business keywords from requirement documents

Time:2020-7-23

background

To be an automatic tool which can locate the system fault, we need to use it in the development processReptilesGet the bug description of the company’s test platform, as well as the crawler for code fault analysis on sonar. Through the obtained data, the system fault can be located, analyzed and classified. Because it belongs to the banking business, it will be involvedInaccurate classification of professional vocabularyThe requirement of “extracting business keywords from requirement documents” is derived.


Basic concepts and principles of algorithms

Textrank algorithm is improved from PageRank algorithm, and their ideas are similar. The difference is: PageRank algorithm constructs network according to the link relationship between web pages, while textrank algorithm constructs network according to the common occurrence relationship between words; the edge in the network constructed by textrank algorithm is undirected weighted edge, and the directed edge in the network constructed by PageRank algorithm is not The same.
The core formula of textrank algorithm is as follows, where wji is used to represent the edge connection between two nodes with different importance
Extracting business keywords from requirement documents


For ease of understandingThe specific steps of using textrank algorithm to extract keywords and keyword groups are given as follows:
  1. Sentence by sentence segmentation of the document to be tested, i.eExtracting business keywords from requirement documents
  2. For eachExtracting business keywords from requirement documentsAfter that, delete the stop words in the sentence and keep the candidate words for the given part of speech, such as noun, verb, adjective, etcExtracting business keywords from requirement documentsamongExtracting business keywords from requirement documentsIt is the word reserved in sentence I;
  3. The word graph network G = (V, e) is constructed, where V is the set of nodes, which is composed of the words generated in the above steps, and then the edges between any two nodes are constructed by co-occurrence relationship: when there are edges between the two nodes, if and only if the words corresponding to the two edge connected nodes appear simultaneously in the window of length k, it means that there are at most k keywords, and in general, k = 2;
  4. According to the above formula, the weight of each node is calculated iteratively until it converges;
  5. After sorting the weights of each node in reverse order, the N keywords in the front are obtained as top-N keywords and phrases;
  6. For the obtained top-N keywords, the tag query and comparison are carried out in the original text. If they form adjacent phrases, they can be extracted as keyword groups.

    Extracting business keywords from requirement documents

    Figure 3-2 textrank keyword extraction algorithm flow chart

    As a keyword extraction method, textrank algorithm has the greatest advantage over the other extraction methods introduced above, which is based on theNo supervisionThe extraction method is based on a text asUnitTherefore, there is no need for a large number of corpus training. Through the development of the existing computer, can calculate and process faster, and can be applied to a variety of file formats or thematic documents, can be in theshort timeAfter the keyword extraction is successful, it can automatically generate the summary, and the overall effect is relatively coherent.

Problems encountered

1. Access to requirements documents

  • Download it from SVN and put it into the project folder to read it directly (because Python is not very skilled, this method can only be used first)
  • Use the crawler to climb down from the SVN (next prepareDirection of optimization

2. The extracted keywords are not relevant to the major

The keywords that have just been extracted will have phrases with greater technical relevance. By modifying and changing the stop word list, the key phrases of business relevance are finally obtained. At present, they are printed out. NextDirection of optimizationIs able to directly output the extracted keyword document, encapsulate this into a tool.