Through text mining, we have discovered the secrets of the national civil service examination


According to, in 2020, more than 1.43 million people signed up for the written examination of civil servants of the central and its directly affiliated institutions, involving 86 units and 23 directly affiliated institutions of the central and state organs, while 24000 people are planned to be recruited. The ratio between the number of people who passed the qualification examination and the number of employment plans is about 60:1. It’s not surprising that there are multiple positions competing for more than one in a thousand during the registration period.

Although I haven’t participated in the national civil service examination (hereinafter referred to as “the national examination”), but in line with the mentality of “the expert looks at the door, the layman looks at the bustle”, I think of a bystander’s perspective,Through some semantic analysis techniques, we can find out what has been tested in the administrative vocational ability test (hereinafter referred to as “line test”) in the past eight years (2011-2018) to see whether there are some regular findings.

data sources

In order to ensure the timeliness of the analysis, the author only collected the national examination questions in 2011-2018 (the questions at prefecture level and sub provincial level are combined), and only extracted the stem part of the text, excluding the options.

In order to intuitively understand what the 8-year examination questions said, the author first extracts the key words from the whole.

“Itinerary calculation” is a kind of test questions

The following is the word cloud distribution map of top150 keywords extracted from keywords, in which the word size reflects the importance of words.

Through text mining, we have discovered the secrets of the national civil service examination

It can be seen directly from the above figure,The word “speed” appears frequently in the national examination questions in the past 8 years, which shows that “itinerary problem” accounts for a high proportion in the types of national examination questions, as shown in the following example:

  • Xiao Wang’s walking speed is 50% slower than running, and his running speed is 50% slower than cycling. If he… Asked Xiao Wang how many minutes it would take to run from city a to city B;
  • A and B plan to walk from a to B. B starts at 7:00 in the morning and walks at a constant speed to… In order to catch up with B, a decides to run forward. The running speed is 2.5 times of B’s walking speed, but every half hour needs a half hour’s rest. When can a catch up with B;
  • As shown in the figure on the right, Party A and Party B set out from a and B at the same time and walked along the path in different directions. It is known that the speed of Party A is twice that of Party B. Ask which of the following coordinates can accurately describe the relationship between the straight-line distance and time between two people

There are also keywords such as “quantity”, “mileage”, “price” with high weight, which also reflect that there are many calculation types in the overseas test,The mathematics operation part of the national examination is not so difficult as a whole. Generally, the answers can be obtained by ordinary methods, but relatively speaking, the speed is relatively slow. Using some good skills, the answers can be obtained quickly. In addition, in recent years, in the civil service examination, the calculation question side reexamination checks the examinee’s understanding, mastery and flexible use of common methods and skills. The commonly used methods are rounding method, mantissa method, grouping or elimination formula method and estimation method.


The extraction of the above keywords mainly considers the following four important factors:

  • Frequency of words: the more common words appear, the more important they are;
  • Location: beginning, middle or end of sentence, generally speaking, the vocabulary weight in the sentence will be higher;
  • Part of speech: noun and verb;
  • Word length: the length of a word. Generally speaking, the longer a word is, the richer the semantic information of a hero, and the higher the weight given.

Although the above key words cloud map can grasp the main words, but the relevance between the words is ignored, sometimes it is difficult to find some meaningful insights when interpreting some key words in isolation.

So, is there a way to capture key information (that is, to find key words) and intuitively reflect the relevance between words?

The answer is yes.

A vocabulary Association chart of all test stems in the past eight years

Lexical relevance map is the extension of the above-mentioned keyword cloud map, which increases the dimension of context, that is, to express the relevance of words that often appear in the same context.

Based on the automatic clustering, the lexical association graph can naturally reflect the semantic features and potential structure of the test questions, thus it can accurately and clearly know the key points of the national test in the past eight years.

The generated visualization results can be interpreted as follows:

  • Font size indicates the weight value of words. The principle is the same as above, which can reflect the importance of words in comments
  • Different colors represent different topics
  • The closer the distance between words, the more frequently they appear in the same context, the more semantic relevance they have,For example, the words “speed”, “law enforcement ship”, “driving”, “hour” and “cycling” are close together. We can quickly associate these keywords with the “travel problem” in the test questions, rather than with politics, physics or cars.

The following figure is the result of automatic clustering, which is automatically clustered into 8 themes (click the picture below to view the large HD image):

Through text mining, we have discovered the secrets of the national civil service examinationIn the above figure, according to the importance of words and their clusters (font size, number of subject words), meaningful topics are selected. According to the key words, we can infer the four hot test points of the national examination in the past eight years, which are:

  • Itinerary:This kind of questions generally involves the changing relationship among distance, speed and time, which is mainly reflected in the purple vocabulary cluster. It can be seen from the words of “speed”, “driving”, “distance”, “cycling”, etc;
  • Biomedical common sense:This kind of questions mainly investigate the knowledge coverage of the candidates on biology and medical related common sense, which is mainly reflected in the dark blue vocabulary cluster. It can be seen from the vocabulary of “convulsion”, “phytoplankton”, “suspended matter”, “sea water”, etc;
  • Finance:This kind of questions mainly examine the examinees’ simple calculation ability in macroeconomic related indicators, which is mainly reflected in the Yellow vocabulary clusters, as can be seen from the words of “transaction scale”, “total amount”, “aquatic products”, “year-on-year growth”, etc;
  • Scenario calculation class:Based on the life and work situations of the candidates, this kind of questions investigate the basic computing ability of the candidates, which is mainly reflected in the two word clusters of turquoise and sapphire blue. It can be seen from the words of “training”, “department”, “unit”, “average age”, “probability”, “pricing” and “balance”.

The above four categories can be viewed directly by the author. Other categories may be recognized by the partners who have had national examination experience. Welcome to speak in the comment area and tell me~


The vocabulary association diagram here is based on hdbscan (hierarchical density based spatial clustering of applications with noise). Compared with the traditional clustering algorithms (K-means, spectral clustering, aggregative clustering, DBSCAN, etc.), it has the following three advantages:

  • There is no need to set the number of clusters, and there is an algorithm to automatically calculate the number of clusters
  • It can better deal with the noise in the data
  • You can find clusters based on different density (different from DBSCAN), and more robust to parameter selection (robot, more robust model)

Finally, the author also wants to see if there is a big change in the national examination questions over the years, which can be abstracted as a text mining task – measuring the similarity between the national examination questions over the years. This can be done throughcorrespondence analysis realization.

Similarity measurement of test questions in recent eight years

According to the above method of extracting key words, the top 200 key words in the test stem of national examination bank in the past eight years are extracted respectively. These key information is enough to represent the test questions of national examination bank in that year. With these data, corresponding analysis can be carried out. Finally, the following figure is obtained (click the picture below to view the large HD image):

Through text mining, we have discovered the secrets of the national civil service examination

For the visualization results in the figure above, you can interpret them as follows:The smaller the angle is, the higher the similarity is;Secondly, the closer the key words are to each year’s test questions, the higher the importance of key words in the year’s test questions, the more representative the characteristics of the test questions are.From this, we can get two analysis angles:

  • From the similarity of test content over the years, the high correlation of test content in 2011 and 2012, 2017 and 2018 means that the continuity of test structure is good, and so on. The continuity of test content in 2013, 2014, 2015 and 2016 is also good. On the contrary, in 2012 and 2013, the content similarity is low, and the content has a certain leap.On the whole, the continuity of national examination questions in the content of questions is good, only occasionally changes.
  • From the perspective of the characteristics of the test questions over the years, the humanistic characteristics in 2011 are more obvious, the economic aspects in 2018 are more, the logic tests in 2018 are more prominent, the linguistic aspects in 2015 are more, the partial calculation in 2016, and the characteristics in other years are less prominent.


Correspondence analysis can reveal the differences between different categories of the same variable, as well as the correspondence between different categories of different variables. For example, the test questions of different years are different categories, and the key words are variables. The map of correspondence analysis can show the correlation degree of the 8-year test questions through the visually acceptable location map.

The above is some analysis made by the author as a “layman” in the national examination. Because only the dry text is extracted, the amount of text data is less, which inevitably leads to some mistakes. Moreover, for the small partners who have experienced the national examination, the analysis results may also appear rough outline.

Here, I would like to express my admiration to the Chinese examinees who took part in the “first test of China” and worked hardThe theme is “the title of the golden list” (not the hidden poem)Let the machine write four poems to show respect:

Through text mining, we have discovered the secrets of the national civil service examination

Note: the above technical support is provided by Daguan data.