Use ES to segment Chinese articles and rank them by word frequency statistics

Time:2019-9-5

PrefaceFirst of all, there is a need to count a 10,000-word article, which words appear more frequently. It is more important that how to partition a paragraph in the article, such as “Beijing is the capital of **,” “Beijing,” “China,” “Chinese,” ” The people, the Republic and the capital are words that need to be separated, while the words “Beijing is” and “the People’s Republic” are not meaningful words, so they cannot be separated. The rules of these words are very troublesome if they are written by themselves. It can be easily done by using the open source IK participle. And the particle size can be determined according to the mode of segmentation.

 

ik_max_wordThe text will be divided into the finest granularity, such as the * * National anthem, the Chinese people, the Chinese people, the Chinese people, the People’s Republic, the people, the people, the republic, the Republic, the Republic, and the national anthem, which will exhaust all possible combinations.

 

ik_smartWe can do the most coarse-grained splitting, such as splitting the “national anthem” into “national anthem”.

 

First, we must prepare the environment first.

If there is an ES environment that can skip the first two steps, let me assume that you only have a newly installed CentOS 6.X system to facilitate your running through the process.

(1) Install jdk.

$ wget http://download.oracle.com/otn-pub/java/jdk/8u111-b14/jdk-8u111-linux-x64.rpm
$ rpm -ivh jdk-8u111-linux-x64.rpm

 

(2) Installation of ES

$ wget  https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.4.2/elasticsearch-2.4.2.rpm
$ rpm -iv elasticsearch-2.4.2.rpm

 

(3) Installing IK word segmenter

Download the 1.10.2 version of the IK participle on github. Note that the ES version is 2.4.2 and the compatible version is 1.10.2.

wKiom1hvJhuDi-_qAACORP8pC1k944.png

 

$ mkdir /usr/share/elasticsearch/plugins/ik
$ wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.10.2/elasticsearch-analysis-ik-1.10.2.zip
$ unzip elasticsearch-analysis-ik-1.10.2.zip -d /usr/share/elasticsearch/plugins/ik

 

(4) Configuring ES

$ vim /etc/elasticsearch/elasticsearch.yml
###### Cluster ######
cluster.name: test
###### Node ######
node.name: test-10.10.10.10
node.master: true
node.data: true
###### Index ######
index.number_of_shards: 5
index.number_of_replicas: 0
###### Path ######
path.data: /data/elk/es
path.logs: /var/log/elasticsearch
path.plugins: /usr/share/elasticsearch/plugins
###### Refresh ######
refresh_interval: 5s
###### Memory ######
bootstrap.mlockall: true
###### Network ######
network.publish_host: 10.10.10.10
network.bind_host: 0.0.0.0
transport.tcp.port: 9300
###### Http ######
http.enabled: true
http.port : 9200
###### IK ########
index.analysis.analyzer.ik.alias: [ik_analyzer]
index.analysis.analyzer.ik.type: ik
index.analysis.analyzer.ik_max_word.type: ik
index.analysis.analyzer.ik_max_word.use_smart: false
index.analysis.analyzer.ik_smart.type: ik
index.analysis.analyzer.ik_smart.use_smart: true
index.analysis.analyzer.default.type: ik

 

(5) Start ES

$ /etc/init.d/elasticsearch start

 

(6) Check es node status

Curl localhost: 9200/_cat/nodes? V# Sees a node working
host         ip           heap.percent ram.percent load node.role master name
10.10.10.10 10.10.10.10           16          52 0.00 d         *      test-10.10.10.10

Curl localhost: 9200/_cat/health? V cluster state is green
epoch      timestamp cluster            status node.total node.data shards pri relo init
1483672233 11:10:33  test               green           1         1     0   0    0    0

 

II: Detection of Word Segmentation Function

(1) Creating test indexes

$ curl -XPUT http://localhost:9200/test

 

(2) Create mapping

$ curl -XPOST http://localhost:9200/test/fulltext/_mapping -d'
  {
      "fulltext": {
               "_all": {
              "analyzer": "ik"
          },
          "properties": {
              "content": {
                  "type" : "string",
                  "boost" : 8.0,
                  "term_vector" : "with_positions_offsets",
                  "analyzer" : "ik",
                  "include_in_all" : true
              }
          }
      }
  }'

 

(3) Test data

Curl'http://localhost:9200/index/_analysis?Analyzer=ik&pretty=true'-d'{"text": "Is America leaving Iraq a mess"}

Return content:

{
  "tokens" : [ {
    "Token": "the United States".
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "Token": "reserved"
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "Token": "Iraq".
    "start_offset" : 4,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "Token": "yi"
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "Token": "pull".
    "start_offset" : 5,
    "end_offset" : 6,
    "type" : "CN_CHAR",
    "position" : 4
  }, {
    "Token": "gram".
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "Token": "ge"
    "start_offset" : 9,
    "end_offset" : 10,
    "type" : "CN_CHAR",
    "position" : 6
  }, {
    "Token": "mess".
    "start_offset" : 10,
    "end_offset" : 13,
    "type" : "CN_WORD",
    "position" : 7
  }, {
    "Token": "stall".
    "start_offset" : 11,
    "end_offset" : 13,
    "type" : "CN_WORD",
    "position" : 8
  }, {
    "Token": "stall".
    "start_offset" : 11,
    "end_offset" : 12,
    "type" : "CN_WORD",
    "position" : 9
  }, {
    "Token": "son".
    "start_offset" : 12,
    "end_offset" : 13,
    "type" : "CN_CHAR",
    "position" : 10
  }, {
    "Token": "do you?"
    "start_offset" : 13,
    "end_offset" : 14,
    "type" : "CN_CHAR",
    "position" : 11
  } ]
}

 

Three: Start importing real data

(1) upload Chinese text files to linux.

$ cat /tmp/zhongwen.txt  
Continuous inspections of heavily polluted weather in Beijing, Tianjin and Hebei found malicious production by enterprises
The producer of "Gufang Buzi" is accused of "playing like a theatre": the special effects are not in place
Barack Obama disregarded Trump's objection to insisting on moving prisoners out of Guantanamo Prison
.
.
.
.
Korean Media: Japan calls for the suspension of currency swap negotiations between Korea and Japan
China's millions of tax elites have to pay more than 400,000 yuan a year to develop abroad

Note: Make sure that the text file is coded utf-8, otherwise it will be scrambled when it is passed to es.

$ vim /tmp/zhongwen.txt

Enter: set fineencoding in command mode, and you will see fileencoding = utf-8.

If fileencoding = utf-16le, enter: set fineencoding = UTF-8

 

(2) Creating indexes and mappings

Create an index

$ curl -XPUT http://localhost:9200/index

Create mapping # to set up the word splitter and field data for the field message to be segmented.

$ curl -XPOST http://localhost:9200/index/logs/_mapping -d '
{
  "logs": {
    "_all": {
      "analyzer": "ik"
    },
    "properties": {
      "path": {
        "type": "string"
      },
      "@timestamp": {
        "format": "strict_date_optional_time||epoch_millis",
        "type": "date"
      },
      "@version": {
        "type": "string"
      },
      "host": {
        "type": "string"
      },
      "message": {
        "include_in_all": true,
        "analyzer": "ik",
        "term_vector": "with_positions_offsets",
        "boost": 8,
        "type": "string",
        "fielddata" : { "format" : "true" }
      },
      "tags": {
        "type": "string"
      }
    }
  }
}'

 

(3) Use logstash to write text files to es

Install logstash

$ wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.1.1/elasticsearch-2.1.1.rpm
$ rpm -ivh  logstash-2.1.1.rpm

Configure logstash

$ vim /etc/logstash/conf.d/logstash.conf
input {
  file {
      codec => 'json' 
      path => "/tmp/zhongwen.txt"
      start_position => "beginning" 
  }
}
output {
    elasticsearch {
      hosts => "10.10.10.10:9200"
      index => "index"
      flush_size => 3000
      idle_flush_time => 2
      workers => 4
     }
  stdout { codec => rubydebug }
}

start-up

$ /etc/init.d/logstash start

Looking at stdout output, you can determine whether to write to es.

$ tail -f /var/log/logstash.stdout

 

(4) Check whether there is data in the index

Curl'localhost: 9200/_cat/indices/index? V'# You can see 6007 pieces of data.
health status index pri rep docs.count docs.deleted store.size pri.store.size 
green  open   index   5   0       6007            0      2.5mb          2.5mb
$ curl -XPOST  "http://localhost:9200/index/_search?pretty"
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 5227,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "index",
      "_type" : "logs",
      "_id" : "AVluC7Dpbw7ZlXPmUTSG",
      "_score" : 1.0,
      "_source" : {
        "Message": "China has to pay more than 400,000 tax elites a million years'salary to go abroad to develop."
        "tags" : [ "_jsonparsefailure" ],
        "@version" : "1",
        "@timestamp" : "2017-01-05T09:52:56.150Z",
        "host" : "0.0.0.0",
        "path" : "/tmp/333.log"
      }
    }, {
      "_index" : "index",
      "_type" : "logs",
      "_id" : "AVluC7Dpbw7ZlXPmUTSN",
      "_score" : 1.0,
      "_source" : {
        "Message": "Barack Obama disregarded Trump's objection to insisting on moving prisoners out of Guantanamo prison."
        "tags" : [ "_jsonparsefailure" ],
        "@version" : "1",
        "@timestamp" : "2017-01-05T09:52:56.222Z",
        "host" : "0.0.0.0",
        "path" : "/tmp/333.log"
      }
}

 

Fourth: Start calculating the word frequency and ranking of participle

(1) Query TOP10 with the highest frequency of all words

$ curl -XGET "http://localhost:9200/index/_search?pretty" -d'
{  
    "size" : 0,  
    "aggs" : {   
        "messages" : {   
            "terms" : {   
               "size" : 10,
              "field" : "message"
            }  
        }  
    }
}'

Return results

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 6007,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "messages" : {
      "doc_count_error_upper_bound" : 154,
      "sum_other_doc_count" : 94992,
      "buckets" : [ {
        "Key": "one".
        "doc_count" : 1582
      }, {
        "Key": "after"
        "doc_count" : 560
      }, {
        "Key": "human".
        "doc_count" : 541
      }, {
        "Key": "home".
        "doc_count" : 538
      }, {
        "Key": "out".
        "doc_count" : 489
      }, {
        "Key": "hair".
        "doc_count" : 451
      }, {
        "Key": "ge".
        "doc_count" : 440
      }, {
        "Key": "state".
        "doc_count" : 421
      }, {
        "Key": "age".
        "doc_count" : 405
      }, {
        "Key": "son".
        "doc_count" : 402
      } ]
    }
  }
}

 

(2) Query TOP10 with the highest frequency of occurrence of all two-word words.

$ curl -XGET "http://localhost:9200/index/_search?pretty" -d'
{  
    "size" : 0,
    "aggs" : {   
        "messages" : {  
            "terms" : {   
                 "size" : 10,
              "field" : "message",
                "include" : "[\u4E00-\u9FA5][\u4E00-\u9FA5]"
            }  
        }  
    },
   "highlight": {
     "fields": {
      "message": {}
    }
  }     
}'

Return

{
  "took" : 22,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 6007,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "messages" : {
      "doc_count_error_upper_bound" : 73,
      "sum_other_doc_count" : 42415,
      "buckets" : [ {
        "Key": "woman"
        "doc_count" : 291
      }, {
        "Key": "man"
        "doc_count" : 264
      }, {
        "Key": "unexpectedly".
        "doc_count" : 257
      }, {
        "Key": "Shanghai".
        "doc_count" : 255
      }, {
        "Key": "this".
        "doc_count" : 238
      }, {
        "Key": "girl"
        "doc_count" : 174
      }, {
        "Key": "these".
        "doc_count" : 167
      }, {
        "Key": "one".
        "doc_count" : 159
      }, {
        "Key": "Attention".
        "doc_count" : 143
      }, {
        "Key": "like this".
        "doc_count" : 142
      } ]
    }
  }
}

 

(3) Query all two-word words without the word “female”, TOP10 appears most frequently.

curl -XGET "http://localhost:9200/index/_search?pretty" -d'
{  
    "size" : 0,
    "aggs" : {   
        "messages" : {  
            "terms" : {   
              "size" : 10,
              "field" : "message",
              "include" : "[\u4E00-\u9FA5][\u4E00-\u9FA5]",
              "Exclude": "female. *"
            }  
        }  
    },
   "highlight": {
     "fields": {
      "message": {}
    }
  }     
}'

Return

{
  "took" : 19,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 5227,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "messages" : {
      "doc_count_error_upper_bound" : 71,
      "sum_other_doc_count" : 41773,
      "buckets" : [ {
        "Key": "man"
        "doc_count" : 264
      }, {
        "Key": "unexpectedly".
        "doc_count" : 257
      }, {
        "Key": "Shanghai".
        "doc_count" : 255
      }, {
        "Key": "this".
        "doc_count" : 238
      }, {
        "Key": "these".
        "doc_count" : 167
      }, {
        "Key": "one".
        "doc_count" : 159
      }, {
        "Key": "Attention".
        "doc_count" : 143
      }, {
        "Key": "like this".
        "doc_count" : 142
      }, {
        "Key": "Chongqing".
        "doc_count" : 142
      }, {
        "Key": "result"
        "doc_count" : 137
      } ]
    }
  }
}

 

There are more word segmentation strategies, such as setting synonyms (setting “tomato” and “tomato” as synonyms, searching for “tomato”, “tomato” will also come out), setting phonetic segmentation (searching for “zhonghua”, “China” can also be searched out), and so on.

Recommended Today

How to solve the failure of Ajax request session

Generally speaking, our projects have login filters, and the general request is enough. But AJAX is an exception, so the solution is to set the response to session invalidation. The settings are divided into two parts: filter and page JS. First, let’s look at the modification of the filter import java.io.IOException; import javax.servlet.Filter; import javax.servlet.FilterChain; […]