[elasticsearch] elasticsearch install IK Chinese word breaker

Time:2020-1-14

1、 Compiling IK word breakers

1. Maven environment configuration

wget http://mirrors.cnnic.cn/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
  • decompression
tar xvf apache-maven-3.3.9-bin.tar.gz
mv maven to /usr/local/apache-maven

mv apache-maven-3.3.9  /usr/local/apache-maven
  • Configure environment variables
vim  ~/.bashrc

#Add the following
export M2_HOME=/usr/local/apache-maven
export M2=$M2_HOME/bin 
export PATH=$M2:$PATH
  • Load environment variables
source ~/.bashrc
  • View version
mvn -version

2. IK version selection

IK version ES version
master 5.x -> master
5.6.1 5.6.1
5.5.3 5.5.3
5.4.3 5.4.3
5.3.3 5.3.3
5.2.2 5.2.2
5.1.2 5.1.2
1.10.1 2.4.1
1.9.5 2.3.5
1.8.1 2.2.1
1.7.0 2.1.1
1.5.0 2.0.0
1.2.6 1.0.0
1.2.5 0.90.x
1.1.3 0.20.x
1.0.0 0.16.2 -> 0.19.0

Currently, the elastic search is version 2.4. According to the above table, you should choose version 1.10.1 IK

  • 1.10.1 IK GitHub address
https://github.com/medcl/elasticsearch-analysis-ik/tree/v1.10.1

2、 Install IK

Download IK source code and upload it to the server and enter the directory

1. pack IK

mvn package

To package, you need to download some jar packages from Maven library. It will take some time, quiet, etc~

After compiling, you can find the corresponding zip package in the target / releases directory

cd target/releases

You can see the compiled IK package. You can use it directly

elasticsearch-analysis-ik-1.10.1.zip

2. configure IK

After extracting IK, place/data/components/elasticsearch/plugins/ikDirectory

vim plugin-descriptor.properties

#Modify the following parameters according to the actual situation
elasticsearch.version=2.4.1
java.version=1.7

3、 Configure elastic search

1. Modify elasticsearch configuration

vim /etc/elasticsearch/elasticsearch.yml

#Add the following
index.analysis.analyzer.default.type: ik

2. Restart elastic search

systemctl restart elasticsearch.service

Configuration completed, under test~

4、 Test IK word breaker

IK with two word breakers
ik_max_word :Split the text in the most fine-grained way; split as many words as possible
ik_smart:Will make the most coarse-grained split; the separated words will not be occupied by other words again

1. IK Max word test

Curl - xget 'http://172.16.200.101:9200 / ﹣ analyze? Pretty & analyzer = ik﹣ max﹣ word' - d'zhangtong home is the largest preschool education management platform in the world '

#The return information is as follows

{
  "tokens" : [ {
    "Token": "palm",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "Token": "communication",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "CN_CHAR",
    "position" : 1
  }, {
    "Token": "home",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "Token": "home",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "Token": "global",
    "start_offset" : 5,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "Token": "maximum",
    "start_offset" : 7,
    "end_offset" : 9,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "Token": "preschool education",
    "start_offset" : 10,
    "end_offset" : 12,
    "type" : "CN_WORD",
    "position" : 6
  }, {
    "Token": "teaching management",
    "start_offset" : 11,
    "end_offset" : 13,
    "type" : "CN_WORD",
    "position" : 7
  }, {
    "Token": "management",
    "start_offset" : 12,
    "end_offset" : 14,
    "type" : "CN_WORD",
    "position" : 8
  }, {
    "Token": "platform",
    "start_offset" : 14,
    "end_offset" : 16,
    "type" : "CN_WORD",
    "position" : 9
  }, {
    "Token": "set",
    "start_offset" : 15,
    "end_offset" : 16,
    "type" : "CN_WORD",
    "position" : 10
  } ]
}

2. Ik_smart test

Curl - xget 'http://172.16.200.101:9200 / ﹣ analyze? Pretty & analyzer = ik﹣ smart' - d'zhangtong home is the largest preschool education management platform in the world '

#The return information is as follows
{
  "tokens" : [ {
    "Token": "palm",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "Token": "communication",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "CN_CHAR",
    "position" : 1
  }, {
    "Token": "home",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "Token": "global",
    "start_offset" : 5,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "Token": "maximum",
    "start_offset" : 7,
    "end_offset" : 9,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "Token": "preschool education",
    "start_offset" : 10,
    "end_offset" : 12,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "Token": "management",
    "start_offset" : 12,
    "end_offset" : 14,
    "type" : "CN_WORD",
    "position" : 6
  }, {
    "Token": "platform",
    "start_offset" : 14,
    "end_offset" : 16,
    "type" : "CN_WORD",
    "position" : 7
  } ]
}