Traps about spring cloud health check

Time:2022-5-26

Trap of spring cloud health check

health examination

Health check based on spring boot actuator is a necessary component of spring cloud microservice to ensure whether our service is available.

After introducing spring boot actuator, you can http://ip:port/health , you can see that healthendpoint provides us with default monitoring results, including disk detection and database detection. as follows


{
    "status": "UP",
    "diskSpace": {
        "status": "UP",
        "total": 398458875904,
        "free": 315106918400,
        "threshold": 10485760
    },
    "db": {
        "status": "UP",
        "database": "MySQL",
        "hello": 1
    }
}

Eliminate unnecessary health examination items

One day, the caller suddenly reported that our service could not be adjusted. Check Eureka console and find that the service status is up. Check that the service process is normal. When I was at a loss, I suddenly thought whether it was the health examination, because the basis for Eureka client to judge whether the service is available or not is the health examination. The health status of any of the monitoring items of spring boot actuator is down, and the health status of the overall application is also down. At this time, the caller regards the service as unavailable.

View again http://ip:port/health Sure enough, I found that an email health check had hung up.

Recently, the project introduced spring boot starter mail to realize the function of sending mail.

The mailbox server hung up, causing the monitoring and inspection status of the whole service to be down.


{
  "status": "DOWN",
  "mail": {
    "status": "DOWN",
    "location": "email-smtp.test.com:-1",
    "error": "javax.mail.AuthenticationFailedException: 535 Authentication Credentials Invalid\n"
  },
  "diskSpace": {
    "status": "UP",
    "total": 266299998208,
    "free": 146394308608,
    "threshold": 10485760
  },
  "hystrix": {
    "status": "UP"
  }
}

Since mail sending is not a core function, non core components can be excluded from the health check to avoid the unavailability of the whole service.

Turn off mailbox health check through the following configuration.


management.health.mail.enabled=false

Big pit caused by timeout of spring cloud health check

0. Premise agreement

Service: only one micro service

Server: an app that provides only one micro service. Generally, a service has multiple servers.

1. Problem introduction

Online spring cloud encounters such a problem: sometimes all servers of a service will be removed.

2. Cause analysis

By default, the health URL of springboot actor is used as the health check in springcloud. The default check timeout is 10s. If the production environment encounters problems such as slow network, DB and redis or hangs, the health check request will time out. The springcloud Registry will consider the server abnormal and change the server status to critical, The service caller (feign) will remove the exception server from the load (healthserviceserverlistfilter).

If you encounter a network segment or a larger network, DB and other problems, all servers of a service will be removed by the registry, resulting in the unavailability of the service.

But in fact, the server only has some problems. For example, dB or redis is only slow and not unavailable, but it is forcibly removed by the registry.

3. Solutions

3.1 general solutions

Close the health check and always return to the up state. As long as the program starts normally, it is considered that normal services can be provided.

The following is the default health check result of the project template output:


{
 "description": "",
 "status": "UP",
 "diskSpace": {
  "description": "",
  "status": "UP",
  "total": 50715856896,
  "free": 7065239552,
  "threshold": 10485760
 },
 "solr": {
  "description": "",
  "status": "UP",
  "solrStatus": "OK"
 },
 "redis": {
  "description": "",
  "status": "UP",
  "version": "2.8.21"
 },
 "db": {
  "description": "",
  "status": "UP",
  "authDataSource": {
   "description": "",
   "status": "UP",
   "database": "MySQL",
   "hello": "x"
  },
  "autodealerDataSource": {
   "description": "",
   "status": "UP",
   "database": "Microsoft SQL Server",
   "hello": "x"
  }
 }
}

How to turn off health check:

# application*. In YML
management:
  health:
    defaults:
      enabled: false

Health check results after shutdown:


{
 "description": "",
 "status": "UP",
 "application": {
  "description": "",
  "status": "UP"
 }
}

4. If there is a need for specific health check

After the health check is turned off, if a certain type of health check requirement is required, it needs to be configured separately. The configuration method is as follows:

management:
  health:
    defaults:
      enabled: false
    #The DB health check is enabled if the following configuration is used
    db:
      enabled: true

The health check results are as follows:


{
 "description": "",
 "status": "UP",
 "db": {
  "description": "",
  "status": "UP",
  "authDataSource": {
   "description": "",
   "status": "UP",
   "database": "MySQL",
   "hello": "x"
  },
  "autodealerDataSource": {
   "description": "",
   "status": "UP",
   "database": "Microsoft SQL Server",
   "hello": "x"
  }
 }
}

The above is my personal experience. I hope I can give you a reference, and I hope you can support developpaer.