Introduction:Monitoring best practice redis and business interface
On December 4, 2020, the CPU of db0 monitored by redis cluster version on the client side suddenly increased to 100%, resulting in the failure of normal service of the database. After investigation, there was a big key of about 2m on the client side business, resulting in db0 blocking. And the cluster connection mode used by the client side is the default proxy mode, as shown in the figure below. Db0 blocking causes other nodes to fail to serve normally; Processing method: the client side cooperates to cut off the frequent calls of the big key service and request recovery.
Figure 1: proxy mode
This problem caused serious damage to the entrance of course registration on the customer side, which led to in-depth thinking. In the aspect of using redis and other products, the monitoring and alarm means are not perfect and careful enough, and the subsequent review of the business log shows that the error rate has gradually increased until the redis level shows up. In view of the problem of big key in redis, this paper provides customers with the analysis method of big key and hot key, and proposes to improve the readability of customer side monitoring alarm and error alarm of business log interface.
2. Database monitoring and analysis
2.1 sharing of redis monitoring indicators
The cloud monitoring indicators of redis cluster version are shown in the following table.
|Average response time||us||ShardingAvgRt||userId、instanceId、nodeId||Average、Maximum|
|Inflow direction flow||KByte/s||ShardingIntranetIn||userId、instanceId、nodeId||Average、Maximum|
|Inflow bandwidth utilization||%||ShardingIntranetInRatio||userId、instanceId、nodeId||Average、Maximum|
|Flow rate in outlet direction||KByte/s||ShardingIntranetOut||userId、instanceId、nodeId||Average、Maximum|
|Outflow bandwidth utilization||%||ShardingIntranetOutRatio||userId、instanceId、nodeId||Average、Maximum|
|Number of keys in cache||individual||ShardingKeys||userId、instanceId、nodeId||Average、Maximum|
|Maximum response time||us||ShardingMaxRt||userId、instanceId、nodeId||Average、Maximum|
|QPS utilization rate||%||ShardingQPSUsage||userId、instanceId、nodeId||Average、Maximum|
|Average visits per second||individual||ShardingUsedQPS||userId、instanceId、nodeId||Average、Maximum|
2.2 redis big key analysis
1. Select the corresponding instance in the console and analyze the big key and hot key.
Figure 2: example analysis
2. Use API interface to analyze big key and hot key.
Cache analysis and hot key query can refer to the following information for details .
2.3 monitoring on the same link of database
Creating group alarm rules has been updated to the group interface.
2.3.1 create application group
Figure 3: creating application groups
2.3.2 creating alarm rules
Figure 4: creating alarm rules
Figure 5: setting alarm rules
3. Log monitoring
Using SLS to access the client log, we can set up the dashboard and alarm by setting rules. In this scheme, log access adopts logtail mode of Intranet transmission.
3.1 installing logtail
For the installation of logtail method, please refer to the following .
3.2 create project and logstore
Log in to the log service console and create the project and logstore of the corresponding region in turn.
Figure 6: project logstore creation
3.3 data access Wizard
The client side log formats are JSON and log4j.
Select JSON text log > select existing machine group > corresponding logtail configuration
Figure 7: logtail configuration
1. Set index
For multiple JSON logs, you need to change the field type to JSON.
Figure 8: setting index
2. Query and analysis
Figure 9: query analysis
Select regular text log\>Select an existing machine group\>Corresponding logtail configuration
1. Regular recognition of the first line
Figure 10: setting up automatic generation
2. Extract fields
Figure 11: log extraction fields
3. Set index
Note: it only works for newly written data.
Figure 12: setting index
4. Query and analysis
Figure 13: query analysis
3.4 log alarm
3.4.1 instrument panel
Figure 14: dashboard information display
Click alarm in the navigation bar on the upper right side of the instrument and select Create in the drop-down menu.
Figure 15: creating alarms
Figure 16: alarm content setting
For the alarm content of the nail robot, please refer to the template  for setting.
 Cache analysis and hot key query:https://help.aliyun.com/document\_detail/184226.html?spm=a2c4g.11186623.6.975.255f3635R5By1i
 Install logtail (Linux system)https://help.aliyun.com/document\_detail/28982.html?spm=a2c4g.111866188.8.131.52a09d7cBfTtvl
 Nail robot alarm template:https://help.aliyun.com/document\_detail/91785.html?spm=5176.2020520112.0.dexternal.62b334c0S2Jxx2
We are the SRE team of alicloud intelligent global technology services. We are committed to becoming a technology-based, service-oriented, high availability engineer team; Provide professional and systematic SRE services to help customers better use the cloud, build more stable and reliable business systems based on the cloud, and improve business stability. We hope to share more technologies to help enterprise customers go to the cloud, make good use of the cloud, and make their cloud business run more stably and reliably. You can scan the QR code below by nailing, join the nailing circle of Alibaba cloud SRE Institute of technology, and communicate with more cloud people about the cloud platform.
Copyright notice:The content of this article is spontaneously contributed by alicloud real name registered users, and the copyright belongs to the original author. The alicloud developer community does not own its copyright, nor does it bear the corresponding legal responsibility. For specific rules, please refer to the user service agreement of alicloud developer community and the guidelines for intellectual property protection of alicloud developer community. If you find any suspected plagiarism content in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the suspected infringement content.