How does Linux / nginx view the behavior of search engine spiders and crawlers

Time:2021-12-2

abstract

The first step in website SEO optimization is to let spiders and crawlers often visit your website. The following linux commands can let you know the crawling situation of spiders clearly. Let’s analyze the nginx server. The directory where the log file is located is: / usr / local / nginx / logs / access.log. Access.log should record the log of the last day. First, please look at the log size. If it is large (more than 50MB), it is recommended not to use these commands for analysis, because these commands consume CPU, or update them and execute them on the analyzer, So as not to affect the speed of the website.

Linux shell command

1. Times of Baidu spider crawling

cat access.log | grep Baiduspider | wc

The value on the far left shows the number of crawls.

2. Detailed record of Baidu spider (ctrl C can be terminated)

cat access.log | grep Baiduspider

You can also use the following command:

cat access.log | grep Baiduspider | tail -n 10
cat access.log | grep Baiduspider | head -n 10

Just look at the last 10 entries or the first 10 entries, and you can know the start time and date of the log file.

3. Baidu spider grabs the detailed record of the home page

cat access.log | grep Baiduspider | grep “GET / HTTP”

Baidu spiders seem to love the home page very much and visit it every hour, while Google and Yahoo spiders prefer the inner page.

4. Baidu spider factional record time point distribution

cat access.log | grep “Baiduspider ” | awk ‘{print $4}’

5. Baidu spider crawls the page in descending order of times

cat access.log | grep “Baiduspider ” | awk ‘{print $7}’ | sort | uniq -c | sort -r

Baidu pider in this article can be changed to Google bot to view Google’s data. In view of the particularity of the mainland, we should pay more attention to Baidu’s log.

Attached: (mediapartners Google) detailed crawling records of Google Adsense spider

cat access.log | grep Mediapartners

What is mediapartners Google? The reason why Google Adsense ads can be related to content is that after each ad containing Adsense is accessed, a mediapartners Google spider will soon come to this page, so refresh it in a few minutes to display relevant ads. it’s really powerful!

PS: how to enable nginx website log under Linux to view spiders and Crawlers

The default path is specified when you install

If you use an installation package such as LNMP

You can use shell

whereisnginx

After finding the corresponding path

Look at the configuration file in the conf folder under nginx. If the log file is recorded

There are paths in the configuration file