Google plug-ins capture all website content and intercept images

Time:2020-10-30

Ordinary crawler: Send a request in the code, and then get the content of the web page from the stream, parse the content of the web page and get the relevant information. The advantage of this method is simple and fast, but the disadvantage is that it is easy to be intercepted and the failure rate is high.
Another way is to use Google plug-ins to get all the content of the page, and then parse it to get the information you want.
Plug in name:chromeCrawl
Add address: click me
Google plug-ins capture all website content and intercept images
If you can’t get to the Google App site, you can use GitHub’s manual installation tutorial: click me

Plug in simple to use
After installation, you can see the icon in the upper right corner of the browser: there are three functions
Google plug-ins capture all website content and intercept images

Explanation of the three check boxes:

Open crawling page function: check this box to send page content to background interface, and receive data interface appears
Auto close page: check this box to close the page automatically after crawling.
Do not display multimedia resources: check this box, pictures, videos, fonts and other resources will not be loaded, which can improve the loading speed of web pages
remarks:
Receive data interface: the interface for receiving page data. It needs to be defined by itself. It is default http://localhost : 8080 / content, linked with open crawling page function

When the crawling function is turned on, we want to get the content of the page, which can be set as follows:
Google plug-ins capture all website content and intercept images
If it is Java, the background can receive it in this way:

package com.molikam.shop.controller;


import java.util.concurrent.atomic.AtomicInteger;

import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RestController;


@RestController
public class CrawlerController {
    
    AtomicInteger count = new AtomicInteger(0);
    @RequestMapping(value="/content",method={RequestMethod.POST})
    public void getContent(String content){
        
        System.out.println(count.incrementAndGet());
        System.out.println(content);
        
    }
}

If you want to see the specific code of the plug-in or add more requirements by yourself, Download GitHub: click me

Download the background.js And content_ script.js These two files to add content on the line, specific how to write Google plug-ins, Baidu all have.

Recommended Today

Comparison and analysis of Py = > redis and python operation redis syntax

preface R: For redis cli P: Redis for Python get ready pip install redis pool = redis.ConnectionPool(host=’39.107.86.223′, port=6379, db=1) redis = redis.Redis(connection_pool=pool) Redis. All commands I have omitted all the following commands. If there are conflicts with Python built-in functions, I will add redis Global command Dbsize (number of returned keys) R: dbsize P: print(redis.dbsize()) […]