Introduction to the use of distributed web crawler greenfinger (Part I)

Time:2021-9-19

GreenfingerIt is a high-performance, extension oriented distributed web crawler framework written in Java. It is based on the springboot framework. Through some configuration parameters, it can easily build a distributed web crawler micro service and build a cluster. In addition,GreenfingerThe framework also provides a large number of rich APIs to customize your application system.

Frame properties


  1. Perfect compatibility with springboot 2.2.0 (or later)
  2. Support universal and vertical Crawlers
  3. Depth first crawling strategy is adopted
  4. It is designed as a multi process and highly available crawler architecture to support dynamic horizontal expansion and load balancing
  5. Built in multiple load balancing algorithms or custom load balancing algorithms
  6. Full index and incremental index are supported
  7. Supports scheduled tasks to update indexes
  8. Support a variety of mainstream HTTP client parsing technologies
  9. Support 100 million URL de duplication
  10. Built in multiple conditional interrupt policies or custom conditional interrupt policies
  11. Multi version index query mechanism

compatibility


  1. jdk8 (or later)
  2. SpringBoot Framework 2.2.x (or later)
  3. Redis 3.x (or later)
  4. PostgreSQL 9.x (or later)
  5. ElasticSearch 6.x (or later)
    explain:

    • Redis is used to access cluster information
    • PostgreSQL is used to access the crawled URL information
    • Elasticsearch is used to create indexes and provide retrieval functions

How to install


  • Git address:
    https://github.com/paganini20…
  • Directory structure:

    ├── greenfinger
    |  ├── greenfinger-console
    |  |  ├── pom.xml
    |  |  └── src
    |  ├── greenfinger-spring-boot-starter
    |  |  ├── pom.xml
    |  |  └── src
    |  ├── LICENSE
    |  ├── pom.xml
    |  └── README.md
  • Software Description:

    • greenfinger-console
      The web version of greenfinger is an independent springboot application with its own management interface, which can add, modify, start and stop crawler tasks, and provide a search interface for real-time query
    • greenfinger-spring-boot-starter
      Greenfinger core jar implements all the above framework features, provides rest APIs such as crawler management and search, and introduces jar package to customize your own system

To install the greenfinger console:


Step1:Enter the greenfinger console directory
Step2:Execute the command: MVN clean install
Step3:After successful execution, there will be an additional directory run. Move this directory to your working directory (the directory specified by yourself)
Step4:Run jar: Java – jar greenfinger-console-1.0-rc2.jar — spring. Config. Location = config / (command is for reference only)

  • Generated run directory structure:

    ├── config
    |  ├── application-dev.properties
    |  └── application.properties
    ├── db
    |  └── crawler.sql
    ├── greenfinger-console-1.0-RC2.jar
    ├── lib
    |  ├── aggs-matrix-stats-client-6.8.6.jar
    |  ├── aspectjweaver-1.9.5.jar
    |  ├── chaconne-spring-boot-starter-1.0-RC2.jar
    |  ├── checker-compat-qual-2.5.5.jar
    |  ├── classmate-1.5.1.jar
    |  ├── commons-codec-1.13.jar
    |  ├── commons-io-2.6.jar
    |  ├── ...
    └── logs
     └── atlantis
  • Reference configuration:
    The greenfinger console interface uses freemaker. At present, there are two configuration files, application.properties and application-dev.properties

The following is the default configuration of greenfinger console (which can be extended according to the actual situation):
The application.properties configuration mainly stores the following global configurations:

spring.application.name=greenfinger-console
spring.application.cluster.name=greenfinger-console-cluster

#Freemarker Configuration
spring.freemarker.enabled=true
spring.freemarker.suffix=.ftl
spring.freemarker.cache=false
spring.freemarker.charset=UTF-8
spring.freemarker.template-loader-path=classpath:/META-INF/templates/
spring.freemarker.expose-request-attributes=true
spring.freemarker.expose-session-attributes=true
spring.freemarker.setting.number_format=#
spring.freemarker.setting.locale=en_US
spring.freemarker.setting.url_escaping_charset=UTF-8

server.port=21212
server.servlet.context-path=/atlantis/greenfinger

spring.profiles.active=dev

Application-dev.properties configuration:

#Jdbc Configuration
spring.datasource.jdbcUrl=jdbc:postgresql://localhost:5432/db_webanchor
spring.datasource.username=fengy
spring.datasource.password=123456
spring.datasource.driverClassName=org.postgresql.Driver

#Redis Configuration
spring.redis.host=localhost
spring.redis.port=6379
spring.redis.password=123456
spring.redis.messager.pubsub.channel=greenfinger-console-messager-pubsub

#Vortex Configuration
atlantis.framework.vortex.bufferzone.collectionName=MyGarden
atlantis.framework.vortex.bufferzone.pullSize=100

#Elasticsearch Configuration
spring.data.elasticsearch.cluster-name=es
spring.data.elasticsearch.cluster-nodes=localhost:9300
spring.data.elasticsearch.repositories.enabled=true
spring.data.elasticsearch.properties.transport.tcp.connect_timeout=60s

#Chaconne Configuration
#atlantis.framework.chaconne.producer.location=http://localhost:6543
#atlantis.framework.chaconne.mail.host=smtp.your_company.com
#[email protected]_company.com
#atlantis.framework.chaconne.mail.password=0123456789
#atlantis.framework.chaconne.mail.default-encoding=UTF-8

#webcrawler.pagesource.selenium.webdriverExecutionPath=D:\\software\\chromedriver_win32\\chromedriver.exe

#logging.level.indi.atlantis.framework.greenfinger=INFO

explain:
Application-dev.properties configures some external resources that greenfinger depends on. By default, greenfinger stores the crawled link information in postgesql. Of course, you can also store it in other places (such as NoSQL database or file format). As mentioned earlier, greenfinger is an extension oriented web crawler, which provides rich APIs for extension, I will explain in detail in the article about the implementation principle of greenfinger later.
The address information in the above configuration should be modified according to your own situation
be careful: under jdk8, starting greenfinger console may report an error (indicating that the JDK version is too low), so you may need to use the jdk11 environment. I can run successfully under jdk11, and other versions have not been tried yet.

How to customize your crawler application?


Step1: Add maven:

<dependency>
    <groupId>com.github.paganini2008.atlantis</groupId>
    <artifactId>greenfinger-spring-boot-starter</artifactId>
    <version>1.0-RC2</version>
</dependency>

Step2: Reference code:

@EnableGreenFingerServer
@SpringBootApplication
public class GreenFingerServerConsoleMain {

    public static void main(String[] args) {
        SpringApplication.run(GreenFingerServerConsoleMain.class, args);
    }
}

Step3: Reference configuration:

spring.application.name=cool-crawler
spring.application.cluster.name=cool-crawler-cluster

#Jdbc Configuration
spring.datasource.jdbcUrl=jdbc:postgresql://localhost:5432/db_webanchor
spring.datasource.username=fengy
spring.datasource.password=123456
spring.datasource.driverClassName=org.postgresql.Driver

#Redis Configuration
spring.redis.host=localhost
spring.redis.port=6379
spring.redis.password=123456
spring.redis.messager.pubsub.channel=greenfinger-console-messager-pubsub

#Vortex Configuration
atlantis.framework.vortex.bufferzone.collectionName=MyGarden
atlantis.framework.vortex.bufferzone.pullSize=100

#Elasticsearch Configuration
spring.data.elasticsearch.cluster-name=es
spring.data.elasticsearch.cluster-nodes=localhost:9300
spring.data.elasticsearch.repositories.enabled=true
spring.data.elasticsearch.properties.transport.tcp.connect_timeout=60s

#Chaconne Configuration
#atlantis.framework.chaconne.producer.location=http://localhost:6543
#atlantis.framework.chaconne.mail.host=smtp.your_company.com
#[email protected]_company.com
#atlantis.framework.chaconne.mail.password=0123456789
#atlantis.framework.chaconne.mail.default-encoding=UTF-8

#webcrawler.pagesource.selenium.webdriverExecutionPath=D:\\software\\chromedriver_win32\\chromedriver.exe

#logging.level.indi.atlantis.framework.greenfinger=INFO

You can modify the above configuration according to your own situation

Greenfinger console introduction:


  • First, let’s talkcatalogueandresourcesConcept:
    In the greenfinger framework, each target website (the website to be crawled) is calledCatalog, and for each URL crawled down from it, it represents oneResource
  • At present, greenfinger’s web interface is still improving, so it looks relatively simple

Enter home page address:http://localhost:21212/atlant…

  • View directory list
    Introduction to the use of distributed web crawler greenfinger (Part I)
    Operating instructions:

    • [Edit] edit directory
    • [delete] delete the directory (including resources and indexes under the directory)
    • [clean] clean the directory (including the resources and indexes under the directory, but the directory is still there, and the version number is set to 0)
    • [rebuild] reconstruct the directory (that is, start a crawler, crawl the directory again, build an index, and increase the version number)
    • [update] update the directory (that is, start a crawler, then continue to crawl and update the directory at the latest crawl address, and build an index, with the version number unchanged)
      When the crawler is running, you can also:
    • [stop] stop the crawler
    • [realtime] monitor crawler operation statistics, etc
  • Create or save a directory:
    Introduction to the use of distributed web crawler greenfinger (Part I)
    explain:
  • Name: directory name
  • Cat: classification name
  • URL: initial address
  • Page encoding: page encoding
  • Path pattern: multiple URL matching patterns, separated by commas
  • Excluded path pattern: multiple excluded URL matching patterns, separated by commas
  • Max fetch size: maximum number of crawled links (100000 by default)
  • Duration: the running time of the crawler. Enter the millisecond value (20 minutes by default). After this time, the crawler will automatically end the crawling work
  • Monitor crawler operation:
    Introduction to the use of distributed web crawler greenfinger (Part I)
  • While the crawler is crawling, you can also search with keywords in real time:
    Introduction to the use of distributed web crawler greenfinger (Part I)
  • If no keyword is entered, query all:
    Introduction to the use of distributed web crawler greenfinger (Part I)

Finally, due to the high complexity of greenfinger framework, it comprehensively uses the micro service distributed cooperation frameworktridenterDistributed streaming processing frameworkvortexDistributed task scheduling frameworkchaconneThe core content of the three frameworks, so this article mainly describes how to operate the greenfinger console interface to create a crawler task, run the crawler, and finally search the crawled content by keyword search, which is limited to space. Later, we will focus on an article on the implementation principle of greenfinger.