Go one library a day, colly

Time:2021-10-26

brief introduction

collyIt is a powerful crawler framework written in go language. It provides a concise API, has strong performance, can automatically process Cookies & sessions, and provides a flexible extension mechanism.

First, let’s introducecollyBasic concepts of. Then introduce it through several casescollyUsage and characteristics of:Pull GitHub treading, pull Baidu novel hot list, and download the pictures on the unsplash website

Quick use

The code in this article uses go modules.

Create directory and initialize:

$ mkdir colly && cd colly
$ go mod init github.com/darjun/go-daily-lib/colly

installcollyLibrary:

$ go get -u github.com/gocolly/colly/v2

use:

package main

import (
  "fmt"

  "github.com/gocolly/colly/v2"
)

func main() {
  c := colly.NewCollector(
    colly.AllowedDomains("www.baidu.com" ),
  )

  c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    fmt.Printf("Link found: %q -> %s\n", e.Text, link)
    c.Visit(e.Request.AbsoluteURL(link))
  })

  c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL.String())
  })

  c.OnResponse(func(r *colly.Response) {
    fmt.Printf("Response %s: %d bytes\n", r.Request.URL, len(r.Body))
  })

  c.OnError(func(r *colly.Response, err error) {
    fmt.Printf("Error %s: %v\n", r.Request.URL, err)
  })

  c.Visit("http://www.baidu.com/")
}

collyThe is easy to use:

First, callcolly.NewCollector()Create a file of type*colly.CollectorCrawler object. Because each page has many links to other pages. Without restrictions, the operation may never stop. So the above passes in an optioncolly.AllowedDomains("www.baidu.com")Restrict crawling only domain names towww.baidu.comWeb page.

Then we callc.OnHTMLMethod registrationHTMLCallback, for eachhrefAttributeaElement executes a callback function. Continue visit herehrefURL to. In other words, parse the crawled web page, and then continue to visit the links to other pages in the web page.

callc.OnRequest()Method to register a request callback and execute the callback every time a request is sent. Here is just a simple print of the request URL.

callc.OnResponse()Method to register a response callback and execute the callback every time a response is received. Here, it is just a simple print URL and response size.

callc.OnError()Method to register an error callback and execute the callback when an error occurs in the execution request. Here, the URL and error information are simply printed.

Finally, we callc.Visit()Start accessing the first page.

function:

$ go run main.go
Visiting http://www.baidu.com/
Response http://www.baidu.com/: 303317 bytes
Link found: "Baidu homepage" ->/
Link found: "Settings" - > javascript:;
Link found: "login" -> https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5
Link found: "news" - > http://news.baidu.com
Link found: "hao123" -> https://www.hao123.com
Link found: "map" - > http://map.baidu.com
Link found: "live broadcast" -> https://live.baidu.com/
Link found: "video" -> https://haokan.baidu.com/?sfrom=baidu -top
Link found: "Post Bar" -> http://tieba.baidu.com
...

collyAfter crawling to the page, goquery will be used to parse the page. Then find the element selector corresponding to the registered HTML callback, andgoquery.SelectionPackage into onecolly.HTMLElementExecute callback.

colly.HTMLElementActually, yesgoquery.SelectionSimple packaging:

type HTMLElement struct {
  Name string
  Text string
  Request *Request
  Response *Response
  DOM *goquery.Selection
  Index int
}

It also provides a simple and easy-to-use method:

  • Attr(k string): returns the attribute of the current element. In the above example, we usee.Attr("href")Got ithrefAttributes;
  • ChildAttr(goquerySelector, attrName string): ReturngoquerySelectorOf the first child element selectedattrNameAttributes;
  • ChildAttrs(goquerySelector, attrName string): ReturngoquerySelectorOf all child elements selectedattrNameProperties to[]stringreturn;
  • ChildText(goquerySelector string): splicinggoquerySelectorSelect the text content of the child element and return it;
  • ChildTexts(goquerySelector string): ReturngoquerySelectorSelect a slice consisting of the text content of the child element to[]stringreturn.
  • ForEach(goquerySelector string, callback func(int, *HTMLElement)): for eachgoquerySelectorThe selected child element performs a callbackcallback
  • Unmarshal(v interface{}): you can unmarshal an HtmlElement object into a structure instance by specifying a tag in goqueryselector format for the structure field.

These methods are frequently used. Let’s introduce it through some examplescollyCharacteristics and usage of.

GitHub Treading

I wrote an API for pulling GitHub routing before, usingcollyMore convenient:

type Repository struct {
  Author  string
  Name    string
  Link    string
  Desc    string
  Lang    string
  Stars   int
  Forks   int
  Add     int
  BuiltBy []string
}

func main() {
  c := colly.NewCollector(
    colly.MaxDepth(1),
  )


  repos := make([]*Repository, 0, 15)
  c.OnHTML(".Box .Box-row", func (e *colly.HTMLElement) {
    repo := &Repository{}

    // author & repository name
    authorRepoName := e.ChildText("h1.h3 > a")
    parts := strings.Split(authorRepoName, "/")
    repo.Author = strings.TrimSpace(parts[0])
    repo.Name = strings.TrimSpace(parts[1])

    // link
    repo.Link = e.Request.AbsoluteURL(e.ChildAttr("h1.h3 >a", "href"))

    // description
    repo.Desc = e.ChildText("p.pr-4")

    // language
    repo.Lang = strings.TrimSpace(e.ChildText("div.mt-2 > span.mr-3 > span[itemprop]"))

    // star & fork
    starForkStr := e.ChildText("div.mt-2 > a.mr-3")
    starForkStr = strings.Replace(strings.TrimSpace(starForkStr), ",", "", -1)
    parts = strings.Split(starForkStr, "\n")
    repo.Stars , _=strconv.Atoi(strings.TrimSpace(parts[0]))
    repo.Forks , _=strconv.Atoi(strings.TrimSpace(parts[len(parts)-1]))

    // add
    addStr := e.ChildText("div.mt-2 > span.float-sm-right")
    parts = strings.Split(addStr, " ")
    repo.Add, _ = strconv.Atoi(parts[0])

    // built by
    e.ForEach("div.mt-2 > span.mr-3  img[src]", func (index int, img *colly.HTMLElement) {
      repo.BuiltBy = append(repo.BuiltBy, img.Attr("src"))
    })

    repos = append(repos, repo)
  })

  c.Visit("https://github.com/trending")
  
  fmt.Printf("%d repositories\n", len(repos))
  fmt.Println("first repository:")
  for _, repo := range repos {
      fmt.Println("Author:", repo.Author)
      fmt.Println("Name:", repo.Name)
      break
  }
}

We useChildTextGet the author, warehouse name, language, number of stars and forks, new today and other informationChildAttrGet the warehouse link, which is a relative path through the calle.Request.AbsoluteURL()Method converts it to an absolute path.

function:

$ go run main.go
25 repositories
first repository:
Author: Shopify
Name: dawn

Baidu novel hot list

The structure of the page is as follows:

Go one library a day, colly

The structure of each part is as follows:

  • Each hot list is in adiv.category-wrap_iQLooMedium;
  • aUnder elementdiv.index_1Ew5pIs the ranking;
  • Content indiv.content_1YWBmMedium;
  • Contenta.title_dIF3BIs the title;
  • Two in the contentdiv.intro_1l0wp, the former is the author and the latter is the type;
  • Contentdiv.desc_3CTjTIt’s a description.

Thus we define the structure:

type Hot struct {
  Rank   string `selector:"a > div.index_1Ew5p"`
  Name   string `selector:"div.content_1YWBm > a.title_dIF3B"`
  Author string `selector:"div.content_1YWBm > div.intro_1l0wp:nth-child(2)"`
  Type   string `selector:"div.content_1YWBm > div.intro_1l0wp:nth-child(3)"`
  Desc   string `selector:"div.desc_3CTjT"`
}

The CSS selector syntax in tag is added so that it can be called directlyHTMLElement.Unmarshal()Method fillingHotObject.

Then createCollectorObject:

c := colly.NewCollector()

Register callback:

c.OnHTML("div.category-wrap_iQLoo", func(e *colly.HTMLElement) {
  hot := &Hot{}

  err := e.Unmarshal(hot)
  if err != nil {
    fmt.Println("error:", err)
    return
  }

  hots = append(hots, hot)
})

c.OnRequest(func(r *colly.Request) {
  fmt.Println("Requesting:", r.URL)
})

c.OnResponse(func(r *colly.Response) {
  fmt.Println("Response:", len(r.Body))
})

OnHTMLExecute for each entryUnmarshalgenerateHotObject.

OnRequest/OnResponseSimply output debugging information.

Then, callc.Visit()Visit website:

err := c.Visit("https://top.baidu.com/board?tab=novel")
if err != nil {
  fmt.Println("Visit error:", err)
  return
}

Finally, add some debugging Prints:

fmt.Printf("%d hots\n", len(hots))
for _, hot := range hots {
  fmt.Println("first hot:")
  fmt.Println("Rank:", hot.Rank)
  fmt.Println("Name:", hot.Name)
  fmt.Println("Author:", hot.Author)
  fmt.Println("Type:", hot.Type)
  fmt.Println("Desc:", hot.Desc)
  break
}

Run output:

Requesting: https://top.baidu.com/board?tab=novel
Response: 118083
30 hots
first hot:
Rank: 1
Name: evil god against heaven
Author: Mars gravity
Type: Fantasy
Desc: hold the Pearl of heaven's poison, inherit the blood of evil gods and repair the power against heaven. A generation of evil gods will reign in the world! See more >

Unsplash

I wrote official account articles, and the background pictures were basically obtained from unsplash website. Unsplash provides a large number of, rich and free pictures. There is a problem with this website, that is, the access speed is relatively slow. Since learning crawler, just use the program to automatically download pictures.

The unsplash homepage is shown in the figure below:

Go one library a day, colly

The structure of the page is as follows:

Go one library a day, colly

However, the home page displays all the smaller pictures. We click the link of a picture:

Go one library a day, colly

The structure of the page is as follows:

Go one library a day, colly

Because it involves a three-tier web page structure(imgFinally, you need to access it once) and use acolly.CollectorObject,OnHTMLCallback settings need to be extra careful, which brings a great mental burden to coding.collySupport multipleCollector, we code in this way:

func main() {
  c1 := colly.NewCollector()
  c2 := c1.Clone()
  c3 := c1.Clone()

  c1.OnHTML("figure[itemProp] a[itemProp]", func(e *colly.HTMLElement) {
    href := e.Attr("href")
    if href == "" {
      return
    }

    c2.Visit(e.Request.AbsoluteURL(href))
  })

  c2.OnHTML("div._1g5Lu > img[src]", func(e *colly.HTMLElement) {
    src := e.Attr("src")
    if src == "" {
      return
    }

    c3.Visit(src)
  })

  c1.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
  })

  c1.OnError(func(r *colly.Response, err error) {
    fmt.Println("Visiting", r.Request.URL, "failed:", err)
  })
}

We use threeCollectorObject, firstCollectorIt is used to collect the corresponding picture links on the home page, and then use the second oneCollectorGo to these picture links and finally get the third oneCollectorGo and download the picture. We are also the first one aboveCollectorRequest and error callbacks are registered.

ThirdCollectorAfter downloading the specific picture content, save it locally:

func main() {
  //... omitted
  var count uint32
  c3.OnResponse(func(r *colly.Response) {
    fileName := fmt.Sprintf("images/img%d.jpg", atomic.AddUint32(&count, 1))
    err := r.Save(fileName)
    if err != nil {
      fmt.Printf("saving %s failed:%v\n", fileName, err)
    } else {
      fmt.Printf("saving %s success\n", fileName)
    }
  })

  c3.OnRequest(func(r *colly.Request) {
    fmt.Println("visiting", r.URL)
  })
}

Use aboveatomic.AddUint32()Generate sequence numbers for pictures.

Run the program and crawl the results:

Go one library a day, colly

asynchronous

By default,collyCrawling web pages is synchronous, that is, crawling one after another. This is the case with the unplash program above. It takes a long time,collyIt provides the feature of asynchronous crawling. We only need to constructCollectorPass in options on objectcolly.Async(true)To turn on asynchronous:

c1 := colly.NewCollector(
  colly.Async(true),
)

However, because it is an asynchronous crawl, the program finally needs to waitCollectorProcessing is complete, otherwise exit earlymain, the program exits:

c1.Wait()
c2.Wait()
c3.Wait()

Run again, much faster.

Second Edition

Sliding down the unsplash page, we find that the following images are loaded asynchronously. Scroll the page and view the request through the network tab of the Chrome browser:

Go one library a day, colly

Request path/photos, settingper_pageandpageParameter, which returns a JSON array. So there’s another way:

To define the structure of each item, we only keep the necessary fields:

type Item struct {
  Id     string
  Width  int
  Height int
  Links  Links
}

type Links struct {
  Download string
}

Then inOnResponseJSON is parsed in the callback, and theDownloadThe link calls the user responsible for downloading the imageCollectorofVisit()method:

c.OnResponse(func(r *colly.Response) {
  var items []*Item
  json.Unmarshal(r.Body, &items)
  for _, item := range items {
    d.Visit(item.Links.Download)
  }
})

To initialize access, we set to pull 3 pages, 12 per page (consistent with the number of page requests):

for page := 1; page <= 3; page++ {
  c.Visit(fmt.Sprintf("https://unsplash.com/napi/photos?page=%d&per_page=12", page))
}

Run to view the downloaded pictures:

Go one library a day, colly

Speed limit

Sometimes there are too many concurrent requests, and the website will restrict access. You need to useLimitRuleYes. To put it bluntly,LimitRuleTo limit access speed and Concurrency:

type LimitRule struct {
  DomainRegexp string
  DomainGlob string
  Delay time.Duration
  RandomDelay time.Duration
  Parallelism    int
}

Commonly used onDelay/RandomDelay/ParallismThese represent the delay between requests, random delay, and concurrency number. in additionmustSpecify which domain names are restricted byDomainRegexporDomainGlobSet if neither of these fields is setLimit()Method returns an error. Used in the above example:

err := c.Limit(&colly.LimitRule{
  DomainRegexp: `unsplash\.com`,
  RandomDelay:  500 * time.Millisecond,
  Parallelism:  12,
})
if err != nil {
  log.Fatal(err)
}

We set forunsplash.comFor this domain name, the maximum random delay between requests is 500ms, and 12 requests can be concurrent at most.

Set timeout

Sometimes the Internet speed is slow,collyUsed inhttp.ClientThere is a default timeout mechanism that we can usecolly.WithTransport()Option override:

c.WithTransport(&http.Transport{
  Proxy: http.ProxyFromEnvironment,
  DialContext: (&net.Dialer{
    Timeout:   30 * time.Second,
    KeepAlive: 30 * time.Second,
  }).DialContext,
  MaxIdleConns:          100,
  IdleConnTimeout:       90 * time.Second,
  TLSHandshakeTimeout:   10 * time.Second,
  ExpectContinueTimeout: 1 * time.Second,
})

extend

collyIn sub packageextensionSome extended features are provided in. The most commonly used is random user agent. Usually, the website will identify whether the request is sent by the browser through the user agent. The crawler will generally set this header to disguise itself as a browser. It is also easy to use:

import "github.com/gocolly/colly/v2/extensions"

func main() {
  c := colly.NewCollector()
  extensions.RandomUserAgent(c)
}

The implementation of random user agent is also very simple, that is, randomly set one from some predefined user agent arrays into the header:

func RandomUserAgent(c *colly.Collector) {
  c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("User-Agent", uaGens[rand.Intn(len(uaGens))]())
  })
}

It is not difficult to implement our own extension. For example, we need to set a specific header every time we request. The extension can be written as follows:

func MyHeader(c *colly.Collector) {
  c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("My-Header", "dj")
  })
}

useCollectorObject callMyHeader()Function:

MyHeader(c)

summary

collyIt is the most popular crawler framework in go language and supports rich features. This paper introduces some common features, supplemented by examples. Limited by space, some advanced features are not covered, such as queue, storage, etc. If you are interested in reptiles, you can have a deep understanding.

If you find a fun and easy-to-use go language library, you are welcome to submit an issue on GitHub, the daily library of go

reference resources

  1. Go one library a day GitHub: https://github.com/darjun/go-daily-lib
  2. Goquery of go daily database: https://darjun.github.io/2020/10/11/godailylib/goquery/
  3. Implement a GitHub trending API with go: https://darjun.github.io/2021/06/16/github-trending-api/
  4. colly GitHub:https://github.com/gocolly/colly

I

My blog: https://darjun.github.io

Welcome to my WeChat official account, GoUpUp, learn together and make progress together.

Recommended Today

Hero League Python crawler

Article catalog Hero League Python crawler 1. Hero crawling 2. JS obtains all hero information 3. Crawl the game data First lol page crawl The second lol web page data crawling The third lol web page data crawling 4. Multi thread crawling lol hero skin image Hero League Python crawler Hero main interface QQhttps://lol.qq.com/data/info-heros.shtml 1. […]