[go libp2p source code analysis] swarm dialing

Time:2022-1-15

1. Introduction

Libp2p swarm is a “low-level” interface for libp2p network, which can control all aspects of the system more finely. Swarm can establish monitoring and dial up to other hosts to establish new connections (such as establishing a TCP connection with a host). The dial-up referred to here is actually the process of establishing an outbound connection. Its implementation logic is more complex. Let’s sort it out here.

2. Code structure

Warehouse address: https://github.com/libp2p/go-libp2p-swarm.git
Dialing related codes are mainly distributed inswarm_dial.go,limiter.go,dial_sync.goThese three files contain the following structures:
swarm_dial.go:DialBackoff,backoffAddr
Dialbackoff is mainly used to limit the time of dialing again after dialing failure
dial_sync.go:DialSync、activeDial
Dialsync synchronous dialing helper. Only one dialing to the specified peer is active at the same time
limiter.go:dialLimiter、dialJob、dialResult
Diallimiter mainly limits the number of concurrent dials

3. Sequence diagram

[go libp2p source code analysis] swarm dialing

It can be seen from the above figure that dialing is actually a series of checks on concurrent dialing, synchronization and retry, and finally call transport for dialing. Suppose there are 1000 peers and each peer has five different addresses. If you dial synchronously, it will inevitably affect the efficiency. Therefore, you need to start multiple collaborative processes for concurrent dialing, but you can’t limit it completely. Diallimit implements the restriction on concurrent dialing. If an address dialing fails, you can’t immediately try dialing again. In this way, the probability will fail. You need to wait for a period of time to dial, otherwise it will be a waste of resources. So how long to wait, there is an algorithm. Dialbackoff realizes these functions. Then why do you need dialsync? When an external program calls dialpeer, it may start multiple coprocesses to dial the same peer concurrently. Because it is impossible to limit the external call method, it can only be limited at the dial source (concurrent dialing has been implemented internally).
Borrow swarm here_ dial. Go to see how dialsync works:

 Diagram of dial sync:

   many callers of Dial()   synched w.  dials many addrs       results to callers
  ----------------------\    dialsync    use earliest          /--------------
  -----------------------\             |----------\           /----------------
  ------------------------>------------<------------>---------<-----------------
  -----------------------|              \----x                 \----------------
  ----------------------|                \-----x                \---------------
                                         any may fail          if no addr at end retry dialAttempt x

3. Call entry

Swarm exposes a dialpeer method. Applications can dial peer directly through it. It is used in two places.

// DialPeer connects to a peer.
func (s *Swarm) DialPeer(ctx context.Context, p peer.ID) (network.Conn, error) {
    if s.gater != nil && !s.gater.InterceptPeerDial(p) {
        log.Debugf("gater disallowed outbound connection to peer %s", p.Pretty())
        return nil, &DialError{Peer: p, Cause: ErrGaterDisallowedConnection}
    }

    return s.dialPeer(ctx, p)
}

1. The connect method of basickost calls the dialpeer method

func (h *BasicHost) Connect(ctx context.Context, pi peer.AddrInfo) error {
    // absorb addresses into peerstore
    h.Peerstore().AddAddrs(pi.ID, pi.Addrs, peerstore.TempAddrTTL)

    if h.Network().Connectedness(pi.ID) == network.Connected {
        return nil
    }

    resolved, err := h.resolveAddrs(ctx, h.Peerstore().PeerInfo(pi.ID))
    if err != nil {
        return err
    }
    h.Peerstore().AddAddrs(pi.ID, resolved, peerstore.TempAddrTTL)

    return h.dialPeer(ctx, pi.ID)
}

func (h *BasicHost) dialPeer(ctx context.Context, p peer.ID) error {
    log.Debugf("host %s dialing %s", h.ID(), p)
    c, err := h.Network().DialPeer(ctx, p)
    if err != nil {
        return err
    }
    select {
    case <-h.ids.IdentifyWait(c):
    case <-ctx.Done():
        return ctx.Err()
    }

    log.Debugf("host %s finished dialing %s", h.ID(), p)
    return nil
}

Finally, ipfsdht calls connect of basickost

func (dht *IpfsDHT) dialPeer(ctx context.Context, p peer.ID) error {
    // short-circuit if we're already connected.
    if dht.host.Network().Connectedness(p) == network.Connected {
        return nil
    }

    logger.Debug("not connected. dialing.")
    routing.PublishQueryEvent(ctx, &routing.QueryEvent{
        Type: routing.DialingPeer,
        ID:   p,
    })

    pi := peer.AddrInfo{ID: p}
    if err := dht.host.Connect(ctx, pi); err != nil {
        logger.Debugf("error connecting: %s", err)
        routing.PublishQueryEvent(ctx, &routing.QueryEvent{
            Type:  routing.QueryError,
            Extra: err.Error(),
            ID:    p,
        })

        return err
    }
    logger.Debugf("connected. dial success.")
    return nil
}

2. In addition, swarm’s newstream also calls dialpeer. If the connection has not been established, dial first

func (s *Swarm) NewStream(ctx context.Context, p peer.ID) (network.Stream, error) {
    log.Debugf("[%s] opening stream to peer [%s]", s.local, p)
    dials := 0
    for {
        c := s.bestConnToPeer(p)
        if c == nil {
            if nodial, _ := network.GetNoDial(ctx); nodial {
                return nil, network.ErrNoConn
            }

            if dials >= DialAttempts {
                return nil, errors.New("max dial attempts exceeded")
            }
            dials++

            var err error
            c, err = s.dialPeer(ctx, p)
            if err != nil {
                return nil, err
            }
        }
        s, err := c.NewStream()
        if err != nil {
            if c.conn.IsClosed() {
                continue
            }
            return nil, err
        }
        return s, nil
    }
}

4. Dialer initialization

Swarm{
    ....
    // dialing helpers
    dsync   *DialSync
    backf   DialBackoff
    limiter *dialLimiter
}

func NewSwarm(ctx context.Context, local peer.ID, peers peerstore.Peerstore, bwc metrics.Reporter, extra ...interface{}) *Swarm {
    s := &Swarm{
        local: local,
        peers: peers,
        bwc:   bwc,
    }
    .....
    s.dsync = NewDialSync(s.doDial)
    s.limiter = newDialLimiter(s.dialAddr, s.IsFdConsumingAddr)
    s.proc = goprocessctx.WithContext(ctx)
    s.ctx = goprocessctx.OnClosingContext(s.proc)
    s.backf.init(s.ctx)

    return s
}

type DialFunc func(context.Context, peer.ID) (*Conn, error)

// NewDialSync constructs a new DialSync
func NewDialSync(dfn DialFunc) *DialSync {
    return &DialSync{
        dials:    make(map[peer.ID]*activeDial),
        dialFunc: dfn,
    }
}

type dialfunc func(context.Context, peer.ID, ma.Multiaddr) (transport.CapableConn, error)
type isFdConsumingFnc func(ma.Multiaddr) bool

func newDialLimiter(df dialfunc, fdFnc isFdConsumingFnc) *dialLimiter {
    fd := ConcurrentFdDials
    if env := os.Getenv("LIBP2P_SWARM_FD_LIMIT"); env != "" {
        if n, err := strconv.ParseInt(env, 10, 32); err == nil {
            fd = int(n)
        }
    }
    return newDialLimiterWithParams(fdFnc, df, fd, DefaultPerPeerRateLimit)
}

func newDialLimiterWithParams(isFdConsumingFnc isFdConsumingFnc, df dialfunc, fdLimit, perPeerLimit int) *dialLimiter {
    return &dialLimiter{
        isFdConsumingFnc:   isFdConsumingFnc,
        fdLimit:            fdLimit,
        perPeerLimit:       perPeerLimit,
        waitingOnPeerLimit: make(map[peer.ID][]*dialJob),
        activePerPeer:      make(map[peer.ID]int),
        dialFunc:           df,
    }
}

func (db *DialBackoff) init(ctx context.Context) {
    if db.entries == nil {
        db.entries = make(map[peer.ID]map[string]*backoffAddr)
    }
    go db.background(ctx)
}

The dialbackoff, diallimit and dialbackoff dialhelpers are initialized during the newsarm instance.
Newdialsync needs to pass in a dial-up function as a parameter (actually call swarm’s dodial function)
Newdiallimiter needs to pass in two functions: one is the dialing function (actually calling the dialaddr function of swarm), and the other is to judge whether the protocol needs to consume FD (UNIX / TCP)
The init background of dialbackoff will also start a collaborative process to clean up backoff

5. Involving collaborative process

1. For each peer, dialsync starts a process to dial

func (ad *activeDial) start(ctx context.Context) {
    ad.conn, ad.err = ad.ds.dialFunc(ctx, ad.id)

    // This isn't the user's context so we should fix the error.
    switch ad.err {
    case context.Canceled:
        // The dial was canceled with `CancelDial`.
        ad.err = errDialCanceled
    case context.DeadlineExceeded:
        // We hit an internal timeout, not a context timeout.
        ad.err = ErrDialTimeout
    }
    close(ad.waitch)
    ad.cancel()
}

func (ds *DialSync) getActiveDial(p peer.ID) *activeDial {
    ds.dialsLk.Lock()
    defer ds.dialsLk.Unlock()

    actd, ok := ds.dials[p]
    if !ok {
        adctx, cancel := context.WithCancel(context.Background())
        actd = &activeDial{
            id:     p,
            cancel: cancel,
            waitch: make(chan struct{}),
            ds:     ds,
        }
        ds.dials[p] = actd

        go actd.start(adctx)
    }

    // increase ref count before dropping dialsLk
    actd.incref()

    return actd
}

2. For each address of each peer, a protocol is started in diallimit to dial

func (dl *dialLimiter) addCheckFdLimit(dj *dialJob) {
    if dl.shouldConsumeFd(dj.addr) {
        if dl.fdConsuming >= dl.fdLimit {
            log.Debugf("[limiter] blocked dial waiting on FD token; peer: %s; addr: %s; consuming: %d; "+
                "limit: %d; waiting: %d", dj.peer, dj.addr, dl.fdConsuming, dl.fdLimit, len(dl.waitingOnFd))
            dl.waitingOnFd = append(dl.waitingOnFd, dj)
            return
        }

        log.Debugf("[limiter] taking FD token: peer: %s; addr: %s; prev consuming: %d",
            dj.peer, dj.addr, dl.fdConsuming)
        // take token
        dl.fdConsuming++
    }

    log.Debugf("[limiter] executing dial; peer: %s; addr: %s; FD consuming: %d; waiting: %d",
        dj.peer, dj.addr, dl.fdConsuming, len(dl.waitingOnFd))
    go dl.executeDial(dj)
}

func (dl *dialLimiter) addCheckPeerLimit(dj *dialJob) {
    if dl.activePerPeer[dj.peer] >= dl.perPeerLimit {
        log.Debugf("[limiter] blocked dial waiting on peer limit; peer: %s; addr: %s; active: %d; "+
            "peer limit: %d; waiting: %d", dj.peer, dj.addr, dl.activePerPeer[dj.peer], dl.perPeerLimit,
            len(dl.waitingOnPeerLimit[dj.peer]))
        wlist := dl.waitingOnPeerLimit[dj.peer]
        dl.waitingOnPeerLimit[dj.peer] = append(wlist, dj)
        return
    }
    dl.activePerPeer[dj.peer]++

    dl.addCheckFdLimit(dj)
}

// executeDial calls the dialFunc, and reports the result through the response channel when finished. Once the response is sent it also releases all tokens it held during the dial.
func (dl *dialLimiter) executeDial(j *dialJob) {
    defer dl.finishedDial(j)
    if j.cancelled() {
        return
    }

    dctx, cancel := context.WithTimeout(j.ctx, j.dialTimeout())
    defer cancel()

    con, err := dl.dialFunc(dctx, j.peer, j.addr)
    select {
    case j.resp <- dialResult{Conn: con, Addr: j.addr, Err: err}:
    case <-j.ctx.Done():
        if err == nil {
            con.Close()
        }
    }
}

3. Backoff cleanup

func (db *DialBackoff) init(ctx context.Context) {
    if db.entries == nil {
        db.entries = make(map[peer.ID]map[string]*backoffAddr)
    }
    go db.background(ctx)
}

func (db *DialBackoff) background(ctx context.Context) {
    ticker := time.(BackoffMax)NewTicker
    defer ticker.Stop()
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            db.cleanup()
        }
    }
}

func (db *DialBackoff) cleanup() {
    db.lock.Lock()
    defer db.lock.Unlock()
    now := time.Now()
    for p, e := range db.entries {
        good := false
        for _, backoff := range e {
            backoffTime := BackoffBase + BackoffCoef*time.Duration(backoff.tries*backoff.tries)
            if backoffTime > BackoffMax {
                backoffTime = BackoffMax
            }
            if now.Before(backoff.until.Add(backoffTime)) {
                good = true
                break
            }
        }
        if !good {
            delete(db.entries, p)
        }
    }
}

6. Some important rules and algorithms

1. Filtering of dial-up address

// filterKnownUndialables takes a list of multiaddrs, and removes those that we definitely don't want to dial: addresses configured to be blocked, IPv6 link-local addresses, addresses without a dial-capable transport, and addresses that we know to be our own. This is an optimization to avoid wasting time on dials that we know are going to fail.
func (s *Swarm) filterKnownUndialables(p peer.ID, addrs []ma.Multiaddr) []ma.Multiaddr {
    lisAddrs, _ := s.InterfaceListenAddresses()
    var ourAddrs []ma.Multiaddr
    for _, addr := range lisAddrs {
        protos := addr.Protocols()
        // we're only sure about filtering out /ip4 and /ip6 addresses, so far
        if len(protos) == 2 && (protos[0].Code == ma.P_IP4 || protos[0].Code == ma.P_IP6) {
            ourAddrs = append(ourAddrs, addr)
        }
    }

    return addrutil.FilterAddrs(addrs,
        addrutil.SubtractFilter(ourAddrs...),
        s.canDial,
        // TODO: Consider allowing link-local addresses
        addrutil.AddrOverNonLocalIP,
        func(addr ma.Multiaddr) bool {
            return s.gater == nil || s.gater.InterceptAddrDial(p, addr)
        },
    )
}

// FilterAddrs is a filter that removes certain addresses, according to the given filters.
// If all filters return true, the address is kept.
func FilterAddrs(a []ma.Multiaddr, filters ...func(ma.Multiaddr) bool) []ma.Multiaddr {
    b := make([]ma.Multiaddr, 0, len(a))
    for _, addr := range a {
        good := true
        for _, filter := range filters {
            good = good && filter(addr)
        }
        if good {
            b = append(b, addr)
        }
    }
    return b
}

// AddrOverNonLocalIP returns whether the addr uses a non-local ip link
func AddrOverNonLocalIP(a ma.Multiaddr) bool {
    split := ma.Split(a)
    if len(split) < 1 {
        return false
    }
    if manet.IsIP6LinkLocal(split[0]) {
        return false
    }
    return true
}

2. Dialing address sorting

// ranks addresses in descending order of preference for dialing   Private UDP > Public UDP > Private TCP > Public TCP > UDP Relay server > TCP Relay server
    rankAddrsFnc := func(addrs []ma.Multiaddr) []ma.Multiaddr {
        var localUdpAddrs []ma.Multiaddr // private udp
        var relayUdpAddrs []ma.Multiaddr // relay udp
        var othersUdp []ma.Multiaddr     // public udp

        var localFdAddrs []ma.Multiaddr // private fd consuming
        var relayFdAddrs []ma.Multiaddr //  relay fd consuming
        var othersFd []ma.Multiaddr     // public fd consuming

        for _, a := range addrs {
            if _, err := a.ValueForProtocol(ma.P_CIRCUIT); err == nil {
                if s.IsFdConsumingAddr(a) {
                    relayFdAddrs = append(relayFdAddrs, a)
                    continue
                }
                relayUdpAddrs = append(relayUdpAddrs, a)
            } else if manet.IsPrivateAddr(a) {
                if s.IsFdConsumingAddr(a) {
                    localFdAddrs = append(localFdAddrs, a)
                    continue
                }
                localUdpAddrs = append(localUdpAddrs, a)
            } else {
                if s.IsFdConsumingAddr(a) {
                    othersFd = append(othersFd, a)
                    continue
                }
                othersUdp = append(othersUdp, a)
            }
        }

        relays := append(relayUdpAddrs, relayFdAddrs...)
        fds := append(localFdAddrs, othersFd...)

        return append(append(append(localUdpAddrs, othersUdp...), fds...), relays...)
    }

3. Backoff time setting

// BackoffBase is the base amount of time to backoff (default: 5s).
var BackoffBase = time.Second * 5

// BackoffCoef is the backoff coefficient (default: 1s).
var BackoffCoef = time.Second

// BackoffMax is the maximum backoff time (default: 5m).
var BackoffMax = time.Minute * 5

// AddBackoff lets other nodes know that we've entered backoff with peer p, so dialers should not wait unnecessarily. We still will attempt to dial with one goroutine, in case we get through.
//
// Backoff is not exponential, it's quadratic and computed according to the following formula:
//
//     BackoffBase + BakoffCoef * PriorBackoffs^2
//
// Where PriorBackoffs is the number of previous backoffs.
func (db *DialBackoff) AddBackoff(p peer.ID, addr ma.Multiaddr) {
    saddr := string(addr.Bytes())
    db.lock.Lock()
    defer db.lock.Unlock()
    bp, ok := db.entries[p]
    if !ok {
        bp = make(map[string]*backoffAddr, 1)
        db.entries[p] = bp
    }
    ba, ok := bp[saddr]
    if !ok {
        bp[saddr] = &backoffAddr{
            tries: 1,
            until: time.Now().Add(BackoffBase),
        }
        return
    }

    backoffTime := BackoffBase + BackoffCoef*time.Duration(ba.tries*ba.tries)
    if backoffTime > BackoffMax {
        backoffTime = BackoffMax
    }
    ba.until = time.Now().Add(backoffTime)
    ba.tries++
}

7. Core dialing logic

func (s *Swarm) dialAddrs(ctx context.Context, p peer.ID, remoteAddrs []ma.Multiaddr) (transport.CapableConn, *DialError) {
    /*
        This slice-to-chan code is temporary, the peerstore can currently provide
        a channel as an interface for receiving addresses, but more thought
        needs to be put into the execution. For now, this allows us to use
        the improved rate limiter, while maintaining the outward behaviour
        that we previously had (halting a dial when we run out of addrs)
    */
    var remoteAddrChan chan ma.Multiaddr
    if len(remoteAddrs) > 0 {
        remoteAddrChan = make(chan ma.Multiaddr, len(remoteAddrs))
        for i := range remoteAddrs {
            remoteAddrChan <- remoteAddrs[i]
        }
        close(remoteAddrChan)
    }

    log.Debugf("%s swarm dialing %s", s.local, p)

    ctx, cancel := context.WithCancel(ctx)
    defer cancel() // cancel work when we exit func

    // use a single response type instead of errs and conns, reduces complexity *a ton*
    respch := make(chan dialResult)
    err := &DialError{Peer: p}

    defer s.limiter.clearAllPeerDials(p)

    var active int
dialLoop:
    for remoteAddrChan != nil || active > 0 {
        // Check for context cancellations and/or responses first.
        select {
        case <-ctx.Done():
            break dialLoop
        case resp := <-respch:
            active--
            if resp.Err != nil {
                // Errors are normal, lots of dials will fail
                if resp.Err != context.Canceled {
                    s.backf.AddBackoff(p, resp.Addr)
                }

                log.Infof("got error on dial: %s", resp.Err)
                err.recordErr(resp.Addr, resp.Err)
            } else if resp.Conn != nil {
                return resp.Conn, nil
            }

            // We got a result, try again from the top.
            continue
        default:
        }

        // Now, attempt to dial.
        select {
        case addr, ok := <-remoteAddrChan:
            if !ok {
                remoteAddrChan = nil
                continue
            }

            s.limitedDial(ctx, p, addr, respch)
            active++
        case <-ctx.Done():
            break dialLoop
        case resp := <-respch:
            active--
            if resp.Err != nil {
                // Errors are normal, lots of dials will fail
                if resp.Err != context.Canceled {
                    s.backf.AddBackoff(p, resp.Addr)
                }

                log.Infof("got error on dial: %s", resp.Err)
                err.recordErr(resp.Addr, resp.Err)
            } else if resp.Conn != nil {
                return resp.Conn, nil
            }
        }
    }

    if ctxErr := ctx.Err(); ctxErr != nil {
        err.Cause = ctxErr
    } else if len(err.DialErrors) == 0 {
        err.Cause = network.ErrNoRemoteAddrs
    } else {
        err.Cause = ErrAllDialsFailed
    }
    return nil, err
}

A peer may have multiple addresses. After filtering the addresses that can be dialed, arrange these addresses in order and throw them here. First put these addresses into the channel, and then traverse the addresses in the channel (dialloop):

STEP 1. Check context and response
1.1. If the context is cancelled, the loop will jump out.
1.2. Receive the response and reduce the active count by 1. If the previous round robin dialing is wrong, addbackoff and record the error. If the previous round robin dialing is successful, Conn will be returned directly. If neither is the case, continue the next round robin.

STEP 2. Try dialing
2.1 take out the address from the channel, call limiteddial to dial (the internal process will be started to dial), and add 1 to the active count.
2.2 if the context is cancelled, the loop will jump out.
2.3 after receiving the response, reduce the active count by 1. If dialing is wrong, addbackoff and record the error. If dialing is successful, Conn will be returned directly. If neither is true, continue the next cycle (check the context and response first next time).

STEP 3. Error returned.
If the dialloop ends without returning Conn, it indicates that the context is cancelled or there is no address to dial. Otherwise, there is a dialing error and none of the addresses are dialed successfully.

Suppose there are three addresses here. Because the protocol is started to dial, the first fails and the second succeeds. When waiting for the second successful response, the third dialing task may have been executed. At this time, it will be executed before returning to Conndefer cancel()The third dial-up task will receive a cancel signal. If the third dial-up task succeeds, close the Conn, as shown in the figure belowdialLimiter.executeDial。 Re execution anddefer s.limiter.clearAllPeerDials(p), right herewaitingOnPeerLimitClean up the data. No matter whether the dialing succeeds or fails, the dialing is over for this peer.

Netwops is composed of a domestic senior cloud computing and distributed technology development team, which has very rich landing experience in finance, power, communication and Internet industries. At present, netwops has set up R & D centers in Shenzhen and Beijing, with a team size of 30 +, most of which are technicians with more than 10 years of development experience, respectively from professional fields such as Internet, finance, cloud computing, blockchain and scientific research institutions.
Netwops focuses on the R & D and application of secure storage technology products. Its main products include decentralized file system (DFS) and decentralized computing platform (DCP). It is committed to providing distributed storage and distributed computing platform based on decentralized network technology. It has the technical characteristics of high availability, low power consumption and low network, and is suitable for the Internet of things Industrial Internet and other scenarios.
Official account: Netwarps

Recommended Today

Young Xia! How to write good SQL?

The blogger (coder) is mainly responsible for using Alibaba cloud MySQL database. Recently, slow SQL alarms have frequently occurred,When executingThe longest roomHowever, up to 5 minutes。 After exporting the log, the main reason isNo index hits and no paging processing 。 In fact, this is a very low-level mistake. I can’t help cooling my back. […]