Deep decryption of go language sync.Pool

Time:2020-7-16

Recently, I encountered the problem of GC in my work: many objects were created repeatedly in the project, which caused the workload of GC to be huge and CPU dropped frequently. Ready to usesync.PoolTo cache objects and reduce GC consumption. In order to use it more smoothly, I specially studied it and formed this article. This article from the use to source code analysis, step by step, one by one.

This paper is based on go 1.14

catalog
  • What is it?
  • What’s the usage?
  • How to use it?
    • Simple examples
    • How to use FMT package
    • pool_test
    • other
  • Source code analysis
    • Pool structure
    • Get
      • pin
      • popHead
      • getSlow
      • popTail
    • Put
      • pushHead
    • pack/unpack
    • GC
  • summary
  • reference material

What is it?

sync.PoolIt is a component under the sync package and can be used as a “pool” to save temporary retrieval objects. I think its name is misleading, because the objects in the pool can be recycled without noticesync.CacheIt’s a better name.

What’s the usage?

For many places where memory needs to be repeatedly allocated and recycled,sync.PoolIt’s a good choice. Frequent allocation and recycling of memory will bring a certain burden to GC. In serious cases, it will cause CPU burrsync.PoolYou can cache objects that are not used temporarily and use them directly when they are needed next time. You do not need to allocate memory again, reuse the memory of objects, reduce the pressure of GC and improve the performance of the system.

How to use it?

first,sync.PoolIt is coprocess safe, which is extremely convenient for users. Before using, set theNewFunction, used in thePoolWhen there is no cached object in the. After that, anywhere in the program, at any time, only throughGet()Put()Methods can be used to retrieve and return objects.

Here’s what happened in “go night reading” in 2018sync.PoolAbout applicable scenarios:

When multiple goroutines need to create the same object, if there are too many goroutines, the number of created objects will increase sharply, and the GC pressure will increase. It forms a vicious cycle of “concurrency – memory occupation – GC slow – concurrency reduction – concurrency update”.

At this time, you need to have an object pool. Each goroutine no longer creates an object independently. Instead, it gets objects from the object pool (if it already exists in the pool).

Therefore, the key idea is to reuse objects to avoid repeated creation and destruction. Let’s take a look at how to use them.

Simple examples

Let’s start with a simple example

package main
import (
	"fmt"
	"sync"
)

var pool *sync.Pool

type Person struct {
	Name string
}

func initPool() {
	pool = &sync.Pool {
		New: func()interface{} {
			fmt.Println("Creating a new Person")
			return new(Person)
		},
	}
}

func main() {
	initPool()

	p := pool.Get().(*Person)
	fmt.Println ("get from pool for the first time?", P)

	p.Name = "first"
	fmt.Printf ("set p.name =% s / N", p.name)

	pool.Put(p)

	fmt.Println ("there is already an object in the pool: & {first}, call get:", pool.Get ().(*Person))
	fmt.Println ("pool has no object, call get:", pool.Get ().(*Person))
}

Operation results:

Creating a new Person
Get from pool for the first time: & {}
Set p.name = first
There is already an object in the pool: & {first}, get: & {first}
Creating a new Person
Pool has no object, get: & {}

First, you need to initializePoolThe only thing you need is to set it upNewFunction. When the get method is called, if an object is cached in the pool, the cached object is returned directly. If there is no inventory, the new function is called to create a new object.

In addition, we found that the object retrieved by the get method is actually the same as the object entered last time by put, and the pool does not do any “empty” processing. However, we should not make any assumptions about this, because in actual concurrent usage scenarios, this order cannot be guaranteed. The best way is to empty the objects before put.

How to use FMT package

This part mainly focuses onfmt.PrintfHow to use:

func Printf(format string, a ...interface{}) (n int, err error) {
	return Fprintf(os.Stdout, format, a...)
}

Keep lookingFprintf

func Fprintf(w io.Writer, format string, a ...interface{}) (n int, err error) {
	p := newPrinter()
	p.doPrintf(format, a)
	n, err = w.Write(p.buf)
	p.free()
	return
}

FprintfThe argument to the function is aio.WriterPrintfThe message isos.StdoutIs equivalent to direct output to standard output. therenewPrinterPool is used

// newPrinter allocates a new pp struct or grabs a cached one.
func newPrinter() *pp {
	p := ppFree.Get().(*pp)
	p.panicking = false
	p.erroring = false
	p.wrapErrs = false
	p.fmt.init(&p.buf)
	return p
}

var ppFree = sync.Pool{
	New: func() interface{} { return new(pp) },
}

go back toFprintfFunction. After getting the PP pointer, it will do some format operations and write the contents of p.buf to W. Finally, call the free function to return the PP pointer to the pool

// free saves used pp structs in ppFree; avoids an allocation per invocation.
func (p *pp) free() {
	if cap(p.buf) > 64<<10 {
		return
	}

	p.buf = p.buf[:0]
	p.arg = nil
	p.value = reflect.Value{}
	p.wrappedErr = nil
	ppFree.Put(p)
}

Clear some fields of the object before returning it to the pool. In this way, when you get the cached object through get, you can use it safely.

pool_test

It is a good way to learn the source code from the test file, because it represents the “official” usage. More importantly, test cases will deliberately test some “pits”, learning these pits, and you will learn to avoid them when using them.

pool_testThere are 7 tests and 4 bechmarks in the file.

TestPoolandTestPoolNewIt is relatively simple, mainly to test the function of get / put. Let’s take a lookTestPoolNew

func TestPoolNew(t *testing.T) {
	// disable GC so we can control when it happens.
	defer debug.SetGCPercent(debug.SetGCPercent(-1))

	i := 0
	p := Pool{
		New: func() interface{} {
			i++
			return i
		},
	}
	if v := p.Get(); v != 1 {
		t.Fatalf("got %v; want 1", v)
	}
	if v := p.Get(); v != 2 {
		t.Fatalf("got %v; want 2", v)
	}

	// Make sure that the goroutine doesn't migrate to another P
	// between Put and Get calls.
	Runtime_procPin()
	p.Put(42)
	if v := p.Get(); v != 42 {
		t.Fatalf("got %v; want 42", v)
	}
	Runtime_procUnpin()

	if v := p.Get(); v != 3 {
		t.Fatalf("got %v; want 3", v)
	}
}

First set theGC=-1The function is to stop GC. Why use defer? If the function is finished, what else should I do to defer. be aware,debug.SetGCPercentThis function is called twice, and it returns the value of the last GC. Therefore, the purpose of defer here is to restore the GC settings before calling this function, that is to restore the scene.

Next, the new function of pool is called: it directly returns an int, and every time new is called, it will increase by 1. Then, the get function is called twice in succession. Because there is no cached object in the pool at this time, new is called to create one each time. Therefore, 1 is returned for the first time and 2 is returned for the second time.

Then, callRuntime_procPin()To prevent goroutine from being forcibly occupied, the purpose is to protect the next put and get operations so that the objects they operate on are the same “pool” of P. Moreover, new is not called when you call get this time, because there was a put operation before.

Finally, the get operation is called again, because there is no inventory, so new is called again to create an object.

TestPoolGCandTestPoolReleaseIt mainly tests the influence of GC on objects in the pool. A function is used to count how many objects will be recycled by GC

runtime.SetFinalizer(v, func(vv *string) {
	atomic.AddUint32(&fin, 1)
})

When garbage collection detectsvIs an unreachable object, andvThere is another related oneFinalizer, it will start another finalizer function set by goroutine call, that is, the parameter func in the above code. In this way, the object v is reachable again, so that it is not recycled during this GC process. After that, unbind the object v and its associatedFinalizerThe next time GC detects that the object v is unreachable, it will be recycled.

TestPoolStressFrom the name point of view, the main purpose is to test the “pressure”. The specific operation is to start 10 goroutines, continuously put objects into the pool, and then get objects to see if there is any error.

TestPoolDequeueandTestPoolChain, all calledtestPoolDequeueThis is the specific work. It needs to pass in aPoolDequeueInterface:

// poolDequeue testing.
type PoolDequeue interface {
	PushHead(val interface{}) bool
	PopHead() (interface{}, bool)
	PopTail() (interface{}, bool)
}

PoolDequeueIs a double ended queue that can queue elements from the head, and queue elements from the head and tail. When the function is called, the former passes inNewPoolDequeue(16)The latter is introducedNewPoolChain()In fact, it’s all at the bottompoolDequeueThis structure. SpecificallytestPoolDequeueWhat has been done:

双端队列

A total of 10 goroutines have been set up: one producer and nine consumers. The producer constantly pushes the element of pushhead from the head of the queue to the double ended queue, and every 10 times of pushing, the producer will pop the head once; the consumer will always fetch the element from the end of the queue. Whether the elements are fetched from the head of the queue or from the end of the queue, they will be marked in the map, and finally check whether each element has been taken out only once.

The rest is the benchmark test. firstBenchmarkPoolIt is relatively simple, that is, put / get continuously to test the performance.

BenchmarkPoolSTWThe function will first turn off GC, then put 10 objects into the pool, and then force GC to trigger, record the pause time of GC, and make a sort to calculate the STW time of P50 and p95. This function can be added to the personal code base

func BenchmarkPoolSTW(b *testing.B) {
	// Take control of GC.
	defer debug.SetGCPercent(debug.SetGCPercent(-1))

	var mstats runtime.MemStats
	var pauses []uint64

	var p Pool
	for i := 0; i < b.N; i++ {
		// Put a large number of items into a pool.
		const N = 100000
		var item interface{} = 42
		for i := 0; i < N; i++ {
			p.Put(item)
		}
		// Do a GC.
		runtime.GC()
		// Record pause time.
		runtime.ReadMemStats(&mstats)
		pauses = append(pauses, mstats.PauseNs[(mstats.NumGC+255)%256])
	}

	// Get pause time stats.
	sort.Slice(pauses, func(i, j int) bool { return pauses[i] < pauses[j] })
	var total uint64
	for _, ns := range pauses {
		total += ns
	}
	// ns/op for this benchmark is average STW time.
	b.ReportMetric(float64(total)/float64(b.N), "ns/op")
	b.ReportMetric(float64(pauses[len(pauses)*95/100]), "p95-ns/STW")
	b.ReportMetric(float64(pauses[len(pauses)*50/100]), "p50-ns/STW")
}

I ran on the MAC:

go test -v -run=none -bench=BenchmarkPoolSTW

Get the output:

goos: darwin
goarch: amd64
pkg: sync
BenchmarkPoolSTW-12    361    3708 ns/op    3583 p50-ns/STW    5008 p95-ns/STW
PASS
ok      sync    1.481s

the last oneBenchmarkPoolExpensiveNewTest the performance of the pool when the cost of new is high. You can also add your own code base.

other

In the standard libraryencoding/jsonIt’s also used sync.Pool To improve performance. famousginFramework, context access is also heresync.Pool

Take a lookginHow to use it sync.Pool 。 Set the new function:

engine.pool.New = func() interface{} {
	return engine.allocateContext()
}

func (engine *Engine) allocateContext() *Context {
	return &Context{engine: engine, KeysMutex: &sync.RWMutex{}}
}

use:

// ServeHTTP conforms to the http.Handler interface.
func (engine *Engine) ServeHTTP(w http.ResponseWriter, req *http.Request) {
	c := engine.pool.Get().(*Context)
	c.writermem.reset(w)
	c.Request = req
	c.reset()

	engine.handleHTTPRequest(c)

	engine.pool.Put(c)
}

First call get to fetch the cached object, and then perform some reset operationshandleHTTPRequestFinally, put it back to the pool.

In addition, the echo framework also enablessync.PoolTo managecontextAnd the zero heap memory allocation is achieved

It leverages sync pool to reuse memory and achieve zero dynamic memory allocation with no GC overhead.

Source code analysis

Pool structure

First, let’s look at the structure of pool

type Pool struct {
	noCopy noCopy

    //The local queue of each P, the actual type is [P] poollocal
	local     unsafe.Pointer // local fixed-size per-P pool, actual type is [P]poolLocal
	//Size of [P] poollocal
	localSize uintptr        // size of the local array

	victim     unsafe.Pointer // local from previous cycle
	victimSize uintptr        // size of victims array

	//Custom object creation callback function, which is called when no object is available in the pool
	New func() interface{}
}

Because the pool does not want to be copied, there is a nocopy field in the structure body. Use thego vetThe tool can detect whether the user code has copied the pool.

noCopyIt is a static checking mechanism introduced by go1.7. It works not only at runtime or in standard libraries, but also in user code.

Users only need to implement such a structure that does not consume memory and is only used for static analysis to ensure that an object will not be copied after the first use.

The implementation is very simple:

//Nocopy is used to embed a structure to ensure that it will not be copied after the first use
//
//See you https://golang.org/issues/8005#issuecomment -190753527
type noCopy struct{}

//Lock is a null operation used for static analysis of 'go ve' - copylocks
func (*noCopy) Lock()   {}
func (*noCopy) Unlock() {}

localField store points to[P]poolLocalA pointer to an array (strictly speaking, it is a slice),localSizeRepresents the size of the local array. When accessing, the ID of P corresponds to[P]poolLocalSubscript index. Through this design, when multiple goroutines use the same pool, the competition is reduced and the performance is improved.

When a round of GC arrives, vitim and victimsize “take over” local and localsize, respectively.victimThe mechanism is used to reduce the performance jitter caused by GC cold start and make the allocation object smoother.

Vitality cache is originally a concept in the computer architecture. It is a technology of CPU hardware processing cache,sync.PoolThe purpose of the introduction is to reduce the GC pressure and improve the hit rate.

When the pool has no cached objects, callNewMethod to generate a new object.

type poolLocal struct {
	poolLocalInternal

	//Add poollocal to the multiple of two cache rows to prevent false sharing,
	//Each cache row has 64 bytes, or 512 bits
	//Currently, our processors generally have 32 * 1024 / 64 = 512 cache lines
	//Pseudo sharing, which only occupies bits, prevents the allocation of multiple poollocalinternal on the cache line
	pad [128 - unsafe.Sizeof(poolLocalInternal{})%128]byte
}

// Local per-P Pool appendix.
type poolLocalInternal struct {
    //P's private cache, no need to lock when using
	private interface{}
	//Public cache. Local P can push head / pophead; other p can only poptail
	shared  poolChain
}

fieldpadMainly to preventfalse sharingWhat is CPU cache written by Dong University

In modern CPU, cache is divided into cache line (cache block) as the unit, in x86_ In 64 system, it is usually 64 bytes, and cache line is the smallest unit of operation.

Even if the program only wants to read one byte of data in memory, it also needs to load 63 adjacent words into the cache. If the program reads more than 64 bytes, it must be loaded into multiple cache lines.

In short, if there is no pad field, when the poollocal of index 0 needs to be accessed, the CPU will load index 0 and index 1 into the CPU cache at the same time. If only index 0 is modified, the poollocal of index 1 will be disabled. In this way, when other threads want to read index No. 1, cache miss occurs and has to be reloaded, which is detrimental to performance. Add one morepadTo complete the cache line so that the relevant fields can be loaded into the cache line independently, so that it will not appearfalse shardingYes.

poolChainIt is an implementation of a double ended queue

type poolChain struct {
	//Only the producer can push to without locking
	head *poolChainElt

	//Reading and writing require atomic control. pop from
	tail *poolChainElt
}

type poolChainElt struct {
	poolDequeue

	//Next is written by producer and read by consumer. So it just goes from nil to non nil
	//Prev is written by consumer and read by producer. So it just goes from non nil to nil
	next, prev *poolChainElt
}

type poolDequeue struct {
	// The head index is stored in the most-significant bits so
	// that we can atomically add to it and the overflow is
	// harmless.
	//Headtail contains a 32-bit head and a 32-bit tail pointer. These two values are modular with len (Vals) - 1.
	//Tail is the oldest data in the queue, and the head points to the next slot to be filled
    //The valid range of slots is [tail, head], which is held by consumers.
	headTail uint64

	//Vals is a circular queue that stores interface {}, and its size must be a power of 2
	//If slot is empty, then Vals [i]. Typ is null; otherwise, it is not null.
	//A slot is declared invalid at this time: tail does not point to it, Vals [i]. Typ is nil
	//It is set to nil by consumer and read by producer
	vals []eface
}

poolDequeueIt is implemented as a single producer, multi consumer fixed size non lock (atomic Implementation) ring type queue (the underlying storage uses an array, using two pointers to mark head and tail). Producers can insert and delete from head, while consumers can only delete from tail.

headTailPoint to the head and tail of the queue and store the head and tail in the headtail variable through bit operation.

We use a picture to describe the pool structure completely

Pool 结构体

“Excuse me” combined with mubai’s technology sync.Pool What are the disadvantages of the two terminal queue

Pool 结构体

We can see that the pool does not use pooldequeue directly because its size is fixed, while the pool size is unlimited. So it’s packaged on top of the pooldequeue and becomes apoolChainEltThe two-way linked list can grow dynamically.

Get

Direct source code:

func (p *Pool) Get() interface{} {
    // ......
	l, pid := p.pin()
	x := l.private
	l.private = nil
	if x == nil {
		x, _ = l.shared.popHead()
		if x == nil {
			x = p.getSlow(pid)
		}
	}
	runtime_procUnpin()
    // ......
	if x == nil && p.New != nil {
		x = p.New()
	}
	return x
}

The content of the ellipsis israceRelated, belongs to the process of reading the source code of some noise, temporary comment out. In this way, the whole process of get is very clear:

  1. First, call thep.pin()The function binds the current goroutine to P, forbids preemption, and returns the poollocal and PID corresponding to the current P.

  2. Then take l.private directly, assign it to x, and set l.private as nil.

  3. Judge whether x is empty. If it is, try to get an object from the header pop of l.shared and assign it to X.

  4. If x is still empty, getslow is called to “steal” an object from the end of the shared double ended queue of other P.

  5. After the operation of pool is finished, callruntime_procUnpin()Remove non preemption.

  6. Finally, if you still don’t get the cached object, you can directly call the preset new function to create one.

I use a flow chart to show the whole process:

Get 流程图

After sorting out the overall process, let’s take a look at some of the key functions.

pin

Let’s look firstPool.pin()

// src/sync/pool.go

// the caller must call runtime_ after completion of the value. Procunpin() to cancel the preemption.
func (p *Pool) pin() (*poolLocal, int) {
	pid := runtime_procPin()
	s := atomic.LoadUintptr(&p.localSize) // load-acquire
	l := p.local                          // load-consume
	//Because there may be dynamic P (the number of P adjusted at runtime)
	if uintptr(pid) < s {
		return indexLocal(l, pid), pid
	}
	return p.pinSlow()
}

pinThe function is to bind the current rootine and P together to prohibit preemption. The corresponding poollocal and the ID of P are returned.

If G is preempted, the state of G changes from running to runnable, and it will be put back to the local Q or globaq of P to wait for the next scheduling. The next time it is executed, it may not be combined with the current P. Because PID will be used later. If it is preempted, it is possible that the PID to be used next is not the same as the bound P.

The task of “binding” was finally handed over toprocPin

// src/runtime/proc.go

func procPin() int {
	_g_ := getg()
	mp := _g_.m

	mp.locks++
	return int(mp.p.ptr().id)
}

The code is very simple: add 1 to the lock field of m bound by goroutine to complete the “binding”. For the principle of pin, refer to the object pool of golang sync.pool Source code interpretation “, the article analyzes in detail why the implementationprocPinAfter that, it cannot be preempted, and GC will not clean the objects in the pool.

Let’s go backp.pin(), atomic operation take outp.localSizeandp.local, if thepidless thanp.localSize, the element at the PID index in the poollocal array is taken directly. Otherwise, it indicates that the pool has not yet created a poollocal. Callp.pinSlow()Complete the creation.

func (p *Pool) pinSlow() (*poolLocal, int) {
	// Retry under the mutex.
	// Can not lock the mutex while pinned.
	runtime_procUnpin()
	allPoolsMu.Lock()
	defer allPoolsMu.Unlock()
	pid := runtime_procPin()
	// poolCleanup won't be called while we are pinned.
	//Atomic operations are not used because they are global locked
	s := p.localSize
	l := p.local
	//Because pinslow may have been called by other threads halfway, the PID needs to be checked again at this time. If the PID is within the size range of p.local, you do not need to create a poollocal slice and return directly.
	if uintptr(pid) < s {
		return indexLocal(l, pid), pid
	}
	if p.local == nil {
		allPools = append(allPools, p)
	}
	// If GOMAXPROCS changes between GCs, we re-allocate the array and lose the old one.
	//Number of current p
	size := runtime.GOMAXPROCS(0)
	local := make([]poolLocal, size)
	//The old local will be recycled
	atomic.StorePointer(&p.local, unsafe.Pointer(&local[0])) // store-release
	atomic.StoreUintptr(&p.localSize, uintptr(size))         // store-release
	return &local[pid], pid
}

Because it’s going to be a big lockallPoolsMu, so the function name hasslow。 We know that the larger the lock granularity and the more competition, the more “slow” will naturally be. However, if you want to lock, you must first release the “binding”, and then perform the “binding” after locking. The reason is that the larger the lock, the greater the probability of blocking. If the lock still occupies P, it will waste resources.

After unbound, pinslow may have been called by other threads, and p.local may change. Therefore, the PID needs to be checked again at this time. If the PID is within the size range of p.localsize, you do not need to create poollocal slice, and return directly.

After that, according to the number of P, use make to create slices, includingruntime.GOMAXPROCS(0)Set p.local and p.localsize using atomic operation.

Finally, the element at the PID index corresponding to p.local is returned.

About this big lockallPoolsMuCao Da gave an example in “several lock problems that may be encountered in go systems”. Third party library used itsync.PoolThere is a structure insidefasttemplate.Template, containingsync.PoolField. When RD is in use, a new structure will be created for each request. Therefore, when processing each request, it will try to fetch the cached object from an empty pool. Finally, goroutine is blocked on this big lock because they are trying to execute:allPools = append(allPools, p), causing performance problems.

popHead

Back to the get function, let’s look at another key function:poolChain.popHead()

func (c *poolChain) popHead() (interface{}, bool) {
	d := c.head
	for d != nil {
		if val, ok := d.popHead(); ok {
			return val, ok
		}
		// There may still be unconsumed elements in the
		// previous dequeue, so try backing up.
		d = loadPoolChainElt(&d.prev)
	}
	return nil, false
}

popHeadFunctions are only called by producer. First get the header node: c.head. If the header node is not empty, try to call the pophead method of the header node. Note that the two pophead methods are not actually the same. One ispoolChainOne ispoolDequeueIf you have doubts, you might as well look back at the diagram of the pool structure. Let’s seepoolDequeue.popHead()

// /usr/local/go/src/sync/poolqueue.go

func (d *poolDequeue) popHead() (interface{}, bool) {
	var slot *eface
	for {
		ptrs := atomic.LoadUint64(&d.headTail)
		head, tail := d.unpack(ptrs)
		//Determine whether the queue is empty
		if tail == head {
			// Queue is empty.
			return nil, false
		}

		//The head position is the previous position of the team leader, so we have to retreat one position first.
		//Before reading the value of the slot, subtract the head value by 1 to cancel the control of the slot
		head--
		ptrs2 := d.pack(head, tail)
		if atomic.CompareAndSwapUint64(&d.headTail, ptrs, ptrs2) {
			// We successfully took back slot.
			slot = &d.vals[head&uint32(len(d.vals)-1)]
			break
		}
	}

    //Take out val
	val := *(*interface{})(unsafe.Pointer(slot))
	if val == dequeueNil(nil) {
		val = nil
	}
	
	//Reset slot, typ and val to nil
	//The method of emptying here is different from poptail and has no competition with pushhead, so you don't have to be careful
	*slot = eface{}
	return val, true
}

This function will be deleted and returnedqueueThe head node of. But ifqueueIf it is empty, it returns false. therequeueWhat is stored is actually the object cached in the pool.

The core of the whole function is an infinite loop, which is a common form of lock free programming in go.

Call firstunpackThe function separates the head and tail pointers. If the head and tail are equal, i.e., the head and tail are equal, then the queue is empty and returns directlynil,false

Otherwise, move the head pointer back one bit, that is, the head value is reduced by 1.packPackage the head and tail pointers. useatomic.CompareAndSwapUint64Compare whether the headtail changes between here. If there is no change, it is equivalent to obtaining the lock. Then update the value of headtail. And assign the element at the corresponding index of Vals to slot.

becausevalsThe length can only be an n power of 2, solen(d.vals)-1In fact, the lower N bits of the value obtained are all 1, and then it is matched with head, which is actually the value of the low n-bit of head.

After getting the element of the corresponding slot, type conversion is carried out to determine whether it isdequeueNilIf yes, it indicates that the cached object is not retrieved, and nil is returned.

// /usr/local/go/src/sync/poolqueue.go
//Because nil is used to represent empty slots, dequeuenil is used for interface {} (NIL)
type dequeueNil *struct{}

Finally, before returning Val, return the slot to “zero”:*slot = eface{}

go back topoolChain.popHead(), callpoolDequeue.popHead()After getting the cached object, it will be returned directly. Otherwise, thedReorientationd.prev, continue trying to get the cached object.

getSlow

If the cache object is not retrieved in the shared, continue to callPool.getSlow(), trying to steal from other p’s poollocal:

func (p *Pool) getSlow(pid int) interface{} {
	// See the comment in pin regarding ordering of the loads.
	size := atomic.LoadUintptr(&p.localSize) // load-acquire
	locals := p.local                        // load-consume
	// Try to steal one element from other procs.
	//Stealing objects from other p's
	for i := 0; i < int(size); i++ {
		l := indexLocal(locals, (pid+i+1)%int(size))
		if x, _ := l.shared.popTail(); x != nil {
			return x
		}
	}

	//Try to fetch the object from the vitality cache. This happened after an attempt to steal from other p's poollocal failed,
	//This makes it easier to recycle objects in actin.
	size = atomic.LoadUintptr(&p.victimSize)
	if uintptr(pid) >= size {
		return nil
	}
	locals = p.victim
	l := indexLocal(locals, pid)
	if x := l.private; x != nil {
		l.private = nil
		return x
	}
	for i := 0; i < int(size); i++ {
		l := indexLocal(locals, (pid+i)%int(size))
		if x, _ := l.shared.popTail(); x != nil {
			return x
		}
	}

	//Clear the vitality cache. You don't have to look here next time
	atomic.StoreUintptr(&p.victimSize, 0)

	return nil
}

Starting at poollocal with index PID + 1, try callingshared.popTail()Gets the cache object. If you don’t get it, you can find it from the vitality, which is similar to the logic of poollocal.

Finally, if it’s not found, set the victimsize to 0 to prevent subsequent “people” from looking for it again.

At the end of the get function, if the cached object is not found after this operation, the new function is called to create a new object.

popTail

Finally, there is a poptail function:

func (c *poolChain) popTail() (interface{}, bool) {
	d := loadPoolChainElt(&c.tail)
	if d == nil {
		return nil, false
	}

	for {
		d2 := loadPoolChainElt(&d.next)

		if val, ok := d.popTail(); ok {
			return val, ok
		}

		if d2 == nil {
			//The double linked list has only one tail node and is now empty
			return nil, false
		}

		//The double ended queue in the tail node of the bidirectional linked list is "hollowed out", so continue to look at the next node.
		//And since the tail node has been "hollowed out," you have to get rid of it. In this way, the next time pophead will not see if it has cached objects.
		if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&c.tail)), unsafe.Pointer(d), unsafe.Pointer(d2)) {
			//Get rid of the tail node
			storePoolChainElt(&d2.prev, nil)
		}
		d = d2
	}
}

stayforAt the beginning of the loop, d.next is loaded into D2. Because D may be empty temporarily, but if D2 is not empty before pop or pop failures, then D will be empty forever. In this case, the node D can be safely “thrown off”.

Finally, thec.tailUpdate tod2To prevent the next timepopTailCheck for an empty onedequeue; and willd2.prevSet tonilTo prevent the next timepopHeadView an emptydequeue

Let’s take a look at the corepoolDequeue.popTail

// src/sync/poolqueue.go:147

func (d *poolDequeue) popTail() (interface{}, bool) {
	var slot *eface
	for {
		ptrs := atomic.LoadUint64(&d.headTail)
		head, tail := d.unpack(ptrs)
		//Determine whether the queue is empty
		if tail == head {
			// Queue is empty.
			return nil, false
		}

		//Let's get the head and tail pointer positions first. If it's done, then this slot will belong to us
		ptrs2 := d.pack(head, tail+1)
		if atomic.CompareAndSwapUint64(&d.headTail, ptrs, ptrs2) {
			// Success.
			slot = &d.vals[tail&uint32(len(d.vals)-1)]
			break
		}
	}

	// We now own slot.
	val := *(*interface{})(unsafe.Pointer(slot))
	if val == dequeueNil(nil) {
		val = nil
	}

	slot.val = nil
	atomic.StorePointer(&slot.typ, nil)
	// At this point pushHead owns the slot.

	return val, true
}

popTailRemoves an element from the end of the queue. Returns false if the queue is empty. This function may be more than one at the same timeconsumerCall.

The core of function is an infinite loop, and it is also a lock free programming. First, find out the head and tail pointer values. If they are equal, the queue is empty.

Because you want to remove an element from the tail, the tail pointer advances by 1 and then sets the headtail with the atomic operation.

Finally, return the Val and typ of the slot to be removed to zero:

slot.val = nil
atomic.StorePointer(&slot.typ, nil)

Put

// src/sync/pool.go

//Put adds the object to the pool 
func (p *Pool) Put(x interface{}) {
	if x == nil {
		return
	}
	// ……
	l, _ := p.pin()
	if l.private == nil {
		l.private = x
		x = nil
	}
	if x != nil {
		l.shared.pushHead(x)
	}
	runtime_procUnpin()
    //…… 
}

Also deleted the race related function, looks much cleaner. The logic of the whole put is also clear:

  1. Bind g and P first, and then try to assign x to the private field.

  2. If it fails, callpushHeadMethod attempts to put it in a double ended queue maintained by the shared field.

The whole process is also shown in the flow chart

Put 流程图

pushHead

Let’s seepushHeadSource code, relatively clear:

// src/sync/poolqueue.go

func (c *poolChain) pushHead(val interface{}) {
	d := c.head
	if d == nil {
		//The initial length of pooldequeue is 8
		const initSize = 8 // Must be a power of 2
		d = new(poolChainElt)
		d.vals = make([]eface, initSize)
		c.head = d
		storePoolChainElt(&c.tail, d)
	}

	if d.pushHead(val) {
		return
	}

    //Twice the length of the previous pooldequeue
	newSize := len(d.vals) * 2
	if newSize >= dequeueLimit {
		// Can't make it any bigger.
		newSize = dequeueLimit
	}

    //End to end, forming a linked list
	d2 := &poolChainElt{prev: d}
	d2.vals = make([]eface, newSize)
	c.head = d2
	storePoolChainElt(&d.next, d2)
	d2.pushHead(val)
}

Ifc.headIf it is empty, you need to create a poolchainelt as the first node and, of course, the tail node. The length of the dual ended queue it manages is initially 8. When a poolchainelt node is created after it is full, the length of the double ended queue will double. There is, of course, a maximum length limit (2 ^ 30)

const dequeueBits = 32
const dequeueLimit = (1 << dequeueBits) / 4

callpoolDequeue.pushHeadTry to put the object in pooldeque:

// src/sync/poolqueue.go

//Add Val to the double ended queue header. False if the queue is full. This function can only be called by one producer
func (d *poolDequeue) pushHead(val interface{}) bool {
	ptrs := atomic.LoadUint64(&d.headTail)
	head, tail := d.unpack(ptrs)
	if (tail+uint32(len(d.vals)))&(1<

First determine whether the queue is full:

if (tail+uint32(len(d.vals)))&(1<

That is to add the tail pointerd.valsTake the lower 31 bits to see if it is equal to the head. We know that,d.valsThe length of is actually fixed, so if the queue is full, then both sides of the if statement are equal. If the queue is full, it returns false.

Otherwise, the queue is not full. Find the slot position to be filled through the head pointer: take the lower 31 bits of the head pointer.

// Check if the head slot has been released by popTail.
typ := atomic.LoadPointer(&slot.typ)
if typ != nil {
	// Another goroutine is still cleaning up the tail, so
	// the queue is actually still full.
	//Poptail sets Val first and then typ to nil. After setting typ, pophead can operate this slot
	return false
}

The above paragraph is used to determine whether there is a conflict with poptail. If there is a conflict, false will be returned directly.

Finally, add the value of “head” to “Val” and assign the value of “slot”.

//Slot is occupied and val is stored in Vals
*(*interface{})(unsafe.Pointer(slot)) = val

The implementation here is more ingenious. Slot is of eFace type. Change slot to interface {} type, so that Val can be assigned to slot with interface {} slot.typ And slot.val Point to its memory block, so slot.typ And slot.val Are not empty.

pack/unpack

Finally, let’s take a look at the pack and unpack functions. They are actually a set of functions that bind, unbind head and tail pointers.

// src/sync/poolqueue.go

const dequeueBits = 32

func (d *poolDequeue) pack(head, tail uint32) uint64 {
	const mask = 1<

maskThe lower 31 bits of the tail are all 1, and the other bits are 0. It is the same as tail, that is, only look at the lower 31 bits of tail. When the head is shifted to the left by 32 bits, the lower 32 bits are all zeros. Finally, by “or” the two parts, the head and tail will be bound together.

The corresponding unbound functions are as follows

func (d *poolDequeue) unpack(ptrs uint64) (head, tail uint32) {
	const mask = 1<> dequeueBits) & mask)
	tail = uint32(ptrs & mask)
	return
}

The method to take out the head pointer is to shift PTRs to the right by 32 bits, and then compare it with mask, and only look at the lower 31 bits of head. Tail is actually more simple. It can be used to match PTRs with mask directly.

GC

For pool, it cannot be expanded infinitely. Otherwise, the object will occupy too much memory and cause memory overflow.

In almost all pooling technologies, some cache objects will be emptied or cleared at some time. When will unused objects be cleaned up in go?

The answer is when GC occurs.

In pool.go In the init function of the file, the function of how to clean up the pool when GC occurs is registered

// src/sync/pool.go

func init() {
	runtime_registerPoolCleanup(poolCleanup)
}

The compiler does something behind it:

// src/runtime/mgc.go

// Hooks for other packages

var poolcleanup func()

//Register cleanup in sync package with runtime using compiler flag
//go:linkname sync_runtime_registerPoolCleanup sync.runtime_registerPoolCleanup
func sync_runtime_registerPoolCleanup(f func()) {
	poolcleanup = f
}

Specifically, it is as follows:

func poolCleanup() {
	for _, p := range oldPools {
		p.victim = nil
		p.victimSize = 0
	}

	// Move primary cache to victim cache.
	for _, p := range allPools {
		p.victim = p.local
		p.victimSize = p.localSize
		p.local = nil
		p.localSize = 0
	}

	oldPools, allPools = allPools, nil
}

poolCleanupWill be called in the STW phase. Overall, it’s quite simple. It is mainly to exchange the local and vitim, so that the GC will not empty all the pools, and there will be victim in the bottom.

Ifsync.PoolIf the speed of getting and releasing is stable, no new pool objects will be allocated. If the speed of fetching decreases, then the object may be in twoGCReleased within the cycle, not the previous oneGCPeriod.

In [go 1.13 sync.Pool How is it optimized?] describes the optimization in 1.13.

Reference [understanding go 1.13 sync.Pool We have simulated the call manuallypoolCleanupThe change process of oldpools, allpools, p.vitcim before and after the function is wonderful

  1. In the initial state, both oldpools and allpools are nil.
  1. Call get for the first time. Since p.local is nil, p.local will be created in pinslow, and then p will be put into allpools. At this time, the length of allpools is 1 and oldpools is nil.
  2. After using the object, put is called for the first time to put the object back.
  3. In the first GC STW stage, all p.local in allpools assign values to vitim and set them to nil. Allpools is assigned to oldpools. Finally, allpools is nil and the length of oldpools is 1.
  4. Call get for the second time. Since p.local is nil, it will try to fetch the object from p.activim.
  5. After using the object, put is called the second time to put the object back. However, because p.local is nil, recreate p.local and put the object back. At this time, the length of allpools is 1 and the length of oldpools is 1.
  6. In the second GC STW stage, all p.actinms in oldpools are set to nil, and the previous cache is recycled in this GC. All p.local values of allpools assign values to victim and set them to nil. Finally, allpools is nil, and the length of oldpools is 1.

I drew a picture according to this process, which can be understood more clearly:

poolCleanup 过程

It should be noted that,allPoolsandoldPoolsThey are all slices. The element of the slice is a pointer to the pool. The get / put operation does not need to pass through them. In step 6, if there are other pools that perform put operations,allPoolsThere will be multiple elements.

In implementations prior to go 1.13,poolCleanupRelatively “simple and crude”

func poolCleanup() {
    for i, p := range allPools {
        allPools[i] = nil
        for i := 0; i < int(p.localSize); i++ {
            l := indexLocal(p.local, i)
            l.private = nil
            for j := range l.shared {
                l.shared[j] = nil
            }
            l.shared = nil
        }
        p.local = nil
        p.localSize = 0
    }
    allPools = []*Pool{}
}

All the pool’sp.localandpoolLocal.shared

Through the comparison of the two, it is found that the granularity of GC in the new version is larger than that before go 1.13. Because the actual recovery time line is longer, the cost of GC per unit time is reduced.

Therefore, the role of P. actin is understood. Its location is the secondary cache. Objects are put into it during GC. If there is a get call before the next GC, it will be fetched from p.activim until GC is collected again.

At the same time, the cost of the next GC is reduced to a certain extent due to the fact that the object is not put back into p.activim after it is used up. The original cost of one GC is lengthened to two times, and the cost will be reduced to a certain extent, which is the intention of p.actinim.

[understanding go 1.13 sync.Pool At the end of this paper, we summarize the design and implementation ofsync.PoolIt includes: no lock, operation object isolation, atomic operation instead of lock, behavior isolation – linked list, and activity cache to reduce GC overhead. It’s very good. It’s recommended to read.

In addition, aboutsync.PoolIn the article of lock competition optimization, we recommend reading Rui Dashen’s optimization lock competition.

summary

This paper first introduces what pool is and what role it plays, then gives the usage of pool and its usage in standard library and some third-party libraries, and introduces pool_ Some test cases in test. At last, it explains in detailsync.PoolSource code.

At the end of this paper, we will summarize the details ofsync.PoolThe key points are as follows:

  1. The key idea is the reuse of objects to avoid repeated creation and destruction. Cache the temporarily unused objects and use them directly when needed next time. It does not need to go through memory allocation again, reuse the memory of objects, and reduce the pressure of GC.

  2. sync.PoolIt is safe and convenient to use. After setting the new function, call get and put to return the object.

  3. Go language built-in FMT package, encoding / JSON package can be seen sync.Pool The figure of;ginEchoAnd so on sync.Pool 。

  4. Don’t make any assumptions about the object you get. It’s better to “empty” the object when you return it.

  5. The life cycle of objects in pool is affected by GC, so it is not suitable for connection pool, because connection pool needs to manage the life cycle of objects by itself.

  6. Pool cannot specify ⼤⼩⼩⼩⼩⼩⼩⼩⼩⼩⼩⼤⼩⼩⼩⼩⼩⼩⼤⼩.

  7. procPinBind g and P to prevent G from being preempted. GC cannot clean up cached objects during binding.

  8. Join invictimBefore the mechanism, sync.Pool The maximum cache time of the object in is a GC cycle. When GC starts, all objects that are not referenced will be cleaned upvictimAfter the mechanism, the maximum cache time is two GC cycles.

  9. Vitality cache is originally a concept in the computer architecture. It is a technology of CPU hardware processing cache,sync.PoolThe purpose of the introduction is to reduce the GC pressure and improve the hit rate.

  10. sync.PoolThe bottom layer of the system uses slice and linked list to implement double end queue, and stores the cached objects in the slice.

reference material

[source code analysis of eurogod] https://changkun.us/archives/2018/09/256/

[go night reading] https://reading.hidevops.io/reading/20180817/2018-08-17-sync-pool-reading.pdf

[night reading video No. 14] https://www.youtube.com/watch?v=jaepwn2PWPk&list=PLe5svQwVF1L5bNxB0smO8gNfAZQYWdIpI

[source code analysis, pseudo sharing] https://juejin.im/post/5d4087276fb9a06adb7fbe4a

Object pool of golang sync.pool Source code interpretation] https://zhuanlan.zhihu.com/p/99710992

[understanding go 1.13 sync.Pool Design and implementation of https://zhuanlan.zhihu.com/p/110140126

[advantages and disadvantages, figure] http://cbsheng.github.io/posts/golang Standard library sync.pool Analysis of principle and source code/

[Xiaorui optimizes lock competition] http://xiaorui.cc/archives/5878

[road to performance optimization, customize multiple cache specifications] https://blog.cyeam.com/golang/2017/02/08/go-optimize-slice-pool

[ sync.Pool What are the disadvantages] https://mp.weixin.qq.com/s?__ biz=MzA4ODg0NDkzOA==&mid=2247487149&idx=1&sn=f38f2d72fd7112e19e97d5a2cd304430&source=41#wechat_ redirect

[evolution of 1.12 and 1.13] https://github.com/watermelo/dailyTrans/blob/master/golang/sync_ pool_ understand.md

[evolution of Dong Zerun] https://www.jianshu.com/p/2e08332481c5

【noCopy】https://github.com/golang/go/issues/8005#issuecomment-190753527

[Dong Zerun CPU cache] https://www.jianshu.com/p/dc4b5562aad2

[gomemcache example] https://docs.kilvn.com/The-Golang-Standard-Library-by-Example/chapter16/16.01.html

[bird’s nest 1.13 optimization] https://colobu.com/2019/10/08/how-is-sync-Pool-improved-in-Go-1-13/

【A journey with go】https://medium.com/a-journey-with-go/go-understand-the-design-of-sync-pool-2dde3024e277

[package a counting component] https://www.akshaydeo.com/blog/2017/12/23/How-did-I-improve-latency-by-700-percent-using-syncPool/

[pseudo sharing] http://ifeve.com/falsesharing/

Recommended Today

Performance test after class notes (1) basic concepts of performance test

I. real meaning and work content of performance test At first, I thought that performance testing was justDo some scripting, parameterization, correlation, press it up, and then throw a result. But in fact, not only these contents, but alsoPerformance analysis focuses on the improvement of response time, TPS and resource savings after tuning The direction […]