Exploration of in function of slice in go

Time:2021-1-14

Before I saw a question in Zhihu: why doesn’t golang have the same function as in Python? As a result, I searched the problem and found that many people still have such questions.

Let’s talk about this topic today.

In is a very common function, which may also be called contains in some languages. Although different languages have different expressions, there are basically some. Unfortunately, go does not. It does not provide Python like operator in, nor does it provide such standard library functions as other languages, such as in PHP_ array。

Go’s philosophy is to pursue less is more. I think maybe the go team thinks it’s a function that can’t be realized.

Why is it insignificant? If you want to realize it by yourself, what should you do?

There are three ways to realize what I think of, one is traversal, the other is binary search of sort, and the third is key index of map.

The source code of this article has been uploaded on my GitHub, Polo / gotin.

ergodic

Traversal should be the easiest way to realize it.

Examples are as follows:


func InIntSlice(haystack []int, needle int) bool {
 for _, e := range haystack {
 if e == needle {
 return true
 }
 }

 return false
}

The above example shows how to find the existence of the specified int in a variable of type [] int. is it very simple? We can also feel why I say it is trivial to implement.

This example has a flaw, it only supports a single type. If we want to support the same general in function as the interpretive language, we need to implement it with the help of reflection.

The code is as follows:


func In(haystack interface{}, needle interface{}) (bool, error) {
 sVal := reflect.ValueOf(haystack)
 kind := sVal.Kind()
 if kind == reflect.Slice || kind == reflect.Array {
 for i := 0; i < sVal.Len(); i++ {
 if sVal.Index(i).Interface() == needle {
 return true, nil
 }
 }

 return false, nil
 }

 return false, ErrUnSupportHaystack
}

To be more general, the input parameters haystack and needle of the in function are of the interface {} type.

Let’s talk about the input parameters interface{} There are two main advantages of the system

First, haystack is an interface {} type, so that in supports not only slice but also array. We can see that haystack is type checked by reflection inside the function, which supports slice and array. If it is other type, it will prompt an error. Adding new type support, such as map, is actually very simple. But this method is not recommended, because through the_ , OK: = m [k] can achieve the effect of in.

Secondly, haystack is interface {}, then []interface{} It also meets the requirements, and the need is the interface {}. In this way, we can achieve the same effect as interpretive language.

How to understand? Direct examples are as follows:


gotin.In([]interface{}{1, "two", 3}, "two")

Haystack is []interface{}{1, "two", 3},And the need is interface {}, and the value is “two”. In this way, it seems that the implementation of the interpretive language, elements can be of any type, not exactly the same effect. In this way, we can use it recklessly.

However, there is a piece of code in the implementation of in function


if sVal.Index(i).Interface() == needle {
 ...
}

Not all types of go can use = = comparison. If the element contains slice or map, an error may be reported.

Binary search

There is a disadvantage in traversing to confirm the existence of elements, that is, if the array or slice contains a large amount of data, such as 1000000 pieces of data, that is one million, the worst case is that we have to traverse 1000000 times to confirm, and the time complexity is on.

Is there any way to reduce the number of iterations?

A natural way to think of it is binary search, which has a time complexity of log2 (n). But this algorithm has a premise and needs to rely on ordered sequences.

Therefore, the first problem we need to solve is to make the sequence orderly. Go’s standard library has provided this function under the sort package.

The sample code is as follows:


fmt.Println(sort.SortInts([]int{4, 2, 5, 1, 6}))

For [] int, the function we use is sortints. For other types of slices, sort also provides related functions. For example, [] string can be sorted by sortstrings.

After sorting, you can perform binary search. Fortunately, go also provides this function. The corresponding function of type [] int is searchints.

For a brief introduction to this function, let’s first look at the definition:


func SearchInts(a []int, x int) int

The input parameters are easy to understand. Search x from slice a. Let’s focus on the return value, which is very important for us to confirm whether the element exists later. The meaning of the return value is to return the location of the element in the slice. If the element does not exist, it is to return where the element should be inserted while keeping the slice in order.

For example, the sequence is as follows:

1 2 6 8 9 11

If x is 6, it will be found at index 2 after searching; if x is 7, it will be found that the element does not exist; if the sequence is inserted, it will be placed between 6 and 8, the index position is 3, so the return value is 3.

Under the code test:


fmt.Println(sort.SearchInts([]int{1, 2, 6, 8, 9, 11}, 6)) // 2
fmt.Println(sort.SearchInts([]int{1, 2, 6, 8, 9, 11}, 7)) // 3

If you want to judge whether the element is in the sequence, you only need to judge whether the value in the return position is the same as that in the search.

But there is another case, if the insertion element is at the end of the sequence, for example, the element value is 12, the insertion position is the length of the sequence 6. If you directly look up the element at position 6, it may be out of bounds. Then what shall I do? In fact, it is enough to judge whether the return length is greater than the slice length, which means that the element is not in the slice sequence.

The complete implementation code is as follows:


func SortInIntSlice(haystack []int, needle int) bool {
 sort.Ints(haystack)

 index := sort.SearchInts(haystack, needle)
 return index < len(haystack) && haystack[index] == needle
}

But there is also a problem. For the disordered scenario, it is not cost-effective to sort every query once. Finally, can achieve a sort, slightly modify the code.


func InIntSliceSortedFunc(haystack []int) func(int) bool {
 sort.Ints(haystack)

 return func(needle int) bool {
 index := sort.SearchInts(haystack, needle)
 return index < len(haystack) && haystack[index] == needle
 }
}

For the above implementation, we sort haystack slices by calling inintslicesordedfunction, and return a function that can be used many times.

The use cases are as follows:


in := gotin.InIntSliceSortedFunc(haystack)

for i := 0; i<maxNeedle; i++ {
 if in(i) {
 fmt.Printf("%d is in %v", i, haystack)
 }
}

What are the disadvantages of binary search?

The important point that comes to mind is that to implement binary search, elements must be sortable, such as int, string, float types. For structure, slice, array, mapping and other types, it’s not so convenient to use. Of course, if you want to use it, it’s OK, but we need to make some appropriate extensions to sort according to the specified criteria, such as a member of the structure.

So far, the in implementation of binary search is introduced.

map key

This section describes the map key method. Its algorithm complexity is O1, no matter how large the amount of data, the query performance remains unchanged. It mainly depends on the map data type in go. You can directly check whether the key exists through hash map. You should be familiar with the algorithm. The key can be directly mapped to the index position.

We often use this method.


_, ok := m[k]
if ok {
 fmt.Println("Found")
}

So how does it combine with in? A case illustrates the problem.

Suppose we have a variable of type [] int, as follows:

s := []int{1, 2, 3}

In order to use the ability of map to check whether an element exists, we can convert s into Smap[int]struct{}。


m := map[interface{}]struct{}{
 1: struct{}{},
 2: struct{}{},
 3: struct{}{},
 4: struct{}{},
}

If you want to check whether an element exists, you only need to write it as follows:


k := 4
if _, ok := m[k]; ok {
 fmt.Printf("%d is found\n", k)
}

Is it very simple?

Add a little bit about why it’s used here struct{},You can read an article I wrote earlier about how to use set in go.

According to this idea, the implementation function is as follows:


func MapKeyInIntSlice(haystack []int, needle int) bool {
 set := make(map[int]struct{})

 for _ , e := range haystack {
 set[e] = struct{}{}
 }

 _, ok := set[needle]
 return ok
}

It’s not difficult to implement, but it has the same problem as binary search. At first, we need to do data processing and transform slice into map. If the data is the same every time, slightly modify its implementation.


func InIntSliceMapKeyFunc(haystack []int) func(int) bool {
 set := make(map[int]struct{})

 for _ , e := range haystack {
 set[e] = struct{}{}
 }

 return func(needle int) bool {
 _, ok := set[needle]
 return ok
 }
}

For the same data, it will return an in function that can be used multiple times. A use case is as follows:


in := gotin.InIntSliceMapKeyFunc(haystack)

for i := 0; i<maxNeedle; i++ {
 if in(i) {
 fmt.Printf("%d is in %v", i, haystack)
 }
}

Compared with the former two algorithms, this method has the highest processing efficiency and is very suitable for big data processing. The next performance test, we will see the effect.

performance

After introducing all the methods, let’s compare the performance of each algorithm. The test source code is located in gotin_ test.go File.

Benchmark test is mainly to examine the performance of different algorithms from the size of data. In this paper, we select three levels of test sample data, namely 10, 1000, 1000000.

For the convenience of testing, a function is defined to generate haystack and needle sample data.

The code is as follows:


func randomHaystackAndNeedle(size int) ([]int, int){
 haystack := make([]int, size)

 for i := 0; i<size ; i++{
 haystack[i] = rand.Int()
 }

 return haystack, rand.Int()
}

The input parameter is size rand.Int () randomly generate haystack and one need with slice size. In benchmark case, the random function is introduced to generate data.

Here is an example


func BenchmarkIn_10(b *testing.B) {
 haystack, needle := randomHaystackAndNeedle(10)

 b.ResetTimer()
 for i := 0; i < b.N; i++ {
 _, _ = gotin.In(haystack, needle)
 }
}

Firstly, a slice with 10 elements is randomly generated by random haystack and need. Because the time to generate sample data should not be included in the benchmark, we reset the time using B. resettimer().

Secondly, the pressure test function is written according to the rule of test + function name + sample data size, such as benchmark in the case_ 10, which means the test in function, and the sample size is 10. If we want to test inintslice with 1000 data, the pressure test function is called benchmark inintslice_ 1000。

Let’s start the test! Let’s talk about my laptop configuration, Mac Pro 15, 16g memory, 512 SSD, 4-core 8-thread CPU.

Test the performance of all functions when the amount of data is 10.

$ go test -run=none -bench=10$ -benchmem

Match all piezometric functions ending in 10.

Test results:

goos: darwin
goarch: amd64
pkg: github.com/poloxue/gotin
BenchmarkIn_10-8 3000000 501 ns/op 112 B/op 11 allocs/op
BenchmarkInIntSlice_10-8 200000000 7.47 ns/op 0 B/op 0 allocs/op
BenchmarkInIntSliceSortedFunc_10-8 100000000 22.3 ns/op 0 B/op 0 allocs/op
BenchmarkSortInIntSlice_10-8 10000000 162 ns/op 32 B/op 1 allocs/op
BenchmarkInIntSliceMapKeyFunc_10-8 100000000 17.7 ns/op 0 B/op 0 allocs/op
BenchmarkMapKeyInIntSlice_10-8 3000000 513 ns/op 163 B/op 1 allocs/op
PASS
ok github.com/poloxue/gotin 13.162s

The performance of sortedfunc and mapkeyfunc is not the best, but the simplest traversal query for a single type, which takes an average time of 7.47ns/op. Of course, the other two methods perform well, 22.3ns/op and 17.7ns/op, respectively.

The worst performance is in, sortin and mapkeyin. The average time consumption is 501 NS / op and 513 NS / op respectively.

Test the performance of all functions with 1000 data.

$ go test -run=none -bench=1000$ -benchmem

Test results:


goos: darwin
goarch: amd64
pkg: github.com/poloxue/gotin
BenchmarkIn_1000-8 30000 45074 ns/op 8032 B/op 1001 allocs/op
BenchmarkInIntSlice_1000-8 5000000 313 ns/op 0 B/op 0 allocs/op
BenchmarkInIntSliceSortedFunc_1000-8 30000000 44.0 ns/op 0 B/op 0 allocs/op
BenchmarkSortInIntSlice_1000-8 20000 65401 ns/op 32 B/op 1 allocs/op
BenchmarkInIntSliceMapKeyFunc_1000-8 100000000 17.6 ns/op 0 B/op 0 allocs/op
BenchmarkMapKeyInIntSlice_1000-8 20000 82761 ns/op 47798 B/op 65 allocs/op
PASS
ok github.com/poloxue/gotin 11.312s

The top three performance indicators are still intinslice, intinslice sortedfunc and intinslice mapkeyfunc, but the order has changed this time. Mapkeyfunc is the best, with 17.6 NS / op, basically unchanged compared with the data volume of 10. The above statement is verified again.

Similarly, when the amount of data is 1000000.


$ go test -run=none -bench=1000000$ -benchmem

The results are as follows

goos: darwin
goarch: amd64
pkg: github.com/poloxue/gotin
BenchmarkIn_1000000-8 30 46099678 ns/op 8000098 B/op 1000001 allocs/op
BenchmarkInIntSlice_1000000-8 3000 424623 ns/op 0 B/op 0 allocs/op
BenchmarkInIntSliceSortedFunc_1000000-8 20000000 72.8 ns/op 0 B/op 0 allocs/op
BenchmarkSortInIntSlice_1000000-8 10 138873420 ns/op 32 B/op 1 allocs/op
BenchmarkInIntSliceMapKeyFunc_1000000-8 100000000 16.5 ns/op 0 B/op 0 allocs/op
BenchmarkMapKeyInIntSlice_1000000-8 10 156215889 ns/op 49824225 B/op 38313 allocs/op
PASS
ok github.com/poloxue/gotin 15.178s

Mapkeyfunc is still the best, taking 17.2 ns for each operation, followed by sort, while intslice shows a linear increasing trend. In general, if there are no special requirements for performance and a large amount of data, single type traversal has very good performance.

It can be seen from the test results that the general in function implemented by reflection needs a lot of memory allocation each time, which is convenient and at the cost of performance.

summary

This article leads to the topic of why there is no Python like in method in go. In my opinion, on the one hand, the implementation is very simple and unnecessary. In addition, on the other hand, in different scenarios, we need to analyze which way to use according to the actual situation, rather than a fixed way.

Then, we introduce three ways to implement in and analyze their advantages and disadvantages. Through the performance analysis and testing, we can draw a general conclusion, which way is suitable for what scene, but the overall still can not say enough detail, interested friends can continue to study.

Recommended Today

Install and configure Tomcat

Download, install and configure Tomcatone . download Download address: http://tomcat.apache.org/ Choose 32-bit or 64 bit according to your computer You can download the installation free version (zip), or the installation version (. Exe) two . install If you download the installation free version, you can unzip it directly after downloading Start: find the startup.bat Click […]