Original link: https://blog.thinkeridea.com/
Recently, I found the problem of string size of 20 character in go forum“hollowaykeanho”Given the relevant answers, and I found that the program of intercepting strings is not the best way, so I did a series of experiments and obtained efficient methods of intercepting strings. This article will gradually explain the process of my practice.
Byte slice truncation
This is exactly“hollowaykeanho”The first scheme given, I think it is also the first one that many people come up with, uses the built-in slicing syntax of go to intercept strings:
s := "abcdef"
fmt.Println(s[1:4])
We soon learned that this is a byte based interception, which is processingASCII
There is no better way to intercept a single byte string. Chinese usually takes up more than one byteutf8
There are three bytes in the code. We will get the scrambled data in the following program:
S: = "go language"
fmt.Println(s[1:4])
Assassin’s mace type conversion [] run
“hollowaykeanho”The second solution is to convert strings to[]rune
, and then cut it according to the slice syntax, and turn the result into a string.
S: = "go language"
rs := []rune(s)
fmt.Println(strings(rs[1:4]))
First of all, we got the right result, which is the biggest progress. However, I am always cautious about type conversion. I am worried about its performance, so I try to find answers in search engines and major forums, but I get the most from this solution, which seems to be the only solution.
I try to write a personality test to evaluate its performance:
package benchmark
import (
"testing"
)
Var benchmarksubstring = "go language is a kind of programming language developed by Google with strong static type, compiled type, parallel hairstyle and garbage collection function. In order to facilitate search and recognition, it is sometimes called golang. "
var benchmarkSubStringLength = 20
func SubStrRunes(s string, length int) string {
if utf8.RuneCountInString(s) > length {
rs := []rune(s)
return string(rs[:length])
}
return s
}
func BenchmarkSubStrRunes(b *testing.B) {
for i := 0; i < b.N; i++ {
SubStrRunes(benchmarkSubString, benchmarkSubStringLength)
}
}
I got some surprising results:
goos: darwin
goarch: amd64
pkg: github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark
BenchmarkSubStrRunes-8 872253 1363 ns/op 336 B/op 2 allocs/op
PASS
ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 2.120s
It takes about 1.3 microseconds to intercept the first 20 characters of 69 strings, which greatly exceeds my expectation. I find that memory allocation is brought by type conversion, which produces a new string, and type conversion requires a lot of calculation.
Lifesaving straw – utf8.decoderuneinstring
I want to improve the extra operation and memory allocation brought by type conversion. I have combed it carefullystrings
Bag, I found no relevant tools, and then I thought about itutf8
Package, which provides multibyte computing related tools, to be honest, I am not familiar with it, or I have not actively (directly) used it, I have checked all its documents and found thatutf8.DecodeRuneInString
The function can convert a single character and give the number of bytes occupied by the character. I tried the following experiment:
package benchmark
import (
"testing"
"unicode/utf8"
)
Var benchmarksubstring = "go language is a kind of programming language developed by Google with strong static type, compiled type, parallel hairstyle and garbage collection function. In order to facilitate search and recognition, it is sometimes called golang. "
var benchmarkSubStringLength = 20
func SubStrDecodeRuneInString(s string, length int) string {
var size, n int
for i := 0; i < length && n < len(s); i++ {
_, size = utf8.DecodeRuneInString(s[n:])
n += size
}
return s[:n]
}
func BenchmarkSubStrDecodeRuneInString(b *testing.B) {
for i := 0; i < b.N; i++ {
SubStrDecodeRuneInString(benchmarkSubString, benchmarkSubStringLength)
}
}
After running it, I got a surprise result:
goos: darwin
goarch: amd64
pkg: github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark
BenchmarkSubStrDecodeRuneInString-8 10774401 105 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 1.250s
relatively[]rune
Improved conversion efficiency13 timesIt’s really exciting and exciting. I can’t wait to reply“hollowaykeanho”Tell him that I found a better way and provided relevant performance tests.
I was a little excited and excited to browse all kinds of interesting questions in the forum. When I looked at the help of one question (forgot which one – |), I was surprised to find another way of thinking.
Good medicine doesn’t have to be bitter – range string iteration
Many people seem to forgetrange
It iterates by character, not by byte. Userange
When iterating the string, I return the character starting index and corresponding characters. I immediately tried to write the following use case with this feature:
package benchmark
import (
"testing"
)
Var benchmarksubstring = "go language is a kind of programming language developed by Google with strong static type, compiled type, parallel hairstyle and garbage collection function. In order to facilitate search and recognition, it is sometimes called golang. "
var benchmarkSubStringLength = 20
func SubStrRange(s string, length int) string {
var n, i int
for i = range s {
if n == length {
break
}
n++
}
return s[:i]
}
func BenchmarkSubStrRange(b *testing.B) {
for i := 0; i < b.N; i++ {
SubStrRange(benchmarkSubString, benchmarkSubStringLength)
}
}
I tried to run it, which seemed to have infinite magic, and it didn’t disappoint me.
goos: darwin
goarch: amd64
pkg: github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark
BenchmarkSubStrRange-8 12354991 91.3 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 1.233s
It’s only increased by 13%, but it’s simple and easy to understand, which seems to be the good medicine I’m looking for.
If you think it’s over, no, it’s just the beginning of exploration for me.
The ultimate moment – make your own wheels
Drankrange
I seem to calm down. I need to make a wheel. It needs to be easier to use and more efficient.
So I carefully observed two optimization schemes. They seem to be for finding the index position of intercepting the specified length characters. If I can provide such a method, can I provide a simple intercepting implementation for userss[:strIndex(20)]
, I can’t get rid of this idea again after it’s sprouted. I’ve been thinking hard for two days about how to provide an easy-to-use interface.
Then I created the exitf8.runeindexinstring and exitf8.runeindex methods to calculate the index position of the end of the specified number of characters in the string and byte slice, respectively.
I implemented a string truncation test with xutf8.runeindexinstring:
package benchmark
import (
"testing"
"unicode/utf8"
"github.com/thinkeridea/go-extend/exunicode/exutf8"
)
Var benchmarksubstring = "go language is a kind of programming language developed by Google with strong static type, compiled type, parallel hairstyle and garbage collection function. In order to facilitate search and recognition, it is sometimes called golang. "
var benchmarkSubStringLength = 20
func SubStrRuneIndexInString(s string, length int) string {
n, _ := exutf8.RuneIndexInString(s, length)
return s[:n]
}
func BenchmarkSubStrRuneIndexInString(b *testing.B) {
for i := 0; i < b.N; i++ {
SubStrRuneIndexInString(benchmarkSubString, benchmarkSubStringLength)
}
}
Trying to run it, I’m very pleased with the results:
goos: darwin
goarch: amd64
pkg: github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark
BenchmarkSubStrRuneIndexInString-8 13546849 82.4 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 1.213s
Performance is morerange
It’s a 10% increase, and I’m glad to get another promotion, which proves it’s effective.
It is efficient enough, but not easy to use. I need two lines of code to intercept strings. If I want to intercept characters between 10 and 20, I need four lines of code. This is not an easy-to-use interface for users. I refer to other languagessub_string
Method, I think I should also design an interface like this for users.
Exutf8.runesubstring and exutf8.runesub are the methods I wrote after careful consideration:
func RuneSubString(s string, start, length int) string
It has three parameters:
-
s
: entered string -
start
: the starting position of interception. If start is a non negative number, the returned string will start from the start position of string and start from 0. For example, in the string “ABCDEF”, the character at position 0 is “a”, the string at position 2 is “C”, and so on. If start is a negative number, the returned string will start from the start character at the end of the string. If the length of a string is less than start, an empty string is returned. -
length
: the intercepted length. If a positive length is provided, the returned string will contain a maximum of length characters from the start (depending on the length of the string). If a negative length is provided, the length at the end of the string is omitted (from the end of the string if start is negative). If start is not in the text, an empty string is returned. If a length of 0 is provided, the returned substring starts at the start position and ends at the end of the string.
I gave them aliases, and they were more inclined tostrings
Package to find solutions to these problems, I created exstrings. Substring and exbytes. Sub as more easily retrieved alias methods.
Finally, I need to do another performance test to ensure its performance:
package benchmark
import (
"testing"
"github.com/thinkeridea/go-extend/exunicode/exutf8"
)
Var benchmarksubstring = "go language is a kind of programming language developed by Google with strong static type, compiled type, parallel hairstyle and garbage collection function. In order to facilitate search and recognition, it is sometimes called golang. "
var benchmarkSubStringLength = 20
func SubStrRuneSubString(s string, length int) string {
return exutf8.RuneSubString(s, 0, length)
}
func BenchmarkSubStrRuneSubString(b *testing.B) {
for i := 0; i < b.N; i++ {
SubStrRuneSubString(benchmarkSubString, benchmarkSubStringLength)
}
}
Running it won’t let me down:
goos: darwin
goarch: amd64
pkg: github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark
BenchmarkSubStrRuneSubString-8 13309082 83.9 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 1.215s
Although it is lower than that of exutf8.runeindexinstring, it provides an interface that is easy to interact and use. I think it should be the most practical solution. If you pursue the ultimate, you can still use exutf8.runeindexinstring, which is still the fastest solution.
summary
When you see the code in question, even if it’s very simple, it’s still worth studying. It’s not boring and boring to keep exploring it, but it will have a lot of gains.
From the beginning[]rune
Type conversion to the final self-made wheel, not only get16 timesI also learnedutf8
Package, deepenrange
The features of traversal string and many practical and efficient solutions for go extend warehouse are included, so that more users of go extend can get results.
Go extend is a repository of practical and efficient methods. If you have good functions and general and efficient solutions, you are expected to send them to mePull request
You can also use this warehouse to speed up the implementation and performance.
Reprint:
Author of this paper: Qi Yin (thinkeridea)
Link to this article: https://blog.thinkeridea.com/201910/go/efficient_string_truncation.html
Copyright notice: except for the special notice, all articles in this blog are licensed under CC by 4.0 CN agreement. Reprint please indicate the source!