Some thoughts on efficient string interception by [go]

Time:2019-12-29

Original link: https://blog.thinkeridea.com/

Recently, I found the problem of string size of 20 character in go forum“hollowaykeanho”Given the relevant answers, and I found that the program of intercepting strings is not the best way, so I did a series of experiments and obtained efficient methods of intercepting strings. This article will gradually explain the process of my practice.

Byte slice truncation

This is exactly“hollowaykeanho”The first scheme given, I think it is also the first one that many people come up with, uses the built-in slicing syntax of go to intercept strings:

s := "abcdef"
fmt.Println(s[1:4])

We soon learned that this is a byte based interception, which is processingASCIIThere is no better way to intercept a single byte string. Chinese usually takes up more than one byteutf8There are three bytes in the code. We will get the scrambled data in the following program:

S: = "go language"
fmt.Println(s[1:4])

Assassin’s mace type conversion [] run

hollowaykeanho”The second solution is to convert strings to[]rune, and then cut it according to the slice syntax, and turn the result into a string.

S: = "go language"
rs := []rune(s)
fmt.Println(strings(rs[1:4]))

First of all, we got the right result, which is the biggest progress. However, I am always cautious about type conversion. I am worried about its performance, so I try to find answers in search engines and major forums, but I get the most from this solution, which seems to be the only solution.

I try to write a personality test to evaluate its performance:

package benchmark

import (
    "testing"
)

Var benchmarksubstring = "go language is a kind of programming language developed by Google with strong static type, compiled type, parallel hairstyle and garbage collection function. In order to facilitate search and recognition, it is sometimes called golang. "
var benchmarkSubStringLength = 20

func SubStrRunes(s string, length int) string {
    if utf8.RuneCountInString(s) > length {
        rs := []rune(s)
        return string(rs[:length])
    }

    return s
}

func BenchmarkSubStrRunes(b *testing.B) {
    for i := 0; i < b.N; i++ {
        SubStrRunes(benchmarkSubString, benchmarkSubStringLength)
    }
}

I got some surprising results:

goos: darwin
goarch: amd64
pkg: github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark
BenchmarkSubStrRunes-8            872253              1363 ns/op             336 B/op          2 allocs/op
PASS
ok      github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark     2.120s

It takes about 1.3 microseconds to intercept the first 20 characters of 69 strings, which greatly exceeds my expectation. I find that memory allocation is brought by type conversion, which produces a new string, and type conversion requires a lot of calculation.

Lifesaving straw – utf8.decoderuneinstring

I want to improve the extra operation and memory allocation brought by type conversion. I have combed it carefullystringsBag, I found no relevant tools, and then I thought about itutf8Package, which provides multibyte computing related tools, to be honest, I am not familiar with it, or I have not actively (directly) used it, I have checked all its documents and found thatutf8.DecodeRuneInStringThe function can convert a single character and give the number of bytes occupied by the character. I tried the following experiment:

package benchmark

import (
    "testing"
    "unicode/utf8"
)

Var benchmarksubstring = "go language is a kind of programming language developed by Google with strong static type, compiled type, parallel hairstyle and garbage collection function. In order to facilitate search and recognition, it is sometimes called golang. "
var benchmarkSubStringLength = 20

func SubStrDecodeRuneInString(s string, length int) string {
    var size, n int
    for i := 0; i < length && n < len(s); i++ {
        _, size = utf8.DecodeRuneInString(s[n:])
        n += size
    }

    return s[:n]
}

func BenchmarkSubStrDecodeRuneInString(b *testing.B) {
    for i := 0; i < b.N; i++ {
        SubStrDecodeRuneInString(benchmarkSubString, benchmarkSubStringLength)
    }
}

After running it, I got a surprise result:

goos: darwin
goarch: amd64
pkg: github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark
BenchmarkSubStrDecodeRuneInString-8     10774401               105 ns/op               0 B/op          0 allocs/op
PASS
ok      github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark     1.250s

relatively[]runeImproved conversion efficiency13 timesIt’s really exciting and exciting. I can’t wait to reply“hollowaykeanho”Tell him that I found a better way and provided relevant performance tests.

I was a little excited and excited to browse all kinds of interesting questions in the forum. When I looked at the help of one question (forgot which one – |), I was surprised to find another way of thinking.

Good medicine doesn’t have to be bitter – range string iteration

Many people seem to forgetrangeIt iterates by character, not by byte. UserangeWhen iterating the string, I return the character starting index and corresponding characters. I immediately tried to write the following use case with this feature:

package benchmark

import (
    "testing"
)

Var benchmarksubstring = "go language is a kind of programming language developed by Google with strong static type, compiled type, parallel hairstyle and garbage collection function. In order to facilitate search and recognition, it is sometimes called golang. "
var benchmarkSubStringLength = 20

func SubStrRange(s string, length int) string {
    var n, i int
    for i = range s {
        if n == length {
            break
        }

        n++
    }

    return s[:i]
}

func BenchmarkSubStrRange(b *testing.B) {
    for i := 0; i < b.N; i++ {
        SubStrRange(benchmarkSubString, benchmarkSubStringLength)
    }
}

I tried to run it, which seemed to have infinite magic, and it didn’t disappoint me.

goos: darwin
goarch: amd64
pkg: github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark
BenchmarkSubStrRange-8          12354991                91.3 ns/op             0 B/op          0 allocs/op
PASS
ok      github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark     1.233s

It’s only increased by 13%, but it’s simple and easy to understand, which seems to be the good medicine I’m looking for.

If you think it’s over, no, it’s just the beginning of exploration for me.

The ultimate moment – make your own wheels

DrankrangeI seem to calm down. I need to make a wheel. It needs to be easier to use and more efficient.

So I carefully observed two optimization schemes. They seem to be for finding the index position of intercepting the specified length characters. If I can provide such a method, can I provide a simple intercepting implementation for userss[:strIndex(20)], I can’t get rid of this idea again after it’s sprouted. I’ve been thinking hard for two days about how to provide an easy-to-use interface.

Then I created the exitf8.runeindexinstring and exitf8.runeindex methods to calculate the index position of the end of the specified number of characters in the string and byte slice, respectively.

I implemented a string truncation test with xutf8.runeindexinstring:

package benchmark

import (
    "testing"
    "unicode/utf8"

    "github.com/thinkeridea/go-extend/exunicode/exutf8"
)

Var benchmarksubstring = "go language is a kind of programming language developed by Google with strong static type, compiled type, parallel hairstyle and garbage collection function. In order to facilitate search and recognition, it is sometimes called golang. "
var benchmarkSubStringLength = 20

func SubStrRuneIndexInString(s string, length int) string {
    n, _ := exutf8.RuneIndexInString(s, length)
    return s[:n]
}

func BenchmarkSubStrRuneIndexInString(b *testing.B) {
    for i := 0; i < b.N; i++ {
        SubStrRuneIndexInString(benchmarkSubString, benchmarkSubStringLength)
    }
}

Trying to run it, I’m very pleased with the results:

goos: darwin
goarch: amd64
pkg: github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark
BenchmarkSubStrRuneIndexInString-8      13546849                82.4 ns/op             0 B/op          0 allocs/op
PASS
ok      github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark     1.213s

Performance is morerangeIt’s a 10% increase, and I’m glad to get another promotion, which proves it’s effective.

It is efficient enough, but not easy to use. I need two lines of code to intercept strings. If I want to intercept characters between 10 and 20, I need four lines of code. This is not an easy-to-use interface for users. I refer to other languagessub_stringMethod, I think I should also design an interface like this for users.

Exutf8.runesubstring and exutf8.runesub are the methods I wrote after careful consideration:

func RuneSubString(s string, start, length int) string

It has three parameters:

  • s: entered string
  • start: the starting position of interception. If start is a non negative number, the returned string will start from the start position of string and start from 0. For example, in the string “ABCDEF”, the character at position 0 is “a”, the string at position 2 is “C”, and so on. If start is a negative number, the returned string will start from the start character at the end of the string. If the length of a string is less than start, an empty string is returned.
  • length: the intercepted length. If a positive length is provided, the returned string will contain a maximum of length characters from the start (depending on the length of the string). If a negative length is provided, the length at the end of the string is omitted (from the end of the string if start is negative). If start is not in the text, an empty string is returned. If a length of 0 is provided, the returned substring starts at the start position and ends at the end of the string.

I gave them aliases, and they were more inclined tostringsPackage to find solutions to these problems, I created exstrings. Substring and exbytes. Sub as more easily retrieved alias methods.

Finally, I need to do another performance test to ensure its performance:

package benchmark

import (
    "testing"

    "github.com/thinkeridea/go-extend/exunicode/exutf8"
)

Var benchmarksubstring = "go language is a kind of programming language developed by Google with strong static type, compiled type, parallel hairstyle and garbage collection function. In order to facilitate search and recognition, it is sometimes called golang. "
var benchmarkSubStringLength = 20

func SubStrRuneSubString(s string, length int) string {
    return exutf8.RuneSubString(s, 0, length)
}

func BenchmarkSubStrRuneSubString(b *testing.B) {
    for i := 0; i < b.N; i++ {
        SubStrRuneSubString(benchmarkSubString, benchmarkSubStringLength)
    }
}

Running it won’t let me down:

goos: darwin
goarch: amd64
pkg: github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark
BenchmarkSubStrRuneSubString-8          13309082                83.9 ns/op             0 B/op          0 allocs/op
PASS
ok      github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark     1.215s

Although it is lower than that of exutf8.runeindexinstring, it provides an interface that is easy to interact and use. I think it should be the most practical solution. If you pursue the ultimate, you can still use exutf8.runeindexinstring, which is still the fastest solution.

summary

When you see the code in question, even if it’s very simple, it’s still worth studying. It’s not boring and boring to keep exploring it, but it will have a lot of gains.

From the beginning[]runeType conversion to the final self-made wheel, not only get16 timesI also learnedutf8Package, deepenrangeThe features of traversal string and many practical and efficient solutions for go extend warehouse are included, so that more users of go extend can get results.

Go extend is a repository of practical and efficient methods. If you have good functions and general and efficient solutions, you are expected to send them to mePull requestYou can also use this warehouse to speed up the implementation and performance.

Reprint:

Author of this paper: Qi Yin (thinkeridea)

Link to this article: https://blog.thinkeridea.com/201910/go/efficient_string_truncation.html

Copyright notice: except for the special notice, all articles in this blog are licensed under CC by 4.0 CN agreement. Reprint please indicate the source!

Recommended Today

Redis learning notes – (5) – list (stack / queue / blocking queue)

Previous: redis learning notes – (4) – set List related commands: lpush/rpush/lpop/rpop/brpop/blpop 5.1 stack = lpush + lpop (in and out of the same end) lpush+lpop 127.0.0.1:6379> lpush juc synchronized volatile aqs thread (integer) 4 127.0.0.1:6379 > lpop JUC # last in, first out “thread” 127.0.0.1:6379 > lpop JUC # penultimate “aqs” 127.0.0.1:6379 > lpop […]