Design and Implementation of Golang Failpoint

Time:2019-7-31

Author: Long Heng

For a large and complex system, it usually consists of multiple modules or components. It is necessary to simulate the faults of each subsystem in the test. These faults must be integrated into the automated test system without intrusion, and these faults can be activated automatically in the automated test. Simulate the faults and observe whether the final results meet the expected results to judge the correctness and stability of the system. If a colleague is required to plug in a distributed system to simulate network anomalies, a storage system needs to simulate disk damage by destroying hard disks. Expensive testing costs can make testing a disaster and make it difficult to simulate tests that require fine control. So we need some automated methods for deterministic fault testing.

Failpoint project is a Golang implementation of FreeBSD failpoints, which allows errors or abnormal behavior to be injected into code and triggered by dynamic activation of environment variables or code. Failpoint can be used to simulate error handling in various complex systems to improve system fault tolerance, correctness and stability, such as:

  • In microservices, a service has random latency and a service is unavailable.
  • IO latency of storage system disk increases, IO throughput is too low, and downtime is long.
  • There is a hot spot in the scheduling system and a scheduling instruction fails.
  • The replenishment system simulates a successful callback interface for replenishing repeated requests from third parties.
  • In game development, the system works correctly under the conditions of unstable player network, frame dropping, excessive delay, and all kinds of abnormal input (plug-in request).
  • ……

Why make wheels again?

The Etcd team developed gofail in 2016, greatly simplifying error injection and contributing greatly to the Golang ecosystem. We introduced gofail for error injection testing in 2018, but we found some functional and convenience problems in use, so we decided to build a better “wheel”.

How to use gofail

  • Use annotations to inject a failpoint into the program:

    // gofail: var FailIfImportedChunk int
    // if merger, ok := scp.merger.(*ChunkCheckpointMerger); ok && merger.Checksum.SumKVS() >= uint64(FailIfImportedChunk) {
    // rc.checkpointsWg.Done()
    // rc.checkpointsWg.Wait()
    // panic("forcing failure due to FailIfImportedChunk")
    // }
    // goto RETURN1
    
    // gofail: RETURN1:
    
    // gofail: var FailIfStatusBecomes int
    // if merger, ok := scp.merger.(*StatusCheckpointMerger); ok && merger.EngineID >= 0 && int(merger.Status) == FailIfStatusBecomes {
    // rc.checkpointsWg.Done()
    // rc.checkpointsWg.Wait()
    // panic("forcing failure due to FailIfStatusBecomes")
    // }
    // goto RETURN2
    
    // gofail: RETURN2:
  • Using gofail enable converted code:

    if vFailIfImportedChunk, __fpErr := __fp_FailIfImportedChunk.Acquire(); __fpErr == nil { defer __fp_FailIfImportedChunk.Release(); FailIfImportedChunk, __fpTypeOK := vFailIfImportedChunk.(int); if !__fpTypeOK { goto __badTypeFailIfImportedChunk} 
        if merger, ok := scp.merger.(*ChunkCheckpointMerger); ok && merger.Checksum.SumKVS() >= uint64(FailIfImportedChunk) {
            rc.checkpointsWg.Done()
            rc.checkpointsWg.Wait()
            panic("forcing failure due to FailIfImportedChunk")
        }
        goto RETURN1; __badTypeFailIfImportedChunk: __fp_FailIfImportedChunk.BadType(vFailIfImportedChunk, "int"); };
    
    /* gofail-label */ RETURN1:
    
    if vFailIfStatusBecomes, __fpErr := __fp_FailIfStatusBecomes.Acquire(); __fpErr == nil { defer __fp_FailIfStatusBecomes.Release(); FailIfStatusBecomes, __fpTypeOK := vFailIfStatusBecomes.(int); if !__fpTypeOK { goto __badTypeFailIfStatusBecomes} 
        if merger, ok := scp.merger.(*StatusCheckpointMerger); ok && merger.EngineID >= 0 && int(merger.Status) == FailIfStatusBecomes {
            rc.checkpointsWg.Done()
            rc.checkpointsWg.Wait()
            panic("forcing failure due to FailIfStatusBecomes")
        }
        goto RETURN2; __badTypeFailIfStatusBecomes: __fp_FailIfStatusBecomes.BadType(vFailIfStatusBecomes, "int"); };
    
    /* gofail-label */ RETURN2:

Problems encountered in the use of gofail

  • Using annotations to inject failpoint into the code, the code is error-prone, and there is no compiler detection.
  • It can only take effect globally. In order to shorten the time of automated testing, large-scale projects will introduce parallel testing, and there will be interference between different parallel tasks.
  • You need to write some hack code to avoid unnecessary error logs, such as the above code, you have to write// goto RETURN2and// gofail: RETURN2:And you have to add a blank line in the middle, so you can see the generated code logic for the reason.

What kind of failure point are we going to design?

What should the ideal failpoint implementation look like?

Ideally, failpoint should be code-defined and non-intrusive to business logic. In a language that supports macros (such as Rust), we can define afail_pointMacros define failpoint:

fail_point!("transport_on_send_store", |sid| if let Some(sid) = sid {
    let sid: u64 = sid.parse().unwrap();
    if sid == store_id {
        self.raft_client.wl().addrs.remove(&store_id);
    }
})

But we have some problems:

  • Golang does not support macro language features.
  • Golang does not support compiler plug-ins.
  • Nor can Golang tags provide a more elegant implementation(go build --tag="enable-failpoint-a")。

Failpoint Design Criteria

  • Define failure points using Golang code, not annotations or other forms.
  • Failpoint code should not have any additional overhead:

    • It can’t affect the normal function logic, and can’t invade the function code.
    • Injecting failpoint code does not cause performance regression.
    • Failpoint code does not eventually appear in the final release binary.
  • Failpoint code must be easy to read, write and introduce compiler detection.
  • The resulting code must be readable.
  • In the generated code, the line number of the functional logic code can not change (easy to debug).
  • Supporting parallel testing, throughcontext.ContextControls whether a specific failpoint is activated.

How does Golang implement a failpoint-like macro?

What is the essence of macro? If tracing back to the origin, we find that we can actually achieve failpoint in Golang by AST rewriting. The principle is as follows:

For any source file of Golang code, the grammar tree of this file can be parsed, the whole grammar tree can be traversed, all failpoint injection points can be found, and then the grammar tree can be rewritten to the desired logic.

Relevant concepts

Failpoint

Failpoint is a snippet of code that is executed only when the corresponding failpoint name is activated, if passedfailpoint.Disable("failpoint-name-for-demo")When disabled, the corresponding failpoint will never trigger. All failpoiint code fragments are not compiled into the final binary file, such as we simulate file system permission control:

func saveTo(path string) error {
    failpoint.Inject("mock-permission-deny", func() error {
         // It's OK to access outer scope variable
         return fmt.Errorf("mock permission deny: %s", path)
    })
}

Marker function

The part of markup that needs to be rewritten in AST rewriting stage has the following functions:

  • Rewriter is prompted to rewrite as an equivalent IF statement.

    • The parameters of the marker function are the parameters needed in the rewriting process.
    • Markup function is an empty function, and the compilation process is further eliminated by inline.
    • The failpoint injected into the markup function is a closure. If the closure accesses the external acting variables, the closure grammar allows the capture of the external scoping variables without compilation errors. At the same time, the converted code is an IF statement, and the access of the external scoping variables by the IF statement will not cause any problems, so the closure capture It’s just for grammatical legitimacy, and ultimately it won’t cost any extra money.
  • Simple, easy to read and write.
  • By introducing compiler detection, if the parameters of Marker function are incorrect, the program can not be compiled to ensure the correctness of the converted code.

List of Marker functions currently supported:

  • func Inject(fpname string, fpblock func(val Value)) {}
  • func InjectContext(fpname string, ctx context.Context, fpblock func(val Value)) {}
  • func Break(label ...string) {}
  • func Goto(label string) {}
  • func Continue(label ...string) {}
  • func Fallthrough() {}
  • func Return(results ...interface{}) {}
  • func Label(label string) {}

How to use failpoint for injection in your program?

The simplest way is to usefailpoint.InjectInject a failpoint at the place of the call, and eventuallyfailpoint.InjectThe call is rewritten as an IF statement, wheremock-io-errorUsed to determine whether or not to trigger,failpoint-closureThe logic in the trigger will be executed after triggering.For example, we inject an IO error into a function that reads files:

failpoint.Inject("mock-io-error", func(val failpoint.Value) error {
    return fmt.Errorf("mock error: %v", val.(string))
})

The final converted code is as follows:

if ok, val := failpoint.Eval(_curpkg_("mock-io-error")); ok {
    return fmt.Errorf("mock error: %v", val.(string))
}

adoptfailpoint.Enable("mock-io-error", "return("disk error")")Activate the failpoint in the program if you need tofailpoint.ValueTo assign a custom value, you need to pass in a failure point expression, such as herereturn("disk error")More grammar can refer to failpoint grammar.

Closure can benilFor examplefailpoint.Enable("mock-delay", "sleep(1000)")The goal is to sleep at the injection point for one second without additional logic.

failpoint.Inject("mock-delay", nil)
failpoint.Inject("mock-delay", func(){})

Ultimately, the following code will be generated:

failpoint.Eval(_curpkg_("mock-delay"))
failpoint.Eval(_curpkg_("mock-delay"))

If we only want to execute a panic in failpoint, we don’t need to receive it.failpoint.ValueThen we can ignore this value in the parameters of the closure.For example:

failpoint.Inject("mock-panic", func(_ failpoint.Value) error {
    panic("mock panic")
})
// OR
failpoint.Inject("mock-panic", func() error {
    panic("mock panic")
})

Best practices are as follows:

failpoint.Enable("mock-panic", "panic")
failpoint.Inject("mock-panic", nil)
// GENERATED CODE
failpoint.Eval(_curpkg_("mock-panic"))

In order to prevent interference between different test tasks in parallel testing, we cancontext.ContextIncludes a callback function to fine-tune the activation and closure of failpoint

failpoint.InjectContext(ctx, "failpoint-name", func(val failpoint.Value) {
    fmt.Println("unit-test", val)
})

The converted code:

if ok, val := failpoint.EvalContext(ctx, _curpkg_("failpoint-name")); ok {
    fmt.Println("unit-test", val)
}

Usefailpoint.WithHookExamples

func (s *dmlSuite) TestCRUDParallel() {
    sctx := failpoint.WithHook(context.Backgroud(), func(ctx context.Context, fpname string) bool {
        return ctx.Value(fpname) != nil // Determine by ctx key
    })
    insertFailpoints = map[string]struct{} {
        "insert-record-fp": {},
        "insert-index-fp": {},
        "on-duplicate-fp": {},
    }
    ictx := failpoint.WithHook(context.Backgroud(), func(ctx context.Context, fpname string) bool {
        _, found := insertFailpoints[fpname] // Only enables some failpoints.
        return found
    })
    deleteFailpoints = map[string]struct{} {
        "tikv-is-busy-fp": {},
        "fetch-tso-timeout": {},
    }
    dctx := failpoint.WithHook(context.Backgroud(), func(ctx context.Context, fpname string) bool {
        _, found := deleteFailpoints[fpname] // Only disables failpoints. 
        return !found
    })
    // other DML parallel test cases.
    s.RunParallel(buildSelectTests(sctx))
    s.RunParallel(buildInsertTests(ictx))
    s.RunParallel(buildDeleteTests(dctx))
}

If we use failpoint in a loop, we might use other Marker functions.

failpoint.Label("outer")
for i := 0; i < 100; i++ {
    inner:
        for j := 0; j < 1000; j++ {
            switch rand.Intn(j) + i {
            case j / 5:
                failpoint.Break()
            case j / 7:
                failpoint.Continue("outer")
            case j / 9:
                failpoint.Fallthrough()
            case j / 10:
                failpoint.Goto("outer")
            default:
                failpoint.Inject("failpoint-name", func(val failpoint.Value) {
                    fmt.Println("unit-test", val.(int))
                    if val == j/11 {
                        failpoint.Break("inner")
                    } else {
                        failpoint.Goto("outer")
                    }
                })
        }
    }
}

The above code will eventually be rewritten as follows:

outer:
    for i := 0; i < 100; i++ {
    inner:
        for j := 0; j < 1000; j++ {
            switch rand.Intn(j) + i {
            case j / 5:
                break
            case j / 7:
                continue outer
            case j / 9:
                fallthrough
            case j / 10:
                goto outer
            default:
                if ok, val := failpoint.Eval(_curpkg_("failpoint-name")); ok {
                    fmt.Println("unit-test", val.(int))
                    if val == j/11 {
                        break inner
                    } else {
                        goto outer
                    }
                }
            }
        }
    }

Why do label, break, continue and fallthrough related Marker functions remain questionable? Why not use keywords directly?

  • In Golang, if a variable or label is not used, it cannot be compiled.

    label1: // compiler error: unused label1
        failpoint.Inject("failpoint-name", func(val failpoint.Value) {
            if val.(int) == 1000 {
                goto label1 // illegal to use goto here
            }
            fmt.Println("unit-test", val)
        })
    
  • Breaks and continues can only be used in the context of loops and in closures.

Some complex injection examples

Example 1: Injecting failpoint into INITIAL and CONDITIONAL of IF statement

if a, b := func() {
    failpoint.Inject("failpoint-name", func(val failpoint.Value) {
        fmt.Println("unit-test", val)
    })
}, func() int { return rand.Intn(200) }(); b > func() int {
    failpoint.Inject("failpoint-name", func(val failpoint.Value) int {
        return val.(int)
    })
    return rand.Intn(3000)
}() && b < func() int {
    failpoint.Inject("failpoint-name-2", func(val failpoint.Value) {
        return rand.Intn(val.(int))
    })
    return rand.Intn(6000)
}() {
    a()
    failpoint.Inject("failpoint-name-3", func(val failpoint.Value) {
        fmt.Println("unit-test", val)
    })
}

The above code will eventually be rewritten as:

if a, b := func() {
    if ok, val := failpoint.Eval(_curpkg_("failpoint-name")); ok {
        fmt.Println("unit-test", val)
    }
}, func() int { return rand.Intn(200) }(); b > func() int {
    if ok, val := failpoint.Eval(_curpkg_("failpoint-name")); ok {
        return val.(int)
    }
    return rand.Intn(3000)
}() && b < func() int {
    if ok, val := failpoint.Eval(_curpkg_("failpoint-name-2")); ok {
        return rand.Intn(val.(int))
    }
    return rand.Intn(6000)
}() {
    a()
    if ok, val := failpoint.Eval(_curpkg_("failpoint-name-3")); ok {
        fmt.Println("unit-test", val)
    }
}

Example 2: Injecting failpoint into SELECT statement’s CASE to dynamically control whether a case is blocked

func (s *StoreService) ExecuteStoreTask() {
    select {
    case <-func() chan *StoreTask {
        failpoint.Inject("priority-fp", func(_ failpoint.Value) {
            return make(chan *StoreTask)
        })
        return s.priorityHighCh
    }():
        fmt.Println("execute high priority task")

    case <- s.priorityNormalCh:
        fmt.Println("execute normal priority task")

    case <- s.priorityLowCh:
        fmt.Println("execute normal low task")
    }
}

The above code will eventually be rewritten as:

func (s *StoreService) ExecuteStoreTask() {
    select {
    case <-func() chan *StoreTask {
        if ok, _ := failpoint.Eval(_curpkg_("priority-fp")); ok {
            return make(chan *StoreTask)
        })
        return s.priorityHighCh
    }():
        fmt.Println("execute high priority task")

    case <- s.priorityNormalCh:
        fmt.Println("execute normal priority task")

    case <- s.priorityLowCh:
        fmt.Println("execute normal low task")
    }
}

Example 3: Dynamic injection of SWITCH CASE

switch opType := operator.Type(); {
case opType == "balance-leader":
    fmt.Println("create balance leader steps")

case opType == "balance-region":
    fmt.Println("create balance region steps")

case opType == "scatter-region":
    fmt.Println("create scatter region steps")

case func() bool {
    failpoint.Inject("dynamic-op-type", func(val failpoint.Value) bool {
        return strings.Contains(val.(string), opType)
    })
    return false
}():
    fmt.Println("do something")

default:
    panic("unsupported operator type")
}

The above code will eventually be rewritten as follows:

switch opType := operator.Type(); {
case opType == "balance-leader":
    fmt.Println("create balance leader steps")

case opType == "balance-region":
    fmt.Println("create balance region steps")

case opType == "scatter-region":
    fmt.Println("create scatter region steps")

case func() bool {
    if ok, val := failpoint.Eval(_curpkg_("dynamic-op-type")); ok {
        return strings.Contains(val.(string), opType)
    }
    return false
}():
    fmt.Println("do something")

default:
    panic("unsupported operator type")
}

In addition to the above examples, you can write more complex situations:

  • Loop INITIAL statements, CONDITIONAL expressions, and POST statements
  • FOR RANGE statement
  • SWITCH INITIAL statement
  • Construction and Index of Slice
  • Dynamic Initialization of Structures
  • ……

In fact, failpoint can be injected anywhere you can call a function, so use your imagination.

Failpoint naming best practices

The code generated above automatically adds one_curpkg_Called infailpoint-nameOn the other hand, because the name is global, in order to avoid naming conflicts, the package name will be included in the final name._curpkg_It’s quite a macro that expands automatically with package names at run time. You don’t need to implement it in your own application_curpkg_It’s infailpoint-ctl enableAutomatic generation and addition, andfailpoint-ctl disableWhen deleted.

package ddl // ddl’s parent package is `github.com/pingcap/tidb`

func demo() {
    // _curpkg_("the-original-failpoint-name") will be expanded as `github.com/pingcap/tidb/ddl/the-original-failpoint-name`
    if ok, val := failpoint.Eval(_curpkg_("the-original-failpoint-name")); ok {...}
}

Since all failpoints under the same package are in the same namespace, careful naming is required to avoid naming conflicts. Here are some recommended rules to improve this situation:

  • Make sure the name is unique in the package.
  • Use a self-explanatory name.

    • Failpoint can be activated by environment variables:
     GO_FAILPOINTS="github.com/pingcap/tidb/ddl/renameTableErr=return(100);github.com/pingcap/tidb/planner/core/illegalPushDown=return(true);github.com/pingcap/pd/server/schedulers/balanceLeaderFailed=return(true)"

Thank

  • Thanks to gofail for providing the initial implementation and inspiration for us to iterate over failpoint on the shoulders of giants.
  • Thank FreeBSD for defining the grammar specification.

Finally, we welcome you to discuss with us to improve the Failpoint project.