If the fault chooses you

Time:2021-6-4

Introduction:Always think chaos project is far away from you? But the moment of failure is not for you to choose, but the moment to choose you, what you can do is to prepare for it. Chaos engineering has been used in Alibaba for many years, and chaosblade, an open source project, is the crystallization of Alibaba’s experience in fighting against faults by injecting faults. In order to make you more in-depth understanding of its implementation principle and how to expand the component fault injection we need, we have prepared a series of detailed technical analysis: architecture, model, protocol, bytecode, plug-in and actual combat.

If the fault chooses you

By Ye Fei and qionggu

Introduction:Always think chaos project is far away from you? But the moment of failure is not for you to choose, but the moment to choose you, what you can do is to prepare for it. Chaos engineering has been used in Alibaba for many years, and chaosblade, an open source project, is the crystallization of Alibaba’s experience in fighting against faults by injecting faults. In order to make you more in-depth understanding of its implementation principle and how to expand the component fault injection we need, we have prepared a series of detailed technical analysis: architecture, model, protocol, bytecode, plug-in and actual combat.

Original title: technical analysis of Java scene chaos engineering implementation series (1) | architecture

preface

In the distributed system architecture, the dependence between services is increasingly complex, it is difficult to evaluate the impact of a single service failure on the whole system, and the request link is long, the monitoring alarm is not perfect, which leads to the difficulty of finding and locating problems. At the same time, the business and technology iteration is fast, so how to continuously guarantee the stability and high availability of the system is a great challenge.

We know that the moment of failure is not for you to choose, but for you to choose, what you can do is to prepare for it. Therefore, chaos engineering is a very important link in the construction of stability system. In the controllable range or environment, fault injection can continuously improve the stability and high availability of the system.

Chaosblade (GitHub address: https://github.com/chaosblade-io/chaosblade )It is a chaos engineering tool that follows the principle of chaos engineering experiment, provides rich fault scene implementation, and helps the distributed system to improve fault tolerance and recoverability. It can realize the injection of underlying faults. It is characterized by simple operation, non invasion and strong scalability. Among them, chaosblade exec JVM (GitHub address: https://github.com/chaosblade-io/chaosblade-exec-jvm )The project realizes zero cost fault injection of Java application services. It not only supports the mainstream framework components, such as Dubbo, servlet, rocketmq, etc., but also supports the specification of any class and method injection delay, exception, and the realization of complex experimental scenarios by writing Java and groovy scripts.

In order to make you understand its implementation principle and how to expand the component fault injection, this paper is divided into six partsArchitecture, model, protocol, bytecode, plug-in and actual combat. This paper will introduce the overall architecture design of chaosblade exec JVM in detail, so that users can have a certain understanding of chaosblade exec JVM.

system design

If the fault chooses you

Chaosblade exec JVM modifies bytecode based on JVM sanbox, and executes chaosblade tool to mount fault injected Java agent to specified application process. Java agent follows the chaotic experimental model design, and extends the support for different Java components through plug-in pluggable design. It can easily extend plug-ins to support more fault scenarios. Plug ins are based on AOP design definitionAdvice, enhancedEnhancer, tangent pointPointCutAt the same time, combined with the chaos experimental model, the model was establishedModelSpecExperimental targetTarget, matching methodMatcherAttack actionAction

Chaosblade exec JVMmake buildWhen compiling and packaging, download the JVM sandbox release package. After compiling and packaging, chaosblade exec JVM is used as the module of JVM sandbox. After loading the agent, we monitor the events of JVM sandbox to manage the whole chaotic experiment process, and implement the class transform injection fault through Java agent technology.

Principle analysis

In the daily background application development, we often need to provide API interface to the client, and these API interfaces inevitably exist timeout, exception and other situations due to network, system load and other reasons. When using java language, we usually use servlet to provide API interface for HTTP protocol. Chaosblade exec JVM supports servlet plug-in, injection timeout, custom exception and other fault capabilities. This article will analyze the process of fault injection in chaosblade exec JVM by taking the example of injecting delay fault capability into servlet API interface.

On servlet API interface/topicDelay for 3 seconds, the steps are as follows:

//Mount agent
blade prepare jvm --pid 888
{"code":200,"success":true,"result":"98e792c9a9a5dfea"}

//Fault injection capability
blade create servlet --requestpath=/topic delay --time=3000 --method=post
{"code":200,"success":true,"result":"52a27bafc252beee"}

//Undo fault capability
blade destroy 52a27bafc252beee

//Uninstall agent
blade revoke 98e792c9a9a5dfea

  1. Execution process

The following describes the fault injection process in detail by taking servlet request delay as an example.

If the fault chooses you

  1. Chaosblade issues the mount command to mount sandbox to the application process and activate the Java agent, such asblade p jvm --pid 888
  2. After mounting sandbox, load the chaosblade exec JVM module, and load plug-ins, such as servletplugin, dubboplugin, etc.
  3. Match the tangent point of servletplugin plug-in, register event monitoring, and the dopost and doget methods of httpservlet.
  4. Fault rule command issued by chaosbladeblade create servlet --requestpath=/topic delay --time=3000 --method=post
  5. Match failure rules, such as — requestpath = / topic, access http://127.0.0.1/topic Rule matching succeeded.
  6. After matching the failure rules successfully, the failure will be triggered, such as delay failure, custom exception throw and so on.
  7. Chaosblade issues a command to uninstall the javaagent, such asblade revoke 98e792c9a9a5dfea
  8. Code analysis

1) Mount agent

blade p jvm --pid 888

After the command is issued, the target Java application process will be hung on the agent, and the sandboxmodule onload() event will be triggered to initialize           Pluginlifecyclelistener is used to manage the life cycle of plug-ins. At the same time, the sandboxmodule onactive() event is triggered to load some plug-ins and their corresponding modelspec.

_//  Agent loading events_
public void onLoad() throws Throwable {
  ManagerFactory.getListenerManager().setPluginLifecycleListener(this);
  dispatchService.load();
  ManagerFactory.load();
}
_//  Implementation of chaosblade module activation_
public void onActive() throws Throwable {
  loadPlugins();
}

2) Load plugin

If the fault chooses you

When the plugin is loaded, create an event listener sandboxenhancerfactory.createafeventlistener (plugin), which will listen to the events of interest, such as beforeadvice, afteradvice, etc. the specific implementation is as follows:

_//  Loading plug-ins_
public void add(PluginBean plugin) {
    PointCut pointCut = plugin.getPointCut();
    if (pointCut == null) {
        return;
    }
    String enhancerName = plugin.getEnhancer().getClass().getSimpleName();
    _//  Create a filter pointcut match_
    Filter filter = SandboxEnhancerFactory.createFilter(enhancerName, pointCut);
   
    _//  Event monitoring_
    int watcherId = moduleEventWatcher.watch(filter, SandboxEnhancerFactory.createBeforeEventListener(plugin), Event.Type.BEFORE);
    watchIds.put(PluginUtil.getIdentifier(plugin), watcherId);
}

3) Match pointcut

After the event of sandboxmodule onactive() triggers the plug to load, sandboxeenhancerfactory creates a filter, which is internally filtered by the classmatcher and methodmatcher of pointcut.

public static Filter createFilter(final String enhancerClassName, final PointCut pointCut) {
  return new Filter() {
    @Override
    public boolean doClassFilter(int access, String javaClassName, String superClassTypeJavaClassName,
                                 String[] interfaceTypeJavaClassNameArray,
                                 String[] annotationTypeJavaClassNameArray
                                ) {
      _//  Classmatcher matching_
      ClassMatcher classMatcher = pointCut.getClassMatcher();
      ...
    }

    @Override
    public boolean doMethodFilter(int access, String javaMethodName,
                                  String[] parameterTypeJavaClassNameArray,
                                  String[] throwsTypeJavaClassNameArray,
                                  String[] annotationTypeJavaClassNameArray) {
       _//  Methodmatcher matching_
      MethodMatcher methodMatcher = pointCut.getMethodMatcher();
      ...
  };
}

4) Trigger enhancer

If the plug-in has been loaded and the target application matches the filter, the EventListener can be triggered. However, the status of the chaosblade exec JVM is managed by the status manager, so the fault capability will not be triggered.

For example, beforeeventlistener triggers to call beforeadvice() method of beforeenhancer, which is interrupted when managerfactory. Getstatusmanager(). Exists (targetname) judges. The specific implementation is as follows:

public void beforeAdvice(String targetName, 
                         ClassLoader classLoader, 
                         String className,
                         Object object,
                         Method method, 
                         Object[] methodArguments) throws Exception {

  _//  Judge the state of the experiment_
  if (!ManagerFactory.getStatusManager().expExists(targetName)) {
    return;
  }
  EnhancerModel model = doBeforeAdvice(classLoader, className, object, method, methodArguments);
  if (model == null) {
    return;
  }
  ...
  _//  Injection phase_
  Injector.inject(model);
}

5) Creating chaos experiment

blade create servlet --requestpath=/topic delay --time=3000

After the command is issued, the sandboxmodule @ HTTP (“/ create”) annotation marking method is triggered to distribute the event to thecom.alibaba.chaosblade.exec.service.handler.CreateHandlerhandle

After determining the necessary parameters of uid, target, action and model, handleInjection is invoked. HandleInjection registers the experiment through the state manager. If the plug-in type is PreCreateInjectionModelHandler type, it will preprocess some things. If the action type is   If it is directly injectionaction, it will inject fault capability directly without enhancer, such as JVM oom fault capability.

public Response handle(Request request) {
  if (unloaded) {
    return Response.ofFailure(Code.ILLEGAL_STATE, "the agent is uninstalling");
  }
  _//  Check suid, which is the context ID of an experiment_
  String suid = request.getParam("suid");
  ...
  return handleInjection(suid, model, modelSpec);
}

private Response handleInjection(String suid, Model model, ModelSpec modelSpec) {
  RegisterResult result = this.statusManager.registerExp(suid, model);
  if (result.isSuccess()) {
    _//  Determine whether to pre create_
    applyPreInjectionModelHandler(suid, modelSpec, model);
  }
}

ModelSpec

  • com.alibaba.chaosblade.exec.common.model.handler.PreCreateInjectionModelHandlerPre creation
  • com.alibaba.chaosblade.exec.common.model.handler.PreDestroyInjectionModelHandlerPre destruction
private void applyPreInjectionModelHandler(String suid, ModelSpec modelSpec, Model model)
  throws ExperimentException {
  if (modelSpec instanceof PreCreateInjectionModelHandler) {
    ((PreCreateInjectionModelHandler)modelSpec).preCreate(suid, model);
  }
}
...

DirectlyInjectionAction

If modelspec is of precreateinjectionmodelhandler type and actionspec is of directlyinjectionaction type, fault capability injection will be performed directly, such as jvmoom fault capability injection. If actionspec is not of directlyinjectionaction type, plug-ins will be loaded.

If the fault chooses you

private Response handleInjection(String suid, Model model, ModelSpec modelSpec) {
    _//  Registration_
    RegisterResult result = this.statusManager.registerExp(suid, model);
    if (result.isSuccess()) {
        _// handle injection_
        try {
            applyPreInjectionModelHandler(suid, modelSpec, model);
        } catch (ExperimentException ex) {
            this.statusManager.removeExp(suid);
            return Response.ofFailure(Response.Code.SERVER_ERROR, ex.getMessage());
        }

        return Response.ofSuccess(model.toString());
    }
    return Response.ofFailure(Response.Code.DUPLICATE_INJECTION, "the experiment exists");
}

Uid is returned after successful registration. If fault capability is injected directly at this stage, or null is returned from user-defined enhancer advice, then the fault will not be triggered through the inject class.

6) Fault injection capability

The way of fault capability injection is to call actionexecutor to execute fault capability.

  • Injection through injector;
  • Directlyinjectionaction is injected directly. Direct injection does not go through the inject class call stage. If   JVM, oom fault capability, etc.

Directlyinjectionaction is directly injected into the execution phase of the fault triggering actionexecutor without enhancer parameter wrapper matching. If it is injector injection, because the statusmanager has registered the experiment, when the event starts again, the judgment of managerfactory. Getstatusmanager(). Expexists (targetname) will not be interrupted and continue to move on, In the user-defined enhancer, you can get the parameters and types of the original method, and even reflect other methods of the original type. This is risky. Generally, you can take some member variables or get methods for parameter matching in the inject phase.

7) Packaging matching parameters

Custom enhancers, such as servletenhancer, wrap some parameters that need to match the command line in matchermode, and then wrap enhancer model to return, such as  — Requestpath = / index, then requestpath is equal to requesturi– Querystring = “name = XX” do custom matching. After the parameters are packaged, they are judged in the injector. Inject (model) stage.

public EnhancerModel doBeforeAdvice(ClassLoader classLoader, String className, Object object,
                                    Method method, Object[] methodArguments)
        throws Exception {
    Object request = methodArguments[0];
    String requestURI = ReflectUtil.invokeMethod(request, ServletConstant.GET_REQUEST_URI, new Object[]{}, false);
    String requestMethod = ReflectUtil.invokeMethod(request, ServletConstant.GET_METHOD, new Object[]{}, false);

    MatcherModel matcherModel = new MatcherModel();
    matcherModel.add(ServletConstant.METHOD_KEY, requestMethod);
    matcherModel.add(ServletConstant.REQUEST_PATH_KEY, requestURI);

    Map<String, Object> queryString = getQueryString(requestMethod, request);

    EnhancerModel enhancerModel = new EnhancerModel(classLoader, matcherModel);
    _//  User defined parameter matching_
    enhancerModel.addCustomMatcher(ServletConstant.QUERY_STRING_KEY, queryString, ServletParamsMatcher.getInstance());
    return enhancerModel;
}

8) Judging preconditions

In the input phase, first obtain the experiment registered by statusmanage, compare (model, enhancer model) for parameter comparison, and return if the comparison fails. Limit and increase (status Metric) judge — effect count — effect percentage to control the number and percentage of impacts

public static void inject(EnhancerModel enhancerModel) throws InterruptProcessException {
    String target = enhancerModel.getTarget();
    List<StatusMetric> statusMetrics = ManagerFactory.getStatusManager().getExpByTarget(
        target);
    for (StatusMetric statusMetric : statusMetrics) {
      Model model = statusMetric.getModel();
      _//  Match command line input parameters_
      if (!compare(model, enhancerModel)) {
        continue;
      }
      _//  Accumulate the number of attacks and judge whether the number of attacks reaches the effect count_ 
      boolean pass = limitAndIncrease(statusMetric);
      if (!pass) {
        break;
      }
      enhancerModel.merge(model);
      ModelSpec modelSpec = ManagerFactory.getModelSpecManager().getModelSpec(target);
      ActionSpec actionSpec = modelSpec.getActionSpec(model.getActionName());
      _//  Actionexecutor execution fault capability_
      actionSpec.getActionExecutor().run(enhancerModel);
      break;
    }
}

9) Ability to trigger failure

Triggered by Inject, or triggered directly by DirectlyInjectionAction, and finally call the custom ActionExecutor to generate faults, such as   Default delay executor, at which point the fault capability is in effect.

public void run(EnhancerModel enhancerModel) throws Exception {
    String time = enhancerModel.getActionFlag(timeFlagSpec.getName());
    Integer sleepTimeInMillis = Integer.valueOf(time);
     _//  Trigger delay_
    TimeUnit.MILLISECONDS.sleep(sleepTimeInMillis);
}
  1. Destruction experiment

blade destroy 52a27bafc252beee

After the command is issued, the sandboxmodule @ HTTP (“/ destroy”) annotation marking method is triggered, and the event is distributed to com.alibaba.chaosblade.exec.service.handler.destroyhandler for processing to log out the fault status. At this time, when the enchanger is triggered again, the statusmanger determines that the experimental status has been destroyed and the fault capability injection will not be carried out

_//  Status manger to judge the state of the experiment_
if (!ManagerFactory.getStatusManager().expExists(targetName)) {
    return;
}

If the modelspec of the plug-in is of the predestroyinjectionmodelhandler type, and the actionspec is of the directlyinjectionaction type, the fault capability injection will be stopped, and the actionspec is not of the directlyinjectionaction type, the plug-in will be unloaded.

_//  Destroyhandler logs off the experimental state_
public Response handle(Request request) {
    String uid = request.getParam("suid");
    ...
    _//  Judge UID_
    if (StringUtil.isBlank(uid)) {
        if (StringUtil.isBlank(target) || StringUtil.isBlank(action)) {
            return false;
        }
        _//  Logout status_
        return destroy(target, action);
    }
    return destroy(uid);
}
  1. Uninstall agent

blade revoke 98e792c9a9a5dfea

After the command is issued, the sandboxmodule unload() event is triggered, and the plug-in is unloaded to recycle all kinds of resources created by the agent.

public void onUnload() throws Throwable {
    dispatchService.unload();
    ManagerFactory.unload();
    watchIds.clear();
}

summary

Taking servlet scenario as an example, this paper introduces the architecture design and implementation principle of chaosblade exec JVM project in detail. In the future, we will introduce this project through model, protocol, bytecode, plug-in and actual combat, so that readers can quickly expand their plug-ins.

As a chaos engineering experimental tool, chaosblade project not only uses simple, but also supports rich experimental scenarios, and the extended scenarios are simple

  • Basic resources: such as CPU, memory, network, disk, process and other experimental scenarios;
  • Java applications: such as database, cache, message, JVM itself, microservice, etc. you can also specify any class method to inject various complex experimental scenarios;
  • C + + applications: such as specifying any method or a line of code injection delay, variable and return value tampering and other experimental scenarios;
  • Docker container: such as killing container, CPU, memory, network, disk, process and other experimental scenarios in container;
  • Kubernetes platform: for example, CPU, memory, network, disk and process experiment scenarios on nodes, pod network and pod itself experiment scenarios such as killing pod, pod IO exception, and container experiment scenarios such as docker container experiment scenario above;
  • Cloud resources: for example, alicloud ECS downtime and other experimental scenarios.

Members of the chaosblade community are welcome to join us. Let’s discuss the practice of chaos engineering or any ideas and problems arising from the use of chaosblade.

About the author

Ye Fei: GitHub @ tiny-x, an open source community enthusiast, chaosblade Committee, participated in promoting chaosblade chaotic engineering ecological construction.
Dome Valley: GitHub @ xcaspar, project leader of chaosblade, preacher of chaos engineering.

Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.