How to do a perfect abtest?

Time:2020-9-2

This article starts with WeChat official account of vivo Internet technology.
Link: https://mp.weixin.qq.com/s/mO5MdwG7apD6RzDhFwZhog
Author: duzhimin

More and more companies are trying abtest, either building their own systems or relying on third-party systems. So what are the necessary basic knowledge when we conduct abtest? How to carry out AB experiment step by step? In this paper, we will take a look at the process of AB experiment.

1、 Introduction

In the process of business development of Internet companies, user growth is an eternal theme, because there is no growth, there will be no development. Therefore, in the early stage of business development, the faster the product iteration speed is, the better. In a word, “how fast and how to come”.

When the business develops to a certain stage, the dividend of savage growth gradually fades, and the user growth space becomes less obvious under the visible strategy, how to reasonably plan the product iteration strategy is particularly important, and how to judge whether the product strategy is effective or not often depends on the data. The result determines the vitality of the product or strategy and the relationship between the product and the strategy After all, we will not waste resources on ineffective products and strategies.

So what kind of tools or means can ensure the effective implementation and implementation of data-driven strategy? Many companies achieve this goal through abtest and the construction of corresponding experimental infrastructure platform.

In 2019, we set up a vivo abtest experimental platform (Hawking experimental platform). So far, 14 business parties have conducted 40 experiments. In the process of communication with business parties, we found that our understanding of abtest is not enough. Therefore, let’s learn some related knowledge points and deepen our understanding of abtest.

Abtest is usually used to compare the effects and effects of setting different values for a certain product variable in different versions (for example, a page uses a red button and another uses a blue button). Version a is the version currently in use, while version b is an improved version. In the experiment, it is generally to compare whether there are differences in some indicators between the experimental group and the control group, of course, it is more often to see whether the experimental group performs better than the control group. There is no significant difference between the two groups in hypothesis H0 and control group.

More often we pay attention to the value of proportion, such as click through rate, conversion rate, retention rate, etc. The characteristic of this kind of proportional values is that for a certain user (each sample point in the sample), there are only two kinds of “successful” or “unsuccessful”; for the whole, the value is the proportion of users whose result is “successful”. For example, conversion rate, for a user, only successful conversion or unsuccessful conversion. The hypothesis test of proportional values is called two sample proportional hypothesis test in statistics.

Let’s explain it with the account device login rate experiment.

2、 Preparation before experiment

1. Before doing the experiment, let’s answer the following questions:

1.1 what do you want to prove in the experiment?

A: I want to improve the device login rate of the account by changing the color of the device login button

2. What will your control and experimental groups look like?

A: the control group is what it looks like at present. Please see the figure belowblueI want to change the background of the login button toorangeTo see if the device login rate has improved.

How to do a perfect abtest?

[perfect first step: determine the experimental group and the control group]

1.3. How to avoid confounding factors?

(confounding factors are individual differences of research objects. They are not factors that you try to compare, but ultimately lead to the sensitivity of the analysis results, such as people in different cities, people of different ages, gender When conducting the experiment, try to avoid the influence of confounding factors on the results.)

A: what you are asking here is how to determine the samples of the control group and the experimental group when we do the experiment, which is to make the individual differences between the experimental group and the control group as much as possible. How to allocate users for each experimental scheme? Hawking experimental platform has helped us. I understand that the experimental platform can support many strategies for dividing users (unique identifier hash, specifying specific users, according to user tags…) We are going to adopt the unique identifier hashing strategy, which is an excellent way to avoid confounding factors by randomly selecting from the requesting users: those factors that may become confounding factors ultimately have the same vote and weight in the control group and the experimental group.

The following figure shows the streaming strategy supported by the experimental platform

How to do a perfect abtest?

[perfect step 2: eliminate confounding factors]

2. Sample size

How many samples are needed for a / b experiment? This is a question we have to answer when we do experiments. (in fact, for Internet applications, the traffic is very large, and the sample size is very small, which is a factor to be considered in the experiment. However, we still need to talk about it, because there are some other concepts involved, and we also need to understand them.)

2.1. Why should we calculate the sample size?

In theory, the more samples, the better

  • Intuitively, when the number of samples is small, the experiment is easy to be biased by new sample points, resulting in the instability of the experimental results, and it is difficult to draw a firm conclusion. On the contrary, when the number of samples increases, the experiment will have more “evidence” and the “reliability” of the experiment will be stronger.

In practice, the smaller the sample size, the better

  1. Limited flow:Because of the large number of users, large companies don’t have to be too careful, and they can run dozens or even hundreds of experiments at the same time. But small companies have so much traffic and so many new products to develop. Under the condition that the samples of different experiments are not overlapped, the speed of product development will be greatly reduced.
  2. The cost of trial and error is highSuppose we ran the experiment with 50% of the users, but unfortunately, a week later, the results showed that the total income of the experimental group decreased by 20%. So, your experiments cost the entire company 10% in a week. This trial and error cost is higher.

2.2. Confidence and detection efficiency

To understand these two concepts, let’s understand the basic knowledge of a / b experiment.

First of all, there are two hypotheses of a / B test

Null hypothesis (H0): a hypothesis that we hope to overturn through experimental results. In our example, the original hypothesis can be expressed as “the device login rate of orange button is the same as that of blue button”.
Alternative hypothesis (H1): a hypothesis that we hope to verify through experimental results. In our example, it can be expressed as “the device login rate of orange button is different from that of blue button”.

The essence of a / B test is to make a judgment based on the experimental data: is H0 correct or not? Then there will be the following four situations:

How to do a perfect abtest?How to do a perfect abtest?

1. There is no difference in the device login rate (H0 is correct), but the experimental analysis results show that there is a difference

Because of wrong judgment, we call this kind of error type I error, and we express the probability of the first type error with α.Confidence = 1 – α。 The first type of error means that the new product does not improve the business, but we mistakenly believe that there is an improvement. Such analysis results not only waste the company’s resources, but also may lead the product negatively.

Therefore, when doing a / B testing, we hope that the lower the type I error, the better. In practice, we set an upper limit for α, which is generally 5%. In other words, when doing experiments, we will ensure that the probability of the first type of error will never exceed5%

2. The device login rate is different (H1 is correct), but the experimental analysis results say that there is no difference

Our judgment is wrong again. This kind of error is called type II error, which is represented by β. We generally define the second kind of error β not exceeding20%

3. Case 2 and case 3 are two scenarios in which the judgment is correct. We call the probability of making such a correct judgment known asDetection efficiency

The basic purpose of our experiment is to detect the difference in device login rate between orange button and blue button. If the detection efficiency is low, it proves that even if the new product is effective, the experiment can not detect it. In other words, our experiment is useless.

According to the definition of conditional probability,Detection efficiency = 1 – β = 80%.

In the selection of the upper limit of two types of errors (α is 5%, β is 20%), we can understand the important concept of a / b experiment: it is better to cut off four good products than to let one bad product go online.

2.3 calculation formula of sample size

In most cases, we do not need to understand the calculation formula of sample size in detail. Here is the formula. Let’s learn it together.

How to do a perfect abtest?How to do a perfect abtest?

In the above formula, P1 is called the basic value, which is the current value of the key index (control group); P2 is the level that we hope to improve through the experiment; α and β are called the first type error probability and the second type error probability, which are generally taken as 0.05 and 0.2 respectively; Z is the quantile function of normal distribution.

Because abtest generally has at least two groups, the sample size required for the experiment is 2n.

What should I do if I can’t calculate such a complicated formula?

Now the Hawking experimental platform has provided a small tool to calculate the sample size. You only need to fill in a few numbers:

How to do a perfect abtest?How to do a perfect abtest?

explain:

Proportion of current business day-to-day (baseline ratio)For example, in the current account device login rate experiment, the baseline ratio is the current device login rate, such as 15%.

Minimum ratio of expected enhancement (minimum detectable effect)In our experiment, we selected the minimum ratio of expected improvement of 5%. This means that if the pink button really increases the device login rate by 5%, we hope that the experiment will be able to detect the difference.

Number of experimental groups: when AB experiment is carried out normally, there are two groups, one control group and one experimental group.

[perfect step 3: calculate the minimum sample size]

3. Determine indicators

In the experiment, it is generally to compare whether there are differences in some indicators between the experimental group and the control group, of course, it is more often to see whether the experimental group performs better than the control group. Therefore, we should first determine the indicators that need to be compared in the experiment before conducting the experiment, and we pay more attention to itProportion indexSuch as click through rate, conversion rate, retention rate, etc. When we do the significance analysis of the experiment, it is also the proportion index.

[step 4 of perfection: determine the experimental indexes]

4. Burying point

When we determine the specific indicators that need to be analyzed, we need to carry out embedded point design to collect relevant user behavior for subsequent process data analysis, so as to draw experimental conclusions.

For abtest, we need to know whether the current user is in the control group or the experimental group, so these parameters must be included in the buried point.

At present, the Hawking experimental platform supports the business side through the embedded point technology on the server side. The embedded point data does not report whether the user is in the experimental group or the control group. However, it is suggested that the business side should bury the scheme information of the user in order to make the data more accurate and the analysis result more reliable.

[perfect step 5: collect experimental data]

3、 Observation in the experiment

1. Observe whether the sample size meets the expectation, such as whether the flow of the experimental group and the control group is uniform. Under normal circumstances, the diversion data will not differ too much. If the difference is too large, it is necessary to analyze where there is a problem.

2. Observe whether the user’s behavior is buried correctly. After many experiments, we find that the embedding point is wrong.

4、 Post experiment analysis

1. After we have done abtest, we need to analyze the data to determine the effect of this experiment, which requires experimentsSignificance analysisIf the results are not significant, they are not referential.

2. Significant difference is a statistical term. It is a statistical evaluation of the difference in data. When making a conclusion, we usually use p > 0.05 to indicate that the difference is not significant; 0.01 < p < 0.05 means that the difference is significant; P < 0.01 means that the difference is extremely significant.

3. When there is a significant difference between the data, it means that the data involved in the comparison come from two different populations with different differences. This difference may be due to the fact that the data involved in the comparison are from different groups of experimental subjects, such as comparing the middle-aged and the elderly, or fromThe experimental treatment caused theFundamental character changeIt is exactly what we expect in AB experiment, so there will be significant difference in experimental data.

4. The following is a formula for calculating the significance of proportional indicators for your reference (independent sample t-test): to calculate the value of P, we need to first calculate the value of T. the formula is as follows:

How to do a perfect abtest?

How to do a perfect abtest?

After calculating t value, t value is converted into p value according to t value and degree of freedom n = N1 + N2 – 2,
Here we give the calculation formula of Excel: P = tdist (T, N, 1)

5. Is it necessary for our business side to calculate such a complex saliency calculation?

A: No, Hawking’s experimental platform is already availableSupport the significance calculation of experimental indexes

How to do a perfect abtest?

[step 6 of perfection: index significance calculation]

6. It can be seen from 3 that significant difference does not necessarily mean that the experiment is effective, but may be caused by confounding factors, which requires further analysis of experimental samples to determine whether it is the influence of confounding factors.

[step 7 of perfection: identify the root cause of significance]

7. Finally, through the analysis, the paper gives the conclusion whether the experiment is effective, and if so, how much improvement this experiment brings to the business side.

[step 8 of perfection: give the experimental conclusion]

8. It’s said that Hawking experimental platform will support real-time viewing of scheme streaming data and real-time viewing of indicator data, right?

A: Yes, yes, it shouldn’t be long.

5、 Summary

How to do a perfect abtest?

1. Determine the control group and experimental group, the best is to doUnivariateOne variable at a time.

2. Try to separate the flow as much as possibleConfounding factors were excludedIn general, random shunt can be used.

3. Check whether the flow rate reachesMinimum sample size requirementIf it fails to meet the requirements, the subsequent analysis can not be carried out, and the experimental results are not credible.

4. Determine theComparison indexIf there is a difference between solutions, what should be measured?

5. Accurate collectionUser behavior dataThis requires that the buried point must be correct.

6. Analysis of indicatorsSignificanceIf the index is not significant, the experiment is invalid.

7. Determine what causes significanceRoot causeSignificant factors leading to experimental confounding were excluded.

8. Finally, the experimental conclusion is givenValid or invalid

Please pay attention to more detailsVivo Internet technologyWeChat official account

How to do a perfect abtest?

Note: please contact the wechat:Labs2020Contact.