Bootstrap is not the front end of twitter, but a concept in statistics. The following is explained by experiments.
Suppose there is an event that happened 10000000 times, and the probability of occurrence is Poisson distribution. Of course, suppose we don’t know it’s Poisson distribution.
import numpy as np import scipy.stats ALL = np.random.poisson(2, size=10000000) ALL.mean() # 2.005085! ALL.var() # 2.0007084414277481 x = np.arange(0, 20) y = scipy.stats.poisson(2).pmf(x) import matplotlib.pyplot as plt fig = plt.figure() plot = fig.add_subplot(111) plot.plot(x, y)
We only know one sample of it. We can’t see anything from this sample. For example, its mean value is not right.
SAMPLE = np.random.choice(ALL, size=20) # array([1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 2, 1, 4, 2, 5, 2]) SAMPLE.mean() # 1.3500000000000001
Now use bootstrap (one of them, resampling), repeat sampling from sample, and calculate the average value, so that you can calculate the confidence interval.
samples = [ np.random.choice(SAMPLE, size=20) for i in range(1000) ] means = [ s.mean() for s in samples ] plot.hist(means, bins=30)
You can come a few more times
def plot_hist(): fig = plt.figure() plot1 = fig.add_subplot(221) plot2 = fig.add_subplot(222) plot3 = fig.add_subplot(223) plot4 = fig.add_subplot(224) for plot in (plot1, plot2, plot3, plot4): SAMPLE = np.random.choice(ALL, size=50) samples = [ np.random.choice(SAMPLE, size=20) for i in range(1000) ] means = [ s.mean() for s in samples ] plot.clear() plot.hist(means, bins=30) return fig
It can be seen that the randomness of sample has a great influence on the final graph. But the hypothesis test of this calculation is basically reliable.