Python implements 8 common sampling methods

Time:2022-5-9
catalogue
  • Probability sampling technique
    • 1. Random sampling
    • 2. Stratified sampling
    • 3. Cluster sampling
    • 4. Systematic sampling
    • 5. Multistage sampling
  • Non probabilistic sampling technique
    • 1. Convention sampling
    • 2. Voluntary sampling
    • 3. Snowball sampling
  • summary

    Let’s talk to you todaysamplingSeveral common methods andPythonHow to implement in.

    samplingIt is a very important and frequently used method in statistics and machine learning, because it is unrealistic or impossible to use full data most of the time. Therefore, we need sampling. For example, in inferential statistics, we often infer and estimate the sample of the population through the sampled sample data.

    All the above are based on probability. In fact, there are also a kind of non probabilistic sampling methods. Therefore, they are generally divided into two categories:

    Probability sampling: select samples according to probability theory. Each sample has the same probability to be selected.

    Non probabilistic sampling: samples are selected according to non random criteria. Not every sample has a chance to be selected.

    Probability sampling technique

    1. Random sampling

    This is also the simplest kind of sampling of violence, that is, direct random sampling, regardless of any factors, completely depends on the probability. And under random sampling, the probability of each sample in the population being selected is equal.

    For example, there are 10000 samples with corresponding serial numbers. If the sampling quantity is 1000, I will directly randomly select 1000 samples from the number of 1-10000, and the samples corresponding to the selected serial number will be selected.

    stayPythonIn, we can userandomFunction randomly generates numbers. Here are five randomly selected from 100 people.

    
    import random
    population = 100
    data = range(population)
    print(random.sample(data,5))
    > 4, 19, 82, 45, 41
    

    2. Stratified sampling

    Stratified sampling is also random sampling, but a prerequisite should be added. Under stratified sampling, the samples with sampling will be grouped according to some common attributes, and then randomly sampled separately from these groups.

    Therefore, it can be said that stratified sampling is a more refined random sampling, which should maintain the same proportion as that in the overall population.For example, the class labels 0 and 1 in the machine learning classification label have a ratio of 3:7. In order to maintain the original ratio, you can take stratified sampling and separate random sampling according to each group.

    PythonIn the middle, we passedtrain_test_splitset upstratifyParameter to complete the layering operation.

    
    from sklearn.model_selection import train_test_split
    
    stratified_sample, _ = train_test_split(population, test_size=0.9, stratify=population[['label']])
    print (stratified_sample)
    

    3. Cluster sampling

    Cluster sampling is also called cluster sampling. It means that the whole population is divided into several subgroups, and each of these subgroups has characteristics similar to the population. In other words, it does not sample individuals, but randomly selects the whole subgroup.

    usePythonYou can first assign cluster ID to the cluster group, then randomly select two subgroups, and then find the corresponding sample value, as shown below.

    import numpy as np
    clusters=5
    pop_size = 100
    sample_clusters=2
    #The interval is 20, and the cluster ID of 100 samples in the cluster is allocated from 1 to 5. This step has assumed that the clustering is completed
    cluster_ids = np.repeat([range(1,clusters+1)], pop_size/clusters)
    #Randomly select the IDs of two clusters
    cluster_to_select = random.sample(set(cluster_ids), sample_clusters)
    #Extract the sample corresponding to the cluster ID
    indexes = [i for i, x in enumerate(cluster_ids) if x in cluster_to_select]
    #Extract the sample value corresponding to the sample serial number
    cluster_associated_elements = [el for idx, el in enumerate(range(1, 101)) if idx in indexes]
    print (cluster_associated_elements)

    4. Systematic sampling

    Systematic sampling is based onScheduled regular interval(basically fixed and periodic intervals) sample from the population. For example, extract every nine elements. Generally speaking, this sampling method is often more effective than ordinary random sampling method.

    The following figure is a sequence of sampling every 9 elements, and then repeat.

    usePythonIf implemented, it can be set directly in the loop bodystepJust.

    
    population = 100
    step = 5
    sample = [element for element in range(1, population, step)]
    print (sample)
    

    5. Multistage sampling

    In multi-stage sampling, we connect multiple sampling methods one by one. For example, in the first stage, cluster sampling can be used to select clusters from the population, and then random sampling can be carried out in the second stage to select elements from each cluster to form the final set.

    PythonThe code reuses the above cluster sampling, and only carries out random sampling in the last step.

    import numpy as np
    clusters=5
    pop_size = 100
    sample_clusters=2
    sample_size=5
    #The interval is 20, and the cluster ID of 100 samples in the cluster is allocated from 1 to 5. This step has assumed that the clustering is completed
    cluster_ids = np.repeat([range(1,clusters+1)], pop_size/clusters)
    #Randomly select the IDs of two clusters
    cluster_to_select = random.sample(set(cluster_ids), sample_clusters)
    #Extract the sample corresponding to the cluster ID
    indexes = [i for i, x in enumerate(cluster_ids) if x in cluster_to_select]
    #Extract the sample value corresponding to the sample serial number
    cluster_associated_elements = [el for idx, el in enumerate(range(1, 101)) if idx in indexes]
    #Then randomly select samples from cluster samples
    print (random.sample(cluster_associated_elements, sample_size))

    Non probabilistic sampling technique

    Non probabilistic sampling is undoubtedly a way of not considering probability. In many cases, it is a conditional choice. Therefore, for non randomness, we cannot realize it through statistical probability and programming. Three methods are also introduced here.

    1. Convention sampling

    Simple sampling means that researchers only select the individuals who are most likely to participate and have the most opportunity to participate in the research. For example, in the figure below, the blue dot is the researcher, and the orange dot is the most accessible crowd near the blue dot.

    2. Voluntary sampling

    Under voluntary sampling, interested people usually participate by themselves by filling in some form of survey form. Therefore, in this case, the researchers surveyed have no right to choose any individual, and all rely on the voluntary registration of the group. For example, the blue dots in the figure below are researchers and the orange ones are individuals who voluntarily agree to participate in the study.

    3. Snowball sampling

    Snowball sampling means that the final collection is selected by other participants, that is, researchers ask other known contacts to find people willing to participate in the study. For example, in the following figure, the blue dot is the researcher, the orange is the known contact, and the yellow is the other contact around the orange dot.

    summary

    The above are the eight commonly used sampling methods. Probability sampling methods are commonly used in daily work. Because there is no randomness, we cannot complete the automatic operation through statistics and programming.

    For example, in the risk control sample design of credit, it is necessary to sample through probability from the sample window. Because the quality of sampling basically determines the upper limit of your model, many problems will be considered when sampling, such as the number of samples, whether there is significance, sample crossing, etc. At this time, a good sampling method is very important.

    This is the end of this article about the eight common sampling methods implemented by python. For more information about Python sampling methods, please search the previous articles of developeppaer or continue to browse the relevant articles below. I hope you will support developeppaer in the future!