Home > Article > Backend Development > Stratified sampling techniques in Python
Stratified Sampling Techniques in Python
Sampling is a commonly used data collection method in statistics. It can select a portion of samples from the data set for analysis to infer the characteristics of the entire data set. In the era of big data, the amount of data is huge, and using full samples for analysis is both time-consuming and not economically practical. Therefore, choosing an appropriate sampling method can improve the efficiency of data analysis. This article mainly introduces stratified sampling techniques in Python.
What is stratified sampling?
In sampling, stratified sampling is a commonly used technique. Different from simple random sampling, stratified sampling divides the data into several layers within the population, and each layer has the same attribute characteristics. Then, samples are obtained from each stratum according to different probabilities. This method is suitable when there are special characteristics in the population, especially when these characteristics are obvious. Stratified sampling is a more effective statistical sampling method.
Why is stratified sampling needed?
The advantage of stratified sampling is that it can improve sampling accuracy and reduce sampling errors, thereby building better models and inferences. In the actual scenario of data analysis, there are different types of variables in the population. Improper processing of these variables will cause deviations or errors, making the established model unable to approach the real situation. Using stratified sampling technology, the samples collected can be controlled so that samples composed of different variables can more accurately reflect the true situation of the population.
How to perform stratified sampling in Python?
In Python, there are a variety of packages that can implement stratified sampling, the more famous of which are the numpy and pandas libraries. Both libraries provide many useful functions that can help us implement stratified sampling techniques.
Below we use an example to demonstrate how to use Python to implement stratified sampling.
In this example, we assume that there is an experimental data set with 5 variables, including gender, age, height, weight and whether to smoke. This data set lends itself well to stratified sampling techniques.
First, we need to divide the dataset into different layers. We selected gender as the stratification variable and divided men and women into two strata.
import pandas as pd # 生成测试数据 data = pd.DataFrame({ 'sex': ['M', 'M', 'M', 'F', 'F', 'F'], 'age': [18, 20, 22, 25, 27, 30], 'height': [170, 172, 175, 160, 165, 170], 'weight': [65, 70, 75, 55, 60, 65], 'smoke': [1, 1, 0, 0, 1, 0] }) # 分层抽样 male = data[data['sex'] == 'M'] female = data[data['sex'] == 'F']
Next, we need to determine the sample size for each level and the corresponding sampling proportion. In this example, we assume that 10% of the sample is taken from women and 20% is taken from men.
# 分层抽样比例 sampling_prop = { 'M': 0.2, 'F': 0.1 } # 计算每个层级的样本大小 m_size = int(len(male) * sampling_prop['M']) f_size = int(len(female) * sampling_prop['F'])
Finally, we can use the random.choice function in the numpy library to extract samples from each level. In this example, we extract samples of the required stratum from each stratum:
import numpy as np # 分层抽样 msample = male.sample(m_size) fsample = female.sample(f_size) # 整合分层样本 sample = pd.concat([msample, fsample])
The results of stratified sampling will be relatively more accurate, and the model established with the full sample will be more easily widely used. In practice, applying stratified sampling techniques can improve the efficiency and accuracy of data research, leading to more accurate conclusions.
The above is the detailed content of Stratified sampling techniques in Python. For more information, please follow other related articles on the PHP Chinese website!