简单随机抽样
简单随机抽样分为有放回抽样和无放回抽样,这两种形式都可以通过base包中的sample()函数实现。
sample(x, size, replace = FALSE, prob =NULL)
x: 带抽取对象,若为整数则表示从1-n的整数中抽取,特别注意如果x为数据库抽取的是列而非行
size: 想要抽取的样本数量
replace: 是否为有放回,默认为FALSE,即无放回
prob: 设置个抽取样本的抽样概率,默认为无取值,即等概率抽样
实例:
sample(x=6,size=3,replace = F, prob=c(0.1,0.2,0.3,0.2,0.1,0.1))
# [1] 2 3 6
分层抽样
分层抽样可以通过sampling包中的strata()函数实现。
strata(data, stratanames=NULL, size, method=c("srswor","srswr","poisson","systematic"),pik,description=FALSE)
data: 带抽样数据
stratanames: 进行分层所依据的变量名称
size: 各层中要抽出的观测样本数
method: 选择4中抽样方法,分别为无放回、有放回、泊松、系统抽样,默认为srswor
pik: 设置各层中样本的抽样概率
description: 选择是否输出含有各层基本信息的结果
注意每一层都是无放回抽样。
library(sampling)
df=data.frame(x=c(1,2,2,3,3,4),api=c('index','index','logout','show','show','index'))
sub2=strata(df, stratanames = 'x',size=c(1,2,1,1), method='srswor',description=T)
# Stratum 1
#
# Population total and number of selected units: 1 1
# Stratum 2
#
# Population total and number of selected units: 2 2
# Stratum 3
#
# Population total and number of selected units: 2 1
# Stratum 4
#
# Population total and number of selected units: 1 1
# Number of strata 4
# Total number of selected units 5
sub2
# x ID_unit Prob Stratum
# 1 1 1 1.0 1
# 2 2 2 1.0 2
# 3 2 3 1.0 2
# 4 3 4 0.5 3
# 6 4 6 1.0 4
整群抽样
cluster(data, clustername, size,method=c("srswor","srswr","poisson","systematic"),pik,description=FALSE)
data: 带抽样数据
clustername: 用来划分群的变量名称
size:需要抽取的群数
method: 选择4中抽样方法,分别为无放回、有放回、泊松、系统抽样,默认为srswor
pik: 设置各层中样本的抽样概率
description: 选择是否输出含有各层基本信息的结果
实例:
library(sampling)
df=data.frame(x=c(1,2,2,3,3,4),api=c('index','index','logout','show','show','index'))
sub3=cluster(df, clustername = 'x',size=2, method='srswor',description=T)
# Number of selected clusters: 2
# Number of units in the population and number of selected units: 6 3
sub3
# x ID_unit Prob
# 1 1 1 0.5
# 2 3 4 0.5
# 3 3 5 0.5
.
训练集与测试集的分割
训练集和测试集的分割在模型训练中经常使用,因此怎样用高效的代码实现很重要。
R语言实现:
df=data.frame(x=1:10,y=paste0('n',1:10))
train_ind=sample(x=nrow(df),size=7,replace = F)
train_set=df[train_ind,]
train_set
# x y
# 9 9 n9
# 5 5 n5
# 3 3 n3
# 7 7 n7
# 10 10 n10
# 8 8 n8
# 4 4 n4
test_set=df[-train_ind,]
test_set
# x y
# 1 1 n1
# 2 2 n2
# 6 6 n6
python语言实现:
python语言中没有像R这么方便的索引方式,但是提供了sklearn包中的分割函数。
基础代码的实现
import pandas as pd
import numpy as np
y=["".join(('n',str(_))) for _ in range(1,11)]
df=pd.DataFrame({'x':range(1,11),'y':y})
train_ind=np.random.choice(range(10),size=7,replace=False)
test_ind=np.array(list(set(range(10))-set(train_ind)))
train_set=df.iloc[train_ind,]
# x y
# 3 4 n4
# 4 5 n5
# 0 1 n1
# 8 9 n9
# 9 10 n10
# 5 6 n6
# 6 7 n7
test_set=df.iloc[test_ind,]
# x y
# 1 2 n2
# 2 3 n3
# 7 8 n8
sklearn包的实现
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size, random_state)
test_size 为测试集占比,如果为整数就代表样本的数量
random_state 为随机数种子
x=range(1,11)
y=["".join(('n',str(_))) for _ in range(1,11)]
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=10)
X_train
# [10, 2, 7, 8, 4, 1, 6]
X_test
# [3, 9, 5]
y_train
# ['n10', 'n2', 'n7', 'n8', 'n4', 'n1', 'n6']
y_test
['n3', 'n9', 'n5']