如何在 R 中使用 sample() 提取样本？

在数据分析中,采样数据是分析师所做的最常见的过程,要研究和理解数据,有时采样是最好的方法,在大数据的情况下是最真实的。

R 提供标准函数样本() 来从数据集中取样。许多业务和数据分析问题将需要从数据中取样。

让我们滚进主题!!!

示例()在R中的语法

1sample(x, size, replace = FALSE, prob = NULL)

x - vector 或数据集
size - 样本大小
replace - 与或无值更换
replace - 与或无值更换
prob - 概率重量

用代替品采样

你可能想知道,用替代品采样是什么?

好,当您从列表或数据中采样时,如果指定 replace=TRUE 或 T,则该函数将允许重复值。

下面的例子清楚地解释了这个问题。

 1#sample range lies between 1 to 5
 2x<- sample(1:5)
 3#prints the samples
 4x
 5Output -> 3 2 1 5 4
 6
 7#samples range is 1 to 5 and number of samples is 3
 8x<- sample(1:5, 3)
 9#prints the samples (3 samples)
10x
11Output -> 2 4 5
12
13#sample range is 1 to 5 and the number of samples is 6
14x<- sample(1:5, 6)
15x
16#shows error as the range should include only 5 numbers (1:5)
17Error in sample.int(length(x), size, replace, prob) : 
18  cannot take a sample larger than the population when 'replace = FALSE'
19
20#specifing replace=TRUE or T will allow repetition of values so that the function will generate 6 samples in the range 1 to 5. Here 2 is repeated.
21
22x<- sample(1:5, 6, replace=T)
23Output -> 2 4 2 2 4 3

在 R 中没有替代的样本

在这种情况下,我们将采取 ** 样品 ** 没有更换 **. 整个概念在下面展示。

在这种没有替换的情况下,使用函数 replace=F,它不会允许重复值。

 1#samples without replacement 
 2x<-sample(1:8, 7, replace=F)
 3x
 4Output -> 4 1 6 5 3 2 7
 5x<-sample(1:8, 9, replace=F)
 6Error in sample.int(length(x), size, replace, prob) :
 7cannot take a sample larger than the population when 'replace = FALSE'
 8
 9#here the size of the sample is equal to range 'x'. 
10x<- sample(1:5, 5, replace=F)
11x
12Output -> 5 4 1 3 2

使用 set.seed() 函数采样

由于您可能会发现,当您采集样本时,它们将随机并每次改变,以避免这一点,或者如果您不希望每次采集不同的样本,您可以使用 set.seed() 函数。

set.seed() - set.seed 函数在运行时会产生相同的序列。

此案例如下所示,请执行下面的代码,每次获取相同的随机样本。

 1#set the index 
 2set.seed(5)
 3#takes the random samples with replacement
 4sample(1:5, 4, replace=T)
 52 3 1 3
 6
 7set.seed(5)
 8sample(1:5, 4, replace=T)
 92 3 1 3
10
11set.seed(5)
12sample(1:5, 4, replace=T)
132 3 1 3

从数据集中采集样本

在本节中,我们将从 Rstudio 中的数据集中生成样本。

此代码将从ToothGrowth数据集中采取10行样本,并显示它,这样,您可以从数据集中取取所需大小的样本。

 1#reads the dataset 'Toothgrwoth' and take the 10 rows as sample
 2df<- sample(1:nrow(ToothGrowth), 10)
 3df
 4--> 53 12 16 26 37 27 9 22 28 10
 5#sample 10 rows
 6ToothGrowth[df,]
 7
 8    len supp dose
 953 22.4 OJ 2.0
1012 16.5 VC 1.0
1116 17.3 VC 1.0
1226 32.5 VC 2.0
1337 8.2 OJ 0.5
1427 26.7 VC 2.0
159 5.2 VC 0.5
1622 18.5 VC 2.0
1728 21.5 VC 2.0
1810 7.0 VC 0.5

使用 set.seed() 函数从数据集中采集样本

在本节中,我们将使用 **set.seed() 函数来从数据集中取样。

运行下面的代码来生成数据集中的样本,使用 set.seed()。

 1#set.seed function
 2set.seed(10)
 3#taking sample of 10 rows from the iris dataset. 
 4x<- sample(1:nrow(iris), 10)
 5x
 6--> 137 74 112 72 88 15 143 149 24 13
 7#displays the 10 rows
 8iris[x, ]
 9    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
10137 6.3 3.4 5.6 2.4 virginica
1174 6.1 2.8 4.7 1.2 versicolor
12112 6.4 2.7 5.3 1.9 virginica
1372 6.1 2.8 4.0 1.3 versicolor
1488 6.3 2.3 4.4 1.3 versicolor
1515 5.8 4.0 1.2 0.2 setosa
16143 5.8 2.7 5.1 1.9 virginica
17149 6.2 3.4 5.4 2.3 virginica
1824 5.1 3.3 1.7 0.5 setosa
1913 4.8 3.0 1.4 0.1 setosa

当您执行代码多次时,您将收到相同的行. 值不会改变,因为我们使用了 set.seed() 函数。

使用 sample() 在 R 中生成随机样本

好吧,我们将通过一个问题来理解这个概念。

** 问题: ** 礼品店决定给其客户之一一个惊喜礼物. 为此,他们收集了一些名字。

提示:使用 sample() 函数生成随机样本。

如下所示,每次运行此代码,都会产生随机样本的参与者名称。

 1#creates a list of names and generates one sample from this list
 2sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
 3--> "Rossie"
 4 sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
 5--> "Jolie"
 6
 7sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
 8--> "jack"
 9
10sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
11--> "Edwards"
12
13sample(c('jack','Rossie','Kyle','Edwards','Joseph','Paloma','Kelly','Alok','Jolie'),1)
14--> "Kyle"

通过设置概率来采样

借助上述示例和概念,您了解如何生成随机样本并从数据集中提取特定数据。

有些人可能会感到放松,如果我说R允许你设置概率,因为它可以解决许多问题。

让我们想想一个能够制造10个手表的公司,其中10个手表中有20%被发现有缺陷,让我们用下面的代码来说明这一点。

1#creates a probability of 80% good watches an 20% effective watches.
2 sample (c('Good','Defective'), size=10, replace=T, prob=c(.80,.20))
3
4"Good"      "Good"      "Good"      "Defective" "Good"      "Good"     
5"Good"      "Good"      "Defective" "Good"

您还可以尝试如下所示的不同概率调整。

1sample (c('Good','Defective'), size=10, replace=T, prob=c(.60,.40))
2
3--> "Good"      "Defective" "Good"      "Defective" "Defective" "Good"     
4 "Good"      "Good"      "Defective" "Good"

包装上

在本教程中,您已经学会了如何从数据集中生成样本, vector,以及一个 list 或没有更换。

尝试从R中可用的各种数据集中采样,也可以导入一些CSV文件以根据示例进行概率调整。

** 更多研究:** R文档