R 中的离群值分析 - 检测和删除离群值

您好,读者! 在本文中,我们将专注于 ** R 编程中的外部分析**。

让我们开始吧!!!

数据中的外界是什么?

在深入挖掘外界概念之前,让我们专注于数据值的预处理。

在数据科学和机器学习领域,数据值的预处理是一个关键步骤,通过预处理,我们意味着在建模之前从数据中删除所有错误和噪音。

在我们最近的文章中,我们了解了R编程中的 [缺少的价值分析]( / 社区 / 教程 / 缺少的价值分析 - 使用 r - 编程)。

今天,我们将专注于同一的高级别 - Outlier 检测和删除在 R。

Outliers,如其名称所示,是与数据集的其他点分离的数据点,即与其他数据值分离的数据值,从而扰乱了数据集的整体分布。

这通常被认为是数据值的异常分布。

对模型的影响 Outliers -

数据显示为扭曲的格式。
改变数据在平均值、偏差等方面的总体统计分布。
**导致模型精度水平的偏差。

了解了外界的影响,现在是时候开展实施工作了。

超级分析 - Get set GO!

起初,对于我们来说,检测数据集中的漏洞的存在非常重要。

所以,让我们开始吧!我们已经使用了自行车租赁预测数据集!您可以找到数据集这里!

1、加载数据集

最初,我们使用 read.csv()函数将数据集加载到 R 环境中。

在外部检测之前,我们执行了缺失值的分析,只是为了检查任何 NULL 或缺失值的存在,同时我们也使用了 `sum(is.na(data))' 函数。

 1#Removed all the existing objects
 2rm(list = ls())
 3
 4#Setting the working directory
 5setwd("D:/Ediwsor_Project - Bike_Rental_Count/")
 6getwd()
 7
 8#Load the dataset
 9bike_data = read.csv("day.csv",header=TRUE)
10
11### Missing Value Analysis ###
12sum(is.na(bike_data))
13summary(is.na(bike_data))
14
15#From the above result, it is clear that the dataset contains NO Missing Values.

这里的数据不包含缺失的值

2. 通过 Boxplot 函数检测 Outliers

说到这一点,现在是检测数据集中外观的时机了,为了实现这一目标,我们将数字数据列存储到一个单独的数据结构/变量中,使用c()函数。

此外,我们使用了boxplot()函数来检测数字变量中外观的存在。

包装盒子:

Outlier Detection-Boxplot Method

从视觉上看,很明显hum和windspeed变量在其数据值中包含异常。

3、用 NULL 值取代外部值

现在,在 R 中进行外部分析后,我们将用 boxplot() 方法识别的外部值替换为 NULL 值,以便按照下图操作。

 1##############################Outlier Analysis -- DETECTION###########################
 2
 3# 1. Outliers in the data values exists only in continuous/numeric form of data variables. Thus, we need to store all the numeric and categorical independent variables into a separate array structure.
 4col = c('temp','cnt','hum','windspeed')
 5categorical_col = c("season","yr","mnth","holiday","weekday","workingday","weathersit")
 6
 7# 2. Using BoxPlot to detect the presence of outliers in the numeric/continuous data columns.
 8boxplot(bike_data[,c('temp','atemp','hum','windspeed')])
 9
10# From the above visualization, it is clear that the data variables 'hum' and 'windspeed' contains outliers in the data values.
11#OUTLIER ANALYSIS -- Removal of Outliers
12# 1. From the boxplot, we have identified the presence of outliers. That is, the data values that are present above the upper quartile and below the lower quartile can be considered as the outlier data values.
13# 2. Now, we will replace the outlier data values with NULL.
14
15for (x in c('hum','windspeed'))
16{
17  value = bike_data[,x][bike_data[,x] %in% boxplot.stats(bike_data[,x])$out]
18  bike_data[,x][bike_data[,x] %in% value] = NA
19} 
20
21#Checking whether the outliers in the above defined columns are replaced by NULL or not
22sum(is.na(bike_data$hum))
23sum(is.na(bike_data$windspeed))
24as.data.frame(colSums(is.na(bike_data)))

验证所有外出符都被 NULL 取代

现在,我们检查是否存在缺失的数据,即是否使用数值(is.na())函数正确地将外部值转换为缺失的值。

出发点:**

 1> sum(is.na(bike_data$hum))
 2[1] 2
 3> sum(is.na(bike_data$windspeed))
 4[1] 13
 5> as.data.frame(colSums(is.na(bike_data)))
 6           colSums(is.na(bike_data))
 7instant 0
 8dteday 0
 9season 0
10yr 0
11mnth 0
12holiday 0
13weekday 0
14workingday 0
15weathersit 0
16temp 0
17atemp 0
18hum 2
19windspeed 13
20casual 0
21registered 0
22cnt 0

因此,我们将hum列的 2 个外观点和windspeed列的 13 个外观点转换为缺失(NA)值。

5. 丢失值的列

最后,我们通过从**tidyr**库中使用 drop_na() 函数来处理缺少的值。

1#Removing the null values
2library(tidyr)
3bike_data = drop_na(bike_data)
4as.data.frame(colSums(is.na(bike_data)))

出发点:**

结果,现在所有的外观已经被有效地删除了!

 1> as.data.frame(colSums(is.na(bike_data)))
 2           colSums(is.na(bike_data))
 3instant 0
 4dteday 0
 5season 0
 6yr 0
 7mnth 0
 8holiday 0
 9weekday 0
10workingday 0
11weathersit 0
12temp 0
13atemp 0
14hum 0
15windspeed 0
16casual 0
17registered 0
18cnt 0

结论

通过此,我们已经到到了这个主题的尽头. 请自由评论下面,如果你遇到任何问题. 有关R编程的更多这样的帖子,保持定制!!

到那时,快乐的学习!!!:)