如何安装 pandas 软件包并在 Python 3 中使用数据结构

介绍

Python的pandas包用于数据操纵和分析,旨在让您以更直观的方式处理标签或关系数据。

基于无数的包,pandas包括标签,描述性索引,在处理常见数据格式和缺失的数据方面尤为强大。

「pandas」包提供表格功能,但使用 Python 处理数据要比使用表格更快,而「pandas」证明非常高效。

在本教程中,我们将首先安装pandas,然后让您了解基本的数据结构: Series 和 DataFrames 。

安装`大熊猫`

与其他Python包一样,我们可以安装pandas与pip。

首先,让我们进入我们所选择的本地编程环境或基于服务器的编程环境并在那里安装panda及其依赖性:

1pip install pandas numpy python-dateutil pytz

您应该获得类似于以下的输出:

1[secondary_label Output]
2Successfully installed pandas-0.19.2

如果您更喜欢在 Anaconda中安装pandas,则可以使用以下命令:

1conda install pandas

此时,你们都已准备好开始使用潘达斯包。

系列

在「panda」中, Series 是可以包含任何数据类型的单维数组。

让我们在您的命令行中开始Python解释器,如下:

1python

从翻译器中导入numpy和pandas包到您的名称空间:

1import numpy as np
2import pandas as pd

在我们使用Series之前,让我们看看它通常看起来是什么样子:

1s = pd.Series([data], index=[index])

您可能会注意到数据的结构就像Python 列表一样。

没有声明指数

我们将输入整数数据,然后为序列提供一个名称参数,但我们将避免使用索引参数来看到潘达斯如何暗示它:

1s = pd.Series([0, 1, 4, 9, 16, 25], name='Squares')

现在,让我们打电话给系列,这样我们就可以看到熊猫在做什么:

1s

我们将看到以下输出,左列中的索引,右列中的数据值,列下方有关于序列名称和构成值的数据类型的信息。

1[secondary_label Output]
20 0
31 1
42 4
53 9
64 16
75 25
8Name: Squares, dtype: int64

虽然我们没有提供数组的索引,但有一个被暗示地添加到0到5的整数值。

宣布一个指数

正如上面的语法所示,我们还可以用一个明确的索引制作系列,我们将使用地球海洋的平均深度的数据:

1avg_ocean_depth = pd.Series([1205, 3646, 3741, 4080, 3270], index=['Arctic',  'Atlantic', 'Indian', 'Pacific', 'Southern'])

随着系列的构建,让我们叫它看出输出:

1avg_ocean_depth

1[secondary_label Output]
2Arctic 1205
3Atlantic 3646
4Indian 3741
5Pacific 4080
6Southern 3270
7dtype: int64

我们可以看到我们提供的索引在左边,值在右边。

索引和剪辑系列

使用「pandas」系列,我们可以通过相应的数字索引以获取值:

1avg_ocean_depth[2]

1[secondary_label Output]
23741

我们还可以按索引号进行切割,以获取值:

1avg_ocean_depth[2:4]

1[secondary_label Output]
2Indian 3741
3Pacific 4080
4dtype: int64

此外,我们可以调用索引的值来返回该索引对应的值:

1avg_ocean_depth['Indian']

1[secondary_label Output]
23741

我们还可以用索引的值切割,以返回相应的值:

1avg_ocean_depth['Indian':'Southern']

1[secondary_label Output]
2Indian 3741
3Pacific 4080
4Southern 3270
5dtype: int64

请注意,在最后一个例子中,当用索引名称切割时,这两个参数是包容的,而不是排他性的。

让我们离开Python解释器,用quit()。

序列初始化成字典

在「pandas」中,我们还可以使用字典数据类型来初始化一个序列,这样,我们不会将一个索引声明为单独的列表,而是将内置的密钥作为索引。

让我们创建一个名为ocean.py的文件,并添加以下字典,并呼吁打印它。

 1[label ocean.py]
 2import numpy as np
 3import pandas as pd
 4
 5avg_ocean_depth = pd.Series({
 6                    'Arctic': 1205,
 7                    'Atlantic': 3646,
 8                    'Indian': 3741,
 9                    'Pacific': 4080,
10                    'Southern': 3270
11})
12
13print(avg_ocean_depth)

现在我们可以在命令行上运行该文件:

1python ocean.py

我们将获得以下输出:

1[secondary_label Output]
2Arctic 1205
3Atlantic 3646
4Indian 3741
5Pacific 4080
6Southern 3270
7dtype: int64

序列以有组织的方式显示,指数(由我们的键组成)在左侧,值集在右侧。

这将像其他Python字典一样行事,您可以通过调用密钥来访问值,我们可以这样做:

1[label ocean_depth.py]
2...
3print(avg_ocean_depth['Indian'])
4print(avg_ocean_depth['Atlantic':'Indian'])

1[secondary_label Output]
23741
3Atlantic 3646
4Indian 3741
5dtype: int64

然而,这些系列现在是Python对象,因此您将无法使用字典函数。

Python 字典提供了另一种形式来设置pandas中的系列。

数据框架

**DataFrames 是具有不同数据类型的列的 2 维标记数据结构。

DataFrames 类似于表格或 SQL 表格,一般来说,当您使用panda时,DataFrames 将是您最常用的对象。

要了解大熊猫数据框架是如何工作的,让我们设置两个系列,然后将它们传递到一个数据框架中。第一个系列将是我们以前的avg_ocean_depth系列,第二个系列将是max_ocean_depth,该系列包含地球上每个海洋的最大深度数据以米。

 1[label ocean.py]
 2import numpy as np
 3import pandas as pd
 4
 5avg_ocean_depth = pd.Series({
 6                    'Arctic': 1205,
 7                    'Atlantic': 3646,
 8                    'Indian': 3741,
 9                    'Pacific': 4080,
10                    'Southern': 3270
11})
12
13max_ocean_depth = pd.Series({
14                    'Arctic': 5567,
15                    'Atlantic': 8486,
16                    'Indian': 7906,
17                    'Pacific': 10803,
18                    'Southern': 7075
19})

有了这两个系列的设置,让我们将DataFrame添加到文件的底部,在max_ocean_depth系列下面。在我们的示例中,这两个系列都有相同的索引标签,但如果您有具有不同的标签的系列,那么缺少的值将被标记为NaN。

它是这样构建的,我们可以包括列标签,我们将其声明为系列变量的密钥。

 1[label ocean.py]
 2...
 3max_ocean_depth = pd.Series({
 4                    'Arctic': 5567,
 5                    'Atlantic': 8486,
 6                    'Indian': 7906,
 7                    'Pacific': 10803,
 8                    'Southern': 7075
 9})
10
11ocean_depths = pd.DataFrame({
12                    'Avg. Depth (m)': avg_ocean_depth,
13                    'Max. Depth (m)': max_ocean_depth
14})
15
16print(ocean_depths)

1[secondary_label Output]
2          Avg. Depth (m)  Max. Depth (m)
3Arctic 1205 5567
4Atlantic 3646 8486
5Indian 3741 7906
6Pacific 4080 10803
7Southern 3270 7075

输出显示了我们两个列标题以及每个列下面的数字数据,而字典密钥的标签则位于左侧。

在数据框中分类数据

我们可以使用DataFrame.sort_values(by=...)函数对数据框架中的数据进行排序(http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values)。

例如,让我们使用上升布尔参数,它可以是真实或错误。注意上升是我们可以传递到函数的参数,但下降不是。

1[label ocean_depth.py]
2...
3print(ocean_depths.sort_values('Avg. Depth (m)', ascending=True))

1[secondary_label Output]
2          Avg. Depth (m)  Max. Depth (m)
3Arctic 1205 5567
4Southern 3270 7075
5Atlantic 3646 8486
6Indian 3741 7906
7Pacific 4080 10803

现在,输出显示从最左整数列的低值上升到高值的数字。

使用数据框架进行统计分析

接下来,让我们看看我们可以从pandas中使用DataFrame.describe()函数收集的某些汇总统计数据(http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.describe.html)。

没有通过特定参数,DataFrame.describe() 函数将为数字数据类型提供以下信息:

返回是什么意思

count███ 频率计数;发生了什么事的次数 meanḳ 平均值或平均值 std 标准偏差,用来表示数据的差异程度的数值 min 集中的最小或最小数目 25% 25 个百分位 50% 50 个百分位 75% 75 个百分位 max 集中的最大或最小数目

让我们让Python为我们打印这些统计数据,用描述()函数调用我们的ocean_depths数据框:

1[label ocean.py]
2...
3print(ocean_depths.describe())

当我们运行这个程序时,我们将收到以下输出:

 1[secondary_label Output]
 2       Avg. Depth (m)  Max. Depth (m)
 3count 5.000000 5.000000
 4mean 3188.400000 7967.400000
 5std 1145.671113 1928.188347
 6min 1205.000000 5567.000000
 725%       3270.000000 7075.000000
 850%       3646.000000 7906.000000
 975%       3741.000000 8486.000000
10max 4080.000000 10803.000000

现在您可以将这里的输出与原始数据框架进行比较,并在将其视为一组时更好地了解地球海洋的平均和最大深度。

处理缺失的价值观

通常在处理数据时,会出现缺失的值。「pandas」包提供了许多不同的方法(http://pandas.pydata.org/pandas-docs/stable/missing_data.html),指的是「null」数据,或者由于某种原因不存在的数据。

我们将使用DataFrame.dropna()函数将丢失值和DataFrame.fillna()函数填写缺失值(http://pandas.pydata.org/pandas-docs/stable/missing_data.html#filling-missing-values-fillna)。

让我们创建一个名为user_data.py的新文件,并填充一些缺少值的数据,并将其转换为数据框:

 1[label user_data.py]
 2import numpy as np
 3import pandas as pd
 4
 5user_data = {'first_name': ['Sammy', 'Jesse', np.nan, 'Jamie'],
 6        'last_name': ['Shark', 'Octopus', np.nan, 'Mantis shrimp'],
 7        'online': [True, np.nan, False, True],
 8        'followers': [987, 432, 321, np.nan]}
 9
10df = pd.DataFrame(user_data, columns = ['first_name', 'last_name', 'online', 'followers'])
11
12print(df)

我们的打印呼叫向我们显示下面的输出,当我们运行该程序:

1[secondary_label Output]
2  first_name last_name online followers
30 Sammy Shark True 987.0
41 Jesse Octopus NaN 432.0
52 NaN NaN False 321.0
63 Jamie Mantis shrimp True NaN

这里有很多缺少的价值观。

首先,让我们用dropna()放下缺少的值。

1[label user_data.py]
2...
3df_drop_missing = df.dropna()
4
5print(df_drop_missing)

由于只有一个行在我们的小数据集中没有任何缺少的值,所以当我们运行程序时,它是唯一未受影响的行:

1[secondary_label Output]
2  first_name last_name online followers
30 Sammy Shark True 987.0

作为降低值的替代方案,我们可以用我们所选择的值来填充缺少的值,例如 0. 我们将通过 DataFrame.fillna(0)来实现这一点。

删除或评论我们添加到我们的文件的最后两行,并添加以下内容:

1[label user_data.py]
2...
3df_fill = df.fillna(0)
4
5print(df_fill)

当我们运行该程序时,我们将收到以下输出:

1[secondary_label Output]
2  first_name last_name online followers
30 Sammy Shark True 987.0
41 Jesse Octopus 0 432.0
52 0 0 False 321.0
63 Jamie Mantis shrimp True 0.0

现在我们的所有列和行都未受损,而不是将NaN作为我们的值,我们现在有0填充这些空间。

此时,您可以对数据进行排序,进行统计分析,并在 DataFrames 中处理缺失的值。

结论

本教程涵盖了pandas和Python 3数据分析的介绍信息,您现在应该安装pandas,并且可以在pandas内与Series和DataFrames数据结构工作。