如何使用 Python 3 利用请求和美丽汤处理网络数据

介绍

网页为我们提供的数据比我们任何人都能阅读和理解的多,所以我们经常希望通过程序来处理这些信息,以便使其有意义。

本教程将介绍如何使用 Requests和 Beautiful Soup Python 包,以便利用来自网页的数据。Requests 模块允许您将您的 Python 程序与网页服务集成,而 Beautiful Soup 模块旨在快速完成屏幕扫描。使用 Python 互动控制台和这两个库,我们将探索如何收集网页并使用可用的文本信息。

前提条件

要完成本教程,您需要 Python 3 的开发环境,您可以遵循适当的操作系统指南,可从如何安装和设置Python 3的本地编程环境或如何在Ubuntu 20.04服务器上安装Python 3和设置编程环境系列中配置您所需的一切。

此外,您还应该熟悉:

考虑到您的开发环境和这些Python编程概念,让我们开始使用Requests和Beautiful Soup。

安装请求

让我们开始通过激活我们的Python 3编程环境,确保您在您的环境所在的目录中,然后运行以下命令:

1. my_env/bin/activate

为了使用网页,我们将需要请求页面,请求库允许您以人类可读的方式在您的Python程序中使用HTTP。

当我们的编程环境被激活时,我们将安装Pip的请求:

1pip install requests

在安装请求库时,您将收到以下输出:

1[secondary_label Output]
2Collecting requests
3  Downloading requests-2.26.0-py2.py3-none-any.whl (88kB)
4    100% |████████████████████████████████| 92kB 3.1MB/s 
5...
6Installing collected packages: chardet, urllib3, certifi, idna, requests
7Successfully installed certifi-2017.4.17 chardet-3.0.4 idna-2.5 requests-2.26.0 urllib3-1.21.1

如果之前安装了 Requests,您将从终端窗口中收到类似的反馈:

1[secondary_label Output]
2Requirement already satisfied
3...

随着请求安装到我们的编程环境中,我们可以继续安装下一个模块。

打造美丽的汤

正如我们在Requests中所做的那样,我们将安装Beautiful Soup with pip.当前版本的Beautiful Soup 4可以用以下命令安装:

1pip install beautifulsoup4

一旦运行此命令,您应该看到类似于以下的输出:

1[secondary_label Output]
2Collecting beautifulsoup4
3  Downloading beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)
4     |████████████████████████████████| 97 kB 6.8 MB/s
5Collecting soupsieve>1.2
6  Downloading soupsieve-2.3.1-py3-none-any.whl (37 kB)
7Installing collected packages: soupsieve, beautifulsoup4
8Successfully installed beautifulsoup4-4.10.0 soupsieve-2.3.1

现在,既安装了美丽的汤和请求,我们可以继续了解如何使用库来扫描网站。

收集请求的网页

现在安装了我们将使用的两个Python库,我们可以熟悉通过一个基本的网页。

让我们先进入 Python 互动控制台:

1python

从这里,我们将导入请求模块,以便我们可以收集样本网页:

1import requests

我们将分配样本网页的URL(下方) mockturtle.html 到变量 url:

1url = 'https://assets.digitalocean.com/articles/eng_python/beautiful-soup/mockturtle.html'

接下来,我们可以用 request.get() 方法将该页面的请求结果分配给变量 page。

1page = requests.get(url)

变量页面被分配给一个响应对象:

1>>> page
2<Response [200]>
3>>>

上面的响应对象告诉我们status_code属性在方块(在这种情况下是200)中。

1>>> page.status_code
2200
3>>>

返回的200代码告诉我们,页面已成功下载。以数字2开头的代码通常表示成功,而以4或5开头的代码表示出现错误。

为了使用网页数据,我们将需要访问网页文件的文本内容,我们可以用page.text读取服务器的响应内容(如果我们希望以字节访问响应,则可以使用page.content。

1page.text

一旦我们按下ENTER,我们将收到以下输出:

 1[secondary_label Output]
 2'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n    
 3"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n\n<html lang="en-US" 
 4xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">\n<head>\n  <meta 
 5http-equiv="content-type" content="text/html; charset=us-ascii" />\n\n  <title>Turtle 
 6Soup</title>\n</head>\n\n<body>\n  <h1>Turtle Soup</h1>\n\n  <p class="verse" 
 7id="first">Beautiful Soup, so rich and green,<br />\n Waiting in a hot tureen!<br />\n Who for 
 8such dainties would not stoop?<br />\n Soup of the evening, beautiful Soup!<br />\n Soup of 
 9the evening, beautiful Soup!<br /></p>\n\n  <p class="chorus" id="second">Beau--ootiful 
10Soo--oop!<br />\n Beau--ootiful Soo--oop!<br />\n Soo--oop of the e--e--evening,<br />\n  
11Beautiful, beautiful Soup!<br /></p>\n\n  <p class="verse" id="third">Beautiful Soup! Who cares 
12for fish,<br />\n Game or any other dish?<br />\n Who would not give all else for two<br />\n  
13Pennyworth only of Beautiful Soup?<br />\n Pennyworth only of beautiful Soup?<br /></p>\n\n  
14<p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br />\n Beau--ootiful Soo--oop!<br />\n  
15Soo--oop of the e--e--evening,<br />\n Beautiful, beauti--FUL SOUP!<br 
16/></p>\n</body>\n</html>\n'
17>>>

在这里,我们看到页面的完整文本被打印出来,包括其所有HTML标签,但很难读取,因为没有太多的间隔。

在下一节中,我们可以利用美丽的汤模块以更人性化的方式处理这些文本数据。

穿过一页美丽的汤

Beautiful Soup 图书馆可以从分析的 HTML 和 XML 文档(包括未封闭标签或 tag soup和其他错误标记的文档)创建一个解析树,该功能将使网页文本比我们从请求模块中看到的更易读。

首先,我们将 Import Beautiful Soup 导入 Python 控制台:

1from bs4 import BeautifulSoup

接下来,我们将通过模块运行page.text文档,给我们一个BeautifulSoup对象 - 也就是说,我们将从运行Python的内置 html.parser中获取从这个解析页面的解析树。

1soup = BeautifulSoup(page.text, 'html.parser')

要在终端上显示页面的内容,我们可以用prettify()方法打印它,以便将美丽的汤草变成一个精心格式的Unicode字符串。

1print(soup.prettify())

这将使每个HTML标签在其自己的行上:

 1[secondary_label Output]
 2<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 3    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 4<html lang="en-US" xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
 5 <head>
 6  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
 7  <title>
 8   Turtle Soup
 9  </title>
10 </head>
11 <body>
12  <h1>
13   Turtle Soup
14  </h1>
15  <p class="verse" id="first">
16   Beautiful Soup, so rich and green,
17   <br/>
18   Waiting in a hot tureen!
19   <br/>
20   Who for such dainties would not stoop?
21   <br/>
22   Soup of the evening, beautiful Soup!
23 ...
24</html>

在上面的输出中,我们可以看到每行都有一个标签,而且标签是由于美丽汤使用的树结构而嵌入的。

寻找一天的案例

我们可以使用 Beautiful Soup 的find_all方法从页面中提取一个单一的标签,这将返回文档中的每个标签的实例。

1soup.find_all('p')

在我们的对象上运行该方法会返回歌曲的完整文本,以及相关的标签以及所请求的标签中包含的任何标签,其中包括线断标签 :

 1[secondary_label Output]
 2[<p class="verse" id="first">Beautiful Soup, so rich and green,<br/>
 3  Waiting in a hot tureen!<br/>
 4  Who for such dainties would not stoop?<br/>
 5  Soup of the evening, beautiful Soup!<br/>
 6  Soup of the evening, beautiful Soup!<br/></p>, <p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
 7...
 8  Beau--ootiful Soo--oop!<br/>
 9  Soo--oop of the e--e--evening,<br/>
10  Beautiful, beauti--FUL SOUP!<br/></p>]

在上面的输出中,你会注意到数据包含在方块中,这意味着它是Python(列表数据类型)(https://andsky.com/tech/tutorials/understanding-lists-in-python-3)。

因为它是一个列表,所以我们可以调用其中一个特定的项目(例如,第三个  元素),并使用 get_text() 方法从该标签中提取所有文本:

1soup.find_all('p')[2].get_text()

我们接收的输出将是第三个  元素在这种情况下:

1[secondary_label Output]
2'Beautiful Soup! Who cares for fish,\n Game or any other dish?\n Who would not give all else for two\n Pennyworth only of Beautiful Soup?\n Pennyworth only of beautiful Soup?'

请注意,在上面的返回字符串中也显示了\n行断。

查找按类和ID的标签

HTML 元素可参考 CSS 选择器,如类和 ID,在使用 Beautiful Soup 使用 Web 数据时可以有助于观察,我们可以使用 find_all() 方法将类和 ID 字符串作为参数来瞄准特定类和 ID。

首先,让我们找到类 chorus 的所有实例. 在 Beautiful Soup 中,我们将将该类的字符串分配给关键字参数 class_:

1soup.find_all(class_='chorus')

当我们运行上面的行时,我们将收到以下列表作为输出:

1[secondary_label Output]
2[<p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
3  Beau--ootiful Soo--oop!<br/>
4  Soo--oop of the e--e--evening,<br/>
5  Beautiful, beautiful Soup!<br/></p>, <p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br/>
6  Beau--ootiful Soo--oop!<br/>
7  Soo--oop of the e--e--evening,<br/>
8  Beautiful, beauti--FUL SOUP!<br/></p>]

两个被标记为的部分与合唱类被打印到终端。

我们还可以指定我们只在标签中搜索chorus类别,如果它用于多个标签:

1soup.find_all('p', class_='chorus')

运行上面的行将产生与以前相同的输出。

我们还可以使用 Beautiful Soup 来瞄准与 HTML 标签相关的 ID. 在这种情况下,我们会将字符串 'third' 分配给关键字参数 id`:

1soup.find_all(id='third')

一旦我们运行上面的行,我们将收到以下输出:

1[secondary_label Output]
2[<p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
3  Game or any other dish?<br/>
4  Who would not give all else for two<br/>
5  Pennyworth only of Beautiful Soup?<br/>
6  Pennyworth only of beautiful Soup?<br/></p>]

与标签相关的文本与第三方的ID打印到终端上,并与相关标签一起打印。

结论

本教程带您通过 Python 中的请求模块来检索网页,并对该网页的文本数据进行初步扫描,以便了解美丽的汤。

从这里,您可以继续创建一个Web扫描程序,从从Web收集的数据中创建一个CSV文件,通过遵循教程(How To Scrape Web Pages with Beautiful Soup and Python 3)(https://andsky.com/tech/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3)。