如何使用 Beautiful Soup 和 Python 3 抓取网页

介绍

许多数据分析,大数据和机器学习项目需要扫描网站来收集您正在使用的数据 Python 编程语言在数据科学界广泛使用,因此有模块和工具的生态系统,您可以在自己的项目中使用。

Beautiful Soup,是路易斯·卡罗尔《爱丽丝在奇迹之地的冒险》第10章中发现的Mock Turtle的歌曲(https://en.wikipedia.org/wiki/Mock_Turtle),是一个Python图书馆,允许在网页扫描项目上快速转变。目前可作为Beautiful Soup 4和兼容Python 2.7和Python 3,美丽的汤可以从HTML和XML文档中创建一个解析树(包括未封闭标签或标签汤(https://en.wikipedia.org/wiki/Tag_soup)和其他错误标记的文档)。

在本教程中,我们将收集和分析网页,以获取文本数据,并将我们收集的信息写入 CSV 文件。

前提条件

在使用本教程之前,您应该在计算机上设置一个本地或基于服务器的 Python 编程环境。

您应该安装请求和美丽汤模块,您可以通过遵循我们的教程如何使用 Python 使用请求和美丽汤工作(https://andsky.com/tech/tutorials/how-to-work-with-web-data-using-requests-and-beautiful-soup-with-python-3)。

此外,由于我们将处理从Web上摘取的数据,您应该熟悉HTML结构和标签。

了解数据

在本教程中,我们将使用来自美国国家画廊的官方网站(https://www.nga.gov/)的数据。国家画廊是一个位于华盛顿特区国家购物中心的艺术博物馆,拥有从文艺复兴到今天的12万多件作品,由13000多位艺术家制作。

我们想搜索艺术家索引,该索引在更新本教程时可通过 Internet Archive的 Wayback Machine在以下URL上找到:

https://web.archive.org/web/20170131230332/https://www.nga.gov/collection/an.shtm

<$>[注] 注:上面长的URL是因为这个网站已被互联网档案馆存档。

互联网档案是一个非营利性的数字库,提供免费访问互联网网站和其他数字媒体. 这个组织采取网站的快照来保存网站的历史,我们目前可以访问国家画廊的网站的较旧版本,这是当本教程首次编写时可用的。

在互联网档案的标题下,你会看到一个看起来像这样的页面:

Index of Artists Landing Page

由于我们将进行这个项目,以了解使用Beautiful Soup的网页扫描,我们不需要从网站中提取太多的数据,所以让我们限制我们想要扫描的艺术家数据的范围。

Artist names beginning with Z list

在上面的页面中,我们看到在写作时列出的第一个艺术家是 Zabaglia, Niccola,这在我们开始提取数据时是值得注意的。

https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ1.htm

重要的是要记住以后你选择列出的信件总共有多少页,你可以通过点击最后一页的艺术家来发现。在这种情况下,总共有4页,而在写作时列出的最后一个艺术家是 Zykmund, Václav.最后一页的艺术家是 Z有以下URL:

**[https://web.archive.org/web/20121010201041/http://www.nga.gov/collection/anZ4.htm]

**然而,您也可以通过使用第一页的相同的互联网档案数字字符串访问上面的页面:

https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ4.htm

这很重要,因为我们将在本教程中稍后重复这些页面。

要开始熟悉这个网页是如何设置的,你可以看看它的 DOM,这将帮助你了解HTML是如何结构的。

图书馆进口

要开始我们的编码项目,让我们激活我们的Python 3编程环境,确保您在您的环境所在的目录中,然后运行以下命令:

1. my_env/bin/activate

随着我们的编程环境被激活,我们将创建一个新的文件,例如纳米,你可以将你的文件命名为任何你想,我们将在本教程中称之为nga_z_artists.py。

1nano nga_z_artists.py

在此文件中,我们可以开始导入我们将使用的库 - 请求和美丽的汤。

请求库允许您以人类可读的方式在您的Python程序中使用HTTP,而美丽的汤模块旨在快速完成Web扫描。

对于美丽的汤,我们将从bs4中导入,其中包含了美丽的汤4。

1[label nga_z_artists.py]
2# Import libraries
3import requests
4from bs4 import BeautifulSoup

随着请求和美丽汤模块的导入,我们可以继续工作,首先收集页面,然后分析它。

收集和整理一个网页

下一步我们需要做的就是收集请求的第一个网页的URL,我们将将第一个网页的URL分配给变量页面,使用方法 requests.get() 。

1[label nga_z_artists.py]
2import requests
3from bs4 import BeautifulSoup
4
5# Collect first page of artists’ list
6page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')

<$>[注] **注:由于URL很长,所以本教程中的代码不会通过(PEP 8 E501)(https://www.python.org/dev/peps/pep-0008/#maximum-line-length),这标志着超过79个字符的线条。您可能希望将URL分配给一个变量,使代码在最终版本中更易读。本教程中的代码用于示范目的,并允许您作为您自己的项目的一部分更短的URL进行交换。

现在我们将创建一个BeautifulSoup对象,或一个解析树,这个对象从请求(服务器响应的内容)中将page.text文档作为其论点,然后从 Python 内置的 html.parser中解析。

1[label nga_z_artists.py]
2import requests
3from bs4 import BeautifulSoup
4
5page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
6
7# Create a BeautifulSoup object
8soup = BeautifulSoup(page.text, 'html.parser')

随着我们的页面被收集,分析,并设置为BeautifulSoup对象,我们可以继续收集我们想要的数据。

从网页中提取文本

对于这个项目,我们将收集艺术家的名字和网站上可用的相关链接. 您可能想要收集不同的数据,如艺术家的国籍和日期。

要做到这一点,在您的网页浏览器中,右键单击 - 或CTRL +点击macOS - 第一位艺术家的名字, Zabaglia, Niccola. 在出现的背景菜单中,您应该看到类似于 Inspect Element(Firefox)或 Inspect(Chrome)的菜单项目。

Context Menu — Inspect Element

一旦您点击相关的 Inspect菜单项,网页开发人员的工具应该出现在您的浏览器中。

Web Page Inspector

首先,我们会看到名称表在 <div> 标签中,其中 class="BodyText". 这很重要,所以我们只在网页的这个部分内搜索文本。我们还注意到名称 Zabaglia, Niccola在链接标签中,因为名称引用了描述艺术家的网页。

为了做到这一点,我们将使用Beautiful Soup的find()和find_all()方法,从BodyText``<div>中提取艺术家的名字。

 1[label nga_z_artists.py]
 2import requests
 3from bs4 import BeautifulSoup
 4
 5# Collect and parse first page
 6page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
 7soup = BeautifulSoup(page.text, 'html.parser')
 8
 9# Pull all text from the BodyText div
10artist_name_list = soup.find(class_='BodyText')
11
12# Pull text from all instances of <a> tag within BodyText div
13artist_name_list_items = artist_name_list.find_all('a')

接下来,在我们程序文件的底部,我们将想要创建一个 for loop,以重复我们刚刚将artist_name_list_items变量中的所有艺术家名称。

我们将使用prettify()方法来打印这些名称,以便将美丽的汤解析树变成一个精心格式的Unicode字符串。

1[label nga_z_artists.py]
2...
3artist_name_list = soup.find(class_='BodyText')
4artist_name_list_items = artist_name_list.find_all('a')
5
6# Create for loop to print out all artists' names
7for artist_name in artist_name_list_items:
8    print(artist_name.prettify())

让我们按照我们迄今为止所拥有的程序运行:

1python nga_z_artists.py

一旦我们这样做,我们将收到以下输出:

 1[secondary_label Output]
 2<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630">
 3 Zabaglia, Niccola
 4</a>
 5...
 6<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3427">
 7 Zao Wou-Ki
 8</a>
 9<a href="/web/20121007172955/https://www.nga.gov/collection/anZ2.htm">
10 Zas-Zie
11</a>
12
13<a href="/web/20121007172955/https://www.nga.gov/collection/anZ3.htm">
14 Zie-Zor
15</a>
16
17<a href="/web/20121007172955/https://www.nga.gov/collection/anZ4.htm">
18 <strong>
19  next
20  <br/>
21  page
22 </strong>
23</a>

在此时我们看到的输出是第一个页面上的<div class="BodyText">标签中发现的<a>标签中的所有艺术家的名字相关的完整文本和标签,以及底部的一些额外的链接文本。

删除不必要的数据

到目前为止,我们已经能够在我们网页的一个<div>部分内收集所有链接文本数据,但是,我们不希望底部链接引用艺术家的名字,所以让我们努力删除该部分。

为了删除页面底部的链接,让我们再次右键单击并检查DOM,我们会看到<div class="BodyText">部分的底部的链接包含在HTML表中:

Links in AlphaNav HTML Table

因此,我们可以使用美丽的汤来找到AlphaNav类,并使用分解()方法从解析树中删除标签,然后将其和其内容一起销毁。

我们将使用变量 last_links 来引用这些下面的链接并将其添加到程序文件中:

 1[label nga_z_artists.py]
 2import requests
 3from bs4 import BeautifulSoup
 4
 5page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
 6
 7soup = BeautifulSoup(page.text, 'html.parser')
 8
 9# Remove bottom links
10last_links = soup.find(class_='AlphaNav')
11last_links.decompose()
12
13artist_name_list = soup.find(class_='BodyText')
14artist_name_list_items = artist_name_list.find_all('a')
15
16for artist_name in artist_name_list_items:
17    print(artist_name.prettify())

现在,当我们使用python nga_z_artist.py命令运行该程序时,我们将收到以下输出:

 1[secondary_label Output]
 2<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630">
 3 Zabaglia, Niccola
 4</a>
 5<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202">
 6 Zaccone, Fabian
 7</a>
 8...
 9<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=11631">
10 Zanotti, Giampietro
11</a>
12<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=3427">
13 Zao Wou-Ki
14</a>

在此时,我们看到输出不再包含网页底部的链接,现在只显示与艺术家的名字相关的链接。

到目前为止,我们已经专门针对与艺术家的名字的链接,但我们有额外的标签数据,我们不真正想要。

从一个 Tag 提取内容

为了只访问真正的艺术家的名字,我们希望瞄准<a>标签的内容,而不是打印整个链接标签。

我们可以使用Beautiful Soup的 .contents,它将以Python 列表数据类型返回标签的孩子。

让我们修改for循环,这样,而不是打印整个链接及其标签,我们将打印儿童列表(即艺术家的完整名称):

 1[label nga_z_artists.py]
 2import requests
 3from bs4 import BeautifulSoup
 4
 5page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
 6
 7soup = BeautifulSoup(page.text, 'html.parser')
 8
 9last_links = soup.find(class_='AlphaNav')
10last_links.decompose()
11
12artist_name_list = soup.find(class_='BodyText')
13artist_name_list_items = artist_name_list.find_all('a')
14
15# Use .contents to pull out the <a> tag’s children
16for artist_name in artist_name_list_items:
17    names = artist_name.contents[0]
18    print(names)

请注意,我们正在重复上面的列表,通过呼叫每个项目的索引号码。

我们可以使用python命令运行该程序,以查看以下输出:

1[secondary_label Output]
2Zabaglia, Niccola
3Zaccone, Fabian
4Zadkine, Ossip
5...
6Zanini-Viola, Giuseppe
7Zanotti, Giampietro
8Zao Wou-Ki

我们收到了所有艺术家的名字列表,在字母 Z的第一页上。

但是,如果我们也想捕捉与这些艺术家相关的URL,我们可以使用Beautiful Soup的get(href)方法提取页面的`标签中发现的URL。

从上面的链接的输出,我们知道整个URL没有被捕获,所以我们将连接(https://andsky.com/tech/tutorials/an-introduction-to-working-with-strings-in-python-3#string-concatenation)链接字符串与URL字符串的前面(在这种情况下 https://web.archive.org/)。

我们也将这些行添加到for循环中:

1[label nga_z_artists.py]
2...
3for artist_name in artist_name_list_items:
4    names = artist_name.contents[0]
5    links = 'https://web.archive.org' + artist_name.get('href')
6    print(names)
7    print(links)

当我们运行上述程序时,我们将收到艺术家的名字和向我们更多关于艺术家的链接的URL:

 1[secondary_label Output]
 2Zabaglia, Niccola
 3https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630
 4Zaccone, Fabian
 5https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202
 6...
 7Zanotti, Giampietro
 8https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11631
 9Zao Wou-Ki
10https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3427

虽然我们现在正在从网站获取信息,但它目前只是打印到我们的终端窗口. 相反,让我们捕捉这些数据,以便我们可以通过将其写入文件在其他地方使用。

将数据写入 CSV 文件

收集只生活在终端窗口中的数据并不太有用。Comma-separated values(CSV)文件允许我们以简单的文本存储表数据,并且是表格和数据库的常见格式。

首先,我们需要导入Python内置的csv模块,以及Python编程文件顶部的其他模块:

1import csv

接下来,我们将创建并打开名为z-artist-names.csv的文件,以便我们使用w模式创建(我们将使用f变量为文件)。

1f = csv.writer(open('z-artist-names.csv', 'w'))
2f.writerow(['Name', 'Link'])

最后,在我们的为循环中,我们将写下每个行与艺术家的名字及其相关的链接:

1f.writerow([names, links])

您可以在下面的文件中看到每个任务的行:

 1[label nga_z_artists.py]
 2import requests
 3import csv
 4from bs4 import BeautifulSoup
 5
 6page = requests.get('https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ1.htm')
 7
 8soup = BeautifulSoup(page.text, 'html.parser')
 9
10last_links = soup.find(class_='AlphaNav')
11last_links.decompose()
12
13# Create a file to write to, add headers row
14f = csv.writer(open('z-artist-names.csv', 'w'))
15f.writerow(['Name', 'Link'])
16
17artist_name_list = soup.find(class_='BodyText')
18artist_name_list_items = artist_name_list.find_all('a')
19
20for artist_name in artist_name_list_items:
21    names = artist_name.contents[0]
22    links = 'https://web.archive.org' + artist_name.get('href')
23
24    # Add each artist’s name and associated link to a row
25    f.writerow([names, links])

当您现在使用python命令运行该程序时,将不会返回您的终端窗口的输出,相反,将在您正在工作的目录中创建一个名为z-artist-names.csv的文件。

取决于你用什么来打开它,它可能看起来像这样的东西:

1[label z-artist-names.csv]
2Name,Link
3"Zabaglia, Niccola",https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=11630
4"Zaccone, Fabian",https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=34202
5"Zadkine, Ossip",https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=3475w
6...

或者,它可能看起来更像一个电子表格:

CSV Spreadsheet

在这两种情况下,您现在可以使用此文件以更有意义的方式处理数据,因为您收集的信息现在存储在计算机的内存中。

恢复相关页面

我们创建了一个程序,将从艺术家列表的第一页中提取数据,他们的名字以字母 Z开始。

为了收集所有这些页面,我们可以执行更多的for循环迭代,这将修改我们迄今为止撰写的大部分代码,但将使用类似的概念。

首先,我们要初始化一份列表,以保持页面:

1pages = []

我们将这个初始化列表填充到以下for循环:

1for i in range(1, 5):
2    url = 'https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ' + str(i) + '.htm'
3    pages.append(url)

本教程的早期,我们注意到,我们应该注意包含艺术家的名字的页面总数,从字母 Z开始(或我们使用的任何字母)。

对于这个特定的网站,URL从字符串 https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ开始,然后是页面的数字(这将是我们将其转换为字符串的 for循环中的 i),然后是 .htm。

除了这个循环之外,我们还会有一个第二循环,它将通过上面的每一个页面。在这个for循环中的代码将看起来类似于我们迄今为止创建的代码,因为它正在完成我们为字母 Z艺术家的第一个页面完成的任务,为每一个共计4页。

两个为循环将看起来像这样:

 1pages = []
 2
 3for i in range(1, 5):
 4    url = 'https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ' + str(i) + '.htm'
 5    pages.append(url)
 6
 7for item in pages:
 8    page = requests.get(item)
 9    soup = BeautifulSoup(page.text, 'html.parser')
10
11    last_links = soup.find(class_='AlphaNav')
12    last_links.decompose()
13
14    artist_name_list = soup.find(class_='BodyText')
15    artist_name_list_items = artist_name_list.find_all('a')
16
17    for artist_name in artist_name_list_items:
18        names = artist_name.contents[0]
19        links = 'https://web.archive.org' + artist_name.get('href')
20
21        f.writerow([names, links])

在上面的代码中,你应该看到第一个for循环在页面上重复,第二个for循环从每个页面中提取数据,然后通过每个页面的每个行添加艺术家的名字和链接。

这些两个for循环在import语句下方,CSV文件创建和编写(有编写文件标题的行),以及页面变量的初始化(分配给列表)。

在编程文件的更大背景下,完整的代码看起来如下:

 1[label nga_z_artists.py]
 2import requests
 3import csv
 4from bs4 import BeautifulSoup
 5
 6f = csv.writer(open('z-artist-names.csv', 'w'))
 7f.writerow(['Name', 'Link'])
 8
 9pages = []
10
11for i in range(1, 5):
12    url = 'https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ' + str(i) + '.htm'
13    pages.append(url)
14
15for item in pages:
16    page = requests.get(item)
17    soup = BeautifulSoup(page.text, 'html.parser')
18
19    last_links = soup.find(class_='AlphaNav')
20    last_links.decompose()
21
22    artist_name_list = soup.find(class_='BodyText')
23    artist_name_list_items = artist_name_list.find_all('a')
24
25    for artist_name in artist_name_list_items:
26        names = artist_name.contents[0]
27        links = 'https://web.archive.org' + artist_name.get('href')
28
29        f.writerow([names, links])

由于这个程序正在做一些工作,它将需要一些时间来创建 CSV 文件. 一旦完成,输出将是完整的,显示艺术家的名字和他们的相关链接从 Zabaglia, Niccola到 Zykmund, Václav。

被视为

在扫描网页时,重要的是要留意您正在获取信息的服务器。

检查是否有一个网站有服务条款或使用条款,涉及到网页扫描。此外,检查是否有一个网站有一个API,允许您在自己扫描之前抓取数据。

一旦您从网站中收集了您需要的内容,请运行脚本,这些脚本将本地传输数据,而不是负担别人的服务器。

此外,使用具有您的姓名和电子邮件的标题进行扫描,以便网站能够识别您并在有任何问题时跟踪您,这是一个好主意。

 1import requests
 2
 3headers = {
 4    'User-Agent': 'Your Name, example.com',
 5    'From': '[email protected]'
 6}
 7
 8url = 'https://example.com'
 9
10page = requests.get(url, headers = headers)

使用具有可识别信息的标题可以确保通过服务器日志的用户可以接触到您。

结论

本教程通过使用Python和Beautiful Soup来从网站中提取数据,我们将收集的文本存储在 CSV 文件中。

您可以通过收集更多数据并使您的 CSV 文件更强大来继续工作,例如,您可能希望包括每个艺术家的国籍和年龄。

要继续学习如何从网络中提取信息,请阅读我们的教程如何使用 Scrapy 和 Python 3 扫描网页(LINK0)。

Published At2019-03-20 11:22 UTC

Categories with 技术

Tagged with