如何使用 Beautiful Soup 搜索亚马逊产品信息

网页扫描是从网站中提取相关信息并将其存储在本地系统中以便进一步使用的编程技术。

在现代时代,网页扫描在数据科学和营销领域有很多应用,世界各地的网页扫描仪收集了大量信息,用于个人或专业用途,此外,当今的科技巨头依靠这样的网页扫描方法来满足他们的消费者群体的需求。

在本文中,我们将从亚马逊网站摘取产品信息,因此,我们将考虑PlayStation 4作为目标产品。

注意:网站布局和标签可能会随着时间的推移而改变,因此,建议读者了解扫描的概念,以便自我实施不会成为一个问题。

网页扫描服务

如果您想使用 Web 扫描构建服务,您可能需要通过 IP 封锁以及代理管理。了解底层技术和流程是很好的,但对于大规模扫描,最好与扫描 API 提供商(如 [Zenscrape])合作(https://zenscrape.com/).他们甚至为动态页面负责 Ajax 请求和 JavaScript。

一些基本要求:

为了制作汤,我们需要合适的成分,同样,我们的新鲜网页切割器也需要某些成分。

Python - 易于使用和大量的图书馆使Python成为扫描网站的_numero-uno。但是,如果用户没有预先安装,请参阅(/community/tutorials/python-tutorial-beginners)。
Beautiful Soup - Python 的许多网页扫描图书馆之一。图书馆的易于和干净的使用使其成为网页扫描的顶级竞争对手。在成功安装Python后,用户可以安装Beautiful Soup by:

1pip install bs4

** HTML 标签的基本理解** - 请参阅本教程( / 社区 / 教程 / HTML5 教程 - 示例)以获取有关 HTML 标签的必要信息
** Web 浏览器** - 由于我们必须从网站中扔出大量不必要的信息,我们需要特定ID 和标签进行过滤。

创建用户代理

许多网站都有某些协议来阻止机器人访问数据,因此,为了从脚本中提取数据,我们需要创建一个用户代理。

此网站包含大量的用户代理,读者可以从中选择。

1HEADERS = ({'User-Agent':
2            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
3            'Accept-Language': 'en-US, en;q=0.5'})

在HEADERS中有一个额外的字段,名为Accept-Language,如果需要,它会将网页翻译成英语。

向 URL 发送请求

一个网页是通过其URL(统一资源定位器)访问的。使用URL,我们将向网页发送访问其数据的请求。

1URL = "https://www.amazon.com/Sony-PlayStation-Pro-1TB-Console-4/dp/B07K14XKZH/"
2webpage = requests.get(URL, headers=HEADERS)

因此,我们的Python脚本专注于提取产品细节,如产品名称,当前价格等。

请注意: 请求通过请求库发送到URL,如果用户收到没有模块命名请求错误,则可以通过pip install requests来安装。

创造一个信息的汤

网页变量包含由网站接收的响应,我们将响应的内容和分析类型传递给美丽的汤函数。

1soup = BeautifulSoup(webpage.content, "lxml")

lxml 是一个由 Beautiful Soup 用来将 HTML 页面分解为复杂的 Python 对象的高速解析器。

标签 - 它对应于 HTML 或 XML 标签,其中包括名称和属性
NavigableString - 它对应于标签内存储的文本
BeautifulSoup - 事实上,整个分析文档
评论 - 最后, HTML 页面的剩余部分不包括在上述三个类别中

发现对象提取的确切标签

这个项目最繁忙的一部分是揭露存储相关信息的ID和标签,正如前面提到的,我们使用Web浏览器来完成这个任务。

我们在浏览器中打开网页并通过右键点击检查相关元素。

Amazon Web Scraper Inspect

结果,屏幕右侧打开一个面板,如下图所示。

Amazon Web Scraper Extractions Edited

一旦我们获得了标签值,提取信息就变成了一块蛋糕,但是,我们必须学习对美丽的汤对象定义的某些函数。

提取产品标题

使用可用的find()函数来搜索具有特定属性的特定标签,我们会找到包含产品标题的标签对象。

1# Outer Tag Object
2title = soup.find("span", attrs={"id":'productTitle'})

然后,我们取出 NavigableString 对象。

1# Inner NavigableString Object
2title_value = title.string

最后,我们删除额外的空间,并将对象转换为字符串值。

1# Title as a string value
2title_string = title_value.strip()

我们可以使用type()函数来查看每个变量的类型。

1# Printing types of values for efficient understanding
2print(type(title))
3print(type(title_value))
4print(type(title_string))
5print()
6
7# Printing Product Title
8print("Product Title = ", title_string)

出发点:**

1<class 'bs4.element.Tag'>
2<class 'bs4.element.NavigableString'>
3<class 'str'>
4
5Product Title =  Sony PlayStation 4 Pro 1TB Console - Black (PS4 Pro)

同样,我们需要弄清楚其他产品细节的标签值,例如产品价格和消费者评级。

Python Script 用于提取产品信息

以下 Python 脚本为产品显示以下细节:

产品标题
产品价格
产品评级
顾客评论数量
产品可用性

 1from bs4 import BeautifulSoup
 2import requests
 3
 4# Function to extract Product Title
 5def get_title(soup):
 6    
 7    try:
 8    	# Outer Tag Object
 9    	title = soup.find("span", attrs={"id":'productTitle'})
10
11    	# Inner NavigableString Object
12    	title_value = title.string
13
14    	# Title as a string value
15    	title_string = title_value.strip()
16
17    	# # Printing types of values for efficient understanding
18    	# print(type(title))
19    	# print(type(title_value))
20    	# print(type(title_string))
21    	# print()
22
23    except AttributeError:
24    	title_string = ""	
25
26    return title_string
27
28# Function to extract Product Price
29def get_price(soup):
30
31    try:
32    	price = soup.find("span", attrs={'id':'priceblock_ourprice'}).string.strip()
33
34    except AttributeError:
35    	price = ""	
36
37    return price
38
39# Function to extract Product Rating
40def get_rating(soup):
41
42    try:
43    	rating = soup.find("i", attrs={'class':'a-icon a-icon-star a-star-4-5'}).string.strip()
44    	
45    except AttributeError:
46    	
47    	try:
48    		rating = soup.find("span", attrs={'class':'a-icon-alt'}).string.strip()
49    	except:
50    		rating = ""	
51
52    return rating
53
54# Function to extract Number of User Reviews
55def get_review_count(soup):
56    try:
57    	review_count = soup.find("span", attrs={'id':'acrCustomerReviewText'}).string.strip()
58    	
59    except AttributeError:
60    	review_count = ""	
61
62    return review_count
63
64# Function to extract Availability Status
65def get_availability(soup):
66    try:
67    	available = soup.find("div", attrs={'id':'availability'})
68    	available = available.find("span").string.strip()
69
70    except AttributeError:
71    	available = ""	
72
73    return available	
74
75if __name__ == '__main__':
76
77    # Headers for request
78    HEADERS = ({'User-Agent':
79                'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
80                'Accept-Language': 'en-US, en;q=0.5'})
81
82    # The webpage URL
83    URL = "https://www.amazon.com/Sony-PlayStation-Pro-1TB-Console-4/dp/B07K14XKZH/"
84
85    # HTTP Request
86    webpage = requests.get(URL, headers=HEADERS)
87
88    # Soup Object containing all data
89    soup = BeautifulSoup(webpage.content, "lxml")
90
91    # Function calls to display all necessary product information
92    print("Product Title =", get_title(soup))
93    print("Product Price =", get_price(soup))
94    print("Product Rating =", get_rating(soup))
95    print("Number of Product Reviews =", get_review_count(soup))
96    print("Availability =", get_availability(soup))
97    print()
98    print()

出发点:**

1Product Title = Sony PlayStation 4 Pro 1TB Console - Black (PS4 Pro)
2Product Price = $473.99
3Product Rating = 4.7 out of 5 stars
4Number of Product Reviews = 1,311 ratings
5Availability = In Stock.

现在我们知道如何从单个亚马逊网页中提取信息,我们可以通过简单地更改URL来将相同的脚本应用于多个网页。

此外,现在让我们尝试从亚马逊搜索结果网页中获取链接。

从亚马逊搜索结果网页获取链接

此前,我们获得了关于随机PlayStation 4的信息,将为多个PlayStation提取此类信息,以便比较价格和评级是一个有价值的想法。

我们可以找到一个包含在一个<a><\a>标签中的链接作为href属性的值。

Amazon Web Scraper Links

相反,我们可以使用find_all()函数提取所有相似的链接。

1# Fetch links as List of Tag Objects
2links = soup.find_all("a", attrs={'class':'a-link-normal s-no-outline'})

find_all()函数返回包含多个标签对象的可迭代对象. 因此,我们选择每个标签对象,并将存储的链接抽取为href属性的值。

1# Store the links
2links_list = []
3
4# Loop for extracting links from Tag Objects
5for link in links:
6    links_list.append(link.get('href'))

我们将链接存储在列表中,以便我们可以重复每个链接并提取产品细节。

 1# Loop for extracting product details from each link 
 2    for link in links_list:
 3    	
 4    	new_webpage = requests.get("https://www.amazon.com" + link, headers=HEADERS)
 5    	new_soup = BeautifulSoup(new_webpage.content, "lxml")
 6    	
 7    	print("Product Title =", get_title(new_soup))
 8    	print("Product Price =", get_price(new_soup))
 9    	print("Product Rating =", get_rating(new_soup))
10    	print("Number of Product Reviews =", get_review_count(new_soup))
11    	print("Availability =", get_availability(new_soup))

我们重复使用以前创建的功能来提取产品信息,尽管生产多个汤的过程使代码变得慢,但反过来,它提供了多个模型和交易之间的价格的正确比较。

Python Script 可在多个网页上提取产品细节

下面是完整的工作Python脚本列出多个PlayStation交易。

  1from bs4 import BeautifulSoup
  2import requests
  3
  4# Function to extract Product Title
  5def get_title(soup):
  6    
  7    try:
  8    	# Outer Tag Object
  9    	title = soup.find("span", attrs={"id":'productTitle'})
 10
 11    	# Inner NavigatableString Object
 12    	title_value = title.string
 13
 14    	# Title as a string value
 15    	title_string = title_value.strip()
 16
 17    	# # Printing types of values for efficient understanding
 18    	# print(type(title))
 19    	# print(type(title_value))
 20    	# print(type(title_string))
 21    	# print()
 22
 23    except AttributeError:
 24    	title_string = ""	
 25
 26    return title_string
 27
 28# Function to extract Product Price
 29def get_price(soup):
 30
 31    try:
 32    	price = soup.find("span", attrs={'id':'priceblock_ourprice'}).string.strip()
 33
 34    except AttributeError:
 35
 36    	try:
 37    		# If there is some deal price
 38    		price = soup.find("span", attrs={'id':'priceblock_dealprice'}).string.strip()
 39
 40    	except:		
 41    		price = ""	
 42
 43    return price
 44
 45# Function to extract Product Rating
 46def get_rating(soup):
 47
 48    try:
 49    	rating = soup.find("i", attrs={'class':'a-icon a-icon-star a-star-4-5'}).string.strip()
 50    	
 51    except AttributeError:
 52    	
 53    	try:
 54    		rating = soup.find("span", attrs={'class':'a-icon-alt'}).string.strip()
 55    	except:
 56    		rating = ""	
 57
 58    return rating
 59
 60# Function to extract Number of User Reviews
 61def get_review_count(soup):
 62    try:
 63    	review_count = soup.find("span", attrs={'id':'acrCustomerReviewText'}).string.strip()
 64    	
 65    except AttributeError:
 66    	review_count = ""	
 67
 68    return review_count
 69
 70# Function to extract Availability Status
 71def get_availability(soup):
 72    try:
 73    	available = soup.find("div", attrs={'id':'availability'})
 74    	available = available.find("span").string.strip()
 75
 76    except AttributeError:
 77    	available = "Not Available"	
 78
 79    return available	
 80
 81if __name__ == '__main__':
 82
 83    # Headers for request
 84    HEADERS = ({'User-Agent':
 85                'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
 86                'Accept-Language': 'en-US'})
 87
 88    # The webpage URL
 89    URL = "https://www.amazon.com/s?k=playstation+4&ref=nb_sb_noss_2"
 90    
 91    # HTTP Request
 92    webpage = requests.get(URL, headers=HEADERS)
 93
 94    # Soup Object containing all data
 95    soup = BeautifulSoup(webpage.content, "lxml")
 96
 97    # Fetch links as List of Tag Objects
 98    links = soup.find_all("a", attrs={'class':'a-link-normal s-no-outline'})
 99
100    # Store the links
101    links_list = []
102
103    # Loop for extracting links from Tag Objects
104    for link in links:
105    	links_list.append(link.get('href'))
106
107    # Loop for extracting product details from each link 
108    for link in links_list:
109
110    	new_webpage = requests.get("https://www.amazon.com" + link, headers=HEADERS)
111
112    	new_soup = BeautifulSoup(new_webpage.content, "lxml")
113    	
114    	# Function calls to display all necessary product information
115    	print("Product Title =", get_title(new_soup))
116    	print("Product Price =", get_price(new_soup))
117    	print("Product Rating =", get_rating(new_soup))
118    	print("Number of Product Reviews =", get_review_count(new_soup))
119    	print("Availability =", get_availability(new_soup))
120    	print()
121    	print()

出发点:**

 1Product Title = SONY PlayStation 4 Slim 1TB Console, Light & Slim PS4 System, 1TB Hard Drive, All the Greatest Games, TV, Music & More
 2Product Price = $357.00
 3Product Rating = 4.4 out of 5 stars
 4Number of Product Reviews = 32 ratings
 5Availability = In stock on September 8, 2020.
 6
 7Product Title = Newest Sony Playstation 4 PS4 1TB HDD Gaming Console Bundle with Three Games: The Last of Us, God of War, Horizon Zero Dawn, Included Dualshock 4 Wireless Controller
 8Product Price = $469.00
 9Product Rating = 4.6 out of 5 stars
10Number of Product Reviews = 211 ratings
11Availability = Only 14 left in stock - order soon.
12
13Product Title = PlayStation 4 Slim 1TB Console - Fortnite Bundle
14Product Price = 
15Product Rating = 4.8 out of 5 stars
16Number of Product Reviews = 2,715 ratings
17Availability = Not Available
18
19Product Title = PlayStation 4 Slim 1TB Console - Only On PlayStation Bundle
20Product Price = $444.00
21Product Rating = 4.7 out of 5 stars
22Number of Product Reviews = 5,190 ratings
23Availability = Only 1 left in stock - order soon.

上述 Python 脚本不限于 PlayStations 列表,我们可以将 URL 切换到其他链接到 Amazon 搜索结果,如耳机或耳机。

如前所述,HTML页面的布局和标签可能会随着时间的推移而改变,使上述代码在这方面毫无价值。

结论

网页扫描可以有各种优点,从比较产品价格到分析消费者趋势,因为互联网是每个人都能访问的,而Python是一个非常简单的语言,任何人都可以执行网页扫描以满足他们的需求。

我们希望这篇文章是容易理解的. 请自由评论下面的任何问题或反馈. 直到那时, ** 快乐扫描!!!**。