目錄
- requests + BeautifulSoup組合
- requests + parsel組合
- httpx同步 + parsel組合
- httpx異步+ parsel組合
- 對比與總結(jié)
Python網(wǎng)絡(luò)爬蟲領(lǐng)域兩個最新的比較火的工具莫過于httpx和parsel了。httpx號稱下一代的新一代的網(wǎng)絡(luò)請求庫,不僅支持requests庫的所有操作,還能發(fā)送異步請求,為編寫異步爬蟲提供了便利。parsel最初集成在著名Python爬蟲框架Scrapy中,后獨立出來成立一個單獨的模塊,支持XPath選擇器, CSS選擇器和正則表達式等多種解析提取方式, 據(jù)說相比于BeautifulSoup,parsel的解析效率更高。
今天我們就以爬取鏈家網(wǎng)上的二手房在售房產(chǎn)信息為例,來測評下httpx和parsel這兩個庫。為了節(jié)約時間,我們以爬取上海市浦東新區(qū)500萬元-800萬元以上的房產(chǎn)為例。
requests + BeautifulSoup組合
首先上場的是Requests + BeautifulSoup組合,這也是大多數(shù)人剛學習Python爬蟲時使用的組合。本例中爬蟲的入口url是https://sh.lianjia.com/ershoufang/pudong/a3p5/, 先發(fā)送請求獲取最大頁數(shù),然后循環(huán)發(fā)送請求解析單個頁面提取我們所要的信息(比如小區(qū)名,樓層,朝向,總價,單價等信息),最后導出csv文件。如果你正在閱讀本文,相信你對Python爬蟲已經(jīng)有了一定了解,所以我們不會詳細解釋每一行代碼。
整個項目代碼如下所示:
# homelink_requests.py
# Author: 大江狗
from fake_useragent import UserAgent
import requests
from bs4 import BeautifulSoup
import csv
import re
import time
class HomeLinkSpider(object):
def __init__(self):
self.ua = UserAgent()
self.headers = {"User-Agent": self.ua.random}
self.data = list()
self.path = "浦東_三房_500_800萬.csv"
self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"
def get_max_page(self):
response = requests.get(self.url, headers=self.headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
a = soup.select('div[class="page-box house-lst-page-box"]')
#使用eval是字符串轉(zhuǎn)化為字典格式
max_page = eval(a[0].attrs["page-data"])["totalPage"]
return max_page
else:
print("請求失敗 status:{}".format(response.status_code))
return None
def parse_page(self):
max_page = self.get_max_page()
for i in range(1, max_page + 1):
url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)
response = requests.get(url, headers=self.headers)
soup = BeautifulSoup(response.text, 'html.parser')
ul = soup.find_all("ul", class_="sellListContent")
li_list = ul[0].select("li")
for li in li_list:
detail = dict()
detail['title'] = li.select('div[class="title"]')[0].get_text()
# 2室1廳 | 74.14平米 | 南 | 精裝 | 高樓層(共6層) | 1999年建 | 板樓
house_info = li.select('div[class="houseInfo"]')[0].get_text()
house_info_list = house_info.split(" | ")
detail['bedroom'] = house_info_list[0]
detail['area'] = house_info_list[1]
detail['direction'] = house_info_list[2]
floor_pattern = re.compile(r'\d{1,2}')
# 從字符串任意位置匹配
match1 = re.search(floor_pattern, house_info_list[4])
if match1:
detail['floor'] = match1.group()
else:
detail['floor'] = "未知"
# 匹配年份
year_pattern = re.compile(r'\d{4}')
match2 = re.search(year_pattern, house_info_list[5])
if match2:
detail['year'] = match2.group()
else:
detail['year'] = "未知"
# 文蘭小區(qū) - 塘橋, 提取小區(qū)名和哈快
position_info = li.select('div[class="positionInfo"]')[0].get_text().split(' - ')
detail['house'] = position_info[0]
detail['location'] = position_info[1]
# 650萬,匹配650
price_pattern = re.compile(r'\d+')
total_price = li.select('div[class="totalPrice"]')[0].get_text()
detail['total_price'] = re.search(price_pattern, total_price).group()
# 單價64182元/平米, 匹配64182
unit_price = li.select('div[class="unitPrice"]')[0].get_text()
detail['unit_price'] = re.search(price_pattern, unit_price).group()
self.data.append(detail)
def write_csv_file(self):
head = ["標題", "小區(qū)", "房廳", "面積", "朝向", "樓層", "年份",
"位置", "總價(萬)", "單價(元/平方米)"]
keys = ["title", "house", "bedroom", "area", "direction",
"floor", "year", "location",
"total_price", "unit_price"]
try:
with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:
writer = csv.writer(csv_file, dialect='excel')
if head is not None:
writer.writerow(head)
for item in self.data:
row_data = []
for k in keys:
row_data.append(item[k])
# print(row_data)
writer.writerow(row_data)
print("Write a CSV file to path %s Successful." % self.path)
except Exception as e:
print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))
if __name__ == '__main__':
start = time.time()
home_link_spider = HomeLinkSpider()
home_link_spider.parse_page()
home_link_spider.write_csv_file()
end = time.time()
print("耗時:{}秒".format(end-start))
注意:我們使用了fake_useragent, requests和BeautifulSoup,這些都需要通過pip事先安裝好才能用。
現(xiàn)在我們來看下爬取結(jié)果,耗時約18.5秒,總共爬取580條數(shù)據(jù)。
![](http://img.jbzj.com/file_images/article/202105/2021510105338252.png?2021410105355)
requests + parsel組合
這次我們同樣采用requests獲取目標網(wǎng)頁內(nèi)容,使用parsel庫(事先需通過pip安裝)來解析。Parsel庫的用法和BeautifulSoup相似,都是先創(chuàng)建實例,然后使用各種選擇器提取DOM元素和數(shù)據(jù),但語法上稍有不同。Beautiful有自己的語法規(guī)則,而Parsel庫支持標準的css選擇器和xpath選擇器, 通過get方法或getall方法獲取文本或?qū)傩灾?,使用起來更方便?/p>
# BeautifulSoup的用法
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
ul = soup.find_all("ul", class_="sellListContent")[0]
# Parsel的用法, 使用Selector類
from parsel import Selector
selector = Selector(response.text)
ul = selector.css('ul.sellListContent')[0]
# Parsel獲取文本值或?qū)傩灾蛋咐?
selector.css('div.title span::text').get()
selector.css('ul li a::attr(href)').get()
>>> for li in selector.css('ul > li'):
... print(li.xpath('.//@href').get())
注:老版的parsel庫使用extract()或extract_first()方法獲取文本或?qū)傩灾担谛掳嬷幸驯籫et()和getall()方法替代。
全部代碼如下所示:
# homelink_parsel.py
# Author: 大江狗
from fake_useragent import UserAgent
import requests
import csv
import re
import time
from parsel import Selector
class HomeLinkSpider(object):
def __init__(self):
self.ua = UserAgent()
self.headers = {"User-Agent": self.ua.random}
self.data = list()
self.path = "浦東_三房_500_800萬.csv"
self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"
def get_max_page(self):
response = requests.get(self.url, headers=self.headers)
if response.status_code == 200:
# 創(chuàng)建Selector類實例
selector = Selector(response.text)
# 采用css選擇器獲取最大頁碼div Boxl
a = selector.css('div[class="page-box house-lst-page-box"]')
# 使用eval將page-data的json字符串轉(zhuǎn)化為字典格式
max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]
print("最大頁碼數(shù):{}".format(max_page))
return max_page
else:
print("請求失敗 status:{}".format(response.status_code))
return None
def parse_page(self):
max_page = self.get_max_page()
for i in range(1, max_page + 1):
url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)
response = requests.get(url, headers=self.headers)
selector = Selector(response.text)
ul = selector.css('ul.sellListContent')[0]
li_list = ul.css('li')
for li in li_list:
detail = dict()
detail['title'] = li.css('div.title a::text').get()
# 2室1廳 | 74.14平米 | 南 | 精裝 | 高樓層(共6層) | 1999年建 | 板樓
house_info = li.css('div.houseInfo::text').get()
house_info_list = house_info.split(" | ")
detail['bedroom'] = house_info_list[0]
detail['area'] = house_info_list[1]
detail['direction'] = house_info_list[2]
floor_pattern = re.compile(r'\d{1,2}')
match1 = re.search(floor_pattern, house_info_list[4]) # 從字符串任意位置匹配
if match1:
detail['floor'] = match1.group()
else:
detail['floor'] = "未知"
# 匹配年份
year_pattern = re.compile(r'\d{4}')
match2 = re.search(year_pattern, house_info_list[5])
if match2:
detail['year'] = match2.group()
else:
detail['year'] = "未知"
# 文蘭小區(qū) - 塘橋 提取小區(qū)名和哈快
position_info = li.css('div.positionInfo a::text').getall()
detail['house'] = position_info[0]
detail['location'] = position_info[1]
# 650萬,匹配650
price_pattern = re.compile(r'\d+')
total_price = li.css('div.totalPrice span::text').get()
detail['total_price'] = re.search(price_pattern, total_price).group()
# 單價64182元/平米, 匹配64182
unit_price = li.css('div.unitPrice span::text').get()
detail['unit_price'] = re.search(price_pattern, unit_price).group()
self.data.append(detail)
def write_csv_file(self):
head = ["標題", "小區(qū)", "房廳", "面積", "朝向", "樓層",
"年份", "位置", "總價(萬)", "單價(元/平方米)"]
keys = ["title", "house", "bedroom", "area",
"direction", "floor", "year", "location",
"total_price", "unit_price"]
try:
with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:
writer = csv.writer(csv_file, dialect='excel')
if head is not None:
writer.writerow(head)
for item in self.data:
row_data = []
for k in keys:
row_data.append(item[k])
# print(row_data)
writer.writerow(row_data)
print("Write a CSV file to path %s Successful." % self.path)
except Exception as e:
print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))
if __name__ == '__main__':
start = time.time()
home_link_spider = HomeLinkSpider()
home_link_spider.parse_page()
home_link_spider.write_csv_file()
end = time.time()
print("耗時:{}秒".format(end-start))
現(xiàn)在我們來看下爬取結(jié)果,爬取580條數(shù)據(jù)耗時約16.5秒,節(jié)省了2秒時間??梢妏arsel比BeautifulSoup解析效率是要高的,爬取任務少時差別不大,任務多的話差別可能會大些。
![](http://img.jbzj.com/file_images/article/202105/2021510105504920.png?2021410105513)
httpx同步 + parsel組合
我們現(xiàn)在來更進一步,使用httpx替代requests庫。httpx發(fā)送同步請求的方式和requests庫基本一樣,所以我們只需要修改上例中兩行代碼,把requests替換成httpx即可, 其余代碼一模一樣。
from fake_useragent import UserAgent
import csv
import re
import time
from parsel import Selector
import httpx
class HomeLinkSpider(object):
def __init__(self):
self.ua = UserAgent()
self.headers = {"User-Agent": self.ua.random}
self.data = list()
self.path = "浦東_三房_500_800萬.csv"
self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"
def get_max_page(self):
# 修改這里把requests換成httpx
response = httpx.get(self.url, headers=self.headers)
if response.status_code == 200:
# 創(chuàng)建Selector類實例
selector = Selector(response.text)
# 采用css選擇器獲取最大頁碼div Boxl
a = selector.css('div[class="page-box house-lst-page-box"]')
# 使用eval將page-data的json字符串轉(zhuǎn)化為字典格式
max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]
print("最大頁碼數(shù):{}".format(max_page))
return max_page
else:
print("請求失敗 status:{}".format(response.status_code))
return None
def parse_page(self):
max_page = self.get_max_page()
for i in range(1, max_page + 1):
url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)
# 修改這里把requests換成httpx
response = httpx.get(url, headers=self.headers)
selector = Selector(response.text)
ul = selector.css('ul.sellListContent')[0]
li_list = ul.css('li')
for li in li_list:
detail = dict()
detail['title'] = li.css('div.title a::text').get()
# 2室1廳 | 74.14平米 | 南 | 精裝 | 高樓層(共6層) | 1999年建 | 板樓
house_info = li.css('div.houseInfo::text').get()
house_info_list = house_info.split(" | ")
detail['bedroom'] = house_info_list[0]
detail['area'] = house_info_list[1]
detail['direction'] = house_info_list[2]
floor_pattern = re.compile(r'\d{1,2}')
match1 = re.search(floor_pattern, house_info_list[4]) # 從字符串任意位置匹配
if match1:
detail['floor'] = match1.group()
else:
detail['floor'] = "未知"
# 匹配年份
year_pattern = re.compile(r'\d{4}')
match2 = re.search(year_pattern, house_info_list[5])
if match2:
detail['year'] = match2.group()
else:
detail['year'] = "未知"
# 文蘭小區(qū) - 塘橋 提取小區(qū)名和哈快
position_info = li.css('div.positionInfo a::text').getall()
detail['house'] = position_info[0]
detail['location'] = position_info[1]
# 650萬,匹配650
price_pattern = re.compile(r'\d+')
total_price = li.css('div.totalPrice span::text').get()
detail['total_price'] = re.search(price_pattern, total_price).group()
# 單價64182元/平米, 匹配64182
unit_price = li.css('div.unitPrice span::text').get()
detail['unit_price'] = re.search(price_pattern, unit_price).group()
self.data.append(detail)
def write_csv_file(self):
head = ["標題", "小區(qū)", "房廳", "面積", "朝向", "樓層",
"年份", "位置", "總價(萬)", "單價(元/平方米)"]
keys = ["title", "house", "bedroom", "area", "direction",
"floor", "year", "location",
"total_price", "unit_price"]
try:
with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:
writer = csv.writer(csv_file, dialect='excel')
if head is not None:
writer.writerow(head)
for item in self.data:
row_data = []
for k in keys:
row_data.append(item[k])
# print(row_data)
writer.writerow(row_data)
print("Write a CSV file to path %s Successful." % self.path)
except Exception as e:
print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))
if __name__ == '__main__':
start = time.time()
home_link_spider = HomeLinkSpider()
home_link_spider.parse_page()
home_link_spider.write_csv_file()
end = time.time()
print("耗時:{}秒".format(end-start))
整個爬取過程耗時16.1秒,可見使用httpx發(fā)送同步請求時效率和requests基本無差別。
![](http://img.jbzj.com/file_images/article/202105/2021510105618420.png?2021410105627)
注意:Windows上使用pip安裝httpx可能會出現(xiàn)報錯,要求安裝Visual Studio C++, 這個下載安裝好就沒事了。
接下來,我們就要開始王炸了,使用httpx和asyncio編寫一個異步爬蟲看看從鏈家網(wǎng)上爬取580條數(shù)據(jù)到底需要多長時間。
httpx異步+ parsel組合
Httpx厲害的地方就是能發(fā)送異步請求。整個異步爬蟲實現(xiàn)原理時,先發(fā)送同步請求獲取最大頁碼,把每個單頁的爬取和數(shù)據(jù)解析變?yōu)橐粋€asyncio協(xié)程任務(使用async定義),最后使用loop執(zhí)行。
大部分代碼與同步爬蟲相同,主要變動地方有兩個:
# 異步 - 使用協(xié)程函數(shù)解析單頁面,需傳入單頁面url地址
async def parse_single_page(self, url):
# 使用httpx發(fā)送異步請求獲取單頁數(shù)據(jù)
async with httpx.AsyncClient() as client:
response = await client.get(url, headers=self.headers)
selector = Selector(response.text)
# 其余地方一樣
def parse_page(self):
max_page = self.get_max_page()
loop = asyncio.get_event_loop()
# Python 3.6之前用ayncio.ensure_future或loop.create_task方法創(chuàng)建單個協(xié)程任務
# Python 3.7以后可以用戶asyncio.create_task方法創(chuàng)建單個協(xié)程任務
tasks = []
for i in range(1, max_page + 1):
url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)
tasks.append(self.parse_single_page(url))
# 還可以使用asyncio.gather(*tasks)命令將多個協(xié)程任務加入到事件循環(huán)
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
整個項目代碼如下所示:
from fake_useragent import UserAgent
import csv
import re
import time
from parsel import Selector
import httpx
import asyncio
class HomeLinkSpider(object):
def __init__(self):
self.ua = UserAgent()
self.headers = {"User-Agent": self.ua.random}
self.data = list()
self.path = "浦東_三房_500_800萬.csv"
self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"
def get_max_page(self):
response = httpx.get(self.url, headers=self.headers)
if response.status_code == 200:
# 創(chuàng)建Selector類實例
selector = Selector(response.text)
# 采用css選擇器獲取最大頁碼div Boxl
a = selector.css('div[class="page-box house-lst-page-box"]')
# 使用eval將page-data的json字符串轉(zhuǎn)化為字典格式
max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]
print("最大頁碼數(shù):{}".format(max_page))
return max_page
else:
print("請求失敗 status:{}".format(response.status_code))
return None
# 異步 - 使用協(xié)程函數(shù)解析單頁面,需傳入單頁面url地址
async def parse_single_page(self, url):
async with httpx.AsyncClient() as client:
response = await client.get(url, headers=self.headers)
selector = Selector(response.text)
ul = selector.css('ul.sellListContent')[0]
li_list = ul.css('li')
for li in li_list:
detail = dict()
detail['title'] = li.css('div.title a::text').get()
# 2室1廳 | 74.14平米 | 南 | 精裝 | 高樓層(共6層) | 1999年建 | 板樓
house_info = li.css('div.houseInfo::text').get()
house_info_list = house_info.split(" | ")
detail['bedroom'] = house_info_list[0]
detail['area'] = house_info_list[1]
detail['direction'] = house_info_list[2]
floor_pattern = re.compile(r'\d{1,2}')
match1 = re.search(floor_pattern, house_info_list[4]) # 從字符串任意位置匹配
if match1:
detail['floor'] = match1.group()
else:
detail['floor'] = "未知"
# 匹配年份
year_pattern = re.compile(r'\d{4}')
match2 = re.search(year_pattern, house_info_list[5])
if match2:
detail['year'] = match2.group()
else:
detail['year'] = "未知"
# 文蘭小區(qū) - 塘橋 提取小區(qū)名和哈快
position_info = li.css('div.positionInfo a::text').getall()
detail['house'] = position_info[0]
detail['location'] = position_info[1]
# 650萬,匹配650
price_pattern = re.compile(r'\d+')
total_price = li.css('div.totalPrice span::text').get()
detail['total_price'] = re.search(price_pattern, total_price).group()
# 單價64182元/平米, 匹配64182
unit_price = li.css('div.unitPrice span::text').get()
detail['unit_price'] = re.search(price_pattern, unit_price).group()
self.data.append(detail)
def parse_page(self):
max_page = self.get_max_page()
loop = asyncio.get_event_loop()
# Python 3.6之前用ayncio.ensure_future或loop.create_task方法創(chuàng)建單個協(xié)程任務
# Python 3.7以后可以用戶asyncio.create_task方法創(chuàng)建單個協(xié)程任務
tasks = []
for i in range(1, max_page + 1):
url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)
tasks.append(self.parse_single_page(url))
# 還可以使用asyncio.gather(*tasks)命令將多個協(xié)程任務加入到事件循環(huán)
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
def write_csv_file(self):
head = ["標題", "小區(qū)", "房廳", "面積", "朝向", "樓層",
"年份", "位置", "總價(萬)", "單價(元/平方米)"]
keys = ["title", "house", "bedroom", "area", "direction",
"floor", "year", "location",
"total_price", "unit_price"]
try:
with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:
writer = csv.writer(csv_file, dialect='excel')
if head is not None:
writer.writerow(head)
for item in self.data:
row_data = []
for k in keys:
row_data.append(item[k])
writer.writerow(row_data)
print("Write a CSV file to path %s Successful." % self.path)
except Exception as e:
print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))
if __name__ == '__main__':
start = time.time()
home_link_spider = HomeLinkSpider()
home_link_spider.parse_page()
home_link_spider.write_csv_file()
end = time.time()
print("耗時:{}秒".format(end-start))
現(xiàn)在到了見證奇跡的時刻了。從鏈家網(wǎng)上爬取了580條數(shù)據(jù),使用httpx編寫的異步爬蟲僅僅花了2.5秒!!
![](http://img.jbzj.com/file_images/article/202105/2021510105732392.png?2021410105747)
對比與總結(jié)
爬取同樣的內(nèi)容,采用不同工具組合耗時是不一樣的。httpx異步+parsel組合毫無疑問是最大的贏家, requests和BeautifulSoup確實可以功成身退啦。
- requests + BeautifulSoup: 18.5 秒
- requests + parsel: 16.5秒
- httpx 同步 + parsel: 16.1秒
- httpx 異步 + parsel: 2.5秒
對于Python爬蟲,你還有喜歡的庫嗎?
以上就是python爬蟲請求庫httpx和parsel解析庫的使用測評的詳細內(nèi)容,更多關(guān)于python httpx和parsel的資料請關(guān)注腳本之家其它相關(guān)文章!
您可能感興趣的文章:- Python爬蟲實現(xiàn)HTTP網(wǎng)絡(luò)請求多種實現(xiàn)方式
- 零基礎(chǔ)寫python爬蟲之HTTP異常處理
- python爬蟲http代理使用方法