亚洲视频二区_亚洲欧洲日本天天堂在线观看_日韩一区二区在线观看_中文字幕不卡一区

公告：魔扣目錄網(wǎng)為廣大站長提供免費收錄網(wǎng)站服務(wù)，提交前請做好本站友鏈：【網(wǎng)站目錄：http://www.430618.com 】，免友鏈快審服務(wù)（50元/站），

網(wǎng)站：51998
待審：31
小程序：12
文章：1030137
會員：747

首頁 > 新聞資訊 > IT業(yè)界 >正文

四種Python爬蟲常用的定位元素方法對比，你偏愛哪一款？1. 傳統(tǒng) BeautifulSoup 操作2. 基于 BeautifulSoup 的 CSS 選擇器3. XPath4. 正則表達式

發(fā)布時間：2023-07-03 10:54:25 作者：網(wǎng)友整理

來源：早起Python/ target=_blank class=infotextkey>Python

作者：陳熹

在使用Python本爬蟲采集數(shù)據(jù)時，一個很重要的操作就是如何從請求到的網(wǎng)頁中提取數(shù)據(jù)，而正確定位想要的數(shù)據(jù)又是第一步操作。

本文將對比幾種 Python 爬蟲中比較常用的定位網(wǎng)頁元素的方式供大家學(xué)習(xí)

“傳統(tǒng) BeautifulSoup 操作基于 BeautifulSoup 的 css 選擇器（與 PyQuery 類似）XPath正則表達式”

http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1

我們以獲取第一頁 20 本書的書名為例。先確定網(wǎng)站沒有設(shè)置反爬措施，是否能直接返回待解析的內(nèi)容：

import requests

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text
print(response)

仔細檢查后發(fā)現(xiàn)需要的數(shù)據(jù)都在返回內(nèi)容中，說明不需要特別考慮反爬舉措

審查網(wǎng)頁元素后可以發(fā)現(xiàn)，書目信息都包含在 li 中，從屬于 class 為 bang_list clearfix bang_list_mode 的 ul 中

進一步審查也可以發(fā)現(xiàn)書名在的相應(yīng)位置，這是多種解析方法的重要基礎(chǔ)

1. 傳統(tǒng) BeautifulSoup 操作

經(jīng)典的 BeautifulSoup 方法借助 from bs4 import BeautifulSoup，然后通過 soup = BeautifulSoup(html, "lxml") 將文本轉(zhuǎn)換為特定規(guī)范的結(jié)構(gòu)，利用 find 系列方法進行解析，代碼如下：

import requests
from bs4 import BeautifulSoup

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text

def bs_for_parse(response):
    soup = BeautifulSoup(response, "lxml")
    li_list = soup.find('ul', class_='bang_list clearfix bang_list_mode').find_all('li') # 鎖定ul后獲取20個li
    for li in li_list:
        title = li.find('div', class_='name').find('a')['title'] # 逐個解析獲取書名
        print(title)

if __name__ == '__main__':
    bs_for_parse(response)

成功獲取了 20 個書名，有些書面顯得冗長可以通過正則或者其他字符串方法處理，本文不作詳細介紹

2. 基于 BeautifulSoup 的 CSS 選擇器

這種方法實際上就是 PyQuery 中 CSS 選擇器在其他模塊的遷移使用，用法是類似的。關(guān)于 CSS 選擇器詳細語法可以參考：http://www.w3school.com.cn/cssref/css_selectors.asp由于是基于 BeautifulSoup 所以導(dǎo)入的模塊以及文本結(jié)構(gòu)轉(zhuǎn)換都是一致的：

import requests
from bs4 import BeautifulSoup

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text
        
def css_for_parse(response):
    soup = BeautifulSoup(response, "lxml") 
    print(soup)

if __name__ == '__main__':
    css_for_parse(response)

然后就是通過 soup.select 輔以特定的 CSS 語法獲取特定內(nèi)容，基礎(chǔ)依舊是對元素的認真審查分析：

import requests
from bs4 import BeautifulSoup
from lxml import html

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text
        
def css_for_parse(response):
    soup = BeautifulSoup(response, "lxml")
    li_list = soup.select('ul.bang_list.clearfix.bang_list_mode > li')
    for li in li_list:
        title = li.select('div.name > a')[0]['title']
        print(title)

if __name__ == '__main__':
    css_for_parse(response)

3. XPath

XPath 即為 XML 路徑語言，它是一種用來確定 XML 文檔中某部分位置的計算機語言，如果使用 Chrome 瀏覽器建議安裝 XPath Helper 插件，會大大提高寫 XPath 的效率。

之前的爬蟲文章基本都是基于 XPath，大家相對比較熟悉因此代碼直接給出：

import requests
from lxml import html

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text

def xpath_for_parse(response):
    selector = html.fromstring(response)
    books = selector.xpath("//ul[@class='bang_list clearfix bang_list_mode']/li")
    for book in books:
        title = book.xpath('div[@class="name"]/a/@title')[0]
        print(title)

if __name__ == '__main__':
    xpath_for_parse(response)

4. 正則表達式

如果對 HTML 語言不熟悉，那么之前的幾種解析方法都會比較吃力。這里也提供一種萬能解析大法：正則表達式，只需要關(guān)注文本本身有什么特殊構(gòu)造文法，即可用特定規(guī)則獲取相應(yīng)內(nèi)容。依賴的模塊是 re

首先重新觀察直接返回的內(nèi)容中，需要的文字前后有什么特殊：

import requests
import re

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text
print(response)

觀察幾個數(shù)目相信就有答案了：<div class="name"><a href="http://product.dangdang.com/xxxxxxxx.html" target="_blank" title="xxxxxxx"> 書名就藏在上面的字符串中，蘊含的網(wǎng)址鏈接中末尾的數(shù)字會隨著書名而改變。

分析到這里正則表達式就可以寫出來了：

import requests
import re

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text

def re_for_parse(response):
    reg = '<div class="name"><a href="http://product.dangdang.com/d+.html" target="_blank" title="(.*?)">'
    for title in re.findall(reg, response):
        print(title)

if __name__ == '__main__':
    re_for_parse(response)

可以發(fā)現(xiàn)正則寫法是最簡單的，但是需要對于正則規(guī)則非常熟練。所謂正則大法好！

當(dāng)然，不論哪種方法都有它所適用的場景，在真實操作中我們也需要在分析網(wǎng)頁結(jié)構(gòu)來判斷如何高效的定位元素，最后附上本文介紹的四種方法的完整代碼，大家可以自行操作一下來加深體會

import requests
from bs4 import BeautifulSoup
from lxml import html
import re

url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1'
response = requests.get(url).text

def bs_for_parse(response):
    soup = BeautifulSoup(response, "lxml")
    li_list = soup.find('ul', class_='bang_list clearfix bang_list_mode').find_all('li')
    for li in li_list:
        title = li.find('div', class_='name').find('a')['title']
        print(title)

def css_for_parse(response):
    soup = BeautifulSoup(response, "lxml")
    li_list = soup.select('ul.bang_list.clearfix.bang_list_mode > li')
    for li in li_list:
        title = li.select('div.name > a')[0]['title']
        print(title)

def xpath_for_parse(response):
    selector = html.fromstring(response)
    books = selector.xpath("//ul[@class='bang_list clearfix bang_list_mode']/li")
    for book in books:
        title = book.xpath('div[@class="name"]/a/@title')[0]
        print(title)

def re_for_parse(response):
    reg = '<div class="name"><a href="http://product.dangdang.com/d+.html" target="_blank" title="(.*?)">'
    for title in re.findall(reg, response):
        print(title)

if __name__ == '__main__':
    # bs_for_parse(response)
    # css_for_parse(response)
    # xpath_for_parse(response)
    re_for_parse(response)

分享到：

標(biāo)簽：爬蟲 Python