关注我们

Python爬虫 - 数据提取

分析网页意味着了解其结构，现在，出现了一个问题，为什么它对爬虫很重要?在本章中，让无涯教程详细了解这一点。

提取网页数据方法

以下方法主要用于从网页提取数据-

正则表达式

它们是嵌入在Python中的高度专业化的编程语言。可以通过Python的 re 模块使用它。也称为RE或正则表达式或正则表达式模式。借助正则表达式，可以为要从数据中匹配的可能的字符串集指定一些规则。

在以下示例中，将在正则表达式的帮助下将<td>的内容匹配之后，从http://example.webscraping.com 获取有关的数据。

import re
import urllib.request
response =
   urllib.request.urlopen('http://example.webscraping.com/places/default/view/India-102')
html=response.read()
text=html.decode()
re.findall('<td class="w2p_fw">(.*?)</td>',text)

相应的输出将如下所示-

[
   '<img src="/places/static/images/flags/in.png" />',
   '3,287,590 square kilometres',
   '1,173,108,018',
   'IN',
   'India',
   'New Delhi',
   '<a href="/places/default/continent/AS">AS</a>',
   '.in',
   'INR',
   'Rupee',
   '91',
   '######',
   '^(\\d{6})$',
   'enIN,hi,bn,te,mr,ta,ur,gu,kn,ml,or,pa,as,bh,sat,ks,ne,sd,kok,doi,mni,sit,sa,fr,lus,inc',
   '<div>
      <a href="/places/default/iso/CN">CN </a>
      <a href="/places/default/iso/NP">NP </a>
      <a href="/places/default/iso/MM">MM </a>
      <a href="/places/default/iso/BT">BT </a>
      <a href="/places/default/iso/PK">PK </a>
      <a href="/places/default/iso/BD">BD </a>
   </div>'
]

请注意，在上面的输出中，您可以使用正则表达式查看有关国家/地区的详细信息。

Beautiful Soup

假设无涯教程要从网页上收集所有超链接，那么可以使用一个名为BeautifulSoup的解析器。

请注意，在此示例中扩展了上面的示例，该示例是使用requests python模块实现的。

首先，需要导入必要的Python模块-

import requests
from bs4 import BeautifulSoup

在下面的代码行中,使用请求对URL进行GET HTTP请求: 通过发出GET请求 https://authoraditiagarwal.com/。

r=requests.get('https://authoraditiagarwal.com/')

现在需要创建一个Soup对象，如下所示:

soup=BeautifulSoup(r.text, 'lxml')
print (soup.title)
print (soup.title.text)

相应的输出将如下所示-

<title>Learn and Grow with Aditi Agarwal</title>
Learn and Grow with Aditi Agarwal

Lxml 库

无涯教程将要讨论的用于Web抓取的另一个Python库是lxml，这是一个高性能的HTML和XML解析库，您可以在 https://lxml.de/ 上了解更多。

使用pip命令，可以在虚拟环境或全局安装中安装 lxml 。

(base) D:\ProgramData>pip install lxml
Collecting lxml
   Downloading
https://files.pythonhosted.org/packages/b9/55/bcc78c70e8ba30f51b5495eb0e
3e949aa06e4a2de55b3de53dc9fa9653fa/lxml-4.2.5-cp36-cp36m-win_amd64.whl
(3.
6MB)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 3.6MB 64kB/s
Installing collected packages: lxml
Successfully installed lxml-4.2.5

使用lxml和提取数据

在以下示例中，无涯教程使用lxml和请求从 authoraditiagarwal.com 抓取了网页的特定元素-

首先，需要从lxml库中导入请求和html，如下所示:

import requests
from lxml import html

现在需要提供要剪贴的网页网址

无涯教程网

url='https://authoraditiagarwal.com/leadershipmanagement/'

现在需要提供该网页特定元素的路径(Xpath)-

path='//*[@id="panel-836-0-0-1"]/div/div/p[1]'
response=requests.get(url)
byte_string=response.content
source_code=html.fromstring(byte_string)
tree=source_code.xpath(path)
print(tree[0].text_content())

相应的输出将如下所示-

The Sprint Burndown or the Iteration Burndown chart is a powerful tool to communicate
daily progress to the stakeholders. It tracks the completion of work for a given sprint
or an iteration. The horizontal axis represents the days within a Sprint. The vertical 
axis represents the hours remaining to complete the committed work.

祝学习愉快！(内容编辑有误？请选中要编辑内容 -> 右键 -> 修改 -> 提交！)

技术教程推荐

深入浅出gRPC -〔李林锋〕

Linux性能优化实战 -〔倪朋飞〕

Python核心技术与实战 -〔景霄〕

OpenResty从入门到实战 -〔温铭〕

Selenium自动化测试实战 -〔郭宏志〕

Linux内核技术实战课 -〔邵亚方〕

WebAssembly入门课 -〔于航〕

Go 并发编程实战课 -〔晁岳攀（鸟窝）〕

Spark性能调优实战 -〔吴磊〕

好记忆不如烂笔头。留下您的足迹吧 :)